Natural Language Processing , what is it ? How can we represent text using the Tensorflow Tokenizer
One thing that makes us great as humans is our ability to learn new languages and communicate with other humans ,but can computers match our ability to understand languages ? It seems so , recent advancements in computational power and deep learning have made it possible for computers to understand human languages , but wait a bit computers don’t understand the human language as it we need to hack the human language for them to understand.
Computers cannot understand text so we need to convert human language into numbers , strange right ?How can we do that , they are two ways of doing that .
Representing Text Using Character Level Representation
We represent test by treating each character as a number. Given that we have C different characters in our text corpus ,the word ‘Hello’ could be represented by a tensor of Cx5.Each letter would correspond to tensor in one-hot encoding
Representing Text Using Word Representation
we create a vocubulary of all words in our text , and then represent the words using one-hot encoding .This approach is better than character level representation because each letter by itself does not have much meaning .Given a large dictionary size , we need to deal with high dimensional sparse tensors.
Text Vectorization
If we want word-level representation , we need to do 2 things :
- use a tokenizer to split text into tokens
- build a vocubulary of those tokens
Now that we know the ways of representing text , let’s dive into how we can do it with tensorflow and keras. First thing we need to import the dependencies
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
Let’s take a look at the following list of sentences
sentences=[‘I love my girlfriend’,
‘I love my dog!’,
‘You love my dog ?’,
]
We need to initialize the tokenizer from tensorflow and after that we fit our sentences to the tokenizer
tokenizer=Tokenizer(num_words=100)
tokenizer.fit_on_texts(sentences)
word_index=tokenizer.word_index
print(word_index)
{'love': 1, 'my': 2, 'i': 3, 'dog': 4, 'girlfriend': 5, 'you': 6}
As you can see our words have been converted into numbers as shown above but the words do not make sense for them to make sense we need to convert them into word sequences.
sequences=tokenizer.texts_to_sequences(sentences)
print(sequences)
[[3, 1, 2, 5], [3, 1, 2, 4], [6, 1, 2, 4]]
As you can see we have converted our words into sequence of words
for example the first sequence [3,1,2,5] represents “ l love my dog ”
Now let’s feed our text to sequences with test data
test_data=[‘I really love my dog’,
‘my dog loves my manatee’]
test_seq=tokenizer.texts_to_sequences(test_data)
print(test_seq)
[[3, 1, 2, 4], [2, 4, 2]]
Let’s take the first sequence [3,1,2,4] represents “my i my girlfriend” as you can see that this sentence is meaningless , because our vocubulary of words is small , when training sequence models you need a large vocubulary of words , but how can we solve this ?
To solve this we need to pass in the oov_token into the Tokenizer instance so that all unseen words are replaced with oov
tokenizer=Tokenizer(num_words=100,oov_token=”<OOV>”)
tokenizer.fit_on_texts(sentences)
word_index=tokenizer.word_index
print(word_index)
{'<OOV>': 1, 'love': 2, 'my': 3, 'i': 4, 'dog': 5, 'girlfriend': 6, 'you': 7}
As you can see we now have <oov> into our vocubulary
Let’s run our test data again and see
test_seq=tokenizer.texts_to_sequences(test_data)
print(test_seq)
[[4, 1, 2, 3, 5], [3, 5, 1, 3, 1]]
As you can see now from our sequence [4,1,2,3,5] we now have “I <oov> love my girlfriend” now it’s now better.
The other problem we will face is that sentences are not of equal length , this will give us problems when training sequence models , so to solve this we need to introduce padding to do so
padded=pad_sequences(sequences)
print(padded)
[[ 0 0 0 5 3 2 7]
[ 0 0 0 5 3 2 4]
[ 0 0 0 6 3 2 4]
[ 8 6 9 2 4 10 11]]
The whole code:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
sentences=[‘I love my girlfriend’,
‘I love my dog!’,
‘You love my dog ?’,
‘Do you think my dog is amazing ?’]
tokenizer=Tokenizer(num_words=100,oov_token=”<OOV>”)
tokenizer.fit_on_texts(sentences)
word_index=tokenizer.word_index
print(word_index)
sequences=tokenizer.texts_to_sequences(sentences)
print(sequences)
test_data=[‘I really love my dog’,
‘my dog loves my manatee’]
padded=pad_sequences(sequences)
print(padded)
Now that we have computers understand the human language .
You can check out my youtube video