NLP Basics
NLP Basics
NLP drives computer programs that translate text from one language to
another, respond to spoken commands, and summarize large volumes of text
rapidly—even in real time. There’s a good chance you’ve interacted with NLP
in the form of voice-operated GPS systems, digital assistants, speech-to-text
dictation software, customer service chatbots, and other consumer
conveniences. But NLP also plays a growing role in enterprise solutions that
help streamline business operations, increase employee productivity, and
simplify mission-critical business processes.
Tokenization is a simple process that takes raw data and converts it into a
useful data string. While tokenization is well known for its use in cybersecurity
and in the creation of NFTs, tokenization is also an important part of the NLP
process. Tokenization is used in natural language processing to split paragraphs
and sentences into smaller units that can be more easily assigned meaning.
The first step of the NLP process is gathering the data (a sentence) and
breaking it into understandable parts (words). Here’s an example of a string of
data:
“What restaurants are nearby?“
This may seem simple, but breaking a sentence into its parts allows a machine
to understand the parts as well as the whole. This will help the program
understand each of the words by themselves, as well as how they function in
the larger text. This is especially important for larger amounts of text as it
allows the machine to count the frequencies of certain words as well as where
they frequently appear. This is important for later steps in natural language
processing.
As you can notice, this built-in Python method already does a good job
tokenizing a simple sentence. It’s “mistake” was on the last word, where it
included the sentence-ending punctuation with the token “1995.”. We
need the tokens to be separated from neighboring punctuation and other
significant tokens in a sentence.
You can easily tokenize the sentences and words of the text with the
tokenize module of NLTK.
First, we’re going to import the relevant functions from the NLTK library:
Punctuation-based tokenizer
This tokenizer splits the sentences into words based on whitespaces and
punctuations.
Tweet tokenizer
MWET tokenizer
spaCy Tokenizer
spaCy tokenizer provides the flexibility to specify special tokens that don’t
need to be segmented, or need to be segmented using special rules for
each language, for example punctuation at the end of a sentence should
be split off – whereas “U.K.” should remain one token.
Before you can use spaCy you need to install it, download data and
models for the English language.
What is Stemming?
We already know that a word has one root-base form but having
different variations, for example, “play” is a root-base word and
playing, played, plays are the different forms of a single word. So,
these words get stripped out, they might get the incorrect
meanings or some other sort of errors.
Moreover;
Stemming is a rule-based approach because it slices the
inflected words from prefix or suffix as per the need using a
set of commonly underused prefix and suffix, like “-ing”, “-
ed”, “-es”, “-pre”, etc. It results in a word that is actually not
a word.
There are mainly two errors that occur while performing
Stemming, Over-stemming, and Under-stemming. Over-
steaming occurs when two words are stemmed from the
same root of different stems. Under-stemming
occurs when two words are stemmed from the same root of
not a different stems.
Vectorization is jargon for a classic approach of converting input data from its
raw format (i.e. text ) into vectors of real numbers which is the format that ML
models support. This approach has been there ever since computers were first
built, it has worked wonderfully across various domains, and it’s now used in
NLP.
In Machine Learning, vectorization is a step in feature extraction. The idea is to
get some distinct features out of the text for the model to train on, by
converting text to numerical vectors.
Bag of Words
One of the simplest vectorization methods for text is a bag-of-words (BoW)
representation. A BoW vector has the length of the entire vocabulary — that is,
the set of unique words in the corpus. The vector’s values represent the
frequency with which each word appears in a given text passage.
TF-IDF
Weighted BoW text vectorization techniques like TF-IDF (short for “term
frequency-inverse document frequency), on the other hand, attempt to give
higher relevance scores to words that occur in fewer documents within the
corpus. To that end, TF-IDF measures the frequency of a word in a text against
its overall frequency in the corpus.
Think of a document that mentions the word “oranges” with high frequency.
TF-IDF will look at all the other documents in the corpus. If “oranges” occurs in
many documents, then it is not a very significant term and is given a lower
weighting in the TF-IDF text vector. If it occurs in just a few documents,
however, it is considered a distinctive term. In that case, it helps characterize
the document within the corpus and as such receives a higher value in the
vector.
BM25
While more sophisticated than the simple BoW approach, TF-IDF has some
shortcomings. For example, it does not address the fact that, in short
documents, even just a single mention of a word might mean that the term is
highly relevant. BM25 was introduced to address this and other issues. It is an
improvement over TF-IDF, in that it takes into account the length of the
document. It also dampens the effect of having many occurrences of a word in
a document.
Because BoW methods will produce long vectors that contain many zeros,
they’re often called “sparse.” In addition to being language-independent,
sparse vectors are quick to compute and compare. Semantic search systems
use them for quick document retrieval.
Let’s now look at a more recent encoding technique that aims to capture not
just the lexical but also the semantic properties of words.