0% found this document useful (0 votes)
4 views12 pages

NLP Basics

Natural Language Processing (NLP) involves the automatic processing of human language and is closely related to various fields such as linguistics, cognitive science, and AI. Tokenization is a crucial first step in NLP, breaking down sentences into manageable parts for further analysis, while challenges include handling languages without clear word boundaries and understanding context. Various Python libraries, including NLTK, TextBlob, and spaCy, provide tools for effective tokenization and further NLP tasks like stemming, lemmatization, and vectorization.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views12 pages

NLP Basics

Natural Language Processing (NLP) involves the automatic processing of human language and is closely related to various fields such as linguistics, cognitive science, and AI. Tokenization is a crucial first step in NLP, breaking down sentences into manageable parts for further analysis, while challenges include handling languages without clear word boundaries and understanding context. Various Python libraries, including NLTK, TextBlob, and spaCy, provide tools for effective tokenization and further NLP tasks like stemming, lemmatization, and vectorization.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 12

Natural language processing (NLP) can be defined asthe automatic (or semi-

automatic) processing of human language. The term ‘NLP’ is sometimes used


rather more narrowly than that, often excluding information retrieval and
sometimes even excluding machine translation. NLP is sometimes contrasted
with ‘computational linguistics’, with NLP being thought of as more applied.
Nowadays, alternative terms are often preferred, like ‘Language Technology’ or
‘Language Engineering’. Language is often used in contrast with speech (e.g.,
Speech and Language Technology). But I’m going to simply refer to NLP and
use the term broadly. NLP is essentially multidisciplinary: it is closely related to
linguistics (although the extent to which NLP overtly draws on linguistic theory
varies considerably). It also has links to research in cognitive science,
psychology, philosophy and maths (especially logic). Within CS, it relates to
formal language theory, compiler techniques, theorem proving, machine
learning and human-computer interaction. Of course it is also related to AI,
though nowadays it’s not generally thought of as part of AI.

NLP drives computer programs that translate text from one language to
another, respond to spoken commands, and summarize large volumes of text
rapidly—even in real time. There’s a good chance you’ve interacted with NLP
in the form of voice-operated GPS systems, digital assistants, speech-to-text
dictation software, customer service chatbots, and other consumer
conveniences. But NLP also plays a growing role in enterprise solutions that
help streamline business operations, increase employee productivity, and
simplify mission-critical business processes.

Tokenization is a simple process that takes raw data and converts it into a
useful data string. While tokenization is well known for its use in cybersecurity
and in the creation of NFTs, tokenization is also an important part of the NLP
process. Tokenization is used in natural language processing to split paragraphs
and sentences into smaller units that can be more easily assigned meaning.

The first step of the NLP process is gathering the data (a sentence) and
breaking it into understandable parts (words). Here’s an example of a string of
data:
“What restaurants are nearby?“

In order for this sentence to be understood by a machine, tokenization is


performed on the string to break it into individual parts. With tokenization,
we’d get something like this:

‘what’ ‘restaurants’ ‘are’ ‘nearby’

This may seem simple, but breaking a sentence into its parts allows a machine
to understand the parts as well as the whole. This will help the program
understand each of the words by themselves, as well as how they function in
the larger text. This is especially important for larger amounts of text as it
allows the machine to count the frequencies of certain words as well as where
they frequently appear. This is important for later steps in natural language
processing.

Tokenization Challenges in NLP


While breaking down sentences seems simple, after all we build sentences
from words all the time, it can be a bit more complex for machines.

A large challenge is being able to segment words when spaces or punctuation


marks don’t define the boundaries of the word. This is especially common
for symbol-based languages like Chinese, Japanese, Korean, and Thai.

Another challenge is symbols that change the meaning of the word


significantly. We intuitively understand that a ‘$’ sign with a number attached
to it ($100) means something different than the number itself (100). Punction,
especially in less common situations, can cause an issue for machines trying to
isolate their meaning as a part of a data string.
Contractions such as ‘you’re’ and ‘I’m’ also need to be properly broken down
into their respective parts. Failing to properly tokenize every part of the
sentence can lead to misunderstandings later in the NLP process.

Tokenization is the start of the NLP process, converting sentences into


understandable bits of data that a program can work with. Without a strong
foundation built through tokenization, the NLP process can quickly devolve
into a messy telephone game.

Although tokenization in Python may be simple, we know that it’s the


foundation to develop good models and help us understand the text
corpus. This section will list a few tools available for tokenizing text
content like NLTK, TextBlob, spacy, Gensim, and Keras.

White Space Tokenization

The simplest way to tokenize text is to use whitespace within a string as


the “delimiter” of words. This can be accomplished with Python’s split
function, which is available on all string object instances as well as on
the string built-in class itself. You can change the separator any way you
need.

As you can notice, this built-in Python method already does a good job
tokenizing a simple sentence. It’s “mistake” was on the last word, where it
included the sentence-ending punctuation with the token “1995.”. We
need the tokens to be separated from neighboring punctuation and other
significant tokens in a sentence.

In the example below, we’ll perform sentence tokenization using the


comma as a separator.
NLTK Word Tokenize

NLTK (Natural Language Toolkit) is an open-source Python library for


Natural Language Processing. It has easy-to-use interfaces for over 50
corpora and lexical resources such as WordNet, along with a set of text
processing libraries for classification, tokenization, stemming, and
tagging.

You can easily tokenize the sentences and words of the text with the
tokenize module of NLTK.

First, we’re going to import the relevant functions from the NLTK library:

 Word and Sentence tokenizer

N.B: The sent_tokenize uses the pre-trained model from


tokenizers/punkt/english.pickle.

 Punctuation-based tokenizer

This tokenizer splits the sentences into words based on whitespaces and
punctuations.

We could notice the difference between considering “Amal.M” a word in


word_tokenize and split it in the wordpunct_tokenize.

 Treebank Word tokenizer

This tokenizer incorporates a variety of common rules for english word


tokenization. It separates phrase-terminating punctuation like (?!.;,) from
adjacent tokens and retains decimal numbers as a single token. Besides,
it contains rules for English contractions.
For example “don’t” is tokenized as [“do”, “n’t”]. You can find all the rules
for the Treebank Tokenizer at this link.

 Tweet tokenizer

When we want to apply tokenization in text data like tweets, the


tokenizers mentioned above can’t produce practical tokens. Through this
issue, NLTK has a rule based tokenizer special for tweets. We can split
emojis into different words if we need them for tasks like sentiment
analysis.

 MWET tokenizer

NLTK’s multi-word expression tokenizer (MWETokenizer) provides a


function add_mwe() that allows the user to enter multiple word
expressions before using the tokenizer on the text. More simply, it can
merge multi-word expressions into single tokens.

TextBlob Word Tokenize

TextBlob is a Python library for processing textual data. It provides a


consistent API for diving into common natural language processing (NLP)
tasks such as part-of-speech tagging, noun phrase extraction, sentiment
analysis, classification, translation, and more.

Let’s start by installing TextBlob and the NLTK corpora:

$pip install -U textblob


$python3 -m textblob.download_corpora

In the code below, we perform word tokenization using TextBlob library:


We could notice that the TextBlob tokenizer removes the punctuations. In
addition, it has rules for English contractions.

spaCy Tokenizer

SpaCy is an open-source Python library that parses


and understands large volumes of text. With available models catering
to specific languages (English, French, German, etc.), it handles NLP tasks
with the most efficient implementation of common algorithms.

spaCy tokenizer provides the flexibility to specify special tokens that don’t
need to be segmented, or need to be segmented using special rules for
each language, for example punctuation at the end of a sentence should
be split off – whereas “U.K.” should remain one token.

Before you can use spaCy you need to install it, download data and
models for the English language.

$ pip install spacy


$ python3 -m spacy download en_core_web_sm
Gensim word tokenizer

Gensim is a Python library for topic modeling, document indexing, and


similarity retrieval with large corpora. The target audience is the natural
language processing (NLP) and information retrieval (IR) community. It
offers utility functions for tokenization.

Tokenization with Keras

Keras open-source library is one of the most reliable deep learning


frameworks. To perform tokenization we use: text_to_word_sequence
method from the Class Keras.preprocessing.text class. The great thing
about Keras is converting the alphabet in a lower case before tokenizing
it, which can be quite a time-saver.
Stemming and Lemmatization have been developed in the 1960s. These are
the text normalizing and text mining procedures in the field of Natural
Language Processing that are applied to adjust text, words, documents for
more processing. These are a widely used system for tagging, SEO, Web Search
Result, and Information Retrieval.
While Implementing NLP, you will always face an issue of similar root-forms
but different representations, for example, the word “caring” can be stripped
out to “car” and “care” using the method Stemming and Lemmatization
respectively.

What is Stemming?

We already know that a word has one root-base form but having
different variations, for example, “play” is a root-base word and
playing, played, plays are the different forms of a single word. So,
these words get stripped out, they might get the incorrect
meanings or some other sort of errors.

The process of reducing inflection towards their root forms are


called Stemming, this occurs in such a way that depicting a group
of relatable words under the same stem, even if the root has no
appropriate meaning.

Moreover;
 Stemming is a rule-based approach because it slices the
inflected words from prefix or suffix as per the need using a
set of commonly underused prefix and suffix, like “-ing”, “-
ed”, “-es”, “-pre”, etc. It results in a word that is actually not
a word.
 There are mainly two errors that occur while performing
Stemming, Over-stemming, and Under-stemming. Over-
steaming occurs when two words are stemmed from the
same root of different stems. Under-stemming
occurs when two words are stemmed from the same root of
not a different stems.

Lemmatization is a method responsible for grouping different


inflected forms of words into the root form, having the same
meaning. It is similar to stemming, in turn, it gives the stripped
word that has some dictionary meaning. The Morphological
analysis would require the extraction of the correct lemma of
each word.

For example, Lemmatization clearly identifies the base form of


‘troubled’ to ‘trouble’’ denoting some meaning whereas,
Stemming will cut out ‘ed’ part and convert it into ‘troubl’ which
has the wrong meaning and spelling errors.

‘troubled’ -> Lemmatization -> ‘troubled’, and error


‘troubled’ -> Stemming -> ‘troubl’
Word Embeddings or Word vectorization is a methodology
in NLP to map words or phrases from vocabulary to a
corresponding vector of real numbers which used to find
word predictions, word similarities/semantics.

The process of converting words into numbers are


called Vectorization.

Word embeddings help in the following use cases.

 Compute similar words


 Text classifications
 Document clustering/grouping
 Feature extraction for text classifications
 Natural language processing.

Vectorization is jargon for a classic approach of converting input data from its
raw format (i.e. text ) into vectors of real numbers which is the format that ML
models support. This approach has been there ever since computers were first
built, it has worked wonderfully across various domains, and it’s now used in
NLP.
In Machine Learning, vectorization is a step in feature extraction. The idea is to
get some distinct features out of the text for the model to train on, by
converting text to numerical vectors.

Bag of Words
One of the simplest vectorization methods for text is a bag-of-words (BoW)
representation. A BoW vector has the length of the entire vocabulary — that is,
the set of unique words in the corpus. The vector’s values represent the
frequency with which each word appears in a given text passage.
TF-IDF
Weighted BoW text vectorization techniques like TF-IDF (short for “term
frequency-inverse document frequency), on the other hand, attempt to give
higher relevance scores to words that occur in fewer documents within the
corpus. To that end, TF-IDF measures the frequency of a word in a text against
its overall frequency in the corpus.

Think of a document that mentions the word “oranges” with high frequency.
TF-IDF will look at all the other documents in the corpus. If “oranges” occurs in
many documents, then it is not a very significant term and is given a lower
weighting in the TF-IDF text vector. If it occurs in just a few documents,
however, it is considered a distinctive term. In that case, it helps characterize
the document within the corpus and as such receives a higher value in the
vector.

BM25
While more sophisticated than the simple BoW approach, TF-IDF has some
shortcomings. For example, it does not address the fact that, in short
documents, even just a single mention of a word might mean that the term is
highly relevant. BM25 was introduced to address this and other issues. It is an
improvement over TF-IDF, in that it takes into account the length of the
document. It also dampens the effect of having many occurrences of a word in
a document.

Because BoW methods will produce long vectors that contain many zeros,
they’re often called “sparse.” In addition to being language-independent,
sparse vectors are quick to compute and compare. Semantic search systems
use them for quick document retrieval.
Let’s now look at a more recent encoding technique that aims to capture not
just the lexical but also the semantic properties of words.

Word2Vec: Inferring Meaning from Context


Words are more than just a collection of letters. As speakers of a language, we
might understand what a word means and how to use it in a sentence. In
short, we would understand its semantics. The sparse, count-based methods
we saw above do not account for the meaning of the words or phrases that our
system processes.

You might also like