NLP Text Preprocessing
NLP Text Preprocessing
Analytics
A traditional text analytics framework consists of three consecutive
phases: Text Preprocessing, Text Representation and Knowledge
Discovery, shown in Figure below.
Text Representation
After text preprocessing has been completed, the individual word tokens must be
transformed into a vector representation suitable for input into text mining algorithms.
Knowledge Discovery
When we successfully transform the text corpus into numeric vectors, we can apply
the existing machine learning or data mining methods like classification or clustering.
2. Tokenize: Break text into discrete words called tokens → Transform text into a list of
words (tokens).
3. Remove stopwords (“stopping”): remove all the stopwords, that is, all the words used
to construct the syntax of a sentence but not containing text information (such as
conjunctions, articles, and prepositions) such as a, about, an, are, as, at, be, by, for,
from, how, will, with, and many others.
Text Preprocessing
4. Stem: Remove prefixes and suffixes to normalize words - for example, run, running,
and runs would all be stemmed to run. So the words with variant forms can be
regarded as same feature. Many algorithms have been invented to do stemming
(Porter, Snowball, and Lancaster). It also depends on the language. Notice that
lemmatization can be used instead stemming, it depends on the text mining
subtasks and corpus language.
5. Normalize spelling: Unify misspellings and other spelling variations into a single
token.
7. Normalize case: Convert the text to either all lower or all upper case.
Text Preprocessing
Difference between Stemming and Lemmatization
Stemming and lemmatization both of these concepts are used to normalize the given
word by removing infixes and consider its meaning. The major difference between
these is as shown:
◼ Stemming:
1. Stemming usually operates on single word without knowledge of the context.
2. In stemming, we do not consider POS (Part-of-speech) tags.
3.
Stemming is used to group words with a similar basic meaning together.
◼ Lemmatization :
1. Lemmatization usually considers words and the context of the word in the
sentence.
2. In lemmatization, we consider POS tags.
Text Preprocessing
Preprocessing methods depend on specific application. In many
applications, such as Opinion Mining or Natural Language Processing
(NLP), they need to analyze the message from a syntactical point of view,
which requires that the method retains the original sentence structure.
Without this information, it is difficult to distinguish “Which university did
the president graduate from?” and “Which president is a graduate of
Harvard University?”, which have overlapping vocabularies. In this case,
we need to avoid removing the syntax-containing words.
Text Representation: Bag of Words and
Vector Space Models
The most popular structured representation of text is the vector-space model, which
represents every document (text) from the corpus as a vector whose length is equal to
the vocabulary of the corpus. This results in an extremely high-dimensional space;
typically, every distinct string of characters occurring in the collection of text documents
has a dimension. This includes dimensions for common English words and other strings
such as email addresses and URLs. For a collection of text documents of reasonable
size, the vectors can easily contain hundreds of thousands of elements. For those
readers who are familiar with data mining or machine learning, the vector-space model
can be viewed as a traditional feature vector where words and strings substitute
for more traditional numerical features. Therefore, it is not surprising that many text
mining solutions consist of applying data mining or machine learning algorithms to text
stored in a vector-space representation, provided these algorithms can be adapted or
extended to deal efficiently with the large dimensional space encountered in text
situations.
Text Representation: Bag of
Words and Vector Space Models
The vector-space model makes an implicit assumption (called the bag-of words assumption) that the
order of the words in the document does not matter. This may seem like a big assumption, since text
must be read in a specific order to be understood. For many text mining tasks, such as document
classification or clustering, however, this assumption is usually not a problem. The collection of words
appearing in the document (in any order) is usually sufficient to differentiate between semantic
concepts. The main strength of text mining algorithms is their ability to use all of the words in the
document-primary keywords and the remaining general text. Often, keywords alone do not differentiate
a document, but instead the usage patterns of the secondary words provide the differentiating
characteristics.
Though the bag-of-words assumption works well for many tasks, it is not a universal solution. For
some tasks, such as information extraction and natural language processing, the order of words is
critical for solving the task successfully. Prominent features in both entity extraction and natural
language processing include both preceding and following words and the decision (e.g., the part of
speech) for those words. Specialized algorithms and models for handling sequences such as finite
state machines or conditional random fields are used in these cases.
Another challenge for using the vector-space model is the presence of homographs. These are words
that are spelled the same but have different meanings.
There are many available APIs (such as Scikit-Learn, Gensim, and NLTK) that make the
implementations of those vectors encoding easier.
Frequency vector
In this representation, each document is represented by one vector where a vector
element i represents the number of times (frequency) the ith word appears in the
document. This representation can either be a straight count (integer) encoding as
shown in the following figure or a normalized encoding where each word is weighted
by the total number of words in the document.
A solution to this problem is one-hot encoding, a boolean vector encoding method that marks a
particular vector index with a value of true (1) if the token exists in the document and false (0) if it does
not. In other words, each element of a one-hot encoded vector reflects either the presence or absence
of the token in the described text as shown in the following Figure.
where the maximum is computed over all terms that appear in document dj. If term ti
does not appear in dj then tfij = 0.
The intuition here is that if a term appears in a large number of documents in the
collection, it is probably not important or not discriminative.
TF-IDF
The final TF-IDF term weight is given by:
Postscript:
tf-idf weighting has
TF-IDF (ti,dj) = many variants. Here
we only gave the
most basic one.
The assumption behind TF-IDF is that words with high term frequency should receive
high weight unless they also have high document frequency. The word “the” is one of
the most commonly occurring words in the English language. “The” often occurs
many times within a single document, but it also occurs in nearly every document.
These two competing effects cancel out to give “the” a low weight.
3.6.2 Computing TF-IDF: An Example
3.6.3 TF-IDF based applications
Some applications that use TF-IDF:
In general, text data analysis can be performed by TF-IDF easily. You can get
information about the most accurate keywords for your dataset.
If you are developing a text summarization application where you have a selected
statistical approach, then TF-IDF is the most important feature for generating a
summary for the document.
Variations of the TF-IDF weighting scheme are often used by search engines to find
out the scoring and ranking of a document's relevance for a given user query.
Document classification applications.