0% found this document useful (0 votes)
2 views

NLP Text Preprocessing

The document outlines a general framework for text analytics, consisting of three main phases: Text Preprocessing, Text Representation, and Knowledge Discovery. It details the steps involved in text preprocessing, such as tokenization, stopword removal, and stemming, as well as various text representation methods like Bag of Words and TF-IDF. The framework aims to facilitate the extraction of meaningful information from text data through structured representation and machine learning techniques.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

NLP Text Preprocessing

The document outlines a general framework for text analytics, consisting of three main phases: Text Preprocessing, Text Representation, and Knowledge Discovery. It details the steps involved in text preprocessing, such as tokenization, stopword removal, and stemming, as well as various text representation methods like Bag of Words and TF-IDF. The framework aims to facilitate the extraction of meaningful information from text data through structured representation and machine learning techniques.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

A General Framework for Text

Analytics
 A traditional text analytics framework consists of three consecutive
phases: Text Preprocessing, Text Representation and Knowledge
Discovery, shown in Figure below.

Figure 1. A Traditional Framework for Text Analytics ()


A General Framework for Text
Analytics
 Text Preprocessing
Text preprocessing aims to make the input documents more consistent to facilitate
text representation, which is necessary for most text analytics tasks. The text data is
usually preprocessed to remove parts that do not bring any relevant information.

 Text Representation
After text preprocessing has been completed, the individual word tokens must be
transformed into a vector representation suitable for input into text mining algorithms.

 Knowledge Discovery
When we successfully transform the text corpus into numeric vectors, we can apply
the existing machine learning or data mining methods like classification or clustering.

By conducting text preprocessing, text representation and knowledge discovery methods,


we can mine latent, useful information from the input text corpus, like similarity
between two messages from social media.
Text Preprocessing
 The possible steps of text preprocessing are the same for all text mining tasks,
though which processing steps are chosen depends on the task. The ways to
process documents are so varied and application- and language-dependent. The
basic steps are as follows:

1. Choose the scope of the text to be processed (documents, paragraphs, etc.).


Choosing the proper scope depends on the goals of the text mining task: for
classification or clustering tasks, often the entire document is the proper scope; for
sentiment analysis, document summarization, or information retrieval, smaller units
of text such as paragraphs or sections might be more appropriate.

2. Tokenize: Break text into discrete words called tokens → Transform text into a list of
words (tokens).

3. Remove stopwords (“stopping”): remove all the stopwords, that is, all the words used
to construct the syntax of a sentence but not containing text information (such as
conjunctions, articles, and prepositions) such as a, about, an, are, as, at, be, by, for,
from, how, will, with, and many others.
Text Preprocessing
4. Stem: Remove prefixes and suffixes to normalize words - for example, run, running,
and runs would all be stemmed to run. So the words with variant forms can be
regarded as same feature. Many algorithms have been invented to do stemming
(Porter, Snowball, and Lancaster). It also depends on the language. Notice that
lemmatization can be used instead stemming, it depends on the text mining
subtasks and corpus language.

5. Normalize spelling: Unify misspellings and other spelling variations into a single
token.

6. Detect sentence boundaries: Mark the ends of sentences.

7. Normalize case: Convert the text to either all lower or all upper case.
Text Preprocessing
Difference between Stemming and Lemmatization

 Stemming and lemmatization both of these concepts are used to normalize the given
word by removing infixes and consider its meaning. The major difference between
these is as shown:
◼ Stemming:
1. Stemming usually operates on single word without knowledge of the context.
2. In stemming, we do not consider POS (Part-of-speech) tags.
3.
Stemming is used to group words with a similar basic meaning together.

◼ Lemmatization :
1. Lemmatization usually considers words and the context of the word in the
sentence.
2. In lemmatization, we consider POS tags.
Text Preprocessing
 Preprocessing methods depend on specific application. In many
applications, such as Opinion Mining or Natural Language Processing
(NLP), they need to analyze the message from a syntactical point of view,
which requires that the method retains the original sentence structure.
Without this information, it is difficult to distinguish “Which university did
the president graduate from?” and “Which president is a graduate of
Harvard University?”, which have overlapping vocabularies. In this case,
we need to avoid removing the syntax-containing words.
Text Representation: Bag of Words and
Vector Space Models
The most popular structured representation of text is the vector-space model, which
represents every document (text) from the corpus as a vector whose length is equal to
the vocabulary of the corpus. This results in an extremely high-dimensional space;
typically, every distinct string of characters occurring in the collection of text documents
has a dimension. This includes dimensions for common English words and other strings
such as email addresses and URLs. For a collection of text documents of reasonable
size, the vectors can easily contain hundreds of thousands of elements. For those
readers who are familiar with data mining or machine learning, the vector-space model
can be viewed as a traditional feature vector where words and strings substitute
for more traditional numerical features. Therefore, it is not surprising that many text
mining solutions consist of applying data mining or machine learning algorithms to text
stored in a vector-space representation, provided these algorithms can be adapted or
extended to deal efficiently with the large dimensional space encountered in text
situations.
Text Representation: Bag of
Words and Vector Space Models
The vector-space model makes an implicit assumption (called the bag-of words assumption) that the
order of the words in the document does not matter. This may seem like a big assumption, since text
must be read in a specific order to be understood. For many text mining tasks, such as document
classification or clustering, however, this assumption is usually not a problem. The collection of words
appearing in the document (in any order) is usually sufficient to differentiate between semantic
concepts. The main strength of text mining algorithms is their ability to use all of the words in the
document-primary keywords and the remaining general text. Often, keywords alone do not differentiate
a document, but instead the usage patterns of the secondary words provide the differentiating
characteristics.

Though the bag-of-words assumption works well for many tasks, it is not a universal solution. For
some tasks, such as information extraction and natural language processing, the order of words is
critical for solving the task successfully. Prominent features in both entity extraction and natural
language processing include both preceding and following words and the decision (e.g., the part of
speech) for those words. Specialized algorithms and models for handling sequences such as finite
state machines or conditional random fields are used in these cases.
Another challenge for using the vector-space model is the presence of homographs. These are words
that are spelled the same but have different meanings.

P.S. Bag of Words model is also known as Vector Space Model.


Understanding BOW
Bag of words (BOW) model represents the text as the bag or multiset of its words, disregarding
grammar and word order and just keeping words (after text preprocessing has been
completed). BOW is often used to generate features; after generating BOW, we can derive the
term-frequency of each word in the document, which can later be fed to a machine learning
algorithm.
To vectorize a corpus with a bag-of-words (BOW) approach, we represent every document from the
corpus as a vector whose length is equal to the vocabulary of the corpus. We can simplify the
computation by sorting token positions of the vector into alphabetical order, as shown in the following
Figure. Alternatively, we can keep a dictionary that maps tokens to vector positions. Either way, we
arrive at a vector mapping of the corpus that enables us to uniquely represent every document.

Figure 2. Encoding documents as vectors (Tony Ojeda et al., 2018)


Understanding BOW
What should each element in the document vector be? In the next sections, We will
explore three types of vector encoding :
- Frequency vector,
- One-hot vector (a binary representation),
- TF–IDF vector (a float-valued weighted vector).

There are many available APIs (such as Scikit-Learn, Gensim, and NLTK) that make the
implementations of those vectors encoding easier.
Frequency vector
 In this representation, each document is represented by one vector where a vector
element i represents the number of times (frequency) the ith word appears in the
document. This representation can either be a straight count (integer) encoding as
shown in the following figure or a normalized encoding where each word is weighted
by the total number of words in the document.

Figure 3. Token frequency as vector encoding (Tony Ojeda et al., 2018)


One-hot encoding
Because they disregard grammar and the relative position of words in documents, frequency-based
encoding methods suffer from the long tail, or Zipfian distribution, that characterizes natural language.
As a result, tokens that occur very frequently are orders of magnitude more “significant” than other,
less frequent ones. This can have a significant impact on some models (e.g., generalized linear
models) that expect normally distributed features.

A solution to this problem is one-hot encoding, a boolean vector encoding method that marks a
particular vector index with a value of true (1) if the token exists in the document and false (0) if it does
not. In other words, each element of a one-hot encoded vector reflects either the presence or absence
of the token in the described text as shown in the following Figure.

Figure 4. One-hot encoding (Tony Ojeda et al., 2018)


One-hot encoding
 One-hot encoding is very useful. Some of the basic applications for one-hot
encoding format are:
◼ Many artificial neural networks accept input data in the one-hot encoding format
and generate output vectors that carry the sematic representation as well.
◼ The word2vec algorithm accepts input data in the form of words and these words
are in the form of vectors that are generated by one-hot encoding.
◼ …..
TF-IDF
 Storing text as weighted vectors first requires choosing a weighting scheme.
The most popular scheme is the TF-IDF weighting approach.

 The concept TF-IDF stands for term frequency-inverse document


frequency. This is in the field of numerical statistics. With this concept, we
will be able to decide how important a word is to a given document in the
present dataset or corpus (collection).
TF-IDF
 Term Frequency (TF):
The term frequency for a term ti in document dj is the number of times that ti
appears in document dj, denoted by fij. It can be absolute or relative (normalization
may also be applied).

The normalized term frequency (denoted by tfij) of ti in dj is given by the equation


below:

where the maximum is computed over all terms that appear in document dj. If term ti
does not appear in dj then tfij = 0.

|V| : is the vocabulary size of the collection (v words).


TF-IDF
 Document Frequency (DF)
◼ dfi = document frequency of term ti
= number of documents containing term ti
The inverse document frequency (denoted by idfi
◼ ) of term ti is given by:

- Where N: total number of documents


- There is one IDF value for each term ti in a collection.
- Log used to dampen the effect relative to tf.

The intuition here is that if a term appears in a large number of documents in the
collection, it is probably not important or not discriminative.
TF-IDF
 The final TF-IDF term weight is given by:

Postscript:
tf-idf weighting has
TF-IDF (ti,dj) = many variants. Here
we only gave the
most basic one.

 The assumption behind TF-IDF is that words with high term frequency should receive
high weight unless they also have high document frequency. The word “the” is one of
the most commonly occurring words in the English language. “The” often occurs
many times within a single document, but it also occurs in nearly every document.
These two competing effects cancel out to give “the” a low weight.
3.6.2 Computing TF-IDF: An Example
3.6.3 TF-IDF based applications
Some applications that use TF-IDF:

 In general, text data analysis can be performed by TF-IDF easily. You can get
information about the most accurate keywords for your dataset.
 If you are developing a text summarization application where you have a selected
statistical approach, then TF-IDF is the most important feature for generating a
summary for the document.
 Variations of the TF-IDF weighting scheme are often used by search engines to find
out the scoring and ranking of a document's relevance for a given user query.
 Document classification applications.

You might also like