0% found this document useful (0 votes)

18 views19 pages

NLP Text Preprocessing

The document outlines a general framework for text analytics, consisting of three main phases: Text Preprocessing, Text Representation, and Knowledge Discovery. It details the steps involved in text preprocessing, such as tokenization, stopword removal, and stemming, as well as various text representation methods like Bag of Words and TF-IDF. The framework aims to facilitate the extraction of meaningful information from text data through structured representation and machine learning techniques.

Uploaded by

chitransh04shukla

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

18 views19 pages

NLP Text Preprocessing

Uploaded by

chitransh04shukla

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 19

A General Framework for Text

Analytics
 A traditional text analytics framework consists of three consecutive
phases: Text Preprocessing, Text Representation and Knowledge
Discovery, shown in Figure below.

Figure 1. A Traditional Framework for Text Analytics ()

A General Framework for Text
Analytics
 Text Preprocessing
Text preprocessing aims to make the input documents more consistent to facilitate
text representation, which is necessary for most text analytics tasks. The text data is
usually preprocessed to remove parts that do not bring any relevant information.

 Text Representation
After text preprocessing has been completed, the individual word tokens must be
transformed into a vector representation suitable for input into text mining algorithms.

 Knowledge Discovery
When we successfully transform the text corpus into numeric vectors, we can apply
the existing machine learning or data mining methods like classification or clustering.

By conducting text preprocessing, text representation and knowledge discovery methods,

we can mine latent, useful information from the input text corpus, like similarity
between two messages from social media.
Text Preprocessing
 The possible steps of text preprocessing are the same for all text mining tasks,
though which processing steps are chosen depends on the task. The ways to
process documents are so varied and application- and language-dependent. The
basic steps are as follows:

1. Choose the scope of the text to be processed (documents, paragraphs, etc.).

Choosing the proper scope depends on the goals of the text mining task: for
classification or clustering tasks, often the entire document is the proper scope; for
sentiment analysis, document summarization, or information retrieval, smaller units
of text such as paragraphs or sections might be more appropriate.

2. Tokenize: Break text into discrete words called tokens → Transform text into a list of
words (tokens).

3. Remove stopwords (“stopping”): remove all the stopwords, that is, all the words used
to construct the syntax of a sentence but not containing text information (such as
conjunctions, articles, and prepositions) such as a, about, an, are, as, at, be, by, for,
from, how, will, with, and many others.
Text Preprocessing
4. Stem: Remove prefixes and suffixes to normalize words - for example, run, running,
and runs would all be stemmed to run. So the words with variant forms can be
regarded as same feature. Many algorithms have been invented to do stemming
(Porter, Snowball, and Lancaster). It also depends on the language. Notice that
lemmatization can be used instead stemming, it depends on the text mining
subtasks and corpus language.

5. Normalize spelling: Unify misspellings and other spelling variations into a single
token.

6. Detect sentence boundaries: Mark the ends of sentences.

7. Normalize case: Convert the text to either all lower or all upper case.
Text Preprocessing
Difference between Stemming and Lemmatization

 Stemming and lemmatization both of these concepts are used to normalize the given
word by removing infixes and consider its meaning. The major difference between
these is as shown:
◼ Stemming:
1. Stemming usually operates on single word without knowledge of the context.
2. In stemming, we do not consider POS (Part-of-speech) tags.
3.
Stemming is used to group words with a similar basic meaning together.

◼ Lemmatization :
1. Lemmatization usually considers words and the context of the word in the
sentence.
2. In lemmatization, we consider POS tags.
Text Preprocessing
 Preprocessing methods depend on specific application. In many
applications, such as Opinion Mining or Natural Language Processing
(NLP), they need to analyze the message from a syntactical point of view,
which requires that the method retains the original sentence structure.
Without this information, it is difficult to distinguish “Which university did
the president graduate from?” and “Which president is a graduate of
Harvard University?”, which have overlapping vocabularies. In this case,
we need to avoid removing the syntax-containing words.
Text Representation: Bag of Words and
Vector Space Models
The most popular structured representation of text is the vector-space model, which
represents every document (text) from the corpus as a vector whose length is equal to
the vocabulary of the corpus. This results in an extremely high-dimensional space;
typically, every distinct string of characters occurring in the collection of text documents
has a dimension. This includes dimensions for common English words and other strings
such as email addresses and URLs. For a collection of text documents of reasonable
size, the vectors can easily contain hundreds of thousands of elements. For those
readers who are familiar with data mining or machine learning, the vector-space model
can be viewed as a traditional feature vector where words and strings substitute
for more traditional numerical features. Therefore, it is not surprising that many text
mining solutions consist of applying data mining or machine learning algorithms to text
stored in a vector-space representation, provided these algorithms can be adapted or
extended to deal efficiently with the large dimensional space encountered in text
situations.
Text Representation: Bag of
Words and Vector Space Models
The vector-space model makes an implicit assumption (called the bag-of words assumption) that the
order of the words in the document does not matter. This may seem like a big assumption, since text
must be read in a specific order to be understood. For many text mining tasks, such as document
classification or clustering, however, this assumption is usually not a problem. The collection of words
appearing in the document (in any order) is usually sufficient to differentiate between semantic
concepts. The main strength of text mining algorithms is their ability to use all of the words in the
document-primary keywords and the remaining general text. Often, keywords alone do not differentiate
a document, but instead the usage patterns of the secondary words provide the differentiating
characteristics.

Though the bag-of-words assumption works well for many tasks, it is not a universal solution. For
some tasks, such as information extraction and natural language processing, the order of words is
critical for solving the task successfully. Prominent features in both entity extraction and natural
language processing include both preceding and following words and the decision (e.g., the part of
speech) for those words. Specialized algorithms and models for handling sequences such as finite
state machines or conditional random fields are used in these cases.
Another challenge for using the vector-space model is the presence of homographs. These are words
that are spelled the same but have different meanings.

P.S. Bag of Words model is also known as Vector Space Model.

Understanding BOW
Bag of words (BOW) model represents the text as the bag or multiset of its words, disregarding
grammar and word order and just keeping words (after text preprocessing has been
completed). BOW is often used to generate features; after generating BOW, we can derive the
term-frequency of each word in the document, which can later be fed to a machine learning
algorithm.
To vectorize a corpus with a bag-of-words (BOW) approach, we represent every document from the
corpus as a vector whose length is equal to the vocabulary of the corpus. We can simplify the
computation by sorting token positions of the vector into alphabetical order, as shown in the following
Figure. Alternatively, we can keep a dictionary that maps tokens to vector positions. Either way, we
arrive at a vector mapping of the corpus that enables us to uniquely represent every document.

Figure 2. Encoding documents as vectors (Tony Ojeda et al., 2018)

Understanding BOW
What should each element in the document vector be? In the next sections, We will
explore three types of vector encoding :
- Frequency vector,
- One-hot vector (a binary representation),
- TF–IDF vector (a float-valued weighted vector).

There are many available APIs (such as Scikit-Learn, Gensim, and NLTK) that make the
implementations of those vectors encoding easier.
Frequency vector
 In this representation, each document is represented by one vector where a vector
element i represents the number of times (frequency) the ith word appears in the
document. This representation can either be a straight count (integer) encoding as
shown in the following figure or a normalized encoding where each word is weighted
by the total number of words in the document.

Figure 3. Token frequency as vector encoding (Tony Ojeda et al., 2018)

One-hot encoding
Because they disregard grammar and the relative position of words in documents, frequency-based
encoding methods suffer from the long tail, or Zipfian distribution, that characterizes natural language.
As a result, tokens that occur very frequently are orders of magnitude more “significant” than other,
less frequent ones. This can have a significant impact on some models (e.g., generalized linear
models) that expect normally distributed features.

A solution to this problem is one-hot encoding, a boolean vector encoding method that marks a
particular vector index with a value of true (1) if the token exists in the document and false (0) if it does
not. In other words, each element of a one-hot encoded vector reflects either the presence or absence
of the token in the described text as shown in the following Figure.

Figure 4. One-hot encoding (Tony Ojeda et al., 2018)

One-hot encoding
 One-hot encoding is very useful. Some of the basic applications for one-hot
encoding format are:
◼ Many artificial neural networks accept input data in the one-hot encoding format
and generate output vectors that carry the sematic representation as well.
◼ The word2vec algorithm accepts input data in the form of words and these words
are in the form of vectors that are generated by one-hot encoding.
◼ …..
TF-IDF
 Storing text as weighted vectors first requires choosing a weighting scheme.
The most popular scheme is the TF-IDF weighting approach.

 The concept TF-IDF stands for term frequency-inverse document

frequency. This is in the field of numerical statistics. With this concept, we
will be able to decide how important a word is to a given document in the
present dataset or corpus (collection).
TF-IDF
 Term Frequency (TF):
The term frequency for a term ti in document dj is the number of times that ti
appears in document dj, denoted by fij. It can be absolute or relative (normalization
may also be applied).

The normalized term frequency (denoted by tfij) of ti in dj is given by the equation

below:

where the maximum is computed over all terms that appear in document dj. If term ti
does not appear in dj then tfij = 0.

|V| : is the vocabulary size of the collection (v words).

TF-IDF
 Document Frequency (DF)
◼ dfi = document frequency of term ti
= number of documents containing term ti
The inverse document frequency (denoted by idfi
◼ ) of term ti is given by:

- Where N: total number of documents

- There is one IDF value for each term ti in a collection.
- Log used to dampen the effect relative to tf.

The intuition here is that if a term appears in a large number of documents in the
collection, it is probably not important or not discriminative.
TF-IDF
 The final TF-IDF term weight is given by:

Postscript:
tf-idf weighting has
TF-IDF (ti,dj) = many variants. Here
we only gave the
most basic one.

 The assumption behind TF-IDF is that words with high term frequency should receive
high weight unless they also have high document frequency. The word “the” is one of
the most commonly occurring words in the English language. “The” often occurs
many times within a single document, but it also occurs in nearly every document.
These two competing effects cancel out to give “the” a low weight.
3.6.2 Computing TF-IDF: An Example
3.6.3 TF-IDF based applications
Some applications that use TF-IDF:

 In general, text data analysis can be performed by TF-IDF easily. You can get
information about the most accurate keywords for your dataset.
 If you are developing a text summarization application where you have a selected
statistical approach, then TF-IDF is the most important feature for generating a
summary for the document.
 Variations of the TF-IDF weighting scheme are often used by search engines to find
out the scoring and ranking of a document's relevance for a given user query.
 Document classification applications.

Digital Image Processing Unit 1
No ratings yet
Digital Image Processing Unit 1
38 pages
Based On The UK Construction Industry Key Performance Indicators
No ratings yet
Based On The UK Construction Industry Key Performance Indicators
30 pages
NLP - Module 2
No ratings yet
NLP - Module 2
54 pages
Statistical NLP
No ratings yet
Statistical NLP
45 pages
Ai TXT Unit2
No ratings yet
Ai TXT Unit2
14 pages
NLP Prez Word - Sentence Embedding - MAQUET - MARTIN - LEEFEBURE - MOGAVERO
No ratings yet
NLP Prez Word - Sentence Embedding - MAQUET - MARTIN - LEEFEBURE - MOGAVERO
18 pages
NLP 2
No ratings yet
NLP 2
8 pages
Topic 8
No ratings yet
Topic 8
55 pages
DVT Unit 4
No ratings yet
DVT Unit 4
21 pages
DVT UNIT - 4 Notes 211124
No ratings yet
DVT UNIT - 4 Notes 211124
21 pages
Ass7 Write Up .Final
No ratings yet
Ass7 Write Up .Final
11 pages
Intro To TM
No ratings yet
Intro To TM
32 pages
Unit IV
No ratings yet
Unit IV
57 pages
Lecture 6 - From Unstructured Texts To Structure Data I
No ratings yet
Lecture 6 - From Unstructured Texts To Structure Data I
17 pages
Unit IV
No ratings yet
Unit IV
58 pages
Lect 5
No ratings yet
Lect 5
40 pages
Text Analytics Basics
No ratings yet
Text Analytics Basics
28 pages
Lect 04
No ratings yet
Lect 04
44 pages
Vector Semantics
No ratings yet
Vector Semantics
83 pages
Data Science Interview Preparation Questions (#Day06)
No ratings yet
Data Science Interview Preparation Questions (#Day06)
10 pages
Embeddings
No ratings yet
Embeddings
3 pages
Machine Learning For NLP: Vocabulary
No ratings yet
Machine Learning For NLP: Vocabulary
37 pages
Text Representation: Lecture # 6
No ratings yet
Text Representation: Lecture # 6
21 pages
Text Mining
No ratings yet
Text Mining
62 pages
Text Prediction Analysis
No ratings yet
Text Prediction Analysis
12 pages
Unit 3 NLP
No ratings yet
Unit 3 NLP
103 pages
Dealing With Textual Data
No ratings yet
Dealing With Textual Data
67 pages
CSE442 Text
No ratings yet
CSE442 Text
89 pages
Kim 2016
No ratings yet
Kim 2016
5 pages
Part 3
No ratings yet
Part 3
5 pages
NLP Notes
No ratings yet
NLP Notes
11 pages
NLP Notes
No ratings yet
NLP Notes
12 pages
Bag of Words
No ratings yet
Bag of Words
32 pages
DLT Unit-5
No ratings yet
DLT Unit-5
48 pages
NLP Asgn3
No ratings yet
NLP Asgn3
6 pages
Document Classification Using Distributed Machine Learning
No ratings yet
Document Classification Using Distributed Machine Learning
4 pages
NLP Intro
No ratings yet
NLP Intro
74 pages
SL-3 - Assignment No 7
No ratings yet
SL-3 - Assignment No 7
14 pages
T 2V: D R T: OP EC Istributed Epresentations of Opics
No ratings yet
T 2V: D R T: OP EC Istributed Epresentations of Opics
25 pages
Lab 5
No ratings yet
Lab 5
27 pages
Word Embedding
No ratings yet
Word Embedding
60 pages
Feature Eng
No ratings yet
Feature Eng
34 pages
Text Mining - Vectorization
No ratings yet
Text Mining - Vectorization
24 pages
Session 11-12 - Text Analytics
No ratings yet
Session 11-12 - Text Analytics
38 pages
NLP Q2 21SAL54 Scheme
No ratings yet
NLP Q2 21SAL54 Scheme
6 pages
Feature Engineering
No ratings yet
Feature Engineering
3 pages
Week 2 and 3
No ratings yet
Week 2 and 3
76 pages
Visual Word: Unlocking the Power of Image Understanding
From Everand
Visual Word: Unlocking the Power of Image Understanding
Fouad Sabry
No ratings yet
Lec8-9 - VSM
No ratings yet
Lec8-9 - VSM
20 pages
Traditional Word Embedding
No ratings yet
Traditional Word Embedding
9 pages
AP For NLP-Word 2 Vec
No ratings yet
AP For NLP-Word 2 Vec
33 pages
ML4D-L6 nlp2
No ratings yet
ML4D-L6 nlp2
58 pages
GSoC 2017 Proposal - Rajat Arora
No ratings yet
GSoC 2017 Proposal - Rajat Arora
9 pages
Unit-4 NLP
No ratings yet
Unit-4 NLP
21 pages
NLP For ML - Spam Classifier
No ratings yet
NLP For ML - Spam Classifier
14 pages
NLP Key Points
No ratings yet
NLP Key Points
3 pages
NLP DeepNLP
No ratings yet
NLP DeepNLP
61 pages
04 - Text Representation
No ratings yet
04 - Text Representation
131 pages
Week 12
No ratings yet
Week 12
19 pages
NLP ANONYMOUS QB Ans
No ratings yet
NLP ANONYMOUS QB Ans
21 pages
Text
No ratings yet
Text
102 pages
Preprocessing Techniquesfor Text Mining
No ratings yet
Preprocessing Techniquesfor Text Mining
7 pages
Digital Image Processing Unit 2
No ratings yet
Digital Image Processing Unit 2
82 pages
Exploratory Data Analysis Unit 2
No ratings yet
Exploratory Data Analysis Unit 2
39 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
16 pages
NLP 2K22 MAY CS3EA06 Natural Language Processing
No ratings yet
NLP 2K22 MAY CS3EA06 Natural Language Processing
2 pages
NLP 2K19 MAY CS3EA06-IT3EA06 Natural Language Processing
No ratings yet
NLP 2K19 MAY CS3EA06-IT3EA06 Natural Language Processing
3 pages
RA3CO42 Digital Image Processing QP
No ratings yet
RA3CO42 Digital Image Processing QP
2 pages
The Impact of Cloud Computing On Organisational Ag
No ratings yet
The Impact of Cloud Computing On Organisational Ag
18 pages
Brain Controlled Car For Disabled
No ratings yet
Brain Controlled Car For Disabled
19 pages
Group 3 Report
No ratings yet
Group 3 Report
66 pages
SHS Grade 11 MIL Q4W6 FINAL
No ratings yet
SHS Grade 11 MIL Q4W6 FINAL
19 pages
I 3 Lines THEORY
No ratings yet
I 3 Lines THEORY
4 pages
Rule of Thumb Calculator Instruction
100% (3)
Rule of Thumb Calculator Instruction
26 pages
A Spatial Data Structure For Fast Poisson-Disk Sample Generation
No ratings yet
A Spatial Data Structure For Fast Poisson-Disk Sample Generation
6 pages
PHP Math Functions
No ratings yet
PHP Math Functions
5 pages
Log
No ratings yet
Log
44 pages
IARPA Cyber-Attack Automated Unconventional Sensor Environment (CAUSE)
No ratings yet
IARPA Cyber-Attack Automated Unconventional Sensor Environment (CAUSE)
93 pages
FRST
No ratings yet
FRST
19 pages
Manual - Bancada Presys
No ratings yet
Manual - Bancada Presys
39 pages
CIT 207 MODULE v2
No ratings yet
CIT 207 MODULE v2
57 pages
Quiz_ Cloud Security_ Revisão Da Tentativa _ Training Institute.pdf 3
No ratings yet
Quiz_ Cloud Security_ Revisão Da Tentativa _ Training Institute.pdf 3
2 pages
Pivot Table
No ratings yet
Pivot Table
19 pages
Dennis
No ratings yet
Dennis
27 pages
BDC Daily & Monthly Report Page 1 (John Yoshida) : Sep 1st - 15th
No ratings yet
BDC Daily & Monthly Report Page 1 (John Yoshida) : Sep 1st - 15th
4 pages
Chart - Poster - PMBOK 6th Ed Data Flow Diagram
No ratings yet
Chart - Poster - PMBOK 6th Ed Data Flow Diagram
1 page
Syllabus IST 8105-Spring 2024
No ratings yet
Syllabus IST 8105-Spring 2024
10 pages
Now and Get: Best VTU Student Companion You Can Get
No ratings yet
Now and Get: Best VTU Student Companion You Can Get
5 pages
Prototype CNC Machine Design PDF
No ratings yet
Prototype CNC Machine Design PDF
6 pages
Ballbar and Circle Diamond Square Machine Tests - Rev 1m
No ratings yet
Ballbar and Circle Diamond Square Machine Tests - Rev 1m
14 pages
Chapter 10 3
No ratings yet
Chapter 10 3
54 pages
Goldenmorning Electronic: Product Name: Rgs15128128Wr000 Product No.: Gme128128-01
No ratings yet
Goldenmorning Electronic: Product Name: Rgs15128128Wr000 Product No.: Gme128128-01
34 pages
2022 Grade 10 3rd Tem Tamil
No ratings yet
2022 Grade 10 3rd Tem Tamil
8 pages
TJR TUJR WF4 Manual 01 25 15
No ratings yet
TJR TUJR WF4 Manual 01 25 15
62 pages
Barangay Baracbac SK Annual Budget Fy 2019: Republic of The Philippines Province of Ilocos Sur Municipality of Galimuyod
No ratings yet
Barangay Baracbac SK Annual Budget Fy 2019: Republic of The Philippines Province of Ilocos Sur Municipality of Galimuyod
7 pages
Chapter 4
No ratings yet
Chapter 4
6 pages
2 Static & Dynamic Web Pages
No ratings yet
2 Static & Dynamic Web Pages
24 pages