0% found this document useful (0 votes)

551 views7 pages

The Traditional Approach To Natural Language Processing

The traditional approach to natural language processing involves several key sequential steps: preprocessing data by removing unwanted text, feature engineering to transform text into numerical representations, using machine learning algorithms on training data, and predicting outputs for new data. Feature engineering, such as bag-of-words representations, was crucial for obtaining good performance but time-consuming. This approach involves distinct subtasks like preprocessing text, feature engineering, learning from features, and prediction.

Uploaded by

Madhu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

551 views7 pages

The Traditional Approach To Natural Language Processing

Uploaded by

Madhu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 7

The traditional approach to Natural Language Processing

The traditional or classical approach to solving NLP is a sequential flow of several key
steps, and it is a statistical approach. When we take a closer look at a traditional NLP
learning model, we will be able to see a set of distinct tasks taking place, such as
preprocessing data by removing unwanted data, feature engineering to get good
numerical representations of textual data, learning to use machine learning algorithms
with the aid of training data, and predicting outputs for novel unfamiliar data. Of these,
feature engineering was the most time-consuming and crucial step for obtaining good
performance on a given NLP task.

Understanding the traditional approach

The traditional approach to solving NLP tasks involves a collection of distinct subtasks.

First, the text corpora need to be preprocessed focusing on reducing the vocabulary
and distractions. By distractions, I refer to the things that distract the algorithm (for
example, punctuation marks and stop word removal) from capturing the vital linguistic
information required for the task.

Next, comes several feature engineering steps. The main objective of feature
engineering is to make the learning easier for the algorithms. Often the features are
hand-engineered and biased toward the human understanding of a language. Feature
engineering was of utter importance for classical NLP algorithms, and consequently, the
best performing systems often had the best engineered features. For example, for a
sentiment classification task, you can represent a sentence with a parse tree and assign
positive, negative, or neutral labels to each node/subtree in the tree to classify that
sentence as positive or negative. Additionally, the feature engineering phase can use
external resources such as WordNet (a lexical database) to develop better features. We
will soon look at a simple feature engineering technique known as bag-of-words.

Next, the learning algorithm learns to perform well at the given task using the obtained
features and optionally, the external resources. For example, for a text summarization
task, a thesaurus that contains synonyms of words can be a good external resource.
Finally, prediction occurs. Prediction is straightforward, where you will feed a new input
and obtain the predicted label by forwarding the input through the learning model. The
entire process of the traditional approach is depicted in Figure 1.2:
Figure 1.2: The general approach of classical NLP

Example – generating football game summaries

To gain an in-depth understanding of the traditional approach to NLP, let's consider a

task of automatic text generation from the statistics of a game of football. We have
several sets of game statistics (for example, score, penalties, and yellow cards) and the
corresponding articles generated for that game by a journalist, as the training data. Let's
also assume that for a given game, we have a mapping from each statistical parameter
to the most relevant phrase of the summary for that parameter. Our task here is that,
given a new game, we need to generate a natural looking summary about the game. Of
course, this can be as simple as finding the best-matching statistics for the new game
from the training data and retrieving the corresponding summary. However, there are
more sophisticated and elegant ways of generating text.

If we were to incorporate machine learning to generate natural language, a sequence of

operations such as preprocessing the text, tokenization, feature engineering, learning,
and prediction are likely to be performed.

Preprocessing the text involves operations, such as stemming (for example,

converting listened to listen) and removing punctuation (for example, ! and ;), in order to
reduce the vocabulary (that is, features), thus reducing the memory requirement. It is
important to understand that stemming is not a trivial operation. It might appear that
stemming is a simple operation that relies on a simple set of rules such as
removing ed from a verb (for example, the stemmed result of listened is listen);
however, it requires more than a simple rule base to develop a good stemming
algorithm, as stemming certain words can be tricky (for example, the stemmed result
of argued is argue). In addition, the effort required for proper stemming can vary in
complexity for other languages.

Tokenization is another preprocessing step that might need to be performed.

Tokenization is the process of dividing a corpus into small entities (for example, words).
This might appear trivial for a language such as English, as the words are isolated;
however, this is not the case for certain languages such as Thai, Japanese, and
Chinese, as these languages are not consistently delimited.

Feature engineering is used to transform raw text data into an appealing numerical
format so that a model can be trained on that data, for example, converting text into a
bag-of-words representation or using the n-gram representation which we will discuss
later. However, remember that state-of-the-art classical models rely on much more
sophisticated feature engineering techniques.

The following are some of the feature engineering techniques:

Bag-of-words: This is a feature engineering technique that creates feature

representations based on the word occurrence frequency. For example, let's consider
the following sentences:

 Bob went to the market to buy some flowers

 Bob bought the flowers to give to Mary

The vocabulary for these two sentences would be:

["Bob", "went", "to", "the", "market", "buy", "some", "flowers", "bought", "give", "Mary"]

Next, we will create a feature vector of size V (vocabulary size) for each sentence
showing how many times each word in the vocabulary appears in the sentence. In this
example, the feature vectors for the sentences would respectively be as follows:

[1, 1, 2, 1, 1, 1, 1, 1, 0, 0, 0]

[1, 0, 2, 1, 0, 0, 0, 1, 1, 1, 1]
A crucial limitation of the bag-of-words method is that it loses contextual information as
the order of words is no longer preserved.

N-gram: This is another feature engineering technique that breaks down text into
smaller components consisting of n letters (or words). For example, 2-gram would break
the text into two-letter (or two-word) entities. For example, consider this sentence:

Bob went to the market to buy some flowers

The letter level n-gram decomposition for this sentence is as follows:

["Bo", "ob", "b ", " w", "we", "en", ..., "me", "e "," f", "fl", "lo", "ow", "we", "er", "rs"]

The word-based n-gram decomposition is this:

["Bob went", "went to", "to the", "the market", ..., "to buy", "buy some", "some flowers"]

The advantage in this representation (letter, level) is that the vocabulary will be
significantly smaller than if we were to use words as features for large corpora.

Next, we need to structure our data to be able to feed it into a learning model. For
example, we will have data tuples of the form, (statistic, a phrase explaining the
statistic) as follows:

Total goals = 4, "The game was tied with 2 goals for each team at the end of the first
half"

Team 1 = Manchester United, "The game was between Manchester United and
Barcelona"

Team 1 goals = 5, "Manchester United managed to get 5 goals"

The learning process may comprise three sub modules: a Hidden Markov

Model (HMM), a sentence planner, and a discourse planner. In our example, a HMM
might learn the morphological structure and grammatical properties of the language by
analyzing the corpus of related phrases. More specifically, we will concatenate each
phrase in our dataset to form a sequence, where the first element is the statistic
followed by the phrase explaining it. Then, we will train a HMM by asking it to predict the
next word, given the current sequence. Concretely, we will first input the statistic to the
HMM and then get the prediction made by the HMM; then, we will concatenate the last
prediction to the current sequence and ask the HMM to give another prediction, and so
on. This will enable the HMM to output meaningful phrases, given statistics.

Next, we can have a sentence planner that corrects any linguistic mistakes (for
example, morphological or grammar), which we might have in the phrases. For
examples, a sentence planner outputs the phrase, I go house as I go home; it can use a
database of rules, which contains the correct way of conveying meanings (for example,
the need of a preposition between a verb and the word house).

Now we can generate a set of phrases for a given set of statistics using a HMM. Then,
we need to aggregate these phrases in such a way that an essay made from the
collection of phrases is human readable and flows correctly. For example, consider the
three phrases, Player 10 of the Barcelona team scored a goal in the second
half, Barcelona played against Manchester United, and Player 3 from Manchester
United got a yellow card in the first half; having these sentences in this order does not
make much sense. We like to have them in this order: Barcelona played against
Manchester United, Player 3 from Manchester United got a yellow card in the first half,
and Player 10 of the Barcelona team scored a goal in the second half. To do this, we
use a discourse planner; discourse planners can order and structure a set of messages
that need to be conveyed.

Now we can get a set of arbitrary test statistics and obtain an essay explaining the
statistics by following the preceding workflow, which is depicted in Figure 1.3:
Figure 1.3: A step from a classical approachoo example of solving a language modelling
task

Here, it is important to note that this is a very high level explanation that only covers the
main general-purpose components that are most likely to be included in the traditional
way of NLP. The details can largely vary according to the particular application we are
interested in solving. For example, additional application-specific crucial components
might be needed for certain tasks (a rule base and an alignment model in machine
translation). However, in this book, we do not stress about such details as the main
objective here is to discuss more modern ways of natural language processing.

Drawbacks of the traditional approach

Let's list several key drawbacks of the traditional approach as this would lay a good
foundation for discussing the motivation for deep learning:

 The preprocessing steps used in traditional NLP forces a trade-off of potentially

useful information embedded in the text (for example, punctuation and tense
information) in order to make the learning feasible by reducing the vocabulary.
Though preprocessing is still used in modern deep-learning-based solutions, it is
not as crucial as for the traditional NLP workflow due to the large representational
capacity of deep networks.
 Feature engineering needs to be performed manually by hand. In order to design
a reliable system, good features need to be devised. This process can be very
tedious as different feature spaces need to be extensively explored. Additionally,
in order to effectively explore robust features, domain expertise is required, which
can be scarce for certain NLP tasks.
 Various external resources are needed for it to perform well, and there are not
many freely available ones. Such external resources often consist of manually
created information stored in large databases. Creating one for a particular task
can take several years, depending on the severity of the task (for example, a
machine translation rule base).

Natural Language Understanding
No ratings yet
Natural Language Understanding
675 pages
NLP Steps Basic
No ratings yet
NLP Steps Basic
26 pages
NLP-Lectures 4,5,6
No ratings yet
NLP-Lectures 4,5,6
85 pages
EITK Case Study
No ratings yet
EITK Case Study
35 pages
Artificial Intelligence in LAW
100% (1)
Artificial Intelligence in LAW
9 pages
Gujarat Technological University: W.E.F. AY 2018-19
No ratings yet
Gujarat Technological University: W.E.F. AY 2018-19
2 pages
NLP2 7
No ratings yet
NLP2 7
400 pages
Introduction To Computational Linguistics and Natural Language Processing
100% (1)
Introduction To Computational Linguistics and Natural Language Processing
182 pages
02 NLP LM
No ratings yet
02 NLP LM
99 pages
Untitled
No ratings yet
Untitled
140 pages
Simplex Method Concept
No ratings yet
Simplex Method Concept
31 pages
Training and Development Dessler
No ratings yet
Training and Development Dessler
59 pages
Semantic Analysis Theory1
No ratings yet
Semantic Analysis Theory1
16 pages
Unit - 3 Distributional Semantics and Word Embedding
No ratings yet
Unit - 3 Distributional Semantics and Word Embedding
69 pages
1) What Is Consumer Behavior Processprocess?
No ratings yet
1) What Is Consumer Behavior Processprocess?
10 pages
Evaluation of Sentiment Analysis in Finance: From Lexicons To Transformers
No ratings yet
Evaluation of Sentiment Analysis in Finance: From Lexicons To Transformers
21 pages
2017 Phrase Mining From Massive Text and Its Applications
No ratings yet
2017 Phrase Mining From Massive Text and Its Applications
89 pages
(Series On Language Processing Pattern Recognition and Intelligent Systems Vol. 4) Neamat El Gayar, Ching Y. Suen - Computational Linguistics, Speech and Image Processing For Arabic Language-World Sci
No ratings yet
(Series On Language Processing Pattern Recognition and Intelligent Systems Vol. 4) Neamat El Gayar, Ching Y. Suen - Computational Linguistics, Speech and Image Processing For Arabic Language-World Sci
286 pages
Corpus Analysis (1) : Corpus Linguistics Richard Xiao
No ratings yet
Corpus Analysis (1) : Corpus Linguistics Richard Xiao
44 pages
Linguistics
75% (12)
Linguistics
50 pages
The Traditional Approach To Natural Language Processing
No ratings yet
The Traditional Approach To Natural Language Processing
7 pages
Ai Unit 5
No ratings yet
Ai Unit 5
16 pages
MaC 8 (3) - Automated Journalism - A Meta-Analysis of Readers' Perceptions of Human-Written in Comparison To Automated News
No ratings yet
MaC 8 (3) - Automated Journalism - A Meta-Analysis of Readers' Perceptions of Human-Written in Comparison To Automated News
10 pages
Vaccine Support System
No ratings yet
Vaccine Support System
7 pages
Efficient Estimation of Word Representations in Vector Space: January 2013
No ratings yet
Efficient Estimation of Word Representations in Vector Space: January 2013
13 pages
Comparing Open-Source Speech Recognition Toolkits
No ratings yet
Comparing Open-Source Speech Recognition Toolkits
12 pages
(2011) Longitudinal Detection of Dementia Through Lexical and Syntactic Changes in Writing
No ratings yet
(2011) Longitudinal Detection of Dementia Through Lexical and Syntactic Changes in Writing
27 pages
Principles of S.oft Computing: Wiley
0% (1)
Principles of S.oft Computing: Wiley
712 pages
Language Processing
50% (8)
Language Processing
29 pages
An Empirical Evaluation of Stop Word Removal in Statistical Machine Translation
No ratings yet
An Empirical Evaluation of Stop Word Removal in Statistical Machine Translation
8 pages
Decision Trew
No ratings yet
Decision Trew
1 page
Mukesh Patel School of Technology Management and Engineering
No ratings yet
Mukesh Patel School of Technology Management and Engineering
1 page
Structure in Linguistics
No ratings yet
Structure in Linguistics
6 pages
Real Time Software
No ratings yet
Real Time Software
272 pages
UBC Summer School in NLP - VSP 2019 Lecture 9
No ratings yet
UBC Summer School in NLP - VSP 2019 Lecture 9
17 pages
Social Media Sentiment Analysis Document
No ratings yet
Social Media Sentiment Analysis Document
6 pages
Deep Learning Hybrid Approaches To Detect Fake Reviews and Ratings
No ratings yet
Deep Learning Hybrid Approaches To Detect Fake Reviews and Ratings
8 pages
Data Flow Diagram
No ratings yet
Data Flow Diagram
50 pages
Great Big Natural Language Processing Primer KDnuggets
No ratings yet
Great Big Natural Language Processing Primer KDnuggets
25 pages
Pinto Evaluating N-Gram Models For A Bilingual Word Sense Disambiguation Task
No ratings yet
Pinto Evaluating N-Gram Models For A Bilingual Word Sense Disambiguation Task
12 pages
Macro and Macro Preprocessor
0% (1)
Macro and Macro Preprocessor
72 pages
2017 ELIV AUTOSAR Proofs To Be The Automotive Software Platform For Intelligent Mobility
No ratings yet
2017 ELIV AUTOSAR Proofs To Be The Automotive Software Platform For Intelligent Mobility
31 pages
5 J MST 87 2023 40 49 7279
No ratings yet
5 J MST 87 2023 40 49 7279
10 pages
2019 09 Cuemacro Alternative Data For Investors Long Intro
No ratings yet
2019 09 Cuemacro Alternative Data For Investors Long Intro
69 pages
Damerau-Levenshtein Algorithm and Bayes Theorem For Spell Checker Optimization
No ratings yet
Damerau-Levenshtein Algorithm and Bayes Theorem For Spell Checker Optimization
6 pages
Dependency Parsing
100% (11)
Dependency Parsing
127 pages
System Programming
No ratings yet
System Programming
150 pages
Automata Theory and Computability 18Cs54
No ratings yet
Automata Theory and Computability 18Cs54
73 pages
Arabic Language Modeling With Thesis
No ratings yet
Arabic Language Modeling With Thesis
202 pages
CS8080 Irt Unit Ii Qbank Main
No ratings yet
CS8080 Irt Unit Ii Qbank Main
8 pages
Antaryāmī: The Smart Keyboard For Indian Languages: Amitava Das
No ratings yet
Antaryāmī: The Smart Keyboard For Indian Languages: Amitava Das
8 pages
DITA: An XML-based Technical Documenta-Tion Authoring and Publishing Architecture
No ratings yet
DITA: An XML-based Technical Documenta-Tion Authoring and Publishing Architecture
16 pages
L21 Mining Social Network Graphs
No ratings yet
L21 Mining Social Network Graphs
30 pages
Langauage Model
No ratings yet
Langauage Model
148 pages
Java-Fundamentals-Course-Introduction
No ratings yet
Java-Fundamentals-Course-Introduction
31 pages
IS 7118 Unit-5 POS Tagging
No ratings yet
IS 7118 Unit-5 POS Tagging
89 pages
NLP Lab Expdoc New
No ratings yet
NLP Lab Expdoc New
103 pages
6CS4 AI Unit-5
No ratings yet
6CS4 AI Unit-5
65 pages
Unit 2 NLP
No ratings yet
Unit 2 NLP
5 pages
AI Lab Mannual
67% (3)
AI Lab Mannual
27 pages
Programming Language Handout
No ratings yet
Programming Language Handout
75 pages
IS 7118 Unit1 Introduction
No ratings yet
IS 7118 Unit1 Introduction
58 pages
Pentaho Data Deduplication
No ratings yet
Pentaho Data Deduplication
5 pages
Embedded Software Architecture Design - Embedded Software Design - A Practical Approach To Architecture, Processes, and Coding Techniques
No ratings yet
Embedded Software Architecture Design - Embedded Software Design - A Practical Approach To Architecture, Processes, and Coding Techniques
27 pages
Java To Kotlin PDF
No ratings yet
Java To Kotlin PDF
19 pages
Lec12-SEMANTIC PROCESSING
No ratings yet
Lec12-SEMANTIC PROCESSING
26 pages
Survey Sign Language Production 2023
No ratings yet
Survey Sign Language Production 2023
23 pages
International Journal of Computational Linguistics (IJCL), Volume (1), Issue
No ratings yet
International Journal of Computational Linguistics (IJCL), Volume (1), Issue
28 pages
NLP Python Intro 1-3
100% (1)
NLP Python Intro 1-3
79 pages
Natural Language Processing: Dr. Abdulfetah A.A
No ratings yet
Natural Language Processing: Dr. Abdulfetah A.A
25 pages
AI Assignment 1
No ratings yet
AI Assignment 1
31 pages
CSCI250 - Exam 2 - V1-Solution
No ratings yet
CSCI250 - Exam 2 - V1-Solution
6 pages
Magahi Language
No ratings yet
Magahi Language
6 pages
Module 4
No ratings yet
Module 4
31 pages
C Programming Notes PDF
No ratings yet
C Programming Notes PDF
83 pages
Question 1 (25 Points) : A. What Is The Output of Each of The Following Fragments? Circle The Correct Answer
No ratings yet
Question 1 (25 Points) : A. What Is The Output of Each of The Following Fragments? Circle The Correct Answer
6 pages
Language Processors Intro
No ratings yet
Language Processors Intro
39 pages
Software Engineering and Project Management
100% (1)
Software Engineering and Project Management
23 pages
Speaker Recognition
No ratings yet
Speaker Recognition
29 pages
NLP Unit I
No ratings yet
NLP Unit I
30 pages
NLP Unit Iv
No ratings yet
NLP Unit Iv
24 pages
What Is Natural Language Processing?
No ratings yet
What Is Natural Language Processing?
5 pages
Matlab Calculator GUI (Assignment)
No ratings yet
Matlab Calculator GUI (Assignment)
12 pages
Lexical Analyser Parser
No ratings yet
Lexical Analyser Parser
37 pages
Mobile Computing Assignment
No ratings yet
Mobile Computing Assignment
20 pages
Express Learning Automata Theory and Formal Languages PDF
No ratings yet
Express Learning Automata Theory and Formal Languages PDF
2 pages
Analysis of Statistical Parsing in Natural Language Processing
No ratings yet
Analysis of Statistical Parsing in Natural Language Processing
6 pages
ECC301 - Chapter Two - Formal Assessment PDF
No ratings yet
ECC301 - Chapter Two - Formal Assessment PDF
1 page
ET7102-Microcontroller Based System Design
No ratings yet
ET7102-Microcontroller Based System Design
19 pages
Embedded Systems
100% (2)
Embedded Systems
21 pages
NLP 3 4 5
No ratings yet
NLP 3 4 5
105 pages
Table of Content
No ratings yet
Table of Content
13 pages
DLT Unit-5
No ratings yet
DLT Unit-5
48 pages
Unit3 - Morphology and Finite State Transducers
100% (1)
Unit3 - Morphology and Finite State Transducers
55 pages
Reverse Engineering at Google
No ratings yet
Reverse Engineering at Google
4 pages
1 N-Grams and Language Models Detailed
No ratings yet
1 N-Grams and Language Models Detailed
4 pages
Mutual Fund Performance Analyser
No ratings yet
Mutual Fund Performance Analyser
24 pages
Monitors in Operating System
No ratings yet
Monitors in Operating System
3 pages
Module 5-Natural Language Processing
No ratings yet
Module 5-Natural Language Processing
13 pages
NLP Unit-1
No ratings yet
NLP Unit-1
12 pages
Module 3
No ratings yet
Module 3
40 pages
Unit V Application
No ratings yet
Unit V Application
13 pages

The Traditional Approach To Natural Language Processing

Uploaded by

The Traditional Approach To Natural Language Processing

Uploaded by

The traditional approach to Natural Language Processing

Understanding the traditional approach

The traditional approach to solving NLP tasks involves a collection of distinct subtasks.

Example – generating football game summaries

To gain an in-depth understanding of the traditional approach to NLP, let's consider a

If we were to incorporate machine learning to generate natural language, a sequence of

Preprocessing the text involves operations, such as stemming (for example,

Tokenization is another preprocessing step that might need to be performed.

The following are some of the feature engineering techniques:

Bag-of-words: This is a feature engineering technique that creates feature

 Bob went to the market to buy some flowers

The vocabulary for these two sentences would be:

Bob went to the market to buy some flowers

The letter level n-gram decomposition for this sentence is as follows:

The word-based n-gram decomposition is this:

Team 1 goals = 5, "Manchester United managed to get 5 goals"

The learning process may comprise three sub modules: a Hidden Markov

Drawbacks of the traditional approach

 The preprocessing steps used in traditional NLP forces a trade-off of potentially

You might also like