0% found this document useful (0 votes)
551 views7 pages

The Traditional Approach To Natural Language Processing

The traditional approach to natural language processing involves several key sequential steps: preprocessing data by removing unwanted text, feature engineering to transform text into numerical representations, using machine learning algorithms on training data, and predicting outputs for new data. Feature engineering, such as bag-of-words representations, was crucial for obtaining good performance but time-consuming. This approach involves distinct subtasks like preprocessing text, feature engineering, learning from features, and prediction.

Uploaded by

Madhu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
551 views7 pages

The Traditional Approach To Natural Language Processing

The traditional approach to natural language processing involves several key sequential steps: preprocessing data by removing unwanted text, feature engineering to transform text into numerical representations, using machine learning algorithms on training data, and predicting outputs for new data. Feature engineering, such as bag-of-words representations, was crucial for obtaining good performance but time-consuming. This approach involves distinct subtasks like preprocessing text, feature engineering, learning from features, and prediction.

Uploaded by

Madhu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 7

The traditional approach to Natural Language Processing

The traditional or classical approach to solving NLP is a sequential flow of several key
steps, and it is a statistical approach. When we take a closer look at a traditional NLP
learning model, we will be able to see a set of distinct tasks taking place, such as
preprocessing data by removing unwanted data, feature engineering to get good
numerical representations of textual data, learning to use machine learning algorithms
with the aid of training data, and predicting outputs for novel unfamiliar data. Of these,
feature engineering was the most time-consuming and crucial step for obtaining good
performance on a given NLP task.

Understanding the traditional approach

The traditional approach to solving NLP tasks involves a collection of distinct subtasks.


First, the text corpora need to be preprocessed focusing on reducing the vocabulary
and distractions. By distractions, I refer to the things that distract the algorithm (for
example, punctuation marks and stop word removal) from capturing the vital linguistic
information required for the task.

Next, comes several feature engineering steps. The main objective of feature
engineering is to make the learning easier for the algorithms. Often the features are
hand-engineered and biased toward the human understanding of a language. Feature
engineering was of utter importance for classical NLP algorithms, and consequently, the
best performing systems often had the best engineered features. For example, for a
sentiment classification task, you can represent a sentence with a parse tree and assign
positive, negative, or neutral labels to each node/subtree in the tree to classify that
sentence as positive or negative. Additionally, the feature engineering phase can use
external resources such as WordNet (a lexical database) to develop better features. We
will soon look at a simple feature engineering technique known as bag-of-words.

Next, the learning algorithm learns to perform well at the given task using the obtained
features and optionally, the external resources. For example, for a text summarization
task, a thesaurus that contains synonyms of words can be a good external resource.
Finally, prediction occurs. Prediction is straightforward, where you will feed a new input
and obtain the predicted label by forwarding the input through the learning model. The
entire process of the traditional approach is depicted in Figure 1.2:
Figure 1.2: The general approach of classical NLP

Example – generating football game summaries

To gain an in-depth understanding of the traditional approach to NLP, let's consider a


task of automatic text generation from the statistics of a game of football. We have
several sets of game statistics (for example, score, penalties, and yellow cards) and the
corresponding articles generated for that game by a journalist, as the training data. Let's
also assume that for a given game, we have a mapping from each statistical parameter
to the most relevant phrase of the summary for that parameter. Our task here is that,
given a new game, we need to generate a natural looking summary about the game. Of
course, this can be as simple as finding the best-matching statistics for the new game
from the training data and retrieving the corresponding summary. However, there are
more sophisticated and elegant ways of generating text.

If we were to incorporate machine learning to generate natural language, a sequence of


operations such as preprocessing the text, tokenization, feature engineering, learning,
and prediction are likely to be performed.

Preprocessing the text involves operations, such as stemming (for example,


converting listened to listen) and removing punctuation (for example, ! and ;), in order to
reduce the vocabulary (that is, features), thus reducing the memory requirement. It is
important to understand that stemming is not a trivial operation. It might appear that
stemming is a simple operation that relies on a simple set of rules such as
removing ed from a verb (for example, the stemmed result of listened is listen);
however, it requires more than a simple rule base to develop a good stemming
algorithm, as stemming certain words can be tricky (for example, the stemmed result
of argued is argue). In addition, the effort required for proper stemming can vary in
complexity for other languages.

Tokenization is another preprocessing step that might need to be performed.


Tokenization is the process of dividing a corpus into small entities (for example, words).
This might appear trivial for a language such as English, as the words are isolated;
however, this is not the case for certain languages such as Thai, Japanese, and
Chinese, as these languages are not consistently delimited.

Feature engineering is used to transform raw text data into an appealing numerical
format so that a model can be trained on that data, for example, converting text into a
bag-of-words representation or using the n-gram representation which we will discuss
later. However, remember that state-of-the-art classical models rely on much more
sophisticated feature engineering techniques.

The following are some of the feature engineering techniques:

Bag-of-words: This is a feature engineering technique that creates feature


representations based on the word occurrence frequency. For example, let's consider
the following sentences:

 Bob went to the market to buy some flowers


 Bob bought the flowers to give to Mary

The vocabulary for these two sentences would be:

["Bob", "went", "to", "the", "market", "buy", "some", "flowers", "bought", "give", "Mary"]

Next, we will create a feature vector of size V (vocabulary size) for each sentence
showing how many times each word in the vocabulary appears in the sentence. In this
example, the feature vectors for the sentences would respectively be as follows:

[1, 1, 2, 1, 1, 1, 1, 1, 0, 0, 0]

[1, 0, 2, 1, 0, 0, 0, 1, 1, 1, 1]
A crucial limitation of the bag-of-words method is that it loses contextual information as
the order of words is no longer preserved.

N-gram: This is another feature engineering technique that breaks down text into
smaller components consisting of n letters (or words). For example, 2-gram would break
the text into two-letter (or two-word) entities. For example, consider this sentence:

Bob went to the market to buy some flowers

The letter level n-gram decomposition for this sentence is as follows:

["Bo", "ob", "b ", " w", "we", "en", ..., "me", "e "," f", "fl", "lo", "ow", "we", "er", "rs"]

The word-based n-gram decomposition is this:

["Bob went", "went to", "to the", "the market", ..., "to buy", "buy some", "some flowers"]

The advantage in this representation (letter, level) is that the vocabulary will be
significantly smaller than if we were to use words as features for large corpora.

Next, we need to structure our data to be able to feed it into a learning model. For
example, we will have data tuples of the form, (statistic, a phrase explaining the
statistic) as follows:

Total goals = 4, "The game was tied with 2 goals for each team at the end of the first
half"

Team 1 = Manchester United, "The game was between Manchester United and
Barcelona"

Team 1 goals = 5, "Manchester United managed to get 5 goals"

The learning process may comprise three sub modules: a Hidden Markov


Model (HMM), a sentence planner, and a discourse planner. In our example, a HMM
might learn the morphological structure and grammatical properties of the language by
analyzing the corpus of related phrases. More specifically, we will concatenate each
phrase in our dataset to form a sequence, where the first element is the statistic
followed by the phrase explaining it. Then, we will train a HMM by asking it to predict the
next word, given the current sequence. Concretely, we will first input the statistic to the
HMM and then get the prediction made by the HMM; then, we will concatenate the last
prediction to the current sequence and ask the HMM to give another prediction, and so
on. This will enable the HMM to output meaningful phrases, given statistics.

Next, we can have a sentence planner that corrects any linguistic mistakes (for
example, morphological or grammar), which we might have in the phrases. For
examples, a sentence planner outputs the phrase, I go house as I go home; it can use a
database of rules, which contains the correct way of conveying meanings (for example,
the need of a preposition between a verb and the word house).

Now we can generate a set of phrases for a given set of statistics using a HMM. Then,
we need to aggregate these phrases in such a way that an essay made from the
collection of phrases is human readable and flows correctly. For example, consider the
three phrases, Player 10 of the Barcelona team scored a goal in the second
half, Barcelona played against Manchester United, and Player 3 from Manchester
United got a yellow card in the first half; having these sentences in this order does not
make much sense. We like to have them in this order: Barcelona played against
Manchester United, Player 3 from Manchester United got a yellow card in the first half,
and Player 10 of the Barcelona team scored a goal in the second half. To do this, we
use a discourse planner; discourse planners can order and structure a set of messages
that need to be conveyed.

Now we can get a set of arbitrary test statistics and obtain an essay explaining the
statistics by following the preceding workflow, which is depicted in Figure 1.3:
Figure 1.3: A step from a classical approachoo example of solving a language modelling
task

Here, it is important to note that this is a very high level explanation that only covers the
main general-purpose components that are most likely to be included in the traditional
way of NLP. The details can largely vary according to the particular application we are
interested in solving. For example, additional application-specific crucial components
might be needed for certain tasks (a rule base and an alignment model in machine
translation). However, in this book, we do not stress about such details as the main
objective here is to discuss more modern ways of natural language processing.

Drawbacks of the traditional approach

Let's list several key drawbacks of the traditional approach as this would lay a good
foundation for discussing the motivation for deep learning:

 The preprocessing steps used in traditional NLP forces a trade-off of potentially


useful information embedded in the text (for example, punctuation and tense
information) in order to make the learning feasible by reducing the vocabulary.
Though preprocessing is still used in modern deep-learning-based solutions, it is
not as crucial as for the traditional NLP workflow due to the large representational
capacity of deep networks.
 Feature engineering needs to be performed manually by hand. In order to design
a reliable system, good features need to be devised. This process can be very
tedious as different feature spaces need to be extensively explored. Additionally,
in order to effectively explore robust features, domain expertise is required, which
can be scarce for certain NLP tasks.
 Various external resources are needed for it to perform well, and there are not
many freely available ones. Such external resources often consist of manually
created information stored in large databases. Creating one for a particular task
can take several years, depending on the severity of the task (for example, a
machine translation rule base).

You might also like