The Traditional Approach To Natural Language Processing
The Traditional Approach To Natural Language Processing
The traditional or classical approach to solving NLP is a sequential flow of several key
steps, and it is a statistical approach. When we take a closer look at a traditional NLP
learning model, we will be able to see a set of distinct tasks taking place, such as
preprocessing data by removing unwanted data, feature engineering to get good
numerical representations of textual data, learning to use machine learning algorithms
with the aid of training data, and predicting outputs for novel unfamiliar data. Of these,
feature engineering was the most time-consuming and crucial step for obtaining good
performance on a given NLP task.
Next, comes several feature engineering steps. The main objective of feature
engineering is to make the learning easier for the algorithms. Often the features are
hand-engineered and biased toward the human understanding of a language. Feature
engineering was of utter importance for classical NLP algorithms, and consequently, the
best performing systems often had the best engineered features. For example, for a
sentiment classification task, you can represent a sentence with a parse tree and assign
positive, negative, or neutral labels to each node/subtree in the tree to classify that
sentence as positive or negative. Additionally, the feature engineering phase can use
external resources such as WordNet (a lexical database) to develop better features. We
will soon look at a simple feature engineering technique known as bag-of-words.
Next, the learning algorithm learns to perform well at the given task using the obtained
features and optionally, the external resources. For example, for a text summarization
task, a thesaurus that contains synonyms of words can be a good external resource.
Finally, prediction occurs. Prediction is straightforward, where you will feed a new input
and obtain the predicted label by forwarding the input through the learning model. The
entire process of the traditional approach is depicted in Figure 1.2:
Figure 1.2: The general approach of classical NLP
Feature engineering is used to transform raw text data into an appealing numerical
format so that a model can be trained on that data, for example, converting text into a
bag-of-words representation or using the n-gram representation which we will discuss
later. However, remember that state-of-the-art classical models rely on much more
sophisticated feature engineering techniques.
["Bob", "went", "to", "the", "market", "buy", "some", "flowers", "bought", "give", "Mary"]
Next, we will create a feature vector of size V (vocabulary size) for each sentence
showing how many times each word in the vocabulary appears in the sentence. In this
example, the feature vectors for the sentences would respectively be as follows:
[1, 1, 2, 1, 1, 1, 1, 1, 0, 0, 0]
[1, 0, 2, 1, 0, 0, 0, 1, 1, 1, 1]
A crucial limitation of the bag-of-words method is that it loses contextual information as
the order of words is no longer preserved.
N-gram: This is another feature engineering technique that breaks down text into
smaller components consisting of n letters (or words). For example, 2-gram would break
the text into two-letter (or two-word) entities. For example, consider this sentence:
["Bo", "ob", "b ", " w", "we", "en", ..., "me", "e "," f", "fl", "lo", "ow", "we", "er", "rs"]
["Bob went", "went to", "to the", "the market", ..., "to buy", "buy some", "some flowers"]
The advantage in this representation (letter, level) is that the vocabulary will be
significantly smaller than if we were to use words as features for large corpora.
Next, we need to structure our data to be able to feed it into a learning model. For
example, we will have data tuples of the form, (statistic, a phrase explaining the
statistic) as follows:
Total goals = 4, "The game was tied with 2 goals for each team at the end of the first
half"
Team 1 = Manchester United, "The game was between Manchester United and
Barcelona"
Next, we can have a sentence planner that corrects any linguistic mistakes (for
example, morphological or grammar), which we might have in the phrases. For
examples, a sentence planner outputs the phrase, I go house as I go home; it can use a
database of rules, which contains the correct way of conveying meanings (for example,
the need of a preposition between a verb and the word house).
Now we can generate a set of phrases for a given set of statistics using a HMM. Then,
we need to aggregate these phrases in such a way that an essay made from the
collection of phrases is human readable and flows correctly. For example, consider the
three phrases, Player 10 of the Barcelona team scored a goal in the second
half, Barcelona played against Manchester United, and Player 3 from Manchester
United got a yellow card in the first half; having these sentences in this order does not
make much sense. We like to have them in this order: Barcelona played against
Manchester United, Player 3 from Manchester United got a yellow card in the first half,
and Player 10 of the Barcelona team scored a goal in the second half. To do this, we
use a discourse planner; discourse planners can order and structure a set of messages
that need to be conveyed.
Now we can get a set of arbitrary test statistics and obtain an essay explaining the
statistics by following the preceding workflow, which is depicted in Figure 1.3:
Figure 1.3: A step from a classical approachoo example of solving a language modelling
task
Here, it is important to note that this is a very high level explanation that only covers the
main general-purpose components that are most likely to be included in the traditional
way of NLP. The details can largely vary according to the particular application we are
interested in solving. For example, additional application-specific crucial components
might be needed for certain tasks (a rule base and an alignment model in machine
translation). However, in this book, we do not stress about such details as the main
objective here is to discuss more modern ways of natural language processing.
Let's list several key drawbacks of the traditional approach as this would lay a good
foundation for discussing the motivation for deep learning: