A Beginner's Guide To Natural Language Processing - IBM Developer
A Beginner's Guide To Natural Language Processing - IBM Developer
developer.ibm.com
In 1954, IBM demonstrated the ability to translate Russian sentences into English
using machine translation on an IBM 701 mainframe. While simple by today’s
standards, the demonstration identified the massive advantages of language
translation. In this article, we’ll examine natural language processing (NLP) and how
it can help us to converse more naturally with computers.
NLP is one of the most important subfields of machine learning for a variety of
reasons. Natural language is the most natural interface between a user and a
machine. In the ideal case, this involves speech recognition and voice generation.
Even Alan Turing recognized this in his “intelligence” article, in which he defined the
“Turing test” as a way to test a machine’s ability to exhibit intelligent behavior
through a natural language conversation.
NLP isn’t a singular entity but a spectrum of areas of research. Figure 1 illustrates a
voice assistant, which is a common product of NLP today. The NLP areas of study
are shown in the context of the fundamental blocks of the voice assistant
application.
Beyond voice assistants, one of the key benefits of NLP is the massive amount of
unstructured text data that exists in the world and acts as a driver for natural
language processing and understanding. For a machine to process, organize, and
understand this text (that was generated primarily for human consumption), we
could unlock a large number of useful applications for future machine learning
applications and a vast amount of knowledge that could be put to work. Wikipedia,
https://developer.ibm.com/technologies/artificial-intelligence/articles/a-beginners-guide-to-natural-language-processing/ 1/9
3/19/2020 Mercury Reader
History
NLP, much like AI, has a history of ups and downs. IBM’s early work in 1954 for the
Georgetown demonstration emphasized the huge benefits of machine translation
(translating over 60 Russian sentences into English). This early approach used six
grammar rules for a dictionary of 250 words and resulted in large investments into
machine translation, but rules-based approaches could not scale into production
systems.
MIT’s SHRDLU (named based upon frequency order of letters in English) was
developed in the late 1960s in LISP and used natural language to allow a user to
manipulate and query the state of a blocks world. The blocks world, a virtual world
filled with different blocks, could be manipulated by a user with commands like “Pick
up a big red block.” Objects could be stacked and queried to understand the state of
the world (“is there anything to the right of the red pyramid?”). At the time, this
demonstration was viewed as highly successful, but could not scale to more
complex and ambiguous environments.
During the 1970s and early 1980s, many chatbot-style applications were developed,
which could converse about restricted topics. These were precursors to what is now
called conversational AI, widely and successfully used in many domains. Other
https://developer.ibm.com/technologies/artificial-intelligence/articles/a-beginners-guide-to-natural-language-processing/ 2/9
3/19/2020 Mercury Reader
In the late 1980s, NLP systems research moved from rules-based approaches to
statistical models. With the introduction of the Internet, NLP became even more
important with the flood of textual information becoming machine accessible.
Using Python’s Natural Language Toolkit (NLTK), you can see the product of parts-
of-speech tagging. In this example, the final set of tuples represent the tokenized
words along with their tags (using the UPenn tagset). This tagset consists of 36 tags,
such as VBG (verb, gerund, or present participle), NN (singular noun), PRP (personal
pronoun), and so on.
>>> tags
>>>
https://developer.ibm.com/technologies/artificial-intelligence/articles/a-beginners-guide-to-natural-language-processing/ 3/9
3/19/2020 Mercury Reader
Tagging words might not seem complicated, but since words can mean different
things depending upon where they are used, the process can be complicated. Parts-
of-speech tagging is used as a prerequisite to other problems and is applied to a
variety of NLP tasks.
The HMM is useful in that a human doesn’t need to construct this graph; a machine
can construct it from a corpus of valid text. Additionally, it can be constructed based
upon tuples of words (probability that a word follows another word called a bigram)
or based upon an n-gram (where n=3, the probability that a word follows two other
words in a sequence). HMMs have been applied not just to NLP but a variety of
other fields (such as protein or DNA sequencing).
https://developer.ibm.com/technologies/artificial-intelligence/articles/a-beginners-guide-to-natural-language-processing/ 4/9
3/19/2020 Mercury Reader
>>>
Modern approaches
Modern approaches to NLP primarily focus on neural network architectures. As
neural network architectures rely on numerical processing, an encoding is required
to process words. Two common methods are one-hot encodings and word vectors.
Word encodings
A one-hot encoding translates words into unique vectors that can then be
numerically processed by a neural network. Consider the words from our last
bigram example. We create our one-hot vector of the dimension of the number of
words to represent and assign a single bit in that vector to represent each word.
This creates a unique mapping that can be used as input to a neural network (each
bit in the vector as input to a neuron) (see Figure 2). This encoding is advantageous
to simply encoding the words as numbers (label encoding) because networks can
more efficiently train with one-hot vectors.
Traditional
RNNs are
trained
through a
variation of
back-
propagation called back-propagation through time (BPTT). A popular variation of
RNNs is long short-term memory units (LSTMs), which have a unique architecture
and the ability to forget information.
https://developer.ibm.com/technologies/artificial-intelligence/articles/a-beginners-guide-to-natural-language-processing/ 6/9
3/19/2020 Mercury Reader
Reinforcement learning has also been applied as the training algorithm for RNNs
that implement text-based summarization.
Deep learning
Deep learning/deep neural networks have been applied successfully to a variety of
problems. You’ll find deep learning at the heart of Q&A systems, document
summarization, image caption generation, text classification and modeling, and
many others. Note that these cases represent natural language understanding and
natural language generation.
Deep learning refers to neural networks with many layers (the deep part) that takes
as features as input and extracts higher-level features from this data. Deep learning
networks are able to learn a hierarchy of representations and different levels of
abstractions of their input. Deep learning networks can use supervised learning or
unsupervised learning and can be formed as hybrids of other approaches (such as
incorporating a recurrent neural network with a deep learning network).
https://developer.ibm.com/technologies/artificial-intelligence/articles/a-beginners-guide-to-natural-language-processing/ 7/9
3/19/2020 Mercury Reader
The most common approach to deep learning networks is the convolutional neural
network (CNN), which is predominantly used in image-processing applications (such
as classifying the contents of an image). Figure 5 illustrates a simple CNN for
sentiment analysis. It consists of an input layer of word encodings (from the
tokenized input), which then feeds the convolutional layers. The convolutional layers
segment the input into many “windows” of input to produce feature maps. These
feature maps are pooled with a max operation, which reduces the dimensionality of
the output and provides the final representation of the input. This is fed into the
final neural network that provides for the classification (such as positive, neutral,
negative).
While
CNNs
have
proven to be efficient in image and language domains, other types of networks can
also be used. The long short-term memory is a new kind of RNN. LSTM cells are
more complex than typical neurons, since they include state and a number of
internal gates that can be used to accept input, output data, or forget internal state
information. LSTMs are commonly used in natural language applications. One of the
most interesting uses of LSTMs was in concert with a CNN where the CNN provided
the ability to process an image, and the LSTM was trained to generate a textual
sentence of the contents of the input image.
Going further
The importance of NLP is demonstrated by the ever-growing list of applications that
use it. NLP provides the most natural interface to computers and to the wealth of
unstructured data available online. In 2011, IBM demonstrated Watson™, which
https://developer.ibm.com/technologies/artificial-intelligence/articles/a-beginners-guide-to-natural-language-processing/ 8/9
3/19/2020 Mercury Reader
competed with two of Jeopardy’s greatest champions and defeated them using a
natural language interface. Watson also used the 2011 version of Wikipedia as its
knowledge source – an important milestone on the path to language processing and
understanding and an indicator of what’s to come. You can also go here to learn
more about NLP.
M. Tim Jones
https://developer.ibm.com/technologies/artificial-intelligence/articles/a-beginners-guide-to-natural-language-processing/
https://developer.ibm.com/technologies/artificial-intelligence/articles/a-beginners-guide-to-natural-language-processing/ 9/9