0% found this document useful (0 votes)
120 views9 pages

A Beginner's Guide To Natural Language Processing - IBM Developer

A Beginner’s Guide to Natural Language Processing – IBM Developer

Uploaded by

Aisha Gurung
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
120 views9 pages

A Beginner's Guide To Natural Language Processing - IBM Developer

A Beginner’s Guide to Natural Language Processing – IBM Developer

Uploaded by

Aisha Gurung
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

3/19/2020 Mercury Reader

developer.ibm.com

A beginner's guide to natural


language processing
Apr. 9th, 2019 Send to Kindle

In 1954, IBM demonstrated the ability to translate Russian sentences into English
using machine translation on an IBM 701 mainframe. While simple by today’s
standards, the demonstration identified the massive advantages of language
translation. In this article, we’ll examine natural language processing (NLP) and how
it can help us to converse more naturally with computers.

NLP is one of the most important subfields of machine learning for a variety of
reasons. Natural language is the most natural interface between a user and a
machine. In the ideal case, this involves speech recognition and voice generation.
Even Alan Turing recognized this in his “intelligence” article, in which he defined the
“Turing test” as a way to test a machine’s ability to exhibit intelligent behavior
through a natural language conversation.

NLP isn’t a singular entity but a spectrum of areas of research. Figure 1 illustrates a
voice assistant, which is a common product of NLP today. The NLP areas of study
are shown in the context of the fundamental blocks of the voice assistant
application.

Beyond voice assistants, one of the key benefits of NLP is the massive amount of
unstructured text data that exists in the world and acts as a driver for natural
language processing and understanding. For a machine to process, organize, and
understand this text (that was generated primarily for human consumption), we
could unlock a large number of useful applications for future machine learning
applications and a vast amount of knowledge that could be put to work. Wikipedia,

https://developer.ibm.com/technologies/artificial-intelligence/articles/a-beginners-guide-to-natural-language-processing/ 1/9
3/19/2020 Mercury Reader

as one example, includes a large amount of


knowledge that is linked in many ways to
illustrate the relationships of the topics.
Wikipedia itself would be an invaluable source of
unstructured data to which NLP could be
applied.

Let’s now explore the history and methods for


NLP.

History
NLP, much like AI, has a history of ups and downs. IBM’s early work in 1954 for the
Georgetown demonstration emphasized the huge benefits of machine translation
(translating over 60 Russian sentences into English). This early approach used six
grammar rules for a dictionary of 250 words and resulted in large investments into
machine translation, but rules-based approaches could not scale into production
systems.

MIT’s SHRDLU (named based upon frequency order of letters in English) was
developed in the late 1960s in LISP and used natural language to allow a user to
manipulate and query the state of a blocks world. The blocks world, a virtual world
filled with different blocks, could be manipulated by a user with commands like “Pick
up a big red block.” Objects could be stacked and queried to understand the state of
the world (“is there anything to the right of the red pyramid?”). At the time, this
demonstration was viewed as highly successful, but could not scale to more
complex and ambiguous environments.

During the 1970s and early 1980s, many chatbot-style applications were developed,
which could converse about restricted topics. These were precursors to what is now
called conversational AI, widely and successfully used in many domains. Other

https://developer.ibm.com/technologies/artificial-intelligence/articles/a-beginners-guide-to-natural-language-processing/ 2/9
3/19/2020 Mercury Reader

applications such as Lehnert’s Plot Units implemented narrative summarization. This


permitted summarization of a simple story with “plot units,” such as motivation,
success, mixed-blessing, and other narrative building blocks.

In the late 1980s, NLP systems research moved from rules-based approaches to
statistical models. With the introduction of the Internet, NLP became even more
important with the flood of textual information becoming machine accessible.

Early work in NLP


In the 1960s, work began on applying meaning to sequences of words. In a process
called tagging, sentences could be broken down into their parts of speech to
understand their relationship within the sentence. These taggers relied on human-
constructed rules-based algorithms to “tag” words with their context in a sentence
(for example noun, verb, adjective, etc.). But there was considerable complexity in
this tagging, since in English there can be up to 150 different types of speech tags.

Using Python’s Natural Language Toolkit (NLTK), you can see the product of parts-
of-speech tagging. In this example, the final set of tuples represent the tokenized
words along with their tags (using the UPenn tagset). This tagset consists of 36 tags,
such as VBG (verb, gerund, or present participle), NN (singular noun), PRP (personal
pronoun), and so on.

>>> quote = "Knowing yourself is the beginning of all wisdom."

>>> tokens = nltk.word_tokenize( quote )

>>> tags = nltk.pos_tag( tokens )

>>> tags

[('Knowing', 'VBG'), ('yourself', 'PRP'), ('is', 'VBZ'),

('the', 'DT'), ('beginning', 'NN'), ('of', 'IN'), ('all', 'DT'),

('wisdom', 'NN'), ('.', '.')]

>>>

https://developer.ibm.com/technologies/artificial-intelligence/articles/a-beginners-guide-to-natural-language-processing/ 3/9
3/19/2020 Mercury Reader

Tagging words might not seem complicated, but since words can mean different
things depending upon where they are used, the process can be complicated. Parts-
of-speech tagging is used as a prerequisite to other problems and is applied to a
variety of NLP tasks.

Strict rules-based approaches to tagging have given way to statistical methods


where ambiguity exists. Given a body of text (or corpus), one could calculate the
probabilities of a word following another word. In some cases, the probabilities are
quite high, where in others they are zero. The massive graph of words and their
transitions probabilities is the product of training a machine to figure out which
words are more likely to follow others and can be used in a variety of ways. As an
example, in a speech recognition application, this word graph could be used to
identify a word that was garbled by noise (based upon the probabilities of the word
sequence that preceded it). This could also be used in an auto-correct application
(to recommend a word for one that was misspelled). This technique is commonly
solved using a Hidden Markov Model (HMM).

The HMM is useful in that a human doesn’t need to construct this graph; a machine
can construct it from a corpus of valid text. Additionally, it can be constructed based
upon tuples of words (probability that a word follows another word called a bigram)
or based upon an n-gram (where n=3, the probability that a word follows two other
words in a sequence). HMMs have been applied not just to NLP but a variety of
other fields (such as protein or DNA sequencing).

The following example illustrates the construction of bigrams from a simple


sentence in NLTK:

>>> sentence = "the man we saw saw a saw"

>>> tokens = nltk.word_tokenize( sentence )

>>> list(nltk.bigrams( tokens ) )

[('the', 'man'), ('man', 'we'), ('we', 'saw'), ('saw', 'saw'),

https://developer.ibm.com/technologies/artificial-intelligence/articles/a-beginners-guide-to-natural-language-processing/ 4/9
3/19/2020 Mercury Reader

('saw', 'a'), ('a', 'saw')]

>>>

Let’s explore some of the modern approaches to NLP tasks.

Modern approaches
Modern approaches to NLP primarily focus on neural network architectures. As
neural network architectures rely on numerical processing, an encoding is required
to process words. Two common methods are one-hot encodings and word vectors.

Word encodings
A one-hot encoding translates words into unique vectors that can then be
numerically processed by a neural network. Consider the words from our last
bigram example. We create our one-hot vector of the dimension of the number of
words to represent and assign a single bit in that vector to represent each word.
This creates a unique mapping that can be used as input to a neural network (each
bit in the vector as input to a neuron) (see Figure 2). This encoding is advantageous
to simply encoding the words as numbers (label encoding) because networks can
more efficiently train with one-hot vectors.

Another encoding is word vectors, which represent


words as highly dimensional vectors where units of
the vector are real values. But rather than assign
each unit a word (as in one-hot), each unit
represents categories for the word (such as singular
vs. plural or noun vs. verb) and can be 100-1,000
units wide (dimensionality). What makes this encoding interesting is that words are
now numerically related, and the encoding opens up applying mathematical
operations to word vectors (such as adding, subtracting, or negating).
https://developer.ibm.com/technologies/artificial-intelligence/articles/a-beginners-guide-to-natural-language-processing/ 5/9
3/19/2020 Mercury Reader

Recurrent neural networks


Developed in the 1980s, recurrent neural networks (RNNs) have found a unique
place in NLP. As the name implies, RNNs — as compared to typical feed-forward
neural networks — operate in the time domain. RNNs unfold in time and operate in
stages where prior outputs feed subsequent stage inputs (see Figure 3 for an
unrolled network example). This type of architecture applies well to NLP since the
network considers not just the words (or their encodings) but the context in which
the words appear (what follows, what came before). In this contrived network
example, input neurons are fed with the word encodings, and the outputs feed
forward through the network to the output nodes (with the goal of an output word
encoding for language translation). In practice, each word encoding is fed one at a
time and propagated through. On the next time step, the next word encoding is fed
(with output occurring only after the last word is fed).

Traditional
RNNs are
trained
through a
variation of
back-
propagation called back-propagation through time (BPTT). A popular variation of
RNNs is long short-term memory units (LSTMs), which have a unique architecture
and the ability to forget information.

Reinforcement learning focuses on selecting actions in an environment to maximize


some cumulative reward (a reward that is not immediately understood but learned
over many actions). Actions are selected based on a policy that defines whether the
given action should explore new states (unknown territory where learning can take
place) or old states (based on past experience).

https://developer.ibm.com/technologies/artificial-intelligence/articles/a-beginners-guide-to-natural-language-processing/ 6/9
3/19/2020 Mercury Reader

In the context of NLP and machine


translation, observations are sequences of
words that are presented. The state
represents a partial translation and the
action the determination of whether a
translation can be provided or if more
information is needed (more observations,
or words). As further observations are
provided, the state may identify that sufficient information is available and a
translation presented. The key to this approach is that the translation is done
incrementally with reinforcement learning identifying when to commit to a
translation or when to wait for more information (useful in languages where they
main verb occurs at the end of the sentence).

Reinforcement learning has also been applied as the training algorithm for RNNs
that implement text-based summarization.

Deep learning
Deep learning/deep neural networks have been applied successfully to a variety of
problems. You’ll find deep learning at the heart of Q&A systems, document
summarization, image caption generation, text classification and modeling, and
many others. Note that these cases represent natural language understanding and
natural language generation.

Deep learning refers to neural networks with many layers (the deep part) that takes
as features as input and extracts higher-level features from this data. Deep learning
networks are able to learn a hierarchy of representations and different levels of
abstractions of their input. Deep learning networks can use supervised learning or
unsupervised learning and can be formed as hybrids of other approaches (such as
incorporating a recurrent neural network with a deep learning network).

https://developer.ibm.com/technologies/artificial-intelligence/articles/a-beginners-guide-to-natural-language-processing/ 7/9
3/19/2020 Mercury Reader

The most common approach to deep learning networks is the convolutional neural
network (CNN), which is predominantly used in image-processing applications (such
as classifying the contents of an image). Figure 5 illustrates a simple CNN for
sentiment analysis. It consists of an input layer of word encodings (from the
tokenized input), which then feeds the convolutional layers. The convolutional layers
segment the input into many “windows” of input to produce feature maps. These
feature maps are pooled with a max operation, which reduces the dimensionality of
the output and provides the final representation of the input. This is fed into the
final neural network that provides for the classification (such as positive, neutral,
negative).

While
CNNs
have

proven to be efficient in image and language domains, other types of networks can
also be used. The long short-term memory is a new kind of RNN. LSTM cells are
more complex than typical neurons, since they include state and a number of
internal gates that can be used to accept input, output data, or forget internal state
information. LSTMs are commonly used in natural language applications. One of the
most interesting uses of LSTMs was in concert with a CNN where the CNN provided
the ability to process an image, and the LSTM was trained to generate a textual
sentence of the contents of the input image.

Going further
The importance of NLP is demonstrated by the ever-growing list of applications that
use it. NLP provides the most natural interface to computers and to the wealth of
unstructured data available online. In 2011, IBM demonstrated Watson™, which
https://developer.ibm.com/technologies/artificial-intelligence/articles/a-beginners-guide-to-natural-language-processing/ 8/9
3/19/2020 Mercury Reader

competed with two of Jeopardy’s greatest champions and defeated them using a
natural language interface. Watson also used the 2011 version of Wikipedia as its
knowledge source – an important milestone on the path to language processing and
understanding and an indicator of what’s to come. You can also go here to learn
more about NLP.

M. Tim Jones

https://developer.ibm.com/technologies/artificial-intelligence/articles/a-beginners-guide-to-natural-language-processing/

https://developer.ibm.com/technologies/artificial-intelligence/articles/a-beginners-guide-to-natural-language-processing/ 9/9

You might also like