PDF Document 4
PDF Document 4
Natural language processing (NLP) is a branch of artificial intelligence (AI) that enables
computers to comprehend, generate, and manipulate human language. Natural language
processing has the ability to interrogate the data with natural language text or voice. This is
also called “language in.” Most consumers have probably interacted with NLP without
realizing it. For instance, NLP is the core technology behind virtual assistants, such as the
Oracle Digital Assistant (ODA), Siri, Cortana, or Alexa. When we ask questions of these
virtual assistants, NLP is what enables them to not only understand the user’s request, but to
also respond in natural language. NLP applies both to written text and speech, and can be
applied to all human languages. Other examples of tools powered by NLP include web
search, email spam filtering, automatic translation of text or speech, document
summarization, sentiment analysis, and grammar/spell checking. For example, some email
programs can automatically suggest an appropriate reply to a message based on its
content—these programs use NLP to read, analyze, and respond to your message.
There are several other terms that are roughly synonymous with NLP. Natural language
understanding (NLU) and natural language generation (NLG) refer to using computers to
understand and produce human language, respectively. NLG has the ability to provide a
verbal description of what has happened. This is also called "language out” by summarizing
by meaningful information into text using a concept known as "grammar of graphics."
In practice, NLU is used to mean NLP. The understanding by computers of the structure and
meaning of all human languages, allowing developers and users to interact with computers
using natural sentences and communication. Computational linguistics (CL) is the scientific
field that studies computational aspects of human language, while NLP is the engineering
discipline concerned with building computational artifacts that understand, generate, or
manipulate human language.
Research on NLP began shortly after the invention of digital computers in the 1950s, and
NLP draws on both linguistics and AI. However, the major breakthroughs of the past few
years have been powered by machine learning, which is a branch of AI that develops systems
that learn and generalize from data. Deep learning is a kind of machine learning that can learn
very complex patterns from large datasets, which means that it is ideally suited to learning the
complexities of natural language from datasets sourced from the web.
Healthcare: As healthcare systems all over the world move to electronic medical records, they are
encountering large amounts of unstructured data. NLP can be used to analyze and gain new
insights into health records.
Legal: To prepare for a case, lawyers must often spend hours examining large collections of
documents and searching for material relevant to a specific case. NLP technology can automate
the process of legal discovery, cutting down on both time and human error by sifting through large
volumes of documents.
Finance: The financial world moves extremely fast, and any competitive advantage is important.
In the financial field, traders use NLP technology to automatically mine information from
corporate documents and news releases to extract information relevant to their portfolios and
trading decisions.
Customer service: Many large companies are using virtual assistants or chatbots to help answer
basic customer inquiries and information requests (such as FAQs), passing on complex questions
to humans when necessary.
Insurance: Large insurance companies are using NLP to sift through documents and reports
related to claims, in an effort to streamline the way business gets done.
Another kind of model is used to recognize and classify entities in documents. For each word
in a document, the model predicts whether that word is part of an entity mention, and if so,
what kind of entity is involved. For example, in “XYZ Corp shares traded for $28 yesterday”,
“XYZ Corp” is a company entity, “$28” is a currency amount, and “yesterday” is a date. The
training data for entity recognition is a collection of texts, where each word is labeled with
the kinds of entities the word refers to. This kind of model, which produces a label for each
word in the input, is called a sequence labeling model.
Sequence to sequence models are a very recent addition to the family of models used in
NLP. A sequence to sequence (or seq2seq) model takes an entire sentence or document as
input (as in a document classifier) but it produces a sentence or some other sequence (for
example, a computer program) as output. (A document classifier only produces a single
symbol as output). Example applications of seq2seq models include machine translation,
which for example, takes an English sentence as input and returns its French sentence as
output; document summarization (where the output is a summary of the input); and semantic
parsing (where the input is a query or request in English, and the output is a computer
program implementing that request).
Deep learning, pretrained models, and transfer learning: Deep learning is the most
widely-used kind of machine learning in NLP. In the 1980s, researchers developed neural
networks, in which a large number of primitive machine learning models are combined into a
single network: by analogy with brains, the simple machine learning models are sometimes
called “neurons.” These neurons are arranged in layers, and a deep neural network is one with
many layers. Deep learning is machine learning using deep neural network models.
Because of their complexity, generally it takes a lot of data to train a deep neural network,
and processing it takes a lot of compute power and time. Modern deep neural network NLP
models are trained from a diverse array of sources, such as all of Wikipedia and data scraped
from the web. The training data might be on the order of 10 GB or more in size, and it might
take a week or more on a high-performance cluster to train the deep neural network.
(Researchers find that training even deeper models from even larger datasets have even
higher performance, so currently there is a race to train bigger and bigger models from larger
and larger datasets).
The voracious data and compute requirements of Deep Neural Networks would seem to
severely limit their usefulness. However, transfer learning enables a trained deep neural
network to be further trained to achieve a new task with much less training data and compute
effort. The simplest kind of transfer learning is called fine tuning. It consists simply of first
training the model on a large generic dataset (for example, Wikipedia) and then further
training (“fine-tuning”) the model on a much smaller task-specific dataset that is labeled with
the actual target task. Perhaps surprisingly, the fine-tuning datasets can be extremely small,
maybe containing only hundreds or even tens of training examples, and fine-tuning training
only requires minutes on a single CPU. Transfer learning makes it easy to deploy deep
learning models throughout the enterprise.
There is now an entire ecosystem of providers delivering pretrained deep learning models
that are trained on different combinations of languages, datasets, and pretraining tasks. These
pretrained models can be downloaded and fine-tuned for a wide variety of different target
tasks.
Stop word removal: A “stop word” is a token that is ignored in later processing. They are
typically short, frequent words such as “a,” “the,” or “an.” Bag-of-words models and search
engines often ignore stop words in order to reduce processing time and storage within the
database. Deep neural networks typically do take word-order into account (that is, they are
not bag-of-words models) and do not do stop word removal because stop words can convey
subtle distinctions in meaning (for example, “the package was lost” and “a package is lost”
don’t mean the same thing, even though they are the same after stop word removal).
Part-of-speech tagging and syntactic parsing: Part-of-speech (PoS) tagging is the process
of labeling each word with its part of speech (for example, noun, verb, adjective, etc.). A
Syntactic parser identifies how words combine to form phrases, clauses, and entire sentences.
PoS tagging is a sequence labeling task, syntactic parsing is an extended kind of sequence
labeling task, and deep neural Nntworks are the state-of-the-art technology for both PoS
tagging and syntactic parsing. Before deep learning, PoS tagging and syntactic parsing were
essential steps in sentence understanding. However, modern deep learning NLP models
generally only benefit marginally (if at all) from PoS or syntax information, so neither PoS
tagging nor syntactic parsing are widely used in deep learning NLP.
NLP Programming Languages
Python:
The NLP Libraries and toolkits are generally available in Python, and for this reason by far
the majority of NLP projects are developed in Python. Python’s interactive development
environment makes it easy to develop and test new code.
Java and C++:
For processing large amounts of data, C++ and Java are often preferred because they can
support more efficient code.
TensorFlow and PyTorch: These are the two most popular deep learning toolkits. They are
freely available for research and commercial purposes. While they support multiple
languages, their primary language is Python. They come with large libraries of prebuilt
components, so even very sophisticated deep learning NLP models often only require
plugging these components together. They also support high-performance computing
infrastructure, such as clusters of machines with graphical processor unit (GPU) accelerators.
They have excellent documentation and tutorials.
AllenNLP: This is a library of high-level NLP components (for example, simple chatbots)
implemented in PyTorch and Python. The documentation is excellent.
HuggingFace: This company distributes hundreds of different pretrained Deep Learning
NLP models, as well as a plug-and-play software toolkit in TensorFlow and PyTorch that
enables developers to rapidly evaluate how well different pretrained models perform on their
specific tasks.
Spark NLP: Spark NLP is an open source text processing library for advanced NLP for the
Python, Java, and Scala programming languages. Its goal is to provide an application
programming interface (API) for natural language processing pipelines. It offers pretrained
neural network models, pipelines, and embeddings, as well as support for training custom
models.
SpaCy NLP: SpaCy is a free, open source library for advanced NLP in Python, and it is
specifically designed to help build applications that can process and understand large
volumes of text. SpaCy is known to be highly intuitive and can handle many of the tasks
needed in common NLP projects.
In summary, Natural language processing is an exciting area of artificial intelligence
development that fuels a wide range of new products such as search engines, chatbots,
recommendation systems, and speech-to-text systems. As human interfaces with computers
continue to move away from buttons, forms, and domain-specific languages, the demand for
growth in natural language processing will continue to increase. For this reason, Oracle
Cloud Infrastructure is committed to providing on-premises performance with our
performance-optimized compute shapes and tools for NLP. Oracle Cloud Infrastructure
offers an array of GPU shapes that you can deploy in minutes to begin experimenting with
NLP.