0% found this document useful (0 votes)
12 views

Unit 4

Uploaded by

ramsssssssssss
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

Unit 4

Uploaded by

ramsssssssssss
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 8

4.

NATURAL LANGUAGE PROCESSING, UNDERSTANDING,


AND GENERATION
Differentiate NLP,NLU and NLG

Chatbot Architecture:

Fig: Architecture diagram for chatbots


Let’s say an airline company has built a chatbot to book a flight via their website or social media
pages. The following are the steps as per the architecture shown in Figure:
1. Customer says, “Help me book a flight for tomorrow from London to New York”
through the airline’s Facebook page. In this case, Facebook becomes the presentation
layer. A fully functional chatbot could be integrated into a company’s website, social
network page, and messaging apps like Skype and Slack.
2. Next, the message is carried to the messaging backend where the plain text passes
through an NLP/NLU engine, where the text is broken into tokens, and the message is
converted into a machine-understandable command.
3. The decision engine then matches the command with preconfigured workflows. So, for
example, to book a flight, the system needs a source and a destination. This is where
NLG helps. The chatbot will ask, “Sure, I will help in you booking your flight from
London to New York. Could you please let me know if you prefer your flight from
Heathrow or Gatwick Airport?” The chatbot picks up the source and destination and
automatically generates a follow-up question asking which airport the customer prefers.
4. The chatbot now hits the data layer and fetches the flight information from preferred data
sources, which could typically be connected to live booking systems. The data source
provides flight availability, price, and many other services as per the design.

Further, to differentiate between NLP and NLU, the Venn diagram in Figure
shows a few applications of NLP and NLU. It shows NLU as a subset of NLP. The
overall objective is to process and understand the natural language text to make machines
think like humans.

Fig: Applications of NLP and NLU


Popular Open Source NLP and NLU Tools :

1. NLTK: The Natural Language Toolkit (NLTK) is a Python library for processing
English vocabulary. It has an Apache 2.0 open source license. NLTK is written in the
Python programming language. The following are some of the tasks NLTK can perform:
 Classification of text: Classifying text into a different category for better
organization and content filtering
 Tokenization of sentences: Breaking sentences into words for symbolic and
statistical natural language processing
 Stemming words: Reducing words into base or root form
 Part-of-speech (POS) tagging: Tagging the words into POS, which categorizes the
words into similar grammatical properties
 Parsing text: Determining the syntactic structure of text based on the underlying
grammar
 Semantic reasoning: Ability to understand the meaning of the word to create
representations
NLTK is the first choice of a tool for teaching NLP. It is also widely used as a
platform for prototyping and research.
2. spaCy : Most organizations that build a product involving natural language data are
adapting spaCy. It stands out with its offering of a production-grade NLP engine that is
accurate and fast. With the extensive documentation, the adaption rate further increases.
It is developed in Python and Cython. All the language models in spaCy are trained using
deep learning, which provides high accuracy for all NLP tasks.
Currently, the following are some high-level capabilities of spaCy:
 Covers NLTK features: Provides all the features of NLTK-like tokenization, POS
tagging, dependency trees, named entity recognition, and many more.
 Deep learning workflow: spaCy supports deep learning workflows, which can
connect to models trained on popular frameworks like Tensorflow, Keras, Scikit-
learn, and PyTorch. This makes spaCy the most potent library when it comes to
building and deploying sophisticated language models for real-world applications.
 Multi-language support: Provides support for more than 50 languages including
French, Spanish, and Greek.
 Processing pipeline: Offers an easy-to-use and very intuitive processing pipeline for
performing a series of NLP tasks in an organized manner. For example, a pipeline for
performing POS tagging, parsing the sentence, and named the entity extraction could
be defined in a list like this: pipeline = ["tagger," "parse," "ner"]. This makes the code
easy to read and quick to debug.
 Visualizers: Using displaCy, it becomes easy to draw a dependency tree and entity
recognizer. We can add our colors to make the visualization aesthetically pleasing
and beautiful. It quickly renders in a Jupyter notebook as well.

3. CoreNLP: Stanford CoreNLP is one of the oldest and most robust tools for all natural
language tasks. Its suite of functions offers many linguistic analysis capabilities,
including the already discussed POS tagging, dependency tree, named entity recognition,
sentiment analysis, and others. Unlike spaCy and NLTK, CoreNLP is written in Java. It
also provides Java APIs to use from the command line and third-party APIs for working
with modern programming languages. The following are the core features of using
 Fast and robust: Since it is written in Java, which is a time-tested and robust
programming language, CoreNLP is a favorite for many developers.
 A broad range of grammatical analysis: Like NLTK and spaCy, CoreNLP also
provides a good number of analytical capabilities to process and understand natural
language.
 API integration: CoreNLP has excellent API support for running it from the
command line and programming languages like Python via a third-party API or web
service.
 Support multiple Operating Systems (OSs): CoreNLP works in Windows, Linux,
and MacOS.
 Language support: Like spaCy, CoreNLP provides useful language support, which
includes Arabic, Chinese, and many more.

4. Gensim: gensim is a popular library written in Python and Cython. It is robust and
production-ready, which makes it another popular choice for NLP and NLU. It can help
analyze the semantic structure of plain-text documents and come out with important
topics. The following are some core features of gensim:
 Topic modeling: It automatically extracts semantic topics from documents. It
provides various statistical models, including latent Dirichlet analysis (LDA) for topic
modeling.
 Pretrained models: It has many pretrained models that provide out-of-the-box
capabilities to develop general-purpose functionalities quickly.
 Similarity retrieval: gensim’s capability to extract semantic structures from any
document makes it an ideal library for similarity queries on numerous topics.
 Features available in spaCy, NLTK, and CoreNLP
5. TextBlob: TextBlob is a relatively less popular but easy-to-use Python library that
provides various NLP capabilities like the libraries discussed above. It extends the
features provided by NLTK but in a much-simplified form. The following are some of
the features of TextBlob:
 Sentiment analysis: It provides an easy-to-use method for computing polarity and
subjectivity kinds of scores that measures the sentiment of a given text.
 Language translations: Its language translation is powered by Google Translate,
which provides support for more than 100 languages.
 Spelling corrections: It uses a simple spelling correction method demonstrated by
Peter Norvig on his blog at http://norvig.com/spell-correct.html. Currently the
Engineering Director at Google, his approach is 70% accurate.
6. fastText: fastText is a specialized library for learning word embeddings and text
classification. It was developed by researchers in Facebook’s FAI Research (FAIR) lab. It
is written in C++ and Python, making it very efficient and fast in processing even a large
chunk of data. The following are some of the features of fastText:
 Word embedding learnings: Provides many word embedding models using
skipgram and Continous Bag of Words (CBOW) by unsupervised training.
 Word vectors for out-of-vocabulary words: It provides the capability to obtain
word vectors even if the word is not present in the training vocabulary.
 Text classification: fastText provides a fast text classifier, which in their paper titled
“Bag of Tricks for Efficient Text Classification” claims to be often at par with many
deep learning classifiers’ accuracy and training time.
Natural Language ProcessingNatural language processing deals with understanding
and manicuring natural language text or speech to perform specific useful desired tasks. NLP
combines ideas and concepts from computer science, linguistics, mathematics, artificial
intelligence, machine learning, and psychology.

1. Processing Textual Data: The dataset can be downloaded from


www.kaggle.com/snap/amazon-fine-food-reviews, which is made available with a CC0:
Public Domain license.
 Reading the CSV File:
Using a read_csv function from the pandas library, we read the Reviews. csv file into
a food_review data frame and print the top rows (Figure ):
import pandas as pd
food_review = pd.read_csv("Reviews.csv")
food_review.head()

 Sampling: Using the sample function from the pandas data frame, let’s randomly
pick the text of 1000 reviews and print the top rows (see Figure ):
food_review_text = pd.DataFrame(food_review["Text"])
food_review_text_1k = food_review_text.sample(n= 1000, random_state = 123)
food_review_text_1k.head()

 Tokenization Using NLTK: The first step in processing text data is to separate a
sentence into individual words. This process is called tokenization. We will use the
NLTK’s word_tokenize function to create a column in the food_review_text_1k data
frame we created above and print the top six rows to see the output of tokenize
(Figure):
food_review_text_1k['tokenized_reviews'] = food_review_
text_1k['Text'].apply(nltk.word_tokenize)
food_review_text_1k.head()
 Word Search Using Regex: let’s take the first row in the data frame and search for
the presence of the word using a regular expression (regex). The regex searches for
any word that contains c as its first character and i as the third character. We can write
various regex searches for a pattern of interest. We use the re.search() function to
perform this search:
#Search: All 5-letter words with c as its first letter and i as its third letter
search_word = set([w for w in food_review_text_1k['tokenized_
reviews'].iloc[0] if re.search('^c.i..$', w)])
print(search_word)
{'chips'}
 Word Search Using the Exact Word: Another way of searching for a word is to use
the exact word. This can be achieved using the str.contains() function in pandas. In
the following example, we search for the word “great” in all of the reviews. The rows
of the reviews containing the word will be retrieved. They can be considered a
positive review. See Figure .
#Search for the word "great" in reviews
food_review_text_1k[food_review_text_1k['Text'].str.contains('great')]

Fig: Samples with a specific word

You might also like