Unit 4
Unit 4
Chatbot Architecture:
Further, to differentiate between NLP and NLU, the Venn diagram in Figure
shows a few applications of NLP and NLU. It shows NLU as a subset of NLP. The
overall objective is to process and understand the natural language text to make machines
think like humans.
1. NLTK: The Natural Language Toolkit (NLTK) is a Python library for processing
English vocabulary. It has an Apache 2.0 open source license. NLTK is written in the
Python programming language. The following are some of the tasks NLTK can perform:
Classification of text: Classifying text into a different category for better
organization and content filtering
Tokenization of sentences: Breaking sentences into words for symbolic and
statistical natural language processing
Stemming words: Reducing words into base or root form
Part-of-speech (POS) tagging: Tagging the words into POS, which categorizes the
words into similar grammatical properties
Parsing text: Determining the syntactic structure of text based on the underlying
grammar
Semantic reasoning: Ability to understand the meaning of the word to create
representations
NLTK is the first choice of a tool for teaching NLP. It is also widely used as a
platform for prototyping and research.
2. spaCy : Most organizations that build a product involving natural language data are
adapting spaCy. It stands out with its offering of a production-grade NLP engine that is
accurate and fast. With the extensive documentation, the adaption rate further increases.
It is developed in Python and Cython. All the language models in spaCy are trained using
deep learning, which provides high accuracy for all NLP tasks.
Currently, the following are some high-level capabilities of spaCy:
Covers NLTK features: Provides all the features of NLTK-like tokenization, POS
tagging, dependency trees, named entity recognition, and many more.
Deep learning workflow: spaCy supports deep learning workflows, which can
connect to models trained on popular frameworks like Tensorflow, Keras, Scikit-
learn, and PyTorch. This makes spaCy the most potent library when it comes to
building and deploying sophisticated language models for real-world applications.
Multi-language support: Provides support for more than 50 languages including
French, Spanish, and Greek.
Processing pipeline: Offers an easy-to-use and very intuitive processing pipeline for
performing a series of NLP tasks in an organized manner. For example, a pipeline for
performing POS tagging, parsing the sentence, and named the entity extraction could
be defined in a list like this: pipeline = ["tagger," "parse," "ner"]. This makes the code
easy to read and quick to debug.
Visualizers: Using displaCy, it becomes easy to draw a dependency tree and entity
recognizer. We can add our colors to make the visualization aesthetically pleasing
and beautiful. It quickly renders in a Jupyter notebook as well.
3. CoreNLP: Stanford CoreNLP is one of the oldest and most robust tools for all natural
language tasks. Its suite of functions offers many linguistic analysis capabilities,
including the already discussed POS tagging, dependency tree, named entity recognition,
sentiment analysis, and others. Unlike spaCy and NLTK, CoreNLP is written in Java. It
also provides Java APIs to use from the command line and third-party APIs for working
with modern programming languages. The following are the core features of using
Fast and robust: Since it is written in Java, which is a time-tested and robust
programming language, CoreNLP is a favorite for many developers.
A broad range of grammatical analysis: Like NLTK and spaCy, CoreNLP also
provides a good number of analytical capabilities to process and understand natural
language.
API integration: CoreNLP has excellent API support for running it from the
command line and programming languages like Python via a third-party API or web
service.
Support multiple Operating Systems (OSs): CoreNLP works in Windows, Linux,
and MacOS.
Language support: Like spaCy, CoreNLP provides useful language support, which
includes Arabic, Chinese, and many more.
4. Gensim: gensim is a popular library written in Python and Cython. It is robust and
production-ready, which makes it another popular choice for NLP and NLU. It can help
analyze the semantic structure of plain-text documents and come out with important
topics. The following are some core features of gensim:
Topic modeling: It automatically extracts semantic topics from documents. It
provides various statistical models, including latent Dirichlet analysis (LDA) for topic
modeling.
Pretrained models: It has many pretrained models that provide out-of-the-box
capabilities to develop general-purpose functionalities quickly.
Similarity retrieval: gensim’s capability to extract semantic structures from any
document makes it an ideal library for similarity queries on numerous topics.
Features available in spaCy, NLTK, and CoreNLP
5. TextBlob: TextBlob is a relatively less popular but easy-to-use Python library that
provides various NLP capabilities like the libraries discussed above. It extends the
features provided by NLTK but in a much-simplified form. The following are some of
the features of TextBlob:
Sentiment analysis: It provides an easy-to-use method for computing polarity and
subjectivity kinds of scores that measures the sentiment of a given text.
Language translations: Its language translation is powered by Google Translate,
which provides support for more than 100 languages.
Spelling corrections: It uses a simple spelling correction method demonstrated by
Peter Norvig on his blog at http://norvig.com/spell-correct.html. Currently the
Engineering Director at Google, his approach is 70% accurate.
6. fastText: fastText is a specialized library for learning word embeddings and text
classification. It was developed by researchers in Facebook’s FAI Research (FAIR) lab. It
is written in C++ and Python, making it very efficient and fast in processing even a large
chunk of data. The following are some of the features of fastText:
Word embedding learnings: Provides many word embedding models using
skipgram and Continous Bag of Words (CBOW) by unsupervised training.
Word vectors for out-of-vocabulary words: It provides the capability to obtain
word vectors even if the word is not present in the training vocabulary.
Text classification: fastText provides a fast text classifier, which in their paper titled
“Bag of Tricks for Efficient Text Classification” claims to be often at par with many
deep learning classifiers’ accuracy and training time.
Natural Language ProcessingNatural language processing deals with understanding
and manicuring natural language text or speech to perform specific useful desired tasks. NLP
combines ideas and concepts from computer science, linguistics, mathematics, artificial
intelligence, machine learning, and psychology.
Sampling: Using the sample function from the pandas data frame, let’s randomly
pick the text of 1000 reviews and print the top rows (see Figure ):
food_review_text = pd.DataFrame(food_review["Text"])
food_review_text_1k = food_review_text.sample(n= 1000, random_state = 123)
food_review_text_1k.head()
Tokenization Using NLTK: The first step in processing text data is to separate a
sentence into individual words. This process is called tokenization. We will use the
NLTK’s word_tokenize function to create a column in the food_review_text_1k data
frame we created above and print the top six rows to see the output of tokenize
(Figure):
food_review_text_1k['tokenized_reviews'] = food_review_
text_1k['Text'].apply(nltk.word_tokenize)
food_review_text_1k.head()
Word Search Using Regex: let’s take the first row in the data frame and search for
the presence of the word using a regular expression (regex). The regex searches for
any word that contains c as its first character and i as the third character. We can write
various regex searches for a pattern of interest. We use the re.search() function to
perform this search:
#Search: All 5-letter words with c as its first letter and i as its third letter
search_word = set([w for w in food_review_text_1k['tokenized_
reviews'].iloc[0] if re.search('^c.i..$', w)])
print(search_word)
{'chips'}
Word Search Using the Exact Word: Another way of searching for a word is to use
the exact word. This can be achieved using the str.contains() function in pandas. In
the following example, we search for the word “great” in all of the reviews. The rows
of the reviews containing the word will be retrieved. They can be considered a
positive review. See Figure .
#Search for the word "great" in reviews
food_review_text_1k[food_review_text_1k['Text'].str.contains('great')]