0% found this document useful (0 votes)

61 views8 pages

Building A Simple Chatbot From Scratch in Python1

Uploaded by

Nikal Poudel

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

61 views8 pages

Building A Simple Chatbot From Scratch in Python1

Uploaded by

Nikal Poudel

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 8

Faculty of Technology

University of Sunderland

Building a Simple Chatbot from Scratch in Python (using NLTK)

A chatbot is an artificial intelligence-powered piece of software in a device (Siri, Alexa, Google

Assistant etc), application, website or other networks that try to gauge consumer’s needs and then
assist them to perform a particular task like a commercial transaction, hotel booking, form
submission etc .Today almost every company has a chatbot deployed to engage with the users.
Some of the ways in which companies are using chatbots are:

 To deliver flight information

 to connect customers and their finances
 As customer support

The possibilities are (almost) limitless.

How do Chatbots work?

There are broadly two variants of chatbots: Rule-Based and Self learning.

1. In a Rule-based approach, a bot answers questions based on some rules on which it is

trained on. The rules defined can be very simple to very complex. The bots can handle
simple queries but fail to manage complex ones.
2. The Self learning bots are the ones that use some Machine Learning-based approaches and
are definitely more efficient than rule-based bots. These bots can be of further two types:
Retrieval Based or Generative

In retrieval-based models, a chatbot uses some heuristic to select a response from a library of
predefined responses. The chatbot uses the message and context of conversation for selecting the
best response from a predefined list of bot messages. The context can include a current position in
the dialog tree, all previous messages in the conversation, previously saved variables (e.g.
username). Heuristics for selecting a response can be engineered in many ways, from rule-based if-
else conditional logic to machine learning classifiers.

Generative bots can generate the answers and not always replies with one of the answers from a set
of answers. This makes them more intelligent as they take word by word from the query and
generates the answers.

Building the Bot
Pre-requisites
A hands-on knowledge of scikit library and NLTK is assumed. However, if you are new to NLP, you
can still read the article and then refer back to resources.

NLP
The field of study that focuses on the interactions between human language and computers is called
Natural Language Processing, or NLP for short. It sits at the intersection of computer science,
artificial intelligence, and computational linguistics [Wikipedia].NLP is a way for computers to
analyse, understand, and derive meaning from human language in a smart and useful way. By
utilizing NLP, developers can organize and structure knowledge to perform tasks such as automatic

CET313 Artificial Intelligence

Faculty of Technology
University of Sunderland

summarization, translation, named entity recognition, relationship extraction, sentiment analysis,

speech recognition, and topic segmentation.

NLTK: A Brief Intro
NLTK (Natural Language Toolkit) is a leading platform for building Python programs to work with
human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources
such as WordNet, along with a suite of text processing libraries for classification, tokenization,
stemming, tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries.
NLTK has been called “a wonderful tool for teaching and working in, computational linguistics using
Python,” and “an amazing library to play with natural language.”

Downloading and installing NLTK

1. Install NLTK: run pip install nltk

2. Test installation: run python then type import nltk

Installing NLTK Packages
import NLTK and run nltk.download().This will open the NLTK downloader from where you can
choose the corpora and models to download. You can also download all packages at once.

Text Pre- Processing with NLTK

The main issue with text data is that it is all in text format (strings). However, the Machine learning
algorithms need some sort of numerical feature vector in order to perform the task. So before we
start with any NLP project we need to pre-process it to make it ideal for working. Basic text pre-
processing includes:

 Converting the entire text into uppercase or lowercase, so that the algorithm does not treat
the same words in different cases as different
 Tokenization: Tokenization is just the term used to describe the process of converting the
normal text strings into a list of tokens i.e. words that we actually want. Sentence tokenizer
can be used to find the list of sentences and Word tokenizer can be used to find the list of
words in strings.

The NLTK data package includes a pre-trained Punkt tokenizer for English.

 Removing Noise i.e. everything that isn’t in a standard number or letter.

 Removing Stop words. Sometimes, some extremely common words which would appear to
be of little value in helping select documents matching a user need are excluded from the
vocabulary entirely. These words are called stop words
 Stemming: Stemming is the process of reducing inflected (or sometimes derived) words to
their stem, base or root form — generally a written word form. Example if we were to stem
the following words: “Stems”, “Stemming”, “Stemmed”, “and Stemtization”, the result
would be a single word “stem”.
 Lemmatization: A slight variant of stemming is lemmatization. The major difference
between these is, that, stemming can often create non-existent words, whereas lemmas are
actual words. So, your root stem, meaning the word you end up with, is not something you
can just look up in a dictionary, but you can look up a lemma. Examples of Lemmatization

CET313 Artificial Intelligence

Faculty of Technology
University of Sunderland

are that “run” is a base form for words like “running” or “ran” or that the word “better” and
“good” are in the same lemma, so they are considered the same.

Bag of Words
After the initial pre-processing phase, we need to transform text into a meaningful vector (or array)
of numbers. The bag-of-words is a representation of text that describes the occurrence of words
within a document. It involves two things;

 A vocabulary of known words.

 A measure of the presence of known words

Why is it is called a “bag” of words? That is because any information about the order or structure of
words in the document is discarded and the model is only concerned with whether the known
words occur in the document, not where they occur in the document.

The intuition behind the Bag of Words is that documents are similar if they have similar content.
Also, we can learn something about the meaning of the document from its content alone.
For example, if our dictionary contains the words {Learning, is, the, not, great}, and we want to
vectorize the text “Learning is great”, we would have the following vector: (1, 1, 0, 0, 1).

TF-IDF Approach
A problem with the Bag of Words approach is that highly frequent words start to dominate in the
document (e.g. larger score) but may not contain as much “informational content”. Also, it will give
more weight to longer documents than shorter documents.

One approach is to rescale the frequency of words by how often they appear in all documents so
that the scores for frequent words like “the” that are also frequent across all documents are
penalized. This approach to scoring is called Term Frequency-Inverse Document Frequency, or TF-
IDF for short, where:

Term Frequency: is a scoring of the frequency of the word in the current document.

Inverse Document Frequency: is a scoring of how rare the word is across documents.

Tf-idf weight is a weight often used in information retrieval and text mining. This weight is a
statistical measure used to evaluate how important a word is to a document in a collection or corpus

CET313 Artificial Intelligence

Faculty of Technology
University of Sunderland

Example:

Consider a document containing 100 words wherein the word ‘phone’ appears 5 times.
The term frequency (i.e., tf) for phone is then (5 / 100) = 0.05. Now, assume we have 10
million documents and the word phone appears in one thousand of these. Then, the
inverse document frequency (i.e., IDF) is calculated as log (10,000,000 / 1,000) = 4.
Thus, the Tf-IDF weight is the product of these quantities: 0.05 * 4 = 0.20.
Tf-IDF can be implemented in scikit learn as:
from sklearn.feature_extraction.text import TfidfVectorizer

Cosine Similarity
TF-IDF is a transformation applied to texts to get two real-valued vectors in vector space. We can
then obtain the Cosine similarity of any pair of vectors by taking their dot product and dividing that
by the product of their norms. That yields the cosine of the angle between the vectors. Cosine
similarity is a measure of similarity between two non-zero vectors. Using this formula, we can find
out the similarity between any two documents d1 and d2.

where d1,d2 are two non-zero vectors.

Now we have a fair idea of the NLP process. It is time that we get to our real task i.e Chatbot
creation. We will name the chatbot here as ‘ROBO🤖’’

Importing the necessary libraries:

Another import…

Corpus
For our example, we will be using the Wikipedia page for chatbots as our corpus. Copy the
contents from the page and place it in a text file named ‘chatbot.txt’.

Reading in the data
We will read in the corpus.txt file and convert the entire corpus into a list of sentences and a
list of words for further pre-processing.

CET313 Artificial Intelligence

Faculty of Technology
University of Sunderland

Let see an example of the sent_tokens and the word_tokens

Pre-processing the raw text

We shall now define a function called LemTokens which will take as input the tokens and return
normalized tokens.

CET313 Artificial Intelligence

Faculty of Technology
University of Sunderland

Keyword matching
Next, we will define a function for a greeting by the bot i.e. if a user’s input is a greeting, the bot
shall return a greeting response.ELIZA uses a simple keyword matching for greetings. We will utilize
the same concept here.

Generating Response
To generate a response from our bot for input questions, the concept of document similarity will be
used. So, we begin by importing necessary modules.

 From scikit learn library, import the TFidf vectorizer to convert a collection of raw
documents to a matrix of TF-IDF features.

 Also, import cosine similarity module from scikit learn library

This will be used to find the similarity between words entered by the user and the words in the
corpus. This is the simplest possible implementation of a chatbot.

We define a function response which searches the user’s utterance for one or more known
keywords and returns one of several possible responses. If it doesn’t find the input matching any of
the keywords, it returns a response:” I am sorry! I don’t understand you”

CET313 Artificial Intelligence

Faculty of Technology
University of Sunderland

Finally, we will feed the lines that we want our bot to say while starting and ending a
conversation depending upon user’s input.

So that’s pretty much it. We have the coded our first chatbot in NLTK. Run it and see how you
can interact with it.

CET313 Artificial Intelligence

Faculty of Technology
University of Sunderland

Conclusion
Though it is a very simple bot with hardly any cognitive skills, it’s a good way to get into NLP and get
to know about chatbots. Though ‘ROBO’ responds to user input. It won’t fool your friends, and for a
production system you’ll want to consider one of the existing bot platforms or frameworks, but this
example should help you think through the design and challenge of creating a chatbot.

CET313 Artificial Intelligence

Statement of Purpose
0% (1)
Statement of Purpose
7 pages
Randall Platinum Plan
No ratings yet
Randall Platinum Plan
18 pages
Ram Chandra Padwal - Pratical Guide To NLTK For Data Science
No ratings yet
Ram Chandra Padwal - Pratical Guide To NLTK For Data Science
37 pages
Chat Bot
No ratings yet
Chat Bot
10 pages
Python Chatbot Project
No ratings yet
Python Chatbot Project
10 pages
NLP Manual (1-12) 1
No ratings yet
NLP Manual (1-12) 1
56 pages
Nlp Lab Manual
No ratings yet
Nlp Lab Manual
21 pages
Minorproject Ishant
No ratings yet
Minorproject Ishant
18 pages
Rajeev Mishra 20 SCSE1180087
No ratings yet
Rajeev Mishra 20 SCSE1180087
29 pages
NLP Final Review
No ratings yet
NLP Final Review
32 pages
Course Project Report For: Artificial Intelligence EL-3011
No ratings yet
Course Project Report For: Artificial Intelligence EL-3011
8 pages
CSDM2-Text Preprocessing For NL Data - 011050
No ratings yet
CSDM2-Text Preprocessing For NL Data - 011050
6 pages
Whats App
No ratings yet
Whats App
24 pages
Natural Language Processing With Python's NLTK Package – Real Python
No ratings yet
Natural Language Processing With Python's NLTK Package – Real Python
27 pages
NLP Manual (1-12)
No ratings yet
NLP Manual (1-12)
55 pages
text classification reseach paper
No ratings yet
text classification reseach paper
4 pages
TextFeatureEnginerring-NLP lec2
No ratings yet
TextFeatureEnginerring-NLP lec2
60 pages
A Tutorial On: Linguistic Data Analysis
No ratings yet
A Tutorial On: Linguistic Data Analysis
99 pages
AP for NLP-LO1
No ratings yet
AP for NLP-LO1
61 pages
Natural Language Processing Notes Class 10
No ratings yet
Natural Language Processing Notes Class 10
10 pages
NLP_DeepNLP
No ratings yet
NLP_DeepNLP
61 pages
AP for NLP-Word 2 Vec
No ratings yet
AP for NLP-Word 2 Vec
33 pages
Week 6: Introduction To Natural Language Processing
No ratings yet
Week 6: Introduction To Natural Language Processing
18 pages
Lecture 8 - Text Analytics NLP
No ratings yet
Lecture 8 - Text Analytics NLP
24 pages
Statistical Learning and Text Classification With NLTK and Scikit-Learn
No ratings yet
Statistical Learning and Text Classification With NLTK and Scikit-Learn
24 pages
ch5&6_lecture_AI
No ratings yet
ch5&6_lecture_AI
69 pages
Chapter 7.1 - Introducing Natural Language Processing
No ratings yet
Chapter 7.1 - Introducing Natural Language Processing
39 pages
Introduction To NLP
No ratings yet
Introduction To NLP
50 pages
NLP record
No ratings yet
NLP record
16 pages
NLP For ML - Spam Classifier
No ratings yet
NLP For ML - Spam Classifier
14 pages
AI Zone: Log in Sign Up
No ratings yet
AI Zone: Log in Sign Up
24 pages
Intro To NLP: Natural Language Toolkit
No ratings yet
Intro To NLP: Natural Language Toolkit
11 pages
21 01 23
No ratings yet
21 01 23
8 pages
A Beginner's Guide To Natural Language Processing - IBM Developer
No ratings yet
A Beginner's Guide To Natural Language Processing - IBM Developer
9 pages
UNIT-V-NLP Using NLTK
No ratings yet
UNIT-V-NLP Using NLTK
19 pages
Natural Language Processing_NOTES
No ratings yet
Natural Language Processing_NOTES
4 pages
Lesson 1 Intro
No ratings yet
Lesson 1 Intro
51 pages
Ai Phase 3 Project
No ratings yet
Ai Phase 3 Project
18 pages
Natural Language Processing
No ratings yet
Natural Language Processing
25 pages
Session 11-12 - Text Analytics
No ratings yet
Session 11-12 - Text Analytics
38 pages
Ai & ML Week-11
No ratings yet
Ai & ML Week-11
32 pages
SMA (TASK1 AND 2) ... HARDCOPY (Final) ..Pranchal..
No ratings yet
SMA (TASK1 AND 2) ... HARDCOPY (Final) ..Pranchal..
11 pages
ML7 - Text Classification
No ratings yet
ML7 - Text Classification
13 pages
NLTK
No ratings yet
NLTK
16 pages
Natural Language Processing
No ratings yet
Natural Language Processing
10 pages
NLP Q2 21SAL54 Scheme
No ratings yet
NLP Q2 21SAL54 Scheme
6 pages
3. Text Classification
No ratings yet
3. Text Classification
60 pages
Aiproject Report
No ratings yet
Aiproject Report
11 pages
big data analytics Chap 11
No ratings yet
big data analytics Chap 11
8 pages
Sentiment Analysis Using Supervised Machine Learning Ijariie13051
No ratings yet
Sentiment Analysis Using Supervised Machine Learning Ijariie13051
7 pages
PPT for Assignment-10 (Machine Learning With Python_NLP-2)
No ratings yet
PPT for Assignment-10 (Machine Learning With Python_NLP-2)
37 pages
SL-3_Assignment No 7
No ratings yet
SL-3_Assignment No 7
14 pages
What Is Natural Language Processing (NLP)
No ratings yet
What Is Natural Language Processing (NLP)
15 pages
Natural Language Processing (NLP)
No ratings yet
Natural Language Processing (NLP)
5 pages
Natural Language Processing manual
No ratings yet
Natural Language Processing manual
39 pages
Lab3 IR BIM
No ratings yet
Lab3 IR BIM
14 pages
Module III
No ratings yet
Module III
42 pages
Natural Language Processing Revision Notes
No ratings yet
Natural Language Processing Revision Notes
4 pages
NLP Lab Manual-1
No ratings yet
NLP Lab Manual-1
18 pages
NLP - 1_250119_222702 (1)
No ratings yet
NLP - 1_250119_222702 (1)
71 pages
Machine Learning, NLP_ Text Classification Using Scikit-learn, Python and NLTK
No ratings yet
Machine Learning, NLP_ Text Classification Using Scikit-learn, Python and NLTK
9 pages
Bi22bc CET324 Assignment 1
100% (1)
Bi22bc CET324 Assignment 1
8 pages
Conditions - UWA Global Excellence Scholarship F19 1756 (2023)
No ratings yet
Conditions - UWA Global Excellence Scholarship F19 1756 (2023)
6 pages
British Army - English Paper
No ratings yet
British Army - English Paper
3 pages

Building A Simple Chatbot From Scratch in Python1

Uploaded by

Building A Simple Chatbot From Scratch in Python1

Uploaded by

Faculty of Technology

Building a Simple Chatbot from Scratch in Python (using NLTK)

A chatbot is an artificial intelligence-powered piece of software in a device (Siri, Alexa, Google

 To deliver flight information

The possibilities are (almost) limitless.

1. In a Rule-based approach, a bot answers questions based on some rules on which it is

CET313 Artificial Intelligence

summarization, translation, named entity recognition, relationship extraction, sentiment analysis,

Downloading and installing NLTK

1. Install NLTK: run pip install nltk

Text Pre- Processing with NLTK

 Removing Noise i.e. everything that isn’t in a standard number or letter.

CET313 Artificial Intelligence

 A vocabulary of known words.

CET313 Artificial Intelligence

where d1,d2 are two non-zero vectors.

Importing the necessary libraries:

CET313 Artificial Intelligence

Let see an example of the sent_tokens and the word_tokens

Pre-processing the raw text

CET313 Artificial Intelligence

 Also, import cosine similarity module from scikit learn library

CET313 Artificial Intelligence

CET313 Artificial Intelligence

CET313 Artificial Intelligence

You might also like