0% found this document useful (0 votes)
58 views18 pages

Minorproject Ishant

1. The document discusses chatbots and how they work. It explains that chatbots are artificial intelligence software that can engage with users to perform tasks like transactions or customer service. 2. Chatbots are either rule-based, where they answer questions based on predefined rules, or self-learning bots that use machine learning approaches. 3. The document then discusses the prerequisites for building chatbots including scikit-learn for machine learning models and NLTK for natural language processing tasks like tokenization and part-of-speech tagging.

Uploaded by

Ishant Kumawat
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
58 views18 pages

Minorproject Ishant

1. The document discusses chatbots and how they work. It explains that chatbots are artificial intelligence software that can engage with users to perform tasks like transactions or customer service. 2. Chatbots are either rule-based, where they answer questions based on predefined rules, or self-learning bots that use machine learning approaches. 3. The document then discusses the prerequisites for building chatbots including scikit-learn for machine learning models and NLTK for natural language processing tasks like tokenization and part-of-speech tagging.

Uploaded by

Ishant Kumawat
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 18

Chat - Bot using

NLTK Library

Submitted by : -

Ishant Kumawat
19bcon085
So what is a chatbot?

A chatbot is an artificial intelligence-powered piece of software in a


device (Siri, Alexa, Google Assistant, etc.), application, website, or
other networks. It gauges consumer’s needs and then assists them in
performing a particular task like a commercial transaction, hotel
booking, form submission, etc. Today almost every company has a
chatbot deployed to engage with the users. Some of the ways in which
companies are using chatbots are:

 To deliver flight information


 to connect customers and their finances
 As customer support
How do Chatbots work?

There are broadly two variants of chatbots: Rule-Based and Self-learning.

1. In a Rule-based approach, a bot answers questions based on some rules,


which it is trained on. The rules defined can be very simple to very complex.
The bots can handle simple queries but fail to manage complex ones.

2. Self-learning bots are the ones that use some Machine Learning-based

approaches and are more efficient than rule-based bots. These bots can be of
further two types: Retrieval Based or Generative.
Pre - Requisites
1. Skicit-Learn : Scikit-learn (Sklearn) is the most useful and robust library for
machine learning in Python. The sklearn library contains a lot of
efficient tools for machine learning and statistical modeling
including classification, regression, clustering and dimensionality
reduction. Please note that sklearn is used to build machine
learning models. It should not be used for reading the data,
manipulating and summarizing it. There are better libraries for that
(e.g. NumPy, Pandas etc.)
Important Features of scikit-learn:

 Simple and efficient tools for data mining and data analysis. It features various
classification, regression and clustering algorithms including support vector
machines, random forests, gradient boosting, k-means, etc.

 Accessible to everybody and reusable in various contexts.

 Built on the top of NumPy, SciPy, and matplotlib.

 Open source, commercially usable – BSD license.


NLP ( Natural Language Processing ) :-

Natural language processing (NLP) refers to the branch of computer science—and


more specifically, the branch of artificial intelligence or AI—concerned with giving
computers the ability to understand text and spoken words in much the same way
human beings can. NLP has existed for more than 50 years and has roots in the field
of linguistics. It has a variety of real-world applications in a number of fields,
including medical research, search engines and business intelligence.
NLP combines computational linguistics—rule-based modeling of human language—with
statistical, machine learning, and deep learning models. Together, these technologies enable
computers to process human language in the form of text or voice data and to ‘understand’ its
full meaning, complete with the speaker or writer’s intent and sentiment.

NLP enables computers to understand natural language as humans do. Whether the language
is spoken or written, natural language processing uses artificial intelligence to take real-world
input, process it, and make sense of it in a way a computer can understand. Just as humans
have different sensors -- such as ears to hear and eyes to see -- computers have programs to
read and microphones to collect audio. And just as humans have a brain to process that input,
computers have a program to process their respective inputs. At some point in processing, the
input is converted to code that the computer can understand.
There are two main phases to natural language processing:
1. Data Pre-Processing and 2. Algorithm Development.

Data pre-processing involves preparing and "cleaning" text data for machines
to be able to analyze it. Pre-processing puts data in workable form and
highlights features in the text that an algorithm can work with. There are
several ways this can be done, including:
Tokenization :

 Tokens are the building blocks of Natural Language.

 Tokenization is a common task in Natural Language Processing (NLP). It’s a


fundamental step in both traditional NLP methods like Count Vectorizer and
Advanced Deep Learning-based architectures like Transformers.

 Tokenization is a way of separating a piece of text into smaller units called


tokens. Here, tokens can be either words, characters, or sub-words. Hence,
tokenization can be broadly classified into 3 types – word, character, and sub-
word (n-gram characters) tokenization.
 As tokens are the building blocks of Natural Language, the
most common way of processing the raw text happens at the
token level.

 Tokenization is the foremost step while modeling text data.


Tokenization is performed on the corpus to obtain tokens. The
following tokens are then used to prepare a vocabulary.
Vocabulary refers to the set of unique tokens in the corpus.

 Remember that vocabulary can be constructed by considering


each unique token in the corpus or by considering the top K
Frequently Occurring Words.
Stop word removal:  

 This is when common words are removed from text so unique words that offer the
most information about the text remain.

 Stop word removal is one of the most commonly used preprocessing steps across
different NLP applications. The idea is simply removing the words that occur
commonly across all the documents in the corpus. Typically, articles and pronouns are
generally classified as stop words. These words have no significance in some of the
NLP tasks like information retrieval and classification, which means these words are
not very discriminative.

 On the contrary, in some NLP applications stop word removal will have very little
impact. Most of the time, the stop word list for the given language is a well hand-
curated list of words that occur most commonly across corpuses. Therefore removing
stop words helps build cleaner dataset with better features for machine learning model.
Lemmatization  and Stemming:

 Stemming and Lemmatization are Text Normalization (or sometimes called Word


Normalization) techniques in the field of Natural Language Processing that are used to
prepare text, words, and documents for further processing.

 Stemming and Lemmatization are itself form of NLP and widely used in Text mining.
Text Mining is the process of analysis of texts written in natural language and extract
high-quality information from text. It involves looking for interesting patterns in the text
or to extract data from the text to be inserted into a database. Text mining tasks include
text categorization, text clustering, concept/entity extraction, production of granular
taxonomies, sentiment analysis, document summarization, and entity relation modelling
(i.e., learning relations between named entities).
Part-of-speech Tagging :

This is when words are marked based on the part-of speech they are -- such as
nouns, verbs and adjectives. Parts of speech tags are the properties of the
words, which define their main context, functions, and usage in a sentence.
Some of the commonly used parts of speech tags are

i. Nouns: Which defines any object or entity

ii. Verbs: That defines some action.


iii. Adjectives and Adverbs: This acts as a modifier,
quantifier, or intensifier in any sentence.
NLTK Library
• The Python programing language provides a wide range of tools and libraries for attacking
specific NLP tasks. Many of these are found in the Natural Language Toolkit, or NLTK, an
open source collection of libraries, programs, and education resources for building NLP
programs.

• The NLTK includes libraries for many of the NLP tasks listed above, plus libraries for
subtasks, such as sentence parsing, word segmentation, stemming and lemmatization
(methods of trimming words down to their roots), and tokenization (for breaking phrases,
sentences, paragraphs and passages into tokens that help the computer better understand the
text). It also includes libraries for implementing capabilities such as semantic reasoning, the
ability to reach logical conclusions based on facts extracted from text.
NLP Use Cases :
 Spam Detection

 Machine Translation

 Virtual Agents and Chat-Bots

 Social Media and Sentiment Analysis

 Text Summarization

 Text Classification

 Text Extraction
References :
 Analytics Vidya

 Medium.com

 IBM official NLP Documentation

 kdNuggets

 Wiki-Pedia

 Udemy
Thank You !!

You might also like