0% found this document useful (0 votes)
35 views33 pages

AP For NLP-Word 2 Vec

Uploaded by

shahbazhassan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views33 pages

AP For NLP-Word 2 Vec

Uploaded by

shahbazhassan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 33

Advanced Python for NLP

CONTENTS:

• NLP
• NLTK
• NLP Pre-
processing
WHAT IS NATURAL LANGUAGE
PROCESSING (NLP)?
• Natural Language Processing is an interdisciplinary field of Artificial
Intelligence.
• It is a technique used to teach a computer to understand Human
languages and also interpret just like us.
• It is the art of extracting information, hidden insights from unstructured
text.
• It is a sophisticated field that makes computers process text data on a
large scale.
• The ultimate goal of NLP is to make computers
and computer-controlled bots understand and interpret Human
Languages, just as we do.
COMPONENTS OF
NLP
• Natural Language
Understanding
• Natural Language
Generation

Figure: Components
of NLP
NATURAL LANGUAGE UNDERSTANDING:-
• NLU helps the machine to understand and analyze human language by extracting
the text from large data such as keywords, emotions, relations, and semantics,
etc.
Let’s see what challenges are faced by a machine-
He is looking for a match.
• What do you understand by the ‘match’ keyword?
• This is Lexical Ambiguity. It happens when a word has different meanings. Lexical
ambiguity can be resolved by using parts-of-speech (POS)tagging techniques.
The Fish is ready to eat.
• What do you understand by the above example?
• This is Syntactical Ambiguity which means when we see more meanings in a
sequence of words and also Called Grammatical Ambiguity.
NATURAL LANGUAGE
GENERATION:-

• It is the process of extracting meaningful insights


as phrases and sentences in the form of natural
language.
• It consists −
• Text planning − It includes retrieving the relevant data
from the domain.
• Sentence planning − It is nothing but a selection of
important words, meaningful phrases, or sentences.
APPLICATIONS OF
NLP
• Sentimental Analysis
• Chabot's
• Virtual Assistants
• Speech Recognition
• Machine Translation
• Advertise Matching
• Information Extraction
• Grammatical error detection
• Fake news detection
• Text Summarize
LIBRARIES FOR
NLP
Here are some of the libraries for leveraging the power of
Natural Language Processing.
• Natural Language Toolkit (NLTK)
• spaCY
• Gensim
• Standford CoreNLP
• TextBlob
WHAT IS NATURAL
TOOLKIT?
• NLTK, or Natural Language Toolkit, is a Python package that
you can use for NLP.
• NLTK is a leading platform for building Python programs to
work with human language data.
Installing NLTK
pip install nltk
WHAT IS Spacy

• Tokenization
• POS- Tagging
• NER
• Lemmatization
• Sentence Boundary Detection
• Text Classification
WHAT IS Gensim

• Topic modeling

• Word 2 Vec

• Document Similarity

• Efficient Data Handling


DATA PREPROCESSING USING
NLTK
• The process of cleaning unstructured text data, so that it can be used to
predict, analyze, and extract information. Real-world text data is
unstructured, inconsistent. So, Data preprocessing becomes a necessary
step.
• The various Data Preprocessing methods are:
• Tokenization
• Frequency Distribution of Words
• Filtering Stop Words
• Stemming
• Lemmatization
• Parts of Speech(POS) Tagging
• Name Entity Recognition
• WordNet
• These are some of the methods to process the text data in NLP. The list is
not so exhaustive but serves as a great starting point for anyone who
wants to get started with NLP.
TOKENIZING

• The process of breaking down the text data into individual


tokens(words, sentences, characters) is known as Tokenization. It is
a foremost step in Text Analytics.
• It’s your first step in turning unstructured data into structured data,
which is easier to analyze.
• Tokenizing can be done by two ways
• Tokenizing by word
• Tokenizing by sentence
• Importing the tokenizer from NLTK
from nltk.tokenize import sent_tokenize,
word_tokenize
STOPWORDS

• Stop words are used to filter some words which are repetitive and don’t hold any
information. For example, words like – {that these, below, is, are, etc.} don’t provide any
information, so they need to be removed from the text. Stop Words are considered as
Noise. NLTK provides a huge list of stop words

• Very common words like 'in', 'is', and 'an' are often used as stop words since they
don’t add a lot of meaning to a text in and of themselves.
STEMMING

• Stemming is a text processing task in which you reduce words to


their root, which is the core part of a word. For example, the
words “helping” and “helper” share the root “help.”
• Stemming allows you to zero in on the basic meaning of a word
rather than all the details of how it’s being used.
• NLTK has
• Porter Stemmer
• Snowball Stemmer
• Understemming and overstemming are two ways stemming
can go wrong:
• Understemming happens when two related words should be
reduced to the same stem but aren’t. This is a
false negative.
• Overstemming happens when two unrelated words are
reduced to the same stem even though they shouldn’t be.
This is a false positive.
Porter Stemmer
• It is one of the earliest and most widely used stemming algorithms. It
applies a series of rules to strip suffixes from words, reducing them
to their root form.

• Snowball Stemmer, also known as the Porter2 Stemmer, is an


improvement over the original Porter Stemmer. It was developed by
Martin Porter to address some of the limitations of the original
algorithm.
POS TAGGING
SUMMARY THAT YOU CAN USE TO GET STARTED WITH
NLTK’S POS TAGS:
Tags that start with Deals With
JJ Adjectives
NN Nouns
RB Adverbs
PRP Pronouns
VB Verbs
LEMMATIZING

• Like stemming, lemmatization is also used to reduce the word


to their root word. Lemmatizing gives the complete meaning
of the word which makes sense. It uses vocabulary and
morphological analysis to transform a word into a root word.
• For example:
• “engineers” is lemmatized to “engineer”
STEMMING V/S LEMMATIZATION

Stemming Lemmatization
Stemming is a process that Lemmatization considers the
stems or removes last few context and converts the word to
characters from a word, often its meaningful base form, which
leading to incorrect meanings is called Lemma.
and spelling.
For instance, stemming the word For instance, lemmatizing
‘Caring’ would return ‘Car’. the word ‘Caring’ would
return ‘Care’.
Stemming is used in case of Lemmatization is
large dataset where computationally expensive
performance is an issue since it involves look-up tables
and what not.
CHUNKING

• While tokenizing allows you to identify words and sentences,


chunking allows you to identify phrases.
• Note: A phrase is a word or group of words that works as a single
unit to perform a grammatical function. Noun phrases are built
around a noun.
• Here are some examples:
• “A plan et”
• “A tilting planet”
• “A swiftly tilting planet”
CHINKING

• Chinking is used together with chunking, but while chunking


is used to include a pattern, chinking is used to exclude a
pattern.
USING NAMED ENTITY RECOGNITION
(NER)
• Named entities are noun phrases that refer to specific
locations, people, organizations, and so on. With named entity
recognition, you can find the named entities in your texts
and also determine what kind of named entity they are.
Word Embedding
• Word embedding's are a type of word representation that allows words to
be represented as vectors in a continuous vector space. These vectors
capture semantic meanings of words such that words with similar
meanings have similar vector representations.

• Word embedding's are crucial for various natural language processing


(NLP) tasks because they help in understanding the context and
relationships between words in a more meaningful way than traditional
bag-of-words or one-hot encoding methods.
Word Embedding
• Vector Space Representation: Words are represented as dense vectors of fixed
size. Each word is mapped to a point in a continuous vector space

• Semantic Similarity: Words with similar meanings are located close to each
other in the vector space. For example, the words "king" and "queen" might be
close to each other.

• Dimensionality Reduction: Word embedding's reduce the dimensionality of


word representations, making them more computationally efficient while
preserving semantic information.
Word Embedding

• Word2Vec:

• GloVe (Global Vectors for Word Representation):

• Fast Text:

• BERT (Bidirectional Encoder Representations from

Transformers):
Word 2 Vec Model
• It is a popular technique used in natural language processing
(NLP) to transform words into numerical vectors of fixed
dimensionality.

• It is based on the idea that words occurring in similar contexts


tend to have similar meanings.

• Word2Vec represents each word as a vector such that words


with similar contexts have similar vectors.
How Word2Vec Works
• There are two main approaches to generating word vectors
using Word2Vec:

• Continuous Bag of Words (CBOW):

• Skip-gram:
How Word2Vec Works
• Continuous Bag of Words (CBOW):

• The CBOW model predicts the target word given the context
words.

• It takes the context of each word (a few words before and after
the target word) and tries to predict the target word based on
these context words.
How Word2Vec Works
• Skip Gram

• The Skip-gram model does the reverse of CBOW. It predicts the


context words given a target word.

• For each word in the text, the model uses the word to predict
the words within a certain range before and after it.
How Word2Vec Works: Example
• Consider a simple sentence: "The cat sat on the mat."

• In CBOW, for the target word "sat," the context words are
["The", "cat", "on", "the", "mat"]. The model uses these context
words to predict "sat."

• In Skip-gram, for the target word "sat," the model will use "sat"
to predict each of the context words ["The", "cat", "on", "the",
"mat"].

You might also like