U1 NLP Complete
U1 NLP Complete
School of computing
2) Medicine @ Care system for smart Hospitals using NLP by 2022 Major project work.
3)“Understanding Short Text Through Lexical Semantic Analysis” ,Published under licence by IOP
Publishing Ltd,IOP Conference Series: Materials Science and Engineering, Volume 1130, International
Conference on Advances in Renewable and Sustainable Energy
Systems (ICARSES 2020) 3rd-5th December, Chennai, India.
4)“Personalized Dynamic User Interfaces”, Published under licence by IOP Publishing Ltd,IOP
Conference Series: Materials Science and Engineering, Volume 1130,
International Conference on Advances in Renewable and Sustainable Energy Systems (ICARSES 2020)
3rd-5th December, Chennai, India
2. Make them understand the concepts of morphology, syntax, semantics and pragmatics of the language
and that they are able to give the appropriate examples that will illustrate the above mentioned concept.
3. Teach them to recognize the significance of pragmatics for natural language understanding.
4. Enable students to be capable to describe the application based on natural language processing and to
show the points of syntactic, semantic and pragmatic processing.
5. To understand natural language processing and to learn how to apply basic algorithms in this field.
CO2: : Analyze approaches to syntax and semantic parsing with pronoun resolution.
CO3: Implement semantic role, relations and frames, Including co reference resolution.
CO5: Apply the knowledge of various levels of analysis involved in NLP and
Implement
• According to the industry the estimation is only 21% of the available data is in the
structured form.
• Data is being generated ,send messages on WhatsApp and Facebook or various social
media.
• Majority of the datas exists in textual format which is highly unstructured form.
• Now in order to produce significant and actionable insights from this data it is important to
get acquainted with the techniques of text analysis and natural language processing.
06/02/2025 NATURAL LANGUAGE PROCESSING
Introduction to NLP
06/02/2025
NATURAL LANGUAGE PROCESSING
Need of NLP
1. User-Friendly Interfaces: NLP allows for intuitive and user-friendly interfaces using natural language,
reducing the need for complex programming syntax.
2. Accessibility and Inclusivity: NLP makes technology accessible to a wider audience, including those with
limited technical expertise or disabilities.
3. Conversational Systems: NLP enables the development of conversational agents, enhancing user interaction
and system efficiency.
4. Data Extraction and Analysis: NLP extracts insights from unstructured text data, enabling sentiment
analysis, information retrieval, and text summarization.
5. Voice-based Interaction: NLP powers voice assistants and speech recognition systems for hands-free and
natural interaction.
6. Human-Machine Collaboration: NLP enables seamless communication and collaboration between humans
and machines.
7. Natural Language Understanding: NLP allows machines to comprehend context, semantics, and intent,
enabling advanced applications and personalized experiences.
1. Gmail - when u r typing any sentence in gmail you will notice that it tries to auto complete. Auto
completion is done using NLP.
2. Spam filters - This emails didn’t have spam filters then you will be so much worried you will get
so much mails/headache. Using NLP we can filter them and using keywords take them out of
your inbox.
3. Language translation - Translate sentence from one language to another language.
4. Customer service chat bot - ex: in bank service chat bot if you type in a message and many
times there is no human on the other hand. Your chat bot can interpret your language it can
derive intent out of it and it can respond to your question on its own and sometimes it doesn’t
work well then they connect it to human beings.
5. Voice assistants at amazon, alexa and google assistant
6. Google search - BERT special language model will give a correct solution for all search question.
• NLP helps users to ask questions about any subject and get a direct response
within seconds.
• NLP offers exact answers to the question means it does not offer unnecessary
and unwanted information.
• NLP helps computers to communicate with humans in their languages.
• It is very time efficient.
• Most of the companies use NLP to improve the efficiency of documentation
processes, accuracy of documentation, and identify the information from large
databases.
2.Syntax Analysis
Syntax Analysis
3.Semantic Analysis
Semantic 4.Discourse
Analysis
5.Pragmatics
Discourse
Pragmatics
level.
Given the possible POS generated from the previous step, a syntax
Pragmatics
• The final stage of NLP, Pragmatics interprets the given text using
information from the previous steps. Given a sentence, “Turn off the
lights” is an order or request to switch off the lights.
• stemming algorithm works by cutting off the end or the beginning of the word taking
into account a list of common prefixes suffixes that can be found in an infected word.
• Somehow similar to stemming, as it maps several words into one common root.
• For example, Lemmatizer should map gone,going and went into go.
06/02/2025 NATURAL LANGUAGE PROCESSING
POS tags
• Speaking the grammatical type of the word is referred to as POS tags or parts of speech.
• Nouns, pronouns, verbs, adverbs, adjectives, prepositions, Conjunctions and Interjections.
• Process of detecting the named entities such as the person name, company name and location that is pharse
identification.
• A free morpheme is a single meaningful unit of a word that can stand alone in the language. For
example: cat, mat, trust, slow.
• A bound morpheme cannot stand alone, it has no real meaning if it is on its own. For example:
walked, (ed) can not stand alone or unpleasant (un) is not a stand alone morpheme.Bound
morphemes that are part of prefixes and suffixes.
• Bound morphemes can also be grouped into into a further two categories.
1. Derivational 2. Inflectional
06/02/2025 NATURAL LANGUAGE PROCESSING
Derivational
• Look at the word able and let it become ability. In this instance the adjective becomes a noun.
• The word send as a verb morpheme becomes sender and a noun with the addition of er.
• While stable to unstable changes the meaning of the word to become the opposite meaning.
• In other words the meaning of the word is completely changed by adding a derivational morpheme
to a base word.
• Additions to the base word that do not change the word, but rather serve as grammatical
indicators. They show grammatical information. For example:
1. Laugh becomes the past tense by adding ed and changing the word to laughed.
4. All these examples show how morphology participates in the study of linguistics.
• Tagging is a disambiguation task; words are ambiguous —have more than one
possible part-of-speech—and the goal is to find the correct tag for the situation.
• The goal of POS-tagging is to resolve these ambiguity , choosing the proper tag
for the context.
06/02/2025 NATURAL LANGUAGE PROCESSING
Introduction to POS Tagging
• Part-of-speech (POS) tagging is a process in natural language processing (NLP) where each
word in a text is labeled with its corresponding part of speech.
• This can include nouns, verbs, adjectives, and other grammatical categories.
• POS tagging is useful for a variety of NLP tasks, such as information extraction, named
entity recognition, and machine translation.
• It can also be used to identify the grammatical structure of a sentence and to disambiguate
words that have multiple meanings.
• example,
• Text: “The cat sat on the mat.”
• POS tags:
• The: determiner , cat: noun , sat: verb , on: preposition , the: determiner ,mat: noun
06/02/2025 NATURAL LANGUAGE PROCESSING
What is Part-of-speech (POS) tagging ?
• It is a process of converting a sentence to forms – list of words, list of tuples (where each
tuple is having a form (word, tag)).
• The tag in case of is a part-of-speech tag, and signifies whether the word is a noun,
adjective, verb, and so on Part Of Speech Tag
Noun (Singular) NN
Verb VB
Determiner DT
Adjective JJ
Adverb RB
•Collect a dataset of annotated text: This dataset will be used to train and test the POS tagger. The text
should be annotated with the correct POS tags for each word.
•Preprocess the text: This may include tasks such as tokenization (splitting the text into individual words),
lowercasing, and removing punctuation.
•Divide the dataset into training and testing sets: The training set will be used to train the POS tagger, and
the testing set will be used to evaluate its performance.
•Train the POS tagger: This may involve building a statistical model, such as a hidden Markov model
(HMM), or defining a set of rules for a rule-based or transformation-based tagger.
•The model or rules will be trained on the annotated text in the training set.
•Test the POS tagger: Use the trained model or rules to predict the POS tags of the words in the testing set.
Compare the predicted tags to the true tags and calculate metrics such as precision and recall to evaluate the
performance of the tagger.
•Fine-tune the POS tagger: If the performance of the tagger is not satisfactory, adjust the model or rules and
repeat the training and testing process until the desired level of accuracy is achieved.
•Use the POS tagger: Once the tagger is trained and tested, it can be used to perform POS tagging on new,
unseen text.
06/02/2025 NATURAL LANGUAGE PROCESSING
Application of POS Tagging
• Information extraction:
• POS tagging can be used to identify specific types of information in a text, such as names,
locations, and organizations.
• T his is useful for tasks such as extracting data f rom news articles or building knowledge bases for
artificial intelligence systems.
• Named entity recognition:
• POS tagging can be used to identify and classify named entities in a text, such as people, places,
and organizations.
• This is useful f or tasks such as building customer profiles or identifying key figures in a news story.
• Text classification:
• POS tagging can be used to help classify texts into different categories, such as spam emails or
sentiment analysis.
• By analyzing the POS tags of the words in a text, algorithms can better understand the content
and tone of the text.
1. Define a set of rules for assigning POS tags to words. For example:
• If the word ends in “-tion,” assign the tag “noun.”
• If the word ends in “-ment,” assign the tag “noun.”
• If the word is all uppercase, assign the tag “proper noun.”
• If the word is a verb ending in “-ing,” assign the tag “verb.”
2. Iterate through the words in the text and apply the rules to each word in turn. For example:
• “Nation” would be tagged as “noun” based on the first rule.
• “Investment” would be tagged as “noun” based on the second rule.
• “UNITED” would be tagged as “proper noun” based on the third rule.
• “Running” would be tagged as “verb” based on the fourth rule.
3. Output the POS tags for each word in the text.
• This is in contrast to rule-based POS tagging, which assigns tags to words based
on pre-defined rules, and to statistical POS tagging, which relies on a trained
model to predict tags based on probability.
• Ambiguity: Some words can have multiple POS tags depending on the context in which they appear,
making it difficult to determine their correct tag. For example, the word “bass” can be a noun (a type of
fish) or an adjective (having a low frequency or pitch).
• Out-of-vocabulary (OOV) words: Words that are not present in the training data of a POS tagger can be
difficult to tag accurately, especially if they are rare or specific to a particular domain.
• Complex grammatical structures: Languages with complex grammatical structures, such as languages
with many inflections or free word order, can be more challenging to tag accurately.
• Lack of annotated training data: Some languages or domains may have limited annotated training data,
making it difficult to train a high-performing POS tagger.
• Inconsistencies in annotated data: Annotated data can sometimes contain errors or inconsistencies, which
can negatively impact the performance of a POS tagger.
06/02/2025 NATURAL LANGUAGE PROCESSING
STUDENT EVALUATION
• Language modeling is a fundamental task in Natural Language Processing (NLP) that involves building a
statistical model to predict the probability distribution of words in a given language.
• The language model learns the patterns and relationships between words in a corpus of text
• Can be used to generate new text, evaluate the likelihood of a sentence, perform speech recognition, machine
translation, Spam filtering, etc.
• In language modeling, the primary goal is to estimate the probability of a sequence of words (a sentence or a
phrase) using the conditional probability of each word given its preceding context.
• The model learns from large amounts of text data to predict the likelihood of a particular word given the
previous words in a sentence.
There are different types of language models, but two prominent approaches are
• N-gram Language Models
• Neural Language Models
N-gram Language Models: N-gram language models are simple and widely used in early NLP
tasks. An N-gram model predicts the probability of a word based on the previous (N-1) words in
a sentence.
For example, a trigram model (3-gram) predicts the probability of a word given the two
preceding words. The model estimates the probabilities based on the frequency of word
sequences observed in the training data.
Neural Language Models: Neural language models, such as recurrent neural networks (RNNs)
and transformer-based models, have gained significant popularity in recent years due to their
ability to capture long-range dependencies and contextual information
These models learn complex patterns in the language and can generate more coherent and
contextually relevant text.
• Recurrent Neural Networks (RNNs): RNNs are a class of neural networks designed for
sequential data processing.
• They process input sequences step by step, maintaining a hidden state that captures information
from previous steps.
• This hidden state acts as the context for the current word prediction.
• However, RNNs have challenges with capturing long-range dependencies and can suffer from
vanishing or exploding gradients.
• Transformer-based Models: Transformer-based models, like the famous BERT (Bidirectional
Encoder Representations from Transformers) and GPT (Generative Pre-trained Transformer) series,
have revolutionized language modeling.
• In technical terms, they can be defined as the neighboring sequences of items in a document.
• They come into play when we deal with text data in NLP (Natural Language Processing) tasks.
The 0.4
Processing 0.3
Natural 0.12
Language 0.18
The 0.05
Processing 0.3
Natural 0.15
Language 0.5
The 0.1
Processing 0.7
Natural 0.1
Language 0.1
06/02/2025
MWETokenizer
• The multi-word expression tokenizer is a rule-based, “add-on”
tokenizer offered by NLTK.
• Once the text has been tokenized by a tokenizer of choice,
some tokens can be re-grouped into multi-word expressions.
• For example, the name Martha Jones is combined into a single
token instead of being broken into two tokens.
• This tokenizer is very flexible since it is agnostic of the base
tokenizer that was used to generate the tokens.
A MWETokenizer takes a string which has already been divided into tokens and retokenizes it,
merging multi-word expressions into single tokens, using a lexicon of MWEs
1.N-grams are defined as the combination of N keywords together. How many bi-grams can be
generated from the given sentence: The Father of our nation is Mahatma Gandhiji
A.8 B.9
C.7 D.4
2. It is the development of probabilistic models that are able to predict the next word in the sequence
given the words that precede.
A. Statistical Language Modelling B. Probabilistic Langualge Modeling
C. Neural Language Modelling D. Natural Language Understanding
3. It is a measure of how good a probability distribution predicts a sample
A. Entropy B. Perplexity
C. Cross-Entropy D. Information Gain
4.What are the python libraries used in NLP?
A. Pandas B. NLTK
C. Spacy D. All the mentioned above
• Be aware of collocations, and try to recognize them when you see or hear them.
• Treat collocations as single blocks of language. Think of them as individual blocks or
chunks, and learn strongly support, not strongly + support.
• When you learn a new word, write down other words that collocate with it (remember
rightly, remember distinctly, remember vaguely, remember vividly).
• Read as much as possible. Reading is an excellent way to learn vocabulary and
collocations in context and naturally.
• Revise what you learn regularly. Practise using new collocations in context as soon as
possible after learning them.
• Learn collocations in groups that work for you. You could learn them by topic (time,
number, weather, money, family) or by a particular word (take action, take a chance,
take an exam).
• You can find information on collocations in any good learner's dictionary. And you can
also find specialized dictionaries of collocations.
06/02/2025 NATURAL LANGUAGE PROCESSING
Collocations
Suppose we have a large dataset of movie reviews, and we want to find collocations that frequently appear
together in positive reviews. We are particularly interested in identifying collocations related to the theme of
"amazing special effects" in movies.
Step 1: Preprocess the Data First, we preprocess the movie reviews by tokenizing them into words and removing
any stop words, punctuation, and numbers.
Step 2: Calculate Association Measures Next, we calculate the association measures for different word pairs.
Let's say we want to consider the Dice coefficient as our association measure.
• For each word pair (A, B), we calculate the Dice coefficient as follows:
• Dice(A, B) = (Number of times A and B co-occur) / (Number of times A occurs + Number of times B occurs)
• Step 3: Identify Significant Collocations Now, we look for collocations with high Dice coefficients, indicating
strong associations. Let's say we find the following collocations with their corresponding Dice coefficients:
1. "amazing special" - Dice coefficient: 0.85
2. "special effects" - Dice coefficient: 0.80
3. "stunning visuals" - Dice coefficient: 0.75
4. "spectacular CGI" - Dice coefficient: 0.70
5. "mind-blowing
06/02/2025
action" - Dice coefficient: 0.65 NATURAL LANGUAGE PROCESSING
Example
Step 4: Interpretation Based on the Dice coefficients, we can see that the word pairs "amazing
special" and "special effects" have the highest associations in positive movie reviews. This suggests
that reviewers often mention "amazing special" and "special effects" together when praising movies
with exceptional visual effects.
• In this real-world example, association measures like the Dice coefficient helped us identify
significant collocations related to "amazing special effects" in movie reviews.
• These collocations can be useful for sentiment analysis, recommending movies to users who
appreciate stunning visuals, or improving the understanding of what aspects of movies are highly
praised by reviewers.
• Word2Vec: Word2Vec is a popular word embedding technique that learns continuous word
representations from large amounts of text data. It offers two algorithms: Continuous Bag of Words
(CBOW) and Skip-gram. These models generate dense word vectors that capture semantic
similarities between words based on their context.
• GloVe (Global Vectors for Word Representation): GloVe is another widely used method for
learning word embeddings. It combines the global co-occurrence statistics of words in a corpus to
create word vectors. GloVe embeddings capture both semantic and syntactic relationships between
words.
• Fast Text: Fast Text is an extension of Word2Vec that represents each word as a bag of character n-
grams. It can generate word embeddings for out-of-vocabulary words based on their character-level
information, making it useful for handling misspellings and rare words.
• Let's demonstrate a simple example of word embeddings using Word2Vec, one of the popular
techniques for learning word representations. For this example, we will use a small dataset of
movie reviews and create word embeddings using the Word2Vec algorithm.
• Step 1: Preprocess the Data Suppose we have the following movie reviews:
1. "The movie was fantastic, with amazing special effects."
2. "The plot was engaging and kept me hooked till the end."
3. "The acting was superb, especially by the lead actor."
4. "The film had stunning visuals and great cinematography."
• We need to preprocess the data by tokenizing the sentences and converting the text to lowercase
• Step 2: Train Word2Vec Model Next, we train a Word2Vec model using the tokenized reviews
• Step 3: Retrieve Word Embeddings Now, we can access the word embeddings for specific words
using the trained Word2Vec model
• Step 4: Similar Words We can also find words similar to a given word based on their embeddings
• Step 5: Word Similarity Additionally, we can measure the similarity between two words
06/02/2025 NATURAL LANGUAGE PROCESSING
Word 2 Vector Example
• The resulting word embeddings and similarity scores will depend on the specific
corpus and the number of training iterations, but they should capture the semantic
relationships between words based on their context in the reviews.
• For instance, "fantastic" and "amazing" are likely to have a high similarity score,
as they both frequently appear together in positive contexts in the dataset.
Similarly, "plot" and "visuals" might also have a reasonable similarity score if they
co-occur in sentences discussing movie elements.
• https://web.stanford.edu/~jurafsky/NLPCourseraSlides.html
2) This is e book which can be followed
• https://web.stanford.edu/~jurafsky/slp3/ed3book.pdf
3) This is the channel by dan jurafsky and manning where they teach each topic from zero level.
• https://www.youtube.com/watch?v=808M7q8QX0E&list=PLaZQkZp6WhWyvdiP49JG-rjyTPck_hvEu
4) https://www.shiksha.com/online-courses/articles/pos-tagging-in-nlp/
5) https://web.stanford.edu/~jurafsky/slp3/
6) https://www.studocu.com/in/document/srm-institute-of-science-and-technology/natural-
language-processing/nlp-notes-unit-1/39506511?origin=home-recent-1
06/02/2025 NATURAL LANGUAGE PROCESSING