0% found this document useful (0 votes)
21 views112 pages

U1 NLP Complete

The document outlines a course on Natural Language Processing (NLP) offered by the Department of Computational Intelligence at SRM Institute of Science and Technology, detailing the course objectives, outcomes, and content structure. It emphasizes the importance of NLP in enabling computers to understand human language, the various applications and advantages of NLP, as well as the challenges faced in the field. Additionally, it lists relevant publications and resources, including programming languages and APIs used in NLP.

Uploaded by

hamzaalnasir7
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views112 pages

U1 NLP Complete

The document outlines a course on Natural Language Processing (NLP) offered by the Department of Computational Intelligence at SRM Institute of Science and Technology, detailing the course objectives, outcomes, and content structure. It emphasizes the importance of NLP in enabling computers to understand human language, the various applications and advantages of NLP, as well as the challenges faced in the field. Additionally, it lists relevant publications and resources, including programming languages and APIs used in NLP.

Uploaded by

hamzaalnasir7
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 112

18CSE359T – Natural Language Processing

“Enabling Computers to Understand Natural Language like Humans”

Department of Computational Intelligence

School of computing

SRM Institute of Science and Technology

06/02/2025 NATURAL LANGUAGE PROCESSING


Mrs.S.Amudha
Assistant Professor
E - mail: [email protected]
Contact No:9791994531
Area: Deep Learning, NLP, Data Analytics, Predictive Analysis,
Internet of Things, Data Science, Wireless Sensor Networks
Specialization: Internet of Things
Affiliation: Department of Computation Intelligence,
School of Computing,
SRM Institute of Science and Technology ,KTR
Experiences Overall: 18 years 8 Months
SRM Experience: 10 years
Total publications : 2 SCI 18 Scopus journal

NATURAL LANGUAGE PROCESSING


06/02/2025 2
Number of Scopus Journal related to NLP
1) Classification of toxicity in social media comments using the Binary Relevance – Logistic Regression
and BERT model,ICONDEEPCOM Conference,2023

2) Medicine @ Care system for smart Hospitals using NLP by 2022 Major project work.

3)“Understanding Short Text Through Lexical Semantic Analysis” ,Published under licence by IOP
Publishing Ltd,IOP Conference Series: Materials Science and Engineering, Volume 1130, International
Conference on Advances in Renewable and Sustainable Energy
Systems (ICARSES 2020) 3rd-5th December, Chennai, India.

4)“Personalized Dynamic User Interfaces”, Published under licence by IOP Publishing Ltd,IOP
Conference Series: Materials Science and Engineering, Volume 1130,
International Conference on Advances in Renewable and Sustainable Energy Systems (ICARSES 2020)
3rd-5th December, Chennai, India

06/02/2025 NATURAL LANGUAGE PROCESSING


Book Chapter Publications
• Book: Deep Sciences for Computing and Communications
• Chapter No: 27
• Chapter DOI:10.1007/978-3-031-27622-4_27
• Title :Modelling Air Pollution and Traffic Congestion Problem
Through Mobile Application
• To be published in :icondeepcom2022, CCIS, 1719
• Springer Nature Book Series
• Authors : S.Amudha 1,J.Shobana 2, M.Satheeshkumar 3, P.Chithra 4

Thursday, February 6, 2025 SRM INSTITUTE OF SCIENCE AND TECHNOLOGY 4


Book Chapter Publications
• Book: Deep Sciences for Computing and Communications
• Chapter No: 28
• Chapter DOI:10.1007/978-3-031-27622-4_27
• Title :EMOTION RECOGNITION OF PEOPLE BASED ON FACIAL
EXPRESSIONS IN REAL-TIME EVENT
• To be published in :icondeepcom2022, CCIS, 1719
• Springer Nature Book Series
• Authors : Amutha Devi 1,E.Poongothai2, S.Amudha3

Thursday, February 6, 2025 SRM INSTITUTE OF SCIENCE AND TECHNOLOGY 5


Course Objectives
1. Teach students the leading trends and systems in natural language processing.

2. Make them understand the concepts of morphology, syntax, semantics and pragmatics of the language
and that they are able to give the appropriate examples that will illustrate the above mentioned concept.

3. Teach them to recognize the significance of pragmatics for natural language understanding.

4. Enable students to be capable to describe the application based on natural language processing and to
show the points of syntactic, semantic and pragmatic processing.

5. To understand natural language processing and to learn how to apply basic algorithms in this field.

06/02/2025 NATURAL LANGUAGE PROCESSING


Course Outcomes
CO1: Construct approaches to syntax and semantics in NLP.

CO2: : Analyze approaches to syntax and semantic parsing with pronoun resolution.

CO3: Implement semantic role, relations and frames, Including co reference resolution.

CO4: Implement summarization, Information retrieval and machine translation.

CO5: Apply the knowledge of various levels of analysis involved in NLP and
Implement

different techniques of NLP like word Embedding, CBOW and Skip-gram

06/02/2025 NATURAL LANGUAGE PROCESSING


NLP
• Unit I - Introduction to NLP
• Unit II - Syntax parsing
• Unit III - Semantic relations
• Unit IV - Information extraction
• Unit V - Statistical & Probabilistic approaches to NLP tasks

06/02/2025 NATURAL LANGUAGE PROCESSING


UNIT 1 -Topics

• Introduction to Natural Language Processing


• Steps – Morphology – Syntax – Semantics
• Morphological Analysis (Morphological Parsing)
• Stemming – Lemmatization
• Parts of Speech Tagging
• Approaches on NLP Tasks (Rule-based, Statistical, Machine Learning
• N-grams
• Multiword Expressions
• Collocations (Association Measures, Coefficients and Context
Measure
• Vector Representation of Words
06/02/2025 NATURAL LANGUAGE PROCESSING
Introduction to NLP

• Motive of learning a language is to communicate and share the information successfully to


others.

• According to the industry the estimation is only 21% of the available data is in the
structured form.

• Data is being generated ,send messages on WhatsApp and Facebook or various social
media.

• Majority of the datas exists in textual format which is highly unstructured form.

• Now in order to produce significant and actionable insights from this data it is important to
get acquainted with the techniques of text analysis and natural language processing.
06/02/2025 NATURAL LANGUAGE PROCESSING
Introduction to NLP

• Text analytics/mining is the process of deriving


meaningful information from natural language text.

• It usually involves the process of structuring the input text,


deriving patterns and evaluating and interpreting the output.

• Natural language processing is a part of computer science


and artificial intelligence which deals with human
languages.
06/02/2025 NATURAL LANGUAGE PROCESSING
What is NLP?

• Natural language processing (NLP) is a branch of artificial intelligence (AI)


that enables machines to understand human language.
• The main intention of NLP is to build systems that are able to make sense of
text and
• Then automatically execute tasks like spell-check, text translation, topic
classification, etc.

06/02/2025 NATURAL LANGUAGE PROCESSING


Need of NLP

• The need to study Natural Language Processing (NLP)


• Increasing role of language in human-computer interaction
• Vast amount of unstructured textual data available.

• Now to make interactions between computers and humans, computers need to


understand natural languages used by humans.

• Natural language processing is all about making computers learn, process,


and manipulate natural languages.

06/02/2025
NATURAL LANGUAGE PROCESSING
Need of NLP

• Processing text data is an essential task as there is an abundance of


text available everywhere.
• Text data can be found in various sources such as books, websites,
social media, news articles, research papers, emails, and more.
• However, text data is often unstructured, meaning it lacks a
predefined format or organization.
• To harness the valuable information contained within text data, it is
necessary to process and analyze it effectively.

06/02/2025 NATURAL LANGUAGE PROCESSING


Need of NLP

• To extract insights, identify patterns, perform


sentiment analysis, categorize documents, automate
text generation, and enable information retrieval.
• By processing text data, valuable knowledge can
be derived, enabling businesses, researchers, and
individuals to make informed decisions, gain
insights, improve customer experiences,
develop intelligent systems, and drive
innovation across various industries.
06/02/2025 NATURAL LANGUAGE PROCESSING
Use of NLP

1. User-Friendly Interfaces: NLP allows for intuitive and user-friendly interfaces using natural language,
reducing the need for complex programming syntax.
2. Accessibility and Inclusivity: NLP makes technology accessible to a wider audience, including those with
limited technical expertise or disabilities.
3. Conversational Systems: NLP enables the development of conversational agents, enhancing user interaction
and system efficiency.
4. Data Extraction and Analysis: NLP extracts insights from unstructured text data, enabling sentiment
analysis, information retrieval, and text summarization.
5. Voice-based Interaction: NLP powers voice assistants and speech recognition systems for hands-free and
natural interaction.
6. Human-Machine Collaboration: NLP enables seamless communication and collaboration between humans
and machines.
7. Natural Language Understanding: NLP allows machines to comprehend context, semantics, and intent,
enabling advanced applications and personalized experiences.

06/02/2025 NATURAL LANGUAGE PROCESSING


Applications of NLP

06/02/2025 NATURAL LANGUAGE PROCESSING


Applications of NLP

06/02/2025 NATURAL LANGUAGE PROCESSING


Applications of NLP

• Twitter sentimental analysis or the Facebook


sentiment as it's being used heavily now.
• Customer chat services provided by various
companies and the process behind all of that is
because of the NLP.
• Speech recognition are also talking about
divorce assistants like Google assistant and
alexa.
• NLP to translate data from one language to
another
• Advertisement matching basically
recommendation of ads based on your history.
06/02/2025 NATURAL LANGUAGE PROCESSING
Real life use cases of NLP

1. Gmail - when u r typing any sentence in gmail you will notice that it tries to auto complete. Auto
completion is done using NLP.
2. Spam filters - This emails didn’t have spam filters then you will be so much worried you will get
so much mails/headache. Using NLP we can filter them and using keywords take them out of
your inbox.
3. Language translation - Translate sentence from one language to another language.
4. Customer service chat bot - ex: in bank service chat bot if you type in a message and many
times there is no human on the other hand. Your chat bot can interpret your language it can
derive intent out of it and it can respond to your question on its own and sometimes it doesn’t
work well then they connect it to human beings.
5. Voice assistants at amazon, alexa and google assistant
6. Google search - BERT special language model will give a correct solution for all search question.

06/02/2025 NATURAL LANGUAGE PROCESSING


Advantages of NLP

• NLP helps users to ask questions about any subject and get a direct response
within seconds.
• NLP offers exact answers to the question means it does not offer unnecessary
and unwanted information.
• NLP helps computers to communicate with humans in their languages.
• It is very time efficient.
• Most of the companies use NLP to improve the efficiency of documentation
processes, accuracy of documentation, and identify the information from large
databases.

06/02/2025 NATURAL LANGUAGE PROCESSING


Disadvantages of NLP

• NLP may not show context.


• NLP is unpredictable
• NLP may require more keystrokes.
• NLP is unable to adapt to the new domain, and it has a limited function that's
why NLP is built for a single and specific task only.

06/02/2025 NATURAL LANGUAGE PROCESSING


Components of NLP

• Natural Language Understanding (NLU)


• Natural Language Understanding (NLU) helps the machine to understand and
analyse human language by extracting the metadata from content such as
concepts, entities, keywords, emotion, relations, and semantic roles.
• NLU mainly used in Business applications to understand the customer's problem
in both spoken and written language.
• NLU involves the following tasks -
• It is used to map the given input into useful representation.
• It is used to analyze different aspects of the language.

06/02/2025 NATURAL LANGUAGE PROCESSING


Components of NLP

• 2. Natural Language Generation (NLG)


• Natural Language Generation (NLG) acts as a translator that converts the
computerized data into natural language representation.
• It mainly involves Text planning, Sentence planning, and Text Realization.

06/02/2025 NATURAL LANGUAGE PROCESSING


Programming Languages for NLP

list of NLP APIs

06/02/2025 NATURAL LANGUAGE PROCESSING


NLP Libraries

• Scikit-learn: It provides a wide range of algorithms for building machine learning


models in Python.
• Natural language Toolkit (NLTK): NLTK is a complete toolkit for all NLP techniques.
• Pattern: It is a web mining module for NLP and machine learning.
• TextBlob: It provides an easy interface to learn basic NLP tasks like sentiment analysis,
noun phrase extraction, or pos-tagging.
• Quepy: Quepy is used to transform natural language questions into queries in a
database query language.
• SpaCy: SpaCy is an open-source NLP library which is used for Data Extraction, Data
Analysis, Sentiment Analysis, and Text Summarization.
• Gensim: Gensim works with large datasets and processes data streams.
06/02/2025 NATURAL LANGUAGE PROCESSING
List of NLP APIs
• IBM Watson API
IBM Watson API combines different sophisticated machine learning techniques to
enable developers to classify text into various custom categories.
• Chatbot API
Chatbot API allows you to create intelligent chatbots for any service.
• Speech to text API
Speech to text API is used to convert speech to text
• Text Analysis API by AYLIEN
Text Analysis API by AYLIEN is used to derive meaning and insights from the textual
content.
• Cloud NLP API
The Cloud NLP API is used to improve the capabilities of the application using natural
language
06/02/2025
processing technology. NATURAL LANGUAGE PROCESSING
Difference between Natural language and Computer Language

06/02/2025 NATURAL LANGUAGE PROCESSING


Student Evaluation MCQ

1)What is the field of Natural Language Processing (NLP)?


a) Computer Science b) Artificial Intelligence
c) Linguistics d) All of the mentioned
2)What is the main challenge/s of NLP?
a) Handling Ambiguity of Sentences b) Handling Tokenization
c) Handling POS-Tagging d) All of the mentioned
3) What is Machine Translation?
a) Converts one human language to another b) Converts human language to machine language
c) Converts any human language to English d) Converts Machine language to human language
4)Natural language processing is divided into the two subfields of -
A. symbolic and numeric B. algorithmic and heuristic
C. time and motion D. understanding and generation
5). The natural language is also known as .....................
A. 3rd Generation language B. 4th Generation language
C. 5th Generation language D. 6th Generation language
06/02/2025 NATURAL LANGUAGE PROCESSING
Process of NLP

Lexical Analysis 1.Morphological Analysis/ Lexical Analysis

2.Syntax Analysis
Syntax Analysis
3.Semantic Analysis

Semantic 4.Discourse
Analysis
5.Pragmatics
Discourse

Pragmatics

06/02/2025 NATURAL LANGUAGE PROCESSING


Morphological Analysis/ Lexical Analysis

• Morphological or Lexical Analysis deals with text at the individual word


level.
• It looks for morphemes, the smallest unit of a word.
• The first phase of NLP is the Lexical Analysis.
• This phase scans the source code as a stream of characters and converts it into
meaningful lexemes.
• It divides the whole text into paragraphs, sentences, and words.
• For example, irrationally can be broken into ir (prefix), rational (root)
and -ly (suffix).
• Lexical Analysis finds the relation between these morphemes and
converts the word into its root form.
• A lexical analyzer also assigns the possible Part-Of-Speech (POS) to the
word.
06/02/2025 NATURAL LANGUAGE PROCESSING
Syntax Analysis

Syntax Analysis ensures that a given piece of text is correct structure.

It tries to parse the sentence to check correct grammar at the sentence

level.

Given the possible POS generated from the previous step, a syntax

analyzer assigns POS tags based on the sentence structure.

06/02/2025 NATURAL LANGUAGE PROCESSING


Syntax
• Syntax refers to the arrangement of words in a very sentence specified they create
grammatical sense.
• Syntax techniques
1. Lemmatization It entails reducing the various inflected forms of a word into a single
form for easy analysis.
2. Morphological segmentation It involves dividing words into individual units called
morphemes.
3. Word segmentation It involves dividing a large piece of continuous text into distinct
units.
4. Part-of-speech tagging It involves identifying the part of speech for every word.
5. Parsing It involves undertaking a grammatical analysis for the provided sentence.
6. Sentence breaking It involves placing sentence boundaries on a large piece of text.
7. Stemming It involves cutting the inflected words to their root form.
06/02/2025 NATURAL LANGUAGE PROCESSING
Semantics

• Semantics refers to the meaning that is conveyed by a text. It involves


the interpretation of words and how sentences are structured.

1. Named entity recognition (NER) It involves determining the parts of a


text that can be identified and categorized into preset groups.
Examples of such groups include names of individuals and names of
places.

2. Word sense disambiguation It involves giving meaning to a word


based on the context.
06/02/2025 NATURAL LANGUAGE PROCESSING
Semantic Analysis

• Consider the sentence: “The apple ate a banana”. Although the


sentence is syntactically correct, it doesn’t make sense because
apples can’t eat.
• Semantic analysis looks for meaning in the given sentence. It also
deals with combining words into phrases.
• For example, “red apple” provides information regarding one object;
hence we treat it as a single phrase.
• Similarly, we can group names referring to the same category,
person, object or organization. “Robert Hill” refers to the same person
and not two separate names – “Robert” and “Hill”
06/02/2025 NATURAL LANGUAGE PROCESSING
Discourse

• Discourse deals with the effect of a previous sentence on the


sentence in consideration. In the text, “Jack is a bright student. He
spends most of the time in the library.” Here, discourse assigns “he”
to refer to “Jack”.

Pragmatics

• The final stage of NLP, Pragmatics interprets the given text using
information from the previous steps. Given a sentence, “Turn off the
lights” is an order or request to switch off the lights.

06/02/2025 NATURAL LANGUAGE PROCESSING


Morphological Analysis (Morphological Parsing)

• The goal of morphological parsing is to find out what morphemes a


given word is built from.
• The goal of morphological parsing is to find out what morphemes a
given word is built from. For example, a morphological parser should be
able to tell us that the word cats is the plural form of the noun
stem cat, and that the word mice is the plural form of the noun
stem mouse. So, given the string cats as input, a morphological parser
should produce an output that looks similar to cat N PL. Here are some
more examples:

06/02/2025 NATURAL LANGUAGE PROCESSING


Morphological Analysis (Morphological Parsing)

• Morphological parsing yields information that is useful in many NLP


applications. In parsing, e.g., it helps to know the agreement features
of words. Similarly, grammar checkers need to know agreement
information to detect such mistakes.

• But morphological information also helps spell checkers to decide


whether something is a possible word or not, and in information
retrieval it is used to search not only cats, if that's the user's input, but
also for cat.

06/02/2025 NATURAL LANGUAGE PROCESSING


Pipeline of NLP in AI /Steps in NLP
1. Tokenization
2. Stemming
3. Lemmatization
4. POS tags
5. Named Entity Recognition
6. Chunking

06/02/2025 NATURAL LANGUAGE PROCESSING


Tokenization
• Process the strings into tokens. Tokens is like a small structures or units.

06/02/2025 NATURAL LANGUAGE PROCESSING


Stemming

• Normalize words into its base or root form.

• stemming algorithm works by cutting off the end or the beginning of the word taking
into account a list of common prefixes suffixes that can be found in an infected word.

• Here the single root word is affect.

06/02/2025 NATURAL LANGUAGE PROCESSING


Lemmatization

• Morphological analysis of the word to do so its necessary to have a detail


dictionary and the algorithm can look through to link the original word or root
word.

• Somehow similar to stemming, as it maps several words into one common root.

• Output of lemmatization is proper word.

• For example, Lemmatizer should map gone,going and went into go.
06/02/2025 NATURAL LANGUAGE PROCESSING
POS tags

• Speaking the grammatical type of the word is referred to as POS tags or parts of speech.
• Nouns, pronouns, verbs, adverbs, adjectives, prepositions, Conjunctions and Interjections.

06/02/2025 NATURAL LANGUAGE PROCESSING


Named Entity Recognition

• Process of detecting the named entities such as the person name, company name and location that is pharse
identification.

06/02/2025 NATURAL LANGUAGE PROCESSING


Chunking
• Picking up individual pieces of information and grouping them into
bigger pieces.
• Grouping of words or tokens.

06/02/2025 NATURAL LANGUAGE PROCESSING


Why is NLP hard?
• ambiguity and variability of linguistic expression:
variability: many forms can mean the same thing
ambiguity: one form can mean many things
• Many different kinds of ambiguity
• Each NLP task has to address a distinct set of kinds

06/02/2025 NATURAL LANGUAGE PROCESSING


Steps – Morphology – Syntax –
Semantics
• Language - smallest individual unit
• Phoneme - Single distinguishable sound
• Morphology - word formation

06/02/2025 NATURAL LANGUAGE PROCESSING


Morphology
• Morphology helps linguists understand the structure of words by putting together morphemes.

• A morpheme is the smallest grammatical, meaningful part of language.

• 2 types - Free morpheme and Bound morpheme.

• A free morpheme is a single meaningful unit of a word that can stand alone in the language. For
example: cat, mat, trust, slow.

• A bound morpheme cannot stand alone, it has no real meaning if it is on its own. For example:
walked, (ed) can not stand alone or unpleasant (un) is not a stand alone morpheme.Bound
morphemes that are part of prefixes and suffixes.

• Bound morphemes can also be grouped into into a further two categories.

1. Derivational 2. Inflectional
06/02/2025 NATURAL LANGUAGE PROCESSING
Derivational

• Added to the base form of the word to create a new word.

• Look at the word able and let it become ability. In this instance the adjective becomes a noun.

• The word send as a verb morpheme becomes sender and a noun with the addition of er.

• While stable to unstable changes the meaning of the word to become the opposite meaning.

• In other words the meaning of the word is completely changed by adding a derivational morpheme
to a base word.

06/02/2025 NATURAL LANGUAGE PROCESSING


Inflectional

• Additions to the base word that do not change the word, but rather serve as grammatical
indicators. They show grammatical information. For example:

1. Laugh becomes the past tense by adding ed and changing the word to laughed.

2. Dog to dogs changes the word from singular to plural.

3. Swim to swimming changes the verb into a progressive verb.

4. All these examples show how morphology participates in the study of linguistics.

06/02/2025 NATURAL LANGUAGE PROCESSING


Parts of Speech Tagging

• Introduce the task of part-of-speech tagging,


taking a sequence of words and assigning
each word a part of speech like NOUN or
VERB
• Task of named entity recognition (NER),
assigning words or phrases tags like
PERSON, LOCATION, or
ORGANIZATION.

06/02/2025 NATURAL LANGUAGE PROCESSING


Introductions- Part-of-Speech Tagging

• Part-of-speech tagging is the process


of assigning a part-of-speech to each
word in text.
• The input is a sequence x1, x2, ..., xn
of (tokenized) words and a tagset, and
• the output is a sequence y1, y2, ..., yn
of tags,
• each output yi corresponding exactly
to one input xi,
06/02/2025 NATURAL LANGUAGE PROCESSING
Introduction

• Tagging is a disambiguation task; words are ambiguous —have more than one
possible part-of-speech—and the goal is to find the correct tag for the situation.

• For example, book can be a verb (book that flight) or

• a noun (hand me that book).

• That can be a determiner (Does that flight serve dinner) or

• a complementizer (thought that your flight was earlier).

• The goal of POS-tagging is to resolve these ambiguity , choosing the proper tag
for the context.
06/02/2025 NATURAL LANGUAGE PROCESSING
Introduction to POS Tagging

• Part-of-speech (POS) tagging is a process in natural language processing (NLP) where each
word in a text is labeled with its corresponding part of speech.
• This can include nouns, verbs, adjectives, and other grammatical categories.
• POS tagging is useful for a variety of NLP tasks, such as information extraction, named
entity recognition, and machine translation.
• It can also be used to identify the grammatical structure of a sentence and to disambiguate
words that have multiple meanings.
• example,
• Text: “The cat sat on the mat.”
• POS tags:
• The: determiner , cat: noun , sat: verb , on: preposition , the: determiner ,mat: noun
06/02/2025 NATURAL LANGUAGE PROCESSING
What is Part-of-speech (POS) tagging ?

• It is a process of converting a sentence to forms – list of words, list of tuples (where each
tuple is having a form (word, tag)).
• The tag in case of is a part-of-speech tag, and signifies whether the word is a noun,
adjective, verb, and so on Part Of Speech Tag

Noun (Singular) NN

Noun (Plural) NNS

Verb VB

Determiner DT

Adjective JJ

Adverb RB

06/02/2025 NATURAL LANGUAGE PROCESSING


Universal Part-of Speech Tagset
Universal Part-of-Speech Tagset

Tag Meaning English Examples


ADJ adjective new, good, high, special, big, local
ADP adposition on, of, at, with, by, into, under
ADV adverb really, already, still, early, now
CONJ conjunction and, or, but, if, while, although
DET determiner, article the, a, some, most, every, no, which
NOUN noun year, home, costs, time, Africa
NUM numeral twenty-four, fourth, 1991, 14:24
PRT particle at, on, out, over per, that, up, with
PRON pronoun he, their, her, its, my, I, us
VERB verb is, say, told, given, playing, would
. punctuation marks .,;!
X other ersatz, esprit, dunno, gr8, univeristy
06/02/2025 NATURAL LANGUAGE PROCESSING
Use of Parts of Speech Tagging in NLP

1.To understand the grammatical structure of a sentence: By labeling each


word with its POS, we can better understand the syntax and structure of a
sentence.
2.To disambiguate words with multiple meanings: Some words, such as
“bank,” can have multiple meanings depending on the context in which they
are used.
3.To improve the accuracy of NLP tasks: POS tagging can help improve the
performance of various NLP tasks, such as named entity recognition and text
classification.
4.To facilitate research in linguistics: POS tagging can also be used to study the
patterns and characteristics of language use and to gain insights into the
structure and function of different parts of speech.
06/02/2025 NATURAL LANGUAGE PROCESSING
Steps Involved in the POS tagging

•Collect a dataset of annotated text: This dataset will be used to train and test the POS tagger. The text
should be annotated with the correct POS tags for each word.
•Preprocess the text: This may include tasks such as tokenization (splitting the text into individual words),
lowercasing, and removing punctuation.
•Divide the dataset into training and testing sets: The training set will be used to train the POS tagger, and
the testing set will be used to evaluate its performance.
•Train the POS tagger: This may involve building a statistical model, such as a hidden Markov model
(HMM), or defining a set of rules for a rule-based or transformation-based tagger.
•The model or rules will be trained on the annotated text in the training set.
•Test the POS tagger: Use the trained model or rules to predict the POS tags of the words in the testing set.
Compare the predicted tags to the true tags and calculate metrics such as precision and recall to evaluate the
performance of the tagger.
•Fine-tune the POS tagger: If the performance of the tagger is not satisfactory, adjust the model or rules and
repeat the training and testing process until the desired level of accuracy is achieved.
•Use the POS tagger: Once the tagger is trained and tested, it can be used to perform POS tagging on new,
unseen text.
06/02/2025 NATURAL LANGUAGE PROCESSING
Application of POS Tagging
• Information extraction:
• POS tagging can be used to identify specific types of information in a text, such as names,
locations, and organizations.
• T his is useful for tasks such as extracting data f rom news articles or building knowledge bases for
artificial intelligence systems.
• Named entity recognition:
• POS tagging can be used to identify and classify named entities in a text, such as people, places,
and organizations.
• This is useful f or tasks such as building customer profiles or identifying key figures in a news story.
• Text classification:
• POS tagging can be used to help classify texts into different categories, such as spam emails or
sentiment analysis.
• By analyzing the POS tags of the words in a text, algorithms can better understand the content
and tone of the text.

06/02/2025 NATURAL LANGUAGE PROCESSING


Application of POS Tagging
• Machine translation:
• POS tagging can be used to help translate texts from one language to
another by identifying the grammatical structure and relationships
between words in the source language and mapping them to the target
language.
• Natural language generation:
• POS tagging can be used to generate natural-sounding text by selecting
appropriate words and constructing grammatically correct sentences.
• This is useful f or tasks such as chatbots and virtual assistants.

06/02/2025 NATURAL LANGUAGE PROCESSING


Different POS Tagging Techniques

1.Rule-Based POS Tag


• This is one of the oldest approaches to POS tagging.
• It involves using a dictionary consisting of all the possible POS tags for a given word.
• If any of the words have more than one tag, hand-written rules are used to assign the
correct tag based on the tags of surrounding words.
• For example, if the preceding of a word an article, then the word has to be a noun.
• Consider the words: A Book
• Get all the possible POS tags for individual words: A – Article; Book – Noun or Verb
• Use the rules to assign the correct POS tag: As per the possible tags, “A” is an Article and
we can assign it directly. But, a book can either be a Noun or a Verb. However, if we
consider “A Book”, A is an article and following our rule above, Book has to be a Noun.
Thus, we assign the tag of Noun to book.
• POS Tag: [(“A”, “Article”), (“Book”, “Noun”)]
06/02/2025 NATURAL LANGUAGE PROCESSING
Rule-based POS Tagging

These rules may be either −


•Context-pattern rules
• Regular expression compiled into finite-state automata, intersected with lexically
ambiguous sentence representation.
Rule-based POS tagging by its two-stage architecture −
•First stage − In the first stage, it uses a dictionary to assign each word a list of
potential parts-of-speech.
•Second stage − In the second stage, it uses large lists of hand-written disambiguation
rules to sort down the list to a single part-of-speech for each word.

06/02/2025 NATURAL LANGUAGE PROCESSING


Properties of Rule-Based POS Tagging

•These taggers are knowledge-driven taggers.


•The rules in Rule-based POS tagging are built manually.
•The information is coded in the form of rules.
•We have some limited number of rules approximately around 1000.
•Smoothing and language modeling is defined explicitly in rule-based taggers.

06/02/2025 NATURAL LANGUAGE PROCESSING


Rule-based POS tagger Example

1. Define a set of rules for assigning POS tags to words. For example:
• If the word ends in “-tion,” assign the tag “noun.”
• If the word ends in “-ment,” assign the tag “noun.”
• If the word is all uppercase, assign the tag “proper noun.”
• If the word is a verb ending in “-ing,” assign the tag “verb.”
2. Iterate through the words in the text and apply the rules to each word in turn. For example:
• “Nation” would be tagged as “noun” based on the first rule.
• “Investment” would be tagged as “noun” based on the second rule.
• “UNITED” would be tagged as “proper noun” based on the third rule.
• “Running” would be tagged as “verb” based on the fourth rule.
3. Output the POS tags for each word in the text.

06/02/2025 NATURAL LANGUAGE PROCESSING


Stochastic POS Tagging

Another technique of tagging is Stochastic POS Tagging.


The model that includes frequency or probability (statistics) can be called stochastic.
Any number of different approaches to the problem of part-of-speech tagging can be referred to as
stochastic tagger.
The simplest stochastic tagger applies the following approaches for POS tagging −
Word Frequency Approach
• In this approach, the stochastic taggers disambiguate the words based on the probability that a word
occurs with a particular tag.
• We can also say that the tag encountered most frequently with the word in the training set is the one
assigned to an ambiguous instance of that word.
• The main issue with this approach is that it may yield inadmissible sequence of tags.
Tag Sequence Probabilities
• It is another approach of stochastic tagging, where the tagger calculates the probability of a given
sequence of tags occurring.
• It is also called n-gram approach.
• It is called so because the best tag for a given word is determined by the probability at which it occurs
with the n previous tags.
06/02/2025 NATURAL LANGUAGE PROCESSING
Statistical POS Tagging

• Statistical part-of-speech (POS) tagging is a method of labeling words with their


corresponding parts of speech using statistical techniques.
• This is in contrast to rule-based POS tagging, which relies on pre-defined rules, and
to unsupervised learning-based POS tagging, which does not use any annotated
training data.
• In statistical POS tagging, a model is trained on a large annotated corpus of text to
learn the patterns and characteristics of different parts of speech.
• The model uses this training data to predict the POS tag of a given word based on the
context in which it appears and the probability of different POS tags occurring in that
context.
• Statistical POS taggers can be more accurate and efficient than rule-based taggers,
especially for tasks with large or complex datasets.

06/02/2025 NATURAL LANGUAGE PROCESSING


Transformation-based tagging (TBT)

• Transformation-based tagging (TBT) is a method of part-of-speech (POS)


tagging that uses a series of rules to transform the tags of words in a text.

• This is in contrast to rule-based POS tagging, which assigns tags to words based
on pre-defined rules, and to statistical POS tagging, which relies on a trained
model to predict tags based on probability.

06/02/2025 NATURAL LANGUAGE PROCESSING


Working Principles

• Here is an example of how a TBT system might work:


1. Define a set of rules for transforming the tags of words in the text. For example:
• If the word is a verb and appears after a determiner, change the tag to “noun.”
• If the word is a noun and appears after an adjective, change the tag to “adjective.”
2. Iterate through the words in the text and apply the rules in a specific order. For example:
• In the sentence “The cat sat on the mat,” the word “sat” would be changed from a verb to a noun
based on the first rule.
• In the sentence “The red cat sat on the mat,” the word “red” would be changed from an adjective
to a noun based on the second rule.
3. Output the transformed tags for each word in the text.

06/02/2025 NATURAL LANGUAGE PROCESSING


Hidden Markov Model POS tagging

• Hidden Markov models (HMMs) are a type of statistical model that


can be used for part-of-speech (POS) tagging in natural language
processing (NLP).
• In an HMM-based POS tagger, a model is trained on a large annotated
corpus of text to learn the patterns and characteristics of different parts
of speech.
• The model uses this training data to predict the POS tag of a given
word based on the probability of different tags occurring in the context
of the word.

06/02/2025 NATURAL LANGUAGE PROCESSING


HMM (Hidden Markov Model)
• HMM (Hidden Markov Model) is a Stochastic technique for POS tagging.
• Hidden Markov models are known for their applications to
reinforcement learning and temporal pattern recognition
• such as
• Speech,
• Handwriting,
• Gesture recognition,
• Musical score following,
• Partial discharges, and
• Bioinformatics.

06/02/2025 NATURAL LANGUAGE PROCESSING


Example
The transition probability is the likelihood of
a particular sequence
for example, how likely is that a noun is
followed by a model and a model by a
verb and a verb by a noun.
This probability is known as Transition
probability.
It should be high for a particular sequence to
be correct.

06/02/2025 NATURAL LANGUAGE PROCESSING


Example
• Mary Jane can see Will
• Spot will see Mary
• Will Jane spot Mary?
• Mary will pat Spot

06/02/2025 NATURAL LANGUAGE PROCESSING


Counting Table – Emission
Probability

06/02/2025 NATURAL LANGUAGE PROCESSING


Emission probabilities
• The probability that Mary is Noun = 4/9
• The probability that Mary is Model = 0
• The probability that Will is Noun = 1/9
• The probability that Will is Model = 3/4

06/02/2025 NATURAL LANGUAGE PROCESSING


Challenges in POS Tagging

• Ambiguity: Some words can have multiple POS tags depending on the context in which they appear,
making it difficult to determine their correct tag. For example, the word “bass” can be a noun (a type of
fish) or an adjective (having a low frequency or pitch).

• Out-of-vocabulary (OOV) words: Words that are not present in the training data of a POS tagger can be
difficult to tag accurately, especially if they are rare or specific to a particular domain.

• Complex grammatical structures: Languages with complex grammatical structures, such as languages
with many inflections or free word order, can be more challenging to tag accurately.

• Lack of annotated training data: Some languages or domains may have limited annotated training data,
making it difficult to train a high-performing POS tagger.

• Inconsistencies in annotated data: Annotated data can sometimes contain errors or inconsistencies, which
can negatively impact the performance of a POS tagger.
06/02/2025 NATURAL LANGUAGE PROCESSING
STUDENT EVALUATION

1) What is the main challenges of NLP?


A. Handling Tokenization B. Handling POS-Tagging
C. Handling Ambiguity of Sentences D. None of the above
2) All of the following are challenges associated with natural language processing
except
A. dividing up a text into individual words in English. B. understanding the context in which something is said.
C. recognizing typographical or grammatical errors in texts D. distinguishing between words that have more than one meaning
3) In linguistic morphology, _____________ is the process for reducing inflected words
to their root form.
A. Stemming B. Rooting
C. Text-Proofing D.Both A and B
4) Morphological Segmentation
A. Is an extension of propositional logic
B. Does Discourse Analysis
C. Separate words into individual morphemes and identify the class of the morphemes
D. None of the mentioned
06/02/2025 NATURAL LANGUAGE PROCESSING
Language modelling in NLP

• Language modeling is a fundamental task in Natural Language Processing (NLP) that involves building a
statistical model to predict the probability distribution of words in a given language.

• The language model learns the patterns and relationships between words in a corpus of text

• Can be used to generate new text, evaluate the likelihood of a sentence, perform speech recognition, machine
translation, Spam filtering, etc.

• In language modeling, the primary goal is to estimate the probability of a sequence of words (a sentence or a
phrase) using the conditional probability of each word given its preceding context.

• The model learns from large amounts of text data to predict the likelihood of a particular word given the
previous words in a sentence.

06/02/2025 NATURAL LANGUAGE PROCESSING


Language modelling in NLP

There are different types of language models, but two prominent approaches are
• N-gram Language Models
• Neural Language Models

N-gram Language Models: N-gram language models are simple and widely used in early NLP
tasks. An N-gram model predicts the probability of a word based on the previous (N-1) words in
a sentence.
For example, a trigram model (3-gram) predicts the probability of a word given the two
preceding words. The model estimates the probabilities based on the frequency of word
sequences observed in the training data.
Neural Language Models: Neural language models, such as recurrent neural networks (RNNs)
and transformer-based models, have gained significant popularity in recent years due to their
ability to capture long-range dependencies and contextual information
These models learn complex patterns in the language and can generate more coherent and
contextually relevant text.

06/02/2025 NATURAL LANGUAGE PROCESSING


Language modelling in NLP

• Recurrent Neural Networks (RNNs): RNNs are a class of neural networks designed for
sequential data processing.
• They process input sequences step by step, maintaining a hidden state that captures information
from previous steps.
• This hidden state acts as the context for the current word prediction.
• However, RNNs have challenges with capturing long-range dependencies and can suffer from
vanishing or exploding gradients.
• Transformer-based Models: Transformer-based models, like the famous BERT (Bidirectional
Encoder Representations from Transformers) and GPT (Generative Pre-trained Transformer) series,
have revolutionized language modeling.

06/02/2025 NATURAL LANGUAGE PROCESSING


What Are N-Grams?

• N-grams are continuous sequences of words or symbols, or tokens in a document.

• In technical terms, they can be defined as the neighboring sequences of items in a document.

• They come into play when we deal with text data in NLP (Natural Language Processing) tasks.

• An N-gram means a sequence of N words. So for example, “Medium blog” is a 2-gram (a


bigram), “A Medium blog post” is a 4-gram, and “Write on Medium” is a 3-gram (trigram).

• They have a wide range of applications, like language models,


• Semantic features,
• Spelling correction,
• Machine translation,
• Text mining, etc.
06/02/2025 NATURAL LANGUAGE PROCESSING
N-gram Language Model:

• N-gram can be defined as the contiguous sequence of n items from a


given sample of text or speech.
• The items can be letters, words, or base pairs according to the
application.
• The N-grams typically are collected from a text or speech corpus (A
long text dataset).
• An N-gram language model predicts the probability of a given N-gram
within any sequence of words in the language.
• A good N-gram model can predict the next word in the sentence i.e the
value of p(w|h)
06/02/2025 NATURAL LANGUAGE PROCESSING
N-Gram Types

• Example of N-gram such as unigram (“This”, “article”, “is”,


“on”, “NLP”) or bi-gram (‘This article’, ‘article is’, ‘is
on’,’on NLP’).
• Now, we will establish a relation on how to find the next
word in the sentence using
• We need to calculate p(w|h), where is the candidate forExample
the of N-Grams
next word.
• Types
• Unigram
• Bi gram
• Trigram
• N-Gram
06/02/2025 NATURAL LANGUAGE PROCESSING
06/02/2025 NATURAL LANGUAGE PROCESSING
06/02/2025 NATURAL LANGUAGE PROCESSING
N-GRAM QUESTION

• The following sentences as the training corpus:


• Thank you so much for your help.
• I really appreciate your help.
• Excuse me, do you know what time it is?
• I’m really sorry for not inviting you.
• I really like your watch.
• Suppose we’re calculating the probability of word “w1” occurring after the word
“w2,” then the
• formula for this is as follows:
• count(w2 w1) / count(w2)
• which is the number of times the words occurs in the required sequence, divided by
the number
• of the times the word before the expected word occurs in the corpus.
06/02/2025 NATURAL LANGUAGE PROCESSING
ANSWER
From our example sentences, let’s calculate the probability of the word “like” occurring after the
word “really”:
count(really like) / count(really)
=1/3
= 0.33
Similarly, for the other two possibilities:
count(really appreciate) / count(really)
=1/3
= 0.33
count(really sorry) / count(really)
=1/3
= 0.33
P(w|h), the probability of a word w given some history h.

06/02/2025 NATURAL LANGUAGE PROCESSING


Metrics for Language Modelings

06/02/2025 NATURAL LANGUAGE PROCESSING


Metrics for Language Modelings

06/02/2025 NATURAL LANGUAGE PROCESSING


• For Example:
• Let’s take an example of the sentence: ‘Natural
Language Processing’. For predicting the first word,
let’s say the word has the following probabilities:
word P(word | <start>)

The 0.4

Processing 0.3

Natural 0.12

Language 0.18

06/02/2025 NATURAL LANGUAGE PROCESSING


• Now, we know the probability of getting the first word
as natural. But, what’s the probability of getting the
next word after getting the word ‘Language‘ after the
word ‘Natural‘.
word P(word | ‘Natural’ )

The 0.05

Processing 0.3

Natural 0.15

Language 0.5

06/02/2025 NATURAL LANGUAGE PROCESSING


• After getting the probability of generating words
‘Natural Language’, what’s the probability of getting
‘Processing‘.
word P(word |
‘Language’ )

The 0.1

Processing 0.7

Natural 0.1

Language 0.1

06/02/2025 NATURAL LANGUAGE PROCESSING


What are MWEs?
• Multiword expressions (MWEs) are expressions which are made up of at least 2 words
and which can be syntactically and/or semantically idiosyncratic in nature.
• Sequence of words that has lexical, orthographic, phonological, morphological,
syntactic, semantic, pragmatic or translational properties not predictable from the
individual components or their normal mode of combination.
• Multi-word Expressions (MWEs) are word combinations with linguistic properties that
cannot be
• predicted from the properties of the individual words or the way they have been
combined.
• MWEs occur frequently and are usually highly domain-dependent
• A proper treatment of MWEs is essential for the success of NLP-systems.
06/02/2025 NATURAL LANGUAGE PROCESSING
MWE
• A sequence, continuous or discontinuous, of words or other elements, which is or appears to be
prefabricated: that is stored and retrieved whole from memory at the time from use, rather than being
subject to generation or analysis by language grammar.
• A language word - lexical unit in the language that stands for a concept.
• e.g. train, water, ability
• However, that may not be true.
• e.g. Prime Minister
• Due to institutionalized usage, we tend to think of ‘Prime Minister’ as a single concept.
• • Here the concept crosses word boundaries.
• Simply put, a multiword expression (MWE):
• a. crosses word boundaries
• b. is lexically, syntactically, semantically, pragmatically and/or statistically idiosyncratic
E.g. traffic signal, Real Madrid, green card, fall asleep, leave a mark, ate up, figured out, kick the
bucket, spill the beans, ad hoc.
06/02/2025 NATURAL LANGUAGE PROCESSING
Idiosyncrasies
• Statistical idiosyncracies
• Usage of the multiword has been conventionalized, though it is still semantically
decomposable E.g. traffic signal, good morning
• Lexical idiosyncrasies
• Lexical items generally not seen in the language, probably borrowed from other languages
• E.g. ad hoc, ad hominem
• Syntactic idiosyncrasy
• Conventional grammar rules don’t hold, these multiword exhibit peculiar syntactic
behaviour

06/02/2025 NATURAL LANGUAGE PROCESSING


Idiosyncrasies
Semantic Idiosyncrasy
• The meaning of the multi word is not completely composable from those
of its constituents .
• This arises from figurative or metaphorical usage .
• The degree of compositionality varies

E.g. blow hot and cold – keep changing opinions


spill the beans – reveal secret
run for office – contest for an official post.

06/02/2025 NATURAL LANGUAGE PROCESSING


MWE Characteristics
• Basis for MWE extraction
o Non-Compositionality
• Non-decomposable – e.g. blow hot and cold
• Partially decomposable – e.g. spill the beans
o Syntactic Flexibility
• Can undergo inflections, insertions, passivizations
• e.g. promise(d/s) him the moon
• The more non-compositional the phrase, the less syntactically flexible it is
o Substitutability
• MWEs resist substitution of their constituents by similar words
• E.g. ‘many thanks’ cannot be expressed as ‘several thanks’ or ‘many gratitudes’
o Institutionalization
• Results in statistical significance of collocations
o Paraphrasability
• Sometimes it is possible to replace the MWE by a single word
• E.g. leave out replaced by omit
• Based on syntactic forms and compositionality
o Institutionalized Noun collocations - E.g. traffic signal, George Bush, green card
o Phrasal Verbs (Verb-Particle constructions) - E.g. call up, eat up ,
o Light verb constructions (V-N collocations) - E.g. fall asleep, give a demo
o Verb Phrase Idioms - E.g. sweep under the rug

06/02/2025
MWETokenizer
• The multi-word expression tokenizer is a rule-based, “add-on”
tokenizer offered by NLTK.
• Once the text has been tokenized by a tokenizer of choice,
some tokens can be re-grouped into multi-word expressions.
• For example, the name Martha Jones is combined into a single
token instead of being broken into two tokens.
• This tokenizer is very flexible since it is agnostic of the base
tokenizer that was used to generate the tokens.
A MWETokenizer takes a string which has already been divided into tokens and retokenizes it,
merging multi-word expressions into single tokens, using a lexicon of MWEs

06/02/2025 NATURAL LANGUAGE PROCESSING


Student Evaluation

1.N-grams are defined as the combination of N keywords together. How many bi-grams can be
generated from the given sentence: The Father of our nation is Mahatma Gandhiji
A.8 B.9
C.7 D.4
2. It is the development of probabilistic models that are able to predict the next word in the sequence
given the words that precede.
A. Statistical Language Modelling B. Probabilistic Langualge Modeling
C. Neural Language Modelling D. Natural Language Understanding
3. It is a measure of how good a probability distribution predicts a sample
A. Entropy B. Perplexity
C. Cross-Entropy D. Information Gain
4.What are the python libraries used in NLP?
A. Pandas B. NLTK
C. Spacy D. All the mentioned above

06/02/2025 NATURAL LANGUAGE PROCESSING


COLLOCATIONS (ASSOCIATION MEASURES, COEFFICIENTS AND CONTEXT
MEASURES)

• Collocations are pairs or groups of words that frequently appear


together in natural language.
• They often have a strong association or tendency to co-occur due to
their semantic or syntactic relationship.
• Association measures and coefficients are statistical methods used to
quantify the strength of the association between words in a collocation.
• These measures help identify meaningful word combinations and can be
useful in various natural language processing tasks,
• such as information retrieval, text mining, and machine translation.

06/02/2025 NATURAL LANGUAGE PROCESSING


COLLOCATIONS (ASSOCIATION MEASURES, COEFFICIENTS AND CONTEXT MEASURES)

• Collocations are phrases or expressions


containing multiple words, which are highly
likely to co-occur.
• For example — ‘social media’, ‘school holiday’,
‘machine learning’, ‘Universal Studios
Singapore’, etc.
• A collocation is two or more words that often go
together.
• These combinations just sound "right" to native
English speakers, who use them all the time.
• On the other hand, other combinations may be
unnatural and just sound "wrong".
• 06/02/2025 NATURAL LANGUAGE PROCESSING
Why learn collocations?

• Your language will be more natural and more easily understood.


• You will have alternative and richer ways of expressing yourself.
• It is easier for our brains to remember and use language in chunks or
blocks rather than as single words.

06/02/2025 NATURAL LANGUAGE PROCESSING


Hod to Learn collocations?

• Be aware of collocations, and try to recognize them when you see or hear them.
• Treat collocations as single blocks of language. Think of them as individual blocks or
chunks, and learn strongly support, not strongly + support.
• When you learn a new word, write down other words that collocate with it (remember
rightly, remember distinctly, remember vaguely, remember vividly).
• Read as much as possible. Reading is an excellent way to learn vocabulary and
collocations in context and naturally.
• Revise what you learn regularly. Practise using new collocations in context as soon as
possible after learning them.
• Learn collocations in groups that work for you. You could learn them by topic (time,
number, weather, money, family) or by a particular word (take action, take a chance,
take an exam).
• You can find information on collocations in any good learner's dictionary. And you can
also find specialized dictionaries of collocations.
06/02/2025 NATURAL LANGUAGE PROCESSING
Collocations

Some commonly used association measures and coefficients for collocations


Pointwise Mutual Information (PMI): PMI measures the degree of association between two
words, considering their probability of co-occurrence versus the probability of their individual
occurrences. A higher PMI value indicates a stronger association.
Log-Likelihood Ratio (LLR): The LLR compares the likelihood of observing a collocation in a
corpus to the likelihood of the individual words occurring independently. It helps identify
statistically significant collocations.
Dice Coefficient: The Dice coefficient calculates the ratio of the number of times two words occur
together to the sum of their individual occurrences. It ranges from 0 to 1, with 1 indicating a perfect
collocation.
Mutual Information (MI): MI measures the reduction in uncertainty of one word's occurrence
when the other word is known. It captures both positive and negative associations.
Frequency-based measures: Simple measures like raw frequency, conditional probability, and
relative frequency can also be used to identify collocations.

06/02/2025 NATURAL LANGUAGE PROCESSING


Example

Suppose we have a large dataset of movie reviews, and we want to find collocations that frequently appear
together in positive reviews. We are particularly interested in identifying collocations related to the theme of
"amazing special effects" in movies.
Step 1: Preprocess the Data First, we preprocess the movie reviews by tokenizing them into words and removing
any stop words, punctuation, and numbers.
Step 2: Calculate Association Measures Next, we calculate the association measures for different word pairs.
Let's say we want to consider the Dice coefficient as our association measure.
• For each word pair (A, B), we calculate the Dice coefficient as follows:
• Dice(A, B) = (Number of times A and B co-occur) / (Number of times A occurs + Number of times B occurs)
• Step 3: Identify Significant Collocations Now, we look for collocations with high Dice coefficients, indicating
strong associations. Let's say we find the following collocations with their corresponding Dice coefficients:
1. "amazing special" - Dice coefficient: 0.85
2. "special effects" - Dice coefficient: 0.80
3. "stunning visuals" - Dice coefficient: 0.75
4. "spectacular CGI" - Dice coefficient: 0.70
5. "mind-blowing
06/02/2025
action" - Dice coefficient: 0.65 NATURAL LANGUAGE PROCESSING
Example

Step 4: Interpretation Based on the Dice coefficients, we can see that the word pairs "amazing
special" and "special effects" have the highest associations in positive movie reviews. This suggests
that reviewers often mention "amazing special" and "special effects" together when praising movies
with exceptional visual effects.
• In this real-world example, association measures like the Dice coefficient helped us identify
significant collocations related to "amazing special effects" in movie reviews.
• These collocations can be useful for sentiment analysis, recommending movies to users who
appreciate stunning visuals, or improving the understanding of what aspects of movies are highly
praised by reviewers.

06/02/2025 NATURAL LANGUAGE PROCESSING


Types of collocation

• adverb + adjective: completely satisfied (NOT downright satisfied)


• adjective + noun: excruciating pain (NOT excruciating joy)
• noun + noun: a surge of anger (NOT a rush of anger)
• noun + verb: lions roar (NOT lions shout)
• verb + noun: commit suicide (NOT undertake suicide)
• verb + expression with preposition: burst into tears (NOT blow up in tears)
• verb + adverb: wave frantically (NOT wave feverishly)

06/02/2025 NATURAL LANGUAGE PROCESSING


Vector representation of words in NLP

• In Natural Language Processing (NLP), vector representation of words is a crucial


concept used to convert words into numerical vectors. These vector representations
are also known as word embeddings.
• Word embeddings are essential because they capture semantic and syntactic
relationships between words and enable machine learning models to process and
understand natural language text.
There are several methods to create word embeddings, and some of the commonly
used techniques include:
One-Hot Encoding: In one-hot encoding, each word in the vocabulary is
represented as a binary vector where all elements are zero except for the index
corresponding to the word's position in the vocabulary, which is set to 1.
This method creates sparse and high-dimensional vectors that lack meaningful
semantic relationships between words.
06/02/2025 NATURAL LANGUAGE PROCESSING
Techniques used

• Word2Vec: Word2Vec is a popular word embedding technique that learns continuous word
representations from large amounts of text data. It offers two algorithms: Continuous Bag of Words
(CBOW) and Skip-gram. These models generate dense word vectors that capture semantic
similarities between words based on their context.
• GloVe (Global Vectors for Word Representation): GloVe is another widely used method for
learning word embeddings. It combines the global co-occurrence statistics of words in a corpus to
create word vectors. GloVe embeddings capture both semantic and syntactic relationships between
words.
• Fast Text: Fast Text is an extension of Word2Vec that represents each word as a bag of character n-
grams. It can generate word embeddings for out-of-vocabulary words based on their character-level
information, making it useful for handling misspellings and rare words.

06/02/2025 NATURAL LANGUAGE PROCESSING


Word 2 Vector Techniques

• BERT (Bidirectional Encoder Representations from Transformers): BERT is a transformer-based


model that generates contextual word embeddings. Unlike traditional methods that generate static
embeddings, BERT considers the context of the word within the sentence, producing highly
contextualized word representations.
• ELMo (Embeddings from Language Models): ELMo is another contextual word embedding model
that uses a bi-directional language model. It generates word embeddings based on the entire context of
the sentence, capturing the polysemy (multiple meanings) of words.
• ULMFiT (Universal Language Model Fine-tuning): ULMFiT is a transfer learning approach for NLP
that utilizes pre-trained language models to fine-tune embeddings for specific downstream tasks. It
enables efficient training on smaller datasets.
• These word embeddings can be used as input features for various NLP tasks, such as sentiment
analysis, machine translation, named entity recognition, and more.
• They help improve the performance of NLP models by providing a more compact and meaningful
representation of words in numerical form.

06/02/2025 NATURAL LANGUAGE PROCESSING


Word 2 Vector Example

• Let's demonstrate a simple example of word embeddings using Word2Vec, one of the popular
techniques for learning word representations. For this example, we will use a small dataset of
movie reviews and create word embeddings using the Word2Vec algorithm.
• Step 1: Preprocess the Data Suppose we have the following movie reviews:
1. "The movie was fantastic, with amazing special effects."
2. "The plot was engaging and kept me hooked till the end."
3. "The acting was superb, especially by the lead actor."
4. "The film had stunning visuals and great cinematography."
• We need to preprocess the data by tokenizing the sentences and converting the text to lowercase
• Step 2: Train Word2Vec Model Next, we train a Word2Vec model using the tokenized reviews
• Step 3: Retrieve Word Embeddings Now, we can access the word embeddings for specific words
using the trained Word2Vec model
• Step 4: Similar Words We can also find words similar to a given word based on their embeddings
• Step 5: Word Similarity Additionally, we can measure the similarity between two words
06/02/2025 NATURAL LANGUAGE PROCESSING
Word 2 Vector Example

• The resulting word embeddings and similarity scores will depend on the specific
corpus and the number of training iterations, but they should capture the semantic
relationships between words based on their context in the reviews.
• For instance, "fantastic" and "amazing" are likely to have a high similarity score,
as they both frequently appear together in positive contexts in the dataset.
Similarly, "plot" and "visuals" might also have a reasonable similarity score if they
co-occur in sentences discussing movie elements.

06/02/2025 NATURAL LANGUAGE PROCESSING


REFERENCE

1) These are slides by stanford for all topics –

• https://web.stanford.edu/~jurafsky/NLPCourseraSlides.html
2) This is e book which can be followed
• https://web.stanford.edu/~jurafsky/slp3/ed3book.pdf
3) This is the channel by dan jurafsky and manning where they teach each topic from zero level.
• https://www.youtube.com/watch?v=808M7q8QX0E&list=PLaZQkZp6WhWyvdiP49JG-rjyTPck_hvEu

4) https://www.shiksha.com/online-courses/articles/pos-tagging-in-nlp/

5) https://web.stanford.edu/~jurafsky/slp3/

6) https://www.studocu.com/in/document/srm-institute-of-science-and-technology/natural-
language-processing/nlp-notes-unit-1/39506511?origin=home-recent-1
06/02/2025 NATURAL LANGUAGE PROCESSING

You might also like