0% found this document useful (0 votes)
15 views108 pages

U1 - NLP Complete

The document outlines a course on Natural Language Processing (NLP) at SRM Institute of Science and Technology, detailing its objectives, outcomes, and various components. It emphasizes the importance of NLP in enabling computers to understand human language and highlights its applications, advantages, and disadvantages. Additionally, it covers the technical aspects of NLP, including syntax, semantics, and the use of programming languages and libraries in the field.

Uploaded by

chayanbaraskar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views108 pages

U1 - NLP Complete

The document outlines a course on Natural Language Processing (NLP) at SRM Institute of Science and Technology, detailing its objectives, outcomes, and various components. It emphasizes the importance of NLP in enabling computers to understand human language and highlights its applications, advantages, and disadvantages. Additionally, it covers the technical aspects of NLP, including syntax, semantics, and the use of programming languages and libraries in the field.

Uploaded by

chayanbaraskar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 108

21CSE356T – Natural Language Processing

“Enabling Computers to Understand Natural Language like Humans”

Department of Networking and Communications

School of computing

SRM Institute of Science and Technology

28-01-2025 NATURAL LANGUAGE PROCESSING


Course Objectives
1. Teach students the leading trends and systems in natural language processing.

2. Make them understand the concepts of morphology, syntax, semantics and pragmatics of the language
and that they are able to give the appropriate examples that will illustrate the above mentioned concept.

3. Teach them to recognize the significance of pragmatics for natural language understanding.

4. Enable students to be capable to describe the application based on natural language processing and to
show the points of syntactic, semantic and pragmatic processing.

5. To understand natural language processing and to learn how to apply basic algorithms in this field.

28-01-2025 NATURAL LANGUAGE PROCESSING


Course Outcomes
CO1: Construct approaches to syntax and semantics in NLP.

CO2: : Analyze approaches to syntax and semantic parsing with pronoun resolution.

CO3: Implement semantic role, relations and frames, Including co reference resolution.

CO4: Implement summarization, Information retrieval and machine translation.

CO5: Apply the knowledge of various levels of analysis involved in NLP and Implement

different techniques of NLP like word Embedding, CBOW and Skip-gram

28-01-2025 NATURAL LANGUAGE PROCESSING


NLP
• Unit I - Introduction to NLP
• Unit II - Syntax parsing
• Unit III - Semantic and Discourse Analysis
• Unit IV - Language Models
• Unit V - NLP Applications

28-01-2025 NATURAL LANGUAGE PROCESSING


Introduction to NLP
• Motive of learning a language is to communicate and share the information successfully to
others.

• According to the industry the estimation is only 21% of the available data is in the
structured form.

• Data is being generate, send messages on WhatsApp and Facebook or various social media.

• Majority of the data exists in textual format which is highly unstructured form.

• Now in order to produce significant and actionable insights from this data it is important to
get acquainted with the techniques of text analysis and natural language processing.

28-01-2025 NATURAL LANGUAGE PROCESSING


Introduction to NLP
• Text analytics/mining is the process of deriving meaningful
information from natural language text.

• It usually involves the process of structuring the input text,


deriving patterns and evaluating and interpreting the output.

• Natural language processing is a part of computer science


and artificial intelligence which deals with human
languages.
28-01-2025 NATURAL LANGUAGE PROCESSING
What is NLP?

• Natural language processing (NLP) is a branch of artificial intelligence (AI)


that enables machines to understand human language.
• The main intention of NLP is to build systems that are able to make sense of
text and audio
• Then automatically execute tasks like spell-check, text translation, topic
classification, etc.

28-01-2025 NATURAL LANGUAGE PROCESSING


Need of NLP

• The need to study Natural Language Processing (NLP)


• Increasing role of language in human-computer interaction
• Vast amount of unstructured textual data available.

• Now to make interactions between computers and humans, computers need to


understand natural languages used by humans.

• Natural language processing is all about making computers learn, process,


and manipulate natural languages.

28-01-2025
NATURAL LANGUAGE PROCESSING
Need of NLP

•Processing text data is an essential task as there is an abundance of


text available everywhere.
•Text data can be found in various sources such as books, websites,
social media, news articles, research papers, emails, and more.
•However, text data is often unstructured, meaning it lacks a
predefined format or organization.
• To harness the valuable information contained within text data, it is
necessary to process and analyze it effectively.

28-01-2025 NATURAL LANGUAGE PROCESSING


Need of NLP

•To extract insights, identify patterns, perform sentiment


analysis, categorize documents, automate text generation,
and enable information retrieval.
•By processing text data, valuable knowledge can be
derived, enabling businesses, researchers, and individuals to
make informed decisions, gain insights, improve
customer experiences, develop intelligent systems, and
drive innovation across various industries.

28-01-2025 NATURAL LANGUAGE PROCESSING


Use of NLP

1. User-Friendly Interfaces: NLP allows for intuitive and user-friendly interfaces using natural language,
reducing the need for complex programming syntax.
2. Accessibility and Inclusivity: NLP makes technology accessible to a wider audience, including those with
limited technical expertise or disabilities.
3. Conversational Systems: NLP enables the development of conversational agents, enhancing user interaction
and system efficiency.
4. Data Extraction and Analysis: NLP extracts insights from unstructured text data, enabling sentiment
analysis, information retrieval, and text summarization.
5. Voice-based Interaction: NLP powers voice assistants and speech recognition systems for hands-free and
natural interaction.
6. Human-Machine Collaboration: NLP enables seamless communication and collaboration between humans
and machines.
7. Natural Language Understanding: NLP allows machines to comprehend context, semantics, and intent,
enabling advanced applications and personalized experiences.

28-01-2025 NATURAL LANGUAGE PROCESSING


Applications of NLP

28-01-2025 NATURAL LANGUAGE PROCESSING


Applications of NLP

28-01-2025 NATURAL LANGUAGE PROCESSING


Real life use cases of NLP

1. Gmail - when u r typing any sentence in gmail you will notice that it tries to auto
complete. Auto completion is done using NLP.
2. Spam filters - This emails didn’t have spam filters then you will be so much worried you
will get so much mails/headache. Using NLP we can filter them and using keywords take
them out of your inbox.
3. Language translation - Translate sentence from one language to another language.
4. Customer service chat bot - ex: in bank service chat bot if you type in a message and
many times there is no human on the other hand. Your chat bot can interpret your
language it can derive intent out of it and it can respond to your question on its own and
sometimes it doesn’t work well then they connect it to human beings.
5. Voice assistants at amazon, alexa and google assistant
6. Google search - BERT special language model will give a correct solution for all search
question.
28-01-2025 NATURAL LANGUAGE PROCESSING
Advantages of NLP

• To ask questions about any subject and get a direct response within seconds.
• NLP offers exact answers to the question means it does not offer unnecessary and
unwanted information.
• NLP helps computers to communicate with humans in their languages.
• It is very time efficient.
• Most of the companies use NLP to improve the efficiency of documentation
processes, accuracy of documentation, and identify the information from large
databases.

28-01-2025 NATURAL LANGUAGE PROCESSING


Disadvantages of NLP

• NLP may not show context.


• NLP is unpredictable
• NLP may require more keystrokes.
• NLP is unable to adapt to the new domain, and it has a limited function that's why
NLP is built for a single and specific task only.

28-01-2025 NATURAL LANGUAGE PROCESSING


Components of NLP

1.Natural Language Understanding (NLU)


• Natural Language Understanding (NLU) helps the machine to understand and
analyse human language by extracting the metadata from content such as
concepts, entities, keywords, emotion, relations, and semantic roles.
• NLU mainly used in Business applications to understand the customer's problem in
both spoken and written language.
• NLU involves the following tasks -
• It is used to map the given input into useful representation.
• It is used to analyze different aspects of the language.

28-01-2025 NATURAL LANGUAGE PROCESSING


Components of NLP

2. Natural Language Generation (NLG)


• Natural Language Generation (NLG) acts as a translator that converts the
computerized data into natural language representation.
• It mainly involves Text planning, Sentence planning, and Text Realization.

28-01-2025 NATURAL LANGUAGE PROCESSING


Programming Languages for NLP

list of NLP APIs

28-01-2025 NATURAL LANGUAGE PROCESSING


NLP Libraries

• Scikit-learn: It provides a wide range of algorithms for building machine learning


models in Python.
• Natural language Toolkit (NLTK): NLTK is a complete toolkit for all NLP techniques.
• Pattern: It is a web mining module for NLP and machine learning.
• TextBlob: It provides an easy interface to learn basic NLP tasks like sentiment
analysis, noun phrase extraction, or pos-tagging.
• Quepy: Quepy is used to transform natural language questions into queries in a
database query language.
• SpaCy: SpaCy is an open-source NLP library which is used for Data Extraction, Data
Analysis, Sentiment Analysis, and Text Summarization.
• Gensim: Gensim works with large datasets and processes data streams.
28-01-2025 NATURAL LANGUAGE PROCESSING
List of NLP APIs

• IBM Watson API


IBM Watson API combines different sophisticated machine learning techniques to
enable developers to classify text into various custom categories.
• Chatbot API
Chatbot API allows you to create intelligent chatbots for any service.
• Speech to text API
Speech to text API is used to convert speech to text
• Text Analysis API by AYLIEN
Text Analysis API by AYLIEN is used to derive meaning and insights from the
textual content.
• Cloud NLP API
The Cloud NLP API is used to improve the capabilities of the application using
natural language processing technology.
28-01-2025 NATURAL LANGUAGE PROCESSING
Difference between Natural language and Computer Language

28-01-2025 NATURAL LANGUAGE PROCESSING


Student Evaluation MCQ

1)What is the field of Natural Language Processing (NLP)?


a) Computer Science b) Artificial Intelligence
c) Linguistics d) All of the mentioned
2)What is the main challenge/s of NLP?
a) Handling Ambiguity of Sentences b) Handling Tokenization
c) Handling POS-Tagging d) All of the mentioned
3) What is Machine Translation?
a) Converts one human language to another b) Converts human language to machine language
c) Converts any human language to English d) Converts Machine language to human language
4)Natural language processing is divided into the two subfields of -
A. symbolic and numeric B. algorithmic and heuristic
C. time and motion D. understanding and generation
5). The natural language is also known as .....................
A. 3rd Generation language B. 4th Generation language
C. 5th Generation language D. 6th Generation language
28-01-2025 NATURAL LANGUAGE PROCESSING
Levels/Process of NLP

1.Morphological Analysis/ Lexical Analysis


Lexical Analysis
2.Syntax Analysis

Syntax Analysis 3.Semantic Analysis

4.Discourse
Semantic
Analysis 5.Pragmatics

Discourse

Pragmatics

28-01-2025 NATURAL LANGUAGE PROCESSING


Morphological Analysis/ Lexical Analysis
• Morphological or Lexical Analysis deals with text at the individual word level.
• It looks for morphemes, the smallest unit of a word.
• The first phase of NLP is the Lexical Analysis.
• This phase scans the source code as a stream of characters and converts it into
meaningful lexemes.
• It divides the whole text into paragraphs, sentences, and words.
• For example, irrationally can be broken into ir (prefix), rational (root)
and -ly (suffix).
• Lexical Analysis finds the relation between these morphemes and converts the
word into its root form.
• A lexical analyzer also assigns the possible Part-Of-Speech (POS) to the word.
• It takes into consideration the dictionary of the language.
• For example, the word “character”
28-01-2025 can
NATURAL be used
LANGUAGE PROCESSINGas a noun or a verb.
Syntax Analysis

Syntax Analysis ensures that a given piece of text is correct structure. It tries to

parse the sentence to check correct grammar at the sentence level.

Given the possible POS generated from the previous step, a syntax analyzer

assigns POS tags based on the sentence structure.

28-01-2025 NATURAL LANGUAGE PROCESSING


Syntax
• Syntax refers to the arrangement of words in a very sentence specified they create
grammatical sense.
• Syntax techniques
1. Lemmatization It entails reducing the various inflected forms of a word into a
single form for easy analysis.
2. Morphological segmentation It involves dividing words into individual units called
morphemes.
3. Word segmentation It involves dividing a large piece of continuous text into
distinct units.
4. Part-of-speech tagging It involves identifying the part of speech for every word.
5. Parsing It involves undertaking a grammatical analysis for the provided sentence.
6. Sentence breaking It involves placing sentence boundaries on a large piece of text.
7. Stemming It involves cutting the inflected words to their root form.

28-01-2025 NATURAL LANGUAGE PROCESSING


Semantics
• Semantics refers to the meaning that is conveyed by a text. It involves
the interpretation of words and how sentences are structured.

1. Named entity recognition (NER) It involves determining the parts of


a text that can be identified and categorized into preset groups.
Examples of such groups include names of individuals and names of
places.

2. Word sense disambiguation It involves giving meaning to a word


based on the context.

28-01-2025 NATURAL LANGUAGE PROCESSING


Semantic Analysis

• Consider the sentence: “The apple ate a banana”. Although the sentence is
syntactically correct, it doesn’t make sense because apples can’t eat.
• Semantic analysis looks for meaning in the given sentence. It also deals with
combining words into phrases.
• For example, “red apple” provides information regarding one object; hence we
treat it as a single phrase.
• Similarly, we can group names referring to the same category, person, object or
organization. “Robert Hill” refers to the same person and not two separate
names – “Robert” and “Hill”

28-01-2025 NATURAL LANGUAGE PROCESSING


Discourse

• Discourse deals with the effect of a previous sentence on the sentence in


consideration. In the text, “Jack is a bright student. He spends most of the time
in the library.” Here, discourse assigns “he” to refer to “Jack”.

Pragmatics

• The final stage of NLP, Pragmatics interprets the given text using information
from the previous steps. Given a sentence, “Turn off the lights” is an order or
request to switch off the lights.

28-01-2025 NATURAL LANGUAGE PROCESSING


Regular Expressions (RE)


Regular expression (RE): A formula (in a special language) that is used for
specifying simple classes of strings.


String: A sequence of alphanumeric characters (letters, numbers, spaces, tabs, and
punctuation).


Can be used to specify search strings as well as to define a language in a
formal way.


Search requires a pattern to search for, and a corpus of texts to search through.

■ Search through corpus and return all texts that contain pattern.
RE Patterns


The search string can consist of single character or a sequence of
characters.

RE String matched

/woodchucks/ “interesting links to woodchucks and lemurs”

/a/ “Sarah Ali stopped by Mona’s”


/Alice says,/ “My gift please,” Alice says,”
/book/ “all our pretty books”

/!/ “Leave him behind!” said Sam


RE Disjunctions


Regular Expressions are case sensitive.

The string of characters inside [ ] specify a disjunction of
characters to match.
RE Range


How to conveniently specify any capital letters ?

Use brackets [ ] with the dash (-) to specify any one character in a
range

[2-5] – specifies any one of 2, 3, 4, or 5
RE Negation


Uses of the caret ^ for negation or just to mean ^

^ symbol is first after open square brace [ , the resulting pattern is
negated
RE Cleany star


Regular expression allows repetition of things.

Kleene star – zero or more occurrences of previous character or
expressions.


Kleene * ----- /baaa*!/ --- baa!, baaa!, baaaa! .....

Kleene + – one or more of the previous character

Kleene + ---- /[09]+/ specifies “a sequence of digits”


Use period /./ to specify any character – a wildcard that matches any
single character (except a carriage return)
RE Cleany star

RE Description
/a*/ Zero or more a’s
/a+/ One or more a’s
/a?/ Zero or one a’s
/cat|dog/ ‘cat’ or ‘dog’
/^cat$/ A line containing only ‘cat’
/\bun\B/ Beginnings of longer strings starts by ‘un’
RE Anchors, Boundaries


The caret ^ matches the start of a line.

The dollar sign $ matches the end of a line.


Ex: /^The boat\.$/ matches a line that contains The boat.

\b matches a word boundary while \B matches a non-boundary

Ex: /\b55\b/ matches the string: There are 55 bottles of honey
but not There are 255 bottles of honey
RE Disjunction, Grouping

The pipe symbol | is called the disjunction operator


Example: /food|wood/ matches either the string food or the string
wood


What is the pattern for matching both the string puppy and
puppies?

/puppy|ies/ --> match the strings puppy and ies hence wrong


The string puppy take precendece over the pipe operator


Use the parenthesis ( and ) to make the disjunction ( | ) apply only to a
specific pattern


/pupp(y|ies)/ > match the strings puppy and puppies
RE Operator Precedence


Kleene* operator applies by default only to a single character, not a
whole sequence.


Ex: Write a pattern to match the string:
Column 1 Column 2 Column 3


/Column_[09]+_*/ matches a column followed by any number of
spaces


The star applies only to the space _ that precedes it, not a whole
sequence


/(Column_[09]+_)*/ --> match the word Column followed by a
number, the whole pattern repeated any number of times
RE Operator Precedence


Parenthesis ( )
Counters * + ? { } Sequences and

anchors the ^my end$ Disjunction |



Counters have higher precedence than sequences
■ ■ /the*/ matches theeeee but not thethe


Sequences have a higher precedence than disjunction

■ /cooky|ies/ matches cooky or ies but not cookies


RE – A Simple Example

Write a RE to match the English article the from the following: the
The
the124
@the_
The – new line

/the/


missed ‘The’
RE – A Simple Example

Write a RE to match the English article the

/the/ missed ‘The’

/[tT]he/


Need The or the not the in ‘others’. Include word boundary
RE – A Simple Example


Write a RE to match the English article the

/the/ missed ‘The’

/[tT]he/ included the in ‘others’

/\b[tT]he\b/


Perl – word is a sequence of letters, digits and underscores


Need 'the' from ‘the25’ or ‘the_’
RE – A Simple Example

Write a RE to match the English article the

/the/ missed ‘The’
included the in ‘others’

/[tT]he/

/\b[tT]he\b/ missed ‘the25’ ‘the_’


Make sure no alphabetic letters on either side of the

/[^azAZ][tT]he[^azAZ]/

Issue: won't find the word The when it begins the line.
RE – A Simple Example

Write a RE to match the English article the

/the/ missed ‘The’
included the in ‘others’

/[tT]he/

/\b[tT]he\b/ missed ‘the25’ ‘the_’


/[^azAZ][tT]he[^azAZ]/ missed ‘The’ at the beginning of a line


Specify that before the the we require either the beginning-of-line or
non-alphabetic character and the same at end.


/(^|[^azAZ])[tT]he([^azAZ]|$)/
RE – A Complex Example

Exercise: Write a regular expression that will match
■ “any PC with more than 500MHz and 32 Gb of disk space for

less than $1000”

■ First consider RE for prices


RE – A Complex Example

/$[09]+/ # whole dollars


What about $155.55 ?


Deal with fraction of dollars
RE – A Complex Example

/$[09]+/ # whole dollars

/$[09]+\.[09][09]/ # fractions of dollars


This pattern only allows $155.55 but not $155


Make cents optional and word boundary
RE – A Complex Example

/$[09]+/ # whole dollars
■ •/$[09]+\.[09][09]/ # fractions of dollars

•/$[09]+(\.[09][09])?/ # cents optional

•/\b$[09]+(\.[09][09])?\b/ # word boundary Specification for

processor speed (in megahertz=MHz or gigahertz=GHz)?
RE – A Complex Example

/$[09]+/ # whole dollars
■ •/$[09]+\.[09][09]/ # fractions of dollars

•/$[09]+(\.[09][09])?/ # cents optional

•/\b$[09]+(\.[09][09])?\b/ # word boundary Specification for

processor speed (in megahertz=MHz or gigahertz=GHz)?

/\b[09]+_*(MHz|[Mm]egahertz|GHz|[Gg]igahertz)\b/

/_*/ mean “zero or more spaces”


Memory size?
RE – A Complex Example

/$[09]+/ # whole dollars
■ •/$[09]+\.[09][09]/ # fractions of dollars

•/$[09]+(\.[09][09])?/ # cents optional

•/\b$[09]+(\.[09][09])?\b/ # word boundary Specification for

processor speed (in megahertz=MHz or gigahertz=GHz)?

/\b[09]+_*(MHz|[Mm]egahertz|GHz|[Gg]igahertz)\b/

Memory size: /\b[09]+_*(Mb|[Mm]egabytes?)\b/


Allow gigabyte fractions like 5.5Gb
RE – A Complex Example


/$[09]+/ # whole dollars
■ •/$[09]+\.[09][09]/ # fractions of dollars

•/$[09]+(\.[09][09])?/ # cents optional
•/\b$[09]+(\.[09][09])?\b/ # word boundary Specification for

processor speed (in megahertz=MHz or gigahertz=GHz)?



/\b[09]+_*(MHz|[Mm]egahertz|GHz|[Gg]igahertz)\b/


Memory size: /\b[09]+_*(Mb|[Mm]egabytes?)\b/

/\b[09](\.[09]+)?_*(Gb|[Gg]igabytes?)\b/


Operating system and Vendor ?
RE – A Complex Example

/$[09]+/ # whole dollars

/$[09]+\.[09][09]/ # fractions of dollars

/$[09]+(\.[09][09])?/ # cents optional

/\b$[09]+(\.[09][09])?\b/ # word boundary


Speed : /\b[09]+_*(MHz|[Mm]egahertz|GHz|[Gg]igahertz)\b/


Memory size: /\b[09]+_*(Mb|[Mm]egabytes?)\b/

/\b[09](\.[09]+)?_*(Gb|[Gg]igabytes?)\b/


Vendor: /\b(Win95|Win98|WinNT|Windows_*(NT|95|98|
2000)?)\b/


/\b(Mac|Macintosh|Apple)\b/
Morphological Analysis (Morphological Parsing)

• The goal of morphological parsing is to find out what morphemes a given word is
built from.
• The goal of morphological parsing is to find out what morphemes a given word is
built from. For example, a morphological parser should be able to tell us that the
word cats is the plural form of the noun stem cat, and that the word mice is the
plural form of the noun stem mouse. So, given the string cats as input, a
morphological parser should produce an output that looks similar to cat N PL.
Here are some more examples:

28-01-2025 NATURAL LANGUAGE PROCESSING


Morphological Analysis (Morphological Parsing)

• Morphological parsing yields information that is useful in many NLP applications.


In parsing, e.g., it helps to know the agreement features of words. Similarly,
grammar checkers need to know agreement information to detect such mistakes.

• But morphological information also helps spell checkers to decide whether


something is a possible word or not, and in information retrieval it is used to
search not only cats, if that's the user's input, but also for cat.

28-01-2025 NATURAL LANGUAGE PROCESSING


Pipeline of NLP in AI /Steps in NLP
1. Tokenization
2. Stemming
3. Lemmatization
4. POS tags
5. Named Entity Recognition
6. Chunking

28-01-2025 NATURAL LANGUAGE PROCESSING


Tokenization
• Process the strings into tokens. Tokens is like a small structures or units.

28-01-2025 NATURAL LANGUAGE PROCESSING


Stemming

• Normalize words into its base or root form.

• stemming algorithm works by cutting off the end or the beginning of the word taking
into account a list of common prefixes suffixes that can be found in an infected word.

• Here the single root word is affect.

28-01-2025 NATURAL LANGUAGE PROCESSING


Lemmatization

• Morphological analysis of the word to do so its necessary to have a detail


dictionary and the algorithm can look through to link the original word or root
word.

• Somehow similar to stemming, as it maps several words into one common root.

• Output of lemmatization is proper word.

• For example, Lemmatizer should map gone,going and went into go.
28-01-2025 NATURAL LANGUAGE PROCESSING
POS tags

• Speaking the grammatical type of the word is referred to as POS tags or parts of speech.
• Nouns, pronouns, verbs, adverbs, adjectives, prepositions, Conjunctions and Interjections.

28-01-2025 NATURAL LANGUAGE PROCESSING


Named Entity Recognition

• Process of detecting the named entities such as the person name, company name and location that is pharse
identification.

28-01-2025 NATURAL LANGUAGE PROCESSING


Chunking
• Picking up individual pieces of information and grouping them into
bigger pieces.
• Grouping of words or tokens.

28-01-2025 NATURAL LANGUAGE PROCESSING


Need of feature extraction techniques
• Machine Learning algorithms learn from a pre-defined set of
features from the training data to produce output for the test
data.
• But the main problem in working with language processing is
that machine learning algorithms cannot work on the raw text
directly.
• So, convert text into a matrix(or vector) of features.
• Popular methods of feature extraction are :
• Bag-of-Words

• TF-IDF
28-01-2025 NATURAL LANGUAGE PROCESSING
Bag of Words:
• It represents a text document as a multiset of its words,
disregarding grammar and word order, but keeping the frequency
of words.
• This representation is useful for tasks such as text classification,
document similarity, and text clustering.
• To transform tokens into a set of features.
• In document classification, For example, in a task of review based
sentiment analysis, the presence of words like ‘fabulous’,
‘excellent’ indicates a positive review, while words
like ‘annoying’, ‘poor’ point to a negative review

28-01-2025 NATURAL LANGUAGE PROCESSING


Bag of Words:
• There are 3 steps while creating a BoW model :
• The first step is text-preprocessing which involves:
converting the entire text into lower case characters.
removing all punctuations and unnecessary symbols.
• The second step is to create a vocabulary of all unique words
from the corpus. Let’s suppose, we have a hotel review text.
• Example (good movie, not a good movie,did not like)

28-01-2025 NATURAL LANGUAGE PROCESSING


Bag of Words:
All the unique words from the above set of
reviews to create a vocabulary, which is
going to be as follows :
{good, movie, not, a, did, like}
Create a matrix of features by assigning a
separate column for each word, while each
row corresponds to a review. This process Unigram: order is not preserved
is known as Text Vectorization.
Each entry in the matrix signifies the
presence(or absence) of the word in the
review. We put 1 if the word is present in
the review, and 0 if it is not present.
Bigram: out to be very
28-01-2025 NATURAL LANGUAGE PROCESSING large(complex to process), Sparse
Issues of Bag of Words:
1. High dimensionality: The resulting feature space can be very high dimensional, which may
lead to issues with overfitting and computational efficiency.

2. Lack of context information: The bag of words model only considers the frequency of words
in a document, disregarding grammar, word order, and context.

3. Insensitivity to word associations: The bag of words model doesn’t consider the associations
between words, and the semantic relationships between words in a document.

4. Lack of semantic information: As the bag of words model only considers individual words, it
does not capture semantic relationships or the meaning of words in context.

5. Importance of stop words: Stop words, such as “the”, “and”, “a”, etc., can have a large impact
on the bag of words representation of a document, even though they may not carry much
meaning.

6. Sparsity: For many applications, the bag of words representation of a document can be very
sparse, meaning that most entries in the resulting feature vector will be zero. This can lead
to issues with computational efficiency and difficulty in interpretability.

28-01-2025 NATURAL LANGUAGE PROCESSING


TF-IDF Vectorizer :

• TF-IDF (Term Frequency-Inverse Document Frequency) is


a statistical measure .
• The basic idea is that a word that occurs frequently in a
document but rarely in the entire corpus is more
informative than a word that occurs frequently in both the
document and the corpus.
• TF-IDF is used for:
1. Text retrieval and information retrieval systems
2. Document classification and text categorization
3. Text summarization
4. Feature extraction for text data in machine learning
algorithms.
28-01-2025 NATURAL LANGUAGE PROCESSING
TF-IDF Vectorizer :

• Term Frequency(TF) : Term frequency specifies how


frequently a term appears in the entire document.
• It can be thought of as the probability of finding a word
within the document.
• It calculates the number of times a word w_i occurs in a
review r_j , with respect to the total number of words in
the review r_j .It is formulated as:

28-01-2025 NATURAL LANGUAGE PROCESSING


TF-IDF Vectorizer :

• Inverse Document Frequency(IDF) : It is a measure of


whether a term is rare or frequent across the documents
in the entire corpus.
• It highlights those words which occur in very few
documents across the corpus, or in simple language, the
words that are rare have high IDF score.
• IDF is a log normalised value,

28-01-2025 NATURAL LANGUAGE PROCESSING


Why is NLP hard?

• ambiguity and variability of linguistic expression:


variability: many forms can mean the same thing
ambiguity: one form can mean many things
• Many different kinds of ambiguity
• Each NLP task has to address a distinct set of kinds

28-01-2025 NATURAL LANGUAGE PROCESSING


Steps – Morphology – Syntax – Semantics

• Language - smallest individual unit


• Phoneme - Single distinguishable sound
• Morphology - word formation

28-01-2025 NATURAL LANGUAGE PROCESSING


Morphology
• Morphology helps linguists understand the structure of words by putting together morphemes.

• A morpheme is the smallest grammatical, meaningful part of language.

• 2 types - Free morpheme and Bound morpheme.

• A free morpheme is a single meaningful unit of a word that can stand alone in the language. For
example: cat, mat, trust, slow.

• A bound morpheme cannot stand alone, it has no real meaning if it is on its own. For example:
walked, (ed) can not stand alone or unpleasant (un) is not a stand alone morpheme.Bound
morphemes that are part of prefixes and suffixes.

• Bound morphemes can also be grouped into into a further two categories.

1. Derivational 2. Inflectional

28-01-2025 NATURAL LANGUAGE PROCESSING


Derivational

• Added to the base form of the word to create a new word.

• Look at the word able and let it become ability. In this instance the adjective becomes a noun.

• The word send as a verb morpheme becomes sender and a noun with the addition of er.

• While stable to unstable changes the meaning of the word to become the opposite meaning.

• In other words the meaning of the word is completely changed by adding a derivational morpheme
to a base word.

28-01-2025 NATURAL LANGUAGE PROCESSING


Inflectional

• Additions to the base word that do not change the word, but rather serve as grammatical indicators.
They show grammatical information. For example:

1. Laugh becomes the past tense by adding ed and changing the word to laughed.

2. Dog to dogs changes the word from singular to plural.

3. Swim to swimming changes the verb into a progressive verb.

4. All these examples show how morphology participates in the study of linguistics.

28-01-2025 NATURAL LANGUAGE PROCESSING


Parts of Speech Tagging

• Introduce the task of part-of-speech tagging,


taking a sequence of words and assigning
each word a part of speech like NOUN or
VERB
• Task of named entity recognition (NER),
assigning words or phrases tags like
PERSON, LOCATION, or
ORGANIZATION.

28-01-2025 NATURAL LANGUAGE PROCESSING


Introductions- Part-of-Speech Tagging

•Part-of-speech tagging is the process


of assigning a part-of-speech to each
word in text.
•The input is a sequence x1, x2, ..., xn
of (tokenized) words and a tagset, and
•the output is a sequence y1, y2, ..., yn
of tags,
• each output yi corresponding exactly
to one input xi,
28-01-2025 NATURAL LANGUAGE PROCESSING
Introduction
• Tagging is a disambiguation task; words are ambiguous —have more than one
possible part-of-speech—and the goal is to find the correct tag for the situation.

• For example, book can be a verb (book that flight) or

• a noun (hand me that book).

• That can be a determiner (Does that flight serve dinner) or

• a complementizer (thought that your flight was earlier).

• The goal of POS-tagging is to resolve these ambiguity , choosing the proper tag
for the context.
28-01-2025 NATURAL LANGUAGE PROCESSING
Introduction to POS Tagging

• Part-of-speech (POS) tagging is a process in natural language processing (NLP) where each word in a text is
labeled with its corresponding part of speech.
• This can include nouns, verbs, adjectives, and other grammatical categories.
• POS tagging is useful for a variety of NLP tasks, such as information extraction, named entity recognition, and
machine translation.
• It can also be used to identify the grammatical structure of a sentence and to disambiguate words that have
multiple meanings.
• example,
• Text: “The cat sat on the mat.”
• POS tags:
• The: determiner , cat: noun , sat: verb , on: preposition , the: determiner ,mat: noun

28-01-2025 NATURAL LANGUAGE PROCESSING


What is Part-of-speech (POS) tagging ?

• It is a process of converting a sentence to forms – list of words, list of tuples (where each
tuple is having a form (word, tag)).
• The tag in case of is a part-of-speech tag, and signifies whether the word is a noun,
adjective, verb, and so on Part Of Speech Tag

Noun (Singular) NN

Noun (Plural) NNS

Verb VB

Determiner DT

Adjective JJ

Adverb RB

28-01-2025 NATURAL LANGUAGE PROCESSING


Universal Part-of Speech Tagset
Universal Part-of-Speech Tagset

Tag Meaning English Examples


ADJ adjective new, good, high, special, big, local
ADP adposition on, of, at, with, by, into, under
ADV adverb really, already, still, early, now
CONJ conjunction and, or, but, if, while, although
DET determiner, article the, a, some, most, every, no, which
NOUN noun year, home, costs, time, Africa
NUM numeral twenty-four, fourth, 1991, 14:24
PRT particle at, on, out, over per, that, up, with
PRON pronoun he, their, her, its, my, I, us
VERB verb is, say, told, given, playing, would
. punctuation marks .,;!
X other ersatz, esprit, dunno, gr8, univeristy
28-01-2025 NATURAL LANGUAGE PROCESSING
Use of Parts of Speech Tagging in NLP

1. To understand the grammatical structure of a sentence: By labeling each


word with its POS, we can better understand the syntax and structure of a
sentence.
2. To disambiguate words with multiple meanings: Some words, such as
“bank,” can have multiple meanings depending on the context in which they
are used.
3. To improve the accuracy of NLP tasks: POS tagging can help improve the
performance of various NLP tasks, such as named entity recognition and
text classification.
4. To facilitate research in linguistics: POS tagging can also be used to study
the patterns and characteristics of language use and to gain insights into the
structure and function of different parts of speech.
28-01-2025 NATURAL LANGUAGE PROCESSING
Steps Involved in the POS tagging

•Collect a dataset of annotated text: This dataset will be used to train and test the POS tagger.
•The text should be annotated with the correct POS tags for each word.
•Preprocess the text: This may include tasks such as tokenization (splitting the text into individual words),
•lowercasing, and removing punctuation.
•Divide the dataset into training and testing sets: The training set will be used to train the POS tagger,
• and the testing set will be used to evaluate its performance.
•Train the POS tagger: This may involve building a statistical model, such as a hidden Markov model (HMM),
•or defining a set of rules for a rule-based or transformation-based tagger.
•The model or rules will be trained on the annotated text in the training set.
•Test the POS tagger: Use the trained model or rules to predict the POS tags of the words in the testing set.
•Compare the predicted tags to the true tags and calculate metrics such as precision and recall to evaluate
•the performance of the tagger.
•Fine-tune the POS tagger: If the performance of the tagger is not satisfactory, adjust the model or rules and
•repeat the training and testing process until the desired level of accuracy is achieved.
•Use the POS tagger: Once the tagger is trained and tested, it can be used to perform POS tagging on new,
•unseen text.
28-01-2025 NATURAL LANGUAGE PROCESSING
Different POS Tagging Techniques

1.Rule-Based POS Tag


• This is one of the oldest approaches to POS tagging.
• It involves using a dictionary consisting of all the possible POS tags for a given word.
• If any of the words have more than one tag, hand-written rules are used to assign the
correct tag based on the tags of surrounding words.
• For example, if the preceding of a word an article, then the word has to be a noun.
• Consider the words: A Book
• Get all the possible POS tags for individual words: A – Article; Book – Noun or Verb
• Use the rules to assign the correct POS tag: As per the possible tags, “A” is an Article and
we can assign it directly. But, a book can either be a Noun or a Verb. However, if we
consider “A Book”, A is an article and following our rule above, Book has to be a Noun.
Thus, we assign the tag of Noun to book.
• POS Tag: [(“A”, “Article”), (“Book”, “Noun”)]
28-01-2025 NATURAL LANGUAGE PROCESSING
Rule-based POS Tagging

These rules may be either −


•Context-pattern rules
• Regular expression compiled into finite-state automata, intersected with lexically
ambiguous sentence representation.
Rule-based POS tagging by its two-stage architecture −
•First stage − In the first stage, it uses a dictionary to assign each word a list of
potential parts-of-speech.
•Second stage − In the second stage, it uses large lists of hand-written disambiguation
rules to sort down the list to a single part-of-speech for each word.

28-01-2025 NATURAL LANGUAGE PROCESSING


Properties of Rule-Based POS Tagging

•These taggers are knowledge-driven taggers.


•The rules in Rule-based POS tagging are built manually.
•The information is coded in the form of rules.
•We have some limited number of rules approximately around 1000.
•Smoothing and language modeling is defined explicitly in rule-based taggers.

28-01-2025 NATURAL LANGUAGE PROCESSING


Rule-based POS tagger Example

1. Define a set of rules for assigning POS tags to words. For example:
• If the word ends in “-tion,” assign the tag “noun.”
• If the word ends in “-ment,” assign the tag “noun.”
• If the word is all uppercase, assign the tag “proper noun.”
• If the word is a verb ending in “-ing,” assign the tag “verb.”
2. Iterate through the words in the text and apply the rules to each word in turn. For example:
• “Nation” would be tagged as “noun” based on the first rule.
• “Investment” would be tagged as “noun” based on the second rule.
• “UNITED” would be tagged as “proper noun” based on the third rule.
• “Running” would be tagged as “verb” based on the fourth rule.
3. Output the POS tags for each word in the text.

28-01-2025 NATURAL LANGUAGE PROCESSING


Stochastic POS Tagging

Another technique of tagging is Stochastic POS Tagging. Now, the question that arises here is which model
can be stochastic.
The model that includes frequency or probability (statistics) can be called stochastic.
Any number of different approaches to the problem of part-of-speech tagging can be referred to as stochastic
tagger.
The simplest stochastic tagger applies the following approaches for POS tagging −
Word Frequency Approach
In this approach, the stochastic taggers disambiguate the words based on the probability that a word occurs
with a particular tag. We can also say that the tag encountered most frequently with the word in the training
set is the one assigned to an ambiguous instance of that word. The main issue with this approach is that it
may yield inadmissible sequence of tags.
Tag Sequence Probabilities
It is another approach of stochastic tagging, where the tagger calculates the probability of a given sequence
of tags occurring. It is also called n-gram approach. It is called so because the best tag for a given word is
determined by the probability at which it occurs with the n previous tags.

28-01-2025 NATURAL LANGUAGE PROCESSING


Statistical POS Tagging

• Statistical part-of-speech (POS) tagging is a method of labeling words with their


corresponding parts of speech using statistical techniques.
• This is in contrast to rule-based POS tagging, which relies on pre-defined rules,
and to unsupervised learning-based POS tagging, which does not use any
annotated training data.
• In statistical POS tagging, a model is trained on a large annotated corpus of text
to learn the patterns and characteristics of different parts of speech.
• The model uses this training data to predict the POS tag of a given word based
on the context in which it appears and the probability of different POS tags
occurring in that context.
• Statistical POS taggers can be more accurate and efficient than rule-based
taggers, especially for tasks with large or complex datasets.
28-01-2025 NATURAL LANGUAGE PROCESSING
Transformation-based tagging (TBT)

• Transformation-based tagging (TBT) is a method of part-of-speech (POS) tagging


that uses a series of rules to transform the tags of words in a text.

• This is in contrast to rule-based POS tagging, which assigns tags to words based
on pre-defined rules, and to statistical POS tagging, which relies on a trained
model to predict tags based on probability.

28-01-2025 NATURAL LANGUAGE PROCESSING


Working Principles

• Here is an example of how a TBT system might work:


1. Define a set of rules for transforming the tags of words in the text. For example:
• If the word is a verb and appears after a determiner, change the tag to “noun.”
• If the word is a noun and appears after an adjective, change the tag to “adjective.”
2. Iterate through the words in the text and apply the rules in a specific order. For example:
• In the sentence “The cat sat on the mat,” the word “sat” would be changed from a verb to a noun
based on the first rule.
• In the sentence “The red cat sat on the mat,” the word “red” would be changed from an adjective to
a noun based on the second rule.
3. Output the transformed tags for each word in the text.

28-01-2025 NATURAL LANGUAGE PROCESSING


Hidden Markov Model POS tagging

• Hidden Markov models (HMMs) are a type of statistical model that


can be used for part-of-speech (POS) tagging in natural language
processing (NLP).
• In an HMM-based POS tagger, a model is trained on a large annotated
corpus of text to learn the patterns and characteristics of different parts
of speech.
• The model uses this training data to predict the POS tag of a given
word based on the probability of different tags occurring in the context
of the word.

28-01-2025 NATURAL LANGUAGE PROCESSING


Challenges in POS Tagging

• Ambiguity: Some words can have multiple POS tags depending on the context in which they appear,
making it difficult to determine their correct tag. For example, the word “bass” can be a noun (a type of
fish) or an adjective (having a low frequency or pitch).

• Out-of-vocabulary (OOV) words: Words that are not present in the training data of a POS tagger can
be difficult to tag accurately, especially if they are rare or specific to a particular domain.

• Complex grammatical structures: Languages with complex grammatical structures, such as languages
with many inflections or free word order, can be more challenging to tag accurately.

• Lack of annotated training data: Some languages or domains may have limited annotated training
data, making it difficult to train a high-performing POS tagger.

• Inconsistencies in annotated data: Annotated data can sometimes contain errors or inconsistencies,
which can negatively impact the performance of a POS tagger.
28-01-2025 NATURAL LANGUAGE PROCESSING
STUDENT EVALUATION

1) What is the main challenges of NLP?


A. Handling Tokenization B. Handling POS-Tagging
C. Handling Ambiguity of Sentences D. None of the above
2) All of the following are challenges associated with natural language processing
except
A. dividing up a text into individual words in English. B. understanding the context in which something is said.
C. recognizing typographical or grammatical errors in texts D. distinguishing between words that have more than one meaning
3) In linguistic morphology, _____________ is the process for reducing inflected words
to their root form.
A. Stemming B. Rooting
C. Text-Proofing D.Both A and B
4) Morphological Segmentation
A. Is an extension of propositional logic
B. Does Discourse Analysis
C. Separate words into individual morphemes and identify the class of the morphemes
D. None of the mentioned
28-01-2025 NATURAL LANGUAGE PROCESSING
What are MWEs?
• Multiword expressions (MWEs) are expressions which are
made up of at least 2 words and which can be syntactically
and/or semantically idiosyncratic in nature.
• Sequence of words that has lexical, orthographic, phonological,
morphological, syntactic, semantic, pragmatic or translational
properties not predictable from the individual components or their
normal mode of combination

28-01-2025 NATURAL LANGUAGE PROCESSING


MWE
• sequence of
• Not necessarily contiguous in a concrete utterance
• Not necessarily always in the same order in each utterance
• words
• Ambiguity between type and token (intentional)
• Inflected word form v. lemma
• Ambiguity between
• Character sequences separated from other character sequences by spaces and other separators
(Narrow interpretation)
• Abstract lexical units of the grammar (Broad interpretation)
• that has properties not predictable from the individual components and their
normal mode of combination

28-01-2025 NATURAL LANGUAGE PROCESSING


MWETokenizer

• The multi-word expression tokenizer is a rule-based, “add-on”


tokenizer offered by NLTK. Once the text has been tokenized by
a tokenizer of choice, some tokens can be re-grouped into
multi-word expressions.
• For example, the name Martha Jones is combined into a single
token instead of being broken into two tokens. This tokenizer is
very flexible since it is agnostic of the base tokenizer that was
used to generate the tokens.

28-01-2025 NATURAL LANGUAGE PROCESSING


Student Evaluation
1.N-grams are defined as the combination of N keywords together. How many bi-grams can be generated from the
given sentence: The Father of our nation is Mahatma Gandhiji
A.8 B.9
C.7 D.4
2. It is the development of probabilistic models that are able to predict the next word in the sequence given the
words that precede.
A. Statistical Language Modelling B. Probabilistic Langualge Modeling
C. Neural Language Modelling D. Natural Language Understanding
3. It is a measure of how good a probability distribution predicts a sample
A. Entropy B. Perplexity
C. Cross-Entropy D. Information Gain
4.What are the python libraries used in NLP?
A. Pandas B. NLTK
C. Spacy D. All the mentioned above

28-01-2025 NATURAL LANGUAGE PROCESSING


Vector representation of words in NLP

• In Natural Language Processing (NLP), vector representation of words is a crucial


concept used to convert words into numerical vectors. These vector representations
are also known as word embeddings.
• Word embeddings are essential because they capture semantic and syntactic
relationships between words and enable machine learning models to process and
understand natural language text.
There are several methods to create word embeddings, and some of the commonly
used techniques include:
One-Hot Encoding: In one-hot encoding, each word in the vocabulary is
represented as a binary vector where all elements are zero except for the index
corresponding to the word's position in the vocabulary, which is set to 1.
This method creates sparse and high-dimensional vectors that lack meaningful
semantic relationships between words.
28-01-2025 NATURAL LANGUAGE PROCESSING
Techniques used

• Word2Vec: Word2Vec is a popular word embedding technique that learns continuous word
representations from large amounts of text data. It offers two algorithms: Continuous Bag of Words
(CBOW) and Skip-gram. These models generate dense word vectors that capture semantic
similarities between words based on their context.
• GloVe (Global Vectors for Word Representation): GloVe is another widely used method for
learning word embeddings. It combines the global co-occurrence statistics of words in a corpus to
create word vectors. GloVe embeddings capture both semantic and syntactic relationships between
words.
• Fast Text: Fast Text is an extension of Word2Vec that represents each word as a bag of character
n-grams. It can generate word embeddings for out-of-vocabulary words based on their
character-level information, making it useful for handling misspellings and rare words.

28-01-2025 NATURAL LANGUAGE PROCESSING


Word 2 Vector Techniques

• BERT (Bidirectional Encoder Representations from Transformers): BERT is a transformer-based


model that generates contextual word embeddings. Unlike traditional methods that generate static
embeddings, BERT considers the context of the word within the sentence, producing highly
contextualized word representations.
• ELMo (Embeddings from Language Models): ELMo is another contextual word embedding model
that uses a bi-directional language model. It generates word embeddings based on the entire context of
the sentence, capturing the polysemy (multiple meanings) of words.
• ULMFiT (Universal Language Model Fine-tuning): ULMFiT is a transfer learning approach for NLP
that utilizes pre-trained language models to fine-tune embeddings for specific downstream tasks. It
enables efficient training on smaller datasets.
• These word embeddings can be used as input features for various NLP tasks, such as sentiment analysis,
machine translation, named entity recognition, and more.
• They help improve the performance of NLP models by providing a more compact and meaningful
representation of words in numerical form.

28-01-2025 NATURAL LANGUAGE PROCESSING


Word 2 Vector Example

• Let's demonstrate a simple example of word embeddings using Word2Vec, one of the popular
techniques for learning word representations. For this example, we will use a small dataset of movie
reviews and create word embeddings using the Word2Vec algorithm.
• Step 1: Preprocess the Data Suppose we have the following movie reviews:
1. "The movie was fantastic, with amazing special effects."
2. "The plot was engaging and kept me hooked till the end."
3. "The acting was superb, especially by the lead actor."
4. "The film had stunning visuals and great cinematography."
• We need to preprocess the data by tokenizing the sentences and converting the text to lowercase
• Step 2: Train Word2Vec Model Next, we train a Word2Vec model using the tokenized reviews
• Step 3: Retrieve Word Embeddings Now, we can access the word embeddings for specific words
using the trained Word2Vec model
• Step 4: Similar Words We can also find words similar to a given word based on their embeddings
• Step 5: Word Similarity Additionally, we can measure the similarity between two words
28-01-2025 NATURAL LANGUAGE PROCESSING
Word 2 Vector Example

• The resulting word embeddings and similarity scores will depend on the specific
corpus and the number of training iterations, but they should capture the semantic
relationships between words based on their context in the reviews.
• For instance, "fantastic" and "amazing" are likely to have a high similarity score, as
they both frequently appear together in positive contexts in the dataset. Similarly,
"plot" and "visuals" might also have a reasonable similarity score if they co-occur
in sentences discussing movie elements.

28-01-2025 NATURAL LANGUAGE PROCESSING


Language modelling in NLP

• Language modeling is a fundamental task in Natural Language Processing (NLP) that involves building a
statistical model to predict the probability distribution of words in a given language.

• The language model learns the patterns and relationships between words in a corpus of text and can be used
to generate new text, evaluate the likelihood of a sentence, perform speech recognition, machine
translation.

• In language modeling, the primary goal is to estimate the probability of a sequence of words (a sentence or
a phrase) using the conditional probability of each word given its preceding context.

• The model learns from large amounts of text data to predict the likelihood of a particular word given the
previous words in a sentence.

28-01-2025 NATURAL LANGUAGE PROCESSING


Language modelling in NLP

There are different types of language models, but two prominent approaches are
• N-gram Language Models
• Neural Language Models

N-gram Language Models: N-gram language models are simple and widely used in early NLP
tasks. An N-gram model predicts the probability of a word based on the previous (N-1) words in a
sentence.
For example, a trigram model (3-gram) predicts the probability of a word given the two preceding
words. The model estimates the probabilities based on the frequency of word sequences observed
in the training data.
Neural Language Models: Neural language models, such as recurrent neural networks (RNNs)
and transformer-based models, have gained significant popularity in recent years due to their
ability to capture long-range dependencies and contextual information
These models learn complex patterns in the language and can generate more coherent and
contextually relevant text.

28-01-2025 NATURAL LANGUAGE PROCESSING


Language modelling in NLP

• Recurrent Neural Networks (RNNs): RNNs are a class of neural networks designed for sequential
data processing. They process input sequences step by step, maintaining a hidden state that captures
information from previous steps.
• This hidden state acts as the context for the current word prediction. However, RNNs have
challenges with capturing long-range dependencies and can suffer from vanishing or exploding
gradients.
• Transformer-based Models: Transformer-based models, like the famous BERT (Bidirectional
Encoder Representations from Transformers) and GPT (Generative Pre-trained Transformer) series,
have revolutionized language modeling.

28-01-2025 NATURAL LANGUAGE PROCESSING


REFERENCE

1) These are slides by stanford for all topics –

• https://web.stanford.edu/~jurafsky/NLPCourseraSlides.html
2) This is e book which can be followed
• https://web.stanford.edu/~jurafsky/slp3/ed3book.pdf
3) This is the channel by dan jurafsky and manning where they teach each topic from zero level.
• https://www.youtube.com/watch?v=808M7q8QX0E&list=PLaZQkZp6WhWyvdiP49JG-rjyTPck_hvEu

4) https://www.shiksha.com/online-courses/articles/pos-tagging-in-nlp/

5) https://web.stanford.edu/~jurafsky/slp3/

28-01-2025 NATURAL LANGUAGE PROCESSING

You might also like