Unit I Inroduction
Unit I Inroduction
Processing
NLP
Modern Conversational
Agents can
• Answer questions
• Book flights
• Find Restaurants
• functions for which
they rely on a much
more sophisticated
understanding of
the user’s intent
Survey Analysis
Text Classification
Language Modelling
Information Extraction
Information Retrieval
Conversational Agents
Text Summarization
Question Answering
Machine Translation
Topic Modelling
Speech Recognition
Origins of NLP
Syntactic category
Duck can be a noun or verb
Her can be possessive or dative pronoun
Word meaning
Make can mean create or cook
Why NLP is Hard?
Ambiguities
Ambiguity is Pervasive
Lexical Ambiguity
Ambiguity is Explosive
Lexical Ambiguity
Neologisms
Non standard use of English in Social
media Unfriend
i.e., the 50th most common word should occur with 3 times the frequency
of the 150th most common word
Empirical evaluation from Tom Sawyer
Empirical Laws
Zipf’s Other laws
Empirical Laws
Heap’s Law
Words – What counts as a word?
Types are the number of distinct words in a corpus; if the set of words in the vocabulary is V, the
number of types is the word token vocabulary size |V|.
Tokens are the total number N of running words.
ignore punctuation and find the number of tokens and types in the following sentence
They picnicked by the pool, then lay back on the grass and looked
at the stars
16
tokens
14 types
Notion of Corpus:
Corpora
• It does not follow linguistics rather a set of 5 rules for different cases that are
applied in phases to generate stems.
create a function which takes a sentence and returns the stemmed sentence.
Lemmatization
• Lemmatization reduces the inflected words properly ensuring that the root word belongs to the language. In
Lemmatization root word is called Lemma
• For example, runs, running, ran are all forms of the word run, therefore run is the lemma of all these words.
• As lemmatization returns an actual word of the language, it is used where it is necessary to get valid words.
• Python NLTK provides WordNetLemmatizer that uses the WordNet Database to lookup lemmas of words.
Standardization of Data
The common operations performed to standardize the data are
Generating Text
Predicting Sequences
There are different types of language models, and they can be broadly categorized into
Statistical Language Models (SLM)