0% found this document useful (0 votes)
3 views

Unit I Inroduction

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Unit I Inroduction

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 52

Natural Language

Processing
NLP

 is among the hottest topic in the field of data science.


 Companies are putting tons of money into research in this field.
 Everyone is trying to understand NLP and its applications to make a career around
it.
 Every business out there wants to integrate it into their business somehow.
Are you using NLP these days?
Search Autocorrect and
Autocomplete – Language Translator
Social media monitoring
 More people these days have started using social media for posting their thoughts about a
particular product, policy, or matter.
 These could contain some useful information about an individual’s likes and dislikes.
 Analyzing this unstructured data can help in generating valuable insights. NLP comes to rescue
here too.
 various NLP techniques are used by companies to analyze social media posts and know what
customers think about their products.
 Companies are also using social media monitoring to understand the issues and problems that
their customers are facing by using their products.
Chatbots

Modern Conversational
Agents can
• Answer questions
• Book flights
• Find Restaurants
• functions for which
they rely on a much
more sophisticated
understanding of
the user’s intent
Survey Analysis

 Surveys are an important way of evaluating a


company’s performance.
 to get customer’s feedback on various products.
 useful in understanding the flaws and help
companies improve their products.

 NLP is used to analyze the surveys and


generate insights from them, like knowing the
sentiments of users analyzing product
reviews to understand the pros and cons
Targeted Advertising – Hiring and
Recruitment

Targeted advertising is a type of online


advertising where ads are shown to the user
based on their online activity.

it saves companies a lot of money


relevant ads are shown only to the potential
customers.
Voice Assistants
Conventional vs. NLP-based search
What is NLP?

 Natural language processing is a sub-field of linguistics, computer


science and AI concerned with the interactions between computers
and human language
 NLP makes computers understand complex language structure and retrieve
meaningful pieces of information from it
 Modern challenges in NLP involve
 speech recognition,
 natural language understanding and
 natural language generation
Why study NLP?

 Text is the largest repository of human knowledge –


 news articles, web pages, scientific articles, patents, emails, government
documents…
 Tweets, facebook posts, comments, quora… etc.
 What are the top ten languages in the internet in terms of millions of user?
 Goals of NLP
 Fundamental and Scientific Goal – Deep understanding of broad language
 Engineering Goal – Design, implement and test subject that process natural
languages for practical applications.
Applications of NLP

 Text Classification
 Language Modelling
 Information Extraction
 Information Retrieval
 Conversational Agents
 Text Summarization
 Question Answering
 Machine Translation
 Topic Modelling
 Speech Recognition
Origins of NLP

 Alan Turing’s Turing Test (1950)


 1950s – 1960s : Early Developments
 Georgetown – IBM Experiment (1954)
 Chomsky’s Transformational Generative Grammar (1957)
 1960s – 1970s : Rule-based approaches
 1970s – 1980s : Rise of statistical methods
 1980s – 1990s : Corpus Linguistics and Machine Learning
 2000s – present : Deep Learning and Neural networks.
Challenges of NLP
Why NLP is Hard?
Lexical Ambiguity
Why NLP is Hard?
Lexical Ambiguity
Ambiguity is pervasive
Activity
 Find at least 5 meanings of this sentence

I made her duck

 Syntactic category
 Duck can be a noun or verb
 Her can be possessive or dative pronoun
 Word meaning
 Make can mean create or cook
Why NLP is Hard?
Ambiguities

Ambiguity is Pervasive
Lexical Ambiguity
Ambiguity is Explosive
Lexical Ambiguity

Why is language ambiguous?


Natural Language Vs. Computer Languages
 The goal in the production and
 Ambiguity is the primary difference
comprehension of natural language is
efficient communication  Programming languages are designed to
be unambiguous
 Allowing resolvable ambiguity
 PLs are defined by grammar that
 Permits shorter linguistic expressions
produces a unique parse for each
 Avoids language being overly complex sentence in the language.

 Language relies on people’s ability to use


that their knowledge and inference
abilities to properly resolve ambiguities
NLP is Hard? .. Why else NLP is hard?
See you, I will text you later.

 Neologisms
 Non standard use of English in Social
media  Unfriend

 Segmentation issues  Retweet

 The New York-New Heaven Road  Google / skype

 Idioms  New senses of the word


 Dark horse  That’s sick due

 Ball in the court  Giants – multinationals, manufacturers

 Burn the midnight oil  Tricky Entity Names


 Where is A Bug’s life playing…
 Let It Be was recorded
Empirical Laws

 Function words Vs. Content Words


 Function words have little lexical
meaning but serve as important
elements to the structure of the
sentences
 Function words are closed class
words
 Prepositions, pronouns, auxiliary
verbs, conjunctions, grammatical
articles, particles etc. • Most of the words here are function words
• The list is dominated by the little words of
 Eg: a, an, the etc. English having important grammatical role
Empirical Laws
Type Vs. Token
 Type
 Type-Token Ratio (TTR) :
 Concept
 It is the ratio of the no.of different words(types)
 Unique words
to the no.of running words (tokens) in a given
 Tokens text or corpus.
 Instances of concepts  The index indicates how often, on average, a
new ‘word form’ appears in the text or corpus.
 The number of words Mark Twain’s Complete
 Type-Token distinction is a distinction Tom Sawyer Shakespeare work
that separates a concept from the Word Tokens 71,370 884,647
objects which are particular instances Word types 8018 29,066
of the concept.
TTR 0.112 0.032
Empirical Laws
Observation on various texts
 Consider various texts from conversation, academic prose, news, fiction. Which one will
have high TTR and which one will have lowest TTR?

High TTR – tendency


to use new words
Low TTR – same word
repeatedly
Word distribution from Tom Sawyer
Empirical Laws
Zipf’s Law
 Count the frequency of each word type in a large corpus
 List the word types in decreasing order of their frequency

 i.e., the 50th most common word should occur with 3 times the frequency
of the 150th most common word
Empirical evaluation from Tom Sawyer
Empirical Laws
Zipf’s Other laws
Empirical Laws
Heap’s Law
Words – What counts as a word?

 corpus (plural corpora): a computer-readable corpora collection of text or speech


 For example the Brown corpus is a million-word collection of samples from 500 written English texts from
different genres (newspaper, fiction, non-fiction, academic, etc.)

How many words are in the following Brown sentence?


Sentence : He stepped out into the hall, was delighted to encounter a water
brother.
 This sentence has 13 words if we don’t count punctuation marks as words,
 15 if we count punctuation.
 Are capitalized tokens like They and uncapitalized tokens like they the same word?
 How about inflected forms like cats versus cat?
 These two words have the same lemma cat but are different wordforms.
 A lemma is a set of lexical forms having the same stem, the same major part-of-speech, and the
same word sense.
 The wordform is the full inflected or derived form of the word.
Notion of Corpus:
Words – Types and Tokens

 Types are the number of distinct words in a corpus; if the set of words in the vocabulary is V, the
number of types is the word token vocabulary size |V|.
 Tokens are the total number N of running words.
 ignore punctuation and find the number of tokens and types in the following sentence

They picnicked by the pool, then lay back on the grass and looked
at the stars
16
tokens
14 types
Notion of Corpus:
Corpora

 Any particular piece of text that we study is produced by


 one or more specific speakers or writers,
 in a specific dialect of a specific language,
 at a specific time,
 in a specific place,
 for a specific function.
 The most important dimension of variation is the language.
 NLP algorithms are most useful when they apply across many languages. The world has 7097
languages.
 It is important to test algorithms on more than one language, and particularly on languages with
different properties; by contrast there is an unfortunate current tendency for NLP algorithms to
be developed or tested just on English
 Code Switching : A phenomenon which uses multiple languages in a single communicative act
 Another variations are Genre, demographic characteristics of the writer, time.
Text-processing Basics
Tokenization
 Tokenization is the process of segmenting a string of characters into
words.
 What is sentence segmentation? –
 The problem of deciding where the sentences begin and end.
 Depending on the application in hand, you might have to perform
sentence segmentation as well.
 What are the challenges in sentence segmentation?
 !, ? Are quite unambiguous. Period (.) is quite ambiguous.
 What are the strategies to build a sentence segmenter?
 Hand-written rules, regular expressions, machine learning
Text-processing Basics
Word Normalization

 Is the process of segmenting a string of characters into words.  Issues in Tokenization


 Finland’s
I have a can opener; but I can’t open these cans
 What’re, I’m, Should n’t
 Word Token  San Francisco
 An occurrence of a word  m.p.h.
 For the above sentence, 11 word tokens.  Handling Hyphenation
 Word Type  End-of-line hyphen

 A different realization of a word  Lexical hyphen


 Sentential determined
 For the above sentence, 10 word types
 Language specific issues
 Practice
 French and German
 NLTK toolkit, Stanford CoreNLP, Unix Commands
 Chinese and Japanese
 Sanskrit
Using Python’s split() function
Tokenization using Regular Expressions
Tokenization using NLTK
Word Normalization, Stemming and
Lemmatization
 used to prepare text, words, and documents for further
processing
 Reduce inflections or invariant forms to base form:
 am, are, is – be
 car, car’s, cars, cars’ – car
 Finds correct dictionary handword form
 Morphemes are divided into two categories
 Stems – The core meaning bearing units
 Affixes – prefix (un-, anti- etc.,), suffix – (-ity, -ation etc.)
 Stemming and Lemmatization helps us to achieve the
root forms of inflected words
Stemming
• helps us to achieve the root forms of inflected words.
• Stem (root) is the part of the word to which you add inflectional
(changing/deriving) affixes such as (-ed,-ize, -s,-de,mis).
• Crude chopping of affixes
• stemming a word or sentence may result in words that are not actual words.
Stems are created by removing the suffixes or prefixes used with a word.
• A computer program that stems word is called a stemming program, or
stemmer
• PorterStemmer is stemming algorithm present in NLTK which uses Suffix
Stripping

• It does not follow linguistics rather a set of 5 rules for different cases that are
applied in phases to generate stems.
create a function which takes a sentence and returns the stemmed sentence.
Lemmatization

• Lemmatization reduces the inflected words properly ensuring that the root word belongs to the language. In
Lemmatization root word is called Lemma
• For example, runs, running, ran are all forms of the word run, therefore run is the lemma of all these words.
• As lemmatization returns an actual word of the language, it is used where it is necessary to get valid words.
• Python NLTK provides WordNetLemmatizer that uses the WordNet Database to lookup lemmas of words.
Standardization of Data
The common operations performed to standardize the data are

 Removal of duplicate whitespaces and  Acronym normalization (e.g.: ‘US’→‘United


punctuation. States’/‘U.S.A’) and abbreviation normalization
(e.g.: ‘btw’→‘by the way’).
 Accent removal
 Normalize date formats, social security numbers
 Capital letter removal
 Spell correction — this is very important if you’re
 Removal or substitution of special
dealing with open user inputs, such as tweets, IMs
characters/emojis (e.g.: remove
and emails.
hashtags).
 Removal of gender/time/grade variation with
 Substitution of contractions (very common
Stemming or Lemmatization.
in English; e.g.: ‘I’m’→‘I am’).
 Substitution of rare words for more common
 Transform word numerals into numbers
synonyms.
(eg.: ‘twenty three’→‘23’).
 Stop word removal (more a dimensionality
 Substitution of values for their type (e.g.:
reduction technique than a normalization
‘$50’→‘MONEY’).
technique).
Spelling Correction – Edit Distance

 Isolated word error correction


 Pick the one that is closest to ‘behaf’
 How to define ‘closest’?
 Need a distance metric
 The simplest metric is – Edit Distance
 Edit Distance
 The minimum edit distance between two strings – is defined as the minimum number
of editing operations
 Insertion
 Deletion
 Substittution
 Levenshtein distance - substitution has cost -1
 Alternate version – substitution cost - 2
Defining minimum edit distance matrix
Edit Distance calculation
Algorithm using Dynamic Programming
Tracing
Edit Distance
Computing Alignments

 Computing edit distance may not be sufficient for some applications – we


often need to align characters of the two strings to each other
 We do this by keeping a backtrace
 Everytime we enter a cell, remember where we came from
 When we reach the end, tracke back the path from upper right corner to
read off the algorithms.
 Performance
 Time – O(nm)
 Space – O(nm)
 Backtrace – O(n+m)
Language models

 is a computational model or algorithm designed to understand, generate, and predict


human language.
 fundamental part of natural language processing (NLP) and machine learning applications
that involve dealing with textual data.
 The primary goals of a language model include:
 Understanding Language

 Generating Text

 Predicting Sequences

 There are different types of language models, and they can be broadly categorized into
 Statistical Language Models (SLM)

 Grammar-based Language Models

 Neural Language Models


Grammar based Language models

 Grammar-based language models rely on predefined rules and structures


to generate sentences. These rules are often based on formal
grammatical frameworks, such as context-free grammars.
 The model uses syntactic rules to define the permissible arrangements of
words in a sentence.
 Example: In a grammar-based LM, you might have rules specifying that a
sentence must start with a noun phrase followed by a verb phrase.
 Challenge - These models may struggle with handling natural language
variations and may not capture the full complexity of language.
Statistical Language Model

 SLMs are based on statistical patterns observed in a given dataset. They


estimate the probability of a sequence of words occurring based on the
frequencies of these sequences in the training data.
 N-gram Models: SLMs often use n-gram models, where the probability of a
word is conditioned on the previous n-1 words. Commonly used n-grams
include bigrams (n=2) and trigrams (n=3).
 Example: In an SLM, the probability of the word "rain" might be higher if the
preceding words are "the" and "it" compared to other combinations.
 Challenge – data sparsity issues

You might also like