UNIT-1 Notes
UNIT-1 Notes
Advantages of NLP
o NLP helps users to ask questions about any subject and get a direct response within
seconds.
o NLP offers exact answers to the question means it does not offer unnecessary and
unwanted information.
o NLP helps computers to communicate with humans in their languages.
o It is very time efficient.
o Most of the companies use NLP to improve the efficiency of documentation
processes, accuracy of documentation, and identify the information from large
databases.
Components of NLP
Natural Language Understanding (NLU) helps the machine to understand and analyse human
language by extracting the metadata from content such as concepts, entities, keywords,
emotion, relations, and semantic roles.
NLU mainly used in Business applications to understand the customer's problem in both
spoken and written language.
Natural Language Generation (NLG) acts as a translator that converts the computerized data
into natural language representation. It mainly involves Text planning, Sentence planning,
and Text Realization.
NLU NLG
NLU is the process of reading and NLG is the process of writing or generating
interpreting language. language.
It produces non-linguistic outputs from It produces constructing natural language
natural language inputs. outputs from non-linguistic inputs.
NLP terminalogy
•Phonology − It is study of organizing sound systematically.
•Morphology − It is a study of construction of words from primitive meaningful units.
•Morpheme − It is primitive unit of meaning in a language.
•Syntax − It refers to arranging words to make a sentence. It also involves determining
the structural role of words in the sentence and in phrases.
•Semantics − It is concerned with the meaning of words and how to combine words
into meaningful phrases and sentences.
•Pragmatics − It deals with using and understanding sentences in different situations
and how the interpretation of the sentence is affected.
•Discourse − It deals with how the immediately preceding sentence can affect
the interpretation of the next sentence.
•World Knowledge − It includes the general knowledge about the world.
Phases of NLP
NLP Libraries
Scikit-learn: It provides a wide range of algorithms for building machine learning models in
Python.
Natural language Toolkit (NLTK): NLTK is a complete toolkit for all NLP techniques.
TextBlob: It provides an easy interface to learn basic NLP tasks like sentiment analysis,
noun phrase extraction, or pos-tagging.
Quepy: Quepy is used to transform natural language questions into queries in a database
query language.
SpaCy: SpaCy is an open-source NLP library which is used for Data Extraction, Data
Analysis, Sentiment Analysis, and Text Summarization.
Gensim: Gensim works with large datasets and processes data streams.
Morphology
Morpholgy is the Study of the way words are built up from smaller meaning- bearing units
morphemes. Morphemes – minimal meaning bearing unit in a language.
The word CATS contain two morphmenes: the morpheme cat and the morpheme -s. As this
example suggests, it is often useful to distinguish two broad classes of STEMS morphemes:
stems and affixes. The stem is the “main” morpheme of the word, supplying the main
meaning, while the affixes add “additional” meanings of various kinds. Affixes are further
divided into prefixes , suffixes , infixes , and circumfixes . Prefixes precede the stem, suffixes
follow the stem, circumfixes do both, and infixes are inserted inside the stem. For example,
the word eats is composed of a stem eat and
the suffix -s . The word unbuckle is composed. A word can have more than one affix. For
example, the word rewrites has the prefix re-, the stem write, and the suffix -s. There are
many ways to combine morphemes to create words. Four of these methods are common and
play important roles in speech and language processing: inflection, derivation, compounding,
INFLECTION and cliticization.
Inflectional Morphology
English has a relatively simple inflectional system; only nouns, verbs, and sometimes
adjectives can be inflected, and the number of possible inflectional affixes is quite
small.English nouns have only two kinds of inflection: an affix that marks plural and an affix
that marks possessive.
The regular plural is spelled -s after most nouns. The possessive suffix is realized by
apostrophe + -s for regular singular nouns ( (children’s ).
First, English has three kinds of verbs; main verbs , (eat, sleep, impeach ), modal verbs (can,
will, should ), and primary verbs (be, have, do ) . These verbs are called regular because just
by knowing the stem we can predict the other forms by adding one of three predictable
endings and making some regular spelling changes. The irregular verbs are those that have
some more or less idiosyncratic forms of inflection.
Derivational Morphology
derivation is the combination of a word stem with a grammatical morpheme, usually
resulting in a word of a different class, often with a meaning hard to predict exactly. A very
common kind of derivation in English is the formation of new nouns, often from verbs or
adjectives.
This process is called nominalization.
Cliticization
its status lies between a word and an affix.
am ‘m
have ‘ve
would ‘d
will ‘ll
An initial hypothesis might be that adjectives can have an optional prefix (un- ), an
obligatory root (big, cool , etc.) and an optional suffix (-er, -est , or -ly ).
FINITE-STATE TRANSDUCERS
A transducer maps between one representation and another; a finite-state transducer or FST
is a type of finite automaton which maps between two sets of symbols. An FST defines a
relation between sets of strings.
Let’s begin with a formal definition. An FST can be formally defined with 7 parameters:
Q a finite set of N states q0,q1, . . . ,qN−1
S a finite set corresponding to the input alphabet
D a finite set corresponding to the output alphabet
q0 ∈ Q the start state
F ⊆ Q the set of final states
d(q,w) the transition function or transition matrix between states; Given a state q ∈
Q and a string w ∈ S∗ , d(q,w) returns a set of new states Q′ ∈ Q . d is thus a
function from QÅ~S
s(q,w) the output function giving the set of possible output strings for each state and
input. Given a state q ∈ Q and a string w ∈ S∗ , s(q,w) gives a set of output
strings, each a string o ∈ D∗ . s is thus a function from QÅ~S∗ to 2D∗
Python code
• import polyglot
• #!polyglot download morph2.en
• from polyglot.text import Word
•
word = Word("Independently", language="en")
• print(word, word.morphemes)
Stemming
•Stemming is a technique used to extract the base form of the words by removing affixes
from them.
•Stemming is a rule-based approach because it slices the inflected words from prefix or suffix
as per the need.
•There are mainly two errors that occur while performing Stemming, Over-stemming, and
Under-stemming.
A computer program or subroutine that stems word may be called a stemming program,
stemming algorithm, or stemmer.
•Porter Stemmer uses suffix striping to produce stems. It does not follow the linguistic set of
rules to produce stem for phases in different cases, due to this reason porter stemmer does not
generate stems, i.e. actual English words.
•import nltk from nltk.stem
•import PorterStemmer
•word_stemmer = PorterStemmer()
•word_stemmer.stem('writing')
Lemmatization
•a method that switches any kind of a word to its base root mode is called Lemmatization.
•The Morphological analysis would require the extraction of the correct lemma of each
word.
•it gives the stripped word that has some dictionary meaning.
•import nltk from nltk.stem
•import WordNetLemmatizer
•lemmatizer = WordNetLemmatizer()
• lemmatizer.lemmatize('books')
S.N
Stemming Lemmatization
o
Part-of-speech tagging is the process of assigning a part of- speech or other syntactic class
marker to each word in a corpus.
ENGLISH WORD CLASSES
Parts-of-speech can be divided into two broad supercategories: closed class typesand open
class types. Closed classes are those that have relatively fixed membership. For example,
prepositions are a closed class because there is a fixed set of them in English; new
prepositions are rarely coined. By contrast nouns and verbs are open classes because new
nouns and verbs are continually coined or borrowed from other languages. Closed class
words are also generally function words like of, it, and, or you, which tend to be very short,
occur frequently, and often have structuring uses in grammar.
There are four major open classes that occur in the languages of the world; nouns,
verbs, adjectives, and adverbs. The closed class words includes the following
Most tagging algorithms fall into one of two classes: rule-based taggers and stochastic
taggers taggers. Rule-based taggers generally involve a large database of hand-written
disambiguation rules which specify, for example, that an ambiguous word is a noun rather
than a verb if it follows a determiner. Stochastic taggers generally resolve tagging
ambiguities by using a training corpus to compute the probability of a given word having a
given tag in a given context.
A Hidden Markov Model (HMM ) allows us to talk about both observed events (like words
that we see in the input) and hidden events
An HMM is specified by the following components:
Q = q1q2 . . .qN a set of N states
A = a11a12 . . .an1 . . .ann
a transition probability matrix A , each ai j representing the probability of moving
from state i to state j ,
O = o1o2 . . .oT a sequence of T observations , each one drawn from a vocabulary
V= v1,v2, ...,vV .
A sequence of observation likelihoods: , also called emission probabilities , each
expressing the probability of an observation ot being generated from a state i .
q0,qF a special start state and end (final) state
An HMM thus has two kinds of probabilities; the A transition probabilities, and the B
observation likelihoods, corresponding respectively to the prior and likelihood probabilities.
TRANSFORMATION-BASED TAGGING
Transformation-Based Tagging, sometimes called Brill tagging, is an instance of the
Transformation-Based Learning (TBL) approach to machine learning and draws inspiration
from both the rule-based and stochastic taggers. Like the rulebased taggers, TBL is based on
rules that specify what tags should be assigned to what words. But like the stochastic taggers,
TBL is a machine learning technique, in which rules are automatically induced from the data.
Like some but not all of the HMMtaggers, TBL is a supervised learning technique; it
assumes a pre-tagged training corpus.
race is most likely to be a
noun: P(NN|race) = .98
P(VB|race) = .02
This means that the two examples of race will be coded as NN.
It first labels every word with its mostlikely tag. It then examines every possible
transformation, and selects the one that results in the most improved tagging. Finally, it then
re-tags the data according to this rule. The last two stages are repeated until some stopping
criterion is reached, such as insufficient improvement over the previous pass. Note that
stage two requires that TBL knows the correct tag of each word; that is, TBL is a supervised
learning algorithm.
N-gram
The idea of word prediction with probabilistic models called N-gram models , which predict
the next word from the previous N – 1 words. Such statistical models of word sequences are
also called language models. An N-gram is a sequence of N tokens (or words).N Grams are
essential in speech recognition, handwriting recognition, statistical machine translation.
• This utterance has two kinds of disfluencies. The broken-off word main- is called a
fragment. Words like uh and um are called fillers or filled pauses.
• Consider inflected forms like cats versus cat. These two words have the same
lemma cat but are different wordforms.
• The wordform is the full inflected or derived form of the word.
Simple Unsmoothed N-Gram
The goal is to compute the probability of a word w given some history h, or P(w|h). Suppose
the history h is “its water is so transparent that” and we want to know the probability that the
next word is the: P(the|its water is so transparent that). To compute probabiltity One of
the way is to estimate it from relative frequency counts.
P( the| its water is so transparent that) = C( its water is so transparent
that the)/ C( its water is so transparent that)
While this method of estimating probabilities directly from counts works fine in many cases,
it turns out that even the web isn’t big enough to give us good estimates in most cases.
In order to represent the probability of a particular random variable Xi taking on the value
“the”, or P(Xi = “the”), we will use the simplification P(the). We’ll represent a sequence of N
words either as w1 . . .wn or wn1 . For the joint probability of each word in a sequence
having a particular value.
P(X = w1,Y = w2,Z = w3, ..., ) we’ll use P(w1,w2, ...,wn).
To compute probabilities of entire sequences P(w1,w2, ...,wn) one thing that can be done is
to decompose this probability using the chain rule of probability:
P(X1...Xn) = P(X1)P(X2|X1)P(X3|X21 ) . . .P(Xn|Xn−1 )= Пnk=1P(Xk|Xk−1 )
Applying the chain rule to words, we get:
P(wn1) = P(w1)P(w2|w1)P(w3|w21) . . .P(wn|wn−1 )= Пnk=1 P(wk|wk−1 )
The chain rule shows the link between computing the joint probability of a sequence
and computing the conditional probability of a word given previous words. The intuition of
the N -gram model is that instead of computing the probability of a word given its entire
history, we will approximate the history by just the last few words.
The bigram model, for example, approximates the probability of a word given
all the previous words P(wn|wn−1 ) by using only the conditional probability of the
preceding word P(wn|wn−1) . In other words, instead of computing the probability
P( the| Walden Pond’s water is so transparent that)
we approximate it with the probability P( the| that)
When a bigrammodel is used to predict the conditional probability of the nextword
The following approximation is made :
P(wn|wn−1 ) P(wn|wn−1)
This assumption that the probability of a word depends only on the previous word is called a
Markov assumption. Markov models are the class of probabilistic models that assume that
we can predict the probability of some future unit without looking too far into the past. We
can generalize the bigram (which looks one word into the past) to the trigram (which looks
two words into the past) and thus to the N-gram (which looks N − 1 words into the past).
Thus the general equation for this N -gram approximation to the conditional probability
of the next word in a sequence is:
P(wn|wn−1 ) P(wn|wn−1 n−N+1)
Example
•“the cat sat on the mat”
•P(S)=P(the)⋅P(cat|the)⋅P(sat|cat)⋅P(on|sat)⋅P(the|on)⋅P(mat|the)
•P(mat| the cat sat on the )=P(mat|the)
Given the bigram assumption for the probability of an individual word, we can compute the
probability of a complete word sequence by substituting Equation into Equation (4.7):
P(wn1) = Пnk=1 P(wk|wk−1)
The simplest and most intuitive way to estimate probabilities is called Maximum Likelihood
Estimation , or MLE . The MLE is estimated for the parameters of an N -gram model by
taking counts from a corpus, and normalizing them so they lie between 0 and 1.
P(wn|wn−1) =C(wn−1wn)/ZwC(wn−1w)
Example
•<s> I am Sam </s>
•<s> Sam I am </s>
•<s> I do not like green eggs and ham </s>
•P(I|<s>)=2/3
•P(sam|<s>)=1/3
when using a statistical model of language given some corpus of relevant data, we start by
dividing the data into training and test sets. We train the statistical parameters of the model
on the training set, and then use this trained model to compute probabilities on the test set.
This training-and-testing paradigm can also be used to evaluate different N-gram
architectures.
Since our evaluation metric is based on test set probability, it’s important not to let the test
sentences into the training set. Suppose we are trying to compute the probability of a
particular “test” sentence. If our test sentence is part of the training corpus, we will
mistakenly assign it an artificially high probability when it occurs in the test set. We call this
situation training on the test set . Training on the test set introduces a bias that makes the
probabilities all look too high and causes huge inaccuracies in perplexity.
In addition to training and test sets, other divisions of data are often useful. Sometimes
we need an extra source of data to augment the training set. Such extra data is called a held-
out set.
SMOOTHING
There is a major problem with the maximum likelihood estimation process we have seen for
training the parameters of an N-gram model. This is the problem of sparse data caused by the
fact that our maximumlikelihood estimate was based on a particular set of training data. For
any N-gram that occurred a sufficient number of times, we might have a good estimate of its
probability. But because any corpus is limited, some perfectly acceptable English word
sequences are bound to be missing from it. This missing data means that the N-gram matrix
for any given training corpus is bound to have a very large number of cases of putative “zero
probability N-grams” that should really have some non-zero probability. we’ll want to
modify the maximum likelihood estimates for computing N-gram probabilities, focusing on
the N-gram events that we incorrectly assumed had zero probability. We use the term
smoothing for such modifications that address the poor estimates that are due to variability
in small data sets.
Laplace Smoothing
One simple way to do smoothing might be just to take our matrix of bigram counts,
before we normalize them into probabilities, and add one to all the counts. This algorithm is
called Laplace smoothing , or Laplace’s Law. Laplace smoothing merely adds one to each
count. A related way to view smoothing is as discounting (lowering) some non-zero counts
in order to get the probability mass that will be assigned to the zero counts.
There are two ways to use this N -gram “hierarchy”, backoff and interpolation . In
backoff, if we have non-zero trigram counts, we rely solely on the trigram counts. We only
“back off” to a lower order N -gram if we have zero evidence for a higher-order N -gram. By
contrast, in interpolation, we always mix the probability estimates from all the N –gram
estimators, i.e., we do a weighted interpolation of trigram, bigram, and unigram counts.
Multiword Expressions
Multiword expressions (MWEs) are expressions which are made up of at least 2
words and which can be syntactically and/or semantically idiosyncratic in nature.
Moreover, they act as a single unit at some level of linguistic analysis.
Classification of MWEs
1 Fixed expressions
Fixed expressions are fully lexicalized and can neither be variated
morphosyntactically nor modificated internally. Examples for fixed expressions are:
in short, by and large, every which way. They are fixed, as you cannot say in shorter or in
very short.
2 Semi-fixed expressions
In semi-fixed expressions word order and composition are strictly invariable, while
inflection, variation in reflexive form and determiner selection is possible.
In non-decomposable idioms (i.e. idioms in which the meaning cannot be assigned to
the parts of the MWE) such as kick the bucket the verb can be inflected according to a
particular
context: he kicks the bucket.
Another type of semi-fixed expressions are compound nominals as car park or peanut
butter. They are syntactically-unalterable but can inflect for number: 2 car parks.
3 Syntactically-Flexible Expressions
Syntactically-flexible expressions have a wider range of syntactic variability than
semi-fixed expressions. They occur in the form of decomposable idioms, verb-
particle constructions and light verbs.
Decomposable idioms are likely to be syntactically flexible to some degree. Examples
are let the cat out of the bag and sweep under the rug.
Verb-particle constructions, such as write up and look up are made up of a verb and one or
more partcicles.
For light verb constructions, as make a mistake, give a demo it is difficult to predict
which light verb combines with a given noun.
Frequency
Surely the simplest method for finding collocations in a text corpus is counting. If two words
occur together a lot, then that is evidence that they have a special function that is not simply
explained as the function that results from their combination.
P(new) = 15828/14307668
P(companies)= 4675/14307668
H0= 3615 * 10−7
t = 0:999932
Likelihood ratios
Likelihood ratios are another approach to hypothesis testing. the maximum- -likelihood
estimates of the population parameters Pij that maximize the probability of the data
CODE
•import nltk
•nltk.download('stopwords')
•from nltk.collocations import BigramCollocationFinder
•from nltk.metrics import BigramAssocMeasures
•from nltk.collocations import TrigramCollocationFinder
•from nltk.metrics import TrigramAssocMeasures
•from nltk.corpus import stopwords
•sentence=" natural language processing (NLP) is one of the field of artificial
intelligence which uses machine learning techniques"
•words=[]
•words = sentence.split()
•stopset = set(stopwords.words('english'))
•filter_stops = lambda w: len(w) < 3 or w in stopset
•biagram_collocation = BigramCollocationFinder.from_words(words)
•biagram_collocation.nbest(BigramAssocMeasures.likelihood_ratio, 15)