0% found this document useful (0 votes)
13 views19 pages

UNIT-1 Notes

The document provides an overview of Natural Language Processing (NLP), detailing its components such as Natural Language Understanding (NLU) and Natural Language Generation (NLG), as well as key concepts like morphology, syntax, and semantics. It discusses various NLP techniques including stemming and lemmatization, and highlights the importance of parts of speech tagging and different approaches to NLP tasks. Additionally, it mentions several NLP libraries and tools that facilitate the implementation of these techniques.

Uploaded by

vy6837
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views19 pages

UNIT-1 Notes

The document provides an overview of Natural Language Processing (NLP), detailing its components such as Natural Language Understanding (NLU) and Natural Language Generation (NLG), as well as key concepts like morphology, syntax, and semantics. It discusses various NLP techniques including stemming and lemmatization, and highlights the importance of parts of speech tagging and different approaches to NLP tasks. Additionally, it mentions several NLP libraries and tools that facilitate the implementation of these techniques.

Uploaded by

vy6837
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 19

UNIT -1

1 Introduction to Natural Language Processing Steps – Morphology – Syntax –


Semantics
2 Morphological Analysis (Morphological Parsing) Stemming – Lemmatization
3 Parts of Speech Tagging
4 Approaches on NLP Tasks (Rule-based, Statistical, Machine Learning)
5 N-grams
6 Multiword Expressions
7 Collocations (Association Measures, Coefficients and Context Measures)
8 Vector Representation of Words
9 Language Modeling

Introduction to Natural Language Processing

Natural language processing (NLP) is the intersection of computer science, linguistics


and machine learning. The field focuses on communication between computers and humans
in natural language and NLP is all about making computers understand and generate human
language. Applications of NLP techniques include voice assistants like Amazon's Alexa and
Apple's Siri, but also things like machine translation and text-filtering.

Advantages of NLP
o NLP helps users to ask questions about any subject and get a direct response within
seconds.
o NLP offers exact answers to the question means it does not offer unnecessary and
unwanted information.
o NLP helps computers to communicate with humans in their languages.
o It is very time efficient.
o Most of the companies use NLP to improve the efficiency of documentation
processes, accuracy of documentation, and identify the information from large
databases.

Components of NLP

There are the following two components of NLP -

1. Natural Language Understanding (NLU)

Natural Language Understanding (NLU) helps the machine to understand and analyse human
language by extracting the metadata from content such as concepts, entities, keywords,
emotion, relations, and semantic roles.

NLU mainly used in Business applications to understand the customer's problem in both
spoken and written language.

NLU involves the following tasks -

o It is used to map the given input into useful representation.


o It is used to analyze different aspects of the language.

2. Natural Language Generation (NLG)

Natural Language Generation (NLG) acts as a translator that converts the computerized data
into natural language representation. It mainly involves Text planning, Sentence planning,
and Text Realization.

Difference between NLU and NLG

NLU NLG

NLU is the process of reading and NLG is the process of writing or generating
interpreting language. language.
It produces non-linguistic outputs from It produces constructing natural language
natural language inputs. outputs from non-linguistic inputs.

NLP terminalogy
•Phonology − It is study of organizing sound systematically.
•Morphology − It is a study of construction of words from primitive meaningful units.
•Morpheme − It is primitive unit of meaning in a language.
•Syntax − It refers to arranging words to make a sentence. It also involves determining
the structural role of words in the sentence and in phrases.
•Semantics − It is concerned with the meaning of words and how to combine words
into meaningful phrases and sentences.
•Pragmatics − It deals with using and understanding sentences in different situations
and how the interpretation of the sentence is affected.
•Discourse − It deals with how the immediately preceding sentence can affect
the interpretation of the next sentence.
•World Knowledge − It includes the general knowledge about the world.

Phases of NLP

NLP Libraries

Scikit-learn: It provides a wide range of algorithms for building machine learning models in
Python.

Natural language Toolkit (NLTK): NLTK is a complete toolkit for all NLP techniques.

Pattern: It is a web mining module for NLP and machine learning.

TextBlob: It provides an easy interface to learn basic NLP tasks like sentiment analysis,
noun phrase extraction, or pos-tagging.
Quepy: Quepy is used to transform natural language questions into queries in a database
query language.

SpaCy: SpaCy is an open-source NLP library which is used for Data Extraction, Data
Analysis, Sentiment Analysis, and Text Summarization.

Gensim: Gensim works with large datasets and processes data streams.

Morphology
Morpholgy is the Study of the way words are built up from smaller meaning- bearing units
morphemes. Morphemes – minimal meaning bearing unit in a language.
The word CATS contain two morphmenes: the morpheme cat and the morpheme -s. As this
example suggests, it is often useful to distinguish two broad classes of STEMS morphemes:
stems and affixes. The stem is the “main” morpheme of the word, supplying the main
meaning, while the affixes add “additional” meanings of various kinds. Affixes are further
divided into prefixes , suffixes , infixes , and circumfixes . Prefixes precede the stem, suffixes
follow the stem, circumfixes do both, and infixes are inserted inside the stem. For example,
the word eats is composed of a stem eat and
the suffix -s . The word unbuckle is composed. A word can have more than one affix. For
example, the word rewrites has the prefix re-, the stem write, and the suffix -s. There are
many ways to combine morphemes to create words. Four of these methods are common and
play important roles in speech and language processing: inflection, derivation, compounding,
INFLECTION and cliticization.

Inflectional Morphology
English has a relatively simple inflectional system; only nouns, verbs, and sometimes
adjectives can be inflected, and the number of possible inflectional affixes is quite
small.English nouns have only two kinds of inflection: an affix that marks plural and an affix
that marks possessive.

Regular Noun Irregular Noun


Singular Cat Fox Mouse OX

Plural cats Foxes Mice Oxen

The regular plural is spelled -s after most nouns. The possessive suffix is realized by
apostrophe + -s for regular singular nouns ( (children’s ).
First, English has three kinds of verbs; main verbs , (eat, sleep, impeach ), modal verbs (can,
will, should ), and primary verbs (be, have, do ) . These verbs are called regular because just
by knowing the stem we can predict the other forms by adding one of three predictable
endings and making some regular spelling changes. The irregular verbs are those that have
some more or less idiosyncratic forms of inflection.

Morphological class Regular verb Irregular verb

Stem walk try eat catch

-s form walks tries eats catches

-ing form walking trying eating catching

Past form –ed form walked tried ate caught

Derivational Morphology
derivation is the combination of a word stem with a grammatical morpheme, usually
resulting in a word of a different class, often with a meaning hard to predict exactly. A very
common kind of derivation in English is the formation of new nouns, often from verbs or
adjectives.
This process is called nominalization.

Suffix Base verb/Adj Noun

-tion Computerize Computerization

-ee employ employee

-er kill Killer

-ness Lazy(A) laziness

Adjectives can also be derived from nouns and verbs.


Suffix Base word Adjective

-al Computation Computational

-able trace traceable

-less clue clueless

-ly man manly

Cliticization
its status lies between a word and an affix.

Full form clitic

am ‘m

have ‘ve

would ‘d

will ‘ll

Morphological Analysis (Morphological Parsing) Stemming – Lemmatization

In order to build a morphological parser, we’ll need at least the following:


1. lexicon: the list of stems and affixes, together with basic LEXICON information about
them (whether a stem is a Noun stem or a Verb stem, etc.).
2. morphotactics: the model of morpheme ordering that explains which classes of
morphemes can follow other classes of morphemes inside a word. For example, the fact that
the English plural morpheme follows the noun rather than preceding it is a morphotactic fact.
3. orthographic rules: these spelling rules are used to model the changes that occur in a word,
usually when two morphemes combine (e.g., the y→ie spelling rule discussed above that
changes city + -s
The FSA in the figure assumes that the lexicon includes regular nouns (reg-noun) that take
the regular -s plural. The lexicon also includes irregular noun forms that don’t take -s, both
singular irreg-sg-noun (goose, mouse) and plural irreg-pl-noun (geese, mice).

A similar model for English verbal inflection


This lexicon has three stem classes (reg-verb-stem, irreg-verb-stem, and irreg-pastverb-
form), plus four more affix classes (-ed past, -ed participle, -ing participle, and third singular
-s):

An initial hypothesis might be that adjectives can have an optional prefix (un- ), an
obligatory root (big, cool , etc.) and an optional suffix (-er, -est , or -ly ).

FINITE-STATE TRANSDUCERS

A transducer maps between one representation and another; a finite-state transducer or FST
is a type of finite automaton which maps between two sets of symbols. An FST defines a
relation between sets of strings.

Let’s begin with a formal definition. An FST can be formally defined with 7 parameters:
Q a finite set of N states q0,q1, . . . ,qN−1
S a finite set corresponding to the input alphabet
D a finite set corresponding to the output alphabet
q0 ∈ Q the start state
F ⊆ Q the set of final states
d(q,w) the transition function or transition matrix between states; Given a state q ∈
Q and a string w ∈ S∗ , d(q,w) returns a set of new states Q′ ∈ Q . d is thus a
function from QÅ~S
s(q,w) the output function giving the set of possible output strings for each state and
input. Given a state q ∈ Q and a string w ∈ S∗ , s(q,w) gives a set of output
strings, each a string o ∈ D∗ . s is thus a function from QÅ~S∗ to 2D∗

FSTs have two additional closure properties.


• inversion: The inversion of a transducer T (T−1) simply switches the input and
output labels. Thus if T maps from the input alphabet I to the output alphabet O, T−1
maps from O to I.
• composition: If T1 is a transducer from I1 to O1 and T2 a transducer from O1 to
O2, then T1 ◦ T2 maps from I1 to O2.

Python code
• import polyglot
• #!polyglot download morph2.en
• from polyglot.text import Word

word = Word("Independently", language="en")
• print(word, word.morphemes)

Stemming
•Stemming is a technique used to extract the base form of the words by removing affixes
from them.
•Stemming is a rule-based approach because it slices the inflected words from prefix or suffix
as per the need.
•There are mainly two errors that occur while performing Stemming, Over-stemming, and
Under-stemming.

A computer program or subroutine that stems word may be called a stemming program,
stemming algorithm, or stemmer.
•Porter Stemmer uses suffix striping to produce stems. It does not follow the linguistic set of
rules to produce stem for phases in different cases, due to this reason porter stemmer does not
generate stems, i.e. actual English words.
•import nltk from nltk.stem
•import PorterStemmer
•word_stemmer = PorterStemmer()
•word_stemmer.stem('writing')

Lemmatization
•a method that switches any kind of a word to its base root mode is called Lemmatization.
•The Morphological analysis would require the extraction of the correct lemma of each
word.
•it gives the stripped word that has some dictionary meaning.
•import nltk from nltk.stem
•import WordNetLemmatizer
•lemmatizer = WordNetLemmatizer()
• lemmatizer.lemmatize('books')

S.N
Stemming Lemmatization
o

Stemming is faster because it


chops words without knowing Lemmatization is slower as compared to
1 stemming but it knows the context of the
the context of the word in given word before proceeding.
sentences.

2 It is a rule-based approach. It is a dictionary-based approach.

Accuracy is more as compared to


3 Accuracy is less.
Stemming.

When we convert any word


into root-form then stemming Lemmatization always gives the
4 dictionary meaning word while
may create the non-existence converting into root-form.
meaning of a word.

Stemming is preferred when Lemmatization would be recommended


5 the meaning of the word is not when the meaning of the word is
important for analysis. important for analysis.
Example: Spam Detection Example: Question Answer

For Example: For Example:


6
“Studies” => “Studi” “Studies” => “Study”

Parts of Speech Tagging

Part-of-speech tagging is the process of assigning a part of- speech or other syntactic class
marker to each word in a corpus.
ENGLISH WORD CLASSES
Parts-of-speech can be divided into two broad supercategories: closed class typesand open
class types. Closed classes are those that have relatively fixed membership. For example,
prepositions are a closed class because there is a fixed set of them in English; new
prepositions are rarely coined. By contrast nouns and verbs are open classes because new
nouns and verbs are continually coined or borrowed from other languages. Closed class
words are also generally function words like of, it, and, or you, which tend to be very short,
occur frequently, and often have structuring uses in grammar.
There are four major open classes that occur in the languages of the world; nouns,
verbs, adjectives, and adverbs. The closed class words includes the following

Most tagging algorithms fall into one of two classes: rule-based taggers and stochastic
taggers taggers. Rule-based taggers generally involve a large database of hand-written
disambiguation rules which specify, for example, that an ambiguous word is a noun rather
than a verb if it follows a determiner. Stochastic taggers generally resolve tagging
ambiguities by using a training corpus to compute the probability of a given word having a
given tag in a given context.

RULE-BASED PART-OF-SPEECH TAGGING


The earliest algorithms for automatically assigning part-of-speech were based on a two stage
architecture. The first stage used a dictionary to assign each word a list of potential parts-of-
speech. The second stage used large lists of hand-written disambiguation rules to win now
down this list to a single part-of-speech for each word. In the first stage of the tagger, each
word is run through the two-level lexicon transducer and the entries for all possible parts-of-
speech are returned. Then it applies a large set of constraints (as many as 3,744 constraints in
the EngCG-2 system) to the input sentence to rule out incorrect parts-of-speech.

Example: •If an ambiguous/unknown word X is preceded by a determiner and followed by a


noun, tag it as an adjective.

HMM PART-OF-SPEECH TAGGING


Bayesian inference or Bayesian classification was applied successfully to language problems
as early as the late 1950s, Bayes’ rule gives us a way to break down any conditional
probability P(x|y) into three other probabilities:
P(x|y) =P(y|x)/(P(x)P(y))
HMM taggers therefore make two simplifying assumptions. The first assumption is that the
probability of a word appearing is dependent only on its own part-of-speech tag; that it is
independent of other words around it, and of the other tags around it:
P(wn1|tn1 ) ≈Пnk=1 P(wi|ti)
The second assumption is that the probability of a tag appearing is dependent only
on the previous tag, the bigram assumption.
P(tn1 ) ≈ Пnk=1=1P(ti|ti−1)

A Hidden Markov Model (HMM ) allows us to talk about both observed events (like words
that we see in the input) and hidden events
An HMM is specified by the following components:
Q = q1q2 . . .qN a set of N states
A = a11a12 . . .an1 . . .ann
a transition probability matrix A , each ai j representing the probability of moving
from state i to state j ,
O = o1o2 . . .oT a sequence of T observations , each one drawn from a vocabulary
V= v1,v2, ...,vV .
A sequence of observation likelihoods: , also called emission probabilities , each
expressing the probability of an observation ot being generated from a state i .
q0,qF a special start state and end (final) state

An HMM thus has two kinds of probabilities; the A transition probabilities, and the B
observation likelihoods, corresponding respectively to the prior and likelihood probabilities.

The Viterbi Algorithm for HMM Tagging


For any model, such as an HMM, that contains hidden variables, the task of determining
which sequence of variables is the underlying source of some sequence of observations is
called the decoding task. The Viterbi algorithm is perhaps the most common decoding
algorithm used for HMMs, whether for part-of-speech tagging or for speech recognition.

TRANSFORMATION-BASED TAGGING
Transformation-Based Tagging, sometimes called Brill tagging, is an instance of the
Transformation-Based Learning (TBL) approach to machine learning and draws inspiration
from both the rule-based and stochastic taggers. Like the rulebased taggers, TBL is based on
rules that specify what tags should be assigned to what words. But like the stochastic taggers,
TBL is a machine learning technique, in which rules are automatically induced from the data.
Like some but not all of the HMMtaggers, TBL is a supervised learning technique; it
assumes a pre-tagged training corpus.
race is most likely to be a
noun: P(NN|race) = .98
P(VB|race) = .02
This means that the two examples of race will be coded as NN.
It first labels every word with its mostlikely tag. It then examines every possible
transformation, and selects the one that results in the most improved tagging. Finally, it then
re-tags the data according to this rule. The last two stages are repeated until some stopping
criterion is reached, such as insufficient improvement over the previous pass. Note that
stage two requires that TBL knows the correct tag of each word; that is, TBL is a supervised
learning algorithm.

N-gram
The idea of word prediction with probabilistic models called N-gram models , which predict
the next word from the previous N – 1 words. Such statistical models of word sequences are
also called language models. An N-gram is a sequence of N tokens (or words).N Grams are
essential in speech recognition, handwriting recognition, statistical machine translation.

COUNTING WORDS IN CORPORA


• Counting of things in natural language is based on a corpus (plural corpora ), an on-
line collection of text or speech.An utterance is the spoken correlate of a sentence:

Consider the sentence “ I do uh main- mainly business data processing”

• This utterance has two kinds of disfluencies. The broken-off word main- is called a
fragment. Words like uh and um are called fillers or filled pauses.
• Consider inflected forms like cats versus cat. These two words have the same
lemma cat but are different wordforms.
• The wordform is the full inflected or derived form of the word.
Simple Unsmoothed N-Gram
The goal is to compute the probability of a word w given some history h, or P(w|h). Suppose
the history h is “its water is so transparent that” and we want to know the probability that the
next word is the: P(the|its water is so transparent that). To compute probabiltity One of
the way is to estimate it from relative frequency counts.
P( the| its water is so transparent that) = C( its water is so transparent
that the)/ C( its water is so transparent that)
While this method of estimating probabilities directly from counts works fine in many cases,
it turns out that even the web isn’t big enough to give us good estimates in most cases.
In order to represent the probability of a particular random variable Xi taking on the value
“the”, or P(Xi = “the”), we will use the simplification P(the). We’ll represent a sequence of N
words either as w1 . . .wn or wn1 . For the joint probability of each word in a sequence
having a particular value.
P(X = w1,Y = w2,Z = w3, ..., ) we’ll use P(w1,w2, ...,wn).
To compute probabilities of entire sequences P(w1,w2, ...,wn) one thing that can be done is
to decompose this probability using the chain rule of probability:
P(X1...Xn) = P(X1)P(X2|X1)P(X3|X21 ) . . .P(Xn|Xn−1 )= Пnk=1P(Xk|Xk−1 )
Applying the chain rule to words, we get:
P(wn1) = P(w1)P(w2|w1)P(w3|w21) . . .P(wn|wn−1 )= Пnk=1 P(wk|wk−1 )
The chain rule shows the link between computing the joint probability of a sequence
and computing the conditional probability of a word given previous words. The intuition of
the N -gram model is that instead of computing the probability of a word given its entire
history, we will approximate the history by just the last few words.
The bigram model, for example, approximates the probability of a word given
all the previous words P(wn|wn−1 ) by using only the conditional probability of the
preceding word P(wn|wn−1) . In other words, instead of computing the probability
P( the| Walden Pond’s water is so transparent that)
we approximate it with the probability P( the| that)
When a bigrammodel is used to predict the conditional probability of the nextword
The following approximation is made :
P(wn|wn−1 ) P(wn|wn−1)
This assumption that the probability of a word depends only on the previous word is called a
Markov assumption. Markov models are the class of probabilistic models that assume that
we can predict the probability of some future unit without looking too far into the past. We
can generalize the bigram (which looks one word into the past) to the trigram (which looks
two words into the past) and thus to the N-gram (which looks N − 1 words into the past).
Thus the general equation for this N -gram approximation to the conditional probability
of the next word in a sequence is:
P(wn|wn−1 ) P(wn|wn−1 n−N+1)

Example
•“the cat sat on the mat”
•P(S)=P(the)⋅P(cat|the)⋅P(sat|cat)⋅P(on|sat)⋅P(the|on)⋅P(mat|the)
•P(mat| the cat sat on the )=P(mat|the)

Given the bigram assumption for the probability of an individual word, we can compute the
probability of a complete word sequence by substituting Equation into Equation (4.7):
P(wn1) = Пnk=1 P(wk|wk−1)
The simplest and most intuitive way to estimate probabilities is called Maximum Likelihood
Estimation , or MLE . The MLE is estimated for the parameters of an N -gram model by
taking counts from a corpus, and normalizing them so they lie between 0 and 1.
P(wn|wn−1) =C(wn−1wn)/ZwC(wn−1w)

Example
•<s> I am Sam </s>
•<s> Sam I am </s>
•<s> I do not like green eggs and ham </s>
•P(I|<s>)=2/3
•P(sam|<s>)=1/3

TRAINING AND TEST SETS

when using a statistical model of language given some corpus of relevant data, we start by
dividing the data into training and test sets. We train the statistical parameters of the model
on the training set, and then use this trained model to compute probabilities on the test set.
This training-and-testing paradigm can also be used to evaluate different N-gram
architectures.
Since our evaluation metric is based on test set probability, it’s important not to let the test
sentences into the training set. Suppose we are trying to compute the probability of a
particular “test” sentence. If our test sentence is part of the training corpus, we will
mistakenly assign it an artificially high probability when it occurs in the test set. We call this
situation training on the test set . Training on the test set introduces a bias that makes the
probabilities all look too high and causes huge inaccuracies in perplexity.

In addition to training and test sets, other divisions of data are often useful. Sometimes
we need an extra source of data to augment the training set. Such extra data is called a held-
out set.

EVALUATING N-GRAMS: PERPLEXITY


The best way to evaluate the performance of a language model is to embed it in an
application and measure the total performance of the application. Such end-to-end evaluation
is called extrinsic evaluation. Extrinisic evaluation is the only way to know if a particular
improvement in a component is really going to help the task at hand. An intrinsitic
evaluation metric is one which measures the quality of a model independent of any
application.
Perplexity is the most common intrinsic evaluationmetric for N-gram language models.
The intuition of perplexity is that given two probabilistic models, the better model is the one
that has a tighter fit to the test data, or predicts the details of the test data better. We can
measure better prediction by looking at the probability themodel assigns
to the test data; the better model will assign a higher probability to the test data.
PP(W) = P(w1w2 . . .wN)− 1/N
Minimizing perplexity is equivalent to maximizing the test set probability according to the
language model.

SMOOTHING
There is a major problem with the maximum likelihood estimation process we have seen for
training the parameters of an N-gram model. This is the problem of sparse data caused by the
fact that our maximumlikelihood estimate was based on a particular set of training data. For
any N-gram that occurred a sufficient number of times, we might have a good estimate of its
probability. But because any corpus is limited, some perfectly acceptable English word
sequences are bound to be missing from it. This missing data means that the N-gram matrix
for any given training corpus is bound to have a very large number of cases of putative “zero
probability N-grams” that should really have some non-zero probability. we’ll want to
modify the maximum likelihood estimates for computing N-gram probabilities, focusing on
the N-gram events that we incorrectly assumed had zero probability. We use the term
smoothing for such modifications that address the poor estimates that are due to variability
in small data sets.
Laplace Smoothing
One simple way to do smoothing might be just to take our matrix of bigram counts,
before we normalize them into probabilities, and add one to all the counts. This algorithm is
called Laplace smoothing , or Laplace’s Law. Laplace smoothing merely adds one to each
count. A related way to view smoothing is as discounting (lowering) some non-zero counts
in order to get the probability mass that will be assigned to the zero counts.

There are two ways to use this N -gram “hierarchy”, backoff and interpolation . In
backoff, if we have non-zero trigram counts, we rely solely on the trigram counts. We only
“back off” to a lower order N -gram if we have zero evidence for a higher-order N -gram. By
contrast, in interpolation, we always mix the probability estimates from all the N –gram
estimators, i.e., we do a weighted interpolation of trigram, bigram, and unigram counts.

Multiword Expressions
Multiword expressions (MWEs) are expressions which are made up of at least 2
words and which can be syntactically and/or semantically idiosyncratic in nature.
Moreover, they act as a single unit at some level of linguistic analysis.

Classification of MWEs
1 Fixed expressions
Fixed expressions are fully lexicalized and can neither be variated
morphosyntactically nor modificated internally. Examples for fixed expressions are:
in short, by and large, every which way. They are fixed, as you cannot say in shorter or in
very short.

2 Semi-fixed expressions
In semi-fixed expressions word order and composition are strictly invariable, while
inflection, variation in reflexive form and determiner selection is possible.
In non-decomposable idioms (i.e. idioms in which the meaning cannot be assigned to
the parts of the MWE) such as kick the bucket the verb can be inflected according to a
particular
context: he kicks the bucket.
Another type of semi-fixed expressions are compound nominals as car park or peanut
butter. They are syntactically-unalterable but can inflect for number: 2 car parks.
3 Syntactically-Flexible Expressions
Syntactically-flexible expressions have a wider range of syntactic variability than
semi-fixed expressions. They occur in the form of decomposable idioms, verb-
particle constructions and light verbs.
Decomposable idioms are likely to be syntactically flexible to some degree. Examples
are let the cat out of the bag and sweep under the rug.
Verb-particle constructions, such as write up and look up are made up of a verb and one or
more partcicles.
For light verb constructions, as make a mistake, give a demo it is difficult to predict
which light verb combines with a given noun.

1.4 Institutionalized Phrases


Institutionalized phrases are conventionalized phrases, such as salt and pepper,
traffic light and to kindle excitement. They are semantically and syntactically
compositional, but statistically idiosyncratic.
Collocation
A collocation is an expression consisting of two or more words that correspond to some
conventional way of saying things. Collocations include noun phrases like strong tea and
weapons of mass destruction, phrasal verbs like to make up, and other stock phrases like the
rich and powerful. Collocations are characterized by limited compositionality. Collocations
are not fully compositional in that there is usually an element of meaning added to the
combination.

Frequency
Surely the simplest method for finding collocations in a text corpus is counting. If two words
occur together a lot, then that is evidence that they have a special function that is not simply
explained as the function that results from their combination.

Mean and Variance


Frequency-based search works well for fixed phrases. But many collocations consist of two
words that stand in a more flexible relationship to one another. Consider the verb knock and
one of its most frequent arguments, door. Here are some examples of knocking on or at a
door from our corpus:
a. she knocked on his door
b. they knocked at the door
c. 100 women knocked on Donaldson's door
d. a man knocked on the metal front door
The words that appear between knocked and door vary and the distance between the two
words is not constant so a fixed phrase approach would not work here. A short note is in
order here on collocations that occur as a fixed phrase versus those that are more variable.
We define a collocational window (usually a window of 3 to 4 words on each side of a
word), and we enter every word pair in there as a collocational bigram. One way of
discovering the relationship between knocked and door is to mean compute the mean and
variance of the offsets (signed distances) between variance the two words in the corpus. The
mean is simply the average offset.
The mean and deviation characterize the distribution of distances between two words in a
corpus. We can use this information to discover collocations by looking for pairs with low
deviation. A low deviation means that the two words usually occur at about the same
distance. Zero deviation means that the two words always occur at exactly the same distance.
Hypothesis Testing
Assessing whether or not something is a chance event is one of the classical problems of
statistics. It is usually couched in terms null hypothesis of hypothesis testing. We formulate a
null hypothesis H0 that there is no association between the words beyond chance
occurrences, compute the probability p that the event would occur if H0 were true, and then
reject H0 if p is too low.
P(w1w2)= P(w1)(w2)
The model implies that the probability of co-occurrence is just the product of the
probabilities of the individual words.
The t test
A test that has been widely used for collocation discovery is the t test. The t test looks at the
mean and variance of a sample of measurements, where the null hypothesis is that the sample
is drawn from a distribution with mean .
To see how to use the t test for finding collocations, let us compute the t value for new
companies.
H0 : P(new companies) = P(new)P(companies)

P(new) = 15828/14307668
P(companies)= 4675/14307668
H0= 3615 * 10−7

t = 0:999932

Pearson's chi-square test


Use of the t test has been criticized because it assumes that probabilities are approximately
normally distributed, which is not true in general. An alternative test for dependence which
does not assume normally distributed probabilities is the γ2 test. The essence of the test is to
compare the observed frequencies in the table with the frequencies expected for
independence. If the difference between observed and expected frequencies is large, then we
can reject the null hypothesis of independence.

Likelihood ratios
Likelihood ratios are another approach to hypothesis testing. the maximum- -likelihood
estimates of the population parameters Pij that maximize the probability of the data

CODE
•import nltk
•nltk.download('stopwords')
•from nltk.collocations import BigramCollocationFinder
•from nltk.metrics import BigramAssocMeasures
•from nltk.collocations import TrigramCollocationFinder
•from nltk.metrics import TrigramAssocMeasures
•from nltk.corpus import stopwords
•sentence=" natural language processing (NLP) is one of the field of artificial
intelligence which uses machine learning techniques"
•words=[]
•words = sentence.split()
•stopset = set(stopwords.words('english'))
•filter_stops = lambda w: len(w) < 3 or w in stopset
•biagram_collocation = BigramCollocationFinder.from_words(words)
•biagram_collocation.nbest(BigramAssocMeasures.likelihood_ratio, 15)

You might also like