0% found this document useful (0 votes)
94 views12 pages

Unit - 4 NLP - R20

Unit IV covers semantic interpretation and language modeling, focusing on concepts such as semantic forms, n-gram models, and language model evaluation. It discusses various types of language models, including statistical, neural, and class-based models, as well as challenges in multilingual and cross-lingual contexts. Additionally, it addresses specific issues related to language modeling in morphologically rich languages and the importance of parameter estimation and model adaptation.

Uploaded by

narayanababu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
94 views12 pages

Unit - 4 NLP - R20

Unit IV covers semantic interpretation and language modeling, focusing on concepts such as semantic forms, n-gram models, and language model evaluation. It discusses various types of language models, including statistical, neural, and class-based models, as well as challenges in multilingual and cross-lingual contexts. Additionally, it addresses specific issues related to language modeling in morphologically rich languages and the importance of parameter estimation and model adaptation.

Uploaded by

narayanababu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Unit-3 R20-Regulations NLP

UNIT - IV Interpretation and Modelling Lecture

Semantic Interpretation-Semantic & Logical form, Word senses & ambiguity, the
basic logical form language, encoding ambiguity in the logical Form, Verbs & States
in logical form, Thematic roles, Speech acts &embedded sentences, Defining
semantics structure model theory. Language Modelling-Introduction, n-Gram Models,
Language model Evaluation, Parameter Estimation, Language Model Adaption,
Types of Language Models, Language-Specific Modelling Problems, Multilingual and
Cross lingual Language Modelling.

Question and answer pattern


Short Answers Questions

1.Define Language modelling.


A model that specifies the a priori probability of a particular word sequence in the
language of interest.
Given an alphabet or inventory of units Σ and a sequence W = w1w2 ... wt ∈ Σ∗, a
language model can be used to compute the probability of W based on parameters
previously estimated from a training set.

2. What is n-gram model?


The probability of a word sequence W of nontrivial (!=0) length cannot be computed
directly because unrestricted natural language permits an infinite number of word
sequences of variable lengths. The probability P(W) can be decomposed into a
product of component probabilities according to the chain rule of probability:
Because the individual terms in this product are still too difficult to be computed
directly, statistical language models make use of the n-gram approximation, which is
why they are also called n-gram models.

3.Define language model evaluation.


Typically, two criteria are used to define language model evaluation: coverage rate
and perplexity on a held-out test set that does not form part of the training data.
The coverage rate measures the percentage of n-grams in the test set that are
represented in the language model. A special case of this is the out-of-vocabulary
SEAGI-NB 1
Unit-3 R20-Regulations NLP

rate (or OOV rate), which is 100 minus the unigram coverage rate, or,in other words,
the percentage of unique word types not covered by the language model.
Perplexity can be thought of as the average number of equally likely successor
words when transitioning from one position in the word string to the next. If the model
has no predictive power at all, perplexity is equal to the vocabulary size.

4.What do you mean by parameter estimation?


Parameter Estimation is a branch of statistics that involves using sample data to
estimate the parameters of a distribution. In order to estimate the parameters
randomly from a given sample distribution data, the technique of parameter
estimation is used. To achieve this, a number a estimation techniques are available
and listed below. Probability Plotting, Rank Regression (Least Squares), Maximum
Likelihood Estimation, Bayesian Parameter Estimation Method, etc.

5.What is language model adaptation?


The amount of language model training data is insufficient, particularly when porting
a speech or language processing system to a new domain, topic, or language.
(English to Telugu or English (Topic1 to Topic 2)) For this reason, much effort has
been invested in language model adaptation, that is, designing and tuning a
language model such that it performs well on a new test set for which little equivalent
training data is available.

6.Briefly define coverage rate and perplexity.


Coverage rate measures the percentage of n-grams in the test set that are
represented in the language model. A special case of this is the out-of-vocabulary
rate (or OOV rate), which is 100 minus the unigram coverage rate, or, in other words,
the percentage of unique word types not covered by the language model.
Perplexity can be thought of as the average number of equally likely successor
words when transitioning from one position in the word string to the next. If the model
has no predictive power at all, perplexity is equal to the vocabulary size.

SEAGI-NB 2
Unit-3 R20-Regulations NLP

7.What is class based language model?


Class-based language models are a simple way of addressing data sparsity in
language modelling. Words are first clustered into classes, either by automatic
means or based on linguistic criteria, for example, using part-of-speech (POS)
classes. Class-based models have been successful in reducing perplexity as well as
practical performance in a wide range of language processing systems.

8.What do you understand by Language-Specific Modelling Problem?


Speech and language processing technology has been ported to a range of other
languages, some of which have highlighted problems with the standard n-gram
modelling approach and have necessitated modifications to the traditional language
modelling framework. There are three types of language-specific problems:
morphological complexity, lack of word segmentation, and spoken versus written
languages.

9.Differentiate between spoken and written languages.

SEAGI-NB 3
Unit-3 R20-Regulations NLP

10.What is multilingual language modelling?

A system can be presented with multiple languages sequentially (e.g. Different users
speaking different languages, without advance indication of which language will be
encountered next),or simultaneously, as happens in the case of code switching.
Here, speakers may use several languages or dialects side by side, often within the
same utterance

Example from Franco and Solorio:

I need to tell her que no voy a poder ir.

'I need to tell her that I won't be able to make it.'

Long Answers Questions

1.Explain three types of language models.

There are primarily two types of language models:

a) Statistical Language Models

N-Gram

Unigram

Bidirectional

Exponential

Continuous Space

b) Neural Language Models

c) Class-Based Language Models

a)Statistical Language Models:

N-Gram: This is one of the simplest approaches to language modelling. Here, a


probability distribution for a sequence of ‘n’ is created, where ‘n’ can be any number

SEAGI-NB 4
Unit-3 R20-Regulations NLP

and defines the size of the gram (or sequence of words being assigned a
probability).

If n=4, a gram may look like: “can you help me”.

Basically, ‘n’ is the amount of context that the model is trained to consider. There
are different types of N-Gram models such as unigrams, bigrams, trigrams, etc.
unigram: The unigram is the simplest type of language model. It doesn't look at any
conditioning context in its calculations. It evaluates each word or term independently.
Unigram models commonly handle language processing tasks such as information
retrieval. The unigram is the foundation of a more specific model variant called the
query likelihood model, which uses information retrieval to examine a pool of
documents and match the most relevant one to a specific query.

bi directional: Unlike n-gram models, which analyze text in one direction


(backwards), bidirectional models analyze text in both directions, backwards and
forwards. These models can predict any word in a sentence or body of text by using
every other word in the text. Examining text bidirectionally increases result accuracy.
This type is often utilized in machine learning and speech generation applications.
For example, Google uses a bidirectional model to process search queries.

exponential: This type of statistical model evaluates text by using an equation which
is a combination of n-grams and feature functions. Here the features and parameters
of the desired results are already specified. The model is based on the principle of
entropy, which states that probability distribution with the most entropy is the best
choice. Exponential models have fewer statistical assumptions which mean the
chances of having accurate results are more

continous space: In this type of statistical model, words are arranged as a non-
linear combination of weights in a neural network. The process of assigning weight to
a word is known as word embedding. This type of model proves helpful in scenarios
where the data set of words continues to become large and include unique words. In
cases where the data set is large and consists of rarely used or unique words, linear
models such as n-gram do not work. This is because, with increasing words, the

SEAGI-NB 5
Unit-3 R20-Regulations NLP

possible word sequences increase, and thus the patterns predicting the next word
become weaker.

b) Neural Language Models : These language models are based on neural


networks and are often considered as an advanced approach to execute NLP tasks.
Neural language models overcome the shortcomings of classical models such as n-
gram and are used for complex tasks such as speech recognition or machine
translation. Language is significantly complex and keeps on evolving. Therefore, the
more complex the language model is, the better it would be at performing NLP tasks.
Compared to the n-gram model, an exponential or continuous space model proves to
be a better option for NLP tasks because they are designed to handle ambiguity and
language variation. Meanwhile, language models should be able to manage
dependencies.

For example, a model should be able to understand words derived from different
languages.

c) Class-Based Language Models: Class-based language models are a simple


way of addressing data sparsity in language modelling. Words are first clustered into
classes, either by automatic means or based on linguistic criteria, for example, using
part-of-speech (POS) classes. The statistical model makes the assumption that
words are conditionally independent of other words given the current word class.
current word is conditioned not only on the current word class but also on the
preceding word classes. Class-based models have been successful in reducing
perplexity as well as practical performance in a wide range of language processing
systems; however, they typically need to be interpolated with a word-based language
model. the average number of equally likely successor words

2.Differentiate between Class based and variable length language models.


Class-Based Language Models: Class-based language models are a simple way
of addressing data sparsity in language modelling. Words are first clustered into
classes, either by automatic means or based on linguistic criteria, for example, using
part-of-speech (POS) classes.

SEAGI-NB 6
Unit-3 R20-Regulations NLP

The statistical model makes the assumption that words are conditionally independent
of other words given the current word class.

current word is conditioned not only on the current word class but also on the
preceding word classes.

Class-based models have been successful in reducing perplexity as well as practical


performance in a wide range of language processing systems; however, they
typically need to be interpolated with a word-based language model. the average
number of equally likely successor words

Variable-Length Language Models: In standard language modeling, vocabulary


units are defined by simple criteria, such as whitespace delimiters, and the prediction
of the probability of the next word is based on an invariable fixed-length history.

Several modifications to this basic approach have been developed that aim at
redefining vocabulary units in a data-driven way, resulting in merged units composed
out of a variable number of basic units.

These approaches are termed variable-length n-gram models. The challenge in


these models is to find the best segmentation of the word sequence w1 w2 ...wt into
language modeling units in addition to estimating the language model probabilities.

3.Differentiate between multilingual and cross lingual language modelling.

Multilingual language modelling: A system can be presented with multiple


languages sequentially (e.g., different users speaking different languages, without
advance indication of which language will be encountered next), or simultaneously,
as happens in the case of code switching. Here, speakers may use several
languages or dialects side by side, often within the same utterance.

Cross-lingual language modelling: Cross-lingual learning is a paradigm for


transferring knowledge from one natural language to another. The transfer of
knowledge can help us overcome the lack of data in the target languages and create
intelligent systems and machine learning models for languages, where it was not
possible previously

The main difference between Multilingual and Cross lingual language modelling is:

SEAGI-NB 7
Unit-3 R20-Regulations NLP

Cross-lingual embeddings attempt to ensure that words that mean the same thing in
different languages map to almost the same vector.

Multilingual embeddings are happy if the embeddings work well in language A and
work well in language B separately without any guarantees about interaction
between different languages.

4.Write a short note on Language-Specific Modeling Problem?

The majority of language modeling research has focused on the English language.
However, speech and language processing technology has been ported to a range
of other languages, some of which have highlighted problems with the standard n-
gram modeling approach and have necessitated modifications to the traditional
language modeling framework. Here , we look at three types of language-specific
problems: morphological complexity, lack of word segmentation, and spoken versus
written languages.

5.Explain coverage rate and perplexity?

Typically, two criteria are used to define language model evaluation: coverage rate
and perplexity on a held-out test set that does not form part of the training data.

The coverage rate measures the percentage of n-grams in the test set that are
represented in the language model.A special case of this is the out-of-vocabulary
rate (or OOV rate), which is 100 minus the unigram coverage rate, or,in other words,
the percentage of unique word types not covered by the language model.

Perplexity can be thought of as the average number of equally likely successor


words when transitioning from one position in the word string to the next. If the model
has no predictive power at all, perplexity is equal to the vocabulary size.

6.Write a short note on language modelling?

Definition: A model that specifies the a priori probability of a particular word


sequence in the language of interest Given an alphabet or inventory of units Σ and a

SEAGI-NB 8
Unit-3 R20-Regulations NLP

sequence W = w1w2 ... wt ∈ Σ∗, a language model can be used to compute the
probability of W based on parameters previously estimated from a training set.

Bagging and Boosting

Parameters:

Lexical context → Words and Lemmas

Parts of speech → Noun, Verb, Determiner, etc.

Bag of words context → unordered set of words in the context window

Local collocations → ordered sequence of phrases near the target word

Syntactic relations

Topic features

Voice of the sentence, Presence of subject/object, etc.

Most commonly, the inventory Σ (also called vocabulary) is the list of unique words
encountered in the training data; however, as we will see in this chapter, selecting
the units over which a language model should be defined can be a rather difficult
problem, particularly in languages other than English.

Like:Telugu, Arabic, Urdu, Kashmiri, Persian, Hindi, etc

A language model is usually combined with some other model or models that
hypothesize possible word sequences. RF, AdaBoost, LGBM, XGBM, etc

In speech recognition, a speech recognizer combines:

acoustic model scores (and possibly other scores, such as pronunciation model
scores) with language model scores to decode spoken word sequences from an
acoustic signal.

In machine translation, a language model is used to score translation hypotheses


generated by a translation model. Language models have also become a standard
tool in information retrieval, authorship identification, and document classification. In

SEAGI-NB 9
Unit-3 R20-Regulations NLP

several related fields, language models are used that are defined not over words but
over acoustic units or isolated text characters.

One of the core approaches to language identification, for example, relies on


language models over phones or phonemes; in optical character recognition,
language models predicting character sequences are used.

7.Differentiate between discriminative and syntax based language models.

Discriminative Language Models: Standard n-gram models embody a generative


model for assigning a probability to a given word sequence W. However, in
practical applications like machine translation or speech recognition, the task of a
language model is often to separate good sentence hypotheses from bad sentence
hypotheses. For this reason, it would be desirable to train language model
parameters discriminatively, such that word strings of widely differing quality receive
maximally distinct probability estimates.

Syntax-Based Language Models: A well-known drawback of n-gram language


models is that they cannot take into account relevant words in the history that fall
outside the limited window of the directly preceding n − 1 words.

However, natural language exhibits many types of long-distance dependencies,


where the choice of the current word is dependent on words that are relatively far
removed in terms of sentence position. the plural noun Investors triggers the plural
verb but is not taken into account as a conditioning variable by an n-gram model,
where n is usually no larger than 4 or 5.

To address this problem, several approaches to syntax-based language modeling


have been developed, whose goal is to explicitly model such syntactic relationships
and use them to estimate better probabilities. Most of these approaches use a
statistical parser to construct the syntactic representation S of the sentence and
define a probability model that incorporates S.

SEAGI-NB 10
Unit-3 R20-Regulations NLP

8.Explain language modeling for morphologically rich languages.

A morphologically rich language is characterized by: a large number of different


unique word forms (types) in relation to the number of word tokens in a text that is
due to productive morphological (word-formation) processes in the language.

Multiple word forms can be derived from small number of tokens. A morpheme is the
smallest meaning-bearing unit in a language. Morphemes can be either free (i.e.,
they can occur on their own), or they are bound (i.e., they must be combined with
some other morpheme).

Morphological processes include compounding (forming a new word out of two


independently existing free morphemes), derivation (combination of a free
morpheme and a bound morpheme to form a new word), and inflection (combination
of a free and a bound morpheme to signal a particular grammatical feature).

Germanic languages, for example, are notorious for their high degree of
compounding, especially for nominals.

Turkish, an agglutinative language, combines several morphemes into a single word;


hus, the same material that would be expressed as a syntactic phrase in English can
be found as a single whitespace-delimited unit in a Turkish sentence, such as:
görülmemeliydik = ‘we should not have been seen’

As a result, Turkish has a huge number of possible words. Many languages have
rich inflectional paradigms. In languages like Finnish and Arabic, a root (base form)
may have thousands of different morphological realizations.

Table shows two Modern Standard Arabic (MSA) inflectional paradigms, one for
present tense verbal inflections for the root skn (basic meaning: ‘live’), one for
pronominal possessive inflections for the root ktb (basic meaning ‘book’)

SEAGI-NB 11
Unit-3 R20-Regulations NLP

9.What is the need of using language modelling? Explain.

Language models determine word probability by analyzing text data. They interpret
this data by feeding it through an algorithm that establishes rules for context in
natural language. Then, the model applies these rules in language tasks to
accurately predict or produce new sentences.

10.Write a short note on n-gram notation?

The probability of a word sequence W of nontrivial (!=0) length cannot be computed


directly because unrestricted natural language permits an infinite number of word
sequences of variable lengths.

The probability P(W) can be decomposed into a product of component probabilities


according to the chain rule of probability: Because the individual terms in this product
are still too difficult to be computed directly, statistical language models make use of
the n-gram approximation, which is why they are also called n-gram models.

The assumption is that all previous words except for the n − 1 words directly
preceding the current word are irrelevant for predicting the current word, or,
alternatively, that they are equivalent.

Depending on the length of n, we can distinguish between:

unigrams (n = 1), bigrams (n = 2), trigrams (n = 3), or 4-grams, 5-grams, and so


on.

“John gifted a watch to his mother.”

unigrams (n = 1), “John”, “gifted”, “a”, “watch”, “to”, “his”, “mother”

bigrams (n = 2), “John gifted”, “gifted a”, “a watch”, “watch to”, “to his”, “his mother”

trigrams (n = 3), “John gifted a”, “gifted a watch”, “a watch to”, “ to his mother”
So on…

SEAGI-NB 12

You might also like