0% found this document useful (0 votes)

94 views12 pages

Unit - 4 NLP - R20

Unit IV covers semantic interpretation and language modeling, focusing on concepts such as semantic forms, n-gram models, and language model evaluation. It discusses various types of language models, including statistical, neural, and class-based models, as well as challenges in multilingual and cross-lingual contexts. Additionally, it addresses specific issues related to language modeling in morphologically rich languages and the importance of parameter estimation and model adaptation.

Uploaded by

narayanababu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

94 views12 pages

Unit - 4 NLP - R20

Uploaded by

narayanababu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 12

Unit-3 R20-Regulations NLP

UNIT - IV Interpretation and Modelling Lecture

Semantic Interpretation-Semantic & Logical form, Word senses & ambiguity, the
basic logical form language, encoding ambiguity in the logical Form, Verbs & States
in logical form, Thematic roles, Speech acts &embedded sentences, Defining
semantics structure model theory. Language Modelling-Introduction, n-Gram Models,
Language model Evaluation, Parameter Estimation, Language Model Adaption,
Types of Language Models, Language-Specific Modelling Problems, Multilingual and
Cross lingual Language Modelling.

Question and answer pattern

Short Answers Questions

1.Define Language modelling.

A model that specifies the a priori probability of a particular word sequence in the
language of interest.
Given an alphabet or inventory of units Σ and a sequence W = w1w2 ... wt ∈ Σ∗, a
language model can be used to compute the probability of W based on parameters
previously estimated from a training set.

2. What is n-gram model?

The probability of a word sequence W of nontrivial (!=0) length cannot be computed
directly because unrestricted natural language permits an infinite number of word
sequences of variable lengths. The probability P(W) can be decomposed into a
product of component probabilities according to the chain rule of probability:
Because the individual terms in this product are still too difficult to be computed
directly, statistical language models make use of the n-gram approximation, which is
why they are also called n-gram models.

3.Define language model evaluation.

Typically, two criteria are used to define language model evaluation: coverage rate
and perplexity on a held-out test set that does not form part of the training data.
The coverage rate measures the percentage of n-grams in the test set that are
represented in the language model. A special case of this is the out-of-vocabulary
SEAGI-NB 1
Unit-3 R20-Regulations NLP

rate (or OOV rate), which is 100 minus the unigram coverage rate, or,in other words,
the percentage of unique word types not covered by the language model.
Perplexity can be thought of as the average number of equally likely successor
words when transitioning from one position in the word string to the next. If the model
has no predictive power at all, perplexity is equal to the vocabulary size.

4.What do you mean by parameter estimation?

Parameter Estimation is a branch of statistics that involves using sample data to
estimate the parameters of a distribution. In order to estimate the parameters
randomly from a given sample distribution data, the technique of parameter
estimation is used. To achieve this, a number a estimation techniques are available
and listed below. Probability Plotting, Rank Regression (Least Squares), Maximum
Likelihood Estimation, Bayesian Parameter Estimation Method, etc.

5.What is language model adaptation?

The amount of language model training data is insufficient, particularly when porting
a speech or language processing system to a new domain, topic, or language.
(English to Telugu or English (Topic1 to Topic 2)) For this reason, much effort has
been invested in language model adaptation, that is, designing and tuning a
language model such that it performs well on a new test set for which little equivalent
training data is available.

6.Briefly define coverage rate and perplexity.

Coverage rate measures the percentage of n-grams in the test set that are
represented in the language model. A special case of this is the out-of-vocabulary
rate (or OOV rate), which is 100 minus the unigram coverage rate, or, in other words,
the percentage of unique word types not covered by the language model.
Perplexity can be thought of as the average number of equally likely successor
words when transitioning from one position in the word string to the next. If the model
has no predictive power at all, perplexity is equal to the vocabulary size.

SEAGI-NB 2
Unit-3 R20-Regulations NLP

7.What is class based language model?

Class-based language models are a simple way of addressing data sparsity in
language modelling. Words are first clustered into classes, either by automatic
means or based on linguistic criteria, for example, using part-of-speech (POS)
classes. Class-based models have been successful in reducing perplexity as well as
practical performance in a wide range of language processing systems.

8.What do you understand by Language-Specific Modelling Problem?

Speech and language processing technology has been ported to a range of other
languages, some of which have highlighted problems with the standard n-gram
modelling approach and have necessitated modifications to the traditional language
modelling framework. There are three types of language-specific problems:
morphological complexity, lack of word segmentation, and spoken versus written
languages.

9.Differentiate between spoken and written languages.

SEAGI-NB 3
Unit-3 R20-Regulations NLP

10.What is multilingual language modelling?

A system can be presented with multiple languages sequentially (e.g. Different users
speaking different languages, without advance indication of which language will be
encountered next),or simultaneously, as happens in the case of code switching.
Here, speakers may use several languages or dialects side by side, often within the
same utterance

Example from Franco and Solorio:

I need to tell her que no voy a poder ir.

'I need to tell her that I won't be able to make it.'

Long Answers Questions

1.Explain three types of language models.

There are primarily two types of language models:

a) Statistical Language Models

N-Gram

Unigram

Bidirectional

Exponential

Continuous Space

b) Neural Language Models

c) Class-Based Language Models

a)Statistical Language Models:

N-Gram: This is one of the simplest approaches to language modelling. Here, a

probability distribution for a sequence of ‘n’ is created, where ‘n’ can be any number

SEAGI-NB 4
Unit-3 R20-Regulations NLP

and defines the size of the gram (or sequence of words being assigned a
probability).

If n=4, a gram may look like: “can you help me”.

Basically, ‘n’ is the amount of context that the model is trained to consider. There
are different types of N-Gram models such as unigrams, bigrams, trigrams, etc.
unigram: The unigram is the simplest type of language model. It doesn't look at any
conditioning context in its calculations. It evaluates each word or term independently.
Unigram models commonly handle language processing tasks such as information
retrieval. The unigram is the foundation of a more specific model variant called the
query likelihood model, which uses information retrieval to examine a pool of
documents and match the most relevant one to a specific query.

bi directional: Unlike n-gram models, which analyze text in one direction

(backwards), bidirectional models analyze text in both directions, backwards and
forwards. These models can predict any word in a sentence or body of text by using
every other word in the text. Examining text bidirectionally increases result accuracy.
This type is often utilized in machine learning and speech generation applications.
For example, Google uses a bidirectional model to process search queries.

exponential: This type of statistical model evaluates text by using an equation which
is a combination of n-grams and feature functions. Here the features and parameters
of the desired results are already specified. The model is based on the principle of
entropy, which states that probability distribution with the most entropy is the best
choice. Exponential models have fewer statistical assumptions which mean the
chances of having accurate results are more

continous space: In this type of statistical model, words are arranged as a non-
linear combination of weights in a neural network. The process of assigning weight to
a word is known as word embedding. This type of model proves helpful in scenarios
where the data set of words continues to become large and include unique words. In
cases where the data set is large and consists of rarely used or unique words, linear
models such as n-gram do not work. This is because, with increasing words, the

SEAGI-NB 5
Unit-3 R20-Regulations NLP

possible word sequences increase, and thus the patterns predicting the next word
become weaker.

b) Neural Language Models : These language models are based on neural

networks and are often considered as an advanced approach to execute NLP tasks.
Neural language models overcome the shortcomings of classical models such as n-
gram and are used for complex tasks such as speech recognition or machine
translation. Language is significantly complex and keeps on evolving. Therefore, the
more complex the language model is, the better it would be at performing NLP tasks.
Compared to the n-gram model, an exponential or continuous space model proves to
be a better option for NLP tasks because they are designed to handle ambiguity and
language variation. Meanwhile, language models should be able to manage
dependencies.

For example, a model should be able to understand words derived from different
languages.

c) Class-Based Language Models: Class-based language models are a simple

way of addressing data sparsity in language modelling. Words are first clustered into
classes, either by automatic means or based on linguistic criteria, for example, using
part-of-speech (POS) classes. The statistical model makes the assumption that
words are conditionally independent of other words given the current word class.
current word is conditioned not only on the current word class but also on the
preceding word classes. Class-based models have been successful in reducing
perplexity as well as practical performance in a wide range of language processing
systems; however, they typically need to be interpolated with a word-based language
model. the average number of equally likely successor words

2.Differentiate between Class based and variable length language models.

Class-Based Language Models: Class-based language models are a simple way
of addressing data sparsity in language modelling. Words are first clustered into
classes, either by automatic means or based on linguistic criteria, for example, using
part-of-speech (POS) classes.

SEAGI-NB 6
Unit-3 R20-Regulations NLP

The statistical model makes the assumption that words are conditionally independent
of other words given the current word class.

current word is conditioned not only on the current word class but also on the
preceding word classes.

Class-based models have been successful in reducing perplexity as well as practical

performance in a wide range of language processing systems; however, they
typically need to be interpolated with a word-based language model. the average
number of equally likely successor words

Variable-Length Language Models: In standard language modeling, vocabulary

units are defined by simple criteria, such as whitespace delimiters, and the prediction
of the probability of the next word is based on an invariable fixed-length history.

Several modifications to this basic approach have been developed that aim at
redefining vocabulary units in a data-driven way, resulting in merged units composed
out of a variable number of basic units.

These approaches are termed variable-length n-gram models. The challenge in

these models is to find the best segmentation of the word sequence w1 w2 ...wt into
language modeling units in addition to estimating the language model probabilities.

3.Differentiate between multilingual and cross lingual language modelling.

Multilingual language modelling: A system can be presented with multiple

languages sequentially (e.g., different users speaking different languages, without
advance indication of which language will be encountered next), or simultaneously,
as happens in the case of code switching. Here, speakers may use several
languages or dialects side by side, often within the same utterance.

Cross-lingual language modelling: Cross-lingual learning is a paradigm for

transferring knowledge from one natural language to another. The transfer of
knowledge can help us overcome the lack of data in the target languages and create
intelligent systems and machine learning models for languages, where it was not
possible previously

The main difference between Multilingual and Cross lingual language modelling is:

SEAGI-NB 7
Unit-3 R20-Regulations NLP

Cross-lingual embeddings attempt to ensure that words that mean the same thing in
different languages map to almost the same vector.

Multilingual embeddings are happy if the embeddings work well in language A and
work well in language B separately without any guarantees about interaction
between different languages.

4.Write a short note on Language-Specific Modeling Problem?

The majority of language modeling research has focused on the English language.
However, speech and language processing technology has been ported to a range
of other languages, some of which have highlighted problems with the standard n-
gram modeling approach and have necessitated modifications to the traditional
language modeling framework. Here , we look at three types of language-specific
problems: morphological complexity, lack of word segmentation, and spoken versus
written languages.

5.Explain coverage rate and perplexity?

Typically, two criteria are used to define language model evaluation: coverage rate
and perplexity on a held-out test set that does not form part of the training data.

The coverage rate measures the percentage of n-grams in the test set that are
represented in the language model.A special case of this is the out-of-vocabulary
rate (or OOV rate), which is 100 minus the unigram coverage rate, or,in other words,
the percentage of unique word types not covered by the language model.

Perplexity can be thought of as the average number of equally likely successor

words when transitioning from one position in the word string to the next. If the model
has no predictive power at all, perplexity is equal to the vocabulary size.

6.Write a short note on language modelling?

Definition: A model that specifies the a priori probability of a particular word

sequence in the language of interest Given an alphabet or inventory of units Σ and a

SEAGI-NB 8
Unit-3 R20-Regulations NLP

sequence W = w1w2 ... wt ∈ Σ∗, a language model can be used to compute the
probability of W based on parameters previously estimated from a training set.

Bagging and Boosting

Parameters:

Lexical context → Words and Lemmas

Parts of speech → Noun, Verb, Determiner, etc.

Bag of words context → unordered set of words in the context window

Local collocations → ordered sequence of phrases near the target word

Syntactic relations

Topic features

Voice of the sentence, Presence of subject/object, etc.

Most commonly, the inventory Σ (also called vocabulary) is the list of unique words
encountered in the training data; however, as we will see in this chapter, selecting
the units over which a language model should be defined can be a rather difficult
problem, particularly in languages other than English.

Like:Telugu, Arabic, Urdu, Kashmiri, Persian, Hindi, etc

A language model is usually combined with some other model or models that
hypothesize possible word sequences. RF, AdaBoost, LGBM, XGBM, etc

In speech recognition, a speech recognizer combines:

acoustic model scores (and possibly other scores, such as pronunciation model
scores) with language model scores to decode spoken word sequences from an
acoustic signal.

In machine translation, a language model is used to score translation hypotheses

generated by a translation model. Language models have also become a standard
tool in information retrieval, authorship identification, and document classification. In

SEAGI-NB 9
Unit-3 R20-Regulations NLP

several related fields, language models are used that are defined not over words but
over acoustic units or isolated text characters.

One of the core approaches to language identification, for example, relies on

language models over phones or phonemes; in optical character recognition,
language models predicting character sequences are used.

7.Differentiate between discriminative and syntax based language models.

Discriminative Language Models: Standard n-gram models embody a generative

model for assigning a probability to a given word sequence W. However, in
practical applications like machine translation or speech recognition, the task of a
language model is often to separate good sentence hypotheses from bad sentence
hypotheses. For this reason, it would be desirable to train language model
parameters discriminatively, such that word strings of widely differing quality receive
maximally distinct probability estimates.

Syntax-Based Language Models: A well-known drawback of n-gram language

models is that they cannot take into account relevant words in the history that fall
outside the limited window of the directly preceding n − 1 words.

However, natural language exhibits many types of long-distance dependencies,

where the choice of the current word is dependent on words that are relatively far
removed in terms of sentence position. the plural noun Investors triggers the plural
verb but is not taken into account as a conditioning variable by an n-gram model,
where n is usually no larger than 4 or 5.

To address this problem, several approaches to syntax-based language modeling

have been developed, whose goal is to explicitly model such syntactic relationships
and use them to estimate better probabilities. Most of these approaches use a
statistical parser to construct the syntactic representation S of the sentence and
define a probability model that incorporates S.

SEAGI-NB 10
Unit-3 R20-Regulations NLP

8.Explain language modeling for morphologically rich languages.

A morphologically rich language is characterized by: a large number of different

unique word forms (types) in relation to the number of word tokens in a text that is
due to productive morphological (word-formation) processes in the language.

Multiple word forms can be derived from small number of tokens. A morpheme is the
smallest meaning-bearing unit in a language. Morphemes can be either free (i.e.,
they can occur on their own), or they are bound (i.e., they must be combined with
some other morpheme).

Morphological processes include compounding (forming a new word out of two

independently existing free morphemes), derivation (combination of a free
morpheme and a bound morpheme to form a new word), and inflection (combination
of a free and a bound morpheme to signal a particular grammatical feature).

Germanic languages, for example, are notorious for their high degree of
compounding, especially for nominals.

Turkish, an agglutinative language, combines several morphemes into a single word;

hus, the same material that would be expressed as a syntactic phrase in English can
be found as a single whitespace-delimited unit in a Turkish sentence, such as:
görülmemeliydik = ‘we should not have been seen’

As a result, Turkish has a huge number of possible words. Many languages have
rich inflectional paradigms. In languages like Finnish and Arabic, a root (base form)
may have thousands of different morphological realizations.

Table shows two Modern Standard Arabic (MSA) inflectional paradigms, one for
present tense verbal inflections for the root skn (basic meaning: ‘live’), one for
pronominal possessive inflections for the root ktb (basic meaning ‘book’)

SEAGI-NB 11
Unit-3 R20-Regulations NLP

9.What is the need of using language modelling? Explain.

Language models determine word probability by analyzing text data. They interpret
this data by feeding it through an algorithm that establishes rules for context in
natural language. Then, the model applies these rules in language tasks to
accurately predict or produce new sentences.

10.Write a short note on n-gram notation?

The probability of a word sequence W of nontrivial (!=0) length cannot be computed

directly because unrestricted natural language permits an infinite number of word
sequences of variable lengths.

The probability P(W) can be decomposed into a product of component probabilities

according to the chain rule of probability: Because the individual terms in this product
are still too difficult to be computed directly, statistical language models make use of
the n-gram approximation, which is why they are also called n-gram models.

The assumption is that all previous words except for the n − 1 words directly
preceding the current word are irrelevant for predicting the current word, or,
alternatively, that they are equivalent.

Depending on the length of n, we can distinguish between:

unigrams (n = 1), bigrams (n = 2), trigrams (n = 3), or 4-grams, 5-grams, and so

on.

“John gifted a watch to his mother.”

unigrams (n = 1), “John”, “gifted”, “a”, “watch”, “to”, “his”, “mother”

bigrams (n = 2), “John gifted”, “gifted a”, “a watch”, “watch to”, “to his”, “his mother”

trigrams (n = 3), “John gifted a”, “gifted a watch”, “a watch to”, “ to his mother”
So on…

SEAGI-NB 12

NLP Unit 4 Q & A
No ratings yet
NLP Unit 4 Q & A
17 pages
Unit-3 (NLP)
No ratings yet
Unit-3 (NLP)
28 pages
NLP Unit-4
No ratings yet
NLP Unit-4
62 pages
Artificial Intelligence: Natural Language Processing
No ratings yet
Artificial Intelligence: Natural Language Processing
13 pages
UNIT 3 Language Modelling
No ratings yet
UNIT 3 Language Modelling
15 pages
NLP - N-Gram Language Model
No ratings yet
NLP - N-Gram Language Model
22 pages
Unit 5 Notes Final
No ratings yet
Unit 5 Notes Final
14 pages
Ai Unit 5
No ratings yet
Ai Unit 5
16 pages
6.chapter6 LanguageModel
No ratings yet
6.chapter6 LanguageModel
33 pages
CS 388: Natural Language Processing:: N-Gram Language Models
No ratings yet
CS 388: Natural Language Processing:: N-Gram Language Models
22 pages
Ngrams
100% (1)
Ngrams
22 pages
Lecture 6 To 8 N-Gram
No ratings yet
Lecture 6 To 8 N-Gram
19 pages
Unit 5 - Notes
No ratings yet
Unit 5 - Notes
11 pages
Unit 5-Aiml
No ratings yet
Unit 5-Aiml
25 pages
Bcse306l Ai Module-7 Smsatapathy
No ratings yet
Bcse306l Ai Module-7 Smsatapathy
51 pages
Statistical Language Model
No ratings yet
Statistical Language Model
9 pages
Unit 5
No ratings yet
Unit 5
26 pages
Unit5 Notes
No ratings yet
Unit5 Notes
17 pages
Langauage Model
No ratings yet
Langauage Model
148 pages
NLP Unit2
No ratings yet
NLP Unit2
65 pages
Notes of NLP - Unit-2
No ratings yet
Notes of NLP - Unit-2
23 pages
N Gram
No ratings yet
N Gram
6 pages
5) Lecture Feb11&13&17&18
No ratings yet
5) Lecture Feb11&13&17&18
21 pages
NLP Unit-V
No ratings yet
NLP Unit-V
30 pages
NLP 1.2
No ratings yet
NLP 1.2
22 pages
Lecture 4
No ratings yet
Lecture 4
87 pages
NLP Unit-4
No ratings yet
NLP Unit-4
48 pages
Class-Based N-Gram Models of Natural Language
No ratings yet
Class-Based N-Gram Models of Natural Language
14 pages
Clip Unit 4
No ratings yet
Clip Unit 4
9 pages
Language Model Adaptation
No ratings yet
Language Model Adaptation
7 pages
NLP Unit-V
No ratings yet
NLP Unit-V
30 pages
Chapter 5
No ratings yet
Chapter 5
22 pages
Lecture - 3 - Statistical Language Models
No ratings yet
Lecture - 3 - Statistical Language Models
56 pages
NLP - Viva - Que & Ans
No ratings yet
NLP - Viva - Que & Ans
15 pages
NLP Unit 5
No ratings yet
NLP Unit 5
3 pages
Assignment 0 DL4NLP-1
No ratings yet
Assignment 0 DL4NLP-1
4 pages
Chapter 6-NLP
No ratings yet
Chapter 6-NLP
8 pages
3-Lecture Three - (Chapter Two-N-gram Language Models)
No ratings yet
3-Lecture Three - (Chapter Two-N-gram Language Models)
28 pages
Lecture 3 - Language Modelling and RNNs Part 1
No ratings yet
Lecture 3 - Language Modelling and RNNs Part 1
44 pages
Lecture 6 N Gram Language Models Contd Annotations
No ratings yet
Lecture 6 N Gram Language Models Contd Annotations
36 pages
Unit 5
No ratings yet
Unit 5
20 pages
Unit V-AI-KCS071
No ratings yet
Unit V-AI-KCS071
28 pages
NLP Notes For Students
No ratings yet
NLP Notes For Students
18 pages
Language Modeling
No ratings yet
Language Modeling
50 pages
LM 24 Aug
No ratings yet
LM 24 Aug
84 pages
NLP 5th Unit
No ratings yet
NLP 5th Unit
19 pages
N-Gram Language Models
No ratings yet
N-Gram Language Models
15 pages
Cs224n 2025 Lecture05 RNNLM
No ratings yet
Cs224n 2025 Lecture05 RNNLM
54 pages
Introduction To Language Modeling Final
No ratings yet
Introduction To Language Modeling Final
69 pages
04 Language Modeling
No ratings yet
04 Language Modeling
70 pages
PLM 17
No ratings yet
PLM 17
15 pages
Lec-3 Language Modeling N-Grams
No ratings yet
Lec-3 Language Modeling N-Grams
41 pages
NLP Units Iv V
No ratings yet
NLP Units Iv V
30 pages
Unit 1
No ratings yet
Unit 1
17 pages
Language Modeling: Prabhleen Juneja Thapar Institute of Engineering & Technology
No ratings yet
Language Modeling: Prabhleen Juneja Thapar Institute of Engineering & Technology
36 pages
Unit 5 Language Modeling Notes
No ratings yet
Unit 5 Language Modeling Notes
3 pages
NLP Sem Unit 5
No ratings yet
NLP Sem Unit 5
9 pages
Module5 DS PPT
No ratings yet
Module5 DS PPT
38 pages
N Grams
No ratings yet
N Grams
2 pages
Language Identification: Fundamentals and Applications
From Everand
Language Identification: Fundamentals and Applications
Fouad Sabry
No ratings yet
DL - Unit - 1 and 2 Questions
No ratings yet
DL - Unit - 1 and 2 Questions
2 pages
Assignment Questions
No ratings yet
Assignment Questions
4 pages
Unit 2
No ratings yet
Unit 2
42 pages
Unit 1
No ratings yet
Unit 1
42 pages
Lesson Plan: Approved by AICTE, New Delhi & Affiliated To JNTUA, Ananthapuramu, AP
No ratings yet
Lesson Plan: Approved by AICTE, New Delhi & Affiliated To JNTUA, Ananthapuramu, AP
5 pages
Carbohydrates g10
100% (2)
Carbohydrates g10
21 pages
CSR Pepsico
No ratings yet
CSR Pepsico
5 pages
Bulking of Sand
No ratings yet
Bulking of Sand
10 pages
Atmospheric-pollutants-EXAM-QUESTIONS-Mark Scheme
No ratings yet
Atmospheric-pollutants-EXAM-QUESTIONS-Mark Scheme
3 pages
CV Donny Prasetyo Utomo
No ratings yet
CV Donny Prasetyo Utomo
5 pages
Instrument Technician Resume 24
No ratings yet
Instrument Technician Resume 24
5 pages
PHYS 121 - Curriculum - Huy
No ratings yet
PHYS 121 - Curriculum - Huy
4 pages
Johnson Porselano Royale 120x240 80x240 120x180 & 80x160cm Catalogue RJKT Aug 23
No ratings yet
Johnson Porselano Royale 120x240 80x240 120x180 & 80x160cm Catalogue RJKT Aug 23
248 pages
FSC BT405 Datasheet
No ratings yet
FSC BT405 Datasheet
6 pages
Convolutional Neural Networks
No ratings yet
Convolutional Neural Networks
76 pages
Resume 2011
No ratings yet
Resume 2011
2 pages
Labour Licence-Quadz Fitness-1
No ratings yet
Labour Licence-Quadz Fitness-1
2 pages
The Role of Technology
100% (1)
The Role of Technology
10 pages
Introduction and Small Talk
No ratings yet
Introduction and Small Talk
1 page
PC Magazine - February 2014 USA
No ratings yet
PC Magazine - February 2014 USA
142 pages
History II List For Book Reviews
No ratings yet
History II List For Book Reviews
4 pages
Guided Masturbation
100% (1)
Guided Masturbation
6 pages
Global Oil and Gas Profile: Vision, Reputation, and Commitment
No ratings yet
Global Oil and Gas Profile: Vision, Reputation, and Commitment
44 pages
FSM 1989 Tracker 00 General Information
No ratings yet
FSM 1989 Tracker 00 General Information
24 pages
Ati Teas 6 English Language Study Guide
No ratings yet
Ati Teas 6 English Language Study Guide
23 pages
Argument of Unequal Childhoods: Class Differences in Childrearing
No ratings yet
Argument of Unequal Childhoods: Class Differences in Childrearing
8 pages
Adaptive Headlight With Orvm Technology
No ratings yet
Adaptive Headlight With Orvm Technology
7 pages
RI 2022 H3 Test 2 (Questions and Solutions)
No ratings yet
RI 2022 H3 Test 2 (Questions and Solutions)
8 pages
11 Rules of English Grammar
No ratings yet
11 Rules of English Grammar
4 pages
Is 2372 - Timber For Cooling Towers
No ratings yet
Is 2372 - Timber For Cooling Towers
8 pages
Strategic Analysis of WALMART - Group-4
No ratings yet
Strategic Analysis of WALMART - Group-4
12 pages
Chapter 7: Data Link Control Protocols True or False: Data and Computer Communications, 10 Edition, by William Stallings
No ratings yet
Chapter 7: Data Link Control Protocols True or False: Data and Computer Communications, 10 Edition, by William Stallings
5 pages
Fundamentals of Power Electronics Ch1
No ratings yet
Fundamentals of Power Electronics Ch1
35 pages
Entrep Quiz 1
No ratings yet
Entrep Quiz 1
6 pages
Iso 21138-3 2020
No ratings yet
Iso 21138-3 2020
42 pages

Unit - 4 NLP - R20

Uploaded by

Unit - 4 NLP - R20

Uploaded by

Unit-3 R20-Regulations NLP

UNIT - IV Interpretation and Modelling Lecture

Question and answer pattern

1.Define Language modelling.

2. What is n-gram model?

3.Define language model evaluation.

4.What do you mean by parameter estimation?

5.What is language model adaptation?

6.Briefly define coverage rate and perplexity.

7.What is class based language model?

8.What do you understand by Language-Specific Modelling Problem?

9.Differentiate between spoken and written languages.

10.What is multilingual language modelling?

Example from Franco and Solorio:

I need to tell her que no voy a poder ir.

'I need to tell her that I won't be able to make it.'

Long Answers Questions

1.Explain three types of language models.

There are primarily two types of language models:

a) Statistical Language Models

b) Neural Language Models

c) Class-Based Language Models

a)Statistical Language Models:

N-Gram: This is one of the simplest approaches to language modelling. Here, a

If n=4, a gram may look like: “can you help me”.

bi directional: Unlike n-gram models, which analyze text in one direction

b) Neural Language Models : These language models are based on neural

c) Class-Based Language Models: Class-based language models are a simple

2.Differentiate between Class based and variable length language models.

Class-based models have been successful in reducing perplexity as well as practical

Variable-Length Language Models: In standard language modeling, vocabulary

These approaches are termed variable-length n-gram models. The challenge in

3.Differentiate between multilingual and cross lingual language modelling.

Multilingual language modelling: A system can be presented with multiple

Cross-lingual language modelling: Cross-lingual learning is a paradigm for

4.Write a short note on Language-Specific Modeling Problem?

5.Explain coverage rate and perplexity?

Perplexity can be thought of as the average number of equally likely successor

6.Write a short note on language modelling?

Definition: A model that specifies the a priori probability of a particular word

Bagging and Boosting

Lexical context → Words and Lemmas

Parts of speech → Noun, Verb, Determiner, etc.

Bag of words context → unordered set of words in the context window

Local collocations → ordered sequence of phrases near the target word

Voice of the sentence, Presence of subject/object, etc.

Like:Telugu, Arabic, Urdu, Kashmiri, Persian, Hindi, etc

In speech recognition, a speech recognizer combines:

In machine translation, a language model is used to score translation hypotheses

One of the core approaches to language identification, for example, relies on

7.Differentiate between discriminative and syntax based language models.

Discriminative Language Models: Standard n-gram models embody a generative

Syntax-Based Language Models: A well-known drawback of n-gram language

However, natural language exhibits many types of long-distance dependencies,

To address this problem, several approaches to syntax-based language modeling

8.Explain language modeling for morphologically rich languages.

A morphologically rich language is characterized by: a large number of different

Morphological processes include compounding (forming a new word out of two

Turkish, an agglutinative language, combines several morphemes into a single word;

9.What is the need of using language modelling? Explain.

10.Write a short note on n-gram notation?

The probability of a word sequence W of nontrivial (!=0) length cannot be computed

The probability P(W) can be decomposed into a product of component probabilities

Depending on the length of n, we can distinguish between:

unigrams (n = 1), bigrams (n = 2), trigrams (n = 3), or 4-grams, 5-grams, and so

“John gifted a watch to his mother.”

unigrams (n = 1), “John”, “gifted”, “a”, “watch”, “to”, “his”, “mother”

You might also like