Unit - 4 NLP - R20
Unit - 4 NLP - R20
Semantic Interpretation-Semantic & Logical form, Word senses & ambiguity, the
basic logical form language, encoding ambiguity in the logical Form, Verbs & States
in logical form, Thematic roles, Speech acts &embedded sentences, Defining
semantics structure model theory. Language Modelling-Introduction, n-Gram Models,
Language model Evaluation, Parameter Estimation, Language Model Adaption,
Types of Language Models, Language-Specific Modelling Problems, Multilingual and
Cross lingual Language Modelling.
rate (or OOV rate), which is 100 minus the unigram coverage rate, or,in other words,
the percentage of unique word types not covered by the language model.
Perplexity can be thought of as the average number of equally likely successor
words when transitioning from one position in the word string to the next. If the model
has no predictive power at all, perplexity is equal to the vocabulary size.
SEAGI-NB 2
Unit-3 R20-Regulations NLP
SEAGI-NB 3
Unit-3 R20-Regulations NLP
A system can be presented with multiple languages sequentially (e.g. Different users
speaking different languages, without advance indication of which language will be
encountered next),or simultaneously, as happens in the case of code switching.
Here, speakers may use several languages or dialects side by side, often within the
same utterance
N-Gram
Unigram
Bidirectional
Exponential
Continuous Space
SEAGI-NB 4
Unit-3 R20-Regulations NLP
and defines the size of the gram (or sequence of words being assigned a
probability).
Basically, ‘n’ is the amount of context that the model is trained to consider. There
are different types of N-Gram models such as unigrams, bigrams, trigrams, etc.
unigram: The unigram is the simplest type of language model. It doesn't look at any
conditioning context in its calculations. It evaluates each word or term independently.
Unigram models commonly handle language processing tasks such as information
retrieval. The unigram is the foundation of a more specific model variant called the
query likelihood model, which uses information retrieval to examine a pool of
documents and match the most relevant one to a specific query.
exponential: This type of statistical model evaluates text by using an equation which
is a combination of n-grams and feature functions. Here the features and parameters
of the desired results are already specified. The model is based on the principle of
entropy, which states that probability distribution with the most entropy is the best
choice. Exponential models have fewer statistical assumptions which mean the
chances of having accurate results are more
continous space: In this type of statistical model, words are arranged as a non-
linear combination of weights in a neural network. The process of assigning weight to
a word is known as word embedding. This type of model proves helpful in scenarios
where the data set of words continues to become large and include unique words. In
cases where the data set is large and consists of rarely used or unique words, linear
models such as n-gram do not work. This is because, with increasing words, the
SEAGI-NB 5
Unit-3 R20-Regulations NLP
possible word sequences increase, and thus the patterns predicting the next word
become weaker.
For example, a model should be able to understand words derived from different
languages.
SEAGI-NB 6
Unit-3 R20-Regulations NLP
The statistical model makes the assumption that words are conditionally independent
of other words given the current word class.
current word is conditioned not only on the current word class but also on the
preceding word classes.
Several modifications to this basic approach have been developed that aim at
redefining vocabulary units in a data-driven way, resulting in merged units composed
out of a variable number of basic units.
The main difference between Multilingual and Cross lingual language modelling is:
SEAGI-NB 7
Unit-3 R20-Regulations NLP
Cross-lingual embeddings attempt to ensure that words that mean the same thing in
different languages map to almost the same vector.
Multilingual embeddings are happy if the embeddings work well in language A and
work well in language B separately without any guarantees about interaction
between different languages.
The majority of language modeling research has focused on the English language.
However, speech and language processing technology has been ported to a range
of other languages, some of which have highlighted problems with the standard n-
gram modeling approach and have necessitated modifications to the traditional
language modeling framework. Here , we look at three types of language-specific
problems: morphological complexity, lack of word segmentation, and spoken versus
written languages.
Typically, two criteria are used to define language model evaluation: coverage rate
and perplexity on a held-out test set that does not form part of the training data.
The coverage rate measures the percentage of n-grams in the test set that are
represented in the language model.A special case of this is the out-of-vocabulary
rate (or OOV rate), which is 100 minus the unigram coverage rate, or,in other words,
the percentage of unique word types not covered by the language model.
SEAGI-NB 8
Unit-3 R20-Regulations NLP
sequence W = w1w2 ... wt ∈ Σ∗, a language model can be used to compute the
probability of W based on parameters previously estimated from a training set.
Parameters:
Syntactic relations
Topic features
Most commonly, the inventory Σ (also called vocabulary) is the list of unique words
encountered in the training data; however, as we will see in this chapter, selecting
the units over which a language model should be defined can be a rather difficult
problem, particularly in languages other than English.
A language model is usually combined with some other model or models that
hypothesize possible word sequences. RF, AdaBoost, LGBM, XGBM, etc
acoustic model scores (and possibly other scores, such as pronunciation model
scores) with language model scores to decode spoken word sequences from an
acoustic signal.
SEAGI-NB 9
Unit-3 R20-Regulations NLP
several related fields, language models are used that are defined not over words but
over acoustic units or isolated text characters.
SEAGI-NB 10
Unit-3 R20-Regulations NLP
Multiple word forms can be derived from small number of tokens. A morpheme is the
smallest meaning-bearing unit in a language. Morphemes can be either free (i.e.,
they can occur on their own), or they are bound (i.e., they must be combined with
some other morpheme).
Germanic languages, for example, are notorious for their high degree of
compounding, especially for nominals.
As a result, Turkish has a huge number of possible words. Many languages have
rich inflectional paradigms. In languages like Finnish and Arabic, a root (base form)
may have thousands of different morphological realizations.
Table shows two Modern Standard Arabic (MSA) inflectional paradigms, one for
present tense verbal inflections for the root skn (basic meaning: ‘live’), one for
pronominal possessive inflections for the root ktb (basic meaning ‘book’)
SEAGI-NB 11
Unit-3 R20-Regulations NLP
Language models determine word probability by analyzing text data. They interpret
this data by feeding it through an algorithm that establishes rules for context in
natural language. Then, the model applies these rules in language tasks to
accurately predict or produce new sentences.
The assumption is that all previous words except for the n − 1 words directly
preceding the current word are irrelevant for predicting the current word, or,
alternatively, that they are equivalent.
bigrams (n = 2), “John gifted”, “gifted a”, “a watch”, “watch to”, “to his”, “his mother”
trigrams (n = 3), “John gifted a”, “gifted a watch”, “a watch to”, “ to his mother”
So on…
SEAGI-NB 12