NLP Unit V Notes
NLP Unit V Notes
Prepared by
K SWAYAMPRABHA
Assistance Professor
UNIT - V
Coreference resolution involves identifying all the expressions in a text that refer to
the same entity. For example, in the sentence "John went to the store. He bought
some bread," the word "he" refers to John. Discourse segmentation involves
identifying the boundaries between different discourse units, such as sentences or
paragraphs.
Text coherence is the degree to which a text is logically organized and easy to
understand. It is often evaluated based on how well the text maintains a coherent
topic, how well its parts relate to each other, and how well it uses discourse
markers to signal shifts in topic or perspective.
Text classification involves categorizing texts based on their content. For example,
a news article may be classified as sports, politics, or entertainment. Text
classification is often used in applications such as sentiment analysis, spam
filtering, and topic modeling.
Cohension
Coherence and cohesion are two important concepts in discourse processing that
are essential for understanding the overall meaning of a text. While coherence
refers to the overall clarity and logical organization of a text, cohesion refers to the
specific linguistic devices that writers use to connect the different parts of a text.
Cohesion is the use of linguistic devices, such as conjunctions, reference words, and
lexical repetition, to link different parts of a text together. Cohesion creates a sense
of unity in a text and helps the reader to follow the writer's intended meaning.
Examples of cohesive devices include pronouns (e.g., he, she, it), conjunctions
(e.g., and, but, or), adverbs (e.g., however, therefore), and lexical repetition (e.g.,
repeating the same word or phrase multiple times).
There are several types of reference resolution, including anaphora resolution and
cataphora resolution.
Discourse cohesion and structure are two related concepts that play important roles
in creating effective communication and understanding in natural language.
Discourse cohesion refers to how different parts of a text are connected through the
use of linguistic devices, such as pronouns, conjunctions, lexical repetition, and
other cohesive markers. Cohesion creates a sense of unity and coherence in a text,
helping readers to follow the writer's intended meaning and to understand the
relationships between different ideas.
Discourse structure, on the other hand, refers to the larger organization and
arrangement of ideas within a text. It involves how ideas are presented and how
they relate to each other, including the use of headings, subheadings, paragraphs,
and other structural devices. Discourse structure helps readers to navigate a text
and to understand its overall organization, which can also contribute to its
coherence and clarity.
Effective discourse cohesion and structure are important for creating clear and
coherent communication in both written and spoken language. When a text is well-
structured and cohesive, readers or listeners are more likely to understand and
remember the content. Discourse cohesion and structure are also important in
many natural language processing tasks, such as summarization, question-
answering, and text classification, where understanding the relationships between
ideas and the overall organization of a text is essential.
n-Gram Models
n-gram models are statistical language models used in natural language processing
and computational linguistics. They are based on the idea of predicting the
probability of a word given the preceding n-1 words in a text.
n-gram models are based on the assumption that the probability of a word depends
only on the preceding n-1 words, which is known as the Markov assumption. They
are trained on a large corpus of text data and estimate the probability of a word
given its context using maximum likelihood estimation or other statistical methods.
n-gram models are used in a wide range of natural language processing tasks, such
as speech recognition, machine translation, and text classification. They are often
used as a baseline model for comparison with other more complex language
models.
Parameter Estimation
1 Maximum-Likelihood Estimation and Smoothing
2 Bayesian Parameter Estimation
3 Large-Scale Language Models
In NLP, models are typically trained on a large corpus of annotated data, and the
objective is to estimate the values of the model parameters that maximize the
likelihood of the observed data. The most commonly used method for parameter
estimation is maximum likelihood estimation (MLE), which involves finding the set
of parameters that maximizes the probability of the observed data. Other methods
for parameter estimation include Bayesian estimation, which involves finding the
posterior distribution of the parameters given the data, and empirical Bayes, which
involves using a hierarchical model to estimate the parameters.
MLE involves finding the values of the model parameters that maximize the
likelihood of the observed data. The likelihood function measures the probability of
observing the data given the model parameters, and the goal of MLE is to find the
parameter values that make this probability as high as possible. The maximum-
likelihood estimate is the set of parameter values that maximizes the likelihood
function.
In practice, MLE can be difficult to apply directly to NLP tasks, as the likelihood
function may be complex and high-dimensional. One common approach is to use
smoothing techniques to estimate the probabilities of unseen events, which can
improve the accuracy of the model and reduce overfitting.
Smoothing techniques are important for handling the problem of data sparsity,
which occurs when the training data contains few or no examples of certain events
or combinations of events. By smoothing the probability estimates, the model can
make reasonable predictions for unseen events and reduce the impact of noisy or
incomplete data.
The choice of prior distribution can have a significant impact on the posterior
distribution and the resulting parameter estimates. A common approach is to use a
conjugate prior, which has the same functional form as the likelihood function and
allows for convenient mathematical analysis. For example, if the likelihood function
is a Gaussian distribution, a conjugate prior would be another Gaussian distribution.
Bayesian parameter estimation offers several advantages over MLE. One advantage
is that it allows for the incorporation of prior knowledge or beliefs about the
parameters, which can help reduce the impact of noisy or incomplete data. Another
advantage is that it provides a probabilistic framework for uncertainty
quantification, allowing for the calculation of confidence intervals and credible
intervals for the parameter estimates.
One of the key challenges in training large-scale language models is handling the
sheer amount of data and computational resources required. Training these models
can require weeks or months of computing time on powerful hardware, and the
resulting models can have billions of parameters. As a result, large-scale language
models are typically trained on specialized hardware such as graphics processing
units (GPUs) or tensor processing units (TPUs).
Another challenge with large-scale language models is managing the biases and
ethical implications of the generated language. These models learn from the
patterns in the data they are trained on, which can include biases and stereotypes
present in the training data. Additionally, the ability of these models to generate
convincing language raises concerns about the potential misuse of the technology,
such as the spread of misinformation or the creation of fake news.
Language model adaptation has been applied successfully in a wide range of NLP
tasks, including sentiment analysis, text classification, named entity recognition,
and machine translation. However, it does require a small amount of task-specific
data, which may not always be available or representative of the target domain.
Each type of language model has its own strengths and weaknesses, and the choice
of model will depend on the specific task and domain being considered. N-gram
models and neural network models are the most widely used types of language
models due to their simplicity and effectiveness, while transformer-based models
are rapidly gaining popularity due to their ability to capture complex dependencies
between words.
1. Word clustering: The first step is to cluster words based on their distributional
similarity. This can be done using unsupervised clustering algorithms such as k-
means clustering or hierarchical clustering.
2. Class construction: After clustering, each cluster is assigned a class label. The
number of classes can be predefined or determined automatically based on the size
of the training corpus and the desired level of granularity.
3. Probability estimation: Once the classes are constructed, the probability of a word
given its class is estimated using a variety of techniques, such as maximum
likelihood estimation or Bayesian estimation.
4. Language modeling: The final step is to use the estimated probabilities to build a
language model that can predict the probability of a sequence of words.
However, class-based models also have some limitations, such as the need for a
large training corpus to build accurate word clusters and the potential loss of some
information due to the grouping of words into classes.
Overall, class-based language models are a useful tool for reducing the sparsity
problem in language modeling and improving the accuracy of language models,
particularly in cases where data is limited or out-of-vocabulary words are common.
The main advantage of variable-length language models is that they can handle
input sequences of any length, which is particularly useful for tasks such as
machine translation or summarization, where the length of the input or output can
vary greatly.
The goal of discriminative models is to learn a mapping from the input to the
output, given a training dataset. Discriminative models can be used for a variety of
tasks, such as text classification, sequence labeling, and machine translation.
Syntax-based language models can be used for a variety of tasks, such as text
generation, machine translation, and question answering. They can also be
evaluated using standard metrics, such as perplexity or BLEU score, although the
evaluation is often more complex due to the additional syntactic information.
MaxEnt models can be used to model both local and global context, and can
incorporate various types of features, such as word identity, part-of-speech, and
syntactic information. The model is trained on a corpus of text by estimating the
parameters of the model using an optimization algorithm, such as gradient descent.
MaxEnt language models have been used for a variety of NLP tasks, including part-
of-speech tagging, named entity recognition, and sentiment analysis. They have
been shown to perform well on tasks that require the modeling of complex
interactions between different types of linguistic features.
MaxEnt models have some advantages over other types of language models, such
as the ability to incorporate diverse feature sets and the ability to handle sparse
data. However, they can be computationally expensive and require careful selection
of features and regularization parameters to prevent overfitting.
Overall, MaxEnt language models are a useful tool for NLP tasks
Factored language models have been used for a variety of NLP tasks, including
machine translation, speech recognition, and information retrieval. They have been
shown to outperform traditional language models in many cases, especially when
dealing with complex or noisy linguistic data.
1. Semantic Role Labeling (SRL) Language Models: SRL models are used to identify
the semantic roles played by each word in a sentence, such as the subject, object,
and verb. These models use syntactic and semantic information to create a tree
structure that represents the relationship between words and their roles.
2. Discourse Parsing Language Models: Discourse parsing models are used to analyze
the structure and organization of a discourse, such as the relationships between
sentences and paragraphs. These models use tree structures to represent the
discourse structure, and can be used for tasks such as summarization and
information extraction.
3. Dependency Parsing Language Models: Dependency parsing models are used to
identify the grammatical relationships between words in a sentence, such as
subject-verb and object-verb relationships. These models use a tree structure to
represent the dependencies between words, and can be used for tasks such as
machine translation and sentiment analysis.
4. Constituent Parsing Language Models: Constituent parsing models are used to
identify the constituent structures of a sentence, such as phrases and clauses.
These models use tree structures to represent the hierarchical structure of a
sentence, and can be used for tasks such as text generation and summarization.
The basic idea behind topic models is that a document is a mixture of several latent
topics, and each word in the document is generated by one of these topics. The
model tries to learn the distribution of these topics from the corpus, and uses this
information to predict the probability distribution of words in each document.
One of the most popular Bayesian topic-based language models is Latent Dirichlet
Allocation (LDA). LDA assumes that the corpus is generated by a mixture of latent
topics, and each topic is a probability distribution over the words in the corpus. The
model uses a Dirichlet prior over the topic distributions, which encourages sparsity
and prevents overfitting.
LDA has been used for a variety of NLP tasks, including text classification,
information retrieval, and topic modeling. It has been shown to be effective in
uncovering hidden themes and patterns in large corpora of text, and can be used to
identify key topics and concepts in a document.
The basic idea behind neural network language models is to learn a distributed
representation of words, where each word is represented as a vector in a high-
dimensional space. These representations capture the semantic and syntactic
relationships between words, and can be used to predict the probability distribution
of the next word in a sequence.
One of the most popular types of neural network language models is the recurrent
neural network (RNN) language model, which uses a type of neural network that is
designed to handle sequential data. RNNs have a hidden state that captures the
context of the previous words in the sequence, and this context is used to predict
the probability distribution of the next word.
Another popular type of neural network language model is the transformer model,
which uses self-attention to model the relationships between words in a sequence.
Transformer models have become increasingly popular in recent years, and have
been used to achieve state-of-the-art performance on a variety of NLP tasks,
including language modeling, machine translation, and text classification.
One major challenge in building language models for specific languages is data
availability. Many languages do not have large corpora of text that are suitable for
training language models, which can make it difficult to build models that are
accurate and robust. In addition, even when data is available, it may be difficult to
obtain high-quality annotations, such as part-of-speech tags or syntactic parses.
Finally, semantics, or the meaning of words and sentences, can also pose
challenges for language modeling. Different languages may have different ways of
expressing the same concept, or may have words that have multiple meanings
depending on context. This can make it difficult to build models that accurately
capture the meaning of sentences and phrases.
1. Character n-grams: One common approach is to use character n-grams, which are
sequences of n characters within a word. For example, the word "language" could
be represented as a set of character 3-grams: {"lan", "ang", "ngu", "gua", "uag",
"age"}. This approach can be effective for capturing the morphology of words, as
well as for handling out-of-vocabulary (OOV) words.
2. Morphemes: Another approach is to use morphemes, which are the smallest units
of meaning within a word. For example, the word "languages" can be broken down
into the morphemes "language" and "-s", indicating plural. This approach can be
effective for capturing the morphology and semantics of words, but can require
more computational resources for segmentation and analysis.
3. Hybrid approaches: Some approaches combine character n-grams and morphemes
to create hybrid subword units. For example, the word "languages" could be
represented as a set of hybrid subword units: {"lan", "ang", "ngu", "gua", "uag",
"age", "es"}, where the "-s" morpheme is represented separately. This approach
can be effective for capturing both morphology and OOV words.
4. Word pieces: Another approach is to use a learned vocabulary of "word pieces",
which are variable-length subword units that are learned during training. This
approach, used by models such as BERT and GPT, can be effective for capturing
complex morphology and semantics, and can also handle OOV words.
1. Chinese: In Chinese, there are no spaces between words, and written text consists
of a sequence of characters. This makes it difficult to determine where one word
ends and the next one begins, especially since some characters can represent
multiple words depending on the context.
2. Japanese: Japanese has a writing system consisting of three scripts: kanji (Chinese
characters), hiragana, and katakana. Kanji characters can represent multiple
words, and hiragana and katakana are used for grammatical particles and
inflections. There are no spaces between words, and the use of kanji, hiragana, and
katakana can vary depending on the context.
3. Thai: Thai is a tonal language that does not use spaces between words. Instead,
words are separated by a space-like character called a "phayen." However, the
placement of the phayen can vary depending on the context, making it difficult to
determine word boundaries.
4. Khmer: Khmer is the official language of Cambodia and does not use spaces
between words. Instead, words are separated by a symbol called a "khan," which is
placed below the final consonant of the preceding syllable. However, there are
some cases where multiple words are written as a single word, and the use of khan
can vary depending on the context.
1. Vocabulary: Spoken language tends to have a more limited vocabulary than written
language. This is because spoken language is often more informal and less precise,
relying on context and gestures to convey meaning. Written language, on the other
hand, tends to be more formal and precise, with a wider range of vocabulary.
2. Grammar: Spoken language is often less strict in terms of grammar and syntax,
with more reliance on intonation and gestures to convey meaning. Written
language, on the other hand, tends to follow more rigid grammatical rules and
conventions.
3. Context: Spoken language is often dependent on context and situational cues, such
as facial expressions and body language, to convey meaning. Written language, on
the other hand, is often self-contained and can be read and understood without
relying on external context.
4. Disfluencies: Spoken language often contains disfluencies, such as pauses,
repetitions, and filler words like "um" and "uh." These are less common in written
language, which is typically more polished and edited.
5. Acoustic Characteristics: Spoken language has a unique set of acoustic
characteristics, including pitch, volume, and timing, that are not present in written
language. These characteristics can be used to help identify speakers and
differentiate between different types of speech, such as questions, statements, and
commands.
Crosslingual language modeling, on the other hand, refers to the task of training a
language model on data from one language and using it to process input in another
language. The goal is to create a model that can transfer knowledge from one
language to another, even if the languages are unrelated. This can be useful for
tasks such as crosslingual document classification, where the model needs to be
able to classify documents written in different languages.
There are several challenges associated with multilingual and crosslingual language
modeling, including:
1. Vocabulary size: Different languages have different vocabularies, which can make it
challenging to train a model that can handle input from multiple languages.
2. Grammatical structure: Different languages have different grammatical structures,
which can make it challenging to create a model that can handle input from
multiple languages.
3. Data availability: It can be challenging to find enough training data for all the
languages of interest.
Multilingual and crosslingual language modeling are active areas of research, with
many potential applications in machine translation, crosslingual information
retrieval, and other areas.
Another approach is to use a shared embedding space for the different languages.
In this approach, the embeddings for words in different languages are learned
jointly, allowing the model to transfer knowledge across languages. This approach
has been shown to be effective for low-resource languages.
Another approach is to use parallel corpora, which are pairs of texts in two different
languages that have been aligned sentence-by-sentence. These parallel corpora can
be used to train models that can map sentences in one language to sentences in
another language, which can be used for tasks like machine translation.