0% found this document useful (0 votes)
43 views9 pages

NLP JNTUH Unit 1

The document discusses key linguistic concepts such as irregularity, ambiguity, productivity, and various morphological models. It explains the differences between generative and discriminative sequence classification methods, highlighting their training complexities and performance in tasks like sentence segmentation. Additionally, it presents performance results of different classifiers on speech corpora, demonstrating the effectiveness of various approaches.

Uploaded by

21wh1a6625
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
43 views9 pages

NLP JNTUH Unit 1

The document discusses key linguistic concepts such as irregularity, ambiguity, productivity, and various morphological models. It explains the differences between generative and discriminative sequence classification methods, highlighting their training complexities and performance in tasks like sentence segmentation. Additionally, it presents performance results of different classifiers on speech corpora, demonstrating the effectiveness of various approaches.

Uploaded by

21wh1a6625
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Irregularity:

● Irregularity means having forms or structures in language that don't fit the usual
patterns or rules. These are exceptions to the standard ways of forming words or
sentences.
● Irregular verbs are verbs that don't follow the standard rule of adding -d, -ed, or -ied to
form their past simple or past participle forms (e.g., "go" becomes "went" instead of
"goed").
● Examples of irregular verbs include "run" (past: "ran"), "buy" (past: "bought"), and
"take" (past: "took").
● Irregularities can also appear in other parts of language, like noun plurals (e.g., "child"
becomes "children" instead of "childs") or comparative adjectives (e.g., "good" becomes
"better" instead of "gooder").
● Understanding irregularities is important because they often appear in everyday
language, and using them correctly makes speech and writing more natural.

Ambiguity:

● Ambiguity means that something, like a sentence or word, can have two or more
possible meanings.In both speaking and writing, there are two main types of ambiguity:
1. Lexical Ambiguity:
● Lexical ambiguity occurs when a word has more than one meaning, and it's
unclear which meaning is intended in a given context.
● This happens because some words have multiple definitions or uses.
○ Example:
● "Bank"
○ Could mean a financial institution (where you store
money).
○ Or it could refer to the side of a river (a riverbank).
2. Syntactic Ambiguity: Presence of two or more possible meanings within a single
sentence or sequence of words.
● This type of ambiguity happens when the way words are arranged allows for
multiple possible meanings.
○ Example:"The chicken is ready to eat."
○ This could mean that the chicken is prepared and ready for
someone to eat it (the chicken is the food).
○ Or it could mean that the chicken itself is hungry and ready to eat
something.

Issues and Challenges:Ambiguity is a challenge in communication because it can lead to


misunderstandings or confusion.

Linguistic Ambiguity:This happens when language is unclear and can be understood in


different ways. It can make it hard or even impossible for someone—or an AI program—to
figure out the exact meaning without more context or information.
○ Example: "I saw someone on the hill with a telescope." This sentence could
mean you used a telescope to see someone on the hill, or that the person you
saw had a telescope.

Homonyms: These are words that look the same but have different meanings or functions.

○ Examples:
■ "Bore" (to drill a hole) and "boar" (a wild pig).
■ "Two" (the number) and "too" (meaning also).

Productivity:

● Productivity in language refers to our unlimited ability to create new sentences and
expressions. This means we can use any language to say things that have never
been said before. It’s also called open-endedness or creativity.
● The term can also refer to specific parts of language, like prefixes or suffixes, that
help us create new words of the same type (e.g., adding "-ness" to "happy" to make
"happiness").
● Productivity is most often talked about in relation to word-formation, which is how we
create new words.
● Humans constantly come up with new ways to express ideas and describe new
things by using their language creatively. This ability, called productivity, allows us to
create an infinite number of sentences.
● Other animals don't have this kind of flexibility in communication. For example, cicadas
have only four signals, and vervet monkeys have 36 vocal calls. They can’t create new
signals to talk about new experiences or events.
● The limitless ability to create and understand completely new sentences is known as
open-endedness.
● Another important part of human creativity is the freedom to respond in any way we
choose. People can say whatever they want in any situation, or they can choose to
say nothing at all.

Morphological Models

• Dictionary Lookup

• Finite-State Morphology

• Unification-Based Morphology

• Functional Morphology
Dictionary Lookup:

● Morphological parsing is a process where word forms in a language are matched


with their corresponding linguistic meanings or structures.
● To analyze a word, systems often look it up directly in word lists, dictionaries, or
databases.
● A dictionary in this context is a data structure designed to quickly provide
precomputed results, like word analyzes.
● This data structure can be optimized for fast lookups, making the process efficient.
● The results from these lookups can also be shared across different applications.
● Lookup operations in dictionaries are typically simple and fast.
● Dictionaries can be created using various data structures like lists, binary search
trees, tries, hash tables, etc.
● Efficient Retrieval: Using optimized data structures in dictionaries helps in retrieving
word information quickly, which is crucial for real-time language processing.
● Broad Application: Dictionary-based methods are adaptable and have been
implemented in various linguistic tools and systems for multiple languages.
● Scalability: With the vast availability of data online, dictionaries can be continuously
updated and expanded, ensuring they cover a wide range of word forms and usages.

Finite-State Morphology:

● Finite-state morphological models use specifications written by programmers that are


directly compiled into finite-state transducers (FSTs).
● Two popular tools for this approach are XFST (Xerox Finite-State Tool) and LexTools.

Finite-State Transducers:

● A finite-state transducer is a computational device that extends the functionality of a


finite-state automaton.
● Essentially, an FST is like a finite-state automaton but operates on two (or more)
tapes: it reads input from one tape and writes output to another.
● Think of transducers as "translating machines" that convert input symbols (like
letters or words) into output symbols.
● FSTs are made up of a finite set of nodes connected by directed edges. These
nodes are called states, and the edges are called arcs.
● As you move through the network from the initial states to the final states along these
arcs, the FST reads input symbols and writes corresponding output symbols.
● The sequences that the transducer accepts define the input language, while the
sequences it outputs define the output language.
● Efficiency: FSTs are efficient for processing language because they handle regular
patterns and relations quickly.
● Flexibility: They can be used in various linguistic tasks, like word formation, spelling
correction, and more.
● Widely Used: Due to their power and flexibility, FSTs are commonly used in
computational linguistics and natural language processing.
● Complex Relations: FSTs can model complex relationships between input and output,
making them ideal for tasks like morphological analysis, where one form of a word needs
to be transformed into another.

Unification-Based Morphology:

● Unification-based morphology focuses on providing complete grammatical


descriptions of languages, especially within frameworks like head-driven phrase
structure grammar (HPSG).
● Unification is a key process where feature structures are combined to create a more
detailed structure.
● Feature structures can be visualized as directed acyclic graphs (DAGs), where
nodes represent variable values and paths represent variable names.
● These structures are often displayed as attribute-value matrices. For example, an
attribute named "number" might have the value "singular." Attributes can be atomic
(simple values like "singular") or complex (like another feature structure, a list, or a set).
● Unification can fail if the feature structures contain conflicting information.
● Unification can be monotonic (preserving all information)
● Morphological models based on unification are often formulated as logic programs
and use unification to solve the constraints defined by the model.
● Advantages include better abstraction for developing a morphological grammar and the
elimination of redundant information.
● Unification-based models have been successfully implemented for languages such as
Russian, Czech, Slovene, Persian, Hebrew, and Arabic.

Functional Morphology:

● Functional morphology uses principles from functional programming and type


theory.
● It views morphological operations (like word formation) as pure mathematical
functions and organizes these operations into different types and categories.
● This approach isn’t limited to one type of language structure; it’s especially useful for
fusional languages, where one word part (morpheme) expresses multiple grammatical
features.
● Key language concepts like paradigms (patterns), rules, exceptions, grammatical
categories, lexemes (words or word stems), morphemes (smallest units of meaning),
and morphs (specific forms) can all be modeled using functional morphology.
● Functional morphology implementations are designed to be reusable as
programming libraries. These libraries can handle the complete morphological
structure of a language and can be used in various applications.
● A functional morphology model can be turned into finite-state transducers for specific
tasks, or it can be used in a more flexible, interactive way.
● Many functional morphology models are built into general-purpose programming
languages, giving developers the ability to use advanced programming techniques to
create real-world applications.
● Functional morphology models achieve high levels of abstraction.

Generative Sequence Classification Methods:

● Generative Approach: This method focuses on learning about each class (or
category) by understanding how data is generated for that class.
○ Learning Process: It learns the joint probability distribution p(x,y), which
means it tries to model how both the features (x) and the classes (y) are
related.
○ Data Modeling: It models the distribution of data within each class
separately. For instance, it learns what a lion and an elephant look like based on
images from the zoo.
○ Reconstruction: It can generate new samples that are similar to those from the
classes it has learned about. For example, it can generate images of lions and
elephants that resemble the ones seen before.
○ Understanding: It has a deeper understanding of the overall structure of the
data and the relationships between different features.
○ Applications: Useful in scenarios where you need to generate new data,
simulate scenarios, or understand the underlying distribution of the data.
Examples include generative adversarial networks (GANs) and hidden
Markov models (HMMs) and Naive Bayes classifiers.
○ Flexibility: Can be used for tasks like data imputation, anomaly detection, and
more because it understands the data generation process.
○ Advantages:
■ Can handle missing data by generating it.
■ Useful for scenarios where understanding the data generation process is
crucial.
○ Disadvantages:
● Can be more complex to train due to the need to model the entire
distribution.
● May require more data and computation.

Discriminative Sequence Classification Methods:

● Discriminative Approach:
○ This method focuses on distinguishing between different classes based on
the features provided.
○ Learning Process: It learns the conditional probability distribution p(y∣x),
which means it tries to model the probability of a class given the features.
○ Feature Differences: It focuses on learning the differences between classes
by directly analyzing features and their relationships. For instance, it
identifies specific features that differentiate a lion from an elephant.
○ Classification: It is primarily used for making classifications or predictions.
For example, it can classify an unknown animal as a lion or elephant based on its
features.
○ Efficiency: Often requires less data to achieve high accuracy in
classification because it focuses on distinguishing features rather than
understanding the entire data distribution.
○ Applications: Commonly used in tasks like image recognition, spam
detection, and speech recognition.
○ Examples include logistic regression and support vector machines (SVMs).
○ Performance: Typically performs better in classification tasks where the primary
goal is to distinguish between categories, rather than understanding how each
category is generated.

Summary:

● Generative Models aim to understand and model how data is generated, providing a
deeper insight into the data distribution and enabling data generation.
● Discriminative Models aim to focus on distinguishing between different categories
based on features, often leading to more accurate and efficient classification in practice.
Complexity of Approaches

1. Complexity in Training and Prediction:


○ Generative Models:
■ Training Complexity: Generally less complex because they focus on
modeling the overall data distribution. Training involves learning the joint
probability distribution p(x,y).
■ Prediction Complexity: Typically faster since the model has learned the
entire data distribution and can use this to generate or classify new data
directly.
■ Performance: Often requires more data and computational resources
but can handle a variety of tasks, including generating new data.
○ Discriminative Models:
■ Training Complexity: More complex because they focus on learning the
boundary between classes. Training involves adjusting feature weights
through multiple passes over the data to optimize classification
performance.
■ Prediction Complexity: Generally slower despite simpler models
because prediction requires evaluating feature weights for each instance.
However, some discriminative models can make predictions quickly once
trained.
■ Performance: Typically performs better on smaller training sets
compared to generative models. They are often more accurate for
classification tasks.
2. Preprocessing:
○ Some algorithms, particularly generative ones, may require preprocessing of
data. This includes converting continuous features into discrete features or
normalizing data to improve performance.
3. Sequence Models:
○ Decoding Complexity: For sequence classification, the additional complexity
arises from decoding, which involves finding the best sequence of decisions. This
requires evaluating all possible sequences, which can be computationally
expensive and time-consuming.
4. Real-World Performance:
○ Generative Models: Perform well when there is ample training data and when
understanding the data generation process is crucial.
○ Discriminative Models: Often excel in real-world classification tasks,
especially with smaller datasets. They are typically more accurate but may
require more sophisticated training processes.

Performance of Approaches for Sentence Segmentation in Speech

Evaluation Metrics:
1. Error Rate: Measures the ratio of errors to the total number of examples. Lower
error rates indicate better performance.
2. F1 Measure: The harmonic mean of recall and precision, which balances both
metrics to provide a single performance measure. Higher F1 scores reflect better
performance.

Performance Results:

● Mandarin TDT4 Multilingual Broadcast News Speech Corpus:


○ MaxEnt Classifier: F1 measure of 69.1%
○ Adaboost: F1 measure of 72.6%
○ Support Vector Machines (SVMs): F1 measure of 72.7%
○ Combination of 3 Classifiers Using Logistic Regression: The report suggests
this approach may improve results, though exact F1 measures are not specified.
● Turkish Broadcast News Corpus:
○ HELM: F1 measure of 78.2%
○ fHELM with Morphology Features: F1 measure of 86.2%
○ Adaboost: F1 measure of 86.9%
○ Conditional Random Fields (CRFs): F1 measure of 89.1%
○ Note: HELMs (Hierarchical Hidden Markov Models) were trained on the same
corpus as the other classifiers, highlighting that model performance can vary
based on the type of classifier and additional features used.

You might also like