0% found this document useful (0 votes)
5 views

Module-1

Natural Language Processing (NLP) is a multidisciplinary field that aims to enable computers to understand and generate human language, utilizing techniques from linguistics, computer science, and artificial intelligence. Despite significant advancements, NLP faces challenges such as ambiguity, data quality, context understanding, and ethical concerns, which researchers are addressing through innovative solutions. Various tools and platforms, including Google Cloud NLP, IBM Watson, and SpaCy, are available to facilitate NLP tasks across different applications.

Uploaded by

Swayam sahay
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Module-1

Natural Language Processing (NLP) is a multidisciplinary field that aims to enable computers to understand and generate human language, utilizing techniques from linguistics, computer science, and artificial intelligence. Despite significant advancements, NLP faces challenges such as ambiguity, data quality, context understanding, and ethical concerns, which researchers are addressing through innovative solutions. Various tools and platforms, including Google Cloud NLP, IBM Watson, and SpaCy, are available to facilitate NLP tasks across different applications.

Uploaded by

Swayam sahay
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 39

Notes

Introduction to NLP, NLP issues and strategies

What is Natural Language Processing (NLP)?


Natural Language Processing (NLP) is a fascinating field that sits at the crossroads of
linguistics, computer science, and artificial intelligence (AI). At its core, NLP is
concerned with enabling computers to understand, interpret, and generate human
language in a way that is both smart and useful.

In simpler terms, NLP allows computers to “read” and “understand” text or


speech, much like humans do. It equips machines with the ability to process
large amounts of natural language data, extract relevant information, and
perform tasks ranging from language translation to sentiment analysis.

OR

1. What is Natural Language Processing (NLP)?

• Natural language processing is a field at the intersection of

• computer science

• artificial intelligence

• and linguistics.

• Goal: for computers to process or “understand” natural language in order to


perform tasks that are useful, e.g.,

• Performing Tasks, like making appointments, buying things

• Language translation

• Question Answering

• Siri, Google Assistant, Facebook M, Cortana …

• Fully understanding and representing the meaning of language (or even


defining it) is a difficult goal.

• Perfect language understanding is AI-complete


What’s special about human language?
How Does Natural Language Processing (NLP) Work?
NLP systems rely on a combination of linguistic rules, statistical models, and machine
learning algorithms to process and understand human language.

Here’s a simplified overview of how NLP works.

Text Preprocessing

Before any analysis can take place, raw text data is preprocessed to remove noise,
tokenize sentences into words, and convert them into a format suitable for analysis.
Feature Extraction

NLP models extract features from the preprocessed text, such as word frequencies,
syntactic patterns, or semantic representations, which are then used as input for further
analysis.

Machine Learning

Many NLP tasks involve training machine learning models on labeled datasets to learn
patterns and relationships in the data. These models are then used to make predictions
or perform tasks such as classification, translation, or summarization.

Evaluation and Optimization

NLP systems are continuously evaluated and optimized to improve their performance.
This may involve fine-tuning model parameters, incorporating new data, or
developing more sophisticated algorithms.

Challenges in Natural Language Processing (NLP)


Natural Language Processing (NLP) has witnessed remarkable advancements in
recent years, but it still faces several challenges that hinder its full potential.

Ambiguity and Polysemy

One of the fundamental challenges in NLP is dealing with the ambiguity and
polysemy inherent in natural language. Words often have multiple meanings
depending on context, making it challenging for NLP systems to accurately interpret
and understand text.

Data Sparsity and Quality

NLP models require large amounts of annotated data for training, but obtaining high-
quality labeled data can be challenging. Furthermore, data sparsity and inconsistency
pose significant hurdles in building robust NLP systems, leading to suboptimal
performance in real-world applications.

Context and Understanding

Understanding context is crucial for NLP tasks such as sentiment analysis,


summarization, and language translation. However, capturing and representing
context accurately remains a challenging task, especially in complex linguistic
environments.

Multilingualism and Language Variations


NLP systems must be able to handle multiple languages and dialects to cater to
diverse user populations. However, language variations, slang, and dialectical
differences pose challenges in developing universal NLP solutions that work
effectively across different linguistic contexts.

Lack of Domain-Specific Data

Many Natural Language Processing applications require domain-specific knowledge


and terminology, but obtaining labeled data for specialized domains can be difficult.
This lack of domain-specific data limits the performance of NLP systems in
specialized domains such as healthcare, legal, and finance.

Semantic Understanding and Reasoning

NLP systems often struggle with semantic understanding and reasoning, especially in
tasks that require inferencing or commonsense reasoning. Capturing the subtle
nuances of human language and making accurate logical deductions remain significant
challenges in NLP research.

Handling Noise and Uncertainty

Natural language data is often noisy and ambiguous, containing errors, misspellings,
and grammatical inconsistencies. NLP systems must be robust enough to handle such
noise and uncertainty while maintaining accuracy and reliability in their outputs.

Ethical and Bias Concerns

NLP models can inadvertently perpetuate biases present in the training data, leading to
unfair or discriminatory outcomes. Addressing ethical concerns and mitigating biases
in NLP systems is crucial to ensuring fairness and equity in their applications.

Scalability and Performance

Scalability is a critical challenge in NLP, particularly with the increasing complexity


and size of language models. Building scalable NLP solutions that can handle large
datasets and complex computations while maintaining high performance remains a
daunting task.

Interdisciplinary Collaboration

Challenges in NLP Solutions

Ambiguity and Polysemy Contextual embeddings capture contextual meaning

Data Sparsity and Quality Semi-supervised learning leverages unlabeled data

Context and Understanding Advanced language models reason about context

Multilingualism and Multilingual NLP research addresses language variations


Variations

Lack of Domain-Specific Domain adaptation adapts models to specific domains


Data

Semantic Understanding Hybrid approaches combine symbolic reasoning with statistical


learning

Noise and Uncertainty Robust preprocessing filters out noise

Ethical and Bias Concerns Fairness-aware training mitigates biases

Scalability and Performance Distributed computing scales NLP systems

Interdisciplinary Collaboration fosters innovation


Collaboration
NLP research requires collaboration across multiple disciplines, including linguistics,
computer science, cognitive psychology, and domain-specific expertise. Bridging the
gap between these disciplines and fostering interdisciplinary collaboration is essential
for advancing the field of NLP and addressing NLP challenges effectively.

Solutions to Challenges in Natural Language Processing


(NLP)
While Natural Language Processing (NLP) faces several challenges, researchers and
practitioners have devised innovative solutions to overcome these hurdles.

Let’s explore some key strategies for addressing the top 10 challenges of NLP.

Ambiguity and Polysemy

 Utilize contextual embeddings and deep learning techniques to capture


the contextual meaning of words

 Incorporate knowledge graphs and semantic ontologies to disambiguate


ambiguous terms based on their relationships with other concepts

Data Sparsity and Quality

 Implement semi-supervised and transfer learning approaches to


leverage unlabeled data and pre-trained models

 Employ data augmentation techniques to generate synthetic data and


improve dataset diversity
 Collaborate with domain experts to curate high-quality labeled datasets
tailored to specific applications

Context and Understanding

 Develop sophisticated language models that can capture and reason


about context at multiple levels of abstraction

 Explore dynamic context-aware models that adaptively adjust their


representations based on contextual cues

Multilingualism and Language Variations

 Invest in multilingual NLP research to develop models that can handle


diverse languages and dialects effectively

 Incorporate cross-lingual transfer learning techniques to transfer


knowledge from resource-rich languages to resource-poor languages

 Collaborate with linguists and language experts to address linguistic


variations and cultural nuances in different languages

Lack of Domain-Specific Data

 Explore domain adaptation and fine-tuning techniques to transfer


knowledge from general domains to specialized domains

 Foster collaborations with industry partners and domain experts to


collect and annotate domain-specific datasets

 Investigate unsupervised and weakly supervised learning methods to


leverage unannotated or partially labeled domain-specific data

Semantic Understanding and Reasoning

 Develop hybrid approaches that combine symbolic reasoning with


statistical learning for a more robust semantic understanding

 Explore commonsense knowledge bases and reasoning engines to


incorporate background knowledge into NLP models

 Investigate neural-symbolic methods that integrate neural networks with


symbolic reasoning engines for enhanced semantic reasoning

Handling Noise and Uncertainty


 Develop robust preprocessing techniques to filter out noise and handle
data inconsistencies

 Investigate uncertainty-aware models that can quantify and propagate


uncertainty through the entire NLP pipeline

 Utilize ensemble methods and model ensembling techniques to improve


robustness and reliability in the face of noisy input data

Ethical and Bias Concerns

 Implement fairness-aware training procedures and bias mitigation


techniques to identify and mitigate biases in NLP models

 Foster diversity and inclusivity in dataset collection and annotation


processes to reduce bias propagation

 Promote transparency and accountability in NLP research by openly


documenting biases and limitations in model performance

Scalability and Performance

 Invest in distributed computing infrastructure and parallel processing


techniques to scale NLP systems to large datasets

 Explore model compression and quantization methods to reduce the


computational overhead of deploying large-scale NLP models

 Develop efficient algorithms and data structures optimized for specific


NLP tasks to improve performance and resource utilization

Interdisciplinary Collaboration

 Foster interdisciplinary collaboration between researchers, practitioners,


and stakeholders from diverse fields, such as linguistics, computer
science, psychology, and domain-specific expertise

 Establish collaborative research initiatives and funding opportunities to


facilitate knowledge exchange and interdisciplinary research in NLP

 Promote open-source development and community-driven initiatives to


encourage collaboration and knowledge sharing across different
disciplines

NLP Application:

Study from PPT


Tools of NLP

These tools are used across various stages of NLP processing:

1. Tokenizers

 Break text into words, sentences, or subwords.


 Examples: nltk.word_tokenize(), SpaCy, Hugging Face Tokenizers.

2. Part-of-Speech (POS) Taggers

 Label each word with its grammatical role (noun, verb, etc.).
 Tools: SpaCy, NLTK, Stanford POS Tagger.

3. Named Entity Recognition (NER) Tools

 Identify names, places, dates, etc.


 Tools: SpaCy, Flair, AllenNLP.

4. Parsing Tools

 Syntactic parsing (dependency and constituency parsing).


 Tools: CoreNLP, spaCy, SyntaxNet.

5. Lemmatizers & Stemmers

 Reduce words to their root forms.


 Tools: WordNetLemmatizer (NLTK), Porter Stemmer, spaCy lemmatizer.

6. Vectorization Tools

 Convert words into numerical vectors.


 Techniques: TF-IDF, Word2Vec, GloVe, BERT embeddings.

7. Machine Learning Frameworks

 Used to build NLP models.


 Examples: Scikit-learn, TensorFlow, PyTorch, Hugging Face Transformers.

Top 10 Natural Language Processing Tools and


Platforms
1. Google Cloud Natural Language API

Overview:
Google Cloud’s Natural Language API offers pre-trained machine
learning models that can perform tasks like sentiment analysis, entity
recognition, and syntax analysis. This tool is widely used for text
classification, document analysis, and content moderation.

Key Features:

 Sentiment analysis for understanding the emotional tone of


text.
 Entity extraction for identifying people, places, and
organizations.
 Content classification and syntax parsing for text structure
analysis.

Why Choose It: Google’s Cloud NLP is scalable, easy to integrate with
Google Cloud services, and ideal for businesses needing to process
large volumes of text data in real-time.

2. IBM Watson Natural Language Understanding

Overview:
IBM Watson is one of the leading AI platforms, and its NLP tool,
Watson Natural Language Understanding (NLU), helps businesses
extract insights from unstructured text. It is particularly strong in
analyzing tone, emotion, and language translation.

Key Features:

 Emotion analysis for detecting sentiments like joy, anger, and


sadness.
 Keyword extraction to identify important phrases in documents.
 Metadata extraction, including information about authors and
dates from documents.

Why Choose It: With its easy-to-use API and sophisticated analytics
capabilities, Watson NLU is perfect for companies seeking deep text
analysis, including sentiment, keywords, and relations in the text.

3. SpaCy
Overview:
SpaCy is an open-source NLP library designed specifically for
building industrial-strength applications. It provides developers with
state-of-the-art speed, accuracy, and support for advanced NLP
tasks, making it a favorite among data scientists and developers.

Key Features:

 Tokenization, part-of-speech tagging, and named entity


recognition (NER).
 Support for multiple languages and customizable pipelines.
 Easy integration with deep learning libraries like TensorFlow and
PyTorch.

Why Choose It: If you’re building custom NLP solutions and need high
performance with flexibility, SpaCy is a great choice for its speed and
modular architecture.

4. Microsoft Azure Text Analytics

Overview:
Microsoft Azure’s Text Analytics API provides a cloud-based service
for NLP, allowing businesses to process text using pre-built machine
learning models. The platform is known for its user-friendly API and
integration with other Azure services.

Key Features:

 Sentiment analysis, key phrase extraction, and language


detection.
 Named entity recognition to identify people, locations, and
brands.
 Multi-language support and real-time processing capabilities.

Why Choose It: Azure Text Analytics is ideal for businesses already
using Microsoft services and looking for a simple, reliable tool for text
analysis.
5. Amazon Comprehend

Overview:
Amazon Comprehend is a fully managed NLP service that uses
machine learning to extract insights from text. It automatically
identifies the language of the text, extracts key phrases, and detects
the sentiment.

Key Features:

 Real-time language detection and entity recognition.


 Custom entity recognition for identifying domain-specific
entities.
 Integrated with AWS for easy deployment and scalability.

Why Choose It: For organizations already leveraging AWS, Amazon


Comprehend provides seamless integration, scalability, and ease of
use for NLP applications in the cloud.

6. Stanford NLP

Overview:
Stanford NLP is a widely-used open-source NLP toolkit developed by
Stanford University. It offers a range of NLP tools and models based
on state-of-the-art machine learning algorithms for various
linguistic tasks.

Key Features:

 Tokenization, part-of-speech tagging, and named entity


recognition.
 Dependency parsing and coreference resolution.
 Available in multiple languages and highly customizable.

Why Choose It: Stanford NLP is perfect for academic research or


enterprises needing comprehensive NLP functionalities with robust
algorithms for deep linguistic analysis.
7. Hugging Face Transformers

Overview:
Hugging Face is renowned for its open-source library, Transformers,
which provides state-of-the-art NLP models, including pre-trained
models like BERT, GPT, and T5. Hugging Face also offers an easy-to-
use API and an extensive ecosystem for developers.

Key Features:

 Pre-trained models for various NLP tasks, including translation,


question-answering, and text summarization.
 Easy integration with TensorFlow and PyTorch.
 Supports fine-tuning for domain-specific needs.

Why Choose It: Hugging Face is an excellent choice for developers


looking for access to powerful pre-trained models or for those who
need the flexibility to fine-tune models for custom use cases.

8. TextRazor

Overview:
TextRazor is an NLP API designed for real-time text analysis. It can
extract entities, relationships, and topics from large text documents. It
also provides users with highly accurate and customizable entity
extraction.

Key Features:

 Named entity recognition, relationship extraction, and


dependency parsing.
 Topic classification and custom taxonomy building.
 Sentiment analysis and multi-language support.

Why Choose It: TextRazor is ideal for real-time applications that need
deep analysis, customizable entity extraction, and robust text
classification.
9. MonkeyLearn

Overview:
MonkeyLearn is an AI-based text analysis tool that offers a no-code
interface for businesses looking to leverage NLP without needing in-
depth technical expertise. It provides solutions for sentiment analysis,
keyword extraction, and categorization.

Key Features:

 No-code platform for easy model creation and integration.


 Sentiment analysis, text classification, and keyword extraction.
 Customizable text analysis models based on specific business
needs.

Why Choose It: MonkeyLearn is perfect for businesses or teams


without a technical background who want to integrate NLP
capabilities without the need for coding.

10. Gensim

Overview:
Gensim is an open-source library primarily focused on topic
modeling and document similarity analysis. It is widely used for
processing large volumes of unstructured text and transforming it
into insights through unsupervised learning algorithms.

Key Features:

 Topic modeling with techniques like Latent Dirichlet Allocation


(LDA).
 Document similarity comparison and word embeddings.
 Memory-efficient processing of large text datasets.

Why Choose It: Gensim is a great tool for researchers and data
scientists focusing on topic modeling and document clustering in
large-scale datasets.
Uses of Natural Language Processing in Data
Analytics
Natural Language Processing (NLP) plays a significant role in data
analytics by enabling organizations to extract insights from
unstructured text data. Here are some of the key uses of NLP in data
analytics:

1. Sentiment Analysis

 Application: Businesses use NLP to analyze customer feedback,


social media posts, and reviews to gauge public sentiment
about their products or services.
 Benefit: This helps in understanding customer opinions and
preferences, guiding marketing strategies, product
improvements, and brand reputation management.

2. Text Classification

 Application: NLP algorithms can classify text into predefined


categories, such as spam detection in emails or categorizing
support tickets based on urgency or topic.
 Benefit: Automating the classification process saves time,
enhances efficiency, and improves the accuracy of data
categorization.

3. Named Entity Recognition (NER)

 Application: NER identifies and classifies key entities (e.g.,


names, organizations, locations) in text data, which is essential
for data extraction in various domains like finance, healthcare,
and marketing.
 Benefit: By pinpointing important entities, businesses can
streamline their data collection processes and gain valuable
insights from structured and unstructured data.

4. Customer Insights and Segmentation


 Application: NLP helps analyze customer interactions and
feedback to segment customers based on behavior,
preferences, and needs.
 Benefit: This enables targeted marketing efforts and
personalized customer experiences, improving engagement
and satisfaction.

5. Topic Modeling

 Application: NLP techniques, such as Latent Dirichlet Allocation


(LDA), can identify underlying topics in a collection of
documents or text data.
 Benefit: Organizations can uncover trends and insights from
large text corpora, aiding strategic decision-making and
content development.

6. Chatbots and Virtual Assistants

 Application: NLP powers chatbots and virtual assistants that


interact with users in natural language, answering queries,
providing information, and assisting with tasks.
 Benefit: These tools enhance customer support efficiency,
reduce response times, and improve user satisfaction.

7. Search and Information Retrieval

 Application: NLP enhances search engines and information


retrieval systems by allowing users to search using natural
language queries.
 Benefit: Improved search capabilities lead to more relevant
results and a better user experience, especially in content-
heavy environments.

8. Text Summarization

 Application: NLP techniques can automatically generate


summaries of long documents, articles, or reports.
 Benefit: This helps users quickly grasp key points without
reading lengthy texts, saving time and improving information
consumption.

9. Fraud Detection and Risk Management

 Application: Financial institutions use NLP to analyze


transaction descriptions, customer communications, and
reports to detect unusual patterns or potential fraud.
 Benefit: Enhanced detection capabilities reduce financial risks
and improve regulatory compliance.

10. Voice Analytics

 Application: NLP is applied in analyzing voice interactions,


converting spoken language into text, and extracting insights
from call center data.
 Benefit: Organizations can monitor customer interactions,
assess service quality, and derive actionable insights for
process improvements.

Context-Free Grammars (CFG), and parsing techniques

In natural language processing, a context-free grammar (CFG) is a


set of production rules used to generate all the possible sentences in
a given language.

A CFG is a formal grammar in the sense that it consists of a set of


terminals, which are the basic units of the language, and a set of
non-terminals, which are used to generate the terminals through a
set of production rules. CFGs are often used in natural language
parsing and generation. They are also used in natural language
understanding, where a CFG can be used to analyze the syntactic
structure of a sentence.
Context free Grammar with examples

A context-free grammar (CFG) is a set of production rules used to


generate all the possible sentences in a given language. A CFG
consists of a set of terminals, which are the basic units of the
language, and a set of non-terminals, which are used to generate the
terminals through a set of production rules.

For example, consider the following CFG for a simple arithmetic


language:

S -> E E -> E + T | T T -> T * F | F F -> (E) | num

where S is the start symbol, E, T, and F are non-terminals, +, *, (, ),


and num are terminals.

This CFG generates all the possible arithmetical expressions, such


as:

num

(num)

num + num

num * num

(num + num) * num


CFGs can be used to generate the sentences in a language or to parse
sentences, where the input sentence is analyzed to determine if it
can be generated by the grammar, and if so, how it can be generated.

It’s important to note that context-free grammars can’t specify some


language properties such as word-ordering and dependencies. For
that natural language processing uses context-sensitive grammars or
dependency grammars.

Top down parsing and bottom up parsing with examples in


nlp

In natural language processing, parsing is the process of analyzing a


sentence to determine its grammatical structure, and there are two
main approaches to parsing: top-down parsing and bottom-up
parsing.

Top-down parsing is a parsing technique that starts with the


highest level of a grammar’s production rules, and then works its
way down to the lowest level. It begins with the start symbol of the
grammar and applies the production rules recursively to expand it
into a parse tree. One example of a top-down parsing algorithm is
the Recursive Descent Parsing.

For example, consider the following CFG:

S -> NP VP NP -> Det N VP -> V NP Det -> the | a N -> dog | cat |
boy | girl V -> chased | hugged
A top-down parser would begin with the start symbol “S” and
then apply the production rule “S -> NP VP” to expand it into “NP
VP”. The parser would then apply the production rule “NP -> Det N”
to expand “NP” into “Det N”.

Bottom-up parsing is a parsing technique that starts with the


sentence’s words and works its way up to the highest level of the
grammar’s production rules. It begins with the input sentence and
applies the production rules in reverse, reducing the input sentence
to the start symbol of the grammar. One example of a bottom-up
parsing algorithm is the Shift-Reduce Parsing.

For example, consider the same CFG:

A bottom-up parser would begin with the input sentence “the dog
chased the cat” and would apply the production rules in reverse to
reduce it to the start symbol “S”. The parser would start by matching
“the dog” to the “Det N” production rule, then “chased” to the “V”
production rule, and finally “the cat” to another “Det N” production
rule. These reduce steps will be repeated until the input sentence is
reduced to “S”, the start symbol of the grammar.

Both top-down and bottom-up parsing are used in natural language


processing, and both have their advantages and disadvantages. Top-
down parsing is easier to implement, but it can get stuck in an
infinite loop if the grammar is ambiguous or if the input sentence is
not part of the language generated by the grammar. Bottom-up
parsing is more powerful and can handle ambiguous grammars and
input sentences that are not part of the language generated by the
grammar, but it is more complex to implement.

Inflectional and derivational morphology

Morphological Analysis and Morphological Generation might be


considered highly relevant in most Natural Language Processing
applications. Because morphological analysis is a technique for
recognising a word, the result can be employed at a later stage. With
this in mind, this study explains how morphological analysis and
generation may be demonstrated as critical components of several
Natural Language Processing domains such as spell checkers and
machine translation.

Morphological analysis is a field of linguistics that studies the


structure of words. It identifies how a word is produced through the
use of morphemes. A morpheme is a basic unit of the English
language. The morpheme is the smallest element of a word that has
grammatical function and meaning. Free morpheme and bound
morpheme are the two types of morphemes. A single free morpheme
can become a complete word.

For instance, a bus, a bicycle, and so forth. A bound morpheme, on


the other hand, cannot stand alone and must be joined to a free
morpheme to produce a word. ing, un, and other bound morphemes
are examples.

Inflectional Morphology and Derivational Morphology are the two


types of morphology. Both of these types have their own significance
in various areas related to the Natural Language Processing.
• What is a morphological analyzer ?
In inflected languages, words are formed through morphological
processes such as affixation. For example, by adding the suffix ‘-s’ to
the verb ‘to dance’, we form the third person singular ‘dances’.

A morphological analyzer assigns the attributes of a given word by


evaluating what morphological processes the form has undergone.If
you give it the word ‘bailaré’ in Spanish, it will tell you it is the first
person, singular, simple future, indicative form of the verb ‘bailar’.

Morphological Parsing.

It is the process of determining the morphenes from which a given


word is constructed. Morphenes are the smallest meaningful words
which cannot be divided further. Morphenes can be stem or afix.
Stem are the root word whereas afix can be prefix, suffix or infix. For
example-

Unsuccessfull → un success ful

(prefix) (stem) (suffix)

Order of words also decide the morphological parser. To design a


morphological parser we require three things- lexicon,
morphptactics and orthographic rules.

Types of Morphology:

• Inflectional Morphology modification of a word to express


different grammatical categories. Inflectional morphology is the
study of processes, including affixation and vowel change, that
distinguish word forms in certain grammatical categories.
Inflectional morphology consists of at least five categories, provided
in the following excerpt from Language Typology and Syntactic
Description: Grammatical Categories and the Lexicon. As the text
will explain, derivational morphology cannot be so easily categorized
because derivation isn’t as predictable as inflection. Examples- cats,
men etc.

• Derivational Morphology:

Is defined as morphology that creates new lexemes, either by


changing the syntactic category (part of speech) of a base or by
adding substantial, non grammatical meaning or both. On the one
hand, derivation may be distinguished from inflectional
morphology, which typically does not change category but rather
modifies lexemes to fit into various syntactic contexts; inflection
typically expresses distinctions like number, case, tense, aspect,
person, among others. On the other hand, derivation may be
distinguished from compounding, which also creates new lexemes,
but by combining two or more bases rather than by affixation,
reduplication, subtraction, or internal modification of various sorts.
Although the distinctions are generally useful, in practice applying
them is not always easy.

APPROACHES TO MORPHOLOGY:

• Morpheme Based Morphology :

In these words are analyzed as arrangements of morphemes.Word-


based morphology is (usually) a word-and-paradigm approach.
The theory takes paradigms as a central notion. Instead of stating
rules to combine morphemes into word forms or to generate word
forms from stems, word-based morphology states generalizations
that hold between the forms of inflectional paradigms.

• Lexeme Based Morphology:

Lexeme-based morphology usually takes what it is called an “item-


and process” approach. Instead of analyzing a word form as a set of
morphemes arranged in sequence , a word form is said to be the
result of applying rules that alter a word-form or steam in order to
produce a new one.

• Word based Morphology :

Word-based morphology is usually a word-and -paradigm approach.


instead of stating rules to combine morphemes into word forms.

Probabilistic Language Modeling: Applications of


language modelling

Language Model
Language modeling (LM) refers to the use of various probabilistic
and statistical techniques to determine the probability of a given
sequence of words occurring in a sentence. LMs analyze the bodies
of text data to provide a basis for their word prediction. Language
models are widely used in natural language processing (NLP)
applications, especially ones that generate text as an output. Some of
the applications include question answering and machine
translation.

We all use it to translate text from one language to another for


varying reasons. This is a popular example of an NLP application
called Machine Translation. In Machine Translation, we take input a
bunch of words from a language and then convert these words into
another language. Now, there may be many potential translations
that a system might give us and we will want to compute the
probability of each of the translations to understand which one is the
most accurate.

In the example above, we know that the probability of the first


sentence i.e. the cat is small will be more than the second i.e. small
the is cat, right? That’s how did we arrive at the right translation.

This ability to model the rules of a language as a probability gives


great capabilities for NLP tasks. Language models are used in
various applications such as speech recognition, Optical Character
Recognition, handwriting recognition, machine translation, part-of-
speech tagging, parsing, information retrieval, and many other
tasks.

Working of Language Models


Language Models function by calculating the probability of the word
which is up next by analyzing the text in the provided corpus. These
models are expected to interpret the data by feeding it through
various algorithms.

It then is the responsibility of algorithms to create certain rules for


the context in natural language. These models are made to prepare
for the prediction of words by learning the characteristics and
features of a language. With this type of learning, the model
prepares itself for understanding phrases and then predicts the next
words in sentences.

A number of probabilistic approaches are used in order to train a


language model. These approaches may differ on the basis of the
purpose for which a specific language model is created and used.
The math applied for analysis and the amount of text data to be
analyzed makes a significant difference in the approach to be
adopted for creating and training a language model.

For example, a language model which is used for predicting the next
word in a long document (such as Google Docs) will be completely
different from those used in predicting the next word in a search
query. The approach followed to train the model will also be unique
and different in both cases.

Importance of Language Models


Language modeling is extremely important in various modern NLP
applications. It is the reason because of which machines are able to
understand qualitative information. Every language model, in some
or the another way, turns qualitative information into quantitative
information. That is the reason why people can communicate with
machines as they do with each other to a limited extent.

Language Modeling is used in a variety of industries including


healthcare, transportation, legal, finance, military, government, and
tech. It is very much possible that most individuals reading this have
interacted with a language model in some or the another way at
some point in the day, whether it be by engaging with a voice
assistant or through Google search, an autocomplete text function.

Talking about the history of language modeling, the roots of


language modeling can be traced back to the year 1948. That year, a
paper titled “A Mathematical Theory of Communication” was
published by Claude Shannon. In that paper, he talked about the use
of a stochastic model called the Markov chain to create a statistical
model for the sequences of letters in English text. That paper had a
very big impact on the telecommunications industry, laying the
foundation for information theory and language modeling. The
Markov model is still used today, and n-grams specifically are tied
very closely to this concept.

Types of Language Models

Types of Language Models


Language Models are primarily classified into two types:

S tatistical Language Models:


Statistical Language Models are the models which focus on the
development of probabilistic models. These models should be
capable of predicting the next word in the sequence with the words
that precede it in the corpus as a context. These language models use
traditional statistical techniques such as N-grams, Hidden Markov
Models (HMM), and certain linguistic rules in order to learn the
probability distribution of words in the textual data.

Statistical Language Models are further classified into:

N-Gram:
A relatively simple type of language model is the N-gram model. We
use the N-gram model to create a probability distribution for a
sequence of n words. This n can be any real number, and it defines
the size of the model or the sequence of words to which the
probability is being assigned. For example, if n = 5, it refers to a
sequence with five words: “can you please call me.” The model then
assigns the probabilities for sequences of n size. Basically, n can be
thought of as the amount of context which the model is considering.
The Model is a unigram if n = 1, Model is a bigram if n = 2, Model is
trigram if n = 3, and so on. For longer n-grams, people just use their
lengths to spot them, like 4-gram, 5-gram, and so on.
Uni-gram, Bi-gram, Tri-gram.

Hidden Markov Model:


The Hidden Markov Model (HMM) deals with the Part of Speech
(POS) tagging in a given text. POS tagging is a process where you
assign a word with its appropriate Part of the Speech tag. HMM
being a statistical model makes use of probabilities as its
parameters. The two parameters being Word Emission Probability
and Tag Transition Probability. Word Emission Probability is the
probability of a word given a particular tag and Tag Transition
Probability is the probability of a tag given the previous tag. HMM
makes use of these two parameters to predict the tag sequence for
the given sentence. As it is a generative model, it makes use of the
joint probability of two parameters. It computes the best tag
sequence that maximizes the probability of getting a word for a
particular tag.

To calculate the best tag sequence, HMM makes use of the Viterbi
algorithm using the concept of dynamic programming. In the Viterbi
algorithm, we consider all the possible tags for each word in the
sentence. At the start of the sentence, we initialize 𝛿 to 1 for all the
states and making use of the two parameters calculates the 𝛿 for
each tag. Then for the next word calculates the 𝛿 for all the tags and
multiplies it with the previous 𝛿 and considers the maximum of
those for a tag. The algorithm keeps the backtrace for all the
calculations and then when the algorithm reaches the end of the
sentence, it backtracks to the start of the sentence thus selecting the
best tag sequence.

Hidden Markov Model

Bidirectional:
Unlike n-gram models, which analyze text in one direction
(backward), bidirectional models analyze text in both directions,
backward and forwards. These models can predict any word in a
sentence or body of text by using every other word in the text.
Examining text bi-directionally increases result accuracy. This type
is often utilized in machine learning and speech generation
applications. For example, Google uses a bidirectional model to
process search queries.

We’ll either use an n-gram language model or a variant of a


recurrent neural network (RNN). An RNN (theoretically) gives us
infinite left context (words to the left of the target word). But what
we’d adore is to use both left and right contexts to determine how
well the word fits within the sentence. A bidirectional language
model can enable this.
The problem statement: predict every word within the sentence
given the rest of the words within the sentence.

Bidirectional Language Model

Maximum Entropy Model:


Another statistical model is the Maximum Entropy Model. In
Natural Language Processing, entropy refers to the uncertainty of
the distribution. The Maximum Entropy Model is a discriminative
model. Along with making use of probability, it also defines the set
of features for an observation. These features are combined in a
probabilistic framework. These features are binary value functions.
These features can be anything from the observed word being
Capitalized to the observed word being the first word in the
sentence. These features are weighted and combining all these
contribute to the choice of the POS tag for the current word. The
Maximum Entropy Markov Model is trained over the dataset having
these features and then uses the conditional probabilities along with
these trained features to predict the POS tag of the given word.
HMM predicts the probability of a tag producing a certain
observation and MEMM predicts a tag given an observation, this is
due to the discriminative approach of MEMM.
To predict the best possible sequence for a sentence, MEMM makes
use of the Beam Search algorithm. The basic idea of Beam Search is
to find the best ’n’ possibilities for a word and selecting the best one
out of those. When the algorithm moves to the next word, it
calculates the best ’n’ possibilities for the current word and then
computes the best ’n’ sequence with the previous word and selects
the best one out of those. This continues till we get the tag sequence
for the entire sentence.

Maximum Entropy Model

N eural Language Models


These language models are based on the “Neural networks” and
are usually considered as an advanced approach to execute natural
language processing tasks. These language models overcome the
drawbacks of statistical models namely n-gram and are used for
more complex tasks such as machine translation.

These language models reduce the “curse of


dimensionality” which refers to the requirement for large
numbers of training examples when learning extremely complex
functions. When the number of input variables is increased, the
number of required examples also grows exponentially. The curse
takes place when a huge number of different combinations of values
of the variables which are input should be discriminated from each
other, and the learning algorithm requires at least one example per
relevant combination of those values. In the context of language
models, this problem arises from the huge number of possible
sequences of words, e.g., with a sequence of 10 words taken from a
vocabulary of 100,000 there are 10 50 possible sequences.

Neural Networks have the ability to learn distributed representation.


Distributed representation of a symbol is a vector(or a tuple) with
features that characterize the meaning of that symbol and are not
mutually exclusive.

The benefit of the distributed representation approach is the fact


that it permits the model to generalize well to sequences that aren’t
in the set of training word sequences, but which are the same in
terms of their features(their distributed representation). Since
neural networks are known to map nearby inputs to nearby outputs,
the predictions corresponding to word sequences with similar
features are mapped to similar predictions. As many unique
combinations of feature values are possible, hence a very large set of
possible meanings can be represented compactly which allows a
model with a comparatively small number of parameters to fit a
large training set.

Applications & Limitations of Language Models


As we saw, the language models have evolved over time and are a
great asset for real-life applications. Let us look at some of the
commonly used applications which use language models.

 Automatic Speech Recognition (ASR)


Software: Automatic Speech Recognition, also known as ASR
in short, is the technology that allows human beings to use their
voices to speak with a computer interface. This software makes
the conversation to an extent resemble a normal human
conversation. The variant of ASR which comes closest to allowing
real conversation between people and machine intelligence is the
Siri program on iPhone devices. Although it still has a long way
to go before reaching the apex of its development, we’re already
seeing some remarkable results. However, despite the high
accuracy of 96 to 99%, this software can only achieve these kinds
of results under ideal conditions where the questions are of a
simple yes or no type or have only a limited number of possible
response options.

 Machine Translation: Machine translation is the process


where computer software is used to translate a text input from
one natural language to another. The most common example of
this is Google Translate where the user can select any two
languages to translate to and from. While translating a sentence,
it is of utmost importance that the meaning of the text remains
unchanged when translated to the target language. Although it
seems straightforward, it is not the reality. The translation is not
just a word-to-word substitution. All the intricacies, the elements
of the text, the common usage of the words, and how they affect
each other in the text must be interpreted by the translator.
Extensive expertise in grammar, syntax, semantics, etc., in the
source as well as the target language as well as the familiarity
with each local language, is required. At the end of the day,
translation does not mean substitution but expressing the same
thoughts and feelings in another way.

 Speech Translation: When people from different parts of the


world come together for an event, it is crucial that they
understand what others are saying. If two people do not speak or
understand each other’s languages then this becomes an issue.
Although they can get around this problem by having a third
person to translate for both of them, this is not feasible when the
number of people and their languages are high in number.
Speech translation is the process by which the spoken phrases
are instantly translated and spoken aloud in a second language.
This is different from phrase translation, where the system only
translates a fixed and finite set of phrases that have been
manually entered into the system. Speech translation technology
enables speakers of different languages to communicate. It thus
is of tremendous value for humankind in terms of science, cross-
cultural exchange, and global business. When a speaker speaks
into the microphone, ASR technology is used to recognize the
spoken phrase, using the Machine Translation, the phrase is
translated to the desired language, and finally using voice
synthesis, it is spoken out with minimal delay.

 Spelling Correction: Automatic spelling correction is


important for many NLP applications like web search engines,
text summarization, sentiment analysis, etc. In most approaches,
the software is trained using parallel data of noisy and correct
word mappings from different sources for automatic spelling
correction. It is a process where the written text is checked and in
case of a misspelled word, suggestions are provided or it is
corrected by the software for the user. It is used to flag words in a
document that may not be spelled correctly. The incorrectly
spelled words are then corrected using the trained model.

Now let us have a look at the limitations of various language models:

One of the very obvious and common limitations of the language


models is the development in the study of Artificial Intelligence and
Machine Learning. Although we have come this far to create and
develop usable applications, these are not 100% accurate. A little
inaccuracy can lead to misinterpretation of certain words or
sentences. But this is not a dead end as the future will bring more
and more advances in this field and the technology will only get
better.

There is a serious shortcoming of N-gram models in real-world


applications. As the N-gram model works on the probabilities of
various words and their preceding context to create a sense of the
given sentence, it becomes completely clueless if a sentence contains
words out of the trained space. For example, if our N-gram model
has the following sentences in its training set: “This is a wonderful
idea” and “We are going to have dinner” and it encounters a
sentence like “I am grateful for the gift” it cannot make sense of the
sentence because the words “I”, “am”, “grateful”, “for”, “the”, “gift” is
not in the feature set of the model. Thus, if we need to make a useful
application, we need to work with an extremely large dataset. Even
after using a huge dataset, it is possible to create hard sentences
which the model will not be able to understand.

One of the main issues in language modeling is data sparsity, the


number of words and valid word combinations is spread across an
infinite number of documents and datasets. This data is in principle
infinite, and most language modeling paradigms can only give
reliable probability estimates for frequent combinations.

English vocabulary sizes used in natural languages processing


applications such as speech recognition and translation involve tens
of thousands, possibly hundreds of thousands of different words.
With even 100,000 words in the equations for Neural Models, the
computational bottleneck is at the output layer, where one computes
operations in the multiples of the input words. This is much more
than the number of operations typically involved in computing
probability predictions for n-gram models.

In addition to the computational challenges, the representation of a


fixed-size context is another challenge. We need a way to represent
longer-term context, which learns a representation of context that
summarizes the past word sequence in a way that preserves
information to predict the future.

The way that machines understand the language is not as close as we


understand our language. We can understand the tone of a sentence
and use the contextual information to make sense of the spoken
sentence which is not direct. For example, humans can understand
the sarcasm in someone’s words whereas machines may have
difficulty understanding it.

There are many languages that have a very low resource in terms of
data set for learning. This is an issue where the language models
cannot be used because of the scarcity of data available about the
language. On the other hand, if the dataset is too large, it is difficult
to keep track of the context, most language models keep track of
current context and information, but when the machine is too far
into the document, it gets difficult to keep track of the relevant
information in the document while reading.

You might also like