Unit - 1

Uploaded by

hashitapusapati012

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

38 views9 pages

Unit - 1

Uploaded by

hashitapusapati012

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

Origin AND Challenges OF NLP

Natural language processing (NLP) is

 A field of computer science, artificial intelligence (also called machine learning), and
linguistics
 Concerned with the interactions between computers and human (natural) languages.
 Specifically, the process of a computer extracting meaningful information from natural
language input and/or producing natural language output ”
Below are the steps involved and some challenges that are faced in the machine learning process
for NLP:
Breaking the sentence
Formally referred to as “sentence boundary disambiguation”, this breaking process is no longer
difficult to achieve, but it is a critical process, especially in the case of highly unstructured data
that includes structured information. A breaking application should be intelligent enough to
separate paragraphs into their appropriate sentence units. Highly complex data might not always
be available in easily recognizable sentence forms. This data may exist in the form of tables,
graphics, notations, page breaks, etc., which need to be appropriately processed for the machine
to derive meanings in the same way a human would approach interpreting text.
Solution: Tagging the parts of speech (POS) and generating dependency graphs
NLP applications employ a set of POS tagging tools that assign a POS tag to each word or
symbol in a given text. Subsequently, the position of each word in a sentence is determined by a
dependency graph, generated in the same procedure. Those POS tags can be further processed to
create meaningful single or compound vocabulary terms.
The context of these sentences is quite different.
Solution: There are several methods today to help train a machine to understand the differences
between the sentences. Some of the popular methods use custom-made knowledge graphs where,
for example, both possibilities would occur based on statistical calculations. When a new
document is under observation, the machine would refer to the graph to determine the setting
before proceeding.
One challenge in building the knowledge graph is domain specificity. Knowledge graphs cannot,
in a practical sense, be made to be universal.
Example : In the example above “enjoy working in a bank” suggests “work, or job, or
profession”, while “enjoy near a river bank” is just any type of work or activity that can be
performed near a river bank.
Two sentences with totally different contexts in different domains might confuse the machine if
forced to rely solely on knowledge graphs. It is therefore critical to enhance the methods used
with a probabilistic approach in order to derive context and proper domain choice.
Extracting named entities (often referred to as Named Entity Recognition = NER)
Challenge: The next big challenge is to successfully execute NER, which is essential when
training a machine to distinguish between simple vocabulary and named entities. In many
instances, these entities are surrounded by dollar amounts, places, locations, numbers, time, etc.,
it is critical to make and express the connections between each of these elements, only then may
a machine fully interpret a given text.
Solution: This problem, however, has been solved to a greater degree by some of the famous
NLP companies such as Stanford CoreNLP, AllenNLP, etc.
Use Case: Transforming unstructured data into structured format
Challenge: Putting the unstructured data into a format that could be reusable for analysis.
Historically, the same task has been done only manually by humans.
Example : Consider the following example that contains a named entity, an event, a financial
element and its values under different time scales. “The recent developments in technology have
enabled the stock price of Apple to rise by 20% to $168 as at Feb 20, 2018 from $140 in Q3
2017.” Think of this sentence broken down into the following structure:
This is extremely challenging through linguistics. Not all sentences are written in a single
fashion since authors follow their unique styles. While linguistics is an initial approach toward
extracting the data elements from a document, it doesn’t stop there. The semantic layer that will
understand the relationship between data elements and its values and surroundings have to be
machine-trained to suggest a modular output in a given format.
2 CHALLENGES OF NLP FOR AI
Artificial intelligence has become part of our everyday lives – Alexa and Siri, text and email
autocorrect, customer service chatbots. They all use machine learning algorithms to process and
respond to human language. A branch of machine learning AI, called Natural Language
Processing (NLP), allows machines to “understand” natural human language. A combination of
linguistics and computer science, NLP works to transform regular spoken or written language
into something that can be processed by machines.
NLP is a powerful tool with huge benefits, but there are still a number of Natural Language
Processing limitations and problems:
1. Contextual words and phrases and homonyms
2. Synonyms
3. Irony and sarcasm
4. Ambiguity
5. Errors in text or speech
6. Colloquialisms and slang
7. Domain-specific language
8. Low-resource languages
9. Lack of research and development
1. Contextual words and phrases and homonyms
The same words and phrases can have different meanings according the context of a sentence
and many words – especially in English – have the exact same pronunciation but totally different
meanings.
For example: I ran to the store because we ran out of milk. Can I run something past you real
quick?
Homonyms – two or more words that are pronounced the same but have different definitions –
can be problematic for question answering and speech-to-text applications because they aren’t
written in text form. Usage of their and there, for example, is even a common problem for
humans.
2. Synonyms
Synonyms can lead to issues similar to contextual understanding because we use many different
words to express the same idea. Furthermore, some of these words may convey exactly the same
meaning, while some may be levels of complexity (small, little, tiny, minute) and different
people use synonyms to denote slightly different meanings within their personal vocabulary.
3. Irony and sarcasm
Irony and sarcasm present problems for machine learning models because they generally use
words and phrases that, strictly by definition, may be positive or negative. Models can be trained
with certain cues that frequently accompany ironic or sarcastic phrases, like “yeah right,”
“whatever,” etc., and word embeddings (where words that have the same meaning have a similar
representation), but it’s still a tricky process.
4. Ambiguity
Ambiguity in NLP refers to sentences and phrases that potentially have two or more possible
interpretations.
 Lexical ambiguity: a word that could be used as a verb, noun, or adjective.
 Syntactic ambiguity: This kind of ambiguity occurs when a sentence is parsed in different
ways. For example, the sentence “The man saw the girl with the telescope”. It is ambiguous
whether the man saw the girl carrying a telescope or he saw her through his telescope.
 Anaphoric Ambiguity: This kind of ambiguity arises due to the use of anaphora entities in
discourse. For example, the horse ran up the hill. It was very steep. It soon got tired. Here, the
anaphoric reference of “it” in two situations cause ambiguity.
 Pragmatic ambiguity : Such kind of ambiguity refers to the situation where the context of a
phrase gives it multiple interpretations. In simple words, we can say that pragmatic ambiguity
arises when the statement is not specific. For example, the sentence “I like you too” can have
multiple interpretations like I like you (just like you like me), I like you (just like someone
else dose).
Even for humans this sentence alone is difficult to interpret without the context of surrounding
text. POS (part of speech) tagging is one NLP solution that can help solve the problem,
somewhat.
5 Errors in text
Misspelled or misused words can create problems for text analysis. Spelling mistakes can occur
for a variety of reasons, from typing errors to extra spaces between letters or missing letters.
Autocorrect and grammar correction applications can handle common mistakes, but don’t always
understand the writer’s intention.
For example, if the misspelled word is “speling,” the system will find the correct word:
“spelling.”
6. Colloquialisms and slang
Informal phrases, expressions, idioms, and culture-specific lingo present a number of problems
for NLP – especially for models intended for broad use. Because as formal language,
colloquialisms may have no “dictionary definition” at all, and these expressions may even have
different meanings in different geographic areas. Furthermore, cultural slang is constantly
morphing and expanding, so new words pop up every day.
This is where training and regularly updating custom models can be helpful, although it
oftentimes requires quite a lot of data.
7. Specific language
Different businesses and industries often use very different language. An NLP processing model
needed for healthcare, for example, would be very different than one used to process legal
documents. These days, however, there are a number of analysis tools trained for specific fields,
but extremely niche industries may need to build or train their own models.
8. Low Resource languages
AI machine learning NLP applications have been largely built for the most common, widely used
languages. However, many languages, especially those spoken by people with less access to
technology often go overlooked and under processed. For example, by some estimations,
(depending on language vs. dialect) there are over 3,000 languages in Africa, alone. There
simply isn’t very much data on many of these languages. However, new techniques,
like multilingual transformers (using Google’s BERT “Bidirectional Encoder Representations
from Transformers”) and multilingual sentence embeddings aim to identify and leverage
universal similarities that exist between languages.
9. Lack of research and development
Machine learning requires A LOT of data to function to its outer limits – billions of pieces of
training data. The more data NLP models are trained on, the smarter they become. That said, data
(and human language!) is only growing by the day, as are new machine learning techniques and
custom algorithms. All of the problems above will require more research and new techniques in
order to improve on them.
Language Modeling
A language model is the core component of modern Natural Language Processing (NLP). It’s a
statistical model that is designed to analyze the pattern of human language and predict the
likelihood of a sequence of words or tokens.
NLP-based applications use language models for a variety of tasks, such as audio to text
conversion, speech recognition, sentiment analysis, summarization, spell correction, etc.
Let’s understand how language models help in processing these NLP tasks:
-Speech Recognition: Smart speakers, such as Alexa, use automatic speech recognition (ASR)
mechanisms for translating the speech into text. It translates the spoken words into text and
between this translation, the ASR mechanism analyzes the intent/sentiments of the user by
differentiating between the words. For example, analyzing homophone phrases such as
“Let her” or “Letter”, “But her” “Butter”.
-Machine Translation: When translating a Chinese phrase “我在吃” into English, the translator
can give several choices as output:
I am eating
Me am eating
Eating am I
Here, the language model tells that the translation “I am eating” sounds natural and will suggest
the same as output.
How does Language Model Works?
Language Models determine the probability of the next word by analyzing the text in data.
These models interpret the data by feeding it through algorithms.
The algorithms are responsible for creating rules for the context in natural language. The models
are prepared for the prediction of words by learning the features and characteristics of a
language. With this learning, the model prepares itself for understanding phrases and predicting
the next words in sentences.

For training a language model, a number of probabilistic approaches are used. These approaches
vary on the basis of the purpose for which a language model is created. The amount of text data
to be analyzed and the math applied for analysis makes a difference in the approach followed for
creating and training a language model.

For example, a language model used for predicting the next word in a search query will be
absolutely different from those used in predicting the next word in a long document (such as
Google Docs). The approach followed to train the model would be unique in both cases.

What is statistical language modeling in NLP?

Statistical Language Modeling, or Language Modeling and LM for short, is the development of
probabilistic models that can predict the next word in the sequence given the words that precede
it.
A statistical language model learns the probability of word occurrence based on examples of
text. Simpler models may look at a context of a short sequence of words, whereas larger models
may work at the level of sentences or paragraphs. Most commonly, language models operate at
the level of words.

You could develop a language model and use it standalone for purposes like generating new
sequences of text that appear to have come from the body.
Language modeling is a core problem for a rather wide range of natural language
processing tasks. Language models are generally used on the front-end or back-end of a more
sophisticated model for a task that needs language understanding.

What are the types of statistical language models?

Statistical models include the development of probabilistic models that are able to predict the
next word in the sequence, given the words that precede it. A number of statistical language
models are in use already. Let’s take a look at some of those popular models:
1. N-Gram
This is one of the simplest approaches to language modelling. Here, a probability distribution for
a sequence of ‘n’ is created, where ‘n’ can be any number and defines the size of the gram (or
sequence of words being assigned a probability). If n=4, a gram may look like: “can you help
me”. Basically, ‘n’ is the amount of context that the model is trained to consider. There are
different types of N-Gram models such as unigrams, bigrams, trigrams, etc.
2. Exponential
This type of statistical model evaluates text by using an equation which is a combination of n-
grams and feature functions. Here the features and parameters of the desired results are already
specified. The model is based on the principle of entropy, which states that probability
distribution with the most entropy is the best choice. Exponential models have fewer statistical
assumptions which mean the chances of having accurate results are more.
3. Continuous Space
In this type of statistical model, words are arranged as a non-linear combination of weights in
a neural network. The process of assigning weight to a word is known as word embedding. This
type of model proves helpful in scenarios where the data set of words continues to become large
and include unique words.
What are the applications of statistical language modeling?
Statistical language models are used to generate text in many similar natural language processing
tasks, such as:
1. Speech Recognition: Voice assistants such as Siri and Alexa are examples of how
language models help machines in processing speech audio.

2. Machine Translation: Google Translator and Microsoft Translate are examples of how
NLP models can help in translating one language to another.

3. Sentiment Analysis: This helps in analyzing sentiments behind a phrase. This use case
of NLP models is used in products that allow businesses to understand a customer’s
intent behind opinions or attitudes expressed in the text. Hubspot’s Service Hub is an
example of how language models can help in sentiment analysis.

4. Text Suggestions: Google services such as Gmail or Google Docs use language models
to help users get text suggestions while they compose an email or create long text
documents, respectively.

5. Parsing Tools: Parsing involves analyzing sentences or words that comply with syntax
or grammar rules. Spell checking tools are perfect examples of language modelling and
parsing.

Language models are also used to generate text in other similar language processing tasks like
optical character recognition, handwriting recognition, image captioning, etc.

What are the drawbacks of statistical language modeling?

1. Zero probabilities
If we have a tri-gram language model that conditions of two words and has a vocabulary of
10,000 words. Then we have 10¹² triplets. If our training data has 10¹⁰ words, there are many
triples that will never be observed in the training data and thus the basic MLE(Maximum
Likelihood Estimate) will assign zero probabilities to those events. And a zero-probability
translates to infinite perplexity. To overcome this issue many techniques have been developed
under the family of Smoothing Techniques.

2. Exponential Growth
The second challenge is that the number of n-grams grows as an nth exponent of the vocabulary
size. A 10,000-word vocabulary would have 10¹² tri-grams and a 100,000 word vocabulary will
have 10¹⁵ trigrams.
3. Generalization
The last issue with MLE techniques is the lack of generalization. If the model sees the term
‘white horse’ in the training data but does not see ‘black horse’, the MLE will assign zero
probability to ‘black horse’.

Submissive Protocol
89% (74)
Submissive Protocol
12 pages
Unlocklevel 3 Reading and Writing Teacherbook
53% (55)
Unlocklevel 3 Reading and Writing Teacherbook
145 pages
6CS4 AI Unit-5
No ratings yet
6CS4 AI Unit-5
65 pages
Shivangi Tyagi (NLP Assignments)
No ratings yet
Shivangi Tyagi (NLP Assignments)
60 pages
Unit - 1 Introduction
No ratings yet
Unit - 1 Introduction
33 pages
NLP Important Question and Answers Module Wise
No ratings yet
NLP Important Question and Answers Module Wise
101 pages
Unit 1
No ratings yet
Unit 1
99 pages
CL Unit 1
No ratings yet
CL Unit 1
11 pages
NLP Ass 1&2
No ratings yet
NLP Ass 1&2
18 pages
NLP Qna Sem 7 2024 18 11 05 03 29 1
No ratings yet
NLP Qna Sem 7 2024 18 11 05 03 29 1
37 pages
Natural Language Processing Lec 1
No ratings yet
Natural Language Processing Lec 1
23 pages
Notes
No ratings yet
Notes
9 pages
NLP Presentation1
No ratings yet
NLP Presentation1
25 pages
NLP Ia1
No ratings yet
NLP Ia1
7 pages
Unit-I NLP
No ratings yet
Unit-I NLP
37 pages
Unit I
No ratings yet
Unit I
36 pages
NLP Steps Basic
No ratings yet
NLP Steps Basic
26 pages
Unit V
No ratings yet
Unit V
16 pages
Introducing Natural Language Processing
No ratings yet
Introducing Natural Language Processing
13 pages
Unit-I NLP
No ratings yet
Unit-I NLP
15 pages
NLP Introduction Overview
No ratings yet
NLP Introduction Overview
34 pages
Introduction To Natural Language Processing (NLP)
No ratings yet
Introduction To Natural Language Processing (NLP)
87 pages
Unit 1 NLP
No ratings yet
Unit 1 NLP
44 pages
Natural Language Processing
No ratings yet
Natural Language Processing
14 pages
What Is Natural Language Processing?
No ratings yet
What Is Natural Language Processing?
5 pages
Lecture 1
No ratings yet
Lecture 1
33 pages
Archivo - 01 (4 Cópia)
No ratings yet
Archivo - 01 (4 Cópia)
6 pages
Basic Terms NLP and Major Challenges
No ratings yet
Basic Terms NLP and Major Challenges
12 pages
Module 1
No ratings yet
Module 1
39 pages
2 Introduction
No ratings yet
2 Introduction
15 pages
NLP Mod 1 SEE
No ratings yet
NLP Mod 1 SEE
7 pages
INFOSYS Natural Language Processing
No ratings yet
INFOSYS Natural Language Processing
13 pages
Natural Language Processing
No ratings yet
Natural Language Processing
21 pages
Lecture 1
No ratings yet
Lecture 1
16 pages
NLP 1
No ratings yet
NLP 1
29 pages
01 - Intro NLP
No ratings yet
01 - Intro NLP
13 pages
Natural Language Processing (NLP)
No ratings yet
Natural Language Processing (NLP)
63 pages
AI Unit 6
No ratings yet
AI Unit 6
12 pages
Introduction To NLPAbebe Zerihun
No ratings yet
Introduction To NLPAbebe Zerihun
45 pages
CAT King Study Material 2
No ratings yet
CAT King Study Material 2
20 pages
NLP Question Bank
No ratings yet
NLP Question Bank
27 pages
Ai Unit-V Rtu
No ratings yet
Ai Unit-V Rtu
14 pages
Group Assignment: Unit One
No ratings yet
Group Assignment: Unit One
27 pages
Natural Language Processing: Bachelor of Technology Computer Science and Engineering
No ratings yet
Natural Language Processing: Bachelor of Technology Computer Science and Engineering
7 pages
Unit 4
No ratings yet
Unit 4
39 pages
5.2 Natural Language Processing
No ratings yet
5.2 Natural Language Processing
43 pages
Transformational Grammar
No ratings yet
Transformational Grammar
11 pages
Introduction To NLP: Prof: Vraj M Hingu Dept: Computer
No ratings yet
Introduction To NLP: Prof: Vraj M Hingu Dept: Computer
87 pages
NLP
No ratings yet
NLP
17 pages
Introduction To NLP - First - Week - Lecture - 1st
No ratings yet
Introduction To NLP - First - Week - Lecture - 1st
6 pages
NLP Unit 1 and 2
No ratings yet
NLP Unit 1 and 2
106 pages
Unit V Natural Language Processing
No ratings yet
Unit V Natural Language Processing
20 pages
Human Communication, Either Spoken or Written, Consisting of The Use of Words in A Structured and Conventional Way". Language Makes Us Unique From Other Living Beings and I Would
No ratings yet
Human Communication, Either Spoken or Written, Consisting of The Use of Words in A Structured and Conventional Way". Language Makes Us Unique From Other Living Beings and I Would
7 pages
NLP Unit-I
No ratings yet
NLP Unit-I
7 pages
Notes MSC NLP
No ratings yet
Notes MSC NLP
36 pages
NLP 833
No ratings yet
NLP 833
26 pages
Unit-1 Aim 502
No ratings yet
Unit-1 Aim 502
15 pages
Binary Code
No ratings yet
Binary Code
2 pages
NLP Module1-4
No ratings yet
NLP Module1-4
100 pages
NPL 12345
No ratings yet
NPL 12345
3 pages
Introduction To Natural Language Processing and NLTK
No ratings yet
Introduction To Natural Language Processing and NLTK
23 pages
Natural Language Processing
From Everand
Natural Language Processing
Ajit Singh
No ratings yet
Object Oriented Analysis and Design
No ratings yet
Object Oriented Analysis and Design
7 pages
Shammy Fathima
No ratings yet
Shammy Fathima
1 page
Unit - 3
No ratings yet
Unit - 3
14 pages
Accenture 2023 Question Paper 1
No ratings yet
Accenture 2023 Question Paper 1
104 pages
Distributed Systems R19 - Unit-2
No ratings yet
Distributed Systems R19 - Unit-2
28 pages
Unit - 4
No ratings yet
Unit - 4
12 pages
Unit - 3
No ratings yet
Unit - 3
15 pages
Unit - 5
No ratings yet
Unit - 5
21 pages
Unit - 2
No ratings yet
Unit - 2
10 pages
DMDW Qa-5
No ratings yet
DMDW Qa-5
7 pages
Statistics and Hypothesis Testing Testing Unit 5
No ratings yet
Statistics and Hypothesis Testing Testing Unit 5
10 pages
Meds Notes
No ratings yet
Meds Notes
13 pages
English For Academic and Professional Purposes EAPP 111 - 1
No ratings yet
English For Academic and Professional Purposes EAPP 111 - 1
221 pages
RAW - Application
No ratings yet
RAW - Application
22 pages
Pashupati Shikshya Mandir: Dhangadhi, Kailali 1 Terminal Examination-2077 Subject: Math
No ratings yet
Pashupati Shikshya Mandir: Dhangadhi, Kailali 1 Terminal Examination-2077 Subject: Math
3 pages
Planning Your Speech in 4 Parts
No ratings yet
Planning Your Speech in 4 Parts
2 pages
Pakistan
No ratings yet
Pakistan
12 pages
E. NSTP 1 Module 4 - Evaluate PDF
No ratings yet
E. NSTP 1 Module 4 - Evaluate PDF
5 pages
Annals of Science: A Silent Childhood-I
No ratings yet
Annals of Science: A Silent Childhood-I
58 pages
67fd30075b0d17335c402b31 71765105398
No ratings yet
67fd30075b0d17335c402b31 71765105398
12 pages
EF Preint 8A V GET
No ratings yet
EF Preint 8A V GET
1 page
I. Objectives: Department of Education Region V Division of Camarines Sur Goa District Pinaglabanan Elementary School
No ratings yet
I. Objectives: Department of Education Region V Division of Camarines Sur Goa District Pinaglabanan Elementary School
5 pages
Guide Number 5 My City: Restaura NT Bakery
No ratings yet
Guide Number 5 My City: Restaura NT Bakery
6 pages
Module-JLZ 110
No ratings yet
Module-JLZ 110
2 pages
Sumer
No ratings yet
Sumer
37 pages
7003 EYFS Phonemic Awareness Aya
No ratings yet
7003 EYFS Phonemic Awareness Aya
32 pages
Learn Hindi in Kannada
No ratings yet
Learn Hindi in Kannada
240 pages
1000 Books Before Kindergarten Phonics Roadmap
100% (1)
1000 Books Before Kindergarten Phonics Roadmap
22 pages
Conditionals 2 Conditional Wishes and Imaginary Situations
No ratings yet
Conditionals 2 Conditional Wishes and Imaginary Situations
4 pages
Dissertation Topics in Hindi
100% (2)
Dissertation Topics in Hindi
7 pages
Workbook 2025 - 6to Uzzi - Turno Mañana
No ratings yet
Workbook 2025 - 6to Uzzi - Turno Mañana
126 pages
2018 LMR AGLC3-Quick-Guide-for-LMR-Students
No ratings yet
2018 LMR AGLC3-Quick-Guide-for-LMR-Students
5 pages
Focus4 2E Test Unit7 GroupA 2kol
100% (1)
Focus4 2E Test Unit7 GroupA 2kol
6 pages
CNF MELC5 FINAL Field-Validated-updated
100% (2)
CNF MELC5 FINAL Field-Validated-updated
19 pages
Nayeem Resume New - Docx-2
No ratings yet
Nayeem Resume New - Docx-2
3 pages
Vocabulary in Use
No ratings yet
Vocabulary in Use
3 pages
Cebuano News Article
No ratings yet
Cebuano News Article
2 pages
CPCCOM1014 Student Pack
No ratings yet
CPCCOM1014 Student Pack
67 pages
Discriptive Catalogue of Sanskrit, Pali and Sinhalese Litarary Works - Volume 1 by JAMES de ALVIS
100% (2)
Discriptive Catalogue of Sanskrit, Pali and Sinhalese Litarary Works - Volume 1 by JAMES de ALVIS
288 pages
Tiếng Anh 7 Friends Plus - Unit 2 - Test 2
No ratings yet
Tiếng Anh 7 Friends Plus - Unit 2 - Test 2
4 pages

Unit - 1

Uploaded by

Unit - 1

Uploaded by

Origin AND Challenges OF NLP

Natural language processing (NLP) is

What is statistical language modeling in NLP?

What are the types of statistical language models?

What are the drawbacks of statistical language modeling?

You might also like