0% found this document useful (0 votes)
38 views9 pages

Unit - 1

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views9 pages

Unit - 1

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Origin AND Challenges OF NLP

Natural language processing (NLP) is


 A field of computer science, artificial intelligence (also called machine learning), and
linguistics
 Concerned with the interactions between computers and human (natural) languages.
 Specifically, the process of a computer extracting meaningful information from natural
language input and/or producing natural language output ”
Below are the steps involved and some challenges that are faced in the machine learning process
for NLP:
Breaking the sentence
Formally referred to as “sentence boundary disambiguation”, this breaking process is no longer
difficult to achieve, but it is a critical process, especially in the case of highly unstructured data
that includes structured information. A breaking application should be intelligent enough to
separate paragraphs into their appropriate sentence units. Highly complex data might not always
be available in easily recognizable sentence forms. This data may exist in the form of tables,
graphics, notations, page breaks, etc., which need to be appropriately processed for the machine
to derive meanings in the same way a human would approach interpreting text.
Solution: Tagging the parts of speech (POS) and generating dependency graphs
NLP applications employ a set of POS tagging tools that assign a POS tag to each word or
symbol in a given text. Subsequently, the position of each word in a sentence is determined by a
dependency graph, generated in the same procedure. Those POS tags can be further processed to
create meaningful single or compound vocabulary terms.
The context of these sentences is quite different.
Solution: There are several methods today to help train a machine to understand the differences
between the sentences. Some of the popular methods use custom-made knowledge graphs where,
for example, both possibilities would occur based on statistical calculations. When a new
document is under observation, the machine would refer to the graph to determine the setting
before proceeding.
One challenge in building the knowledge graph is domain specificity. Knowledge graphs cannot,
in a practical sense, be made to be universal.
Example : In the example above “enjoy working in a bank” suggests “work, or job, or
profession”, while “enjoy near a river bank” is just any type of work or activity that can be
performed near a river bank.
Two sentences with totally different contexts in different domains might confuse the machine if
forced to rely solely on knowledge graphs. It is therefore critical to enhance the methods used
with a probabilistic approach in order to derive context and proper domain choice.
Extracting named entities (often referred to as Named Entity Recognition = NER)
Challenge: The next big challenge is to successfully execute NER, which is essential when
training a machine to distinguish between simple vocabulary and named entities. In many
instances, these entities are surrounded by dollar amounts, places, locations, numbers, time, etc.,
it is critical to make and express the connections between each of these elements, only then may
a machine fully interpret a given text.
Solution: This problem, however, has been solved to a greater degree by some of the famous
NLP companies such as Stanford CoreNLP, AllenNLP, etc.
Use Case: Transforming unstructured data into structured format
Challenge: Putting the unstructured data into a format that could be reusable for analysis.
Historically, the same task has been done only manually by humans.
Example : Consider the following example that contains a named entity, an event, a financial
element and its values under different time scales. “The recent developments in technology have
enabled the stock price of Apple to rise by 20% to $168 as at Feb 20, 2018 from $140 in Q3
2017.” Think of this sentence broken down into the following structure:
This is extremely challenging through linguistics. Not all sentences are written in a single
fashion since authors follow their unique styles. While linguistics is an initial approach toward
extracting the data elements from a document, it doesn’t stop there. The semantic layer that will
understand the relationship between data elements and its values and surroundings have to be
machine-trained to suggest a modular output in a given format.
2 CHALLENGES OF NLP FOR AI
Artificial intelligence has become part of our everyday lives – Alexa and Siri, text and email
autocorrect, customer service chatbots. They all use machine learning algorithms to process and
respond to human language. A branch of machine learning AI, called Natural Language
Processing (NLP), allows machines to “understand” natural human language. A combination of
linguistics and computer science, NLP works to transform regular spoken or written language
into something that can be processed by machines.
NLP is a powerful tool with huge benefits, but there are still a number of Natural Language
Processing limitations and problems:
1. Contextual words and phrases and homonyms
2. Synonyms
3. Irony and sarcasm
4. Ambiguity
5. Errors in text or speech
6. Colloquialisms and slang
7. Domain-specific language
8. Low-resource languages
9. Lack of research and development
1. Contextual words and phrases and homonyms
The same words and phrases can have different meanings according the context of a sentence
and many words – especially in English – have the exact same pronunciation but totally different
meanings.
For example: I ran to the store because we ran out of milk. Can I run something past you real
quick?
Homonyms – two or more words that are pronounced the same but have different definitions –
can be problematic for question answering and speech-to-text applications because they aren’t
written in text form. Usage of their and there, for example, is even a common problem for
humans.
2. Synonyms
Synonyms can lead to issues similar to contextual understanding because we use many different
words to express the same idea. Furthermore, some of these words may convey exactly the same
meaning, while some may be levels of complexity (small, little, tiny, minute) and different
people use synonyms to denote slightly different meanings within their personal vocabulary.
3. Irony and sarcasm
Irony and sarcasm present problems for machine learning models because they generally use
words and phrases that, strictly by definition, may be positive or negative. Models can be trained
with certain cues that frequently accompany ironic or sarcastic phrases, like “yeah right,”
“whatever,” etc., and word embeddings (where words that have the same meaning have a similar
representation), but it’s still a tricky process.
4. Ambiguity
Ambiguity in NLP refers to sentences and phrases that potentially have two or more possible
interpretations.
 Lexical ambiguity: a word that could be used as a verb, noun, or adjective.
 Syntactic ambiguity: This kind of ambiguity occurs when a sentence is parsed in different
ways. For example, the sentence “The man saw the girl with the telescope”. It is ambiguous
whether the man saw the girl carrying a telescope or he saw her through his telescope.
 Anaphoric Ambiguity: This kind of ambiguity arises due to the use of anaphora entities in
discourse. For example, the horse ran up the hill. It was very steep. It soon got tired. Here, the
anaphoric reference of “it” in two situations cause ambiguity.
 Pragmatic ambiguity : Such kind of ambiguity refers to the situation where the context of a
phrase gives it multiple interpretations. In simple words, we can say that pragmatic ambiguity
arises when the statement is not specific. For example, the sentence “I like you too” can have
multiple interpretations like I like you (just like you like me), I like you (just like someone
else dose).
Even for humans this sentence alone is difficult to interpret without the context of surrounding
text. POS (part of speech) tagging is one NLP solution that can help solve the problem,
somewhat.
5 Errors in text
Misspelled or misused words can create problems for text analysis. Spelling mistakes can occur
for a variety of reasons, from typing errors to extra spaces between letters or missing letters.
Autocorrect and grammar correction applications can handle common mistakes, but don’t always
understand the writer’s intention.
For example, if the misspelled word is “speling,” the system will find the correct word:
“spelling.”
6. Colloquialisms and slang
Informal phrases, expressions, idioms, and culture-specific lingo present a number of problems
for NLP – especially for models intended for broad use. Because as formal language,
colloquialisms may have no “dictionary definition” at all, and these expressions may even have
different meanings in different geographic areas. Furthermore, cultural slang is constantly
morphing and expanding, so new words pop up every day.
This is where training and regularly updating custom models can be helpful, although it
oftentimes requires quite a lot of data.
7. Specific language
Different businesses and industries often use very different language. An NLP processing model
needed for healthcare, for example, would be very different than one used to process legal
documents. These days, however, there are a number of analysis tools trained for specific fields,
but extremely niche industries may need to build or train their own models.
8. Low Resource languages
AI machine learning NLP applications have been largely built for the most common, widely used
languages. However, many languages, especially those spoken by people with less access to
technology often go overlooked and under processed. For example, by some estimations,
(depending on language vs. dialect) there are over 3,000 languages in Africa, alone. There
simply isn’t very much data on many of these languages. However, new techniques,
like multilingual transformers (using Google’s BERT “Bidirectional Encoder Representations
from Transformers”) and multilingual sentence embeddings aim to identify and leverage
universal similarities that exist between languages.
9. Lack of research and development
Machine learning requires A LOT of data to function to its outer limits – billions of pieces of
training data. The more data NLP models are trained on, the smarter they become. That said, data
(and human language!) is only growing by the day, as are new machine learning techniques and
custom algorithms. All of the problems above will require more research and new techniques in
order to improve on them.
Language Modeling
A language model is the core component of modern Natural Language Processing (NLP). It’s a
statistical model that is designed to analyze the pattern of human language and predict the
likelihood of a sequence of words or tokens.
NLP-based applications use language models for a variety of tasks, such as audio to text
conversion, speech recognition, sentiment analysis, summarization, spell correction, etc.
Let’s understand how language models help in processing these NLP tasks:
-Speech Recognition: Smart speakers, such as Alexa, use automatic speech recognition (ASR)
mechanisms for translating the speech into text. It translates the spoken words into text and
between this translation, the ASR mechanism analyzes the intent/sentiments of the user by
differentiating between the words. For example, analyzing homophone phrases such as
“Let her” or “Letter”, “But her” “Butter”.
-Machine Translation: When translating a Chinese phrase “我在吃” into English, the translator
can give several choices as output:
I am eating
Me am eating
Eating am I
Here, the language model tells that the translation “I am eating” sounds natural and will suggest
the same as output.
How does Language Model Works?
Language Models determine the probability of the next word by analyzing the text in data.
These models interpret the data by feeding it through algorithms.
The algorithms are responsible for creating rules for the context in natural language. The models
are prepared for the prediction of words by learning the features and characteristics of a
language. With this learning, the model prepares itself for understanding phrases and predicting
the next words in sentences.

For training a language model, a number of probabilistic approaches are used. These approaches
vary on the basis of the purpose for which a language model is created. The amount of text data
to be analyzed and the math applied for analysis makes a difference in the approach followed for
creating and training a language model.

For example, a language model used for predicting the next word in a search query will be
absolutely different from those used in predicting the next word in a long document (such as
Google Docs). The approach followed to train the model would be unique in both cases.

What is statistical language modeling in NLP?


Statistical Language Modeling, or Language Modeling and LM for short, is the development of
probabilistic models that can predict the next word in the sequence given the words that precede
it.
A statistical language model learns the probability of word occurrence based on examples of
text. Simpler models may look at a context of a short sequence of words, whereas larger models
may work at the level of sentences or paragraphs. Most commonly, language models operate at
the level of words.

You could develop a language model and use it standalone for purposes like generating new
sequences of text that appear to have come from the body.
Language modeling is a core problem for a rather wide range of natural language
processing tasks. Language models are generally used on the front-end or back-end of a more
sophisticated model for a task that needs language understanding.

What are the types of statistical language models?


Statistical models include the development of probabilistic models that are able to predict the
next word in the sequence, given the words that precede it. A number of statistical language
models are in use already. Let’s take a look at some of those popular models:
1. N-Gram
This is one of the simplest approaches to language modelling. Here, a probability distribution for
a sequence of ‘n’ is created, where ‘n’ can be any number and defines the size of the gram (or
sequence of words being assigned a probability). If n=4, a gram may look like: “can you help
me”. Basically, ‘n’ is the amount of context that the model is trained to consider. There are
different types of N-Gram models such as unigrams, bigrams, trigrams, etc.
2. Exponential
This type of statistical model evaluates text by using an equation which is a combination of n-
grams and feature functions. Here the features and parameters of the desired results are already
specified. The model is based on the principle of entropy, which states that probability
distribution with the most entropy is the best choice. Exponential models have fewer statistical
assumptions which mean the chances of having accurate results are more.
3. Continuous Space
In this type of statistical model, words are arranged as a non-linear combination of weights in
a neural network. The process of assigning weight to a word is known as word embedding. This
type of model proves helpful in scenarios where the data set of words continues to become large
and include unique words.
What are the applications of statistical language modeling?
Statistical language models are used to generate text in many similar natural language processing
tasks, such as:
1. Speech Recognition: Voice assistants such as Siri and Alexa are examples of how
language models help machines in processing speech audio.

2. Machine Translation: Google Translator and Microsoft Translate are examples of how
NLP models can help in translating one language to another.

3. Sentiment Analysis: This helps in analyzing sentiments behind a phrase. This use case
of NLP models is used in products that allow businesses to understand a customer’s
intent behind opinions or attitudes expressed in the text. Hubspot’s Service Hub is an
example of how language models can help in sentiment analysis.

4. Text Suggestions: Google services such as Gmail or Google Docs use language models
to help users get text suggestions while they compose an email or create long text
documents, respectively.

5. Parsing Tools: Parsing involves analyzing sentences or words that comply with syntax
or grammar rules. Spell checking tools are perfect examples of language modelling and
parsing.

Language models are also used to generate text in other similar language processing tasks like
optical character recognition, handwriting recognition, image captioning, etc.

What are the drawbacks of statistical language modeling?


1. Zero probabilities
If we have a tri-gram language model that conditions of two words and has a vocabulary of
10,000 words. Then we have 10¹² triplets. If our training data has 10¹⁰ words, there are many
triples that will never be observed in the training data and thus the basic MLE(Maximum
Likelihood Estimate) will assign zero probabilities to those events. And a zero-probability
translates to infinite perplexity. To overcome this issue many techniques have been developed
under the family of Smoothing Techniques.

2. Exponential Growth
The second challenge is that the number of n-grams grows as an nth exponent of the vocabulary
size. A 10,000-word vocabulary would have 10¹² tri-grams and a 100,000 word vocabulary will
have 10¹⁵ trigrams.
3. Generalization
The last issue with MLE techniques is the lack of generalization. If the model sees the term
‘white horse’ in the training data but does not see ‘black horse’, the MLE will assign zero
probability to ‘black horse’.

You might also like