0% found this document useful (0 votes)
22 views67 pages

NLP Unit 1 1

Natural Language Processing (NLP) is a field within computer science and artificial intelligence that enables machines to understand and manipulate human language. The document outlines the history, advantages, disadvantages, applications, and challenges of NLP, along with its phases and morphological processing. Key issues include language differences, training data quality, ambiguity, and the need for context in understanding language.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views67 pages

NLP Unit 1 1

Natural Language Processing (NLP) is a field within computer science and artificial intelligence that enables machines to understand and manipulate human language. The document outlines the history, advantages, disadvantages, applications, and challenges of NLP, along with its phases and morphological processing. Key issues include language differences, training data quality, ambiguity, and the need for context in understanding language.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 67

NLP - UNIT 1

WHAT IS NLP
NLP stands for Natural Language Processing, which is a part of Computer
Science, Human language, and Artificial Intelligence.

It is the technology that is used by machines to understand, analyse,


manipulate, and interpret human's languages.

It helps developers to organize knowledge for performing tasks such as


translation, automatic summarization, Named Entity Recognition (NER), speech
recognition, relationship extraction, and topic segmentation.
HISTORY OF NLP
(1940-1960) - Focused on Machine Translation (MT)

The Natural Languages Processing started in the year 1940s.

1948 - In the Year 1948, the first recognisable NLP application was introduced in Birkbeck College,
London.

1950s - In the Year 1950s, there was a conflicting view between linguistics and computer science.
Now, Chomsky developed his first book syntactic structures and claimed that language is
generative in nature.

I957 -In 1957, Chomsky also introduced the idea of Generative Grammar, which is rule based
descriptions of syntactic structures.
HISTORY OF NLP
(1960-1980) - Flavored with Artificial Intelligence (AI)

In the year 1960 to 1980, the key developments were:

Augmented Transition Networks (ATN)

Augmented Transition Networks is a finite state machine that is capable of recognizing regular
languages.

Case Grammar

Case Grammar was developed by Linguist Charles J. Fillmore in the year 1968.

Case Grammar uses languages such as English to express the relationship between nouns and
verbs by using the preposition.

In Case Grammar, case roles can be defined to link certain kinds of verbs and objects.

For example: "Neha broke the mirror with the hammer". In this example case grammar identify Neha
as an agent, mirror as a theme, and hammer as an instrument.
HISTORY OF NLP
In the year 1960 to 1980, key systems were:
SHRDLU
SHRDLU is a program written by Terry Winograd in 1968-70. It helps users to communicate
with the computer and moving objects.
LUNAR
LUNAR is the classic example of a Natural Language database interface system that is used
ATNs and Woods' Procedural Semantics.
It was capable of translating elaborate natural language expressions into database queries
and handle 78% of requests without errors.
HISTORY OF NLP
1980 - Current

Till the year 1980, natural language processing systems were based on complex sets of
hand-written rules. After 1980, NLP introduced machine learning algorithms for language
processing.

In the beginning of the year 1990s, NLP started growing faster and achieved good process accuracy,
especially in English Grammar. In 1990 also, an electronic text introduced, which provided a good
resource for training and examining natural language programs.

modern NLP consists of various applications, like speech recognition, machine translation, and
machine text reading. When we combine all these applications then it allows the artificial
intelligence to gain knowledge of the world. Let's consider the example of AMAZON ALEXA, using
this robot you can ask the question to Alexa, and it will reply to you.
ADVANTAGES OF NLP
○ NLP helps us to analyse data from both structured and unstructured sources.
○ NLP is very fast and time efficient.
○ NLP offers end-to-end exact answers to the question. So, It saves time that going to
consume unnecessary and unwanted information.
○ NLP offers users to ask questions about any subject and give a direct response within
milliseconds.
○ Most of the companies use NLP to improve the efficiency of documentation processes,
accuracy of documentation, and identify the information from large databases.


DISADVANTAGES OF NLP
○ For the training of the NLP model, A lot of data and computation are required.
○ Many issues arise for NLP when dealing with informal expressions, idioms, and cultural jargon.
○ NLP results are sometimes not to be accurate, and accuracy is directly proportional to the accuracy of
data.
○ NLP is designed for a single, narrow job since it cannot adapt to new domains and has a limited function.
○ NLP is unpredictable
○ NLP may require more keystrokes.
○ NLP is unable to adapt to the new domain, and it has a limited function that's why NLP is built for a single
and specific task only.
IDIOMS

A particular type of idiom, called a phrasal verb, consists of a verb followed by an adverb or
preposition (or sometimes both);

Ex: make over, make out, and make up, for instance,

notice how the meanings have nothing to do with the usual meanings of over, out, and up.

JARGON

Jargon is occupation-specific language used by people in a given profession, the “shorthand”


that people in the same profession use to communicate with each other.

For example, plumbers might use terms such as elbow, ABS, sweating the pipes, reducer,
flapper, snake, and rough-in.
APPLICATIONS OF NLP
● Text and speech processing like-Voice assistants – Alexa, Siri, etc.
● Text classification like Grammarly, Microsoft Word, and Google Docs
● Information extraction like-Search engines like DuckDuckGo, Google
● Chatbot and Question Answering like:- website bots
● Language Translation like:- Google Translate
● Text summarization
Why NLP is difficult?
NLP is difficult because Ambiguity and Uncertainty exist in the language.
Ambiguity
There are the following three ambiguity -

○ Lexical Ambiguity
○ Syntactic Ambiguity
○ Referential Ambiguity


○ Lexical Ambiguity

Lexical Ambiguity exists in the presence of two or more possible


meanings of the sentence within a single word.

Example:

Manya is looking for a match.

In the above example, the word match refers to that either Manya is
looking for a partner or Manya is looking for a match. (Cricket or other
match)
○ Syntactic Ambiguity

Syntactic Ambiguity exists in the presence of two or more possible


meanings within the sentence.

Example:

I saw the girl with the binocular.

In the above example, did I have the binoculars? Or did the girl have the
binoculars?
○ Referential Ambiguity

Referential Ambiguity exists when you are referring to something using


the pronoun.

Example: Kavya went to Sunita. She said, "I am hungry."

In the above sentence, you do not know that who is hungry, either Kavya
or Sunita.
Phases of Natural Language Processing
1. Lexical Analysis and Morphological
The first phase of NLP is the Lexical Analysis. This
phase scans the source code as a stream of
characters and converts it into meaningful lexemes.
It divides the whole text into paragraphs, sentences,
and words.
2. Syntactic Analysis (Parsing)
Syntactic Analysis is used to check grammar, word
arrangements, and shows the relationship among the
words.
Example: Agra goes to the Poonam
In the real world, Agra goes to the Poonam, does not make any sense, so
this sentence is rejected by the Syntactic analyzer.
3. Semantic Analysis
Semantic analysis is concerned with the meaning
representation. It mainly focuses on the literal
meaning of words, phrases, and sentences.
The semantic analyzer disregards sentence such as
“hot ice-cream”.
4. Discourse Integration
Discourse Integration depends upon the sentences
that proceeds it and also invokes the meaning of the
sentences that follow it.
5. Pragmatic Analysis
Pragmatic is the fifth and last phase of NLP. It helps
you to discover the intended effect by applying a set
of rules that characterize cooperative dialogues.
For Example: "Open the door" is interpreted as a
request instead of an order.
WORDS AND THEIR COMPONENTS

1.TOKENS

2.LEXEMES

3.MORPHEMES

4.TYPOLOGY
TOKENS

Tokens are typically words or sub-words in the context of natural language


processing.
Tokenization is the process of dividing a text into smaller units known as
tokens.
Tokenization in natural language processing (NLP) is a technique that
involves dividing a sentence or phrase into smaller units known as tokens.
These tokens can encompass words, dates, punctuation marks, or even
fragments of words.
TYPES OF TOKENIZATION

Tokenization can be classified into several types based on how the text is segmented. Here
are some types of tokenization:

Word Tokenization:

Word tokenization divides the text into individual words. Many NLP tasks use this
approach, in which words are treated as the basic units of meaning.
Example:
Input: "Tokenization is an important NLP task."
Output: ["Tokenization", "is", "an", "important", "NLP", "task",
"."]
Sentence Tokenization:

The text is segmented into sentences during sentence tokenization.


This is useful for tasks requiring individual sentence analysis or
processing.
Example:
Input: "Tokenization is an important NLP task. It helps break
down text into smaller units."
Output: ["Tokenization is an important NLP task.", "It helps
break down text into smaller units."]
Subword Tokenization:

Subword tokenization entails breaking down words into smaller units,


which can be especially useful when dealing with morphologically rich
languages or rare words.
Example:
Input: "tokenization"
Output: ["token", "ization"]
Character Tokenization:

This process divides the text into individual characters. This can be
useful for modelling character-level language.
Example:
Input: "Tokenization"
Output: ["T", "o", "k", "e", "n", "i", "z", "a",
"t", "i", "o", "n"]
Implementation for Tokenization using Python3
Sentence Tokenization using sent_tokenize

from nltk.tokenize import sent_tokenize

text = "Hello everyone. Welcome to GeeksforGeeks. You are studying NLP article."
sent_tokenize(text)

Output:
['Hello everyone.',

'Welcome to GeeksforGeeks.',

'You are studying NLP article']


Word Tokenization using work_tokenize

from nltk.tokenize import word_tokenize

text = "Hello everyone. Welcome to GeeksforGeeks."


word_tokenize(text)

Output:
['Hello', 'everyone', '.', 'Welcome', 'to', 'GeeksforGeeks', '.']
Issues and Challenges
● Irregularity: word forms are not described by a prototypical
linguistic model.

● Ambiguity: word forms be understood in multiple ways out of


the context of their discourse.

● Productivity: is the inventory of words in a language finite, or is

it unlimited?
The 10 Biggest Issues for NLP

1. Language differences

In the United States, most people speak English, but if you’re


thinking of reaching an international and/or multicultural audience,
you’ll need to provide support for multiple languages.
2. Training data

● The best AI must also spend a significant amount of time


reading, listening to, and utilizing a language.

● The abilities of an NLP system depend on the training data


provided to it. If you feed the system bad or questionable data,
it’s going to learn the wrong things, or learn in an inefficient
way.
4. Phrasing ambiguities

● Sometimes it’s hard even for another human being to parse out what
someone means when they say something ambiguous. There may not
be a clear concise meaning to be found in a strict analysis of their
words. In order to resolve this, an NLP system must be able to seek
context to help it understand the phrasing. It may also need to ask the
user for clarity.
5. Misspellings

Misspellings are a simple problem for human beings. We can easily


associate a misspelled word with its properly spelled counterpart, and
seamlessly understand the rest of the sentence in which it’s used. But
for a machine, misspellings can be harder to identify. You’ll need to
use an NLP tool with capabilities to recognize common misspellings
of words, and move beyond them.
6. Innate biases

In some cases, NLP tools can carry the biases of their programmers,
as well as biases within the data sets used to train them. Depending
on the application, an NLP could exploit and/or reinforce certain
societal biases, or may provide a better experience to certain types
of users over others. It’s challenging to make a system that works
equally well in all situations, with all people.
7. Words with multiple meanings

No language is perfect, and most languages have words that


have multiple meanings. For example, a user who asks, “how
are you” has a totally different goal than a user who asks
something like “how do I add a new credit card?” Good NLP
tools should be able to differentiate between these phrases with
the help of context.
8. Phrases with multiple intentions

Some phrases and questions actually have multiple intentions, so


your NLP system can’t oversimplify the situation by interpreting only
one of those intentions.

For example, a user may prompt your chatbot with something like, “I
need to cancel my previous order and update my card on file.” Your
AI needs to be able to distinguish these intentions separately.
9. False positives and uncertainty

A false positive occurs when an NLP notices a phrase that should be


understandable and/or addressable, but cannot be sufficiently answered.

The solution here is to develop an NLP system that can recognize its
own limitations, and use questions or prompts to clear up the ambiguity.
10. Keeping a conversation moving

Many modern NLP applications are built on dialogue between a


human and a machine.

Accordingly, your NLP AI needs to be able to keep the conversation


moving, providing additional questions to collect more information and
always pointing toward a solution.
Morphology
∙ Morphology is the domain of linguistics that
analyses the internal structure of words.
∙ Morphological analysis – exploring the structure of words
∙ Words are built up of minimal meaningful elements called morphemes:
played = play-ed
cats = cat-s
unfriendly = un-friend-ly
∙ Two types of morphemes:
i Stems: play, cat, friend
∙ ii Affixes: -ed, -s, un-, -ly
∙ Two main types of affixes:
i Prefixes precede the stem: un-
ii Suffixes follow the stem: -ed, -s, -ly
∙ Stemming = find the stem by stripping off
affixes play = play
replayed = re-play-ed
computerized = comput-er-ize-d
Problems in morphological processing
∙ Inflectional morphology: inflected forms are constructed frombase forms and
inflectional affixes.
∙ Inflection relates different forms of the same word

Lemma Singular Plural


cat cat cats
dog dog dogs
knife knife knives
sheep sheep sheep
mouse mouse mice
∙ Derivational morphology: words are constructed fromroots (or stems)
and derivational
affixes:
inter+national = international
international+ize = internationalize
internationalize+ation =
internationalization
Morphological Models

1.Dictionary Lookup

2.Finite-State Morphology
3.Unification-Based Morphology
4.Functional Morphology
Dictionary Lookup
∙ Morphological parsing is a process by which word forms of a language
are associated with
corresponding linguistic descriptions.
∙ Morphological systems that specify these associations by merely
enumerating(is the act or process of making or stating a list of
things one after another) them case by case do not offer any
generalization means.
∙ These approaches do not allow development of reusable morphological
rules.
Finite-State Morphology
∙ By finite-state morphological models, we mean those in which the specifications
written by human programmers are directly compiled into finite-state
transducers.
∙ The two most popular tools supporting this approach, XFST (Xerox
Finite-State Tool) and LexTools.
∙ Finite-state transducers are computational devices extending the
powerof finite-state
Automata.
∙ A theoretical limitation of finite-state models of morphology is the problem of
capturing reduplication of words or their elements (e.g., to express plurality)
found in several human languages.
Input Input Morphological parsed output
Cats cat +N +PL
Cat cat +N +SG
Cities city +N +PL
Geese goose +N +PL
Goose goose +N +SG) or (goose +V)
Gooses goose +V +3SG
mergin merge +V +PRES-PART
g
Caught (caught +V +PAST-PART) or (catch +V +PAST)
Unification-Based Morphology

∙ The concepts and methods of these formalisms are often closely connected to those of logic
programming.

∙ In finite-state morphological models, both surface and lexical forms are by themselves
unstructured strings of atomic symbols.

∙ In higher-level approaches, linguistic information is expressed by more appropriate data


structures that can include complex values or can be recursively nested if needed.

∙ Advantages of this approach include better abstraction possibilities for developing a


morphological grammar as well as elimination of redundant information from it.

∙ Unification-based models have been implemented for Russian, Czech,


Slovene
Functional Morphology
∙ Functional morphology defines its models using principles of functional
programming and type theory.

∙ It treats morphological operations and processes as pure mathematical


functions and organizes the linguistic as well as abstract elements of a
model into distinct types of values and type classes.

∙ Functional morphology implementations are intended to be reused as


programming libraries capable of handling the complete morphology of
a language and to be incorporated into various kinds of applications.
1.b.Finding structure of Documents

● In human language, words and sentences do not appear randomly


but have structure.
● Automatic extraction of structure of documents helps subsequent
NLP tasks.
● for example, parsing, machine translation, and semantic role labelling
use sentences as the basic processing unit.
` Finding structure of Documents

There are several approaches to finding the structure of documents in


NLP, including:

1. Rule-based methods

2. Machine learning methods

3. Hybrid methods
` Finding structure of Documents
Some of the specific techniques and tools used in finding the structure
of documents in NLP include:

1. Named entity recognition

2. Part-of-speech tagging

3. Dependency parsing

4. Topic modeling
` Finding structure of Documents
1. Named entity recognition: This technique identifies and extracts specific entities,
such as people, places, and organizations, from the document, which can help in
identifying the different sections and topics.

2. Part-of-speech tagging: This technique assigns a part-of-speech tag to each word


in the document, which can help in identifying the syntactic and semantic structure of
the text.

3. Dependency parsing: This technique analyzes the relationships between the


words in a sentence, and can be used to identify the different clauses and phrases in
the text.

4. Topic modeling: This technique uses unsupervised learning algorithms to identify


the different topics and themes in the document, which can be used to organize the
content into different sections.
` Finding structure of Documents
1.Sentence Boundary Detection
Sentence boundary detection is a subtask of finding the structure of documents
inNLP that involves identifying the boundaries between sentences in a document.
This is an important task, as it is a fundamental step in many NLP
applications, such as machine translation, text summarization, and
information retrieval.

2.Topic Boundary Detection


Topic boundary detection is another important subtask of finding the structure of
documents in NLP. It involves identifying the points in a document where the topic
or theme of the text shifts. This task is particularly useful for organizing and
summarizing large amounts of text, as it allows for the identification of different
topics or subtopics within a document.
1.Sentence Boundary Detection
Some of the specific techniques and tools used in sentence boundary
detection include:

1. Regular expressions: These are patterns that can be used to match specific
character sequences in a text, such as periods followed by whitespace
characters, and can be used to identify the end of a sentence.

2. Hidden Markov Models(HMM): These are statistical models that can be


used to identify the most likely sequence of sentence boundaries in a text,
based on the probabilities of different sentence boundary markers.

3. Deep learning models: These are neural network models that can learn
complex patterns and features of sentence boundaries from a large corpus of
text, and can be used to achieve state-of-the-art performance in sentence
boundary detection.
2.Topic Boundary Detection
1. Lexical cohesion: This method looks at the patterns of words and phrases that appear
in a text, and identifies changes in the frequency or distribution of these patterns as
potential topic boundaries. For example, if the frequency of a particular keyword or
phrase drops off sharply after a certain point in the text, this could indicate a shift in
topic.

2. Discourse markers: This method looks at the use of discourse markers, such as
"however", "in contrast", and "furthermore", which are often used to signal a change in
topic or subtopic. By identifying these markers in a text, it is possible to locate potential
topic boundaries.

3. Machine learning: This method involves training a machine learning model to identify
patterns and features in a text that are associated with topic boundaries. This can
involve using a variety of linguistic and contextual features, such as sentence length,
word frequency, and part-of-speech tags, to identify potential topic boundaries.
2.Topic Boundary Detection
Some of the specific techniques and tools used in topic boundary detection include:

1. Latent Dirichlet Allocation (LDA): This is a probabilistic topic modeling technique that
can be used to identify topics within a corpus of text. By analyzing the distribution of
words within a text, LDA can identify the most likely topics and subtopics within the text,
and can be used to locate topic boundaries.

2. TextTiling: This is a technique that involves breaking a text into smaller segments, or
"tiles", based on the frequency and distribution of key words and phrases. By comparing
the tiles to each other, it is possible to identify shifts in topic or subtopic, and locate
potential topic boundaries.

3. Coh-Metrix: This is a text analysis tool that uses a range of linguistic and
discourse-based features to identify different aspects of text complexity, including topic
boundaries. By analyzing the patterns of words, syntax, and discourse in a text,
Coh-Metrix can identify potential topic boundaries, as well as provide insights into the
overall structure and organization of the text.
2.Methods used in NLP

There are several methods and techniques used in NLP to find the structure of
documents, which include:

1. Sentence boundary detection

2. Part-of-speech tagging

3. Named entity recognition

4. Coreference resolution

5. Topic boundary detection

6. Parsing

7. Sentiment analysis
2.1 Generative Sequence Classification Methods
Generative sequence classification methods are a type of NLP method used to find
the structure of documents. These methods involve using probabilistic models to
classify sequences of words into predefined categories or labels.

One popular generative sequence classification method is Hidden Markov


Models(HMMs).

HMMs are statistical models that can be used to classify sequences of words by
modeling the probability distribution of the observed words given a set of hidden
states.

The hidden states in an HMM can represent different linguistic features, such as
part-of-speech tags or named entities, and the model can be trained using labeled
data to learn the most likely sequence of hidden states for a given sequence of
words.
2.1 Generative Sequence Classification Methods

Another type of generative sequence classification method is Conditional


Random Fields (CRFs).

CRFs are similar to HMMs in that they model the conditional probability of a
sequence of labels given a sequence of words, but they are more flexible in
that they can take into account more complex features and dependencies
between labels.
2.2 Discriminative Sequence Classification Methods:
Discriminative local classification methods are another type of NLP method used to
find the structure of documents. These methods involve training a model to classify
each individual word or token in a document based on its features and the context in
which it appears.

One popular example of a discriminative local classification method is Conditional


Random Fields (CRFs).

CRFs are a type of generative model that can also be used as a discriminative model,
as they can model the conditional probability of a sequence of labels given a sequence
of features, without making assumptions about the underlying distribution of the data.

CRFs have been used for tasks such as named entity recognition, part-of-speech
tagging, and chunking.
2.2 Discriminative Sequence Classification Methods:

Another example of a discriminative local classification method is Maximum


Entropy Markov Models (MEMMs), which are similar to CRFs but use
maximum entropy modeling to make predictions about the next label in a
sequence given the current label and features.

MEMMs have been used for tasks such as speech recognition, named entity
recognition, and machine translation.

Other discriminative local classification methods include support vector


machines (SVMs), decision trees, and neural networks. These methods
have also been used for tasks such as sentiment analysis, topic
classification, and document categorization.
2.3 Hybrid Approaches:

Hybrid approaches to finding the structure of documents in NLP


combine multiple methods to achieve better results than any one
method alone.

For example, a hybrid approach might combine generative and


discriminative models, or combine different types of models with
different types of features.
3.Complexity of the Approaches:

Finding the structure of documents in natural language processing (NLP) can be a complex task, and
there are several approaches with varying degrees of complexity.

Here are a few examples:

1. Rule-based approaches: These approaches use a set of predefined rules to identify the structure of
a document. For instance, they might identify headings based on font size and style or look for bullet
points or numbered lists. While these approaches can be effective in some cases, they are often
limited in their ability to handle complex or ambiguous structures.

2. Statistical approaches: These approaches use machine learning algorithms to identify the structure
of a document based on patterns in the data. For instance, they might use a classifier to predict whether
a given sentence is a heading or a body paragraph. These approaches can be quite effective, but
they require large amounts of labeled data to train the model.

3. Deep learning approaches: These approaches use deep neural networks to learn the structure of a
document. For instance, they might use a hierarchical attention network to identify headings and
subheadings, or a sequence-to-sequence model to summarize the document. These approaches can
be very powerful, but they require even larger amounts of labeled data and significant
computational resources to train.
4.Performances of the Approaches:

The performance of different approaches for finding the structure of documents in natural
language processing (NLP) can vary depending on the specific task and the complexity of
the document. Here are some general trends:

1. Rule-based approaches: These approaches can be effective when the document


structure is relatively simple and the rules are well-defined. However, they can struggle with
more complex or ambiguous structures, and require a lot of manual effort to define the
rules.

2. Statistical approaches: These approaches can be quite effective when there is a large
amount of labeled data available for training, and the document structure is relatively
consistent across examples. However, they may struggle with identifying new or unusual
structures that are not well-represented in the training data.

3. Deep learning approaches: These approaches can be very effective in identifying


complex and ambiguous document structures, and can even discover new structures that
were not present in the training data. However, they require large amounts of labeled data
and significant computational resources to train, and can be difficult to interpret.

You might also like