NLP Unit 1 1
NLP Unit 1 1
WHAT IS NLP
NLP stands for Natural Language Processing, which is a part of Computer
Science, Human language, and Artificial Intelligence.
1948 - In the Year 1948, the first recognisable NLP application was introduced in Birkbeck College,
London.
1950s - In the Year 1950s, there was a conflicting view between linguistics and computer science.
Now, Chomsky developed his first book syntactic structures and claimed that language is
generative in nature.
I957 -In 1957, Chomsky also introduced the idea of Generative Grammar, which is rule based
descriptions of syntactic structures.
HISTORY OF NLP
(1960-1980) - Flavored with Artificial Intelligence (AI)
Augmented Transition Networks is a finite state machine that is capable of recognizing regular
languages.
Case Grammar
Case Grammar was developed by Linguist Charles J. Fillmore in the year 1968.
Case Grammar uses languages such as English to express the relationship between nouns and
verbs by using the preposition.
In Case Grammar, case roles can be defined to link certain kinds of verbs and objects.
For example: "Neha broke the mirror with the hammer". In this example case grammar identify Neha
as an agent, mirror as a theme, and hammer as an instrument.
HISTORY OF NLP
In the year 1960 to 1980, key systems were:
SHRDLU
SHRDLU is a program written by Terry Winograd in 1968-70. It helps users to communicate
with the computer and moving objects.
LUNAR
LUNAR is the classic example of a Natural Language database interface system that is used
ATNs and Woods' Procedural Semantics.
It was capable of translating elaborate natural language expressions into database queries
and handle 78% of requests without errors.
HISTORY OF NLP
1980 - Current
Till the year 1980, natural language processing systems were based on complex sets of
hand-written rules. After 1980, NLP introduced machine learning algorithms for language
processing.
In the beginning of the year 1990s, NLP started growing faster and achieved good process accuracy,
especially in English Grammar. In 1990 also, an electronic text introduced, which provided a good
resource for training and examining natural language programs.
modern NLP consists of various applications, like speech recognition, machine translation, and
machine text reading. When we combine all these applications then it allows the artificial
intelligence to gain knowledge of the world. Let's consider the example of AMAZON ALEXA, using
this robot you can ask the question to Alexa, and it will reply to you.
ADVANTAGES OF NLP
○ NLP helps us to analyse data from both structured and unstructured sources.
○ NLP is very fast and time efficient.
○ NLP offers end-to-end exact answers to the question. So, It saves time that going to
consume unnecessary and unwanted information.
○ NLP offers users to ask questions about any subject and give a direct response within
milliseconds.
○ Most of the companies use NLP to improve the efficiency of documentation processes,
accuracy of documentation, and identify the information from large databases.
○
DISADVANTAGES OF NLP
○ For the training of the NLP model, A lot of data and computation are required.
○ Many issues arise for NLP when dealing with informal expressions, idioms, and cultural jargon.
○ NLP results are sometimes not to be accurate, and accuracy is directly proportional to the accuracy of
data.
○ NLP is designed for a single, narrow job since it cannot adapt to new domains and has a limited function.
○ NLP is unpredictable
○ NLP may require more keystrokes.
○ NLP is unable to adapt to the new domain, and it has a limited function that's why NLP is built for a single
and specific task only.
IDIOMS
A particular type of idiom, called a phrasal verb, consists of a verb followed by an adverb or
preposition (or sometimes both);
Ex: make over, make out, and make up, for instance,
notice how the meanings have nothing to do with the usual meanings of over, out, and up.
JARGON
For example, plumbers might use terms such as elbow, ABS, sweating the pipes, reducer,
flapper, snake, and rough-in.
APPLICATIONS OF NLP
● Text and speech processing like-Voice assistants – Alexa, Siri, etc.
● Text classification like Grammarly, Microsoft Word, and Google Docs
● Information extraction like-Search engines like DuckDuckGo, Google
● Chatbot and Question Answering like:- website bots
● Language Translation like:- Google Translate
● Text summarization
Why NLP is difficult?
NLP is difficult because Ambiguity and Uncertainty exist in the language.
Ambiguity
There are the following three ambiguity -
○ Lexical Ambiguity
○ Syntactic Ambiguity
○ Referential Ambiguity
○
○
○ Lexical Ambiguity
Example:
In the above example, the word match refers to that either Manya is
looking for a partner or Manya is looking for a match. (Cricket or other
match)
○ Syntactic Ambiguity
Example:
In the above example, did I have the binoculars? Or did the girl have the
binoculars?
○ Referential Ambiguity
In the above sentence, you do not know that who is hungry, either Kavya
or Sunita.
Phases of Natural Language Processing
1. Lexical Analysis and Morphological
The first phase of NLP is the Lexical Analysis. This
phase scans the source code as a stream of
characters and converts it into meaningful lexemes.
It divides the whole text into paragraphs, sentences,
and words.
2. Syntactic Analysis (Parsing)
Syntactic Analysis is used to check grammar, word
arrangements, and shows the relationship among the
words.
Example: Agra goes to the Poonam
In the real world, Agra goes to the Poonam, does not make any sense, so
this sentence is rejected by the Syntactic analyzer.
3. Semantic Analysis
Semantic analysis is concerned with the meaning
representation. It mainly focuses on the literal
meaning of words, phrases, and sentences.
The semantic analyzer disregards sentence such as
“hot ice-cream”.
4. Discourse Integration
Discourse Integration depends upon the sentences
that proceeds it and also invokes the meaning of the
sentences that follow it.
5. Pragmatic Analysis
Pragmatic is the fifth and last phase of NLP. It helps
you to discover the intended effect by applying a set
of rules that characterize cooperative dialogues.
For Example: "Open the door" is interpreted as a
request instead of an order.
WORDS AND THEIR COMPONENTS
1.TOKENS
2.LEXEMES
3.MORPHEMES
4.TYPOLOGY
TOKENS
Tokenization can be classified into several types based on how the text is segmented. Here
are some types of tokenization:
Word Tokenization:
Word tokenization divides the text into individual words. Many NLP tasks use this
approach, in which words are treated as the basic units of meaning.
Example:
Input: "Tokenization is an important NLP task."
Output: ["Tokenization", "is", "an", "important", "NLP", "task",
"."]
Sentence Tokenization:
This process divides the text into individual characters. This can be
useful for modelling character-level language.
Example:
Input: "Tokenization"
Output: ["T", "o", "k", "e", "n", "i", "z", "a",
"t", "i", "o", "n"]
Implementation for Tokenization using Python3
Sentence Tokenization using sent_tokenize
text = "Hello everyone. Welcome to GeeksforGeeks. You are studying NLP article."
sent_tokenize(text)
Output:
['Hello everyone.',
'Welcome to GeeksforGeeks.',
Output:
['Hello', 'everyone', '.', 'Welcome', 'to', 'GeeksforGeeks', '.']
Issues and Challenges
● Irregularity: word forms are not described by a prototypical
linguistic model.
it unlimited?
The 10 Biggest Issues for NLP
1. Language differences
● Sometimes it’s hard even for another human being to parse out what
someone means when they say something ambiguous. There may not
be a clear concise meaning to be found in a strict analysis of their
words. In order to resolve this, an NLP system must be able to seek
context to help it understand the phrasing. It may also need to ask the
user for clarity.
5. Misspellings
In some cases, NLP tools can carry the biases of their programmers,
as well as biases within the data sets used to train them. Depending
on the application, an NLP could exploit and/or reinforce certain
societal biases, or may provide a better experience to certain types
of users over others. It’s challenging to make a system that works
equally well in all situations, with all people.
7. Words with multiple meanings
For example, a user may prompt your chatbot with something like, “I
need to cancel my previous order and update my card on file.” Your
AI needs to be able to distinguish these intentions separately.
9. False positives and uncertainty
The solution here is to develop an NLP system that can recognize its
own limitations, and use questions or prompts to clear up the ambiguity.
10. Keeping a conversation moving
1.Dictionary Lookup
2.Finite-State Morphology
3.Unification-Based Morphology
4.Functional Morphology
Dictionary Lookup
∙ Morphological parsing is a process by which word forms of a language
are associated with
corresponding linguistic descriptions.
∙ Morphological systems that specify these associations by merely
enumerating(is the act or process of making or stating a list of
things one after another) them case by case do not offer any
generalization means.
∙ These approaches do not allow development of reusable morphological
rules.
Finite-State Morphology
∙ By finite-state morphological models, we mean those in which the specifications
written by human programmers are directly compiled into finite-state
transducers.
∙ The two most popular tools supporting this approach, XFST (Xerox
Finite-State Tool) and LexTools.
∙ Finite-state transducers are computational devices extending the
powerof finite-state
Automata.
∙ A theoretical limitation of finite-state models of morphology is the problem of
capturing reduplication of words or their elements (e.g., to express plurality)
found in several human languages.
Input Input Morphological parsed output
Cats cat +N +PL
Cat cat +N +SG
Cities city +N +PL
Geese goose +N +PL
Goose goose +N +SG) or (goose +V)
Gooses goose +V +3SG
mergin merge +V +PRES-PART
g
Caught (caught +V +PAST-PART) or (catch +V +PAST)
Unification-Based Morphology
∙ The concepts and methods of these formalisms are often closely connected to those of logic
programming.
∙ In finite-state morphological models, both surface and lexical forms are by themselves
unstructured strings of atomic symbols.
1. Rule-based methods
3. Hybrid methods
` Finding structure of Documents
Some of the specific techniques and tools used in finding the structure
of documents in NLP include:
2. Part-of-speech tagging
3. Dependency parsing
4. Topic modeling
` Finding structure of Documents
1. Named entity recognition: This technique identifies and extracts specific entities,
such as people, places, and organizations, from the document, which can help in
identifying the different sections and topics.
1. Regular expressions: These are patterns that can be used to match specific
character sequences in a text, such as periods followed by whitespace
characters, and can be used to identify the end of a sentence.
3. Deep learning models: These are neural network models that can learn
complex patterns and features of sentence boundaries from a large corpus of
text, and can be used to achieve state-of-the-art performance in sentence
boundary detection.
2.Topic Boundary Detection
1. Lexical cohesion: This method looks at the patterns of words and phrases that appear
in a text, and identifies changes in the frequency or distribution of these patterns as
potential topic boundaries. For example, if the frequency of a particular keyword or
phrase drops off sharply after a certain point in the text, this could indicate a shift in
topic.
2. Discourse markers: This method looks at the use of discourse markers, such as
"however", "in contrast", and "furthermore", which are often used to signal a change in
topic or subtopic. By identifying these markers in a text, it is possible to locate potential
topic boundaries.
3. Machine learning: This method involves training a machine learning model to identify
patterns and features in a text that are associated with topic boundaries. This can
involve using a variety of linguistic and contextual features, such as sentence length,
word frequency, and part-of-speech tags, to identify potential topic boundaries.
2.Topic Boundary Detection
Some of the specific techniques and tools used in topic boundary detection include:
1. Latent Dirichlet Allocation (LDA): This is a probabilistic topic modeling technique that
can be used to identify topics within a corpus of text. By analyzing the distribution of
words within a text, LDA can identify the most likely topics and subtopics within the text,
and can be used to locate topic boundaries.
2. TextTiling: This is a technique that involves breaking a text into smaller segments, or
"tiles", based on the frequency and distribution of key words and phrases. By comparing
the tiles to each other, it is possible to identify shifts in topic or subtopic, and locate
potential topic boundaries.
3. Coh-Metrix: This is a text analysis tool that uses a range of linguistic and
discourse-based features to identify different aspects of text complexity, including topic
boundaries. By analyzing the patterns of words, syntax, and discourse in a text,
Coh-Metrix can identify potential topic boundaries, as well as provide insights into the
overall structure and organization of the text.
2.Methods used in NLP
There are several methods and techniques used in NLP to find the structure of
documents, which include:
2. Part-of-speech tagging
4. Coreference resolution
6. Parsing
7. Sentiment analysis
2.1 Generative Sequence Classification Methods
Generative sequence classification methods are a type of NLP method used to find
the structure of documents. These methods involve using probabilistic models to
classify sequences of words into predefined categories or labels.
HMMs are statistical models that can be used to classify sequences of words by
modeling the probability distribution of the observed words given a set of hidden
states.
The hidden states in an HMM can represent different linguistic features, such as
part-of-speech tags or named entities, and the model can be trained using labeled
data to learn the most likely sequence of hidden states for a given sequence of
words.
2.1 Generative Sequence Classification Methods
CRFs are similar to HMMs in that they model the conditional probability of a
sequence of labels given a sequence of words, but they are more flexible in
that they can take into account more complex features and dependencies
between labels.
2.2 Discriminative Sequence Classification Methods:
Discriminative local classification methods are another type of NLP method used to
find the structure of documents. These methods involve training a model to classify
each individual word or token in a document based on its features and the context in
which it appears.
CRFs are a type of generative model that can also be used as a discriminative model,
as they can model the conditional probability of a sequence of labels given a sequence
of features, without making assumptions about the underlying distribution of the data.
CRFs have been used for tasks such as named entity recognition, part-of-speech
tagging, and chunking.
2.2 Discriminative Sequence Classification Methods:
MEMMs have been used for tasks such as speech recognition, named entity
recognition, and machine translation.
Finding the structure of documents in natural language processing (NLP) can be a complex task, and
there are several approaches with varying degrees of complexity.
1. Rule-based approaches: These approaches use a set of predefined rules to identify the structure of
a document. For instance, they might identify headings based on font size and style or look for bullet
points or numbered lists. While these approaches can be effective in some cases, they are often
limited in their ability to handle complex or ambiguous structures.
2. Statistical approaches: These approaches use machine learning algorithms to identify the structure
of a document based on patterns in the data. For instance, they might use a classifier to predict whether
a given sentence is a heading or a body paragraph. These approaches can be quite effective, but
they require large amounts of labeled data to train the model.
3. Deep learning approaches: These approaches use deep neural networks to learn the structure of a
document. For instance, they might use a hierarchical attention network to identify headings and
subheadings, or a sequence-to-sequence model to summarize the document. These approaches can
be very powerful, but they require even larger amounts of labeled data and significant
computational resources to train.
4.Performances of the Approaches:
The performance of different approaches for finding the structure of documents in natural
language processing (NLP) can vary depending on the specific task and the complexity of
the document. Here are some general trends:
2. Statistical approaches: These approaches can be quite effective when there is a large
amount of labeled data available for training, and the document structure is relatively
consistent across examples. However, they may struggle with identifying new or unusual
structures that are not well-represented in the training data.