Unit 1 and Unit 2 Good Notes
Unit 1 and Unit 2 Good Notes
UNIT – 1 7 Hrs
UNIT – 2 8 Hrs
UNIT – 3 6 Hrs
UNIT – 4 7 Hrs
UNIT – 5 8 Hrs
Text Books:
1. Natural Language Processing and Information Retrieval Tanveer Siddiqui,
U.S. Tiwary 1 st Edition Oxford University press 2008. 2. Speech and Language
Processing: An introduction to Natural Language
Reference Books:
1. Natural Language Processing: An information Access Perspective by Kavi
Narayana Murthy, Ess Ess Publications, 2006.
2. Applied Text Analysis with Python by Benjamin Bengfort, Tony Ojeda,
Rebecca Bilbro, O'Reilly Media, June 2018.
Advantages of NLP
o NLP helps users to ask questions about any subject and get a direct response
Downloaded by Rishita Kishore Shah ([email protected])
lOMoARcPSD|23424567
within seconds.
o NLP offers exact answers to the question means it does not offer unnecessary and
unwanted information.
o NLP helps computers to communicate with humans in their languages.
o It is very time efficient.
o Most of the companies use NLP to improve the efficiency of documentation
processes, accuracy of documentation, and identify the information from large
databases.
Why NLP
• Huge data from tweets, reviews, chats, queries etc.
• Most of them are unstructured
• Hard for Human’s to handle and manage them.
• To have a deep understanding of broad natural language
History of NLP
(1940-1960) - Focused on Machine Translation (MT)
1948 - In the Year 1948, the first recognisable NLP application was introduced in
Birkbeck College, London.
1950s - In the Year 1950s, there was a conflicting view between linguistics and
computer science. Now, Chomsky developed his first book syntactic structures and
claimed that language is generative in nature.
In 1957, Chomsky also introduced the idea of Generative Grammar, which is rule based
descriptions of syntactic structures.
Case Grammar
Case Grammar was developed by Linguist Charles J. Fillmore in the year 1968. Case
Grammar uses languages such as English to express the relationship between nouns and
verbs by using the preposition.
For example: "Neha broke the mirror with the hammer". In this example case grammar
identify Neha as an agent, mirror as a theme, and hammer as an instrument.
SHRDLU
LUNAR
LUNAR is the classic example of a Natural Language database interface system that is
used ATNs and Woods' Procedural Semantics. It was capable of translating elaborate
natural language expressions into database queries and handle 78% of requests without
errors.
1980 - Current
Till the year 1980, natural language processing systems were based on complex sets of
hand-written rules. After 1980, NLP introduced machine learning algorithms for
language processing
In the beginning of the year 1990s, NLP started growing faster and achieved good
process accuracy, especially in English Grammar. In 1990 also, an electronic text
introduced, which provided a good resource for training and examining natural
language programs. Other factors may include the availability of computers with fast
CPUs and more memory. The major factor behind the advancement of natural language
processing was the Internet.
Disadvantages of NLP
A list of disadvantages of NLP is given below:
Components of NLP
Natural Language Understanding (NLU) helps the machine to understand and analyse
human language by extracting the metadata from content such as concepts, entities,
keywords, emotion, relations, and semantic roles.
NLU mainly used in Business applications to understand the customer's problem in both
spoken and written language.
Natural Language Generation (NLG) acts as a translator that converts the computerized
data into natural language representation. It mainly involves Text planning, Sentence
planning, and Text Realization.
1. Question Answering
2. Spam Detection
3. Sentiment Analysis
Sentiment Analysis is also known as opinion mining. It is used on the web to analyse
the attitude, behaviour, and emotional state of the sender. This application is
implemented through a combination of NLP (Natural Language Processing) and
statistics by assigning the values to the text (positive, negative, or natural), identify the
mood of the context (happy, sad, angry, etc.)
4. Machine Translation
Machine translation is used to translate text or speech from one natural language to
another natural language.
Microsoft Corporation provides word processor software like MS-word, PowerPoint for
the spelling correction.
Downloaded by Rishita Kishore Shah ([email protected])
lOMoARcPSD|23424567
6. Speech Recognition
Speech recognition is used for converting spoken words into text. It is used in
applications, such as mobile, home automation, video recovery, dictating to Microsoft
Word, voice biometrics, voice user interface, and so on.
7. Chatbot
Implementing the Chatbot is one of the important applications of NLP. It is used by many
companies to provide the customer's chat services.
8. Information extraction
Information extraction is one of the most important applications of NLP. It is used for
extracting structured information from unstructured or semi-structured machine
readable documents.
It converts a large set of text into more formal representations such as first-order logic
structures that are easier for the computer programs to manipulate notations of the
natural language processing.
Sentence Segment is the first step for building the NLP pipeline. It breaks the paragraph
into separate sentences.
1. "Independence Day is one of the important festivals for every Indian citizen." 2. "It
is celebrated on the 15th of August each year ever since India got independence
from the British rule."
3. "This day celebrates independence in the true sense."
Word Tokenizer is used to break the sentence into separate words or tokens.
Example:
JavaTpoint offers Corporate Training, Summer Training, Online Training, and Winter
Training.
Step3: Stemming
Stemming is used to normalize words into its base form or root form. For example,
celebrates, celebrated and celebrating, all these words are originated with a single root
word "celebrate." The big problem with stemming is that sometimes it produces the root
word which may not have any meaning.
For Example, intelligence, intelligent, and intelligently, all these words are originated
with a single root word "intelligen." In English, the word "intelligen" do not have any
meaning.
Step 4: Lemmatization
For example: In lemmatization, the words intelligence, intelligent, and intelligently has
a root word intelligent, which has a meaning.
Step 5: Identifying Stop Words
In English, there are a lot of words that appear very frequently like "is", "and", "the", and
"a". NLP pipelines will flag these words as stop words. Stop words might be filtered out
before doing any statistical analysis.
Note: When you are building a rock band search engine, then you do not ignore the word
"The."
Dependency Parsing is used to find that how all the words in the sentence are related to
each other.
POS stands for parts of speech, which includes Noun, verb, adverb, and Adjective. It
indicates that how a word functions with its meaning as well as grammatically within
the sentences. A word has one or more parts of speech based on the context in which it
is used.
Named Entity Recognition (NER) is the process of detecting the named entity such as
person name, movie name, organization name, or location.
Example: Steve Jobs introduced iPhone at the Macworld Conference in San Francisco,
California.
Step 9: Chunking
Chunking is used to collect the individual piece of information and grouping them into
bigger pieces of sentences.
Phases of NLP
Natural Language Processing is separated into five primary stages or phases, starting
with simple word processing and progressing to identifying complicated phrase
meanings.
Downloaded by Rishita Kishore Shah ([email protected])
lOMoARcPSD|23424567
• Stop word removal (removing ‘and’, ‘of’, ‘the’ etc. from text)
• Tokenization (breaking the text into sentences or words)
Word tokenizer
o
o Sentence tokenizer
o Tweet tokenizer
• Stemming (removing ‘ing’, ‘es’, ‘s’ from the tail of the words)
• Lemmatization (converting the words to their base forms)
o Dependency Parsing
o Parts of Speech (POS) tagging
o Take the following sentence for example: “The guava ate an apple.” The line is
syntactically valid, yet it is illogical because guavas cannot eat.
o Discourse Integration The term “discourse integration” refers to a feeling of
context. The meaning of any sentence is determined by the meaning of the
sentence immediately preceding it. In addition, it establishes the meaning of the
sentence that follows. The sentences that come before it play a role in discourse
integration. That is to say, that statement or word is dependent on the preceding
sentence or words. It’s the same with the use of proper nouns and pronouns.
o
o Example: "John got ready at 9 AM. Later he took the train to California" o Here, the
machine is able to understand that the word “he” in the second sentence is referring
to “John”.
Pragmatic Analysis The fifth and final phase of NLP is pragmatic analysis. The overall
communicative and social content, as well as its impact on interpretation, are the focus
of pragmatic analysis. Pragmatic Analysis uses a set of rules that describe cooperative
dialogues to help you find
the intended result. It covers things like word repetition, who said what to whom, and so
on. It comprehends how people communicate with one another, the context in which
they converse, and a variety of other factors. It refers to the process of abstracting or
extracting the meaning of a situation’s use of language. It translates the given text using
the knowledge gathered in the preceding stages. “Switch on the TV” when used in a
sentence, is an order or request to switch the TV on.
"Thank you for coming so late, we have wrapped up the meeting" (Contains sarcasm)
"Can you share your screen?" (here the context is about computer’s screen share during
a remote meeting)
Challenges in NLP
have more than one meaning. That is, a sentence may be interpreted in more than one
way. This leads to uncertainty in choosing the right meaning of a sentence especially
while processing natural languages by computer.
Lexical ambiguity – The word “tried” means “attempted” not “judged” or “tested”.
Also, the word “reach” means “establish communication” not “gain” or “pass” or
“strive”.
Syntactic ambiguity – The phrase “on the mobile” attached to “reach” and thus means
“using the mobile”. It is not attached to “friend”.
Anaphoric ambiguity – The anaphor “he” refers the “friend” not “Raj”.
i. Lexical ambiguity
It is class of ambiguity caused by a word and its multiple senses especially when the
word is part of sentence or phrase. A word can have multiple meanings under different
part of speech categories. Also, under each POS category they may have multiple
different senses. Lexical ambiguity is about choosing which sense of a particular word
under a particular POS category.
In a sentence, the lexical ambiguity is caused while choosing the right sense of a word
under a correct POS category.
For example, let us take the sentence “I saw a ship”. Here, the words “saw” and “ship”
would mean multiple things as follows;
Saw = present tense of the verb saw (cut with a saw) OR past tense of the verb see
(perceive by sight) OR a noun saw (blade for cutting) etc. According to WordNet, the
word “saw” is defined under 3 different senses in NOUN category and under 25 different
senses in VERB category.
Ship = present tense of the verb ship (transport commercially) OR present tense of the
verb ship (travel by ship) OR a noun ship (a vessel that carries passengers) etc. As per
WordNet, the word “ship” is defined with 1 sense under NOUN category and 5 senses
under VERB category.
Due to multiple meanings, there arises an ambiguity in choosing the right sense of “saw”
and “ship”.
Handling lexical ambiguity
Lexical ambiguity can be handled using the tasks like POS tagging and Word Sense
Downloaded by Rishita Kishore Shah ([email protected])
lOMoARcPSD|23424567
Disambiguation.
Example: “Darshan plays keyboard. He loves music”. In this sentence, the anaphor “He”
is a pronoun.
“A puppy drank the milk. The cute little dog was satisfied”. In this sentence, the anaphor
“The cute little dog” is a noun pharse.
• Anaphoric references may not explicitly present in the previous sentence. Instead it
may refer a part of an entity (antecedent) in the previous sentence.
Example: “I went to the hospital, and they told me to go home and rest”. Here, the
anaphor “they” refers not to the “hospital” directly, instead to the “hospital staff”.
• Anaphors may not be in the immediately previous sentence. They may present in the
sentences before the previous one or may present in the same sentence.
Example: “The horse ran up the hill. It was very steep. It soon got tired”. Here, the
anaphor “it” of the third sentence refers the “horse” in the first sentence
v. Pragmatic ambiguity
Pragmatics focuses on conversational implicature. Conversational implicature is a
process in which the speaker implies and a listener infers. Simply, it is a study about the
sentences that are not directly spoken. It is the study of how people use language. The
pragmatic level of linguistic processing deals with the use of real-world knowledge and
understanding how this impacts the meaning of what is being communicated. By
analyzing the contextual dimension of the documents and queries, a more detailed
representation is derived.
Pragmatic ambiguity arises when the statement is not specific, and the context does not
provide the information needed to clarify the statement (Walton D. (1996) A Pragmatic
Synthesis. In: Fallacies Arising from Ambiguity. Applied Logic Series, vol 1. Springer,
Dordrecht).
Roughly, a pragmatic ambiguity occurs in a requirement if different readers give
different interpretations to it, depending on the context of the requirement. The context
of a requirement includes the other requirements of the same document, which
influence the understanding of the requirement, and the background knowledge of the
reader, which gives a meaning to the concepts expressed in the requirement.
Example:
Sentence Direct meaning Other meanings
(semantic meaning) (pragmatic meanings)
Do you know what time Asking for the current Expressing anger to
is it? time someone who missed the
due time or something `
Will you crack open the To break Open the door just a little
door? I am getting hot
The chicken is ready to eat The chicken is ready to The cooked chicken is
eat its breakfast, for ready to be served
example.
• Semantics is about literal meaning of the words and their interrelations, whereas
pragmatics focuses on the inferred meaning that the speakers and listeners perceive.
• Semantics is the study of meaning, or more precisely, the study of the relation between
linguistic expressions and their meanings. Pragmatics is the study of context, or more
precisely, a study of the way context can influence our understanding of linguistic
utterances.
Chapter 2
Language Modeling
Study:
Regular Expressions
Spelling Error Detection and Correction
From speech and language processing