Unit 1
Unit 1
Unit 1
Dr.A.Ajina RIT
UNIT -1 Syllabus
Dr.A.Ajina RIT
What is NLP?
NLP stands for Natural Language Processing, which is a
part of Computer Science, Human language, and
Artificial Intelligence.
Natural language processing (NLP) is a subfield of
linguistics, computer science, and artificial intelligence
concerned with the interactions between computers and
human language, in particular how to program computers to
process and analyze large amounts of natural language data.
Process information contained in Natural Language Text.
Also Known as Computational Linguistics (CL),Human
Language Technology (HLT), Natural Language Engineer
(NLE). Dr.A.Ajina RIT
What is NLP?
It is the technology that is used by machines to understand, analyze,
manipulate, and interpret human's languages.
It helps developers to organize knowledge or performing tasks such as
translation, automatic summarization, Named Entity Recognition
(NER), speech recognition, relationship extraction and topic
segmentation.
NLP is concerned with the development of computational models of
aspects of human language processing.
Main reasons for NLP:
To develop automated tools for language processing
To gain a better understanding of human communication
Dr.A.Ajina RIT
NLP
Building computational models with human language processing
abilities requires
Knowledge of how humans acquire store and process language.
Knowledge of the world and of language.
Two major approaches to NLP
Rationalist Approach: A significant part of the knowledge in the human
mind is not derived by the senses but is fixed in advance, presumably by
genetic inheritance
Empiricist Approach: The brain is able to perform association, pattern
recognition, and generalization and, thus, the structures of Natural
Language can be learned.
Linguistics is the scientific study of language. It deals with analysis of
every aspect of language, as well as the methods for studying and
Dr.A.Ajina RIT
modelling them.
Origins of NLP
Theoretical linguists identify rules that describe and restrict the structure of
Languages(Grammar).
Theoretical Linguistics mainly provide structural description of natural
language and its semantics.
Psycholinguistics explain how humans produce and comprehend natural
language.
They are interested in representation of linguistic structures as well as in the
process by which these structures are produced.
Computational linguistics are concerned with the study of language using
computational models of linguistic phenomena.
It deals with the application of linguistic theories and computational
techniques for NLP.
Dr.A.Ajina RIT
Computational models may be broadly classified under
Knowledge driven
Data driven
Knowledge driven: rely on explicitly coded linguistic knowledge, often
expressed as a set of handcrafted grammar rules.
Data driven: presume the existence of large amount of data and usually
employ some machine learning technique to learn syntactic patterns. The
amount of human effort is less and the performance of these systems is
dependent on the quantity of the data.
Dr.A.Ajina RIT
Why NLP is Hard
Will Will will Will’s will? The man saw the boy with the binoculars.
Rose rose to put rose roes on her rows of roses. Flying planes can be dangerous.
Buffalo buffalo Buffalo buffalo buffalo buffalo Hole found in the room wall; police are
Buffalo buffalo. looking into it
→ Buffaloes from Buffalo, NY, whom buffaloes from
Buffalo bully, bully buffaloes from Buffalo.
Language imprecision and vagueness
It is very warm here.
Q: Did your mother call your aunt last night?
A: I’m sure she must have.
Dr.A.Ajina RIT
Why NLP is Hard
New Senses of a
Non-standard English word
Great job @justinbieber! Were SOO PROUD
of what youve accomplished! U taught us 2 That’s sick dude!
#neversaynever & you yourself should never Giants
give up either
Tricky Entity Names Neologisms
Where is A Bug’s Life playing ... unfriend
Let It Be was recorded retweet
Google/Skype/photoshop
Dr.A.Ajina RIT
Why NLP is Hard
Dr.A.Ajina RIT
Phases of NLP
Dr.A.Ajina RIT
Phases of NLP
Lexical Analysis and Morphological
The first phase of NLP is the Lexical Analysis. This phase scans the
source code as a stream of characters and converts it into meaningful
lexemes. It divides the whole text into paragraphs, sentences, and
words.
Syntactic Analysis (Parsing)
Syntactic Analysis is used to check grammar, word arrangements, and
shows the relationship among the words.
Dr.A.Ajina RIT
Phases of NLP
Semantic Analysis
Semantic analysis is concerned with the meaning representation. It
mainly focuses on the literal meaning of words, phrases, and
sentences.
Discourse Integration
Discourse Integration depends upon the sentences that proceeds it
and also invokes the meaning of the sentences that follow it.
Pragmatic Analysis
Pragmatic is the fifth and last phase of NLP. It helps you to discover
the intended effect by applying a set of rules that characterize
cooperative dialogues.
Dr.A.Ajina RIT
Language and Knowledge
People use seven interdependent levels to understand and extract meaning from a text or spoken words. In order
to understand natural languages, it’s important to distinguish among them:
1- Phonetic or phonological level: deals with pronunciation
2- Morphological level: deals with the smallest parts of words that carry meaning, and suffixes and prefixes.
3- Lexical level: deals with lexical meaning of a word.
4- Syntactic level: deals with grammar and structure of sentences.
5- Semantic level: deals with the meaning of words and sentences.
6- Discourse level: deals with the structure of different kinds of text.
7- Pragmatic level: deals with the knowledge that comes from the outside world, i.e., from outside the content of the
document.
Dr.A.Ajina RIT
Morphological Analysis
While performing the morphological analysis, each particular word is
analyzed. Non-word tokens such as punctuation are removed from the
words. Hence the remaining words are assigned categories.
For instance, Ram’s iPhone cannot convert the video from .mkv to
.mp4. In Morphological analysis, word by word the sentence is
analyzed.
So here, Ram is a proper noun, Ram’s is assigned as possessive suffix
and .mkv and .mp4 is assigned as a file extension.
For example, swims and swim’s are different. One makes it plural,
while the other makes it a third-person singular verb.
Dr.A.Ajina RIT
Disclosure Integration
While processing a language there can arise one major ambiguity
known as referential ambiguity. Referential ambiguity is the ambiguity
that can arise when a reference to a word cannot be determined.
For example,
Ram won the race.
Mohan ate half of a pizza.
He liked it.
It requires the knowledge of the world
Dr.A.Ajina RIT
Syntactic Analysis
There are different rules for different languages. Violation of these rules will give a syntax error. Here the sentence is
transformed into the structure that represents a correlation between the words. This correlation might violate the
rules occasionally. The syntax represents the set of rules that the official language will have to follow. For example,
“To the movies, we are going.” Will give a syntax error. The syntactic analysis uses the results given by
morphological analysis to develop the description of the sentence. The sentence which is divided into categories given
by the morphological process is aligned into a defined structure. This process is called parsing. For example, the cat
chases the mouse in the garden, would be represented as:
Here the sentence is broken down according to the categories.
Then it is described in a hierarchical structure with nodes as
sentence units. These parse trees are parsed while the syntax
analysis run and if any error arises the processing stops and it
displays syntax error. The parsing can be top-down or bottom-up.
Top-down: Starts with the first symbol and parse the sentence
according to the grammar rules until each of the terminals in the
sentence is parsed.
Bottom-up: Starts with the sentence which is to be parsed and
apply all the rules backwards till the first symbol is reached.
Dr.A.Ajina RIT
Why NLP is difficult?
NLP is difficult because Ambiguity and Uncertainty exist in the language.
There are the following three ambiguity
Lexical Ambiguity
Lexical Ambiguity exists in the presence of two or more possible meanings of
the sentence within a single word.
Syntactic Ambiguity
Syntactic Ambiguity exists in the presence of two or more possible meanings
within the sentence.
Referential Ambiguity
Referential Ambiguity exists when you are referring to something using the
pronoun.Dr.A.Ajina RIT
Advantages of NLP
NLP helps users to ask questions about any subject and get a direct
response within seconds.
NLP offers exact answers to the question means it does not offer
unnecessary and unwanted information.
NLP helps computers to communicate with humans in their languages.
It is very time efficient.
Dr.A.Ajina RIT
Disadvantages of NLP
Dr.A.Ajina RIT
Components of NLP
Natural Language Understanding (NLU)
Natural Language Understanding (NLU) helps the machine to understand and
analyze human language by extracting the meta data from content such as
concepts, entities, keywords, emotion, relations, and semantic roles.
NLU mainly used in Business applications to understand the customer's
problem in both spoken and written language.
NLU involves the following tasks –
It is used to map the given input into useful representation.
It is used to analyze different aspects of the language.
Dr.A.Ajina RIT
Components of NLP
Natural Language Generation (NLG)
Natural Language Generation (NLG) acts as a translator that
converts the computerized data into natural language
representation. It mainly involves Text planning, Sentence
planning, and Text Realization.
Dr.A.Ajina RIT
Applications of NLP
Question Answering
Question Answering focuses on
building systems that automatically
answer the questions asked by
humans in a natural language
Spam Detection
Spam detection is used to detect
unwanted e-mails getting to a user's
inbox.
Dr.A.Ajina RIT
Applications of NLP
Sentiment Analysis
Sentiment Analysis is also known as opinion
mining.
It is used on the web to analyze the attitude,
behavior, and emotional state of the sender.
This application is implemented through a
combination of NLP(Natural Language
Processing) and statistics by assigning the
values to the text (positive, negative, or
natural), identify the mood of the context
(happy, sad, angry, etc.)
Dr.A.Ajina RIT
Applications of NLP
Machine Translation
Machine translation is used to translate text or
speech from one natural language to another
natural language.
Spelling correction
Microsoft Corporation provides word processor
software like MS-word, PowerPoint for the spelling
correction.
Dr.A.Ajina RIT
Applications of NLP
Chatbot
Implementing the Chatbot is one of the important
applications of NLP. It is used by many companies
to provide the customer's chat services.
Dr.A.Ajina RIT
Applications of NLP
Information extraction
Information extraction is one of the most important applications of
NLP. It is used for extracting structured information from unstructured
or semi-structured machine-readable documents.
Natural Language Understanding (NLU)
It converts a large set of text into more formal representations such as
first-order logic structures that are easier for the computer programs to
manipulate notations of the natural language processing.
Dr.A.Ajina RIT
Challenges of NLP
Factors that make NLP difficult:
Problems of representation and interpretation:
Natural Language is highly ambiguous and vague, so it is quite difficult to embody all sources
of knowledge that human uses to process language.
Identifying the semantics of language.
Words alone do not make a sentence. Instead, it is the words as well as their syntactic and
semantic relation that gives meaning to a sentence.
Alas! They won.
New words are added continually and existing words ae introduced in new context.
example
Tv channels use 9/11 t refer to the terrorist act on the world trade center.
The only way a machine can learn the meaning of a specific word in a message is by
considering its context, unless some explicitly coded general world or domain knowledge is
available. TheDr.A.Ajina
context RIT
of a word is defined by occurring words.
Challenges of NLP
Idioms, metaphor and ellipses add more complexity to identify the meaning of the
written text.
Idioms: a group of words established by usage as having a meaning not deducible
from those of the individual words.
Example Idiom: Its a piece of cake(meaning its easy)
Metaphor:A metaphor is a figure of speech that describes an object or action in a way
that isn't literally true, but helps explain an idea or make a comparison.
example:Laughter is the music of the soul.
Ellipses: Use an ellipsis to show an omission, or leaving out, of a word or words in a
quote. Use ellipses to shorten the quote without changing the meaning.
Dr.A.Ajina RIT
Challenges of NLP
For example: "After school I went to her house, which was a few blocks away, and then
came home."
Shorten the quote by replacing a few words with an ellipsis. Remember, the meaning of
the quote should not change.
"After school I went to her house … and then came home."
We removed the words "which was a few blocks away" and replaced them with an ellipsis
without changing the meaning of the original quote.
Quantifier scoping is another problem. Scope of quantifiers is often not clear and poses
problem in automatic processing.
Example: There are many things to do today.
We have a lot of time left, don’t worry
Dr.A.Ajina RIT
Challenges of NLP
Ambiguity of natural language is another difficulty:
As humans , we are aware of the context and current cultural knowledge, and also of the
language and traditions and utilize these to process the meaning. However incorporating
contextual and world knowledge poses the greatest difficulty in language computing.
There are various sources of ambiguities in natural language
Ambiguity at word level(Lexical Ambiguity)
A word can be ambiguous, word may represent a noun or a verb
Example: can, bunk, cat etc.
Sentence Level Ambiguity(structural Ambiguity)
Example: Stolen rifle found by the tree
Number of grammars have been proposed to describe the structure of the sentences.
However there are an infinite number of ways to generate them. Which makes writing
grammarDr.A.Ajina
rulesRITand grammar itself, extremely complex.
NLP APIs
Natural Language Processing APIs allow developers to integrate human-to-
machine communications and complete several useful tasks such as speech
recognition, chatbots, spelling correction, sentiment analysis, etc.
A list of NLP APIs is given below:
IBM Watson API
Chatbot API
Speech to text API
Sentiment Analysis API
Translation API by SYSTRAN
Text Analysis API by AYLIEN
Cloud NLP API
Google Cloud Natural Language API
Dr.A.Ajina RIT
NLP Libraries
Scikit-learn:
It provides a wide range of algorithms for building machine learning models in
Python.
Natural language Toolkit (NLTK):
NLTK is a complete toolkit for all NLP techniques.
Pattern:
It is a web mining module for NLP and machine learning.
TextBlob:
It provides an easy interface to
learn basic NLP tasks like sentiment analysis, noun phrase extraction, or pos-tagging.
Dr.A.Ajina RIT
NLP Libraries
Quepy:
Quepy is used to transform natural language questions into
queries in a data base query language.
SpaCy:
SpaCy is an open-source NLP library which is used for Data
Extraction, Data Analysis, Sentiment Analysis, and Text
Summarization.
Gensim:
Gensim works with large data sets and processes data streams.
Dr.A.Ajina RIT
How to build an NLP pipeline
Step1: Sentence Segmentation
Step2: Word Tokenization
Step3: Stemming
Step 4: Lemmatization
Step 5: Identifying Stop Words
Step 6: Dependency Parsing
Step 7: POS tags
Step 8: Named Entity Recognition (NER)
Step 9: Chunking
Dr.A.Ajina RIT
Language and Grammar
Automatic Processing of Language requires the rules and exceptions of a language to be explained to
the computer.
● Grammar defines the language
● It consists of a set of rules that allows us to parse and generate sentences in a Language. These
rules relate information to coding devices at the language level and not at the world knowledge
level.
Main hurdle :
Constantly changing nature of languages and the presence of large number of language exceptions.
Effort to provide specifications for the language has led to many grammars.
● Phrase Structure Grammar
● Transformational Grammar
● Lexical Functional Grammar
● Generalized phrase Structure Grammar
● Dependency Grammar
● Paninian Grammar
● Tree-adjoining
Dr.A.Ajina RITGrammar
Language and Grammar
Though many grammars were proposed but Transformational Grammar was identified as the better,
● Noam Chomsky proposed the Transformational Grammar and suggested that each sentence in a
language has two levels of representation, namely a deep structure and surface structure.
● Mapping of deep structure to surface structure is carried out by transformations.
● Deep structure can be transformed in a number of ways to yield many different surface level
representations.
● Sentences with different surface level representations having the same meaning, share a common
deep-level representation.
Transformational meaning which changes the structure but not the meaning , It is also called
Transformational Generative Grammar.
Dr.A.Ajina RIT
Language and Grammar
● English is SVO Language.
● Transformation grammar has three components
● Phrase structure grammar
● Transformational rules
● Morphophonemic rules-These rules match each sentence representation to a string
of phonemes
● Each of these components consists of set of rules.
● Phrase structure grammar consists of set of rules that generate natural
language sentences and assign a structural description to them.
Dr.A.Ajina RIT
Phrase structure grammar
Dr.A.Ajina RIT
Language and Grammar
Transformational rules are applied on the terminal string generated by phrase
structure rules.
It can be used to transform one phrase maker into another phrase marker.
These rules are used to transform one surface representation into another(an
active sentence to passive one).
The rule relating active and passive sentences (as given by chomsky)
(s1) NP1-Aux-V-NP2-->NP2-Aux+be+en-V-by+NP1(s2).
This rules says that if the input has s1 structure it can be transformed to s2.
Transformational rules can be obligatory or optional.
Obligatory rules: ensures agreement in number of subject and verb etc.,
Optional rules: it modifies the structure of the sentence while preserving its
meaning
Dr.A.Ajina RIT
Language and Grammar
Dr.A.Ajina RIT
Language and Grammar
Morphophonemic rules: match each sentence representation to a string of
phonemes.
Phoneme, in linguistics, smallest unit of speech distinguishing one word (or
word element) from another, as the element p in “tap,” which separates that
word from “tab,” “tag,” and “tan.”
Consider the sentence:
The police will catch the snatcher
(s1) NP1-Aux-V-NP2-->NP2-Aux+be+en-V-by+NP1(s2).
Dr.A.Ajina RIT
Dr.A.Ajina RIT
Language and Grammar
Dr.A.Ajina RIT
Language Modelling
● Creates language model by training it from a corpus and the corpus needs to be
sufficiently large.
● With the complete set of rules that can generate all possible sentences in a language,
those rules provide a model of the language.(deals with only the structure and not
the meaning )
Language Modelling
2. Hierarchical Grammar
Speech Recognition
P(I saw a van) >> P(eyes awe of an)
Machine Translation
Which sentence is more plausible in the target language?
P(high winds) > P(large winds)
Completion Prediction
Predictive text input systems can guess what you are typing and give choices on how to complete it.
N-Gram Model
Statistical Language model is a probability distribution P(s) over all possible word sequences.
Goal of statistical model is to estimate the probability(likelihood) of a sentence. It is achieved by decomposing the
sentence probability into a product of conditional probabilities using the chain rule as follows
N-gram model calculates p(wi/hi) by modelling language as Markov model of order n-1, by looking at n-1 words only.
A Markov chain or Markov process is a stochastic model describing a sequence of possible events in which the probability of
each event depends only on the state attained in the previous event.
Model that limits the history to one word only is termed bi-gram(n==1) model.
N-Gram Models
Consider the following corpus C1 of 4 sentences. What is the total count of unique bi-grams for which the likelihood will
be estimated? Assume we do not perform any pre-processing.
2. Consider the following corpus C1 of 4 sentences. What is the total count of unique bi-grams for which the likelihood will
be estimated? Assume we do not perform any pre-processing.
24
If we consider tri gram or four gram the history increases and it is very difficult to match that set of words in
corpus.
Probabilities of occurrence of larger collection of words is minimum , to overcome this problem Bi-gram model
is used.
To find out what statements are probable in corpus
Leads to zero
probability problem
Advantages of N-gram model:
Easy to understand and implement using any grams.
Dr.A.Ajina RIT