0% found this document useful (0 votes)
36 views68 pages

Unit 1

Uploaded by

ankitupatil1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views68 pages

Unit 1

Uploaded by

ankitupatil1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 68

Natural Language Processing

Unit 1

Dr.A.Ajina RIT
UNIT -1 Syllabus

Overview and language modeling: Overview: Origins and challenges


of NLP-Language and Grammar-Processing Indian Languages-
NLP Applications-Information Retrieval.
Language Modeling: Various Grammar- based Language Models-
Statistical Language Model.

Dr.A.Ajina RIT
What is NLP?
 NLP stands for Natural Language Processing, which is a
part of Computer Science, Human language, and
Artificial Intelligence.
 Natural language processing (NLP) is a subfield of
linguistics, computer science, and artificial intelligence
concerned with the interactions between computers and
human language, in particular how to program computers to
process and analyze large amounts of natural language data.
 Process information contained in Natural Language Text.
 Also Known as Computational Linguistics (CL),Human
Language Technology (HLT), Natural Language Engineer
(NLE). Dr.A.Ajina RIT
What is NLP?
 It is the technology that is used by machines to understand, analyze,
manipulate, and interpret human's languages.
 It helps developers to organize knowledge or performing tasks such as
translation, automatic summarization, Named Entity Recognition
(NER), speech recognition, relationship extraction and topic
segmentation.
 NLP is concerned with the development of computational models of
aspects of human language processing.
 Main reasons for NLP:
 To develop automated tools for language processing
 To gain a better understanding of human communication

Dr.A.Ajina RIT
NLP
 Building computational models with human language processing
abilities requires
 Knowledge of how humans acquire store and process language.
 Knowledge of the world and of language.
 Two major approaches to NLP
Rationalist Approach: A significant part of the knowledge in the human
mind is not derived by the senses but is fixed in advance, presumably by
genetic inheritance
Empiricist Approach: The brain is able to perform association, pattern
recognition, and generalization and, thus, the structures of Natural
Language can be learned.
 Linguistics is the scientific study of language. It deals with analysis of
every aspect of language, as well as the methods for studying and
Dr.A.Ajina RIT

modelling them.
Origins of NLP
 Theoretical linguists identify rules that describe and restrict the structure of
Languages(Grammar).
 Theoretical Linguistics mainly provide structural description of natural
language and its semantics.
 Psycholinguistics explain how humans produce and comprehend natural
language.
 They are interested in representation of linguistic structures as well as in the
process by which these structures are produced.
 Computational linguistics are concerned with the study of language using
computational models of linguistic phenomena.
 It deals with the application of linguistic theories and computational
techniques for NLP.
Dr.A.Ajina RIT
Computational models may be broadly classified under
 Knowledge driven
 Data driven
Knowledge driven: rely on explicitly coded linguistic knowledge, often
expressed as a set of handcrafted grammar rules.
Data driven: presume the existence of large amount of data and usually
employ some machine learning technique to learn syntactic patterns. The
amount of human effort is less and the performance of these systems is
dependent on the quantity of the data.

Dr.A.Ajina RIT
Why NLP is Hard

Lexical Ambiguity Language ambiguity: Structural

Will Will will Will’s will? The man saw the boy with the binoculars.
Rose rose to put rose roes on her rows of roses. Flying planes can be dangerous.
Buffalo buffalo Buffalo buffalo buffalo buffalo Hole found in the room wall; police are
Buffalo buffalo. looking into it
→ Buffaloes from Buffalo, NY, whom buffaloes from
Buffalo bully, bully buffaloes from Buffalo.
Language imprecision and vagueness
It is very warm here.
Q: Did your mother call your aunt last night?
A: I’m sure she must have.

Dr.A.Ajina RIT
Why NLP is Hard
New Senses of a
Non-standard English word
Great job @justinbieber! Were SOO PROUD
of what youve accomplished! U taught us 2 That’s sick dude!
#neversaynever & you yourself should never Giants
give up either
Tricky Entity Names Neologisms
Where is A Bug’s Life playing ... unfriend
Let It Be was recorded retweet
Google/Skype/photoshop
Dr.A.Ajina RIT
Why NLP is Hard

See You, I will text you later


C U I’ll txt U L8R

Dr.A.Ajina RIT
Phases of NLP

Dr.A.Ajina RIT
Phases of NLP
Lexical Analysis and Morphological
The first phase of NLP is the Lexical Analysis. This phase scans the
source code as a stream of characters and converts it into meaningful
lexemes. It divides the whole text into paragraphs, sentences, and
words.
Syntactic Analysis (Parsing)
Syntactic Analysis is used to check grammar, word arrangements, and
shows the relationship among the words.

Dr.A.Ajina RIT
Phases of NLP
 Semantic Analysis
Semantic analysis is concerned with the meaning representation. It
mainly focuses on the literal meaning of words, phrases, and
sentences.
 Discourse Integration
Discourse Integration depends upon the sentences that proceeds it
and also invokes the meaning of the sentences that follow it.
 Pragmatic Analysis
Pragmatic is the fifth and last phase of NLP. It helps you to discover
the intended effect by applying a set of rules that characterize
cooperative dialogues.
Dr.A.Ajina RIT
Language and Knowledge

 People use seven interdependent levels to understand and extract meaning from a text or spoken words. In order
to understand natural languages, it’s important to distinguish among them:
1- Phonetic or phonological level: deals with pronunciation
2- Morphological level: deals with the smallest parts of words that carry meaning, and suffixes and prefixes.
3- Lexical level: deals with lexical meaning of a word.
4- Syntactic level: deals with grammar and structure of sentences.
5- Semantic level: deals with the meaning of words and sentences.
6- Discourse level: deals with the structure of different kinds of text.
7- Pragmatic level: deals with the knowledge that comes from the outside world, i.e., from outside the content of the
document.

Dr.A.Ajina RIT
Morphological Analysis
While performing the morphological analysis, each particular word is
analyzed. Non-word tokens such as punctuation are removed from the
words. Hence the remaining words are assigned categories.
For instance, Ram’s iPhone cannot convert the video from .mkv to
.mp4. In Morphological analysis, word by word the sentence is
analyzed.
So here, Ram is a proper noun, Ram’s is assigned as possessive suffix
and .mkv and .mp4 is assigned as a file extension.
For example, swims and swim’s are different. One makes it plural,
while the other makes it a third-person singular verb.

Dr.A.Ajina RIT
Disclosure Integration
While processing a language there can arise one major ambiguity
known as referential ambiguity. Referential ambiguity is the ambiguity
that can arise when a reference to a word cannot be determined.
For example,
Ram won the race.
Mohan ate half of a pizza.
He liked it.
It requires the knowledge of the world

Dr.A.Ajina RIT
Syntactic Analysis
 There are different rules for different languages. Violation of these rules will give a syntax error. Here the sentence is
transformed into the structure that represents a correlation between the words. This correlation might violate the
rules occasionally. The syntax represents the set of rules that the official language will have to follow. For example,
“To the movies, we are going.” Will give a syntax error. The syntactic analysis uses the results given by
morphological analysis to develop the description of the sentence. The sentence which is divided into categories given
by the morphological process is aligned into a defined structure. This process is called parsing. For example, the cat
chases the mouse in the garden, would be represented as:
Here the sentence is broken down according to the categories.
Then it is described in a hierarchical structure with nodes as
sentence units. These parse trees are parsed while the syntax
analysis run and if any error arises the processing stops and it
displays syntax error. The parsing can be top-down or bottom-up.
Top-down: Starts with the first symbol and parse the sentence
according to the grammar rules until each of the terminals in the
sentence is parsed.
Bottom-up: Starts with the sentence which is to be parsed and
apply all the rules backwards till the first symbol is reached.
Dr.A.Ajina RIT
Why NLP is difficult?
 NLP is difficult because Ambiguity and Uncertainty exist in the language.
 There are the following three ambiguity
 Lexical Ambiguity
Lexical Ambiguity exists in the presence of two or more possible meanings of
the sentence within a single word.
 Syntactic Ambiguity
Syntactic Ambiguity exists in the presence of two or more possible meanings
within the sentence.
 Referential Ambiguity
Referential Ambiguity exists when you are referring to something using the
pronoun.Dr.A.Ajina RIT
Advantages of NLP

NLP helps users to ask questions about any subject and get a direct
response within seconds.
NLP offers exact answers to the question means it does not offer
unnecessary and unwanted information.
NLP helps computers to communicate with humans in their languages.
It is very time efficient.

Dr.A.Ajina RIT
Disadvantages of NLP

NLP may require more keystrokes.


NLP is unable to adapt to the new domain, and it has a limited
function that's why NLP is built for a single and specific task only.

Dr.A.Ajina RIT
Components of NLP
 Natural Language Understanding (NLU)
Natural Language Understanding (NLU) helps the machine to understand and
analyze human language by extracting the meta data from content such as
concepts, entities, keywords, emotion, relations, and semantic roles.
 NLU mainly used in Business applications to understand the customer's
problem in both spoken and written language.
NLU involves the following tasks –
It is used to map the given input into useful representation.
It is used to analyze different aspects of the language.
Dr.A.Ajina RIT
Components of NLP
Natural Language Generation (NLG)
Natural Language Generation (NLG) acts as a translator that
converts the computerized data into natural language
representation. It mainly involves Text planning, Sentence
planning, and Text Realization.

Dr.A.Ajina RIT
Applications of NLP
Question Answering
Question Answering focuses on
building systems that automatically
answer the questions asked by
humans in a natural language

Spam Detection
Spam detection is used to detect
unwanted e-mails getting to a user's
inbox.
Dr.A.Ajina RIT
Applications of NLP
Sentiment Analysis
Sentiment Analysis is also known as opinion
mining.
It is used on the web to analyze the attitude,
behavior, and emotional state of the sender.
This application is implemented through a
combination of NLP(Natural Language
Processing) and statistics by assigning the
values to the text (positive, negative, or
natural), identify the mood of the context
(happy, sad, angry, etc.)

Dr.A.Ajina RIT
Applications of NLP

Machine Translation
Machine translation is used to translate text or
speech from one natural language to another
natural language.
Spelling correction
 Microsoft Corporation provides word processor
software like MS-word, PowerPoint for the spelling
correction.

Dr.A.Ajina RIT
Applications of NLP
Chatbot
Implementing the Chatbot is one of the important
applications of NLP. It is used by many companies
to provide the customer's chat services.

Dr.A.Ajina RIT
Applications of NLP

Information extraction
Information extraction is one of the most important applications of
NLP. It is used for extracting structured information from unstructured
or semi-structured machine-readable documents.
Natural Language Understanding (NLU)
It converts a large set of text into more formal representations such as
first-order logic structures that are easier for the computer programs to
manipulate notations of the natural language processing.

Dr.A.Ajina RIT
Challenges of NLP
 Factors that make NLP difficult:
 Problems of representation and interpretation:
 Natural Language is highly ambiguous and vague, so it is quite difficult to embody all sources
of knowledge that human uses to process language.
 Identifying the semantics of language.
 Words alone do not make a sentence. Instead, it is the words as well as their syntactic and
semantic relation that gives meaning to a sentence.
 Alas! They won.
 New words are added continually and existing words ae introduced in new context.
 example
 Tv channels use 9/11 t refer to the terrorist act on the world trade center.
 The only way a machine can learn the meaning of a specific word in a message is by
considering its context, unless some explicitly coded general world or domain knowledge is
available. TheDr.A.Ajina
context RIT
of a word is defined by occurring words.
Challenges of NLP
 Idioms, metaphor and ellipses add more complexity to identify the meaning of the
written text.
 Idioms: a group of words established by usage as having a meaning not deducible
from those of the individual words.
 Example Idiom: Its a piece of cake(meaning its easy)
 Metaphor:A metaphor is a figure of speech that describes an object or action in a way
that isn't literally true, but helps explain an idea or make a comparison.
 example:Laughter is the music of the soul.
 Ellipses: Use an ellipsis to show an omission, or leaving out, of a word or words in a
quote. Use ellipses to shorten the quote without changing the meaning.

Dr.A.Ajina RIT
Challenges of NLP
 For example: "After school I went to her house, which was a few blocks away, and then
came home."
 Shorten the quote by replacing a few words with an ellipsis. Remember, the meaning of
the quote should not change.
 "After school I went to her house … and then came home."
 We removed the words "which was a few blocks away" and replaced them with an ellipsis
without changing the meaning of the original quote.
 Quantifier scoping is another problem. Scope of quantifiers is often not clear and poses
problem in automatic processing.
 Example: There are many things to do today.
 We have a lot of time left, don’t worry
Dr.A.Ajina RIT
Challenges of NLP
 Ambiguity of natural language is another difficulty:
 As humans , we are aware of the context and current cultural knowledge, and also of the
language and traditions and utilize these to process the meaning. However incorporating
contextual and world knowledge poses the greatest difficulty in language computing.
 There are various sources of ambiguities in natural language
 Ambiguity at word level(Lexical Ambiguity)
 A word can be ambiguous, word may represent a noun or a verb
 Example: can, bunk, cat etc.
 Sentence Level Ambiguity(structural Ambiguity)
 Example: Stolen rifle found by the tree
 Number of grammars have been proposed to describe the structure of the sentences.
However there are an infinite number of ways to generate them. Which makes writing
grammarDr.A.Ajina
rulesRITand grammar itself, extremely complex.
NLP APIs
 Natural Language Processing APIs allow developers to integrate human-to-
machine communications and complete several useful tasks such as speech
recognition, chatbots, spelling correction, sentiment analysis, etc.
 A list of NLP APIs is given below:
IBM Watson API
Chatbot API
Speech to text API
Sentiment Analysis API
Translation API by SYSTRAN
Text Analysis API by AYLIEN
Cloud NLP API
Google Cloud Natural Language API
Dr.A.Ajina RIT
NLP Libraries
 Scikit-learn:
It provides a wide range of algorithms for building machine learning models in
Python.
 Natural language Toolkit (NLTK):
NLTK is a complete toolkit for all NLP techniques.
 Pattern:
It is a web mining module for NLP and machine learning.
 TextBlob:
It provides an easy interface to
learn basic NLP tasks like sentiment analysis, noun phrase extraction, or pos-tagging.

Dr.A.Ajina RIT
NLP Libraries
Quepy:
Quepy is used to transform natural language questions into
queries in a data base query language.
SpaCy:
SpaCy is an open-source NLP library which is used for Data
Extraction, Data Analysis, Sentiment Analysis, and Text
Summarization.
Gensim:
Gensim works with large data sets and processes data streams.
Dr.A.Ajina RIT
How to build an NLP pipeline
Step1: Sentence Segmentation
Step2: Word Tokenization
Step3: Stemming
Step 4: Lemmatization
Step 5: Identifying Stop Words
Step 6: Dependency Parsing
Step 7: POS tags
Step 8: Named Entity Recognition (NER)
Step 9: Chunking

Dr.A.Ajina RIT
Language and Grammar
Automatic Processing of Language requires the rules and exceptions of a language to be explained to
the computer.
● Grammar defines the language
● It consists of a set of rules that allows us to parse and generate sentences in a Language. These
rules relate information to coding devices at the language level and not at the world knowledge
level.
Main hurdle :
Constantly changing nature of languages and the presence of large number of language exceptions.
Effort to provide specifications for the language has led to many grammars.
● Phrase Structure Grammar
● Transformational Grammar
● Lexical Functional Grammar
● Generalized phrase Structure Grammar
● Dependency Grammar
● Paninian Grammar
● Tree-adjoining
Dr.A.Ajina RITGrammar
Language and Grammar

Though many grammars were proposed but Transformational Grammar was identified as the better,
● Noam Chomsky proposed the Transformational Grammar and suggested that each sentence in a
language has two levels of representation, namely a deep structure and surface structure.
● Mapping of deep structure to surface structure is carried out by transformations.
● Deep structure can be transformed in a number of ways to yield many different surface level
representations.
● Sentences with different surface level representations having the same meaning, share a common
deep-level representation.

Transformational meaning which changes the structure but not the meaning , It is also called
Transformational Generative Grammar.

Dr.A.Ajina RIT
Language and Grammar
● English is SVO Language.
● Transformation grammar has three components
● Phrase structure grammar
● Transformational rules
● Morphophonemic rules-These rules match each sentence representation to a string
of phonemes
● Each of these components consists of set of rules.
● Phrase structure grammar consists of set of rules that generate natural
language sentences and assign a structural description to them.

● Sentences that can be generated using


these rules are termed grammatical.

Dr.A.Ajina RIT
Phrase structure grammar

● A grammar is a collection of rules that defines a language as a set of allowable


strings of words.
● Rules for allowable characters, words and sentences.
● The lexical category (POS) such as noun or adjective (category of words)
● Syntactic categories: string together lexical categories such as noun phrase or verb
phrase
● Phrase structure: combine these syntactic categories into trees representing the
phrase structure of sentences:
● Nested phrases, each marked with a category

Dr.A.Ajina RIT
Language and Grammar
Transformational rules are applied on the terminal string generated by phrase
structure rules.
It can be used to transform one phrase maker into another phrase marker.
These rules are used to transform one surface representation into another(an
active sentence to passive one).
The rule relating active and passive sentences (as given by chomsky)
(s1) NP1-Aux-V-NP2-->NP2-Aux+be+en-V-by+NP1(s2).
This rules says that if the input has s1 structure it can be transformed to s2.
Transformational rules can be obligatory or optional.
Obligatory rules: ensures agreement in number of subject and verb etc.,
Optional rules: it modifies the structure of the sentence while preserving its
meaning
Dr.A.Ajina RIT
Language and Grammar

Dr.A.Ajina RIT
Language and Grammar
Morphophonemic rules: match each sentence representation to a string of
phonemes.
Phoneme, in linguistics, smallest unit of speech distinguishing one word (or
word element) from another, as the element p in “tap,” which separates that
word from “tab,” “tag,” and “tan.”
Consider the sentence:
The police will catch the snatcher
(s1) NP1-Aux-V-NP2-->NP2-Aux+be+en-V-by+NP1(s2).

Dr.A.Ajina RIT

Structure obtained by applying phrase structure rule


Example

Dr.A.Ajina RIT
Language and Grammar

● Application of phrase structure rules will assign the structure.


● Passive transformation rules will convert the sentence into
● The + culprit+will+be+en+catch+by+police
● Another transformational rule will then reorder ‘en+catch’ to ‘catch+en’
and subsequently one of the morphophonemic rules will convert
‘catch+en’ to ‘caught’.

Dr.A.Ajina RIT
Language Modelling

 Model: is a description of some complex entity or process. A Language model is a


description of Language.
 Natural language is a complex entity and in order to process it through a
computer based program, we need to build a representation of it known as
Language modelling.
 Language model can be either grammar based or it can be based probability
estimation
Grammar based Language Model

● It uses grammar of a language to create the model.

● Attempts to represent the syntactic structure of language.


● Grammar consists of hand coded rules defining the structure and ordering of
various constituents.
Statistical language modelling

● Creates language model by training it from a corpus and the corpus needs to be
sufficiently large.

● Statistical language modelling is one of the fundamental tasks in many NLP


applications, including speech recognition, spelling correction, handwriting
recognitions and machine translation.
Various Grammar based Language Models
This section introduces various approaches to understand a language in a
grammatical and rule based format
1. Generative Grammars:
● Noam Chomsky suggests that we can generate sentences in a language if we know
the collection of words and rules in the language. Only those sentences that can be
generated as per the rules are grammatical.

● With the complete set of rules that can generate all possible sentences in a language,
those rules provide a model of the language.(deals with only the structure and not
the meaning )
Language Modelling
2. Hierarchical Grammar

● Chomsky described classes of grammars in a


hierarchical manner, where the top layer contained
the grammars represented by its subclasses.

● Type 0 contains Type 1 Grammar which in turn


contains Type 2 grammar and that contains Type 3
Grammar.

This can be extended to describe grammars at various


levels, such as in a class-sub class relationship
Language Modelling -Applications

Context Sensitive Spelling Correction


The office is about fifteen minuets from my house

Use a Language Model


P(about fifteen minutes from) > P(about fifteen minuets from)
Language Modelling -Applications

Speech Recognition
 P(I saw a van) >> P(eyes awe of an)

Machine Translation
 Which sentence is more plausible in the target language?
 P(high winds) > P(large winds)
Completion Prediction

Language model also supports predicting the completion of a sentence.


 Please turn off your cell ...
 Your program does not ...

 Predictive text input systems can guess what you are typing and give choices on how to complete it.
N-Gram Model
 Statistical Language model is a probability distribution P(s) over all possible word sequences.

 Dominant approach in statistical language model is N-gram model

 Goal of statistical model is to estimate the probability(likelihood) of a sentence. It is achieved by decomposing the
sentence probability into a product of conditional probabilities using the chain rule as follows

=ℿin=1 P(wi/h)i Where hi is the history of word wi


Probability of words in sentences

P(w1w2 . . . wn) = ∏ P(wi|w1w2 . . . wi−1)


 i

P(“about fifteen minutes from”) =


P(about) x P(fifteen |about) x P(minutes |about fifteen) x P(from |about fifteen minutes)
To calculate sentence probability, we need to calculate the probability of a word , given the sequence of words preceding it.

N-gram model calculates p(wi/hi) by modelling language as Markov model of order n-1, by looking at n-1 words only.

A Markov chain or Markov process is a stochastic model describing a sequence of possible events in which the probability of
each event depends only on the state attained in the previous event.

Model that limits the history to one word only is termed bi-gram(n==1) model.
N-Gram Models

P(office | about fifteen minutes from)

An N-gram model uses only N − 1 words of prior context.


Unigram: P(office) Bigram: P(office | from)
Trigram: P(office | minutes from)

Markov model and Language Model


An N-gram model is an N − 1-order Markov Model
Unigram model: No history is used , do the model considers the word with highest probability
In the corpus which ever the word has got highest probability that will be considered

Bi-gram model: one word history


Example for bi-gram model
Bi-grams
1. About five minutes from

Consider the following corpus C1 of 4 sentences. What is the total count of unique bi-grams for which the likelihood will
be estimated? Assume we do not perform any pre-processing.

today is Sneha’s birthday


She likes ice cream
She is also fond of cream cake
We will celebrate her birthday with ice cream cake
Bi-grams
1. About five minutes from

2. Consider the following corpus C1 of 4 sentences. What is the total count of unique bi-grams for which the likelihood will
be estimated? Assume we do not perform any pre-processing.

today is Sneha’s birthday


She likes ice cream
She is also fond of cream cake
We will celebrate her birthday with ice cream cake

24
 If we consider tri gram or four gram the history increases and it is very difficult to match that set of words in
corpus.
 Probabilities of occurrence of larger collection of words is minimum , to overcome this problem Bi-gram model
is used.
To find out what statements are probable in corpus

Leads to zero
probability problem
Advantages of N-gram model:
Easy to understand and implement using any grams.

Drawback of N-gram model:


 Underflow due to multiplication of probabilities specially in long sentences
Solution: use long and add probabilities
 Zero probability problem
Solution: use Laplace smoothing or add one smoothing technique
Processing Indian Languages
● Indian Languages make extensive and productive use of complex
predicates
The complex predicates are combination of two lexical items.
The first and second lexical items of the complex predicates are called polar and
vector respectively. The way how these two lexical items come together and forms a
CP is quite interesting to examine. Consider a CP ¤ాఠ ¨ాడు [paaTa maaDu] ‘teach
lesson’. In the example, the first constituent ¤ాఠ [paaTa] ‘lesson’ is a polar and the
second constituent ¨ాడు [maaDu] ‘do’ is a vector

● Indian Languages use post-position case markers instead of


prepositions(example: "on the table")
● Indian Languages use verb complexes consisting of sequences of
verbs(gaa raha hai, rahi hai).
Dr.A.Ajina RIT
Processing Indian Languages
● Unlike English,Indic scripts have a non linear structure.
● Unlike English, Indian Languages have SOV as default sentence structure.
● Indian Languages have a free word order, i.e, words can be moved freely within
a sentence without changing the meaning of the sentence.
मैं फल खाता हूँ । (main phaL khaaTaa huun.)
(S + O + V)

मैं खाता हूँ फल। (main khaaTaa huun phaL.)


(S +V+O)
● Spelling standardardization is more subtle in Hindi than in English.
(standardization rules for spelling)
● Indian Languages have a relatively rich set of morphological variants(morpheme:
minimum meaningful unit example policy:start, starts,starting,started etc..

Dr.A.Ajina RIT

You might also like