0% found this document useful (0 votes)
17 views

Introduction To NLP

Natural language processing (NLP) is concerned with developing computational models for human language understanding and generation. NLP involves building systems that can understand and generate human languages at various levels like phonetic, morphological, syntactic, semantic and pragmatic.

Uploaded by

sharan raj
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views

Introduction To NLP

Natural language processing (NLP) is concerned with developing computational models for human language understanding and generation. NLP involves building systems that can understand and generate human languages at various levels like phonetic, morphological, syntactic, semantic and pragmatic.

Uploaded by

sharan raj
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

Natural Language Processing

Natural language processing (NLP) is a subfield of linguistics, computer science, and


artificial intelligence concerned with the interactions between computers and human
language, in particular how to program computers to process and analyze large amounts
of natural language data.

Natural Language processing is concerned with development of computational models of


aspects of human language processing.
Main reasons for NLP:
1. To develop automated tools for language processing
2. To gain a better understanding of human communication
Building computational models with human language processing abilities requires
● Knowledge of how humans acquire store and process language.
● Knowledge of the world and of language.

Two major approaches to NLP


● Rationalist Approach: A significant part of the knowledge in the human mind is not
derived by the senses but is fixed in advance, presumably by
genetic inheritance
● Empiricist Approach: The brain is able to perform association, pattern recognition, and
generalization and, thus, the structures of Natural Language can
be learned.

Linguistics is the scientific study of language. It deals with analysis of every aspect of
language, as well as the methods for studying and modelling them.
Origins of NLP
Theoretical linguists identify rules that describe and restrict the structure of
Languages(Grammar).
Theoretical Linguistics mainly provide structural description of natural language and its
semantics.

Psycholinguistics explain how humans produce and comprehend natural language.


They are interested in representation of linguistic structures as well as in the process by
which these structures are produced.

Computational linguistics are concerned with the study of language using


computational models of linguistic phenomena.
It deals with the application of linguistic theories and computational techniques for NLP.
Computational models may be broadly classified under
● Knowledge driven
● Data driven

Knowledge driven: rely on explicitly coded linguistic knowledge, often expressed as a


set of handcrafted grammar rules.

Data driven: presume the existence of large amount of data and usually employ some
machine learning technique to learn syntactic patterns. The amount of human effort is
less and the performance of these systems is dependent on the quantity of the data.
People use seven interdependent levels to understand and extract meaning from a text or spoken words. In order to understand natural
languages, it’s important to distinguish among them:

1- Phonetic or phonological level: deals with pronunciation

2- Morphological level: deals with the smallest parts of words that carry meaning, and suffixes and prefixes.

3- Lexical level: deals with lexical meaning of a word.

4- Syntactic level: deals with grammar and structure of sentences.

5- Semantic level: deals with the meaning of words and sentences.

6- Discourse level: deals with the structure of different kinds of text.

7- Pragmatic level: deals with the knowledge that comes from the outside world, i.e., from outside the content of the document.
1. Morphological Analysis:
While performing the morphological analysis, each particular word is analyzed. Non-word tokens such as punctuation
are removed from the words. Hence the remaining words are assigned categories.
For instance, Ram’s iPhone cannot convert the video from .mkv to .mp4. In Morphological analysis, word by word the
sentence is analyzed.
So here, Ram is a proper noun, Ram’s is assigned as possessive suffix and .mkv and .mp4 is assigned as a file extension.
As shown above, the sentence is analyzed word by word. Each word is assigned a syntactic category. The file extensions are
also identified present in the sentence which is behaving as an adjective in the above example. In the above example, the
possessive suffix is also identified. This is a very important step as the judgement of prefixes and suffixes will depend on a
syntactic category for the word. For example, swims and swim’s are different. One makes it plural, while the other makes it
a third-person singular verb. If the prefix or suffix is incorrectly interpreted then the meaning and understanding of the
sentence are completely changed. The interpretation assigns a category to the word. Hence, discard the uncertainty from the
word.
2. Syntactic Analysis:
There are different rules for different languages. Violation of these rules will give a syntax error. Here the sentence is
transformed into the structure that represents a correlation between the words. This correlation might violate the rules
occasionally. The syntax represents the set of rules that the official language will have to follow. For example, “To the
movies, we are going.” Will give a syntax error. The syntactic analysis uses the results given by morphological analysis to
develop the description of the sentence. The sentence which is divided into categories given by the morphological process
is aligned into a defined structure. This process is called parsing. For example, the cat chases the mouse in the garden,
would be represented as:
Here the sentence is broken down according to the categories.
Then it is described in a hierarchical structure with nodes as
sentence units. These parse trees are parsed while the syntax
analysis run and if any error arises the processing stops and it
displays syntax error. The parsing can be top-down or
bottom-up.
○ Top-down: Starts with the first symbol and parse
the sentence according to the grammar rules until
each of the terminals in the sentence is parsed.
○ Bottom-up: Starts with the sentence which is to
be parsed and apply all the rules backwards till
the first symbol is reached.
● Semantic Analysis:
● The semantic analysis looks after the meaning. It allocates the meaning to all the structures built by the syntactic
analyzer. Then every syntactic structure and the objects are mapped together into the task domain. If mapping is
possible the structure is sent, if not then it is rejected. For example, “hot ice-cream” will give a semantic error. During
semantic analysis two main operations are executed:
○ First, each separate word will be mapped with appropriate objects in the database. The dictionary meaning of
every word will be found. A word might have more than one meaning.
○ Secondly, all the meanings of each different word will be integrated to find a proper correlation between the
word structures. This process of determining the correct meaning is called lexical disambiguation. It is done by
associating each word with the context.
● This process defined above can be used to determine the partial meaning of a sentence. However semantic and syntax
are two completely contrasting concepts. It might be possible that a syntactically correct sentence is semantically
incorrect.
● For example, “A rock smelled the colour nine.” It is syntactically correct as it obeys all the rules of English, but is
semantically incorrect. The semantic analysis verifies that a sentence is abiding by the rules and creates correct
information
Disclosure Integration:
While processing a language there can arise one major ambiguity known as referential ambiguity. Referential ambiguity
is the ambiguity that can arise when a reference to a word cannot be determined. For example,

Ram won the race.


Mohan ate half of a pizza.
He liked it.

In the above example, “He” can be Ram or Mohan. This creates an ambiguity. The word “He” shows dependency on
both sentences. This is known as disclosure integration. It means when an individual sentence relies upon the sentence
that comes before it. Like in the above example the third sentence relies upon the sentence before it. Hence the goal of
this model is to remove referential ambiguity.

It requires the knowledge of the world.


Challenges of NLP
Factors that make NLP difficult:
Problems of representation and interpretation:
Natural Language is highly ambiguous and vague,so it is quite difficult to embody all
sources of knowledge that human uses to process language.

Identifying the semantics of language.


Words alone do not make a sentence. Instead, it is the words as well as their syntactic and
semantic relation that gives meaning to a sentence.
Alas! They won.

New words are added continually and existing words ae introduced in new context.
example
Tv channels use 9/11 t refer to the terrorist act on the world trade centre.
The only way a machine can learn the meaning of a specific word in a message is by
considering its context, unless some explicitly coded general world or domain knowledge is
available.the context of a word is defined by occurring words.
Idioms, metaphor and ellipses add more complexity to identify the meaning of the
written text.
Idioms: a group of words established by usage as having a meaning not deducible from
those of the individual words.
Example Idiom: Its a piece of cake(meaning its easy)

Metaphor:A metaphor is a figure of speech that describes an object or action in a way


that isn't literally true, but helps explain an idea or make a comparison.
example:Laughter is the music of the soul.

Ellipses: Use an ellipsis to show an omission, or leaving out, of a word or words in a


quote. Use ellipses to shorten the quote without changing the meaning.
For example:
● "After school I went to her house, which was a few blocks away, and then came
home."
Shorten the quote by replacing a few words with an ellipsis. Remember, the meaning of
the quote should not change.
● "After school I went to her house … and then came home."
We removed the words "which was a few blocks away" and replaced them with an ellipsis
without changing the meaning of the original quote.

Quantifier scoping is another problem. Scope of quantifiers is often not clear and
poses problem in automatic processing.
Example:
There are many things to do today.
We have a lot of time left, don’t worry.
Ambiguity of natural language is another difficulty:

As humans , we are aware of the context and current cultural knowledge, and also of the
language and traditions and utilize these to process the meaning.however incorporating
contextual and world knowledge poses the greatest difficulty in language computing.

There are various sources of ambiguities in natural language


Ambiguity at word level(Lexical Ambiguity)
A word can be ambiguous, word may represent a noun or a verb
Example: can,bunk, cat etc.

Sentence Level Ambiguity(structural Ambiguity)


Example:Stolen rifle found by the tree

Number of grammars have been proposed to describe the structure of the sentences.
However there are an infinite number of ways to generate them. Which makes writing
grammar rules and grammar itself, extremely complex.
Language and Grammar
Automatic Processing of Language requires the rules and exceptions of a language to be explained to the
computer.

● Grammar defines the language


● It consists of a set of rules that allows us to parse and generate sentences in a Language. These rules
relate information to coding devices at the language level and not at the world knowledge level.

Main hurdle :
Constantly changing nature of languages and the presence of large number of language exceptions.

Effort to provide specifications for the language has led to many grammars.
● Phrase Structure Grammar
● Transformational Grammar
● Lexical Functional Grammar
● Generalized phrase Structure Grammar
● Dependency Grammar
● Paninian Grammar
● Tree-adjoining Grammar
Though many grammars were proposed but Transformational Grammar was identified as
the better,
● Noam Chomsky proposed the Transformational Grammar and suggested that each
sentence in a language has two levels of representation, namely a deep structure and
surface structure.
● Mapping of deep structure to surface structure is carried out by transformations.
● Deep structure can be transformed in a number of ways to yield many different
surface level representations.
● Sentences with different surface level representations having the same meaning, share
a common deep-level representation.

Transformational meaning which changes the structure but not the meaning , It is also
called Transformational Generative Grammar.
English is SVO Language.

Transformation grammar has three components


● Phrase structure grammar
● Transformational rules
● Morphophonemic rules-These rules match each sentence representation to a string of
phonemes

Each of these components consists of set of rules.


Phrase structure grammar consists of set of rules that generate natural language sentences and
assign a structural description to them.
● Sentences that can be
generated using these rules
are termed grammatical.
Transformational rules are applied on the terminal string generated by phrase structure
rules.
● It can be used to transform one phrase maker into another phrase marker.
● These rules are used to transform one surface representation into another(an active
sentence to passive one).
● The rule relating active and passive sentences (as given by chomsky)

(s1) NP1-Aux-V-NP2-->NP2-Aux+be+en-V-by+NP1(s2).

This rules says that if the input has s1 structure it can be transformed to s2.

Transformational rules can be obligatory or optional.


● Obligatory rules: ensures agreement in number of subject and verb etc.,
● Optional rules: it modifies the structure of the sentence while preserving its
meaning
Morphophonemic rules: match each sentence representation to a string of phonemes.

Phoneme, in linguistics, smallest unit of speech distinguishing one word (or word element) from
another, as the element p in “tap,” which separates that word from “tab,” “tag,” and “tan.”

Consider the sentence:


(s1) NP1-Aux-V-NP2-->NP2-Aux+be+en-V-by+NP1(s2).
The police will catch the snatcher

Structure obtained by applying phrase structure rule


● Application of phrase structure rules will assign the structure.
● Passive transformation rules will convert the sentence into
● The + culprit+will+be+en+catch+by+police
● Another transformational rule will then reorder ‘en+catch’ to
‘catch+en’ and subsequently one of the morphophonemic rules
will convert ‘catch+en’ to ‘caught’.
Processing Indian Languages

● Unlike English,Indic scripts have a non linear structure.


● Unlike English, Indian Languages have SOV as default sentence structure.
● Indian Languages have a free word order, i.e, words can be moved freely
within a sentence without changing the meaning of the sentence.

मैं फल खाता हू ँ। (main phaL khaaTaa huun.)


(S + O + V)

मैं खाता हू ँ फल। (main khaaTaa huun phaL.)


(S +V+O)
● Spelling standardardization is more subtle in Hindi than in English.
(standardization rules for spelling)
● Indian Languages have a relatively rich set of morphological
variants(morpheme: minimum meaningful unit example policy:start,
starts,starting,started etc..
● Indian Languages make extensive and productive use of complex predicates
The complex predicates are combination of two lexical items.
The first and second lexical items of the complex predicates are called polar and vector respectively. The
way how these two lexical items come together and forms a CP is quite interesting to examine. Consider
a CP ¤◌ాఠ ¨◌ాడు [paaTa maaDu] ‘teach lesson’. In the example, the first constituent ¤◌ాఠ [paaTa]
‘lesson’ is a polar and the second constituent ¨◌ాడు [maaDu] ‘do’ is a vector

● Indian Languages use post-position case markers instead of


prepositions(example: "on the table")
● Indian Languages use verb complexes consisting of sequences of verbs(gaa
raha hai, rahi hai).
NLP Applications
First application of NLP: Machine Translation, recent progress is information retrieval,
information extraction, text summarization etc.

Machine Translation: Translation from one human language to another , demands the
knowledge of words, phrases, grammars of two languages involved, world knowledge.

Speech Recognition: Process of mapping acoustic speech signals to a set of words.


Difficulty: wide variation in the pronunciation of words, homonym(dear and deer) and
acoustic ambiguities.(ex: in the rest and interest).

Speech Synthesis: automatic production of speech. Such systems can read out mals o the
telephone, or even read out a storybook for you.

Natural Language interfaces to Databases: allows querying structured database using


Natural language sentences.
Information Extraction:
It captures and outputs factual Information contained in a document.
It extracts structured information from unstructured and/or semi-structured
machine-readable documents.
In most of the cases this activity concerns processing human language texts by means of
natural language processing (NLP).
Recent activities in multimedia document processing like automatic annotation and
content extraction out of images/audio/video could be seen as information extraction.

Information Retrieval: The IR system assists the users in finding the information
they require but it does not explicitly return the answers to the question. It notifies
regarding the existence and location of documents that might consist of the required
information. Concerned with identifying the documents relevant to users query.
Example: google search
Question Answering : given a question and a set of documents, Question Answering system
attempts to find the precise answer or atleast the precise portion of text in which the answer
appears. Unlike Information extraction system, question answering system benefits from
having an information extraction system to identify entities in the text.
Text Summarization: deals with creation of summaries of documents and involves syntactic,
semantic and discourse level processing of text.

Some Successful Early NLP Systems


ELIZA: is an early natural language processing computer program created from 1964 to 1966
at the MIT Artificial Intelligence Laboratory by Joseph Weizenbaum.

SysTran(System Translation): First Machine Translation Tool developed in 1969 for


Russian-English translation. SysTran provided the first online machine translation service
called Babel Fish used by Alta Vista for handling translation requests for users.

TAUM METO: Natural Language generation system used in Canada to generate weather
reports. It accepts daily weather data and generates weather reports in English and French.
SHRDLU(Winogard 1972): Natural language Understanding system that simulates actions of a robot in a block
world domain. It uses syntactic parsing and semantic reasoning to understand instructions. User can ask the robot to
manipulate the blocks, to tell the blocks configurations, and to explain its reasoning.

LUNAR(Woods 1977): Question answering system that answered questions about moon roc
Information Retrieval
● Information refers to the data, and we are concerned with the text only. So, we
consider words as the carrier of information and written text as message encoded in
natural language.
● Retrieval refers to the process of accessing information from memory, it also
requires information to be processed and stored. Only relevant information
expressed in the form of query is located.
● Information retrieval is deals with organization , storage ,retrieval and evaluation of
information relevant to the query.

Information retrieval deals with unstructured data.It is performed based on the content
of the document rather than its structure.

Approaches for accessing large text collections can be broadly classified into 2
categories.
1) Approaches that construct Topic hierarchy
2) Approaches that rank the documents according to the relevance.
Issues involved in the design and evaluation of IR Systems
1. Representation of the document: most human knowledge is coded in natural language
which is difficult to use as knowledge representations.
2. Most of the Retrieval systems are based on keyword representation, problem associated
Polysemy: lexeme with multiple meaning
a. Polysemy is the coexistence of many possible meanings for a word or phrase.

Example: He fixed his hair.


They fixed a date for the wedding.

b. Homonymy is the existence of two or more words having the same spelling or
pronunciation but different meanings and origins.Ambiguity makes it difficult of a computer
to automatically determine the conceptual content of documents.

Homonymy: ambiguity in which the words that appear the same have unrelated
meanings ex: kneed,need, whole ,hole
Right vs Write
C. Synonymy : creates a problem when a document is indexed with one term and the query contains a
different term, and the two terms share a common meaning.

D. It ignores semantics and contextual information in the retrieval process.

E. Inappropriate characterization of queries by user: reason can be lack of knowledge of the subject or even
the inherent vagueness of the natural language.User may fail to include relevant terms in the query or
may include irrelevant terms.

F. Matching query representation with that of the document is another issue: selection of appropriate
similarity measure is a crucial issue in the design of IR system.

E. Evaluating the performance of IR systems is also a major issue. Recall and precision are the most widely
used measures of effectiveness.Recall and precision are the most widely used measures of effectiveness

F. Goal of IR is to search a document in a manner relevant to the query, understanding what constitutes
relevance ia an important issue.

G. Size of document collections and the varying needs of users also complicate text retrieval.some users
require answers of limited scope, while others require documents with a wider scope.
Why NLP?
To design, implement and test systems that can process natural language for practical
applications.

Practical Applications:
● Sentiment Analysis
● Query Completion/Auto correction
● Word Prediction
● Information Retrieval
● Text Summarization
● Spam Detection
Difficulties that we face while designing Algorithms for NLP
1. Lexical Ambiguity:(in a language the same word can provide different meaning
which is called lexical Ambiguity)

Example: Rose rose to get a twig.

2. Structural Ambiguity:
Example: The man saw the boy with the binoculars
Flying planes can be dangerous

Ambiguities:
Hospitals are sued by 7 foot doctors.
Stolen painting found by tree.
Teacher strikes idle kids.
A "morpheme" is a short segment of language that meets three basic criteria:

1. It is a word or a part of a word that has meaning.

2. It cannot be divided into smaller meaningful segments without changing its meaning or leaving a meaningless remainder.

3. It has relatively the same stable meaning in different verbal environments.

You might also like