Introduction To NLP
Introduction To NLP
Linguistics is the scientific study of language. It deals with analysis of every aspect of
language, as well as the methods for studying and modelling them.
Origins of NLP
Theoretical linguists identify rules that describe and restrict the structure of
Languages(Grammar).
Theoretical Linguistics mainly provide structural description of natural language and its
semantics.
Data driven: presume the existence of large amount of data and usually employ some
machine learning technique to learn syntactic patterns. The amount of human effort is
less and the performance of these systems is dependent on the quantity of the data.
People use seven interdependent levels to understand and extract meaning from a text or spoken words. In order to understand natural
languages, it’s important to distinguish among them:
2- Morphological level: deals with the smallest parts of words that carry meaning, and suffixes and prefixes.
7- Pragmatic level: deals with the knowledge that comes from the outside world, i.e., from outside the content of the document.
1. Morphological Analysis:
While performing the morphological analysis, each particular word is analyzed. Non-word tokens such as punctuation
are removed from the words. Hence the remaining words are assigned categories.
For instance, Ram’s iPhone cannot convert the video from .mkv to .mp4. In Morphological analysis, word by word the
sentence is analyzed.
So here, Ram is a proper noun, Ram’s is assigned as possessive suffix and .mkv and .mp4 is assigned as a file extension.
As shown above, the sentence is analyzed word by word. Each word is assigned a syntactic category. The file extensions are
also identified present in the sentence which is behaving as an adjective in the above example. In the above example, the
possessive suffix is also identified. This is a very important step as the judgement of prefixes and suffixes will depend on a
syntactic category for the word. For example, swims and swim’s are different. One makes it plural, while the other makes it
a third-person singular verb. If the prefix or suffix is incorrectly interpreted then the meaning and understanding of the
sentence are completely changed. The interpretation assigns a category to the word. Hence, discard the uncertainty from the
word.
2. Syntactic Analysis:
There are different rules for different languages. Violation of these rules will give a syntax error. Here the sentence is
transformed into the structure that represents a correlation between the words. This correlation might violate the rules
occasionally. The syntax represents the set of rules that the official language will have to follow. For example, “To the
movies, we are going.” Will give a syntax error. The syntactic analysis uses the results given by morphological analysis to
develop the description of the sentence. The sentence which is divided into categories given by the morphological process
is aligned into a defined structure. This process is called parsing. For example, the cat chases the mouse in the garden,
would be represented as:
Here the sentence is broken down according to the categories.
Then it is described in a hierarchical structure with nodes as
sentence units. These parse trees are parsed while the syntax
analysis run and if any error arises the processing stops and it
displays syntax error. The parsing can be top-down or
bottom-up.
○ Top-down: Starts with the first symbol and parse
the sentence according to the grammar rules until
each of the terminals in the sentence is parsed.
○ Bottom-up: Starts with the sentence which is to
be parsed and apply all the rules backwards till
the first symbol is reached.
● Semantic Analysis:
● The semantic analysis looks after the meaning. It allocates the meaning to all the structures built by the syntactic
analyzer. Then every syntactic structure and the objects are mapped together into the task domain. If mapping is
possible the structure is sent, if not then it is rejected. For example, “hot ice-cream” will give a semantic error. During
semantic analysis two main operations are executed:
○ First, each separate word will be mapped with appropriate objects in the database. The dictionary meaning of
every word will be found. A word might have more than one meaning.
○ Secondly, all the meanings of each different word will be integrated to find a proper correlation between the
word structures. This process of determining the correct meaning is called lexical disambiguation. It is done by
associating each word with the context.
● This process defined above can be used to determine the partial meaning of a sentence. However semantic and syntax
are two completely contrasting concepts. It might be possible that a syntactically correct sentence is semantically
incorrect.
● For example, “A rock smelled the colour nine.” It is syntactically correct as it obeys all the rules of English, but is
semantically incorrect. The semantic analysis verifies that a sentence is abiding by the rules and creates correct
information
Disclosure Integration:
While processing a language there can arise one major ambiguity known as referential ambiguity. Referential ambiguity
is the ambiguity that can arise when a reference to a word cannot be determined. For example,
In the above example, “He” can be Ram or Mohan. This creates an ambiguity. The word “He” shows dependency on
both sentences. This is known as disclosure integration. It means when an individual sentence relies upon the sentence
that comes before it. Like in the above example the third sentence relies upon the sentence before it. Hence the goal of
this model is to remove referential ambiguity.
New words are added continually and existing words ae introduced in new context.
example
Tv channels use 9/11 t refer to the terrorist act on the world trade centre.
The only way a machine can learn the meaning of a specific word in a message is by
considering its context, unless some explicitly coded general world or domain knowledge is
available.the context of a word is defined by occurring words.
Idioms, metaphor and ellipses add more complexity to identify the meaning of the
written text.
Idioms: a group of words established by usage as having a meaning not deducible from
those of the individual words.
Example Idiom: Its a piece of cake(meaning its easy)
Quantifier scoping is another problem. Scope of quantifiers is often not clear and
poses problem in automatic processing.
Example:
There are many things to do today.
We have a lot of time left, don’t worry.
Ambiguity of natural language is another difficulty:
As humans , we are aware of the context and current cultural knowledge, and also of the
language and traditions and utilize these to process the meaning.however incorporating
contextual and world knowledge poses the greatest difficulty in language computing.
Number of grammars have been proposed to describe the structure of the sentences.
However there are an infinite number of ways to generate them. Which makes writing
grammar rules and grammar itself, extremely complex.
Language and Grammar
Automatic Processing of Language requires the rules and exceptions of a language to be explained to the
computer.
Main hurdle :
Constantly changing nature of languages and the presence of large number of language exceptions.
Effort to provide specifications for the language has led to many grammars.
● Phrase Structure Grammar
● Transformational Grammar
● Lexical Functional Grammar
● Generalized phrase Structure Grammar
● Dependency Grammar
● Paninian Grammar
● Tree-adjoining Grammar
Though many grammars were proposed but Transformational Grammar was identified as
the better,
● Noam Chomsky proposed the Transformational Grammar and suggested that each
sentence in a language has two levels of representation, namely a deep structure and
surface structure.
● Mapping of deep structure to surface structure is carried out by transformations.
● Deep structure can be transformed in a number of ways to yield many different
surface level representations.
● Sentences with different surface level representations having the same meaning, share
a common deep-level representation.
Transformational meaning which changes the structure but not the meaning , It is also
called Transformational Generative Grammar.
English is SVO Language.
(s1) NP1-Aux-V-NP2-->NP2-Aux+be+en-V-by+NP1(s2).
This rules says that if the input has s1 structure it can be transformed to s2.
Phoneme, in linguistics, smallest unit of speech distinguishing one word (or word element) from
another, as the element p in “tap,” which separates that word from “tab,” “tag,” and “tan.”
Machine Translation: Translation from one human language to another , demands the
knowledge of words, phrases, grammars of two languages involved, world knowledge.
Speech Synthesis: automatic production of speech. Such systems can read out mals o the
telephone, or even read out a storybook for you.
Information Retrieval: The IR system assists the users in finding the information
they require but it does not explicitly return the answers to the question. It notifies
regarding the existence and location of documents that might consist of the required
information. Concerned with identifying the documents relevant to users query.
Example: google search
Question Answering : given a question and a set of documents, Question Answering system
attempts to find the precise answer or atleast the precise portion of text in which the answer
appears. Unlike Information extraction system, question answering system benefits from
having an information extraction system to identify entities in the text.
Text Summarization: deals with creation of summaries of documents and involves syntactic,
semantic and discourse level processing of text.
TAUM METO: Natural Language generation system used in Canada to generate weather
reports. It accepts daily weather data and generates weather reports in English and French.
SHRDLU(Winogard 1972): Natural language Understanding system that simulates actions of a robot in a block
world domain. It uses syntactic parsing and semantic reasoning to understand instructions. User can ask the robot to
manipulate the blocks, to tell the blocks configurations, and to explain its reasoning.
LUNAR(Woods 1977): Question answering system that answered questions about moon roc
Information Retrieval
● Information refers to the data, and we are concerned with the text only. So, we
consider words as the carrier of information and written text as message encoded in
natural language.
● Retrieval refers to the process of accessing information from memory, it also
requires information to be processed and stored. Only relevant information
expressed in the form of query is located.
● Information retrieval is deals with organization , storage ,retrieval and evaluation of
information relevant to the query.
Information retrieval deals with unstructured data.It is performed based on the content
of the document rather than its structure.
Approaches for accessing large text collections can be broadly classified into 2
categories.
1) Approaches that construct Topic hierarchy
2) Approaches that rank the documents according to the relevance.
Issues involved in the design and evaluation of IR Systems
1. Representation of the document: most human knowledge is coded in natural language
which is difficult to use as knowledge representations.
2. Most of the Retrieval systems are based on keyword representation, problem associated
Polysemy: lexeme with multiple meaning
a. Polysemy is the coexistence of many possible meanings for a word or phrase.
b. Homonymy is the existence of two or more words having the same spelling or
pronunciation but different meanings and origins.Ambiguity makes it difficult of a computer
to automatically determine the conceptual content of documents.
Homonymy: ambiguity in which the words that appear the same have unrelated
meanings ex: kneed,need, whole ,hole
Right vs Write
C. Synonymy : creates a problem when a document is indexed with one term and the query contains a
different term, and the two terms share a common meaning.
E. Inappropriate characterization of queries by user: reason can be lack of knowledge of the subject or even
the inherent vagueness of the natural language.User may fail to include relevant terms in the query or
may include irrelevant terms.
F. Matching query representation with that of the document is another issue: selection of appropriate
similarity measure is a crucial issue in the design of IR system.
E. Evaluating the performance of IR systems is also a major issue. Recall and precision are the most widely
used measures of effectiveness.Recall and precision are the most widely used measures of effectiveness
F. Goal of IR is to search a document in a manner relevant to the query, understanding what constitutes
relevance ia an important issue.
G. Size of document collections and the varying needs of users also complicate text retrieval.some users
require answers of limited scope, while others require documents with a wider scope.
Why NLP?
To design, implement and test systems that can process natural language for practical
applications.
Practical Applications:
● Sentiment Analysis
● Query Completion/Auto correction
● Word Prediction
● Information Retrieval
● Text Summarization
● Spam Detection
Difficulties that we face while designing Algorithms for NLP
1. Lexical Ambiguity:(in a language the same word can provide different meaning
which is called lexical Ambiguity)
2. Structural Ambiguity:
Example: The man saw the boy with the binoculars
Flying planes can be dangerous
Ambiguities:
Hospitals are sued by 7 foot doctors.
Stolen painting found by tree.
Teacher strikes idle kids.
A "morpheme" is a short segment of language that meets three basic criteria:
2. It cannot be divided into smaller meaningful segments without changing its meaning or leaving a meaningless remainder.