NLP
NLP
2
Stages in a Comprehensive NLP System
Tokenization
Morphological Analysis
Syntactic Analysis
Semantic Analysis (lexical and compositional)
Pragmatics and Discourse Analysis
Knowledge-Based Reasoning
Text generation
• NLP works at different levels, which means that machine process and understand
natural language at different levels.
• These levels are :
• Morphological level: This level deals with understanding word structure and word
information.
• Lexical level: This level deals with understanding the part of speech of the word.
• Syntactic level: This level deals with understanding the syntactic analysis of a sentence, or
parsing a sentence.
• Semantic level: This level deals with understanding the actual meaning of a sentence.
• Discourse level: This level deals with understanding the meaning of a sentence beyond just
the sentence level, that is, considering the context.
• Pragmatic level: This level deals with using real-world knowledge to understand the
sentence.
5
History of NLP
• NLP is a field that has emerged from various other fields such as AI, linguistics
and Data science.
• As stated above the idea had emerged from the need for Machine Translation in
the 1940s.
• Then the original language was English and Russian.
• But the use of other words such as Chinese also came into existence in the initial
period of the 1960s.
• Then a lousy era came for MT/NLP during 1966, this fact was supported by a
report of ALPAC, according to which MT/NLP almost died because the research
in this area did not have the pace at that time.
• This condition became better again in the 1980s when the product related to
MT/NLP started providing some results to customers.
8
• After reaching in dying state in the 1960s, the NLP/MT got a new life when the
idea and need of Artificial Intelligence emerged. LUNAR is developed in 1978 by
W.A woods; it could analyze, compare and evaluate the chemical data on a lunar
rock and soil composition that was accumulating as a result of Apollo moon
missions and can answer the related question.
• In the 1980s the area of computational grammar became a very active field of
research which was linked with the science of reasoning for meaning and
considering the user ‘s beliefs and intentions.
• In the period of 1990s, the pace of growth of NLP/MT increased. Grammars, tools
and Practical resources related to NLP/MT became available with the parsers.
• The research on the core and futuristic topics such as word sense disambiguation
and statistically colored NLP, the work on the lexicon got a direction of research.
9
• This quest of the emergence of NLP was joined by other essential topics such as
statistical language processing, Information Extraction and automatic
summarising.
• The discussion on the history of NLP cannot be considered complete without the
mention of the ELIZA, a chatbot program which was developed from 1964 to
1966 at the Artificial Intelligence Laboratory of MIT.
• It was created by Joseph Weizenbaum.
• It was a program which was based on script named as DOCTOR which was
arranged to Rogerian Psychotherapist and used rules, to response the questions of
the users which were psychometric-based.
• It was one of the chatbots which were capable of taking the Turing test at that
time.
10
• Previously, a traditional rule-based system was used for computations, in which
you had to explicitly write hardcoded rules.
• Today, computations on natural language are being done using ML and DL
techniques.
• Let’s say we have to extract the names of some politicians from a set of policial
news articles. So, if we want to apply rule-based grammar, we must manually craft
certain rules based on human understanding of language.
• As we can see, using a rule-based system like this would not yield very accurate
results.
• One major disadvantage is that the same rule cannot be applicable in all cases,
given the complex and nuanced nature of most language.
11
Basic Concepts
12
Text Corpus or corpora
• The language data that all NLP tasks depend upon is called the text corpus or
simply corpus.
• A corpus is a large set of text data that can be in one of the languages like English,
French, and so on.
• The corpus can consist of a single document or a bunch of documents.
• The source of the text corpus can be social network sites like Twitter, blog sites,
open discussion forums like Stack Overflow, books, and several others.
• In some of the tasks like machine translation, we would require a multilingual
corpus.
• For example we might need both the English and French translations of the same
document content for developing a machine translation model.
13
• For speech tasks, we would also need human voice recordings and the
corresponding transcribed corpus.
• For many of the NLP task, the corpus is split into chunks for further analysis.
• These chunks could be at the paragraph, sentence, or word level.
14
Paragraph
15
Sentences
16
Phrases and words
• Phrases are a group of consecutive words within a sentence that can convey a
specific meaning.
• For example, in the sentence Tomorrow is going to be a rainy day the part going to
be a rainy day expresses a specific thought.
• Some of the NLP tasks extract key phrases from sentences for search and retrieval
applications.
• The next smallest unit of text is the word.
• The common tokenizers split sentences into text based on punctuations like spaces
and comma.
• One of the problems with NLP is ambiguity in the meaning of same words used in
different context.
17
N-gram
18
Bag-of-words
19
Applications
• Analyzing sentiment
• Recognizing named entities
• Linking entities
• Translating text
• Natural language interfaces
• Semantic Role Labeling
• Relation extraction
• SQL query generation, or semantic parsing
• Machine Comprehension
• Textual entailment
20
• Coreference resolution
• Searching
• Question answering and chatbots
• Converting text to voice
• Converting voice to text
• Speaker identification
• Spoken dialog systems
• Other applications
21