Natural Language Processing Lec 1
Natural Language Processing Lec 1
Processing
Instructor: Touseef Sultan
Core Concept
• Natural language processing is a subfield of linguistics, computer
science, and artificial intelligence concerned with the interactions
between computers and human language, in particular how to
program computers to process and analyze large amounts of natural
language data.
• Natural language processing (NLP) refers to the branch of computer
science—and more specifically, the branch of artificial intelligence or
AI—concerned with giving computers the ability to understand text
and spoken words in much the same way human beings can.
Continue…
• NLP combines computational linguistics—rule-based modeling of
human language—with statistical, machine learning, and deep
learning models. Together, these technologies enable computers to
process human language in the form of text or voice data and to
‘understand’ its full meaning, complete with the speaker or writer’s
intent and sentiment.
• NLP drives computer programs that translate text from one language
to another, respond to spoken commands, and summarize large
volumes of text rapidly—even in real time.
NLP tasks
• Human language is filled with ambiguities that make it incredibly
difficult to write software that accurately determines the intended
meaning of text or voice data.
• Several NLP tasks break down human text and voice data in ways that
help the computer make sense of what it's ingesting. Some of these
tasks include the following:
Tasks
• Speech recognition, also called speech-to-text, is the task of reliably converting voice data
into text data. Speech recognition is required for any application that follows voice
commands or answers spoken questions. What makes speech recognition especially
challenging is the way people talk—quickly, slurring words together, with varying emphasis
and intonation, in different accents, and often using incorrect grammar.
• Part of speech tagging, also called grammatical tagging, is the process of determining the
part of speech of a particular word or piece of text based on its use and context. Part of
speech identifies ‘make’ as a verb in ‘I can make a paper plane,’ and as a noun in ‘What
make of car do you own?’
• Word sense disambiguation is the selection of the meaning of a word with multiple
meanings through a process of semantic analysis that determine the word that makes the
most sense in the given context. For example, word sense disambiguation helps distinguish
the meaning of the verb 'make' in ‘make the grade’ (achieve) vs. ‘make a bet’ (place).
Tasks
• Named entity recognition, or NEM, identifies words or phrases as useful entities.
NEM identifies ‘Kentucky’ as a location or ‘Fred’ as a man's name.
• Co-reference resolution is the task of identifying if and when two words refer to
the same entity. The most common example is determining the person or object to
which a certain pronoun refers (e.g., ‘she’ = ‘Mary’), but it can also involve
identifying a metaphor or an idiom in the text (e.g., an instance in which 'bear' isn't
an animal but a large hairy person).
• Sentiment analysis attempts to extract subjective qualities—attitudes, emotions,
sarcasm, confusion, suspicion—from text.
• Natural language generation is sometimes described as the opposite of speech
recognition or speech-to-text; it's the task of putting structured information into
human language.
How does NLP work?
• Natural language processing includes many different techniques for
interpreting human language, ranging from statistical and machine
learning methods to rules-based and algorithmic approaches. We need a
broad array of approaches because the text- and voice-based data varies
widely, as do the practical applications.
• Basic NLP tasks include tokenization and parsing,
lemmatization/stemming, part-of-speech tagging, language detection and
identification of semantic relationships. If you ever diagramed sentences
in grade school, you’ve done these tasks manually before.
• In general terms, NLP tasks break down language into shorter, elemental
pieces, try to understand relationships between the pieces and explore
how the pieces work together to create meaning.
lemmatization vs stemming
• Stemming is a process that stems or removes last few
characters from a word, often leading to incorrect meanings and
spelling. Lemmatization considers the context and converts the
word to its meaningful base form, which is called Lemma.
Stemming is a process that stems or removes last few
characters from a word, often leading to incorrect meanings and
spelling. Lemmatization considers the context and converts the
word to its meaningful base form, which is called Lemma.
Natural Language Understanding (NLU)
• NLU is branch of natural language processing (NLP), which
helps computers understand and interpret human language by
breaking down the elemental pieces of speech. While speech
recognition captures spoken language in real-time, transcribes
it, and returns text, NLU goes beyond recognition to determine
a user’s intent. Speech recognition is powered by statistical
machine learning methods which add numeric structure to
large datasets. In NLU, machine learning models improve over
time as they learn to recognize syntax, context, language
patterns, unique definitions, sentiment, and intent.
Natural language understanding (NLU)
• NLU enables machines to understand and interpret human language by extracting metadata from content. It
performs the following tasks:
• Helps analyze different aspects of language.
• Helps map the input in natural language into valid representations.
• NLU is more difficult than NLG tasks owing to referential, lexical, and syntactic ambiguity.
• Lexical ambiguity: This means that one word holds several meanings. For example, "The man is looking for
the match." The sentence is ambiguous as ‘match’ could mean different things such as a partner or a
competition.
• Syntactic ambiguity: This refers to a sequence of words with more than one meaning. For example, "The
fish is ready to eat.” The ambiguity here is whether the fish is ready to eat its food or whether the fish is ready
for someone else to eat. This ambiguity can be resolved with the help of the part-of-speech tagging technique.
• Referential ambiguity: This involves a word or a phrase that could refer to two or more properties. For
example, Tom met Jerry and John. They went to the movies. Here, the pronoun ‘they’ causes ambiguity as it
isn’t clear who it refers to.
Business applications often rely on NLU to understand what people are saying
in both spoken and written language. This data helps virtual assistants and
other applications determine a user’s intent and route them to the right task.
Natural language generation (NLG)
• NLG is a method of creating meaningful phrases and sentences (natural
language) from data. It comprises three stages: text planning, sentence
planning, and text realization.
• Text planning: Retrieving applicable content that should be intelligent.
• Sentence planning: Forming meaningful phrases and setting the
sentence tone.
• Text realization: Mapping sentence plans to sentence structures.
• Chatbots, machine translation tools, analytics platforms, voice
assistants, sentiment analysis platforms, and AI-powered transcription
tools are some applications of NLG.
Techniques and methods of natural
language processing
• Syntax and semantic analysis are two main techniques used with natural language processing.
• Syntax is the arrangement of words in a sentence to make grammatical sense. NLP uses syntax to assess meaning from a
language based on grammatical rules. Syntax techniques include:
• Parsing. This is the grammatical analysis of a sentence. Example: A natural language processing algorithm is fed the
sentence, "The dog barked." Parsing involves breaking this sentence into parts of speech -- i.e., dog = noun, barked = verb.
This is useful for more complex downstream processing tasks.
• Word segmentation. This is the act of taking a string of text and deriving word forms from it. Example: A person scans a
handwritten document into a computer. The algorithm would be able to analyze the page and recognize that the words are
divided by white spaces.
• Sentence breaking. This places sentence boundaries in large texts. Example: A natural language processing algorithm is
fed the text, "The dog barked. I woke up." The algorithm can recognize the period that splits up the sentences using
sentence breaking.
• Morphological segmentation. This divides words into smaller parts called morphemes. Example: The word untestably
would be broken into [[un[[test]able]]ly], where the algorithm recognizes "un," "test," "able" and "ly" as morphemes. This is
especially useful in machine translation and speech recognition.
• Stemming. This divides words with inflection in them to root forms. Example: In the sentence, "The dog barked," the
algorithm would be able to recognize the root of the word "barked" is "bark." This would be useful if a user was analyzing a
text for all instances of the word bark, as well as all of its conjugations. The algorithm can see that they are essentially the
same word even though the letters are different.
Ambiguities Problem in NLU
• Word sense disambiguation. This derives the meaning of a word
based on context. Example: Consider the sentence, "The pig is in
the pen." The word pen has different meanings. An algorithm using
this method can understand that the use of the word pen here refers
to a fenced-in area, not a writing implement.
• Types of ambiguities given here
• Lexical ambiguity (The tank was full of water)
• Syntactical ambiguity (Old men and women were taken to safe place)
• Semantic ambiguity (The hit the pole while it was moving)
• Pragmatic ambiguity (The police are coming)
Why is NLP important?
1.Large volumes of textual data
Natural language processing helps computers communicate with humans in their own language and scales
other language-related tasks. For example, NLP makes it possible for computers to read text, hear speech,
interpret it, measure sentiment and determine which parts are important. Today’s machines can analyze
more language-based data than humans, without fatigue and in a consistent, unbiased way. Considering
the staggering amount of unstructured data that’s generated every day, from medical records to social
media, automation will be critical to fully analyze text and speech data efficiently.
2.Structuring a highly unstructured data source
Human language is astoundingly complex and diverse. We express ourselves in infinite ways, both verbally
and in writing. Not only are there hundreds of languages and dialects, but within each language is a unique
set of grammar and syntax rules, terms and slang. When we write, we often misspell or abbreviate words,
or omit punctuation. When we speak, we have regional accents, and we mumble, stutter and borrow terms
from other languages. While supervised and unsupervised learning, and specifically deep learning, are
now widely used for modeling human language, there’s also a need for syntactic and semantic
understanding and domain expertise that are not necessarily present in these machine learning
approaches. NLP is important because it helps resolve ambiguity in language and adds useful numeric
structure to the data for many downstream applications, such as speech recognition or text analytics.
Syntactic & Semantic Analysis
• Syntactic analysis (syntax) and semantic analysis (semantic) are the
two primary techniques that lead to the understanding of natural
language. Language is a set of valid sentences, but what makes a
sentence valid? Syntax and semantics.
• Syntax is the grammatical structure of the text, whereas semantics is
the meaning being conveyed. A sentence that is syntactically correct,
however, is not always semantically correct. For example, “cows flow
supremely” is grammatically valid (subject — verb — adverb) but it
doesn't make any sense.
SYNTACTIC ANALYSIS