Module 05 - Learners Guide
Module 05 - Learners Guide
LANGUAGE
PROCESSING
Let’s start!
WHAT IS NLP?
Natural language processing, or NLP, combines computational
linguistics—rule-based modeling of human language—with statistical and
machine learning models to enable computers and digital devices to
recognize, understand and generate text and speech.
Let’s begin!
HISTORY OF NLP
• 1950s: Alan Turing introduces the Turing Test, questioning if machines can think, and the first automated
translation experiment translates Russian to English.
• 1960s: Development of ELIZA, a chatbot simulating a psychotherapist, showcasing early NLP capabilities.
• 1970s: Introduction of SHRDLU, an NLP program for understanding natural language in a blocks world
context.
• 1980s: Transition from rule-based to statistical methods in NLP, acknowledging the complexity of human
language.
• 1990s: The rise of the internet provides vast text data, boosting statistical NLP methods and machine
learning models.
• 2000s: Machine learning becomes dominant in NLP, with algorithms learning from large corpora of text for
language processing.
• 2010s: Deep learning and neural networks revolutionize NLP, significantly improving language
understanding and generation.
• Late 2010s: Introduction of Transformer-based models like BERT and GPT, setting new standards for NLP
tasks.
• Present: NLP technologies achieve near-human
L e t ’ s levels
b e g iofn language
! understanding and generation across
various applications.
• Ongoing: Continuous advancements in deep learning models drive NLP forward, expanding its capabilities
and applications in everyday technology.
FOUNDATIONS OF
NLP
Let’s begin!
TYPES OF DATA
1. Qualitative Data: Categories without numerical values.
2.Quantitative Data: Numerical values, including countable items (discrete) or
measurements (continuous).
3.Structured Data: Organized data in databases, easily searchable, like customer
databases.
4.Unstructured Data: Not organized in a predefined way, including texts, images, and
videos.
5.Semi-structured Data: Not in databases but has some organization, like JSON or XML
files.
6.Time Series Data: Data points collected over time intervals.
Let’s begin!
TOKENIZATION LEMMATIZATION
Tokenization breaks text into Lemmatization also reduces
individual elements or tokens, words to their base or
making text analysis dictionary form but considers
manageable and efficient. the context to ensure the root
word's correct meaning.
A) Tokenization
B) Lemmatization
C) Stemming
Question 2: If "better" becomes "good" and "best" becomes "good", which NLP process is being used?
A) Tokenization
B) Lemmatization
C) Stemming
Question 3: Splitting "Hello, world!" into ["Hello", ",", "world", "!"] is an example of:
A) Tokenization
B) Lemmatization
C) Stemming
CHOOSE THE RIGHT PROCESS
Question 4: Converting the sentence "The boys are playing" into tokens like "The", "boys", "are", "playing" is known as:
A) Tokenization
B) Lemmatization
C) Stemming
5: Which process might incorrectly reduce "university" and "universal" to a common root such as " univers"?
A) Tokenization
B) Lemmatization
C) Stemming
Real-World Scenario
Email spam detection. BoW can be used to
identify spam by analyzing the frequency of
words typically found in spam emails versus
legitimate ones, allowing email systems to
filter out unwanted messages efficiently.
N-GRAMS
Why It's Important?
N-grams capture the sequence of 'N' items
(words or letters) from text, preserving
some order and context, which is lost in the
BoW model, improving the model's
understanding of language structure.
Real-World Scenario
Auto-complete features in search engines or
texting apps use n-grams to predict the next
word or phrase you're likely to type based on
the probability of word sequences,
enhancing user experience by speeding up
typing and reducing effort.
embedding
Why It's Important?
Word embeddings map words into dense
vectors of real numbers in a way that captures
semantic meaning, relationships, and context,
greatly enhancing the performance of NLP
models on complex tasks.
Real-World Scenario:
Recommendation systems in streaming
services. By using embeddings to understand
the content and context of user reviews and
interactions, these platforms can recommend
movies, shows, or songs that are more aligned
with individual user preferences, improving
personalization and satisfaction.
Term Frequency-Inverse Document
Frequency (TF-IDF)
Why it's important?
TF-IDF is a statistical measure used to evaluate how
important a word is to a document in a collection or
corpus. It increases proportionally to the number of
times a word appears in the document but is offset
by the frequency of the word across the corpus.
Real-world Implication:
When you search for "data scientist remote jobs" on
a job site, TF-IDF helps find the best matches. It
checks how often "data scientist" appears in a job
post and how unique the term is across all posts.
Posts with higher scores appear first, making your job
hunt quicker and more relevant.
ONE HOT ENCODING
Why It's Important?
Converts categorical data into binary vectors,
enabling computational operations on text.
Essential for NLP tasks in machine learning
models.
Real-World Example:
In customer service chatbots, one-hot encoding
transforms queries into a format the algorithm
understands, aiding in accurately responding to
customer needs like processing returns or
exchanges, thereby enhancing efficiency and
satisfaction.
named entity recognition
Why it's important?
NER is vital for extracting information from texts,
such as names of people, organizations, locations,
and more. This technique is fundamental in
information retrieval, content categorization, and
question-answering systems, making it easier to
organize and search large datasets.
Real-world Implication:
When customers send messages to their banks via
chat or email, NER can identify personal names,
account numbers, transaction IDs, and other crucial
entities in the text. This enables the automated
system to quickly pull up relevant account
information or transaction history, streamlining
customer service and reducing response times.
A document classification system analyzes job descriptions to categorize them into IT, Marketing, and Finance.
Which technique is likely used to convert the job descriptions into a format suitable for machine learning models?
A) N-grams
B) BoW
C) One-hot encoding
Correct Answer: B
Question 2. An online library wants to improve its search feature so that rare books on specific topics are more easily
found. Which technique could help prioritize these rare books when relevant search terms are used?
A) BoW
B) TF-IDF
C) Encoding
Question 3 . A smartphone keyboard app predicts the next word as a user types to speed up input. What technique
underlies this predictive text feature?
- A) BoW
- B) N-grams
- C) One-hot encoding
*Correct Answer: B**
Question 4 . An email filtering system needs to categorize incoming messages into "Urgent", "Important", "Regular", and "Spam". Before applying
machine learning, each email's text is transformed into a binary vector. This initial step is called:
- A) NER
- B) TF-IDF
- C) One-hot encoding
Question 5 . A news aggregator automatically tags articles with names of mentioned countries, companies, and persons for easier browsing.
Which NLP technique is being employed for tagging?
- A) NER
- B) BoW
- C) N-grams
Question 6 . To recommend articles based on their content similarity, a website analyzes how unique keywords are across articles compared to
how often they appear in each article. This analysis likely uses:
- A) Encoding
- B) TF-IDF
- C) BoW
### Question 8
A data analyst wants to examine Twitter feeds to see how often certain policy topics are mentioned over time. Before analysis, tweets are broken down into 2-word phrases to capture context
better than single words. This technique is called:
- A) N-grams
- B) TF-IDF
- C) Encoding
### Question 9
A customer feedback tool highlights keywords that frequently appear in negative reviews to help a business understand common complaints. This feature is most likely powered by:
- A) BoW
- B) NER
- C) TF-IDF
### Question 10
A language learning app highlights named entities in sentences to teach users proper nouns in context. This functionality relies on:
- A) NER
- B) Encoding
- C) N-grams
For Eg. : If you instruct your phone to "set an alarm for 8 in the morning",
speech recognition converts it into text. Natural Language Understanding
L e ttriggers
processes the text, extracts the meaning, and ’ s b e gan
i naction
! to set the
alarm at 8 am. A response is driven by natural language generation that
the alarm is set.
NATURAL LANGUAGE GENERATION
Natural language generation (NLG) is the process of generating text that appears to be written by a
human, without the need for a human.
NLG works usually by pulling large bodies of text, take sentences which represent the main points ,
identify key concepts which is then rephrased and summarized in a grammatically accurate manner.
What is the difference between syntax and semantics in natural language processing?
a) Syntax refers to the meaning of language, while semantics refers to the structure
b) Syntax refers to the structure of language, while semantics refers to the meaning
c) Syntax and semantics are the same thing
d) Syntax and semantics are not relevant to natural language processing
Answer: b) Syntax refers to the structure of language, while semantics refers to the meaning.
Which of the following is an example of a sequence labeling task in natural language processing?
a) Sentiment analysis
b) Named entity recognition
c) Text classification
d) Language modeling
Answer: b) Named entity recognition.
REAL WORLD APPLICATIONS OF NLP
1 Search Engine Results
Search engine functionality is an example of natural language processing in action.Search engines utilize NLP to propose appropriate results
based on previous search history and user intent.Google, for example, anticipates what you'll enter next based on popular queries, while also
taking into account the context and detecting the meaning behind what you want to say.
4 Email Filters
Email filters are common NLP examples you can find online across most servers.
For e.g Spam filters they uncovered patterns of words or phrases that were linked to spam messages.
REAL WORLD APPLICATIONS OF NLP
5 Smart Assistants
Smart assistants such as Google's Alexa use voice recognition to understand everyday phrases and inquiries.
They then use a subfield of NLP called natural language generation to respond to queries.
As NLP evolves, smart assistants are now being trained to provide more than just one-way answers. E.g :
Apple's Siri, Amazon's Alexa, and Google Assistant.
6 CHATBOT
Chatbots are an NLP customer service application example
They can be used to:
Respond to pre-determined FAQs
Schedule meetings and appointments Book tickets
Process and track orders
Cross and upsell
Onboard new users or members
sentiment analysis
Sentiment analysis is a technique in natural language processing (NLP)
that determines the emotional tone behind a body of text. This process
involves identifying whether the sentiment is positive, negative, or
neutral, often by analyzing word choice and context. It allows computers
to understand opinions, emotions, and attitudes expressed in written
language.
Lt’s begin!