NLP Exam Notes
NLP Exam Notes
Introduction to NLP
-Natural Language Processing (NLP) is a field of Artificial Intelligence (AI)
that deals with the interaction between computers and human languages. NLP is
used to analyze, understand, and generate natural language text and speech. The
goal of NLP is to enable computers to understand and interpret human language
in a way that is like how humans process language.
Advantages of NLP:
1. Improving Communication: NLP can improve communication by
enabling computers to understand natural language and respond in a way
that is more intuitive for humans.
2. Text Summarization: NLP can be used to summarize large amounts of
text quickly and accurately, allowing users to quickly identify key points.
3. Sentiment Analysis: NLP can be used to analyse the sentiment of text,
allowing businesses to monitor customer feedback and adjust their
strategies accordingly.
4. Personalization: NLP can be used to personalize content for individual
users based on their preferences and behaviour.
5. Automates repetitive tasks: NLP techniques can be used to automate
repetitive tasks, such as text summarization, sentiment analysis, and
language translation, which can save time and increase efficiency.
6. NLP helps users to ask questions about any subject and get a direct
response within seconds.
7. NLP offers exact answers to the question means it does not offer
unnecessary and unwanted information.
8. NLP helps computers to communicate with humans in their languages.
9. It is very time efficient.
10.Most companies use NLP to improve the efficiency of documentation
processes, accuracy of documentation, and identify the information from
large databases.
Disadvantages of NLP:
1. Requires large amounts of data: NLP systems require large amounts of
data to train and improve their performance, which can be expensive and
time-consuming to collect.
2. Limited ability to understand idioms and sarcasm: NLP systems have a
limited ability to understand idioms, sarcasm, and other forms of
figurative language, which can lead to misinterpretations or errors in the
output.
3. Limited understanding of context: NLP systems have a limited
understanding of context, which can lead to misinterpretations or errors in
the output.
4. Limited ability to understand emotions: NLP systems have a limited
ability to understand emotions and tone of voice, which can lead to
misinterpretations or errors in the output.
5. Bias: NLP systems may reflect the biases of their developers or training
data, leading to inaccurate or unfair results.
6. NLP may not show context.
7. NLP is unpredictable.
8. NLP may require more keystrokes.
9. NLP is unable to adapt to the new domain, and it has a limited function
that's why NLP is built for a single and specific task only.
Challenges in NLP
Both NLU and NLG are essential components of NLP and are used in a variety
of applications to improve human-machine interaction and automate tasks that
involve natural language processing. NLU is focused on understanding natural
language input, while NLG is focused on generating natural language output.
Together, these two subfields enable machines to communicate with humans
more naturally and intuitively.
Applications of NLP
- NLP techniques are used in a wide range of applications, including:
a) A penny for your thoughts: This idiom means asking someone to share
their thoughts or opinions about something. In NLP, this idiom can be used
in sentiment analysis to understand the opinions and attitudes of people
toward a particular topic.
c) A piece of cake: This idiom means something very easy to do. In NLP,
this idiom can be used to describe a text that is easy to classify or analyze,
such as a text with a clear sentiment polarity.
Morphology analysis
Morphology analysis in NLP refers to the process of analyzing the
structure of words to identify their morphemes, which are the smallest
units of meaning in a language. Morphology is an important aspect of
NLP because it allows us to understand how words are formed and how
their meaning can be modified by adding prefixes, suffixes, or other
affixes.
Tagged Sentence: "The (DT) cat (NN) sat (VBD) on (IN) the (DT) mat
(NN). (.)"
Tagged Sentence: "I (PRP) am (VBP) eating (VBG) a (DT) delicious (JJ)
pizza (NN). (.)"
Some examples of stop words in English include "the", "and", "a", "an",
"in", "of", "to", "is", "that", and "it". However, the list of stop words may
vary depending on the specific NLP task and the characteristics of the
text data being analyzed.
Here's an example of how to stop words are removed from text data
during pre-processing:
Original text: The quick brown fox jumps over the lazy dog. Stop words
removed: quick brown fox jumps lazy dog.
In this example, the stop words "the", "over", and "the" has been removed from
the original text to create a cleaner version of the text that is easier to analyze
with NLP techniques.
It's worth noting that not all NLP tasks require the removal of stop
words. In some cases, stop words may be important for understanding
the meaning and context of text data. Therefore, it's important to
carefully consider the specific NLP task and the characteristics of the
text data when deciding whether to remove stop words.
1. Porter stemmer: The Porter stemmer is one of the most widely used
stemmers in NLP. It is a rule-based algorithm that applies a series of rules
to remove common suffixes from words. The Porter stemmer is often
used in information retrieval and text mining applications. Here's an
example:
Original Word: walking
Stemmed Word: walk
It's worth noting that stemmers and lemmatizers can produce different results
depending on the context and the specific algorithm used. Therefore, it's
important to choose the appropriate stemmer or lemmatizer based on the
specific NLP task and the characteristics of the text data being analyzed.
# Multi-word expression
Multi-word expressions (MWEs) in NLP refer to phrases or groups of words
that function as a single unit and carry a specific meaning that cannot be easily
inferred from the individual words alone. MWEs can include idioms,
collocations, phrasal verbs, and other fixed expressions that are commonly used
in language.
MWEs present a challenge for NLP tasks such as parsing, machine translation,
and sentiment analysis because their meaning is often not predictable from the
individual words in the expression. For example, the expression "kick the
bucket" means "to die", but this meaning cannot be inferred from the individual
words "kick" and "bucket".
To deal with MWEs in NLP, various approaches have been developed,
including:
MWEs are an important aspect of NLP because they are commonly used in
language and can significantly affect the meaning of the text. Proper handling of
MWEs can improve the accuracy and efficiency of NLP applications and
contribute to a more accurate understanding of natural language text.
NLTK
Natural Language Toolkit (NLTK) is a popular open-source platform for
building Python programs that work with human language data. It is a
comprehensive library of tools and algorithms for tasks such as tokenization,
stemming, tagging, parsing, and machine learning, as well as a variety of corpus
resources.
# Text mining: Text mining, also known as text analytics, is the process
of analyzing unstructured text data to extract useful information. It
involves techniques such as natural language processing, machine
learning, and information retrieval to identify patterns and relationships in
text data. Text mining can be used for a variety of applications, such as
information extraction, text classification, and text summarization.
All three subfields of NLP involve the use of machine learning algorithms and
statistical techniques to extract insights from large amounts of text data. By
applying these techniques, businesses, and organizations can gain valuable
insights into customer behavior, preferences, and opinions, and use this
knowledge to inform business decisions and improve customer satisfaction.
# Text classifications
Text classification is a subfield of natural language processing (NLP) that
involves assigning predefined categories or labels to text based on its content. It
is also known as text categorization or text tagging.
Text classification is used in a variety of applications such as spam filtering,
sentiment analysis, language identification, topic modeling, and more. It
involves using machine learning algorithms to train models that can
automatically classify text into different categories or labels.
There are several approaches to text classification, including rule-based
methods, machine learning methods, and deep learning methods. Rule-based
methods involve defining a set of rules or patterns that can be used to classify
text. Machine learning methods involve training a model on a labeled dataset,
and then using the trained model to classify new text. Deep learning methods
use neural networks to automatically learn features from text and classify it into
different categories.
Some popular machine learning algorithms used for text classification include
Naive Bayes, Support Vector Machines (SVM), and decision trees. Deep
learning models like Convolutional Neural Networks (CNN) and Recurrent
Neural Networks (RNN) have also shown promising results in text classification
tasks.
Overall, text classification is a crucial task in NLP, as it allows us to
automatically categorize and analyze large amounts of text data, making it
easier to extract insights and make informed decisions.