CL Unit 1

Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

Computational Linguistic Unit1

1] What is NLP?
 Linguistics: is concerned with language, it’s formation, syntax, meaning, different kind of
phrases (noun or verb)
 Computer Science: is concerned with applying linguistic knowledge, by transforming it into
computer programs with the help of sub-fields such as Artificial Intelligence (Machine Learning
& Deep Learning).
 Natural language processing (NLP) is the intersection of computer science, linguistics and
machine learning.
 Natural Language Processing (NLP) is an aspect of Artificial Intelligence that helps computers
understand, interpret, and utilize human languages.
 The field focuses on communication between computers and humans in natural language and
NLP is all about making computers understand and generate human language.
 Advantages of NLP
 NLP helps users to ask questions about any subject and get a direct response within seconds.
 NLP helps computers to communicate with humans in their languages.
 Most of the companies use NLP to improve the efficiency of documentation processes,
accuracy of documentation, and identify the information from large databases.
 Used to process raw and unstructured data from online sources.
 Helps to have a deep understanding of broad natural language.
 Understand semantics (meaning) of tokens used in Natural Languages.
 Provides an easy way of communication using language translation and generation of new
text.
 Natural Language Processing also provides computers with the ability to read text, hear
speech, and interpret it.

2] Discuss a brief history of Natural Language Processing.


 The study of natural language processing generally started in the 1950s.
 In 1950, Alan Turing published an article titled “Computing Machinery and Intelligence” which
proposed what is now called the Turing test as a criterion of intelligence.
 Furthermore, up to the 1980s, most NLP systems were based on complex sets of hand-written
rules.
 Starting in the late 1980s, however, there was a revolution in NLP with the introduction
of machine learning(ML) algorithms for language processing.
 Since the so-called “statistical revolution” in the late 1980s and mid-1990s, much natural
language processing research has relied heavily on ML and currently relying even more on ML
because of the big breakthrough the now-famous subfield of ML called Deep Learning(DL).
 In the 2010s, deep learning(DL) took over and deep neural network-style ML methods became
widespread in natural language processing, due to results showing that such techniques can
achieve state-of-the-art results in many natural languages tasks, such as language modeling,
parsing and many others.

3] List and explain different NLP applications:


1. Speech To Text (STT):
Speech to text conversion is the process of converting spoken words into written texts.
This process is also often called speech recognition.
2. Text to Speech (TTS)
Text-to-speech (TTS) is a type of assistive technology that reads digital text aloud. It’s sometimes
called “read aloud” technology.
The voice in TTS is computer-generated, and reading speed can usually be sped up or slowed
down.
Some TTS tools also have a technology called optical character recognition (OCR).
OCR allows TTS tools to read text aloud from images.
3. NL Generation:
Natural Language Generation (NLG) is the process of generating descriptions or narratives in
natural language from structured data.
NLG often works closely with Natural Language Understanding (NLU).
4. QA system:
It is used to answer questions in the form of natural language and has a wide range of
applications.
Typical applications include: intelligent voice interaction, online customer service, knowledge
acquisition, personalized emotional chatting, and more.
5. Machine Translation:
Machine Translation (MT) is the task of automatically converting one natural language into
another, preserving the meaning of the input text, and producing fluent text in the output language.
6. Text Summarization:
It is a process of generating a concise and meaningful summary of text from multiple text
resources such as books, news articles, blog posts, research papers, emails, and tweets.

4] Discuss the Challenges of NLP


1. Contextual words and phrases and homonyms
The same words and phrases can have different meanings according the context of a sentence
and many words – especially in English – have the exact same pronunciation but totally different
meanings.
For example:
I ran to the store because we ran out of milk.
Homonyms – two or more words that are pronounced the same but have different definitions – can
be problematic for question answering and speech-to-text applications because they aren’t written
in text form. Usage of their and there, for example, is even a common problem for humans.
2. Synonyms
Synonyms can lead to issues similar to contextual understanding because we use many different
words to express the same idea. Furthermore, some of these words may convey exactly the same
meaning, while some may be levels of complexity (small, little, tiny, minute) and different people
use synonyms to denote slightly different meanings within their personal vocabulary.

So, for building NLP systems, it’s important to include all of a word’s possible meanings and all
possible synonyms.
3. Ambiguity
Ambiguity in NLP refers to sentences and phrases that potentially have two or more possible
interpretations.

Lexical ambiguity: a word that could be used as a verb, noun, or adjective.


Semantic ambiguity: the interpretation of a sentence in context. For example: I saw the boy on the
beach with my binoculars. This could mean that I saw a boy through my binoculars or the boy had
my binoculars with him
Syntactic ambiguity: In the sentence above, this is what creates the confusion of meaning. The
phrase with my binoculars could modify the verb, “saw,” or the noun, “boy.”

4. Errors in text and speech


Misspelled or misused words can create problems for text analysis. Autocorrect and grammar
correction applications can handle common mistakes, but don’t always understand the writer’s
intention.

5. Language differences

In the United States, most people speak English, but if you’re thinking of reaching an international
and/or multicultural audience, you’ll need to provide support for multiple languages.

6. Training data

At its core, NLP is all about analyzing language to better understand it. A human being must be
immersed in a language constantly for a period of years to become fluent in it; even the best AI
must also spend a significant amount of time reading, listening to, and utilizing a language. The
abilities of an NLP system depend on the training data provided to it. If you feed the system bad or
questionable data, it’s going to learn the wrong things, or learn in an inefficient way.

5] Explain the role of Grammar in NLP.


Natural language has an underlying structure usually referred to under the heading of Syntax. The
fundamental idea of syntax is that words group together to form so-called constituents i.e. groups
of words or phrases which behave as a single unit. These constituents can combine together to
form bigger constituents and eventually sentences. So for instance, John, the man, the man with a
hat and almost every man are constituents (called Noun Phrases or NP for short) because they all
can appear in the same syntactic context.
A commonly used mathematical system for modelling constituent structure in Natural Language is
Context-Free Grammar (CFG).
Specifically, a CFG (also sometimes called Phrase-Structure Grammar) consists of four
components:

T, the terminal vocabulary: the words of the language being defined

N, the non-terminal vocabulary: a set of symbols disjoint from T

P, a set of productions of the form a -> b, where a is a non-terminal and b is a sequence of one or
more symbols from T Union V (where V – Set of variables (also called as Non-terminal symbols))

S, the start symbol, a member from N

Example context-free grammar


G = (V, T, S, P)
V = {S, NP, VP, PP, Det, Noun, Verb, Aux, Pre}
T = {‘a’, ‘ate’, ‘cake’, ‘child’, ‘fork’, ‘the’, ‘with’}
S=S
P = { S → NP VP
NP → Det Noun | NP PP
PP → Pre NP
VP → Verb NP
Det → ‘a’ | ‘the’
Noun → ‘cake’ | ‘child’ | ‘fork’
Pre → ‘with’
Verb → ‘ate’}
Sample derivation:
S → NP VP
→ Det Noun VP
→ the Noun VP
→ the child VP
→ the child Verb NP
→ the child ate NP
→ the child ate Det Noun
→ the child ate a Noun
→ the child ate a cake

Q6] Describe Tokenization process with the help of example.


 Essentially, electronic text is nothing more than a sequence of characters.
 NLP tools, however, generally process text in terms of linguistic units, such as words, clauses,
sentences, paragraphs, and so on.
 Thus, NLP algorithms need to first segment text data into separate tokens that can be
processed by NLP tools.
 Tokenization is the process of breaking down text into words, phrases, symbols, or other
meaningful elements called tokens.
 The input to the tokenizer is a unicode text, and the output is a list of sentences or words.
 In NLTK, we have two types of tokenizers – the word tokenizer and the sentence tokenizer.
 The sent_tokenize function splits the text into sentences, and the word_tokenize function splits
the text into words.
 The punctuation is also treated as a separate token.

Example:

from nltk.tokenize import sent_tokenize, word_tokenize


text = "Natural language processing is fascinating. It involves many tasks such as text
classification, sentiment analysis, and more."
sentences = sent_tokenize(text)
print(sentences)
words = word_tokenize(text)
print(words)

Q7] Explain the Stemming concept with example.


 Stemming is a text preprocessing technique used in natural language processing (NLP) to
reduce words to their root or base form.
 The goal of stemming is to simplify and standardize words, which helps improve the
performance of information retrieval, text classification, and other NLP tasks.
 By transforming words to their stems, NLP models can treat different forms of the same word
as a single entity, reducing the complexity of the text data.
 For example, stemming would reduce the words “running,” “runner,” and “runs” to their stem
“run.”
 This allows the NLP model to recognize that these words share a common concept or
meaning, even though their forms are different.
 Stemming algorithms typically work by removing or replacing word suffixes or prefixes, based
on a set of predefined rules or heuristics. Some common stemming algorithms include the
Porter Stemmer, Lancaster Stemmer, and Snowball Stemmer.
 Reasons for using stemming:
 Text simplification: Stemming helps simplify text data by reducing words to their base forms,
making it easier for NLP models to process and analyze the text.
 Improved model performance: By reducing word variations, stemming can lead to better model
performance in tasks such as text classification, sentiment analysis, and information retrieval.
 Standardization: Stemming standardizes words, which helps in comparing and matching text
data across different sources and contexts.
Example:
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
# Initialize the Porter Stemmer
stemmer = PorterStemmer()
# Example sentence
sentence = "The quick brown foxes were jumping over the lazy dogs."
# Tokenize the sentence
words = word_tokenize(sentence)
# Stemming the words
stemmed_words = [stemmer.stem(word) for word in words]
# Print the stemmed words
print("Original words:", words)
print("Stemmed words:", stemmed_words)

Q8] What is Lemmatization? Compare it with stemming.


 Lemmatization is a process that takes into consideration the morphological analysis of the
words and efficiently reduces a word to its base or root form.
 In stemming, a part of the word is just chopped off at the tail end to arrive at the stem of the
word.
 There are different algorithms used to find out how many characters have to be chopped off,
but the algorithms don’t actually know the meaning of the word in the language it belongs to.
 In lemmatization, the algorithms do have this knowledge.
 In fact, you can even say that these algorithms refer to a dictionary to understand the meaning
of the word before reducing it to its root word, or lemma.
 So, a lemmatization algorithm would know that the word better is derived from the word good,
and hence, the lemme is good.
 But a stemming algorithm wouldn’t be able to do the same.
 There could be over-stemming or under-stemming, and the word better could be reduced to
either bet, or bett, or just retained as better.
 But there is no way in stemming that can reduce better to its root word good.
 This is the difference between stemming and lemmatization.
 Because lemmatization involves deriving the meaning of a word from something like a
dictionary, it’s very time consuming.
 So most lemmatization algorithms are slower compared to their stemming counterparts.
Example:
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
text = "He was running and eating at same time. He has bad habit of swimming after playing long
hours in the Sun."
lemmatizer = WordNetLemmatizer()
words = word_tokenize(text)
lemmatized_words = [lemmatizer.lemmatize(word) for word in words]
print(lemmatized_words)

Q9] Explain the concept of Morphology with few examples in English.


 Morphological analysis is a field of linguistics that studies the structure of words.
 It identifies how a word is produced through the use of morphemes.
 A morpheme is a basic unit of the English language.
 The morpheme is the smallest element of a word that has grammatical function and meaning.
 We can divide morphemes into two broad classes.
 Stems – the core meaningful units, the root of the word.
 Affixes – add additional meanings and grammatical functions to words.
 Affixes are further divided into:
 Prefixes – precede the stem: do / undo
 Suffixes – follow the stem: eat / eats
 Infixes – are inserted inside the stem
 Circumfixes – precede and follow the stem
 There are two broad classes of morphology:
 Inflectional morphology
 Derivational morphology
 After a combination with an inflectional morpheme,
o the meaning and class of the actual stem usually do not change.
 eat / eats pencil / pencils
 After a combination with an derivational morpheme, the meaning and the class of the actual
stem usually change.
 compute / computer do / undo friend / friendly

Q10] Discuss the need for Regular expression in NLP with example.
Many linguistic processing tasks involve pattern matching. For example, we can find
words ending with ed using endswith('ed').
Regular expressions give us a more powerful and flexible method for describing
the character patterns we are interested in.
To use regular expressions in Python, we need to import the re library using:
>>>import re.
>>> [w for w in wordlist if re.search('ed$', w)]
Example:
Extracting Word Pieces
The re.findall() (“find all”) method finds all (non-overlapping) matches of the given
regular expression. Let’s find all the vowels in a word, then count them:
>>>import re
>>> word = 'supercalifragilisticexpialidocious'
>>> re.findall(r'[aeiou]', word)
['u', 'e', 'a', 'i', 'a', 'i', 'i', 'i', 'e', 'i', 'a', 'i', 'o', 'i', 'o', 'u']
>>> len(re.findall(r'[aeiou]', word))
16

Q11] Discuss the Grams and its variation such as Bigram and Trigram.
 N-grams are continuous sequences of words or symbols or tokens in a document and are
defined as the neighboring sequences of items in a document.
 They are used most importantly in tasks dealing with text data in NLP (Natural Language
Processing).
 N-gram models are widely used in statistical natural language processing, speech recognition,
phonemes and sequences of phonemes, machine translation and predictive text input, and
many others for which the modeling inputs are n-gram distributions.
 N-grams are defined as the contiguous sequence of n items that can be extracted from a given
sample of text or speech.
 The items can be letters, words, or base pairs, according to the application.
 The N-grams typically are collected from a text or speech corpus (Usually a corpus of long text
dataset).
 N-grams can also be seen as a set of co-occurring words within a given window computed by
basically moving the window some k words forward (k can be from 1 or more than 1).
 The co-occurring words are called "n-grams," and "n" is a number saying how long a string of
words we have considered in the construction of n-grams.
 Unigrams are single words, bigrams are two words, trigrams are three words, 4-grams are four
words, 5-grams are five words, etc.
Example:
from nltk import ngrams
from nltk.tokenize import word_tokenize
sentence = "The big cat ate the little mouse who was after fresh cheese"
# Tokenize the sentence
tokens = word_tokenize(sentence)
# Generate bigrams
bigrams = list(ngrams(tokens, 2))
print(bigrams)
# Generate trigrams
trigrams = list(ngrams(tokens, 3))
print(trigrams)

Q12] what is Data Acquisition?


 Data acquisition is the process of gathering and collecting data for use in natural language
processing (NLP) tasks.
 The quality and quantity of the data is critical to the success of any NLP model.
 There are a number of different ways to acquire data for NLP tasks. Some common methods
include:
 Crawling and scraping the web: This involves using web crawlers and scrapers to collect text
data from websites.
 Using social media data: This involves collecting text data from social media platforms such as
Twitter, Facebook, and Reddit.
 Customer reviews: This involves collecting text from customer review websites like Amazon
and Yelp.
 Using public datasets: There are a number of public datasets available that contain text data,
such as the TREC: https://trec.nist.gov/ and CORD-19: https://www.kaggle.com/allen-institute-
for-ai/CORD-19-research-challenge datasets.
 Generating synthetic data: This involves generating artificial text data using techniques such as
machine translation and text generation.
 Consider the specific NLP task you are trying to solve.
 The type of data you need will vary depending on the task.
 For example, if you are trying to build a sentiment analysis model, you will need data that
includes both positive and negative sentiment.
 Make sure the data is representative of the real world.
 The data you collect should be representative of the type of text that your model will encounter
in the real world.
 For example, if you are building a model to answer questions about customer support, you
should make sure to include data from customer support forums and websites.

Q13] Discuss Text Extraction and Cleanup process.


 Text extraction:
 Text extraction, often referred to as keyword extraction, uses machine learning to automatically
scan text and extract relevant or core words and phrases from unstructured data like news
articles, surveys, and customer service tickets.
 A sub-task of keyword extraction is entity extraction (or entity recognition), used to pull out
important data points, like names, organizations, and email addresses to automatically
populate spreadsheets or databases.
 You can also specify other types of information that need to be extracted, such as product
specifications (model, memory, color, brand, size, material, etc.)
 Use pattern extraction models to extract entities whose structures match a specific pattern, for
example, postal codes, case numbers, email addresses, and so on.
 You can specify the list of key terms and their synonyms that belong to a particular domain. For
example, you can create a list of keywords to track social media messages that pertain to the
latest release of a product or a group of products of your competitor.
 Text Cleaning:
 Normalizing text is the process of standardizing text so that, through NLP, computer models
can better understand human input, with the end goal being to more effectively perform
sentiment analysis and other types of analysis on your customer feedback.
 Specifically, normalizing text with Python and the NLTK library means standardizing
capitalization so that machine models don’t group capitalized words (Hey) as different from
their lowercase counterparts (hey).
 This is called case normalization.
 Remove Unnecessary Whitespaces:
 Most of the text data you collect from the web may contain some extra spaces between words,
before and after a sentence. It is important to remove these before applying any text
processing or cleaning technique to the data.
 Removing Unwanted Data
 Unwanted data refers to certain parts of the text that don’t add any value in analysis and model
building. For example hashtags, HTML tags, emails, URLs, phone numbers, or some special
combination of characters. We can remove these completely from our text data or replace
them with their representative word.

Q14] Explain few Pre-Processing techniques in NLP.


The main goal of text preprocessing is to break down the noisy text into a form that ML models
can digest. It cleans and prepares textual data for further analysis and reporting.
1. Tokenization
Tokenization in NLP is the process of breaking down a piece of text into smaller chunks, called
tokens, such as words, phrases, symbols, or other meaningful elements. It’s a fundamental step in
most NLP tasks, as it helps to standardize text and make it more manageable for further analysis.
2. Stop Word Removal
Words like “is” and “are” are abundant in textual data, appearing so frequently that they don’t need
processing as thoroughly as nouns, other verbs, and adjectives. NLP refers to these as stop
words, which usually don’t add meaning to the data. Stop word removal means removing these
commonly used words from the text you want to process.
3. Stemming
ML practitioners use stemming to clean textual data by removing prefixes and suffixes from words.
The stemming process removes redundancy in the data. For example, assume that your data has
this set of words, asked, asking, and ask. These words are different tenses of the root word ask.
The stemming process transforms the words asked and asking into ask.
4. Part-Of-Speech (POS) Tagging:
Part-of-speech tagging is the process of assigning a part of speech to each word in a sentence.
The most common parts of speech are noun, verb, adjective, adverb, pronoun, preposition, and
conjunction.
In simple words, we can say that POS tagging is a task of labelling each word in a sentence with
its appropriate part of speech.

Q15] Explain Feature engineering concept used in NLP.


In simple terms, Feature Extraction is transforming textual data into numerical data.
After cleaning and normalizing textual data, we need to transform it into their features for
modeling, as the machine does not compute textual data. So we go for numerical representation
for individual words as it’s easy for the computer to process numbers.
1. Countvectorizer
A Countvectorizer model is a representation of text that describes the occurrence of words within a
document. We just keep track of word counts and disregard the grammatical details and the word
order. It is called a “bag of words” because any information about the order or structure of words in
the document is discarded. The model is only concerned with whether known words occur in the
document, not wherein the document.
2. TF – IDF Vectorizer (Term Frequency – Inverse Document Frequency)
It’s designed to reflect how important a word is to a document in a collection or corpus.
The TF-IDF value increases proportionally to the number of times a word appears in the document
and is offset by the number of documents in the corpus that contain the word, which helps to
adjust for the fact that some words appear more frequently in general.
3. Principal Component Analysis (PCA):
This feature extraction method reduces the dimensionality of large data sets while preserving the
maximum amount of information. Principal Component Analysis emphasizes variation and
captures important patterns and relationships between variables in the dataset.
Common Feature Types:
Numerical: Values with numeric types (int, float, etc.). Examples: age, salary, height.
Categorical Features: Features that can take one of a limited number of values. Examples: gender
(male, female, X), color (red, blue, green).
Ordinal Features: Categorical features that have a clear ordering. Examples: T-shirt size (S, M, L,
XL).
Binary Features: A special case of categorical features with only two categories. Examples:
is_smoker (yes, no), has_subscription (true, false).
Text Features: Features that contain textual data.

Q16] Discuss the Modeling and Evaluation in NLP.


The heart of the pipeline, where models are applied and evaluated using different approaches:
(i) Heuristic Approaches
Heuristic models rely on predefined rules or strategies based on expert knowledge to make
decisions.
Application: Commonly used in simple text-based tasks where rule-based systems can effectively
handle specific patterns or tasks, like keyword matching for sentiment analysis or rule-based
chatbots.
(ii) Machine Learning (ML) Approaches
ML models learn patterns and relationships from data to make predictions or classifications.
Applications:
Support Vector Machines (SVM): Effective for text classification tasks by finding the best
separation between classes in a high-dimensional space.
(iii) Deep Learning (DL) Approaches
DL models use neural networks with multiple layers to learn complex patterns and representations
from raw data.
Applications:
Recurrent Neural Networks (RNNs): Effective for sequence-based tasks like language modelling,
sentiment analysis, or machine translation.
Evaluation:
(i) Intrinsic Evaluation
Intrinsic evaluation focuses on assessing the technical aspects and capabilities of the model in
isolation, without considering its real-world application.
Examples of Intrinsic Metrics:
Accuracy: Measures the ratio of correctly predicted instances to the total instances in the dataset.
Precision and Recall: Assess the model’s performance in binary or multi-class classification tasks.
(ii) Extrinsic Evaluation
Extrinsic evaluation measures the model’s performance in real-world applications or business
contexts, considering its impact and utility in practical scenarios.
Examples of Extrinsic Evaluation Metrics:
Business Metrics: Metrics aligned with specific business goals or outcomes, such as customer
satisfaction scores, revenue impact, or user engagement rates.

Q17] What are the Post-Modeling Phases?


The deployment phase in the NLP pipeline marks the transition of the developed model from the
development environment to a production environment, followed by continuous monitoring and
adaptation to ensure sustained performance and relevance.
(i) Deployment
Rolling out the Model: Moving the trained NLP model from the development environment to a
production environment where it can be utilized in real-world applications.
Infrastructure Setup: Configuring the necessary infrastructure, integrating the model into the
existing systems, and ensuring scalability and reliability.
Testing and Validation: Thoroughly testing the deployed model to ensure it functions as expected
and delivers accurate results in the production environment.
(ii) Monitoring
Continuous Performance Oversight: Constantly monitoring the model’s performance, including its
accuracy, efficiency, and response time in real-time or at regular intervals.
Alert Systems: Implementing alert systems or triggers to notify about deviations or anomalies in
the model’s behaviour, ensuring timely interventions.
(iii) Update
Adaptation to Dynamic Data: Adapting the model to changing data patterns or evolving
requirements by periodically updating and retraining the model.
Improvement Iterations: Incorporating feedback, identifying areas for improvement, and fine-tuning
the model to enhance its performance or address changing user needs.
Version Control: Maintaining version control to track model iterations and changes, ensuring
transparency and reproducibility.

You might also like