0% found this document useful (0 votes)

36 views60 pages

Natural Language Processing (NLP) & Computational Linguistics

Uploaded by

A.J Videos

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

36 views60 pages

Natural Language Processing (NLP) & Computational Linguistics

Uploaded by

A.J Videos

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 60

Natural Language Processing (NLP)

&
Computational Linguistics

Lecture 3
Language and Technology
Introduction

• Natural Language Processing (NLP) is a subfield of artificial

intelligence and computational linguistics that focuses on enabling
machines to understand, interpret, and manipulate human language.
NLP involves the development of algorithms and models that allow
computers to process large amounts of natural language data and
perform tasks such as translation, summarization, sentiment analysis,
and more.
• In the context of teaching computational linguistics, NLP plays a
critical role in helping students understand how human languages can
be modeled computationally. It allows students to experiment with
real-world data, understand linguistic phenomena, and apply machine
learning models to solve language-related problems
Key Concepts in NLP for Computational Linguistics

1. Tokenization
2. Part-of-Speech Tagging (POS Tagging)
3. Parsing
4. Named Entity Recognition (NER)
5. Machine Translation (MT)
6. Sentiment Analysis
Key Concepts in NLP for Computational Linguistics
1. Tokenization: Tokenization is the process of breaking down a text
into individual units (tokens), such as words, phrases, or even
characters. This step is fundamental in NLP because it serves as the
foundation for many downstream tasks, such as parsing and language
modeling.

In Natural Language Processing (NLP), tokenization plays a critical role

as the first step in breaking down text into manageable and analyzable
units, or tokens. These tokens can represent words, phrases, or even
characters, depending on the type of analysis or model being used.
Importance
• Tokenization is foundational because almost every downstream task
in NLP—like parsing, part-of-speech tagging, sentiment analysis, and
machine translation—requires the text to be divided into distinct,
meaningful units. Without tokenization, the computational model
wouldn't be able to process and understand the text correctly.
• It helps in:
• Reducing complexity: By breaking down long sentences or
paragraphs into smaller units, tokenization simplifies the text, making
it easier to analyze.
• Maintaining structure: Tokenization ensures the structure of the
language (like word order or phrase separation) is preserved, which is
important for tasks such as syntax parsing and language modeling.
Tokenization Across Different Languages
• Tokenization varies significantly across languages, especially when the
structures of those languages differ. For example:
• English: Tokenization in English is relatively straightforward, as it is usually
word-based and relies on spaces and punctuation as delimiters.
• Chinese: Chinese does not use spaces between words, so tokenization is
more challenging and requires segmenting characters into meaningful
words or phrases. Tools like Jieba are often used for Chinese tokenization.
• Agglutinative Languages: Languages like Turkish or Finnish are
agglutinative, meaning words are formed by adding prefixes and suffixes to
a root. This structure creates longer words with multiple morphemes (units
of meaning), making word-based tokenization complex. Specialized
tokenizers are needed to split these words into smaller meaningful units.
• In learning NLP, one can experiment with tokenization across different
languages to understand how diverse linguistic structures impact
tokenization.
• For instance:
• In English, splitting by space usually works well, but handling
contractions (like isn't or we're) requires additional processing.
• In Chinese, students may try segmentation tools like Jieba to
understand how character-based tokenization operates without
spaces.
• In Turkish, learners can explore morphological analysis to tokenize
words into roots and suffixes, understanding the challenges posed by
agglutination.
Types of Tokenization in English

• Word Tokenization: This is the most common form of tokenization in

English, where the text is split into individual words based on spaces
and punctuation.
• Sub-word Tokenization: Words are split into smaller units, which are
often prefixes, suffixes, or root components. This is useful for
handling words that are rare or unknown.
• Character Tokenization: The text is split into individual characters.
This is less common but can be useful for certain types of language
models or when working with unstructured or misspelled data.
Word Tokenization
• In word tokenization, the text is split into tokens at whitespace and
punctuation boundaries. This is straightforward in English, which uses
spaces to separate words.
• Example:
• Consider the sentence: "NLP is challenging, but exciting!“
• Using word tokenization, this sentence would be split into the
following tokens:
• "NLP”, "is”, "challenging”, "but”, "exciting”, "!"
• Punctuation Handling: In the example, punctuation marks like commas
and exclamation marks are treated as separate tokens. This is because
punctuation can affect the meaning of a sentence and should be preserved
in the analysis.
• Contractions: Word tokenization in English must also handle contractions
like "isn't" or "can't". These can be tokenized into two tokens (e.g., "is" and
"n't") to preserve the meaning of the negative form.
• For instance, the sentence "She isn't here" could be tokenized as:
• "She"
• "is"
• "n't"
• "here"
Sub-word Tokenization
• Sub-word tokenization splits words into smaller meaningful units, such as
prefixes, suffixes, or even smaller units called byte pair encodings (BPE), which
are commonly used in modern NLP models like BERT and GPT. This method is
useful for handling rare words, compound words, or unknown words that a
model hasn't seen before.
• Example:
• Consider the word "unexpectedly". A sub-word tokenizer might split it as follows:
• "un"
• "expect"
• "ed"
• "ly"
• This breakdown allows a model to understand the meaning of each part of the
word and handle similar words (like "expectation" or "unexpected") more
effectively.
Linguistic Analysis of Sub-word Tokenization

• Handling Morphology: English morphology involves a variety of

affixes (prefixes and suffixes), which can be captured by sub-word
tokenization. For instance, "ly" typically indicates an adverb, and "ed"
signals past tense.
• Compound Words: Sub-word tokenization helps break down
compound words. For example, "notebook" could be split into "note"
and "book".
Character Tokenization
• In character tokenization, each character in a sentence is treated as a token. This method is less common for
standard English texts but can be useful for tasks like character-level language modeling, text generation, or
dealing with unstructured text (e.g., misspellings or noisy text).
• Example:
• The sentence "NLP is fun!" would be tokenized as:
• "N"
• "L"
• "P"
• ""
• "i"
• "s"
• ""
• "f"
• "u"
• "n"
• "!"
Linguistic Analysis of Character Tokenization

• Handling Misspellings: Character tokenization is useful when dealing

with text that may contain typos or non-standard spelling. For
example, "caaaat" can still be understood as a variation of "cat".
• Morphology: Although character tokenization doesn’t directly
capture morphemes (like prefixes and suffixes), it enables the model
to handle variations in word forms at the character level.
Challenges of Tokenization in English
• Tokenizing English text seems simple because words are separated by spaces, but
several linguistic challenges arise:
• Handling Contractions: As seen earlier, contractions like "I'm" (I am) or "he'll"
(he will) require tokenizers to break them into their component parts to capture
the meaning fully.
• Ambiguity with Hyphenation: Words like "well-being" or "state-of-the-art" can
be difficult to tokenize properly because the hyphen sometimes joins words into
a single concept, while at other times it’s a separator.
• Multi-word Expressions: Phrases like "New York" or "San Francisco" are
semantically single units but consist of multiple words. Advanced tokenization
techniques (such as phrase-based models) are sometimes needed to capture
these correctly.
• Punctuation and Special Symbols: Deciding how to handle punctuation (e.g.,
keeping it attached to a word or separating it) can affect the tokenization process.
For instance, should "hello!" be one token or two?
Tokenization Tools

• Several NLP libraries in Python can perform tokenization effectively:

• NLTK: The Natural Language Toolkit provides basic tokenization
methods, such as word_tokenize, to split text into words and
sent_tokenize to split text into sentences.
• spaCy: spaCy offers fast and robust tokenization with built-in linguistic
features like part-of-speech tagging and dependency parsing. It
handles punctuation, contractions, and multi-word expressions
effectively.
Computational Tools for Linguistic Analysis
• 1. Natural Language Toolkit (NLTK): NLTK is a comprehensive tool for
analyzing text data and performing linguistic analysis. It includes
functionalities for tokenization, parsing, and semantic analysis,
making it widely used in academia and industry (Bird et al., 2009).
• 2. spaCy: SpaCy is an efficient and user-friendly library for NLP tasks
such as part-of-speech tagging, named entity recognition, and
dependency parsing. It is widely adopted for industrial applications
due to its scalability (Honnibal & Montani, 2017).
• 3. Gensim: Gensim specializes in topic modeling and document
similarity analysis. It uses algorithms like Latent Dirichlet Allocation
(LDA) for extracting topics from large datasets, making it suitable for
linguistic and computational text analysis (Řehůřek & Sojka, 2010).
Key Concepts in NLP for Computational Linguistics

2. Part-of-Speech (POS) Tagging: POS tagging assigns part-of-speech

labels (such as nouns, verbs, adjectives) to each token in a sentence.
This step is crucial for understanding the syntactic structure of
sentences.

Part-of-Speech (POS) Tagging is a key task in Natural Language

Processing (NLP) where each word in a sentence is assigned a label
that indicates its part of speech, such as noun, verb, adjective, adverb,
etc. This is crucial for understanding the syntactic structure of
sentences and plays a foundational role in various NLP applications like
parsing, machine translation, text generation, and information
retrieval.
• POS tagging helps in identifying the grammatical function of each word,
which is essential for:
• Syntactic Parsing: Understanding the structure of sentences by identifying
which words serve as subjects, objects, and predicates.
• Disambiguation: Words in English often have multiple meanings depending
on context (e.g., "bank" as a noun vs. "bank" as a verb). POS tagging helps
disambiguate by recognizing the word's function in a sentence.
• Semantic Analysis: POS tags help in downstream tasks like named entity
recognition (NER) and coreference resolution, where understanding the
role of each word aids in extracting deeper meaning from the text.
• Improving Text Models: In machine learning models, POS tags serve as
features to improve performance on tasks such as text classification or
sentiment analysis.
• Here are some of the most common POS tags used in English:
• Noun (NN): Represents objects, people, places, etc. (e.g., "dog", "city").
• Verb (VB): Represents actions or states of being (e.g., "run", "is").
• Adjective (JJ): Describes nouns (e.g., "big", "beautiful").
• Adverb (RB): Modifies verbs, adjectives, or other adverbs (e.g., "quickly",
"very").
• Pronoun (PRP): Refers to people or objects without naming them (e.g.,
"he", "they").
• Preposition (IN): Shows relationships between nouns (e.g., "on", "in").
• Conjunction (CC): Connects words, phrases, or clauses (e.g., "and", "but").
• Determiner (DT): Introduces nouns (e.g., "the", "a").
• POS tagging typically uses two main approaches:
• Rule-based Tagging: This approach relies on a predefined set of
grammatical rules. For instance, if a word ends in -ing, it is likely a
verb. However, this method can be limited in handling ambiguity and
variability in natural language.
• Statistical or Machine Learning-based Tagging: Modern POS tagging
often uses machine learning models, which are trained on large
datasets of annotated text. These models analyze the context of
words in a sentence to predict their part-of-speech tags. Algorithms
like Hidden Markov Models (HMMs), Conditional Random Fields
(CRFs), or deep learning models such as Recurrent Neural Networks
(RNNs) are commonly used.
Key Concepts in NLP for Computational Linguistics

3. Parsing (Syntactic and Dependency): Parsing involves analyzing the

grammatical structure of a sentence to determine how words relate to
one another.
Parsing in linguistics and Natural Language Processing (NLP) refers to
the process of analyzing the grammatical structure of a sentence to
understand how words relate to one another. Parsing allows us to
break down and visualize a sentence’s structure, helping us to
understand how different components of the sentence interact.
There are two main types of parsing: Constituency Parsing and
Dependency Parsing.
• Constituency parsing, also known as phrase structure parsing, breaks
a sentence down into sub-phrases or constituents. Each constituent
is a group of words that function as a single unit in the sentence, such
as noun phrases (NP), verb phrases (VP), or prepositional phrases
(PP). The goal is to represent the sentence as a hierarchical tree
structure, where each node represents a constituent or a word.
• Key Concepts in Constituency Parsing:
• Phrase Structure: The sentence is split into nested phrases, such as
noun phrases (NP) or verb phrases (VP).
• Hierarchical Tree: The result is a constituency tree, where the root
represents the sentence (S), and the branches break down into
smaller constituents.
• Dependency parsing focuses on the grammatical relationships between
words in a sentence. Unlike constituency parsing, which groups words into
hierarchical phrases, dependency parsing aims to identify which words are
the "heads" of phrases and which words are dependent on those heads.
This forms a directed graph or dependency tree, where the arrows
(dependencies) point from heads to dependents.
• Key Concepts in Dependency Parsing:
• Head: A word that governs or controls other words in the sentence (usually
the main verb or noun).
• Dependent: A word that is subordinate to the head (e.g., a subject
dependent on a verb, an adjective dependent on a noun).
• Dependency Relations: Each relationship between a head and a dependent
is labeled with a grammatical relation such as subject, object, or modifier.
Linguistic Insights from Constituency Parsing
• Hierarchical Structure: Constituency parsing reflects how words
group together to form larger units (e.g., noun phrases, verb phrases)
that serve specific functions in the sentence.
• Syntactic Relations: It shows the role of each word or phrase within
the sentence, helping to understand how sentences are structured
grammatically.
• Constituency parsing is especially useful for understanding phrase-
based syntax, which focuses on how phrases function within
sentences and how they can be nested to create complex
grammatical structures.
Linguistic Insights from Dependency Parsing
• Grammatical Relations: Dependency parsing provides a clearer
picture of syntactic dependencies, such as which noun is the subject
of a verb or which noun is the object of a preposition.
• Head-Dependent Structure: Unlike constituency parsing, dependency
parsing focuses on binary relations between words (e.g., subject-verb,
verb-object), emphasizing syntactic functions rather than grouping
words into phrases.
• Dependency parsing is particularly useful for tasks that require
understanding word-to-word relationships, like information
extraction, where knowing which word is the subject or object of an
action is critical.
Constituency Parsing vs. Dependency Parsing
Feature Constituency Parsing Dependency Parsing
Hierarchical structure of phrases Direct grammatical relationships
Focus
and sub-phrases between words
Dependency tree with head-
Output Structure Parse tree with phrases as nodes
dependent pairs
Breaks down sentences into Shows head-dependent
Representation
constituents (phrases) relationships
Understanding phrase structure Understanding grammatical
Usage
and syntax dependencies
Phrase-based syntactic analysis, Information extraction, syntactic
Use Cases
phrase extraction parsing for semantics
spaCy, Stanford Dependency Parser,
Example Tools NLTK, Stanford Parser
MaltParser
Key Concepts in NLP for Computational Linguistics
4. Named Entity Recognition (NER): NER is the task of identifying
named entities in a text, such as people, organizations, locations,
dates, and more. It is an essential step in information extraction.

• Named Entity Recognition (NER) is a key task in Natural Language

Processing (NLP) and computational linguistics. It involves identifying
and classifying words or phrases in a text that represent named
entities, such as persons, organizations, locations, dates, quantities,
and other specific categories. NER is a fundamental component in
information extraction systems and is crucial for various applications,
including text mining, question answering, and machine translation.
Purpose of NER in Linguistic Analysis
• NER is used to extract structured information from unstructured text
by identifying the most important entities within the text. By
detecting who, what, where, and when, NER facilitates a deeper
linguistic and semantic analysis, allowing for tasks such as:
• Identifying key actors (persons, organizations) in a document.
• Mapping events to specific locations or times.
• Building knowledge graphs that represent relationships between
entities.
• Enhancing search algorithms by identifying entity-related information
in search queries
Key Components of NER
• Entity Identification: The first step is identifying entities—words or phrases
in the text that may represent named entities.
• Entity Classification: Once identified, entities are classified into predefined
categories such as person (PER), organization (ORG), location (LOC),
date/time (DATE), quantities (QUANTITY), monetary values (MONEY), etc.
• Consider the sentence: Barack Obama was born in Hawaii and served as
the 44th President of the United States.
• An NER system would process this sentence and output:
• Barack Obama → PERSON
• Hawaii → LOCATION
• 44th → ORDINAL
• President → TITLE
• United States → LOCATION
Categories of Named Entities
• Here are some of the most commonly used categories in NER:
• PERSON (PER): Names of people, fictional characters, or deities (e.g., Barack
Obama, Shakespeare).
• ORGANIZATION (ORG): Institutions, corporations, government bodies, and
associations (e.g., Google, United Nations).
• LOCATION (LOC): Geographical locations such as cities, countries, landmarks, and
continents (e.g., New York, Amazon River).
• DATE/TIME (DATE): Specific dates, years, and times (e.g., July 4, 1776, 5:30 PM).
• MONEY (MONEY): Amounts of currency (e.g., $50, 200 euros).
• PERCENTAGE (PERCENT): Percent values (e.g., 15%, 100%).
• ORDINAL (ORDINAL): Words representing order or ranking (e.g., 1st, third, 44th).
• MISC (MISCELLANEOUS): Various entities that don’t fit into other categories but
are still significant, such as titles (e.g., President).
Importance of NER in Linguistic Analysis
• Extracting Structured Information: NER helps transform unstructured text
into structured data, which can be used for further linguistic analysis,
building databases, or populating knowledge graphs.
• Disambiguation and Reference Resolution: In many languages, the same
word can refer to different things (e.g., "Apple" as a company vs. "apple"
as a fruit). NER helps disambiguate these terms by placing them into the
correct categories.
• Text Summarization: NER assists in summarizing key entities in large
documents, enabling easier extraction of core information like people
involved in a story, places of interest, or specific dates.
• Cross-Linguistic Studies: In linguistics, NER can be applied across different
languages to study how named entities are treated in various linguistic
contexts. This is particularly valuable for multilingual NLP systems.
Approaches to Named Entity Recognition
• NER systems typically rely on one of three main approaches:
• Rule-Based Systems: These systems use hand-crafted rules and patterns (e.g.,
identifying capitalized words followed by titles like "Mr." or "Dr.") to detect and
classify named entities. While accurate for specific domains, rule-based systems
are hard to generalize.
• Statistical and Machine Learning-Based Models: Modern NER models use
machine learning algorithms such as Hidden Markov Models (HMMs),
Conditional Random Fields (CRFs), and more recently, deep learning models like
Recurrent Neural Networks (RNNs) and Transformers. These models are trained
on large, annotated datasets to recognize patterns in how named entities appear
in context.
• Deep Learning-Based Approaches: State-of-the-art NER systems often use neural
networks trained on large datasets. Models like BERT (Bidirectional Encoder
Representations from Transformers) are highly effective because they understand
the contextual meaning of words and sentences, improving accuracy for entity
recognition.
Examples of NER Applications in Linguistic Studies
• Historical Text Analysis: NER has been used to analyze historical texts, identifying
key figures, events, and locations. For example, in the study of Shakespearean
texts, NER can be applied to detect characters (persons), places (locations), and
temporal references (dates) to better understand historical context.
• Political Discourse Analysis: In political science and sociolinguistics, NER is
employed to extract mentions of political figures, organizations, and countries in
speeches, news articles, or social media. This helps researchers track the
frequency and sentiment associated with specific entities (e.g., mentions of
"United Nations" or "European Union").
• Social Media Analysis: NER is often used in Twitter or Facebook posts to identify
entities like companies, celebrities, or events. Linguists use this data to study
language trends, public discourse, or brand perception over time.
• Legal Document Processing: In legal linguistics, NER is used to extract relevant
parties, laws, dates, and jurisdictions from legal documents, helping legal
professionals quickly locate key information.
Studies and Research in NER
• NER for Biomedical Text: NER has been widely studied in the
biomedical domain, where identifying entities such as disease names,
genes, and drug names is essential for text mining in medical
literature. One such system is BioNER, which is trained on biomedical
corpora to extract medical entities.
• Cross-Language NER: A study by P. Ehrmann et al. (2011)
demonstrated NER in multilingual corpora, focusing on how NER can
be adapted for languages with fewer resources by leveraging cross-
lingual transfer learning techniques.
• NER in Legal and Financial Documents: Systems like FinNER specialize
in extracting named entities from financial texts, identifying entities
like company names, monetary values, and financial instruments,
facilitating automated analysis of financial reports.
Key Concepts in NLP for Computational Linguistics
5. Machine Translation: Machine Translation (MT) is the automatic
translation of text from one language to another using computational
models. It is a crucial subfield of Natural Language Processing (NLP)
and computational linguistics that has been revolutionized by the
advent of deep learning techniques, especially neural networks,
sequence-to-sequence models, and transformer architectures (such as
BERT and GPT-4).
• Tools: Google Translate API, OpenNMT, MarianMT.
Importance of Machine Translation in Linguistic Analysis
• Machine translation plays a central role in cross-lingual
communication, allowing for automatic translation of large amounts
of text in real time. For linguists, MT offers computational tools for
analyzing:
• Translation equivalence across languages.
• Syntactic and semantic differences between languages.
• Morphological variations and idiomatic expressions in multilingual
contexts.
• MT systems are also applied in multilingual text processing, global
content dissemination, and linguistic research, where researchers
investigate how well machine translation handles the intricacies of
different languages.
Evolution of Machine Translation
• Rule-Based Machine Translation (RBMT): Early MT systems relied on rule-
based approaches, where linguists manually encoded grammatical and
syntactic rules to translate between languages. While this approach
provided linguistic transparency, it was difficult to scale due to the
complexity and variation in natural language.
• Statistical Machine Translation (SMT): SMT emerged in the 1990s, focusing
on statistical models that used large parallel corpora (text aligned in two
languages) to learn how words and phrases in one language corresponded
to those in another. SMT improved translation quality but struggled with
long-distance dependencies and complex linguistic structures.
• Neural Machine Translation (NMT): In recent years, Neural Machine
Translation (NMT) has become the dominant paradigm. NMT models,
particularly sequence-to-sequence models and transformers, have
drastically improved translation quality by leveraging deep learning
techniques and contextual representations.
Key Concepts in Machine Translation
• Sequence-to-Sequence Models: These are the backbone of modern
MT. A sequence-to-sequence (Seq2Seq) model consists of two
components:
• Encoder: Reads and encodes the input sentence (in the source language) into
a continuous vector representation.
• Decoder: Generates the output sentence (in the target language) from this
encoded vector.
• The encoder-decoder model works by compressing the information in
the source sentence and then translating it into the target language.
This model architecture can handle variable-length input and output
sentences, making it highly suitable for translation tasks.
• Attention Mechanism: A major advancement in NMT came with the
introduction of the attention mechanism. Attention allows the model
to focus on different parts of the input sentence when generating
each word in the output sentence. This mechanism helps deal with
the challenges of long sentences and improves translation accuracy
by ensuring that important words and phrases are translated with
appropriate context.
• Transformer Model: The transformer architecture, introduced by
Vaswani et al. (2017), replaced the need for recurrent neural
networks (RNNs) in MT. Transformers use self-attention mechanisms
to process entire sentences in parallel, making them more efficient
and capable of handling long-range dependencies. Transformers are
the basis for state-of-the-art models like GPT-4, BERT, T5, and
mBART.
Neural Machine Translation Process
• Input: The source sentence is tokenized and converted into a
numerical format that the model can process.
• Encoding: The encoder reads the input sequence and converts it into
a fixed-size vector (or a series of vectors, as in transformers).
• Attention: The attention mechanism helps the model focus on
relevant parts of the input during the translation process.
• Decoding: The decoder generates the translation, word by word,
using the encoded input and attention weights.
• Output: The model outputs a translated sentence, which is then
converted back into words using a reverse tokenization process.
Example of Transformer-Based Machine Translation

• Consider the translation of the sentence:

• "The weather is nice today." (English)
• Translated to French:
• "Le temps est agréable aujourd'hui.“

• A transformer model processes this sentence by first encoding the

input (English) and then decoding it into the target language (French),
while focusing on relevant words during the translation using the
attention mechanism. The word "weather" is translated to "temps",
and "nice" is translated to "agréable", showing how the model maps
words between the languages.
Models and Tools for Machine Translation
• Google Translate: One of the most widely used MT systems, powered by
neural networks, particularly transformers. It supports over 100 languages
and has integrated zero-shot learning capabilities (translating between
languages it hasn’t directly been trained on).
• OpenNMT: An open-source neural machine translation framework that
allows users to train and deploy their own NMT models.
• Fairseq: Developed by Facebook AI Research, Fairseq supports various
NMT models, including transformers, and provides state-of-the-art
performance for multilingual translation tasks.
• Marian NMT: A fast, efficient NMT framework designed for production-
scale translation tasks. It supports multi-GPU training and provides various
pre-trained models.
• mBART: A multilingual version of BART, fine-tuned for translation tasks.
mBART is particularly useful for translating between multiple languages.
Studies and Research in Machine Translation
• Several studies have explored the performance and capabilities of machine
translation models:
• Transformer Models for Multilingual Translation: The transformer
architecture has been extensively studied for multilingual MT. Research by
Conneau et al. (2020) introduced mBART, a pre-trained transformer model
for multilingual tasks that supports translation between multiple language
pairs. mBART demonstrated significant improvements over traditional NMT
models, especially in low-resource language translation.
• Handling Low-Resource Languages: A key challenge in MT is low-resource
languages that lack large parallel corpora. Research in this area includes
unsupervised translation and transfer learning techniques, where models
trained on high-resource languages (like English) are adapted to translate
between low-resource languages (like Swahili or Urdu). One such approach
is mT5, which fine-tunes a pre-trained model across multiple languages.
• Zero-Shot Translation: One exciting area of research is zero-shot
translation, where models translate between language pairs they have not
been explicitly trained on. This capability is critical for reducing the need
for massive multilingual datasets. The work by Johnson et al. (2017) on
Google’s Multilingual NMT showed that a single model could learn to
translate between multiple languages, and even handle language pairs
where no direct translation data existed.
• Evaluating Translation Quality: Several studies evaluate MT quality using
metrics like BLEU (Bilingual Evaluation Understudy), TER (Translation Error
Rate), and METEOR. These metrics compare the machine-generated
translation to human translations to measure accuracy. However, MT
evaluation is challenging, as these metrics may not fully capture the
nuances of fluency and adequacy in translation.
Challenges in Machine Translation
• Ambiguity and Context: Natural languages are inherently ambiguous,
and words or phrases can have multiple meanings. Understanding
context is critical for disambiguation. For example, the English word
"bank" can mean a financial institution or the side of a river, and its
translation depends on context.
• Idiomatic Expressions: Machine translation often struggles with
idioms and colloquial expressions that do not have direct
translations. For instance, translating "kick the bucket" literally into
another language may not convey the intended meaning of "to die."
• Low-Resource Languages: Many languages have limited parallel data
available for training machine translation models, making it difficult
to build effective systems. Transfer learning and unsupervised
learning are being researched to address these challenges.
• Morphologically Rich Languages: Languages like Finnish or Turkish,
which have complex morphological structures, pose challenges for
MT. Words in these languages can contain multiple morphemes (units
of meaning), and translating them into languages with less complex
morphology requires sophisticated handling of grammatical structure.
• Cultural Nuances: Machine translation systems often miss cultural
and regional nuances in language. For example, in translating
between Chinese and English, cultural contexts behind certain
phrases or expressions may not be adequately conveyed by literal
translation.
Applications of Machine Translation
• Cross-Lingual Communication: MT is widely used for enabling real-time
communication between people who speak different languages. Applications
include chat translation (e.g., in social media or customer service) and subtitling
for videos.
• Document Translation: Machine translation is employed to translate official
documents, contracts, and reports in various sectors, including government, law,
and commerce.
• Multilingual Content: Companies use MT to localize their websites, software, and
marketing materials across different regions, helping them reach global
audiences.
• Academic Research: MT is used by researchers to access scientific papers and
materials written in languages they may not be fluent in. Automated translation
aids in knowledge dissemination across linguistic barriers.
• Language Learning: MT models are used in language learning applications to help
students understand sentences, translate texts, and practice grammar.
• Machine Translation (MT) has evolved from rule-based systems to
sophisticated neural models powered by sequence-to-sequence
architectures and transformers. These advancements have
significantly improved translation accuracy and fluency, making MT an
indispensable tool in linguistic analysis and real-world applications.
While challenges like idiomatic expressions and low-resource
languages remain, ongoing research in unsupervised learning and
contextual understanding promises to address these limitations. By
enabling cross-lingual communication and bridging linguistic divides,
MT continues to play a transformative role in linguistics and beyond.
Key Concepts in NLP for Computational Linguistics
6. Sentiment Analysis: Sentiment analysis is a subtask of text classification
where the goal is to classify text into categories such as positive, negative, or
neutral. This task is commonly used in social media analysis, product
reviews, and market research.
• Sentiment Analysis (SA), also known as opinion mining, is a subfield of
text classification that involves identifying and categorizing opinions
expressed in a piece of text. The primary goal is to classify text into
sentiment categories such as positive, negative, or neutral. In some
applications, finer-grained sentiment categories or emotion detection
(e.g., anger, joy, sadness) may also be used.
• Sentiment analysis has become a powerful tool for linguistic research and
computational linguistics. By analyzing how opinions, attitudes, and
emotions are expressed in text, researchers can gain insights into language
use, social trends, and cultural differences.
Importance of Sentiment Analysis in Linguistic Research
• Sentiment analysis is used in linguistic research to explore:
• Language and Emotion: Understanding how emotions are conveyed
through words, phrases, and grammar.
• Pragmatics and Context: Analyzing how sentiment is expressed
depending on context, tone, and cultural nuances.
• Sociolinguistics: Investigating public sentiment on social issues,
political discourse, or global events.
• Semantic and Lexical Analysis: Studying how specific words, idioms,
or constructions contribute to sentiment.
• Comparative Linguistics: Exploring sentiment expression across
different languages or dialects.
Levels of Sentiment Analysis
• Sentiment analysis can be performed at various levels, depending on the
granularity required:
• Document-Level Analysis:
• Focuses on the overall sentiment of an entire document.
• Example: Classifying a full product review as positive or negative.
• Sentence-Level Analysis:
• Examines the sentiment of individual sentences.
• Example: "I love the design, but the battery life is terrible." → Positive (first clause),
Negative (second clause).
• Aspect-Level Analysis:
• Breaks down sentiment by specific aspects or features mentioned in the text.
• Example: "The camera is amazing, but the phone is too slow." → Positive for
"camera," Negative for "phone."
• Emotion-Based Analysis:
• Goes beyond polarity to detect specific emotions such as joy, anger, or surprise.
• Example: "I’m thrilled with this purchase!" → Emotion: Joy.
How Sentiment Analysis Works
• Sentiment analysis typically involves the following steps:
• Data Preprocessing:
• Tokenization: Splitting text into words or sentences.
• Stopword Removal: Removing irrelevant words (e.g., "and," "is").
• Lemmatization/Stemming: Reducing words to their root forms.
• Feature Extraction:
• Identifying sentiment-indicative features such as keywords, emoticons, punctuation, or n-grams.
• TF-IDF (Term Frequency-Inverse Document Frequency) or word embeddings (e.g., Word2Vec,
GloVe) are commonly used.
• Sentiment Classification:
• Rule-Based Approaches: Using sentiment lexicons (e.g., SentiWordNet) to assign polarity scores to
words.
• Machine Learning Models: Training classifiers (e.g., Logistic Regression, SVM, Naïve Bayes) to
predict sentiment.
• Deep Learning Models: Leveraging neural networks like LSTMs, GRUs, or Transformers for more
context-aware sentiment analysis.
• Post-Processing:
• Aggregating results (e.g., aspect-level sentiment to document-level sentiment).
• Visualizing sentiment trends (e.g., sentiment over time or across different topics).
Example of Sentiment Analysis

Consider the sentence:

"The movie had stunning visuals, but the plot was predictable."
Steps:
1.Tokenization :
•Tokens: [The, movie, had, stunning, visuals, but, the, plot, was, predictable.]
2.Sentiment Lexicon (Rule-Based):
•"Stunning": Positive (+2)
•"Predictable": Negative (-1)
3.Aspect-Level Analysis:
•Visuals: Positive
•Plot: Negative
4.Overall Sentiment:
•Mixed or Neutral.
Tools for Sentiment Analysis
• Several tools and libraries provide capabilities for sentiment analysis:
• 1. Python Libraries
• NLTK (Natural Language Toolkit): Provides basic sentiment analysis tools
like VADER (Valence Aware Dictionary and sEntiment Reasoner) for social
media text.
• TextBlob: Simple sentiment analysis based on polarity and subjectivity.
• 2. Deep Learning Frameworks
• Transformers (Hugging Face): Pre-trained models like BERT, RoBERTa, and
GPT can be fine-tuned for sentiment analysis.
• 3. Online Tools
• Google Cloud Natural Language API: Provides sentiment analysis with
advanced features like entity sentiment.
• Microsoft Azure Text Analytics: Offers sentiment detection with REST API
integration.
Applications in Linguistic Research
• Social Media Analysis: Analyzing public sentiment on platforms like Twitter or
Reddit during significant events (e.g., elections, pandemics).
• Example: Tracking sentiment during COVID-19 to study public fear or optimism.
• Product Review Analysis: Linguists can study language trends in reviews to
understand how consumers express satisfaction or dissatisfaction.
• Example: Analyzing Amazon reviews to identify patterns in sentiment across product
categories.
• Political Discourse: Investigating how sentiment is expressed in speeches,
debates, or media coverage of politicians.
• Study: Sentiment trends in U.S. presidential debates.
• Cross-Linguistic Sentiment Studies: Comparing sentiment expression in different
languages or cultural contexts.
• Example: A study showing that Japanese reviews often use indirect expressions of sentiment,
whereas English reviews tend to be more direct.
• Historical Text Analysis: Analyzing the sentiment in historical documents to
understand public mood during significant events.
• Example: Sentiment trends in newspapers during World War II.
Challenges in Sentiment Analysis
• Ambiguity: Sarcasm and irony are challenging for sentiment analysis
systems to detect accurately.
• Example: "Oh great, another delay!" → Negative sentiment despite the word "great."
• Context Dependence: The meaning of words often depends on context.
For example, "light" can mean positive (weight) or negative (brightness)
depending on usage.
• Domain-Specific Vocabulary: Sentiment words can vary by domain. For
instance, "cold" might be negative in a movie review but neutral in a
weather report.
• Multilingual Sentiment Analysis: Challenges arise in detecting sentiment in
languages with complex morphology or idiomatic expressions.
• Mixed Sentiment: Texts often contain both positive and negative
sentiments, making classification difficult.
• Example: "The food was amazing, but the service was terrible."
Studies in Sentiment Analysis
• Public Sentiment on COVID-19:
• Study: Researchers analyzed Twitter data during the pandemic to track sentiment
trends in real time. Findings revealed increased negative sentiment during lockdowns
but a rise in positive sentiment following vaccination news.
• Sentiment in Literature:
• Study: Analyzing sentiment in novels to understand emotional arcs of characters. For
instance, the Hedonometer project tracks sentiment in classic literature like Pride
and Prejudice.
• Cross-Language Sentiment Analysis:
• Study by Chen et al. (2020) explored sentiment in English and Chinese movie
reviews, revealing differences in sentiment expression patterns and cultural nuances.
• Political Sentiment Analysis:
• Study: Jaidka et al. (2019) analyzed sentiment in U.S. presidential speeches to track
changing emotional tones over decades.
• Sentiment Analysis is a powerful tool in linguistic research that
allows researchers to extract, quantify, and analyze emotions and
opinions from text.
• By combining computational techniques with linguistic insights,
sentiment analysis can reveal patterns of language use, cultural
variations, and social trends. While challenges like sarcasm detection
and multilingual analysis persist, advancements in deep learning and
transformer-based models are enabling more nuanced and accurate
sentiment detection.
• This makes sentiment analysis an essential tool for understanding
human expression in digital and linguistic landscapes.
•THANK YOU FOR THE
PATIENCE…….

•ANY QUESTIONS?

Sap Workflow For Beginners Step by Step
86% (7)
Sap Workflow For Beginners Step by Step
5 pages
NLP Sem Answers (All)
No ratings yet
NLP Sem Answers (All)
124 pages
Natural Language Processing Unit 1
No ratings yet
Natural Language Processing Unit 1
16 pages
NLP Basics
No ratings yet
NLP Basics
12 pages
NLP m2
No ratings yet
NLP m2
71 pages
Tokenization in NLP
No ratings yet
Tokenization in NLP
21 pages
NLP Unit 1
No ratings yet
NLP Unit 1
15 pages
Lecture 2 Tokenization
No ratings yet
Lecture 2 Tokenization
16 pages
Natural Language Processing Tools and Approaches
No ratings yet
Natural Language Processing Tools and Approaches
106 pages
NLP Exp 3
No ratings yet
NLP Exp 3
24 pages
Unit I
No ratings yet
Unit I
12 pages
Text Processing For NLP String Tokenization
No ratings yet
Text Processing For NLP String Tokenization
10 pages
Dsbdal A7
No ratings yet
Dsbdal A7
65 pages
NLPNotes
No ratings yet
NLPNotes
12 pages
NLP - 1 - 250119 - 222702
No ratings yet
NLP - 1 - 250119 - 222702
71 pages
Lecture 2 NLP
No ratings yet
Lecture 2 NLP
27 pages
NLP - Srilakshmi H - PPT Assignment
No ratings yet
NLP - Srilakshmi H - PPT Assignment
29 pages
1 Introduction
No ratings yet
1 Introduction
99 pages
NLP Experiment 2
No ratings yet
NLP Experiment 2
5 pages
Introduction To NLP
No ratings yet
Introduction To NLP
68 pages
Tokenization in NLP
No ratings yet
Tokenization in NLP
10 pages
NLP Insem Notes
No ratings yet
NLP Insem Notes
13 pages
5 Basic Text Processing
No ratings yet
5 Basic Text Processing
6 pages
NLP Manual (1-12) 1
No ratings yet
NLP Manual (1-12) 1
56 pages
NLP Manual (1-12)
No ratings yet
NLP Manual (1-12)
54 pages
NLP Manual (1-12)
No ratings yet
NLP Manual (1-12)
55 pages
NLP Steps Basic
No ratings yet
NLP Steps Basic
26 pages
Experiment - 2
No ratings yet
Experiment - 2
3 pages
Introduction To Natural Language Processing and NLTK
No ratings yet
Introduction To Natural Language Processing and NLTK
23 pages
AI6122 Topic 1.2 - WordLevel
No ratings yet
AI6122 Topic 1.2 - WordLevel
63 pages
Text Analytics and Natural Language Processing - KAI073
No ratings yet
Text Analytics and Natural Language Processing - KAI073
24 pages
Week 02 Tokenizers
No ratings yet
Week 02 Tokenizers
36 pages
Normalization and Tokenization in NLP
No ratings yet
Normalization and Tokenization in NLP
10 pages
NLP Module 1
No ratings yet
NLP Module 1
71 pages
Week 6: Introduction To Natural Language Processing
No ratings yet
Week 6: Introduction To Natural Language Processing
18 pages
Gitika Mandal BE4 A 17 NLP EXP1
No ratings yet
Gitika Mandal BE4 A 17 NLP EXP1
3 pages
NLP Lab Manual Lab Work
No ratings yet
NLP Lab Manual Lab Work
24 pages
Module 1
No ratings yet
Module 1
49 pages
NLP-1 (Tokenization)
100% (1)
NLP-1 (Tokenization)
10 pages
Core Components of Natural Language Processing
No ratings yet
Core Components of Natural Language Processing
43 pages
UNIT-5 Quetions - Answers
No ratings yet
UNIT-5 Quetions - Answers
10 pages
Natural Language Processing
No ratings yet
Natural Language Processing
47 pages
Natural Language Processing Revision Notes
No ratings yet
Natural Language Processing Revision Notes
4 pages
Natural Language Processing Dossier 20231110 141736 0000
No ratings yet
Natural Language Processing Dossier 20231110 141736 0000
114 pages
NLP - Shortnotes Unit 1 & 2
No ratings yet
NLP - Shortnotes Unit 1 & 2
16 pages
NLP 9
No ratings yet
NLP 9
44 pages
Unit 1 NLP KCS072
No ratings yet
Unit 1 NLP KCS072
12 pages
NLP Study
No ratings yet
NLP Study
48 pages
Introduction To NLP Basics of Text Processing, Spelling Correction-Edit Distance, Weighted Edit Distance
No ratings yet
Introduction To NLP Basics of Text Processing, Spelling Correction-Edit Distance, Weighted Edit Distance
35 pages
NLP 02
No ratings yet
NLP 02
6 pages
DL For NLP Week1
No ratings yet
DL For NLP Week1
153 pages
Natural Language Processing CS 1462: Some Slides Borrows From Carl Sable
No ratings yet
Natural Language Processing CS 1462: Some Slides Borrows From Carl Sable
54 pages
What Is Computational Linguistics
No ratings yet
What Is Computational Linguistics
14 pages
Introduction To NLP
No ratings yet
Introduction To NLP
50 pages
UNIT 1 - Part1
No ratings yet
UNIT 1 - Part1
121 pages
Natural Language Processing
100% (2)
Natural Language Processing
48 pages
Module-I NLP
No ratings yet
Module-I NLP
35 pages
Text Processing
No ratings yet
Text Processing
114 pages
CAT King Study Material 5
No ratings yet
CAT King Study Material 5
21 pages
NLP Pyq Solutions
No ratings yet
NLP Pyq Solutions
59 pages
Rita Mulcahy: PM Crash Course ™ For IT Professionals
No ratings yet
Rita Mulcahy: PM Crash Course ™ For IT Professionals
6 pages
Unit-IV Management Information System
No ratings yet
Unit-IV Management Information System
29 pages
Test#1: Sub Inspector Bs14: Email
No ratings yet
Test#1: Sub Inspector Bs14: Email
34 pages
PCMF Manual
No ratings yet
PCMF Manual
1,009 pages
Heralyn Cantong BTLED HE - 3rd Year Module 2 (Ttl-2)
No ratings yet
Heralyn Cantong BTLED HE - 3rd Year Module 2 (Ttl-2)
5 pages
DSCI 303: Machine Learning For Data Science Fall 2020
No ratings yet
DSCI 303: Machine Learning For Data Science Fall 2020
5 pages
INNOVATIVE PROJECT On PORTFOLIO
100% (1)
INNOVATIVE PROJECT On PORTFOLIO
14 pages
Question Paper Code:: (10×2 20 Marks)
No ratings yet
Question Paper Code:: (10×2 20 Marks)
2 pages
Maanvi Agarwal
No ratings yet
Maanvi Agarwal
11 pages
Groupware Technology
100% (5)
Groupware Technology
18 pages
FYP Thesis Template
No ratings yet
FYP Thesis Template
25 pages
ONTAP 9.10.1 Performance Tech Spec
No ratings yet
ONTAP 9.10.1 Performance Tech Spec
1 page
Unit 1
No ratings yet
Unit 1
29 pages
BITS User Manual
No ratings yet
BITS User Manual
57 pages
Cpe Diary G7
No ratings yet
Cpe Diary G7
20 pages
Imdrf Rps WG pd1 n27r2
No ratings yet
Imdrf Rps WG pd1 n27r2
12 pages
Resume Website
100% (1)
Resume Website
8 pages
Excel - Introduction To Data Analysis
No ratings yet
Excel - Introduction To Data Analysis
4 pages
Dk30a2dhu Datasheet
No ratings yet
Dk30a2dhu Datasheet
5 pages
3 IntroSoftSec
No ratings yet
3 IntroSoftSec
42 pages
CV Daniar Heri Kurniawan New 1
No ratings yet
CV Daniar Heri Kurniawan New 1
4 pages
LDM1 Module 3 Decision Tree
No ratings yet
LDM1 Module 3 Decision Tree
5 pages
What Is The Longest Text About LTEs
No ratings yet
What Is The Longest Text About LTEs
43 pages
HPE ProLiant DL380 Gen10 Server
No ratings yet
HPE ProLiant DL380 Gen10 Server
3 pages
Technology Changes The World
No ratings yet
Technology Changes The World
2 pages
chp4 CCD
No ratings yet
chp4 CCD
8 pages
Apks Count List
No ratings yet
Apks Count List
6 pages
Xbox Gaming
No ratings yet
Xbox Gaming
1 page
Synchronization PCB: This Module Is One of A Series of Modules That Describe The Components of The M System
No ratings yet
Synchronization PCB: This Module Is One of A Series of Modules That Describe The Components of The M System
12 pages

Natural Language Processing (NLP) & Computational Linguistics

Uploaded by

Natural Language Processing (NLP) & Computational Linguistics

Uploaded by

Natural Language Processing (NLP)

• Natural Language Processing (NLP) is a subfield of artificial

In Natural Language Processing (NLP), tokenization plays a critical role

• Word Tokenization: This is the most common form of tokenization in

• Handling Morphology: English morphology involves a variety of

• Handling Misspellings: Character tokenization is useful when dealing

• Several NLP libraries in Python can perform tokenization effectively:

2. Part-of-Speech (POS) Tagging: POS tagging assigns part-of-speech

Part-of-Speech (POS) Tagging is a key task in Natural Language

3. Parsing (Syntactic and Dependency): Parsing involves analyzing the

• Named Entity Recognition (NER) is a key task in Natural Language

• Consider the translation of the sentence:

• A transformer model processes this sentence by first encoding the

Consider the sentence:

You might also like