0% found this document useful (0 votes)
10 views53 pages

12.1. NLP Intro

Natural Language Processing (NLP) is a subfield of computer science and AI that enables machines to understand and communicate in human language. Key tasks include named entity recognition, sentiment analysis, and language translation, with applications in various industries for automating tasks and improving data insights. Approaches to NLP include rules-based systems, statistical methods, and deep learning, with techniques for text preprocessing and encoding, such as stemming, lemmatization, and TF-IDF.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views53 pages

12.1. NLP Intro

Natural Language Processing (NLP) is a subfield of computer science and AI that enables machines to understand and communicate in human language. Key tasks include named entity recognition, sentiment analysis, and language translation, with applications in various industries for automating tasks and improving data insights. Approaches to NLP include rules-based systems, statistical methods, and deep learning, with techniques for text preprocessing and encoding, such as stemming, lemmatization, and TF-IDF.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 53

Natural Language

Processing
Part – 1
Dr. Oybek Eraliev,
Department of Computer Engineering
Inha University In Tashkent.
Email: [email protected]

Dr. Oybek Eraliyev Class: Artificial Intelligence SOC4040 1


What is NLP?

Ø Natural language processing


(NLP) is a subfield of computer
science and artificial intelligence
(AI) that uses machine
learning to enable computers to
understand and communicate
with human language.

Dr. Oybek Eraliyev Class: Artificial Intelligence SOC4040 2


NLP Tasks
Named entity
recognition (NER)

Part-of-speech Language
tagging (POS) translation

Speech recognition Text classification

Sentiment analysis Text summarization

Natural language
generation (NLG)
Dr. Oybek Eraliyev Class: Artificial Intelligence SOC4040 3
NLP Applications

Ø NLP makes it easier for humans to


communicate and collaborate with
machines, by allowing them to do so in
the natural human language they use
every day.

Ø This offers benefits across many


industries and applications.
• Automation of repetitive tasks
• Improved data analysis and insights
• Enhanced search
• Content generation

Dr. Oybek Eraliyev Class: Artificial Intelligence SOC4040 4


NLP Approaches

NLP approaches

Rules & Heuristics Deep learning


Statistical NLP
based NLP NLP

Dr. Oybek Eraliyev Class: Artificial Intelligence SOC4040 5


NLP Approaches

Rules-based NLP

Ø The earliest NLP applications were simple if-then decision trees,


requiring preprogrammed rules.

Ø They are only able to provide answers in response to specific prompts.

Ø Because there is no machine learning or AI capability in rules-based NLP,


this function is highly limited and not scalable.

Dr. Oybek Eraliyev Class: Artificial Intelligence SOC4040 6


NLP Approaches

Statistical NLP

Ø Statistical NLP automatically extracts, classifies and labels elements of text


and voice data and then assigns a statistical likelihood to each possible
meaning of those elements. This relies on machine learning, enabling a
sophisticated breakdown of linguistics such as part-of-speech tagging.

Statistical NLP introduced the essential technique of mapping language


elements—such as words and grammatical rules—to a vector representation
so that language can be modeled by using mathematical (statistical) methods,
including regression or Markov models.

Dr. Oybek Eraliyev Class: Artificial Intelligence SOC4040 7


NLP Approaches

Deep learning NLP

Ø Deep learning models have become the dominant mode of NLP, by using huge
volumes of raw, unstructured data—both text and voice—to become ever
more accurate.

Ø Deep learning can be viewed as a further evolution of statistical NLP, with the
difference that it uses neural network models.

Dr. Oybek Eraliyev Class: Artificial Intelligence SOC4040 8


ML&DL - based NLP Flowchart

Data Text Extraction Text Feature


Acquisition & Cleanup Preprocessing Engineering

Monitoring &
Deployment Evaluation Model building
Update

Dr. Oybek Eraliyev Class: Artificial Intelligence SOC4040 9


Text Cleaning and Preprocessing

Ø Tokenization.

Ø Stop word removal.

Ø Stemming and lemmatization.

Dr. Oybek Eraliyev Class: Artificial Intelligence SOC4040 10


Text Cleaning and Preprocessing

Dr. Oybek Eraliyev Class: Artificial Intelligence SOC4040 11


Text Cleaning and Preprocessing
Stop word removal

Definition: Stop words are common words in a language (e.g., “is”, “the”,
“and”, “of”) that are often removed during text processing because they
provide little meaningful information for many NLP tasks, such as
classification or generalization. Removing stop words helps reduce noise
and focus on the main content.

Example:

Input: "This is an example of a sentence that has stop words.”

Output: "example sentence stop words"

Dr. Oybek Eraliyev Class: Artificial Intelligence SOC4040 12


Text Cleaning and Preprocessing
Stemming

Definition: Stemming is the process of reducing a word to its root or basic


form (stem) by cutting off suffixes, often without understanding its
meaning. This is a rule-based, crude approach that can produce words that
are not part of the dictionary.

Example:

Input: "running", "runner", "ran"

Output: "run”, "runner”, "ran"

Dr. Oybek Eraliyev Class: Artificial Intelligence SOC4040 13


Text Cleaning and Preprocessing
Lemmatization

Definition: Lemmatization reduces words to their base or dictionary form


(lemma), taking into account the meaning and part of speech of the word. It
ensures that the resulting word is correct and meaningful.

Example:

Input: "running", "better", "feet”

Output: "run", "good", "foot"

Dr. Oybek Eraliyev Class: Artificial Intelligence SOC4040 14


Text Cleaning and Preprocessing
Comparison
Feature Stemming Lemmatization
Rule-based (cuts Dictionary-based
Approach
suffixes/prefixes). (meaning-aware).
May not be a valid word (e.g., Always a valid word (e.g.,
Output
"easili"). "easily").
Faster due to simple rules. Slower due to more
Speed
complexity.
Example: "running" “run” “run”
Example: “better” “better” “good”

Dr. Oybek Eraliyev Class: Artificial Intelligence SOC4040 15


Basic Encoding Techniques

One-Hot Encoding
Ø Represents each word as a binary vector of length equal to the vocabulary
size.
Ø Each vector has one 1 at the index corresponding to the word and 0
elsewhere.

Ø Pros: Simple and easy to implement.

Ø Cons: Results in very sparse and high-dimensional vectors.

Dr. Oybek Eraliyev Class: Artificial Intelligence SOC4040 16


Basic Encoding Techniques

One-Hot Encoding (Example) Step 2: Create one-hot vectors


Each word is represented as a binary vector
Vocabulary: ["cat", "dog", "fish", "bird"] of length equal to the vocabulary size (4 in
this case).
Step 1: Assign an index to each word
• cat → 0 Only the index corresponding to the word is
• dog → 1 1, and all other indices are 0.
• fish → 2 • cat → [1, 0, 0, 0]
• bird → 3 • dog → [0, 1, 0, 0]
• fish → [0, 0, 1, 0]
• bird → [0, 0, 0, 1]

Dr. Oybek Eraliyev Class: Artificial Intelligence SOC4040 17


Basic Encoding Techniques

Bag of Words (BoW)

Ø Represents text as a vector of word counts or frequencies.

Ø Ignores word order, treating text as a "bag" of words.

Ø Pros: Simple and effective for smaller datasets.

Ø Cons: Loses information about word order and context.

Dr. Oybek Eraliyev Class: Artificial Intelligence SOC4040 18


Basic Encoding Techniques

Bag of Words (Example) Step 2: Assign an Index to Each Word

Sentences1: “Hi, I love NLP” • Hi → 0


Sentences2: “I am learning NLP now” • I→1
• love → 2
Step 1: Build the Vocabulary • NLP → 3
• Create a unique list of all words in the • am → 4
dataset (sentences). • learning → 5
• now → 6
• Vocabulary: [“Hi”, “I”, “love”, “NLP”, “am”,
“learning”, “now”]

Dr. Oybek Eraliyev Class: Artificial Intelligence SOC4040 19


Basic Encoding Techniques

Step 3: Represent Each Sentence as a Vector Step 2: Assign an


For each sentence, count the occurrences of each word from Index to Each Word
the vocabulary.
• Hi → 0
Sentence 1: " Hi, I love NLP " • I→1
• love → 2
Vector: [1, 1, 1, 1, 0, 0, 0] • NLP → 3
• am → 4
Sentence 2: " I am learning NLP now " • learning → 5
• now → 6
Vector: [0, 1, 0, 1, 1, 1, 1]

Dr. Oybek Eraliyev Class: Artificial Intelligence SOC4040 20


Basic Encoding Techniques

TF-IDF (Term Frequency-Inverse Document Frequency)

Ø Extends BoW by weighting words based on their importance in a


document relative to the entire corpus.

Ø Highlights rare but significant words.

Ø Pros: Reduces the impact of common but uninformative words.

Ø Cons: Still loses context and word order.

Dr. Oybek Eraliyev Class: Artificial Intelligence SOC4040 21


Basic Encoding Techniques

TF-IDF (Example)
Document1: “Hi, I love NLP”
Document2: “I am learning NLP now”

Step 1: Build the Vocabulary


• Create a unique list of all words in the dataset (sentences).

• Vocabulary: [“Hi”, “I”, “love”, “NLP”, “am”, “learning”, “now”]

Dr. Oybek Eraliyev Class: Artificial Intelligence SOC4040 22


Basic Encoding Techniques

TF-IDF (Example)

Step 2: Calculate Term Frequency (TF)


TF measures how often a word appears in a document. For simplicity, let’s assume:

Number of occurrences of term t in document d


𝑇𝐹 𝑡, 𝑑 =
𝑇𝑜𝑡𝑎𝑙 𝑡𝑒𝑟𝑚𝑠 𝑖𝑛 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡 𝑑

Dr. Oybek Eraliyev Class: Artificial Intelligence SOC4040 23


Basic Encoding Techniques

TF-IDF (Example)
Step 2: Calculate Term Frequency (TF)
TF measures how often a word appears in a document. For simplicity, let’s assume:
Words TF in Document 1 (“Hi, I love NLP”) TF in Document 2 (“I am learning NLP now”)
Hi ¼=0.25 0/5=0.0
I ¼=0.25 1/5=0.2
love ¼=0.25 0/5=0.0
NLP ¼=0.25 1/5=0.2
am 0/4=0.0 1/5=0.2
learning 0/4=0.0 1/5=0.2
now 0/4=0.0 1/5=0.2

Dr. Oybek Eraliyev Class: Artificial Intelligence SOC4040 24


Basic Encoding Techniques

TF-IDF (Example)

Step 3: Calculate Inverse Document Frequency (IDF)


IDF reduces the weight of common words that appear in many documents. It is
calculated as:

𝑇𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑠


𝐼𝐷𝐹 𝑡 = 𝑙𝑛
𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑠 𝑐𝑜𝑛𝑡𝑎𝑖𝑛𝑖𝑛𝑔 𝑡

Dr. Oybek Eraliyev Class: Artificial Intelligence SOC4040 25


Basic Encoding Techniques
TF-IDF (Example)
Step 3: Calculate Inverse Document Frequency (IDF)
Words Documents Containing Word IDF Value
Hi 1 𝒍𝒏
𝟐
= 𝟎. 𝟔𝟗
𝟏
I 2 𝒍𝒏
𝟐
= 𝟎. 𝟎
𝟐
love 1 𝒍𝒏
𝟐
= 𝟎. 𝟔𝟗
𝟏
NLP 2 𝒍𝒏
𝟐
= 𝟎. 𝟎
𝟐
am 1 𝒍𝒏
𝟐
= 𝟎. 𝟔𝟗
𝟏
learning 1 𝒍𝒏
𝟐
= 𝟎. 𝟔𝟗
𝟏
now 1 𝒍𝒏
𝟐
= 𝟎. 𝟔𝟗
𝟏

Dr. Oybek Eraliyev Class: Artificial Intelligence SOC4040 26


Basic Encoding Techniques

TF-IDF (Example)

Step 4: Calculate TF-IDF


Multiply the TF values by the corresponding IDF values.

Document1: “Hi, I love NLP”


𝑇𝐹 − 𝐼𝐷𝐹 = 0.25 K 0.69, 0.25 K 0, 0.25 K 0.69, 0.25 K 0, 0 K 0.69, 0 K 0.69, 0 K 0.69
𝑇𝐹 − 𝐼𝐷𝐹 = 0.1725, 0, 0.1725, 0, 0, 0, 0

Document2: “I am learning NLP now”


𝑇𝐹 − 𝐼𝐷𝐹 = 0 v 0.69, 0.2 v 0, 0 v 0.69, 0.2 v 0, 0.2 v 0.69, 0.2 v 0.69, 0.2 v 0.69
𝑇𝐹 − 𝐼𝐷𝐹 = 0, 0, 0, 0, 0.138, 0.138, 0.138
Dr. Oybek Eraliyev Class: Artificial Intelligence SOC4040 27
Distributed Word Representations

These techniques capture semantic meaning and context.

Word Embeddings (e.g., Word2Vec, GloVe, FastText)

• Represent words as dense, fixed-length vectors learned from large


corpora.
• Words with similar meanings have similar vector representations.

• Pros: Captures semantic similarity and can be pre-trained.

• Cons: Static embeddings don’t account for polysemy.

Dr. Oybek Eraliyev Class: Artificial Intelligence SOC4040 28


Distributed Word Representations

Dr. Oybek Eraliyev Class: Artificial Intelligence SOC4040 29


Distributed Word Representations

Dr. Oybek Eraliyev Class: Artificial Intelligence SOC4040 30


Distributed Word Representations

Ø Example: "King - Man + Woman = Queen"


Ø Step 1: Vectors for Each Word
Using the Word2Vec model, you retrieve the dense vectors for the words "King", "Man",
"Woman", and "Queen". Each vector is a 300-dimensional numerical representation.
Ø Step 2: Arithmetic on Vectors
Word embeddings encode semantic information. For instance:
"King" represents "a male monarch."
"Man" represents "male."
"Woman" represents "female."
Ø When you compute:
King - Man + Woman,
it effectively removes the "male" concept from "King" and adds the "female" concept,
resulting in a vector similar to "Queen".

Dr. Oybek Eraliyev Class: Artificial Intelligence SOC4040 31


Contextualized Word Representations

These are dynamic and depend on the context in which a word appears.

Transformer-based Models (e.g., BERT, GPT, RoBERTa, T5)

• Use attention mechanisms to generate contextualized embeddings for


each word in a sentence.

• Pros: State-of-the-art performance on many NLP tasks.

• Cons: High computational cost and requires significant training


resources.

Dr. Oybek Eraliyev Class: Artificial Intelligence SOC4040 32


Contextualized Word Representations

Dr. Oybek Eraliyev Class: Artificial Intelligence SOC4040 33


Contextualized Word Representations

Dr. Oybek Eraliyev Class: Artificial Intelligence SOC4040 34


Transformers

Ø Figure shows the Transformer-model architecture.

Ø Most competitive neural sequence transduction


models have an encoder-decoder structure.
Ø Here, the encoder maps an input sequence of
symbol representations (x1,...,xn) to a sequence of
continuous representations z = (z1,...,zn).
Ø Given z, the decoder then generates an output
sequence (y1, ..., ym) of symbols one element at a
time. At each step the model is auto-regressive,
consuming the previously generated symbols as
additional input when generating the next.
Dr. Oybek Eraliyev Class: Artificial Intelligence SOC4040 35
Transformers

Ø The Transformer follows this overall architecture


using stacked self-attention and point-wise, fully
connected layers for both the encoder and decoder,
shown in the left and right halves of Figure,
respectively.

Dr. Oybek Eraliyev Class: Artificial Intelligence SOC4040 36


Transformers
Encoder
Ø Sequence models often involve encoder-decoder
architectures, especially in tasks like machine
translation, text summarization, or speech
recognition.

Ø The encoder processes the input sequence and


converts it into a fixed-size representation (context
vector) that captures the input's essential
information.

Dr. Oybek Eraliyev Class: Artificial Intelligence SOC4040 37


Transformers
Encoder
Ø The input sequence (e.g., a sentence or time-series
data) is tokenized and converted into numerical
representations (e.g., word embeddings).
Ø The encoder processes these embeddings step-by-
step (token-by-token or timestep-by-timestep)
using a sequence model (RNN, LSTM, GRU,
Transformer Encoder (for attention-based models)).
Ø The processing involves:
Ø Capturing semantic relationships.
Ø Understanding temporal or sequential
dependencies.

Dr. Oybek Eraliyev Class: Artificial Intelligence SOC4040 38


Transformers
Encoder
Ø The final hidden state (or multiple hidden states in
some cases) serves as a context vector.

Ø This vector summarizes the input sequence's


meaning, enabling further processing by the
decoder.

Dr. Oybek Eraliyev Class: Artificial Intelligence SOC4040 39


Transformers
Decoder
Ø The decoder generates the output sequence step-by-
step by using the context vector from the encoder.

Ø The decoder receives the context vector from the


encoder as input. This acts as the "knowledge"
needed to start generating the output sequence.

Dr. Oybek Eraliyev Class: Artificial Intelligence SOC4040 40


Transformers
Decoder
Ø At each timestep, the decoder:
• Takes the current input (e.g., the previous output
or a start token).
• Combines it with the encoder's context vector.
• Uses this combined information to predict the
next token in the sequence.

Ø This is repeated until the entire output sequence is


generated.

Dr. Oybek Eraliyev Class: Artificial Intelligence SOC4040 41


Transformers
Decoder
Ø Like the encoder, the decoder can use RNNs, LSTMs,
GRUs, or Transformer Decoders to process the
sequence.

Ø Transformer-based decoders also use self-attention


to focus on relevant parts of the already-generated
sequence.

Ø The decoder outputs a sequence of predictions,


such as words in a translated sentence or characters
in speech recognition.

Dr. Oybek Eraliyev Class: Artificial Intelligence SOC4040 42


Transformers

Ø Input

Ø What it is: The input is a sequence of tokens, e.g., a


sentence like "I am learning NLP".

Ø Tokens are typically numericalized (e.g., ["I", "am",


"learning", "NLP"] → [101, 220, 345, 678]).

Dr. Oybek Eraliyev Class: Artificial Intelligence SOC4040 43


Transformers

Ø Input Embedding

Ø Purpose: Maps each token in the input into a high-


dimensional dense vector.

Ø Example: The word "NLP" might map to a vector of size


512 or 768, e.g., [0.5, 0.3, -0.8, ..., 1.2].

Dr. Oybek Eraliyev Class: Artificial Intelligence SOC4040 44


Transformers

Ø Positional Encoding

Ø What it is: Adds information about the position of each token in the
sequence since transformers are position-agnostic.
Ø Why: Unlike RNNs, which inherently process sequences step-by-step,
transformers process tokens in parallel and need positional context.
Ø How: Adds sinusoidal values (or learnable embeddings) to the input
embeddings.

Ø Example:
Ø Input embedding for "NLP": [0.5, 0.3, -0.8]
Ø Positional encoding: [0.1, -0.2, 0.05]
Ø Final encoding: [0.6, 0.1, -0.75].

Dr. Oybek Eraliyev Class: Artificial Intelligence SOC4040 45


Transformers

Ø Multi-Head Attention
Ø What it is: A mechanism that allows the model to focus on
different parts of the input sequence simultaneously.
Ø How: It computes the attention for each token with respect
to every other token using:
Ø Query (Q): Represents the current word.
Ø Key (K): Represents the target words in the sequence.
Ø Value (V): Carries the actual information.
Ø Example: For the input "I am learning NLP", attention can
highlight:
Ø "learning" strongly attends to "NLP" (they are
contextually linked).
Dr. Oybek Eraliyev Class: Artificial Intelligence SOC4040 46
Transformers

Ø Add & Norm Layer

Ø What it is: Combines (adds) the original input with the


output of the attention layer and normalizes the result.

Ø Why: Helps stabilize and accelerate training.

Dr. Oybek Eraliyev Class: Artificial Intelligence SOC4040 47


Transformers

Ø Feed Forward

Ø What it is: A fully connected neural network applied to


each token's embedding independently.

Ø Purpose: Adds non-linearity and transforms the


embeddings into a more expressive space.

Dr. Oybek Eraliyev Class: Artificial Intelligence SOC4040 48


Transformers
Attention Mechanism

Dr. Oybek Eraliyev Class: Artificial Intelligence SOC4040 49


Transformers
Attention Mechanism

Dr. Oybek Eraliyev Class: Artificial Intelligence SOC4040 50


Transformers
Attention Mechanism

Source: https://arxiv.org/abs/1706.03762

Dr. Oybek Eraliyev Class: Artificial Intelligence SOC4040 51


Dr. Oybek Eraliyev Class: Artificial Intelligence SOC4040 52
Dr. Oybek Eraliyev Class: Artificial Intelligence SOC4040 53

You might also like