0% found this document useful (0 votes)

8 views

Lab 5

The document outlines various feature extraction techniques used in Natural Language Processing, including One Hot Encoding, Bag of Words, N-grams, and TF-IDF, along with their drawbacks. It emphasizes the limitations of traditional methods in capturing word relationships and introduces Word2Vec as a solution for creating word embeddings that reflect semantic similarity. The document also discusses the Continuous Bag of Words and Skip-gram models within Word2Vec, highlighting their applications in tasks such as sentiment analysis and information retrieval.

Uploaded by

Michael Mansour

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views

Lab 5

Uploaded by

Michael Mansour

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 27

Natural Language

Processing
LAB 5
Lab Outline:

► Feature Extraction
► Feature extraction techniques
► One Hot Encoding
► Bag of Words (BOW)
► N-gram
► TF-IDF
► Word Vectors (Word2Vec)
► Continuous Bag of Words (CBOW)
► Skip-Gram
Feature Extraction

► Feature
► is the name given to selected or treated data that is prepared to be used as
input to Machine Learning Algorithms. Features can be things like the price
of a house, the RGB value of a pixel or, in our case, the representation of a
word.
► Feature Extraction
► Is the process of transforming the raw text into a numerical representation
that can be processed by computers.
One Hot Encoding

► One Hot Encoding ( gnenerates a vector of boolean values)

► Map each word to a unique id
► The ID vector is filled with 0s except for a 1 at the position associated with
the ID.
► Vector dimension= number of words in vocabulary
► Example
► Vocab(set of unique words)=[‘dog’, ‘bites’, ‘man’]
► Sent 1: ‘dog bites man’=[[1,0,0],[0,1,0],[0,0,1]]
► Sent 2: ‘man bites dog’=[[0,0,1],[0,1,0],[1,0,0]]
One Hot Encoding
One Hot Encoding

► Drawbacks
► Size of input vector scales with size of vocabulary
► No relationship between words
► Resultants sparse vectors (most computations go to zero)
Bag of Words

► Bag of Words (BOW)

► In this method, each document is considered as a collection or bag having all
the words in it.
► It tells us how many times each word can occur in a document.
► Example
► Vocab(set of unique words)=[‘the’, ‘cat’, ‘sat’, ’on’, ‘hat’, ‘dog’, ‘ate’,
‘and’]
► Sent 1: ‘the cat sat on the hat’={2, 1, 1, 1, 1, 0, 0, 0}
► Sent 2: ‘the dog ate the cat and the hat’={3, 1, 0, 0, 1, 1, 1, 1}
Bag of Words
Bag of Words

► Drawbacks
► No semantical relasionship between words
► Not designed to model linguistic knowledge
► Sparisty
► Due to high number of dimensions
► Curse of dimensionality
► When dimensionality increases, the distance between points becomes
less meaningful
Bag of N-grams

► N-gram
► is basically a collection of word tokens from a text document such that
these tokens are contiguous and occur in a sequence. Bi-grams (two words),
Tri-grams (three words), and so on.
► It tells us how many times a phrase can occur in a document.
► Example
► Vocab(set of all n-grams in corpus)=[‘the cat’, ‘cat sat’, ‘sat on’, ’on the’, ‘the
hat’, ‘the dog’, ‘dog ate’, ‘ate the’, ‘ cat and’, ‘and the’]
► Sent 1: ‘the cat sat on the hat’={1, 1, 1, 1, 1, 0, 0, 0, 0, 0}
► Sent 2: ‘the dog ate the cat and the hat’={1, 0, 0, 0, 0, 1, 1, 1, 1, 1}
Bag of N-grams
Bag of N-grams

► Drawbacks
► Very large vocab set
► No notion of syntactic or semantic similarity between words.
Term Frequency-Inverse Document Frequency

► TF-IDF
► Captures importance of a word to a document in a corpus.
► Importance increases proportionally to the number of times a word appears
in the document; but is inversely proportional to the frequency of the word
in the corpus .
Term Frequency-Inverse Document Frequency

►
Term Frequency-Inverse Document Frequency
Term Frequency-Inverse Document Frequency

► Drawbacks
► Based on the bag-of-words model, so it doesn’t capture position in text,
semantics, co-occurrence in different documents.
► Thus TF-IDF is only useful as a lexical level feature.
Legacy Techniques Problem

► The previous feature extraction techniques represents the words as discrete

symbols
► Example
► in web search, if user searches for “Seattle motel”, we would like to match
documents containing “Seattle hotel”; but:
► motel = [0 0 0 0 0 0 0 0 0 0 1 0 0 0 0]
► hotel = [0 0 0 0 0 0 0 1 0 0 0 0 0 0 0]
► These two vectors are orthogonal. There is no natural notion of similarity for
one-hot vectors!
Legacy Techniques Solution

► So, learn to encode similarity in the vectors themselves.

► Target: is to represent the words by meaning. The word’s meaning is given by the words
that frequently appear close-by.
► There are lexical resources such as the Wordnet lexicon which contains word
synonymos, hypernyms but it has some problems like 1] missing new meanings of
words, 2] can’t compute accurate word similarity, 3] the word synonym may differ
based on the context.
► From this perspective, the distributed representations like word vectors or word
embeddings or word representations are appearing.
Word Context

► When a word w appears in a text, its context is the set of words that appear
nearby (within a fixed-size window).
► Use the many contexts of w to build up a representation of w.
Word Vectors/ Embeddings

► The word embeddings show us some

properties. Like,
► if two words are similar they must be
closer to each other in representation,
and
► two opposite words if their pairs exist,
they both must be having the same
difference of distances. These help us
find synonyms, analogies, etc…
Word2Vec

► Word2vec is a framework for learning word vectors. It

transforms the text words into vectors.
► Word2vec is not a deep neural network however, it
converts text into an unambiguous form of computation
for deep neural networks.
► Purpose: is to collect vectors of the same words together
in vector space.
► There are two methods presented in Word2Vec model
► Continuous bag of words (CBoW)
► Skip-gram
Continuous bag of words (CBoW)

► Predict the target word given the context words.

► Example given a sentence and a window size 2.

► Drawbacks: it can’t capture rare words.

► So, the skip-gram algorithm comes.
Skip-gram

► Reverses format of CBoW

► Predicts a context given a target word
► The context is specified by the window length
Skip-gram

► Advantages
► It can capture rare words
► It captures the similarity of word semantics
► Synonymous like ‘intelligent’ and ‘smart’ would have very similar contexts.
Word2Vec Implementation

► Initially use the command: pip install gensim

► input
Word2Vec Implementation

► Input

► Output
Word Vectors Applications

► Sentiment Analysis
► Speech Recognition
► Information Retrieval
► Question Answering

DM Chapter 9 - word embedding
No ratings yet
DM Chapter 9 - word embedding
7 pages
Lect04
No ratings yet
Lect04
44 pages
Module03 Embeddings
No ratings yet
Module03 Embeddings
102 pages
NLP Asgn2
No ratings yet
NLP Asgn2
7 pages
Unit iv
No ratings yet
Unit iv
57 pages
04 - Text Representation
No ratings yet
04 - Text Representation
131 pages
wordembed
No ratings yet
wordembed
31 pages
Text Mining - Vectorization
No ratings yet
Text Mining - Vectorization
24 pages
UNIT-II
No ratings yet
UNIT-II
20 pages
Unit iv
No ratings yet
Unit iv
58 pages
ML UNIT-II
No ratings yet
ML UNIT-II
27 pages
Module III
No ratings yet
Module III
42 pages
NLP Asgn3
No ratings yet
NLP Asgn3
6 pages
NLP_Module 2
No ratings yet
NLP_Module 2
54 pages
NLP An Intuitive Understanding of Word Embeddings From Count Vectors To Word2Vec
No ratings yet
NLP An Intuitive Understanding of Word Embeddings From Count Vectors To Word2Vec
18 pages
Sentiment Analysis based on vector embeding
No ratings yet
Sentiment Analysis based on vector embeding
5 pages
NLP Prez Word - Sentence Embedding - MAQUET - MARTIN - LEEFEBURE - MOGAVERO
No ratings yet
NLP Prez Word - Sentence Embedding - MAQUET - MARTIN - LEEFEBURE - MOGAVERO
18 pages
6. Text Vectorization
No ratings yet
6. Text Vectorization
10 pages
Feature extraction techniques in NLP
No ratings yet
Feature extraction techniques in NLP
10 pages
sheet 3 (3)
No ratings yet
sheet 3 (3)
5 pages
Bag of Words
No ratings yet
Bag of Words
32 pages
Embeddings
No ratings yet
Embeddings
3 pages
NLP Basic - YL
No ratings yet
NLP Basic - YL
16 pages
Learning Representations That Convey Semantic and Syntactic Information
No ratings yet
Learning Representations That Convey Semantic and Syntactic Information
14 pages
Machine Learning For NLP: Vocabulary
No ratings yet
Machine Learning For NLP: Vocabulary
37 pages
Wordembed v2.0
No ratings yet
Wordembed v2.0
46 pages
06 Wordvectors
No ratings yet
06 Wordvectors
96 pages
Unit-2-TB
No ratings yet
Unit-2-TB
20 pages
unit2newml
No ratings yet
unit2newml
25 pages
Lecture Word Embeddings WordTo Vec IR
No ratings yet
Lecture Word Embeddings WordTo Vec IR
60 pages
4. Word Embadding
No ratings yet
4. Word Embadding
24 pages
4 Word Representation
No ratings yet
4 Word Representation
41 pages
Traditional Word Embedding
No ratings yet
Traditional Word Embedding
9 pages
Chapter II
No ratings yet
Chapter II
26 pages
Word Embedding
No ratings yet
Word Embedding
60 pages
Unit 2 Updated New
No ratings yet
Unit 2 Updated New
77 pages
Unit-2
No ratings yet
Unit-2
21 pages
Christopher Manning Lecture 2: Word Vectors, Word Senses, and Neural Classifiers
No ratings yet
Christopher Manning Lecture 2: Word Vectors, Word Senses, and Neural Classifiers
57 pages
Dealing With Textual Data
No ratings yet
Dealing With Textual Data
67 pages
21 Word2Vec 24 09 2024
No ratings yet
21 Word2Vec 24 09 2024
63 pages
Wordembedding
No ratings yet
Wordembedding
25 pages
Unit 2a
No ratings yet
Unit 2a
51 pages
Word2Vec - A Baby Step in Deep Learning But A Giant Leap Towards Natural Language Processing
100% (1)
Word2Vec - A Baby Step in Deep Learning But A Giant Leap Towards Natural Language Processing
12 pages
Cs224n 2024 Lecture02 Wordvecs2
No ratings yet
Cs224n 2024 Lecture02 Wordvecs2
45 pages
Lebijp 59 SZ 31 Py
No ratings yet
Lebijp 59 SZ 31 Py
69 pages
word embedding
No ratings yet
word embedding
35 pages
NLP m3
No ratings yet
NLP m3
111 pages
unit2
No ratings yet
unit2
15 pages
Motivation Video: Mitsuku Vs Cleverbot - AI (Artificial Intelligence)
No ratings yet
Motivation Video: Mitsuku Vs Cleverbot - AI (Artificial Intelligence)
45 pages
3 WordMeaning
No ratings yet
3 WordMeaning
78 pages
CCS369 - TSS-Unit 2
No ratings yet
CCS369 - TSS-Unit 2
56 pages
Word Embeddings
No ratings yet
Word Embeddings
55 pages
Master Thesis
No ratings yet
Master Thesis
74 pages
Lec 6
No ratings yet
Lec 6
2 pages
NLP Final Review
No ratings yet
NLP Final Review
32 pages
Data Science Interview Preparation Questions (#Day06)
No ratings yet
Data Science Interview Preparation Questions (#Day06)
10 pages
08 Word Embeddings (2021)
No ratings yet
08 Word Embeddings (2021)
58 pages
50 Python Concepts Every Developer Should Know
From Everand
50 Python Concepts Every Developer Should Know
Hernando Abella
No ratings yet
50 Most Challenging Algebra Problems!
From Everand
50 Most Challenging Algebra Problems!
Andrei Besedin
No ratings yet
Computer Programming: A Simplified Entry to Python, Java, and C++ Programming for Beginners
From Everand
Computer Programming: A Simplified Entry to Python, Java, and C++ Programming for Beginners
Lena Neill
No ratings yet
Avant Garde Definition
100% (2)
Avant Garde Definition
2 pages
Using Which To Modify A Whole Sentence
No ratings yet
Using Which To Modify A Whole Sentence
16 pages
Compound Complex Sentences Handout
No ratings yet
Compound Complex Sentences Handout
2 pages
Paper Solving Techniques
0% (1)
Paper Solving Techniques
4 pages
DLL Quarter 2 Week 9 All Subjects Grade 2 Day 4
No ratings yet
DLL Quarter 2 Week 9 All Subjects Grade 2 Day 4
9 pages
技術高中英文(一)B版-課本-內頁(全)
No ratings yet
技術高中英文(一)B版-課本-內頁(全)
124 pages
Oxford Grammar 2 - Be
No ratings yet
Oxford Grammar 2 - Be
4 pages
Bản Sao Của Practice Exercises
No ratings yet
Bản Sao Của Practice Exercises
6 pages
Learning Vocabulary
100% (1)
Learning Vocabulary
1 page
Erbil Polytechnic University Erbil Medical Technical Institute Nursing Department First Stage
No ratings yet
Erbil Polytechnic University Erbil Medical Technical Institute Nursing Department First Stage
8 pages
The Simple Present Positive Statements Verbs With Grammar Drills - 141846
No ratings yet
The Simple Present Positive Statements Verbs With Grammar Drills - 141846
2 pages
2425 PT Math Roman Numerals-Year5
No ratings yet
2425 PT Math Roman Numerals-Year5
4 pages
lý thuyết dịch
No ratings yet
lý thuyết dịch
1 page
Dewi Mustika Arifiani (1111026000068)
No ratings yet
Dewi Mustika Arifiani (1111026000068)
74 pages
Simple Complex Compound
No ratings yet
Simple Complex Compound
2 pages
Trennbare Verben
No ratings yet
Trennbare Verben
5 pages
Gold Exp B1 U2 Lang Test A
No ratings yet
Gold Exp B1 U2 Lang Test A
2 pages
IAFL 2019 Australia
No ratings yet
IAFL 2019 Australia
46 pages
French COD and COI
No ratings yet
French COD and COI
25 pages
Week 2 - I Can Show My Talents Objectives A. Expressive Objectives
No ratings yet
Week 2 - I Can Show My Talents Objectives A. Expressive Objectives
13 pages
Materi Bab 4 B. Inggris Kls 8
No ratings yet
Materi Bab 4 B. Inggris Kls 8
3 pages
English and Spanish Diphthongs
No ratings yet
English and Spanish Diphthongs
4 pages
Eight Parts of Speech
No ratings yet
Eight Parts of Speech
173 pages
B1 - Connectors (purpose, reason, result)
No ratings yet
B1 - Connectors (purpose, reason, result)
5 pages
Ibbotson & Tomasello (2016) - Language in A New Key
No ratings yet
Ibbotson & Tomasello (2016) - Language in A New Key
6 pages
Ige PDF
No ratings yet
Ige PDF
249 pages
Good Book, Terrible Movie! 9 Ano
No ratings yet
Good Book, Terrible Movie! 9 Ano
46 pages
PRACTICE - TÍNH TỪ SỞ HỮU VÀ ĐẠI TỪ SỞ HỮU
No ratings yet
PRACTICE - TÍNH TỪ SỞ HỮU VÀ ĐẠI TỪ SỞ HỮU
11 pages
Semi Detailed Lesson Plan in English 1 I. Objectives
No ratings yet
Semi Detailed Lesson Plan in English 1 I. Objectives
6 pages
1st Periodical Test in English 5
100% (1)
1st Periodical Test in English 5
9 pages

Lab 5

Uploaded by

Lab 5

Uploaded by

Natural Language

► One Hot Encoding ( gnenerates a vector of boolean values)

► Bag of Words (BOW)

► The previous feature extraction techniques represents the words as discrete

► So, learn to encode similarity in the vectors themselves.

► The word embeddings show us some

► Word2vec is a framework for learning word vectors. It

► Predict the target word given the context words.

► Drawbacks: it can’t capture rare words.

► Reverses format of CBoW

► Initially use the command: pip install gensim

You might also like