Report On Word2vec

The document describes a project to train Word2Vec models using the continuous bag-of-words and skip-gram architectures to generate word embeddings. It discusses training the models on a text dataset and evaluating their performance on intrinsic tasks like word similarity and analogy questions.

Uploaded by

Subrata Nandi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

23 views7 pages

Report On Word2vec

Uploaded by

Subrata Nandi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

PROJECT REPORT ON

Word2Vec using CBOW

(continuous bag-of-words)
and skip-gram

Submi ed by:
NAME ROLL NO ENROLL NO
DHANANJAY GORAIN 190 22022002002028
SUBRATA NANDI 198 22022002002028

SUPERVISED BY:
PROF. APURBA PAUL
LECTURER
COMPUTER SCIENCE AND ENGINEERING
1. Introduction
1.1 Background
Natural Language Processing (NLP) has witnessed significant
advancements in recent years, and word embeddings have
become a cornerstone in various NLP applications. The ability
to represent words as dense vectors in a continuous vector
space has improved the performance of many language-related
tasks.

1.2 Problem Statement

This project aims to train Word2Vec models, specifically the
Continuous Bag of Words (CBOW) and Skip-gram models, to
generate high-quality word embeddings. The goal is to explore
the differences in their architectures and assess their
performance on a given dataset.

2. Literature Review
Word2Vec, introduced by Mikolov et al. (2013), is a popular
technique for learning distributed representations of words.
CBOW and Skip-gram are two primary architectures used for
training Word2Vec models. CBOW predicts the target word
from its context, while Skip-gram predicts the context words
from the target word. These models have been successfully
applied in various NLP applications, including machine
translation, sentiment analysis, and information retrieval.
3. Methodology
3.1 Word2Vec Overview
Word2Vec is a neural network-based model that learns
distributed representations of words in a continuous vector
space. The model is trained to predict the context of a word
(Skip-gram) or predict the word given its context (CBOW).

3.2 CBOW Model

The continuous bag-of-words (CBOW) model is a neural
network for natural language processing tasks such as
language translation and text classification. It is based on
predicting a target word given the context of the surrounding
words. The CBOW model takes a window of surrounding words
as input and tries to predict the target word in the center of the
window. The model is trained on a large text dataset and learns
to predict the target word based on the patterns it observes in
the input data. The CBOW model is often combined with other
natural language processing techniques and models, such as
the skip-gram model, to improve the performance of natural
language processing tasks.
3.3 Skip-gram Model
The skip-gram model was introduced by Mikolov et al. in their
paper "Efficient Estimation of Word Representations in Vector
Space" (2013). The skip-gram model is a way of teaching a
computer to understand the meaning of words based on the
context they are used in. An example would be training a
computer to understand the word "dog" by looking at sentences
where "dog" appears and seeing the words that come before
and after it. By doing this, the computer will be able to
understand how the word "dog" is commonly used and will be
able to use that understanding in other ways.
Here's an example to illustrate the concept:
Let's say you have the sentence.
The dog fetched the ball.
If you are trying to train a skip-gram model for the word "dog",
the goal of the model is to predict the context words "the" and
"fetched" given the input word "dog". So, the training data for
the model would be pairs of the form (input word = "dog",
context word = "the"), (input word = "dog", context word
"fetched").
3.4 Data Preprocessing
The dataset is pre-processed by tokenization, lowercasing, and
removing stop words and punctuation. The cleaned data is then
used for training the Word2Vec models.

4. Implementation
4.1 Environment Setup
The project is implemented using Python with the TensorFlow
library. The code is structured to facilitate easy reproducibility
and experimentation.

4.2 Dataset
The dataset used is the "Text8" dataset, a small subset of the
English Wikipedia. It contains approximately 100 MB of cleaned
text.

4.3 Model Training

Both CBOW and Skip-gram models are trained on the dataset
with a vocabulary size of 50,000 words and embedding
dimensions set to 100. The training is done over 10 epochs
using stochastic gradient descent.
5. Results
5.1 Evaluation Metrics
Model performance is evaluated based on intrinsic evaluation
metrics such as word similarity and analogy tasks.

5.2 Quantitative Results

Quantitative results show that Skip-gram outperforms CBOW in
capturing semantic relationships, while CBOW excels in
syntactic relationships.

5.3 Qualitative Analysis

Qualitative analysis involves exploring word embeddings
visually and interpreting relationships captured by the models.

6. Discussion
6.1 Model Comparison
The comparison reveals that the choice between CBOW and
Skip-gram depends on the specific nature of the linguistic
relationships within the data.

6.2 Limitations
Limitations include sensitivity to hyperparameters and the need
for large datasets for better generalization.

7. Conclusion
This project demonstrates the effectiveness of Word2Vec
models in capturing semantic and syntactic relationships. The
choice between CBOW and Skip-gram depends on the
characteristics of the dataset and the task at hand.

8. Future Work
Future work could involve exploring advanced Word2Vec
variants, experimenting with larger datasets, and applying the
learned embeddings to downstream NLP tasks.

Thinkvision t34w 30 Curved Usermanual
No ratings yet
Thinkvision t34w 30 Curved Usermanual
40 pages
C Programming
No ratings yet
C Programming
16 pages
Bhanu Priya 2020 IOP Conf. Ser. Mater. Sci. Eng. 912 062009
No ratings yet
Bhanu Priya 2020 IOP Conf. Ser. Mater. Sci. Eng. 912 062009
10 pages
Acknowledgement For Thesis Work in Pakistan
100% (3)
Acknowledgement For Thesis Work in Pakistan
7 pages
Bluetooth Car Using Arduino: Mini Project Synopsis
No ratings yet
Bluetooth Car Using Arduino: Mini Project Synopsis
8 pages
School Education and Sports Department
No ratings yet
School Education and Sports Department
1 page
Pentesting Methodologies
No ratings yet
Pentesting Methodologies
12 pages
Advanced Technologies in Modern Robotic PDF
100% (2)
Advanced Technologies in Modern Robotic PDF
428 pages
DM Chapter 9 - Word Embedding
No ratings yet
DM Chapter 9 - Word Embedding
7 pages
USAA Bank Statement 5 Page
No ratings yet
USAA Bank Statement 5 Page
8 pages
Lecture 6 - Word2Vec and Text Classification
No ratings yet
Lecture 6 - Word2Vec and Text Classification
66 pages
NLP Assignment
No ratings yet
NLP Assignment
12 pages
Deep Learning-5
No ratings yet
Deep Learning-5
5 pages
DLL G7&8 Ict CHS
100% (4)
DLL G7&8 Ict CHS
23 pages
KSS Catalog-E
No ratings yet
KSS Catalog-E
236 pages
Azure Cicd
No ratings yet
Azure Cicd
4 pages
Word 2 Vec
No ratings yet
Word 2 Vec
22 pages
Lecture#6 Skip Gram
No ratings yet
Lecture#6 Skip Gram
17 pages
Document Review Checklist
No ratings yet
Document Review Checklist
7 pages
Unit 1 - Session 6: Free Speaking
No ratings yet
Unit 1 - Session 6: Free Speaking
9 pages
NLP Assignment
No ratings yet
NLP Assignment
3 pages
NLP2
No ratings yet
NLP2
11 pages
NLPM 21
No ratings yet
NLPM 21
31 pages
LP-VI - NLP - Lab Manual
No ratings yet
LP-VI - NLP - Lab Manual
21 pages
NLP CT2 Set B Answer Key
No ratings yet
NLP CT2 Set B Answer Key
12 pages
SCUBA
No ratings yet
SCUBA
34 pages
Explaining The Intuition of Word2Vec & Implementing It in Python
No ratings yet
Explaining The Intuition of Word2Vec & Implementing It in Python
13 pages
Homework 2
No ratings yet
Homework 2
4 pages
Word Embadding
No ratings yet
Word Embadding
24 pages
NLP Using Deep Learning Handson
No ratings yet
NLP Using Deep Learning Handson
7 pages
Common Word Embedding - Continuous Bag-Of-Words - Word2Vec
No ratings yet
Common Word Embedding - Continuous Bag-Of-Words - Word2Vec
12 pages
TB Barricade v3 - ESP
No ratings yet
TB Barricade v3 - ESP
6 pages
MT6737 PCB Design Guidelines-English-V0 - 1
No ratings yet
MT6737 PCB Design Guidelines-English-V0 - 1
113 pages
Ria 37.03 24
No ratings yet
Ria 37.03 24
7 pages
Let's Learn NLP in 5 Minutes (Part 7)
No ratings yet
Let's Learn NLP in 5 Minutes (Part 7)
8 pages
Automated Warehouse PDF
No ratings yet
Automated Warehouse PDF
345 pages
Vector Semantics and Embedding (Part 2)
No ratings yet
Vector Semantics and Embedding (Part 2)
47 pages
Ebooks File Agile Project Management With Kanban All Chapters
100% (4)
Ebooks File Agile Project Management With Kanban All Chapters
34 pages
Word2vec Summary
No ratings yet
Word2vec Summary
5 pages
7 Word Embeddings
No ratings yet
7 Word Embeddings
13 pages
NLP Concepts
No ratings yet
NLP Concepts
37 pages
Abb Utilities GMBH: Operation
No ratings yet
Abb Utilities GMBH: Operation
4 pages
CH 3
No ratings yet
CH 3
183 pages
07 Word Embeddings Notes
No ratings yet
07 Word Embeddings Notes
23 pages
A Simple Word2vec Tutorial - Zafar Ali - Medium - Reader View
No ratings yet
A Simple Word2vec Tutorial - Zafar Ali - Medium - Reader View
9 pages
IEEE Paper Format Template
No ratings yet
IEEE Paper Format Template
2 pages
Experiment 8
No ratings yet
Experiment 8
2 pages
ICT Backup Procedure Policy
No ratings yet
ICT Backup Procedure Policy
8 pages
M.Tech SE Curriculam Syllabi - 2019 - 2020
No ratings yet
M.Tech SE Curriculam Syllabi - 2019 - 2020
12 pages
08-DL-Deep Learning For Text Data (Transfer Learning in NLP)
No ratings yet
08-DL-Deep Learning For Text Data (Transfer Learning in NLP)
53 pages
Lecture#14
No ratings yet
Lecture#14
38 pages
06 Wordvectors
No ratings yet
06 Wordvectors
96 pages
CandidateHallTicket PDF
No ratings yet
CandidateHallTicket PDF
1 page
Effective Web Searching
No ratings yet
Effective Web Searching
13 pages
08 Word Embeddings (2021)
No ratings yet
08 Word Embeddings (2021)
58 pages
Bomb Crack
No ratings yet
Bomb Crack
2 pages
Java Swing Intro
No ratings yet
Java Swing Intro
76 pages
Word 2 Vec
No ratings yet
Word 2 Vec
33 pages
Levelling Activity - Represent Shape (SOLUTIONS)
No ratings yet
Levelling Activity - Represent Shape (SOLUTIONS)
9 pages
API-fication: Core Building Block of The Digital Enterprise
No ratings yet
API-fication: Core Building Block of The Digital Enterprise
14 pages
Unit 2
No ratings yet
Unit 2
15 pages
Tugas NLP - 1152000052 1
No ratings yet
Tugas NLP - 1152000052 1
14 pages
INLP Assignment 3
No ratings yet
INLP Assignment 3
5 pages
CCS369 Unit-2 20.12.24
No ratings yet
CCS369 Unit-2 20.12.24
41 pages
Torralba Skip Thought Vectors
No ratings yet
Torralba Skip Thought Vectors
10 pages
Word Embeddings
No ratings yet
Word Embeddings
55 pages
Note 1015202360148 PM
No ratings yet
Note 1015202360148 PM
4 pages
Chapter II
No ratings yet
Chapter II
26 pages
Part 3
No ratings yet
Part 3
5 pages
NLP Notes
No ratings yet
NLP Notes
11 pages
Zhou 2020
No ratings yet
Zhou 2020
5 pages
12 Subrata DL
No ratings yet
12 Subrata DL
25 pages
Continuous Bag of Words
No ratings yet
Continuous Bag of Words
19 pages
Word 2 Vec
No ratings yet
Word 2 Vec
6 pages
Code:: Program To Implement RMI
No ratings yet
Code:: Program To Implement RMI
4 pages
Word Embeddings Classification
No ratings yet
Word Embeddings Classification
52 pages
Module03 Embeddings
No ratings yet
Module03 Embeddings
102 pages
CCS369 - TSS-Unit 2
No ratings yet
CCS369 - TSS-Unit 2
56 pages
Pendekar Laut Generasi 1
100% (1)
Pendekar Laut Generasi 1
6 pages
Word Embeddings Notes
No ratings yet
Word Embeddings Notes
9 pages
Continuous Bag of Words (Cbow) - Single Word Model - How It Works - Thinkinfi
No ratings yet
Continuous Bag of Words (Cbow) - Single Word Model - How It Works - Thinkinfi
14 pages
Lecture Word Embeddings WordTo Vec IR
No ratings yet
Lecture Word Embeddings WordTo Vec IR
60 pages
IGNOU PGDCA All in One Previous Years Unsolved Papers
From Everand
IGNOU PGDCA All in One Previous Years Unsolved Papers
Manish Soni
No ratings yet
5950 Skip Thought Vectors
No ratings yet
5950 Skip Thought Vectors
9 pages
1506 06726 PDF
No ratings yet
1506 06726 PDF
11 pages
Dept of CSE, AIET, Mijar 1
No ratings yet
Dept of CSE, AIET, Mijar 1
13 pages
Dept of CSE, AIET, Mijar 1
No ratings yet
Dept of CSE, AIET, Mijar 1
13 pages
Word2Vec - A Baby Step in Deep Learning But A Giant Leap Towards Natural Language Processing
100% (1)
Word2Vec - A Baby Step in Deep Learning But A Giant Leap Towards Natural Language Processing
12 pages
Neural Network
No ratings yet
Neural Network
23 pages
Word2vector Paper PDF
No ratings yet
Word2vector Paper PDF
9 pages