0% found this document useful (0 votes)
23 views7 pages

Report On Word2vec

The document describes a project to train Word2Vec models using the continuous bag-of-words and skip-gram architectures to generate word embeddings. It discusses training the models on a text dataset and evaluating their performance on intrinsic tasks like word similarity and analogy questions.

Uploaded by

Subrata Nandi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views7 pages

Report On Word2vec

The document describes a project to train Word2Vec models using the continuous bag-of-words and skip-gram architectures to generate word embeddings. It discusses training the models on a text dataset and evaluating their performance on intrinsic tasks like word similarity and analogy questions.

Uploaded by

Subrata Nandi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

PROJECT REPORT ON

Word2Vec using CBOW


(continuous bag-of-words)
and skip-gram

Submi ed by:
NAME ROLL NO ENROLL NO
DHANANJAY GORAIN 190 22022002002028
SUBRATA NANDI 198 22022002002028

SUPERVISED BY:
PROF. APURBA PAUL
LECTURER
COMPUTER SCIENCE AND ENGINEERING
1. Introduction
1.1 Background
Natural Language Processing (NLP) has witnessed significant
advancements in recent years, and word embeddings have
become a cornerstone in various NLP applications. The ability
to represent words as dense vectors in a continuous vector
space has improved the performance of many language-related
tasks.

1.2 Problem Statement


This project aims to train Word2Vec models, specifically the
Continuous Bag of Words (CBOW) and Skip-gram models, to
generate high-quality word embeddings. The goal is to explore
the differences in their architectures and assess their
performance on a given dataset.

2. Literature Review
Word2Vec, introduced by Mikolov et al. (2013), is a popular
technique for learning distributed representations of words.
CBOW and Skip-gram are two primary architectures used for
training Word2Vec models. CBOW predicts the target word
from its context, while Skip-gram predicts the context words
from the target word. These models have been successfully
applied in various NLP applications, including machine
translation, sentiment analysis, and information retrieval.
3. Methodology
3.1 Word2Vec Overview
Word2Vec is a neural network-based model that learns
distributed representations of words in a continuous vector
space. The model is trained to predict the context of a word
(Skip-gram) or predict the word given its context (CBOW).

3.2 CBOW Model


The continuous bag-of-words (CBOW) model is a neural
network for natural language processing tasks such as
language translation and text classification. It is based on
predicting a target word given the context of the surrounding
words. The CBOW model takes a window of surrounding words
as input and tries to predict the target word in the center of the
window. The model is trained on a large text dataset and learns
to predict the target word based on the patterns it observes in
the input data. The CBOW model is often combined with other
natural language processing techniques and models, such as
the skip-gram model, to improve the performance of natural
language processing tasks.
3.3 Skip-gram Model
The skip-gram model was introduced by Mikolov et al. in their
paper "Efficient Estimation of Word Representations in Vector
Space" (2013). The skip-gram model is a way of teaching a
computer to understand the meaning of words based on the
context they are used in. An example would be training a
computer to understand the word "dog" by looking at sentences
where "dog" appears and seeing the words that come before
and after it. By doing this, the computer will be able to
understand how the word "dog" is commonly used and will be
able to use that understanding in other ways.
Here's an example to illustrate the concept:
Let's say you have the sentence.
The dog fetched the ball.
If you are trying to train a skip-gram model for the word "dog",
the goal of the model is to predict the context words "the" and
"fetched" given the input word "dog". So, the training data for
the model would be pairs of the form (input word = "dog",
context word = "the"), (input word = "dog", context word
"fetched").
3.4 Data Preprocessing
The dataset is pre-processed by tokenization, lowercasing, and
removing stop words and punctuation. The cleaned data is then
used for training the Word2Vec models.

4. Implementation
4.1 Environment Setup
The project is implemented using Python with the TensorFlow
library. The code is structured to facilitate easy reproducibility
and experimentation.

4.2 Dataset
The dataset used is the "Text8" dataset, a small subset of the
English Wikipedia. It contains approximately 100 MB of cleaned
text.

4.3 Model Training


Both CBOW and Skip-gram models are trained on the dataset
with a vocabulary size of 50,000 words and embedding
dimensions set to 100. The training is done over 10 epochs
using stochastic gradient descent.
5. Results
5.1 Evaluation Metrics
Model performance is evaluated based on intrinsic evaluation
metrics such as word similarity and analogy tasks.

5.2 Quantitative Results


Quantitative results show that Skip-gram outperforms CBOW in
capturing semantic relationships, while CBOW excels in
syntactic relationships.

5.3 Qualitative Analysis


Qualitative analysis involves exploring word embeddings
visually and interpreting relationships captured by the models.

6. Discussion
6.1 Model Comparison
The comparison reveals that the choice between CBOW and
Skip-gram depends on the specific nature of the linguistic
relationships within the data.

6.2 Limitations
Limitations include sensitivity to hyperparameters and the need
for large datasets for better generalization.

7. Conclusion
This project demonstrates the effectiveness of Word2Vec
models in capturing semantic and syntactic relationships. The
choice between CBOW and Skip-gram depends on the
characteristics of the dataset and the task at hand.

8. Future Work
Future work could involve exploring advanced Word2Vec
variants, experimenting with larger datasets, and applying the
learned embeddings to downstream NLP tasks.

You might also like