Report On Word2vec
Report On Word2vec
Submi ed by:
NAME ROLL NO ENROLL NO
DHANANJAY GORAIN 190 22022002002028
SUBRATA NANDI 198 22022002002028
SUPERVISED BY:
PROF. APURBA PAUL
LECTURER
COMPUTER SCIENCE AND ENGINEERING
1. Introduction
1.1 Background
Natural Language Processing (NLP) has witnessed significant
advancements in recent years, and word embeddings have
become a cornerstone in various NLP applications. The ability
to represent words as dense vectors in a continuous vector
space has improved the performance of many language-related
tasks.
2. Literature Review
Word2Vec, introduced by Mikolov et al. (2013), is a popular
technique for learning distributed representations of words.
CBOW and Skip-gram are two primary architectures used for
training Word2Vec models. CBOW predicts the target word
from its context, while Skip-gram predicts the context words
from the target word. These models have been successfully
applied in various NLP applications, including machine
translation, sentiment analysis, and information retrieval.
3. Methodology
3.1 Word2Vec Overview
Word2Vec is a neural network-based model that learns
distributed representations of words in a continuous vector
space. The model is trained to predict the context of a word
(Skip-gram) or predict the word given its context (CBOW).
4. Implementation
4.1 Environment Setup
The project is implemented using Python with the TensorFlow
library. The code is structured to facilitate easy reproducibility
and experimentation.
4.2 Dataset
The dataset used is the "Text8" dataset, a small subset of the
English Wikipedia. It contains approximately 100 MB of cleaned
text.
6. Discussion
6.1 Model Comparison
The comparison reveals that the choice between CBOW and
Skip-gram depends on the specific nature of the linguistic
relationships within the data.
6.2 Limitations
Limitations include sensitivity to hyperparameters and the need
for large datasets for better generalization.
7. Conclusion
This project demonstrates the effectiveness of Word2Vec
models in capturing semantic and syntactic relationships. The
choice between CBOW and Skip-gram depends on the
characteristics of the dataset and the task at hand.
8. Future Work
Future work could involve exploring advanced Word2Vec
variants, experimenting with larger datasets, and applying the
learned embeddings to downstream NLP tasks.