WORD EMBEDDING Project
WORD EMBEDDING Project
A Seminar report submitted in the partial fulfillment of the requirement of the award for the
degree of
BACHELOR OF TECHONOLOGY
IN
COMPUTER SCIENCE OF ENGINEERING (AI&ML)
BY
BOYINA.UTKARSHA (20JG1A4208)
&
KAIRAM.SOUMYA (20JG1A4224)
DR.L.GREESHMA
(Assistant Professor)
1
W HAT A RE W ORD E MBEDDINGS FOR T EXT ?
Word embeddings are a type of word representation that allows words with similar meaning
to have a similar representation.
They are a distributed representation for text that is perhaps one of the key breakthroughs
for the impressive performance of deep learning methods on challenging natural language
processing problems.
After completing this project, you will know:
What the word embedding approach for representing text is and how it differs from other
feature extraction methods.
That there are 3 main algorithms for learning a word embedding from text data.
That you can either train a new embedding or use a pre-trained embedding on your natural
language processing task.
In natural language processing (NLP), word embedding is a term used for the representation of words
for text analysis, typically in the form of a real-valued vector that encodes the meaning of the word such
that the words that are closer in the vector space are expected to be similar in meaning.
Word embeddings can be obtained using a set of language modeling and feature
learning techniques where words or phrases from the vocabulary are mapped to vectors of real
numbers. Conceptually it involves the mathematical embedding from space with many dimensions per
word to a continuous vector space with a much lower dimension.
Methods to generate this mapping include neural networks,[2] dimensionality reduction on the
word co-occurrence matrix, probabilistic models,explainable knowledge base method,[7] and
explicit representation in terms of the context in which words appear.
Machine learning models take vectors (arrays of numbers) as input. When working with
text, the first thing you must do is come up with a strategy to convert strings to numbers
(or to "vectorize" the text) before feeding it to the model. In this section, you will look at
three strategies for doing so.
One-hot encodings
2
As a first idea, you might "one-hot" encode each word in your vocabulary. Consider the
sentence "The cat sat on the mat". The vocabulary (or unique words) in this sentence is
(cat, mat, on, sat, the). To represent each word, you will create a zero vector with length
equal to the vocabulary, then place a one in the index that corresponds to the word.
This approach is shown in the following diagram.
The learning process is either joint with the neural network model on some task, such as
document classification, or is an unsupervised process, using document statistics.
This section reviews three techniques that can be used to learn a word embedding from text
data.
1. E MBEDDING L AYER
An embedding layer, for lack of a better name, is a word embedding that is learned jointly
with a neural network model on a specific natural language processing task, such
as language modeling or document classification.
It requires that document text be cleaned and prepared such that each word is one-hot
encoded. The size of the vector space is specified as part of the model, such as 50, 100, or
300 dimensions. The vectors are initialized with small random numbers. The embedding
3
layer is used on the front end of a neural network and is fit in a supervised way using the
Backpropagation algorithm.
… when the input to a neural network contains symbolic categorical features (e.g. features
that take one of k distinct symbols, such as words from a closed vocabulary), it is common
to associate each possible feature value (i.e., each word in the vocabulary) with a d-
dimensional vector for some d. These vectors are then considered parameters of the model,
and are trained jointly with the other parameters.
This approach of learning an embedding layer requires a lot of training data and can be
slow, but will learn an embedding both targeted to the specific text data and the NLP task.
2. W ORD 2V EC
Word2Vec is a statistical method for efficiently learning a standalone word embedding from
a text corpus.
It was developed by Tomas Mikolov, et al. at Google in 2013 as a response to make the
neural-network-based training of the embedding more efficient and since then has become
the de facto standard for developing pre-trained word embedding.
Additionally, the work involved analysis of the learned vectors and the exploration of vector
math on the representations of words. For example, that subtracting the “M A N - N E S S ” from
“K I N G ” and adding “W O M E N - N E S S ” results in the word “Q U E E N “, capturing the analogy
“K I N G I S T O Q U E E N A S M A N I S T O W O M A N “.
We find that these representations are surprisingly good at capturing syntactic and semantic
regularities in language, and that each relationship is characterized by a relation-specific
vector offset. This allows vector-oriented reasoning based on the offsets between words.
For example, the male/female relationship is automatically learned, and with the induced
vector representations, “King – Man + Woman” results in a vector very close to “Queen.”
4
— Linguistic Regularities in Continuous Space Word Representations, 2013.
Two different learning models were introduced that can be used as part of the word2vec
approach to learn the word embedding; they are:
The continuous skip-gram model learns by predicting the surrounding words given a current
word.
3. G LO V E
The Global Vectors for Word Representation, or GloVe, algorithm is an extension to the
word2vec method for efficiently learning word vectors, developed by Pennington, et al. at
Stanford.
Classical vector space model representations of words were developed using matrix
factorization techniques such as Latent Semantic Analysis (LSA) that do a good job of using
5
global text statistics but are not as good as the learned methods like word2vec at capturing
meaning and demonstrating it on tasks like calculating analogies (e.g. the King and Queen
example above).
GloVe is an approach to marry both the global statistics of matrix factorization techniques
like LSA with the local context-based learning in word2vec.
Rather than using a window to define local context, GloVe constructs an explicit word-
context or word co-occurrence matrix using statistics across the whole text corpus. The
result is a learning model that may result in generally better word embeddings.
GloVe, is a new global log-bilinear regression model for the unsupervised learning of word
representations that outperforms other models on word analogy, word similarity, and named
entity recognition tasks..
1. L EARN AN E MBEDDING
You may choose to learn a word embedding for your problem.
This will require a large amount of text data to ensure that useful embeddings are learned,
such as millions or billions of words.
You have two main options when training your word embedding:
1. Learn it Standalone, where a model is trained to learn the embedding, which is saved and
used as a part of another model for your task later. This is a good approach if you would like
to use the same embedding in multiple models.
2. Learn Jointly, where the embedding is learned as part of a large task-specific model. This
is a good approach if you only intend to use the embedding on one task.
2. R EUSE AN E MBEDDING
It is common for researchers to make pre-trained word embeddings available for free, often
under a permissive license so that you can use them on your own academic or commercial
projects.
6
For example, both word2vec and GloVe word embeddings are available for free download.
These can be used on your project instead of training your own embeddings from scratch.
You have two main options when it comes to using pre-trained embeddings:
1. Static, where the embedding is kept static and is used as a component of your model. This
is a suitable approach if the embedding is a good fit for your problem and gives good
results.
2. Updated, where the pre-trained embedding is used to seed the model, but the embedding
is updated jointly during the training of the model. This may be a good option if you are
looking to get the most out of the model and embedding on your task.
CODE:
## Setup
7
8
Take a look at a few movie reviews and their labels `(1: positive, 0: negative)` from the train
dataset.
9
U SING THE E MBEDDING LAYER
Keras makes it easy to use word embeddings. Take a look at the Embedding layer.
The Embedding layer can be understood as a lookup table that maps from integer
indices (which stand for specific words) to dense vectors (their embeddings). The
dimensionality (or width) of the embedding is a parameter you can experiment with to
see what works well for your problem, much in the same way you would experiment
with the number of neurons in a Dense layer.
When you create an Embedding layer, the weights for the embedding are randomly
initialized (just like any other layer). During training, they are gradually adjusted via
backpropagation. Once trained, the learned word embeddings will roughly encode
similarities between words (as they were learned for the specific problem your model is
trained on).
If you pass an integer to an embedding layer, the result replaces each integer with the
vector from the embedding table:
For text or sequence problems, the Embedding layer takes a 2D tensor of integers, of
shape (samples, sequence_length), where each entry is a sequence of integers. It
can embed sequences of variable lengths. You could feed into the embedding layer
above batches with shapes (32, 10) (batch of 32 sequences of length 10) or (64,
15) (batch of 64 sequences of length 15).
The returned tensor has one more axis than the input, the embedding vectors are
aligned along the new last axis. Pass it a (2, 3) input batch and the output is (2, 3,
N)
10
When given a batch of sequences as input, an embedding layer returns a 3D floating point tensor, of
shape (samples, sequence_length, embedding_dimensionality). To convert from this
sequence of variable length to a fixed representation there are a variety of standard approaches. You could use an
RNN, Attention, or pooling layer before passing it to a Dense layer. This tutorial uses pooling because it's the
simplest. The Text Classification with an RNN tutorial is a good next step.
## Text preprocessing
Next, define the dataset preprocessing steps required for your sentiment classification model. Initialize a
TextVectorization layer with the desired parameters to vectorize movie reviews. You can learn more
about using this layer in the Text Classification tutorial.
11
trains. The vectors add a dimension to the output array. The resulting dimensions
are: (batch, sequence, embedding).
The GlobalAveragePooling1D layer returns a fixed-length output vector for
each example by averaging over the sequence dimension. This allows the model
to handle input of variable length, in the simplest way possible.
The fixed-length output vector is piped through a fully-connected (Dense) layer
with 16 hidden units.
The last layer is densely connected with a single output node.
Caution: This model doesn't use masking, so the zero-padding is used as part of the
input and hence the padding length may affect the output. To fix this, see the masking
and padding guide.
12
With this approach the model reaches a validation accuracy of around 78%
Note that the model is overfitting since training accuracy is higher).
Note: Your results may be a bit different, depending on how weights were
randomly initialized before training the embedding layer.
You can look into the model summary to learn more about each layer of the model.
13
Finally Visualize the model metrics in TensorBoard.
14
The following is the graph:
15