0% found this document useful (0 votes)
292 views15 pages

WORD EMBEDDING Project

word embedding project using ai

Uploaded by

utkarsha296
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
292 views15 pages

WORD EMBEDDING Project

word embedding project using ai

Uploaded by

utkarsha296
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 15

WORD EMBEDDING

A Seminar report submitted in the partial fulfillment of the requirement of the award for the
degree of

BACHELOR OF TECHONOLOGY
IN
COMPUTER SCIENCE OF ENGINEERING (AI&ML)

BY

BOYINA.UTKARSHA (20JG1A4208)
&
KAIRAM.SOUMYA (20JG1A4224)

Under the Guidance of

DR.L.GREESHMA
(Assistant Professor)

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING (AI&ML)

GAYATRI VIDYA PARISHAD COLLEGE OF ENGINEERING FOR WOMEN

(Affiliated to Jawaharlal Nehru Technological University Kakinada)


MADHURAWADA, VISHAKAPATNAM -48
(2020-2024)

1
W HAT A RE W ORD E MBEDDINGS FOR T EXT ?
Word embeddings are a type of word representation that allows words with similar meaning
to have a similar representation.

They are a distributed representation for text that is perhaps one of the key breakthroughs
for the impressive performance of deep learning methods on challenging natural language
processing problems.
After completing this project, you will know:

 What the word embedding approach for representing text is and how it differs from other
feature extraction methods.
 That there are 3 main algorithms for learning a word embedding from text data.
 That you can either train a new embedding or use a pre-trained embedding on your natural
language processing task.

In natural language processing (NLP), word embedding is a term used for the representation of words
for text analysis, typically in the form of a real-valued vector that encodes the meaning of the word such
that the words that are closer in the vector space are expected to be similar in meaning.
Word embeddings can be obtained using a set of language modeling and feature
learning techniques where words or phrases from the vocabulary are mapped to vectors of real
numbers. Conceptually it involves the mathematical embedding from space with many dimensions per
word to a continuous vector space with a much lower dimension.
Methods to generate this mapping include neural networks,[2] dimensionality reduction on the
word co-occurrence matrix, probabilistic models,explainable knowledge base method,[7] and
explicit representation in terms of the context in which words appear.

R EPRESENTING TEXT AS NUMBERS

Machine learning models take vectors (arrays of numbers) as input. When working with
text, the first thing you must do is come up with a strategy to convert strings to numbers
(or to "vectorize" the text) before feeding it to the model. In this section, you will look at
three strategies for doing so.

One-hot encodings

2
As a first idea, you might "one-hot" encode each word in your vocabulary. Consider the
sentence "The cat sat on the mat". The vocabulary (or unique words) in this sentence is
(cat, mat, on, sat, the). To represent each word, you will create a zero vector with length
equal to the vocabulary, then place a one in the index that corresponds to the word.
This approach is shown in the following diagram.

W ORD E MBEDDING A LGORITHMS


Word embedding methods learn a real-valued vector representation for a predefined fixed
sized vocabulary from a corpus of text.

The learning process is either joint with the neural network model on some task, such as
document classification, or is an unsupervised process, using document statistics.

This section reviews three techniques that can be used to learn a word embedding from text
data.

1. E MBEDDING L AYER
An embedding layer, for lack of a better name, is a word embedding that is learned jointly
with a neural network model on a specific natural language processing task, such
as language modeling or document classification.
It requires that document text be cleaned and prepared such that each word is one-hot
encoded. The size of the vector space is specified as part of the model, such as 50, 100, or
300 dimensions. The vectors are initialized with small random numbers. The embedding
3
layer is used on the front end of a neural network and is fit in a supervised way using the
Backpropagation algorithm.

… when the input to a neural network contains symbolic categorical features (e.g. features
that take one of k distinct symbols, such as words from a closed vocabulary), it is common
to associate each possible feature value (i.e., each word in the vocabulary) with a d-
dimensional vector for some d. These vectors are then considered parameters of the model,
and are trained jointly with the other parameters.

— Page 49, Neural Network Methods in Natural Language Processing, 2017.


The one-hot encoded words are mapped to the word vectors. If a multilayer Perceptron
model is used, then the word vectors are concatenated before being fed as input to the
model. If a recurrent neural network is used, then each word may be taken as one input in a
sequence.

This approach of learning an embedding layer requires a lot of training data and can be
slow, but will learn an embedding both targeted to the specific text data and the NLP task.

2. W ORD 2V EC
Word2Vec is a statistical method for efficiently learning a standalone word embedding from
a text corpus.

It was developed by Tomas Mikolov, et al. at Google in 2013 as a response to make the
neural-network-based training of the embedding more efficient and since then has become
the de facto standard for developing pre-trained word embedding.

Additionally, the work involved analysis of the learned vectors and the exploration of vector
math on the representations of words. For example, that subtracting the “M A N - N E S S ” from
“K I N G ” and adding “W O M E N - N E S S ” results in the word “Q U E E N “, capturing the analogy
“K I N G I S T O Q U E E N A S M A N I S T O W O M A N “.
We find that these representations are surprisingly good at capturing syntactic and semantic
regularities in language, and that each relationship is characterized by a relation-specific
vector offset. This allows vector-oriented reasoning based on the offsets between words.
For example, the male/female relationship is automatically learned, and with the induced
vector representations, “King – Man + Woman” results in a vector very close to “Queen.”

4
— Linguistic Regularities in Continuous Space Word Representations, 2013.
Two different learning models were introduced that can be used as part of the word2vec
approach to learn the word embedding; they are:

 Continuous Bag-of-Words, or CBOW model.


 Continuous Skip-Gram Model.
The CBOW model learns the embedding by predicting the current word based on its
context. The continuous skip-gram model learns by predicting the surrounding words given
a current word.

The continuous skip-gram model learns by predicting the surrounding words given a current
word.

3. G LO V E
The Global Vectors for Word Representation, or GloVe, algorithm is an extension to the
word2vec method for efficiently learning word vectors, developed by Pennington, et al. at
Stanford.

Classical vector space model representations of words were developed using matrix
factorization techniques such as Latent Semantic Analysis (LSA) that do a good job of using
5
global text statistics but are not as good as the learned methods like word2vec at capturing
meaning and demonstrating it on tasks like calculating analogies (e.g. the King and Queen
example above).

GloVe is an approach to marry both the global statistics of matrix factorization techniques
like LSA with the local context-based learning in word2vec.

Rather than using a window to define local context, GloVe constructs an explicit word-
context or word co-occurrence matrix using statistics across the whole text corpus. The
result is a learning model that may result in generally better word embeddings.

GloVe, is a new global log-bilinear regression model for the unsupervised learning of word
representations that outperforms other models on word analogy, word similarity, and named
entity recognition tasks..

U SING W ORD E MBEDDINGS


You have some options when it comes time to using word embeddings on your natural
language processing project.

This section outlines those options.

1. L EARN AN E MBEDDING
You may choose to learn a word embedding for your problem.

This will require a large amount of text data to ensure that useful embeddings are learned,
such as millions or billions of words.

You have two main options when training your word embedding:

1. Learn it Standalone, where a model is trained to learn the embedding, which is saved and
used as a part of another model for your task later. This is a good approach if you would like
to use the same embedding in multiple models.
2. Learn Jointly, where the embedding is learned as part of a large task-specific model. This
is a good approach if you only intend to use the embedding on one task.
2. R EUSE AN E MBEDDING
It is common for researchers to make pre-trained word embeddings available for free, often
under a permissive license so that you can use them on your own academic or commercial
projects.
6
For example, both word2vec and GloVe word embeddings are available for free download.

These can be used on your project instead of training your own embeddings from scratch.

You have two main options when it comes to using pre-trained embeddings:

1. Static, where the embedding is kept static and is used as a component of your model. This
is a suitable approach if the embedding is a good fit for your problem and gives good
results.
2. Updated, where the pre-trained embedding is used to seed the model, but the embedding
is updated jointly during the training of the model. This may be a good option if you are
looking to get the most out of the model and embedding on your task.

CODE:

## Setup

Download the IMDB dataset


You will use the Large Movie Review Dataset through the tutorial. You will train a
sentiment classifier model on this dataset and in the process learn embeddings from
scratch. To read more about loading a dataset from scratch, see the Loading text
tutorial.
Download the dataset using Keras file utility and take a look at the directories.

7
8
Take a look at a few movie reviews and their labels `(1: positive, 0: negative)` from the train
dataset.

C ONFIGURE THE DATASET FOR PERFORMANCE


These are two important methods you should use when loading data to make sure that
I/O does not become blocking.
.cache() keeps data in memory after it's loaded off disk. This will ensure the dataset
does not become a bottleneck while training your model. If your dataset is too large to fit
into memory, you can also use this method to create a performant on-disk cache, which
is more efficient to read than many small files.
.prefetch() overlaps data preprocessing and model execution while training.
You can learn more about both methods, as well as how to cache data to disk in
the data performance guide.

9
U SING THE E MBEDDING LAYER
Keras makes it easy to use word embeddings. Take a look at the Embedding layer.
The Embedding layer can be understood as a lookup table that maps from integer
indices (which stand for specific words) to dense vectors (their embeddings). The
dimensionality (or width) of the embedding is a parameter you can experiment with to
see what works well for your problem, much in the same way you would experiment
with the number of neurons in a Dense layer.

When you create an Embedding layer, the weights for the embedding are randomly
initialized (just like any other layer). During training, they are gradually adjusted via
backpropagation. Once trained, the learned word embeddings will roughly encode
similarities between words (as they were learned for the specific problem your model is
trained on).
If you pass an integer to an embedding layer, the result replaces each integer with the
vector from the embedding table:

For text or sequence problems, the Embedding layer takes a 2D tensor of integers, of
shape (samples, sequence_length), where each entry is a sequence of integers. It
can embed sequences of variable lengths. You could feed into the embedding layer
above batches with shapes (32, 10) (batch of 32 sequences of length 10) or (64,
15) (batch of 64 sequences of length 15).
The returned tensor has one more axis than the input, the embedding vectors are
aligned along the new last axis. Pass it a (2, 3) input batch and the output is (2, 3,
N)

10
When given a batch of sequences as input, an embedding layer returns a 3D floating point tensor, of
shape (samples, sequence_length, embedding_dimensionality). To convert from this
sequence of variable length to a fixed representation there are a variety of standard approaches. You could use an
RNN, Attention, or pooling layer before passing it to a Dense layer. This tutorial uses pooling because it's the
simplest. The Text Classification with an RNN tutorial is a good next step.

## Text preprocessing
Next, define the dataset preprocessing steps required for your sentiment classification model. Initialize a
TextVectorization layer with the desired parameters to vectorize movie reviews. You can learn more
about using this layer in the Text Classification tutorial.

C REATE A CLASSIFICATION MODEL


Use the Keras Sequential API to define the sentiment classification model. In this case it
is a "Continuous bag of words" style model.
 The TextVectorization layer transforms strings into vocabulary indices. You
have already initialized vectorize_layer as a TextVectorization layer and built
its vocabulary by calling adapt on text_ds. Now vectorize_layer can be used as
the first layer of your end-to-end classification model, feeding transformed strings
into the Embedding layer.
 The Embedding layer takes the integer-encoded vocabulary and looks up the
embedding vector for each word-index. These vectors are learned as the model

11
trains. The vectors add a dimension to the output array. The resulting dimensions
are: (batch, sequence, embedding).
 The GlobalAveragePooling1D layer returns a fixed-length output vector for
each example by averaging over the sequence dimension. This allows the model
to handle input of variable length, in the simplest way possible.
 The fixed-length output vector is piped through a fully-connected (Dense) layer
with 16 hidden units.
 The last layer is densely connected with a single output node.
Caution: This model doesn't use masking, so the zero-padding is used as part of the
input and hence the padding length may affect the output. To fix this, see the masking
and padding guide.

C OMPILE AND TRAIN THE MODEL


You will use TensorBoard to visualize metrics including loss and accuracy. Create
a tf.keras.callbacks.TensorBoard.

12
With this approach the model reaches a validation accuracy of around 78%
Note that the model is overfitting since training accuracy is higher).
Note: Your results may be a bit different, depending on how weights were
randomly initialized before training the embedding layer.
You can look into the model summary to learn more about each layer of the model.

13
Finally Visualize the model metrics in TensorBoard.

14
The following is the graph:

15

You might also like