0% found this document useful (0 votes)

292 views15 pages

WORD EMBEDDING Project

word embedding project using ai

Uploaded by

utkarsha296

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

292 views15 pages

WORD EMBEDDING Project

word embedding project using ai

Uploaded by

utkarsha296

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 15

WORD EMBEDDING

A Seminar report submitted in the partial fulfillment of the requirement of the award for the
degree of

BACHELOR OF TECHONOLOGY
IN
COMPUTER SCIENCE OF ENGINEERING (AI&ML)

BOYINA.UTKARSHA (20JG1A4208)
&
KAIRAM.SOUMYA (20JG1A4224)

Under the Guidance of

DR.L.GREESHMA
(Assistant Professor)

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING (AI&ML)

GAYATRI VIDYA PARISHAD COLLEGE OF ENGINEERING FOR WOMEN

(Affiliated to Jawaharlal Nehru Technological University Kakinada)

MADHURAWADA, VISHAKAPATNAM -48
(2020-2024)

1
W HAT A RE W ORD E MBEDDINGS FOR T EXT ?
Word embeddings are a type of word representation that allows words with similar meaning
to have a similar representation.

They are a distributed representation for text that is perhaps one of the key breakthroughs
for the impressive performance of deep learning methods on challenging natural language
processing problems.
After completing this project, you will know:

 What the word embedding approach for representing text is and how it differs from other
feature extraction methods.
 That there are 3 main algorithms for learning a word embedding from text data.
 That you can either train a new embedding or use a pre-trained embedding on your natural
language processing task.

In natural language processing (NLP), word embedding is a term used for the representation of words
for text analysis, typically in the form of a real-valued vector that encodes the meaning of the word such
that the words that are closer in the vector space are expected to be similar in meaning.
Word embeddings can be obtained using a set of language modeling and feature
learning techniques where words or phrases from the vocabulary are mapped to vectors of real
numbers. Conceptually it involves the mathematical embedding from space with many dimensions per
word to a continuous vector space with a much lower dimension.
Methods to generate this mapping include neural networks,[2] dimensionality reduction on the
word co-occurrence matrix, probabilistic models,explainable knowledge base method,[7] and
explicit representation in terms of the context in which words appear.

R EPRESENTING TEXT AS NUMBERS

Machine learning models take vectors (arrays of numbers) as input. When working with
text, the first thing you must do is come up with a strategy to convert strings to numbers
(or to "vectorize" the text) before feeding it to the model. In this section, you will look at
three strategies for doing so.

One-hot encodings

2
As a first idea, you might "one-hot" encode each word in your vocabulary. Consider the
sentence "The cat sat on the mat". The vocabulary (or unique words) in this sentence is
(cat, mat, on, sat, the). To represent each word, you will create a zero vector with length
equal to the vocabulary, then place a one in the index that corresponds to the word.
This approach is shown in the following diagram.

W ORD E MBEDDING A LGORITHMS

Word embedding methods learn a real-valued vector representation for a predefined fixed
sized vocabulary from a corpus of text.

The learning process is either joint with the neural network model on some task, such as
document classification, or is an unsupervised process, using document statistics.

This section reviews three techniques that can be used to learn a word embedding from text
data.

1. E MBEDDING L AYER
An embedding layer, for lack of a better name, is a word embedding that is learned jointly
with a neural network model on a specific natural language processing task, such
as language modeling or document classification.
It requires that document text be cleaned and prepared such that each word is one-hot
encoded. The size of the vector space is specified as part of the model, such as 50, 100, or
300 dimensions. The vectors are initialized with small random numbers. The embedding
3
layer is used on the front end of a neural network and is fit in a supervised way using the
Backpropagation algorithm.

… when the input to a neural network contains symbolic categorical features (e.g. features
that take one of k distinct symbols, such as words from a closed vocabulary), it is common
to associate each possible feature value (i.e., each word in the vocabulary) with a d-
dimensional vector for some d. These vectors are then considered parameters of the model,
and are trained jointly with the other parameters.

— Page 49, Neural Network Methods in Natural Language Processing, 2017.

The one-hot encoded words are mapped to the word vectors. If a multilayer Perceptron
model is used, then the word vectors are concatenated before being fed as input to the
model. If a recurrent neural network is used, then each word may be taken as one input in a
sequence.

This approach of learning an embedding layer requires a lot of training data and can be
slow, but will learn an embedding both targeted to the specific text data and the NLP task.

2. W ORD 2V EC
Word2Vec is a statistical method for efficiently learning a standalone word embedding from
a text corpus.

It was developed by Tomas Mikolov, et al. at Google in 2013 as a response to make the
neural-network-based training of the embedding more efficient and since then has become
the de facto standard for developing pre-trained word embedding.

Additionally, the work involved analysis of the learned vectors and the exploration of vector
math on the representations of words. For example, that subtracting the “M A N - N E S S ” from
“K I N G ” and adding “W O M E N - N E S S ” results in the word “Q U E E N “, capturing the analogy
“K I N G I S T O Q U E E N A S M A N I S T O W O M A N “.
We find that these representations are surprisingly good at capturing syntactic and semantic
regularities in language, and that each relationship is characterized by a relation-specific
vector offset. This allows vector-oriented reasoning based on the offsets between words.
For example, the male/female relationship is automatically learned, and with the induced
vector representations, “King – Man + Woman” results in a vector very close to “Queen.”

4
— Linguistic Regularities in Continuous Space Word Representations, 2013.
Two different learning models were introduced that can be used as part of the word2vec
approach to learn the word embedding; they are:

 Continuous Bag-of-Words, or CBOW model.

 Continuous Skip-Gram Model.
The CBOW model learns the embedding by predicting the current word based on its
context. The continuous skip-gram model learns by predicting the surrounding words given
a current word.

The continuous skip-gram model learns by predicting the surrounding words given a current
word.

3. G LO V E
The Global Vectors for Word Representation, or GloVe, algorithm is an extension to the
word2vec method for efficiently learning word vectors, developed by Pennington, et al. at
Stanford.

Classical vector space model representations of words were developed using matrix
factorization techniques such as Latent Semantic Analysis (LSA) that do a good job of using
5
global text statistics but are not as good as the learned methods like word2vec at capturing
meaning and demonstrating it on tasks like calculating analogies (e.g. the King and Queen
example above).

GloVe is an approach to marry both the global statistics of matrix factorization techniques
like LSA with the local context-based learning in word2vec.

Rather than using a window to define local context, GloVe constructs an explicit word-
context or word co-occurrence matrix using statistics across the whole text corpus. The
result is a learning model that may result in generally better word embeddings.

GloVe, is a new global log-bilinear regression model for the unsupervised learning of word
representations that outperforms other models on word analogy, word similarity, and named
entity recognition tasks..

U SING W ORD E MBEDDINGS

You have some options when it comes time to using word embeddings on your natural
language processing project.

This section outlines those options.

1. L EARN AN E MBEDDING
You may choose to learn a word embedding for your problem.

This will require a large amount of text data to ensure that useful embeddings are learned,
such as millions or billions of words.

You have two main options when training your word embedding:

1. Learn it Standalone, where a model is trained to learn the embedding, which is saved and
used as a part of another model for your task later. This is a good approach if you would like
to use the same embedding in multiple models.
2. Learn Jointly, where the embedding is learned as part of a large task-specific model. This
is a good approach if you only intend to use the embedding on one task.
2. R EUSE AN E MBEDDING
It is common for researchers to make pre-trained word embeddings available for free, often
under a permissive license so that you can use them on your own academic or commercial
projects.
6
For example, both word2vec and GloVe word embeddings are available for free download.

These can be used on your project instead of training your own embeddings from scratch.

You have two main options when it comes to using pre-trained embeddings:

1. Static, where the embedding is kept static and is used as a component of your model. This
is a suitable approach if the embedding is a good fit for your problem and gives good
results.
2. Updated, where the pre-trained embedding is used to seed the model, but the embedding
is updated jointly during the training of the model. This may be a good option if you are
looking to get the most out of the model and embedding on your task.

CODE:

## Setup

Download the IMDB dataset

You will use the Large Movie Review Dataset through the tutorial. You will train a
sentiment classifier model on this dataset and in the process learn embeddings from
scratch. To read more about loading a dataset from scratch, see the Loading text
tutorial.
Download the dataset using Keras file utility and take a look at the directories.

7
8
Take a look at a few movie reviews and their labels `(1: positive, 0: negative)` from the train
dataset.

C ONFIGURE THE DATASET FOR PERFORMANCE

These are two important methods you should use when loading data to make sure that
I/O does not become blocking.
.cache() keeps data in memory after it's loaded off disk. This will ensure the dataset
does not become a bottleneck while training your model. If your dataset is too large to fit
into memory, you can also use this method to create a performant on-disk cache, which
is more efficient to read than many small files.
.prefetch() overlaps data preprocessing and model execution while training.
You can learn more about both methods, as well as how to cache data to disk in
the data performance guide.

9
U SING THE E MBEDDING LAYER
Keras makes it easy to use word embeddings. Take a look at the Embedding layer.
The Embedding layer can be understood as a lookup table that maps from integer
indices (which stand for specific words) to dense vectors (their embeddings). The
dimensionality (or width) of the embedding is a parameter you can experiment with to
see what works well for your problem, much in the same way you would experiment
with the number of neurons in a Dense layer.

When you create an Embedding layer, the weights for the embedding are randomly
initialized (just like any other layer). During training, they are gradually adjusted via
backpropagation. Once trained, the learned word embeddings will roughly encode
similarities between words (as they were learned for the specific problem your model is
trained on).
If you pass an integer to an embedding layer, the result replaces each integer with the
vector from the embedding table:

For text or sequence problems, the Embedding layer takes a 2D tensor of integers, of
shape (samples, sequence_length), where each entry is a sequence of integers. It
can embed sequences of variable lengths. You could feed into the embedding layer
above batches with shapes (32, 10) (batch of 32 sequences of length 10) or (64,
15) (batch of 64 sequences of length 15).
The returned tensor has one more axis than the input, the embedding vectors are
aligned along the new last axis. Pass it a (2, 3) input batch and the output is (2, 3,
N)

10
When given a batch of sequences as input, an embedding layer returns a 3D floating point tensor, of
shape (samples, sequence_length, embedding_dimensionality). To convert from this
sequence of variable length to a fixed representation there are a variety of standard approaches. You could use an
RNN, Attention, or pooling layer before passing it to a Dense layer. This tutorial uses pooling because it's the
simplest. The Text Classification with an RNN tutorial is a good next step.

## Text preprocessing
Next, define the dataset preprocessing steps required for your sentiment classification model. Initialize a
TextVectorization layer with the desired parameters to vectorize movie reviews. You can learn more
about using this layer in the Text Classification tutorial.

C REATE A CLASSIFICATION MODEL

Use the Keras Sequential API to define the sentiment classification model. In this case it
is a "Continuous bag of words" style model.
 The TextVectorization layer transforms strings into vocabulary indices. You
have already initialized vectorize_layer as a TextVectorization layer and built
its vocabulary by calling adapt on text_ds. Now vectorize_layer can be used as
the first layer of your end-to-end classification model, feeding transformed strings
into the Embedding layer.
 The Embedding layer takes the integer-encoded vocabulary and looks up the
embedding vector for each word-index. These vectors are learned as the model

11
trains. The vectors add a dimension to the output array. The resulting dimensions
are: (batch, sequence, embedding).
 The GlobalAveragePooling1D layer returns a fixed-length output vector for
each example by averaging over the sequence dimension. This allows the model
to handle input of variable length, in the simplest way possible.
 The fixed-length output vector is piped through a fully-connected (Dense) layer
with 16 hidden units.
 The last layer is densely connected with a single output node.
Caution: This model doesn't use masking, so the zero-padding is used as part of the
input and hence the padding length may affect the output. To fix this, see the masking
and padding guide.

C OMPILE AND TRAIN THE MODEL

You will use TensorBoard to visualize metrics including loss and accuracy. Create
a tf.keras.callbacks.TensorBoard.

12
With this approach the model reaches a validation accuracy of around 78%
Note that the model is overfitting since training accuracy is higher).
Note: Your results may be a bit different, depending on how weights were
randomly initialized before training the embedding layer.
You can look into the model summary to learn more about each layer of the model.

13
Finally Visualize the model metrics in TensorBoard.

14
The following is the graph:

Large Language Models From Scratch
No ratings yet
Large Language Models From Scratch
29 pages
Word Embedding
No ratings yet
Word Embedding
35 pages
Word Embeddings
No ratings yet
Word Embeddings
55 pages
Lecture#14
No ratings yet
Lecture#14
38 pages
DM Chapter 9 - Word Embedding
No ratings yet
DM Chapter 9 - Word Embedding
7 pages
Hung-Yi Lee Word2vec (v3)
No ratings yet
Hung-Yi Lee Word2vec (v3)
23 pages
CCS369 Unit-2 20.12.24
No ratings yet
CCS369 Unit-2 20.12.24
41 pages
Bert - Se: A P - L R M S E: RE Trained Anguage Epresentation Odel For Oftware Ngineering
No ratings yet
Bert - Se: A P - L R M S E: RE Trained Anguage Epresentation Odel For Oftware Ngineering
17 pages
5 Pretained Word Embeddings Algorithms
No ratings yet
5 Pretained Word Embeddings Algorithms
21 pages
Exploring Afrikaans Word Embeddings With Analogies and Nearest Neighbours
No ratings yet
Exploring Afrikaans Word Embeddings With Analogies and Nearest Neighbours
10 pages
Text Processing For NLP Word Embedding
No ratings yet
Text Processing For NLP Word Embedding
11 pages
A Survey of Word Embeddings Based On Deep Learning: Shirui Wang Wenan Zhou Chao Jiang
No ratings yet
A Survey of Word Embeddings Based On Deep Learning: Shirui Wang Wenan Zhou Chao Jiang
24 pages
08-DL-Deep Learning For Text Data (Transfer Learning in NLP)
No ratings yet
08-DL-Deep Learning For Text Data (Transfer Learning in NLP)
53 pages
Comparative Study of Word Embeddings Models and Their Usage in Arabic Language Applications
No ratings yet
Comparative Study of Word Embeddings Models and Their Usage in Arabic Language Applications
7 pages
From Word Vectors To Multimodal Embeddings: Techniques, Applications, and Future Directions For Large Language Models
No ratings yet
From Word Vectors To Multimodal Embeddings: Techniques, Applications, and Future Directions For Large Language Models
21 pages
Word Embadding
No ratings yet
Word Embadding
24 pages
Performance Evaluation of Word Embedding Algorithms
No ratings yet
Performance Evaluation of Word Embedding Algorithms
7 pages
Language Analysis - Sociolinguistics of Word Embeddings - PREPRINT - 8.8.2020
No ratings yet
Language Analysis - Sociolinguistics of Word Embeddings - PREPRINT - 8.8.2020
17 pages
NLP2
No ratings yet
NLP2
11 pages
CHATGPT NLP
No ratings yet
CHATGPT NLP
6 pages
IJISRT23DEC1110
No ratings yet
IJISRT23DEC1110
8 pages
Nlput-Unit2 Notes
No ratings yet
Nlput-Unit2 Notes
28 pages
08 Word Embeddings (2021)
No ratings yet
08 Word Embeddings (2021)
58 pages
Unit IV
No ratings yet
Unit IV
57 pages
词向量嵌入综述
No ratings yet
词向量嵌入综述
10 pages
Word Embeddings A Survey
No ratings yet
Word Embeddings A Survey
11 pages
Effect of Word Embedding Vector Dimensionality On Sentiment Analysis Through Short and Long Texts
No ratings yet
Effect of Word Embedding Vector Dimensionality On Sentiment Analysis Through Short and Long Texts
8 pages
Word 2 Vector
No ratings yet
Word 2 Vector
4 pages
Gen AI 1
No ratings yet
Gen AI 1
4 pages
ML For NLP-LO4
No ratings yet
ML For NLP-LO4
42 pages
Cs 224 N
No ratings yet
Cs 224 N
128 pages
Word Embeddings Classification
No ratings yet
Word Embeddings Classification
52 pages
NLP Using Deep Learning Handson
No ratings yet
NLP Using Deep Learning Handson
7 pages
Wordembed
No ratings yet
Wordembed
31 pages
A Survey On Word Representation in Natural Language
No ratings yet
A Survey On Word Representation in Natural Language
7 pages
NLP 2
No ratings yet
NLP 2
8 pages
Chapter II
No ratings yet
Chapter II
26 pages
Contextual Word Embeddings
No ratings yet
Contextual Word Embeddings
8 pages
Zhou 2020
No ratings yet
Zhou 2020
5 pages
DLNLP CH-3 N
No ratings yet
DLNLP CH-3 N
11 pages
NLP Notes
No ratings yet
NLP Notes
11 pages
NLP An Intuitive Understanding of Word Embeddings From Count Vectors To Word2Vec
No ratings yet
NLP An Intuitive Understanding of Word Embeddings From Count Vectors To Word2Vec
18 pages
Word Embeddings Notes
No ratings yet
Word Embeddings Notes
9 pages
Explaining The Intuition of Word2Vec & Implementing It in Python
No ratings yet
Explaining The Intuition of Word2Vec & Implementing It in Python
13 pages
Unsupervised Learning of Sentence Embeddings Using Compositional N-Gram Features
No ratings yet
Unsupervised Learning of Sentence Embeddings Using Compositional N-Gram Features
11 pages
Intuitive Understanding of Word Embeddings - Count Vectors To Word2Vec
No ratings yet
Intuitive Understanding of Word Embeddings - Count Vectors To Word2Vec
34 pages
Wordembed v2.0
No ratings yet
Wordembed v2.0
46 pages
Embeddings
No ratings yet
Embeddings
3 pages
Lecture 2a - Word Level Semantics
No ratings yet
Lecture 2a - Word Level Semantics
34 pages
Constructing and Evaluating Word Embeddings
No ratings yet
Constructing and Evaluating Word Embeddings
33 pages
NLP - L9 Word Embedding
No ratings yet
NLP - L9 Word Embedding
5 pages
Part 3
No ratings yet
Part 3
5 pages
Module 2
No ratings yet
Module 2
78 pages
Word Embedding
No ratings yet
Word Embedding
9 pages
Word2Vec - A Baby Step in Deep Learning But A Giant Leap Towards Natural Language Processing
100% (1)
Word2Vec - A Baby Step in Deep Learning But A Giant Leap Towards Natural Language Processing
12 pages
542 315 Word2vec
No ratings yet
542 315 Word2vec
20 pages
Word Embeddings in NLP - Gunjan Agicha - Medium
No ratings yet
Word Embeddings in NLP - Gunjan Agicha - Medium
5 pages
CS224n: Natural Language Processing With Deep Learning
No ratings yet
CS224n: Natural Language Processing With Deep Learning
14 pages
Karnaugh Maps - Rules of Simplification: Grouping Ones
100% (1)
Karnaugh Maps - Rules of Simplification: Grouping Ones
14 pages
Naan Muthalvan Project Report Stock Market Forecast 4310
No ratings yet
Naan Muthalvan Project Report Stock Market Forecast 4310
29 pages
Discriminant and Roots
No ratings yet
Discriminant and Roots
2 pages
MLT Unit 2 Notes
No ratings yet
MLT Unit 2 Notes
58 pages
cs675 SS2022 Midterm Solution PDF
No ratings yet
cs675 SS2022 Midterm Solution PDF
10 pages
Exe 1 DL
No ratings yet
Exe 1 DL
3 pages
Artificial Intelligence CS188 Midterm1 Solutions
No ratings yet
Artificial Intelligence CS188 Midterm1 Solutions
28 pages
Fox Rev 3.0
No ratings yet
Fox Rev 3.0
21 pages
Location Planning Models and Methods
No ratings yet
Location Planning Models and Methods
74 pages
Open Elective Notice Jan May 2025
No ratings yet
Open Elective Notice Jan May 2025
2 pages
2 Short
No ratings yet
2 Short
9 pages
Emergency Fund-TVM
No ratings yet
Emergency Fund-TVM
33 pages
CSE221 Lab 04 Graph Summer 2023
No ratings yet
CSE221 Lab 04 Graph Summer 2023
19 pages
Seventh Semester B. Tech. Degree Examination: (Answer All Questions: 5 X 2 Marks 10 Marks)
No ratings yet
Seventh Semester B. Tech. Degree Examination: (Answer All Questions: 5 X 2 Marks 10 Marks)
2 pages
Sparse, Stacked and Variational Autoencoder - by Venkata Krishna Jonnalagadda - Medium
No ratings yet
Sparse, Stacked and Variational Autoencoder - by Venkata Krishna Jonnalagadda - Medium
17 pages
A1 U6 Review3
No ratings yet
A1 U6 Review3
3 pages
Or Assignment
No ratings yet
Or Assignment
9 pages
1 - Introduction To DS
No ratings yet
1 - Introduction To DS
22 pages
Operation Management Forecast
No ratings yet
Operation Management Forecast
2 pages
Copia de Modeling Microwave Popcorn
No ratings yet
Copia de Modeling Microwave Popcorn
14 pages
Math 3201 Chapter 6 Review
No ratings yet
Math 3201 Chapter 6 Review
4 pages
TSCS Week5 Trends
No ratings yet
TSCS Week5 Trends
19 pages
PLC Program To Implement A Combinational Logic Circuit (2) - Sanfoundry
No ratings yet
PLC Program To Implement A Combinational Logic Circuit (2) - Sanfoundry
4 pages
Demystifying Quantum Computing Registration Contents
No ratings yet
Demystifying Quantum Computing Registration Contents
2 pages
Ordered Pairs and Cross Product
No ratings yet
Ordered Pairs and Cross Product
3 pages
Unit 12 Unit Test
No ratings yet
Unit 12 Unit Test
7 pages
Teaching Introductory Artificial Intelligence With Pac-Man: January 2010
No ratings yet
Teaching Introductory Artificial Intelligence With Pac-Man: January 2010
6 pages
Mahzaib CV
No ratings yet
Mahzaib CV
2 pages
The Newbie’s Guidebook to ChatGPT: A Beginner's Tutorial: The Newbie’s Guidebook
From Everand
The Newbie’s Guidebook to ChatGPT: A Beginner's Tutorial: The Newbie’s Guidebook
Timothy King
No ratings yet
Pathways to Machine Learning and Soft Computing: 邁向機器學習與軟計算之路（國際英文版）
From Everand
Pathways to Machine Learning and Soft Computing: 邁向機器學習與軟計算之路（國際英文版）
Jyh-Horng Jeng
No ratings yet
Visual Word: Unlocking the Power of Image Understanding
From Everand
Visual Word: Unlocking the Power of Image Understanding
Fouad Sabry
No ratings yet

WORD EMBEDDING Project

Uploaded by

WORD EMBEDDING Project

Uploaded by

WORD EMBEDDING

Under the Guidance of

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING (AI&ML)

GAYATRI VIDYA PARISHAD COLLEGE OF ENGINEERING FOR WOMEN

(Affiliated to Jawaharlal Nehru Technological University Kakinada)

R EPRESENTING TEXT AS NUMBERS

W ORD E MBEDDING A LGORITHMS

— Page 49, Neural Network Methods in Natural Language Processing, 2017.

 Continuous Bag-of-Words, or CBOW model.

U SING W ORD E MBEDDINGS

This section outlines those options.

Download the IMDB dataset

C ONFIGURE THE DATASET FOR PERFORMANCE

C REATE A CLASSIFICATION MODEL

C OMPILE AND TRAIN THE MODEL

You might also like