Unit 2 Newml
Unit 2 Newml
The Bag of Words (BoW) model is the simplest form of text representation
in numbers. Like the term itself, we can represent a sentence as a bag of
words vector (a string of numbers).
We will first build a vocabulary from all the unique words in the above
three reviews. The vocabulary consists of these 11 words: ‘This’, ‘movie’,
‘is’, ‘very’, ‘scary’, ‘and’, ‘long’, ‘not’, ‘slow’, ‘spooky’, ‘good’.
We can now take each of these words and mark their occurrence in the
three movie reviews above with 1s and 0s. This will give us 3 vectors for 3
reviews:
Vector-of-Review1: [1 1 1 1 1 1 1 0 0 0 0]
Vector-of-Review2: [1 1 2 0 0 1 1 0 1 0 0]
Vector-of-Review3: [1 1 1 0 0 0 1 0 0 1 1]
And that’s the core idea behind a Bag of Words (BoW) model.
1. If the new sentences contain new words, then our vocabulary size
would increase and thereby, the length of the vectors would
increase too.
We all love watching movies (to varying degrees). I tend to always look at
the reviews of a movie before I commit to watching it. I know a lot of you
do the same! So, I’ll use this example here.
You can see that there are some contrasting reviews about the movie as
well as the length and pace of the movie. Imagine looking at a thousand
reviews like these. Clearly, there is a lot of interesting insights we can
draw from them and build upon them to gauge how well the movie
performed.
Word Embedding is one such technique where we can represent the text
using vectors. The more popular forms of word embeddings are:
Let’s first put a formal definition around TF-IDF. Here’s how Wikipedia puts
it:
We will again use the same vocabulary we had built in the Bag-of-Words
model to show how to calculate the TF for Review #2:
Here,
Similarly,
TF(‘movie’) = 1/8
TF(‘very’) = 0/8 = 0
TF(‘scary’) = 1/8
TF(‘and’) = 1/8
TF(‘long’) = 0/8 = 0
TF(‘not’) = 1/8
TF(‘slow’) = 1/8
TF(‘good’) = 0/8 = 0
We can calculate the term frequencies for all the terms and all the reviews
in this manner:
IDF is a measure of how important a term is. We need the IDF value
because computing just the TF alone is not sufficient to understand the
importance of words:
We can calculate the IDF values for the all the words in Review 2:
Similarly,
IDF(‘movie’, ) = log(3/3) = 0
IDF(‘is’) = log(3/3) = 0
IDF(‘and’) = log(3/3) = 0
We can calculate the IDF values for each word like this. Thus, the IDF
values for the entire vocabulary would be:
Hence, we see that words like “is”, “this”, “and”, etc., are
reduced to 0 and have little importance; while words like “scary”,
“long”, “good”, etc. are words with more importance and thus
have a higher value.
We can now compute the TF-IDF score for each word in the corpus. Words
with a higher score are more important, and those with a lower score are
less important:
We can now calculate the TF-IDF score for every word in Review 2:
Similarly,
Similarly, we can calculate the TF-IDF scores for all the words with respect
to all the reviews:
We have now obtained the TF-IDF scores for our vocabulary. TF-IDF also
gives larger values for less frequent words and is high when both IDF and
TF values are high i.e the word is rare in all the documents combined but
frequent in a single document.
These models work using context. This implies that to learn the
embedding, it looks at nearby words; if a group of words is always found
close to the same words, they will end up having similar embeddings.
To label how words are similar or close to each other, we first fix
the window size, which determines which nearby words we want to pick.
For Example, For a window size of 2, implies that for every word, we’ll
pick the 2 words behind and the 2 words after it. Let’s see the following
example:
With the help of the above table, we can see the word pairs constructed
with this method. The highlighted word denotes the word for which we
want to find pairs. Here, we don’t care about how much the distance
between the words in the window is. As long as words are inside the
window, we don’t differentiate between words that are 1 word away or
more.
Step-4: Finally, we will extract the weights from the hidden layer
and by using these weights encode the meaning of words in the
vocabulary.
Skip-Gram.
Both of the mentioned models are basically shallow neural networks that
map word(s) to the target variable which is also a word(s). These
techniques learn the weights that act as word vector representations. Both
these techniques can be used to implementing word embedding using
word2vec.
As we know that most of the NLP systems treat words as atomic units. In
existing systems with the same purpose as that of word2vec, there is a
disadvantage that there is no notion of similarity between words. Also,
those system works for small, simpler data and outperforms on because of
only a few billions of data or less.
So, In order to train the system with a larger dataset with complex
models, these techniques use a neural network architecture to train
complex data models and outperform huge datasets with billions of words
and with vocabulary having millions of words.
For Example,
Vector(“King”) — Vector(“Man”)+Vector(“Woman”) =
Word(“Queen”)
The above new two proposed models i.e, CBOW and Skip-Gram in
Word2Vec uses a distributed architecture that tries to minimize the
computation complexity.
1. Firstly, the input layer and the target, both are one-hot encoded of
size [1 X V].
2. In this model, we have two sets of weights- one is between the input
and the hidden layer and the second between the hidden and
output layer.
8. So, the weight between the hidden layer and the output layer is
taken as the word vector representation of the word.
So, the input layer will have 3 [1 X V] Vectors and we have 1 [1 X V] vector
in the output layer. The rest of the architecture is the same as for a 1-
context CBOW.
The above-mentioned steps remain the same but the only thing that
changes is the calculation of hidden activation. Here, instead of just
sending the corresponding rows of the input-hidden weight matrix to the
hidden layer, an average is taken over all the corresponding rows of the
matrix. We can understand this with the above figure. Therefore, the
average vector calculated becomes the hidden activation.
So, if for a single target word we have three context words, then we will
have three initial hidden activations which we are averaged element-wise
to obtain the final activation.
where,
Advantages of CBOW:
Disadvantages of CBOW:
2. If we want to train a CBOW model from scratch, then it can take forever
if we not properly optimized it.
Skip-Gram
3. Then, we compute the two separate errors with respect to the two
target variables, and then we add the two error vectors element-
wise to obtain a final error vector which is propagated back to
update the weights until our objective is met.
4. Finally, the weights present between the input and the hidden layer
are considered as the word vector representation after training. The
loss function or the objective is of the same type as the CBOW
model.
For the above matrix, the sizes of different layers are as follows:
2. The yellow matrix is the weights present between the hidden layer
and the output layer.
4. Then, we convert each row of the blue matrix into its softmax
probabilities individually which is shown in the green box.
5. Here, the grey matrix describes the one-hot encoded vectors of the
two context words i.e, target.
7. The element-wise sum is taken over all the error vectors to obtain a
final error vector.
8. Finally, the calculated error vector is backpropagated to adjust the
weights.
1. The Skip-gram model can capture two semantics for a single word. i.e
two vector representations for the word Apple. One for the company and
the other for the fruit.
Now that we have a broad idea of both the models involved in the
Word2Vec Technique, which one is better? Of course, which model we
choose from the above two largely depends on the problem statement
we’re trying to solve.
Lines in 2D Space
Planes in 3D Space
where (w) is a vector normal to the plane, and (x) represents points on the
plane. This equation highlights the dot product’s role in determining the
perpendicular distance from the origin to the plane, encapsulating the
plane’s orientation in space.
Hyperplanes in N-Dimensional Space
Through this, we can see how lines, planes, and hyperplanes form the
geometric backbone of many machine learning algorithms. By framing
these constructs in vector notation and highlighting the dot product’s role,
we gain a deeper understanding of how algorithms like SVMs carve out
decision boundaries in multidimensional datasets.
onto the normal vector (w), and it is a concept that finds use in evaluating
the performance of models, particularly in regression and classification
tasks.
With these concepts in mind, let’s explore how they are applied in
machine learning to extract meaningful insights from data and improve
algorithm performance.
is not just a geometric exercise but a practical tool for refining the
accuracy of our models. This principle aids in optimizing algorithms by
ensuring they can quantify and minimize errors effectively.
Data Cleaning
What is scaling?
Feature Scaling is one of the most important
transformation we need to apply to our data. Machine
Learning algorithms (Mostly Regression algorithms)
don’t perform well when the inputs are numerical
with different scales.
when different features are in different scales, after
applying scaling all the features will be converted to
the same scale. Let’s take we have two features where
one feature is measured on a scale from 1 to 10 and
the second feature is measured on a scale from 1 to
100,00, respectively. If we calculate the mean squared
error, algorithm will mostly be busy in optimizing the
weights corresponding to second feature instead of
both the features. Same will be applicable when the
algorithm uses distance calculations like Euclidian or
Manhattan distances, second feature will dominate
the result. So, if we scale the features, algorithm will
give equal priority for both the features.
There are two common ways to get all attributes to
have the same scale: min-max
scaling and standardization.
Min-Max scaling, We have to subtract min value from
actual value and divide it with max minus min. Scikit-
Learn provides a transformer called MinMaxScaler. It
has a feature_range hyperparameter that lets you
change the range if you don’t want 0 to1 for any
reason.
Class
sklearn.preprocessing.MinMaxScaler(feature_range
=0,1,*, copy=True, clip=False)
Mathematical Explanation
Min-Max Scaling formula: