0% found this document useful (0 votes)
10 views25 pages

Unit 2 Newml

The Bag of Words (BoW) model is a basic text representation method that converts sentences into numerical vectors based on word occurrence, but it has limitations such as ignoring word order and context. The document also discusses the Term Frequency-Inverse Document Frequency (TF-IDF) method, which enhances BoW by measuring word importance in a document, and introduces the Word2Vec model, which uses neural networks to create word embeddings based on context. Overall, these models aim to improve text analysis for machine learning applications.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views25 pages

Unit 2 Newml

The Bag of Words (BoW) model is a basic text representation method that converts sentences into numerical vectors based on word occurrence, but it has limitations such as ignoring word order and context. The document also discusses the Term Frequency-Inverse Document Frequency (TF-IDF) method, which enhances BoW by measuring word importance in a document, and introduces the Word2Vec model, which uses neural networks to create word embeddings based on context. Overall, these models aim to improve text analysis for machine learning applications.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 25

What is Bag of Words (BoW)?

The Bag of Words (BoW) model is the simplest form of text representation
in numbers. Like the term itself, we can represent a sentence as a bag of
words vector (a string of numbers).

Let’s recall the three types of movie reviews we saw earlier:

 Review 1: This movie is very scary and long

 Review 2: This movie is not scary and is slow

 Review 3: This movie is spooky and good

We will first build a vocabulary from all the unique words in the above
three reviews. The vocabulary consists of these 11 words: ‘This’, ‘movie’,
‘is’, ‘very’, ‘scary’, ‘and’, ‘long’, ‘not’, ‘slow’, ‘spooky’, ‘good’.

We can now take each of these words and mark their occurrence in the
three movie reviews above with 1s and 0s. This will give us 3 vectors for 3
reviews:

Vector-of-Review1: [1 1 1 1 1 1 1 0 0 0 0]

Vector-of-Review2: [1 1 2 0 0 1 1 0 1 0 0]

Vector-of-Review3: [1 1 1 0 0 0 1 0 0 1 1]

And that’s the core idea behind a Bag of Words (BoW) model.

Drawbacks of using a Bag-of-Words (BoW) Model

In the above example, we can have vectors of length 11. However, we


start facing issues of Bag of Words when we come across new sentences:

1. If the new sentences contain new words, then our vocabulary size
would increase and thereby, the length of the vectors would
increase too.

2. Additionally, the vectors would also contain many 0s, thereby


resulting in a sparse matrix (which is what we would like to avoid)

3. We are retaining no information on the grammar of the sentences


nor on the ordering of the words in the text.
Let’s Take an Example to Understand Bag-of-Words (BoW) and TF-IDF

I’ll take a popular example to explain Bag-of-Words (BoW) and TF-DF in


this article.

We all love watching movies (to varying degrees). I tend to always look at
the reviews of a movie before I commit to watching it. I know a lot of you
do the same! So, I’ll use this example here.

Here’s a sample of reviews about a particular horror movie:

 Review 1: This movie is very scary and long

 Review 2: This movie is not scary and is slow

 Review 3: This movie is spooky and good

You can see that there are some contrasting reviews about the movie as
well as the length and pace of the movie. Imagine looking at a thousand
reviews like these. Clearly, there is a lot of interesting insights we can
draw from them and build upon them to gauge how well the movie
performed.

However, as we saw above, we cannot simply give these sentences to a


machine learning model and ask it to tell us whether a review was positive
or negative. We need to perform certain text preprocessing steps.

Bag-of-Words and TF-IDF are two examples of how to do this. Let’s


understand them in detail.

Creating Vectors from Text

Can you think of some techniques we could use to vectorize a sentence at


the beginning? The basic requirements would be:

1. It should not result in a sparse matrix since sparse matrices result in


high computation cost

2. We should be able to retain most of the linguistic information


present in the sentence

Word Embedding is one such technique where we can represent the text
using vectors. The more popular forms of word embeddings are:

1. BoW, which stands for Bag of Words

2. TF-IDF, which stands for Term Frequency-Inverse Document


Frequency
Now, let us see how we can represent the above movie reviews as
embeddings and get them ready for a machine learning model.

Limitations of Bag of Words

1. No Word Order: It doesn’t care about the order of words, missing


out on how words work together.

2. Ignores Context: It doesn’t understand the meaning of words


based on the words around them.

3. Always Same Length: It always represents text in the same way,


which can be limiting for different types of text.

4. Lots of Words: It needs to know every word in a language, which


can be a huge list to handle.

5. No Meanings: It doesn’t understand what words mean, only how


often they appear, so it can’t grasp synonyms or different word
forms.

Term Frequency-Inverse Document Frequency (TF-IDF)

Let’s first put a formal definition around TF-IDF. Here’s how Wikipedia puts
it:

“Term frequency–inverse document frequency, is a numerical statistic that


is intended to reflect how important a word is to a document in a
collection or corpus.”

Term Frequency (TF)

Let’s first understand Term Frequent (TF). It is a measure of how


frequently a term, t, appears in a document, d:

Here, in the numerator, n is the number of times the term “t”


appears in the document “d”. Thus, each document and term
would have its own TF value.

We will again use the same vocabulary we had built in the Bag-of-Words
model to show how to calculate the TF for Review #2:

Review 2: This movie is not scary and is slow

Here,

 Vocabulary: ‘This’, ‘movie’, ‘is’, ‘very’, ‘scary’, ‘and’, ‘long’, ‘not’,


‘slow’, ‘spooky’, ‘good’
 Number of words in Review 2 = 8

 TF for the word ‘this’ = (number of times ‘this’ appears in review


2)/(number of terms in review 2) = 1/8

Similarly,

 TF(‘movie’) = 1/8

 TF(‘is’) = 2/8 = 1/4

 TF(‘very’) = 0/8 = 0

 TF(‘scary’) = 1/8

 TF(‘and’) = 1/8

 TF(‘long’) = 0/8 = 0

 TF(‘not’) = 1/8

 TF(‘slow’) = 1/8

 TF( ‘spooky’) = 0/8 = 0

 TF(‘good’) = 0/8 = 0

We can calculate the term frequencies for all the terms and all the reviews
in this manner:

Inverse Document Frequency (IDF)

IDF is a measure of how important a term is. We need the IDF value
because computing just the TF alone is not sufficient to understand the
importance of words:
We can calculate the IDF values for the all the words in Review 2:

IDF(‘this’) = log(number of documents/number of documents containing


the word ‘this’) = log(3/3) = log(1) = 0

Similarly,

 IDF(‘movie’, ) = log(3/3) = 0

 IDF(‘is’) = log(3/3) = 0

 IDF(‘not’) = log(3/1) = log(3) = 0.48

 IDF(‘scary’) = log(3/2) = 0.18

 IDF(‘and’) = log(3/3) = 0

 IDF(‘slow’) = log(3/1) = 0.48

We can calculate the IDF values for each word like this. Thus, the IDF
values for the entire vocabulary would be:

Hence, we see that words like “is”, “this”, “and”, etc., are
reduced to 0 and have little importance; while words like “scary”,
“long”, “good”, etc. are words with more importance and thus
have a higher value.

We can now compute the TF-IDF score for each word in the corpus. Words
with a higher score are more important, and those with a lower score are
less important:
We can now calculate the TF-IDF score for every word in Review 2:

TF-IDF(‘this’, Review 2) = TF(‘this’, Review 2) * IDF(‘this’) = 1/8 * 0 = 0

Similarly,

 TF-IDF(‘movie’, Review 2) = 1/8 * 0 = 0

 TF-IDF(‘is’, Review 2) = 1/4 * 0 = 0

 TF-IDF(‘not’, Review 2) = 1/8 * 0.48 = 0.06

 TF-IDF(‘scary’, Review 2) = 1/8 * 0.18 = 0.023

 TF-IDF(‘and’, Review 2) = 1/8 * 0 = 0

 TF-IDF(‘slow’, Review 2) = 1/8 * 0.48 = 0.06

Similarly, we can calculate the TF-IDF scores for all the words with respect
to all the reviews:

We have now obtained the TF-IDF scores for our vocabulary. TF-IDF also
gives larger values for less frequent words and is high when both IDF and
TF values are high i.e the word is rare in all the documents combined but
frequent in a single document.

What is Word2Vec Model?

Word2Vec model is used for Word representations in Vector Space which is


founded by Tomas Mikolov and a group of the research teams from Google
in 2013. It is a neural network model that attempts to explain the word
embeddings based on a text corpus.

These models work using context. This implies that to learn the
embedding, it looks at nearby words; if a group of words is always found
close to the same words, they will end up having similar embeddings.
To label how words are similar or close to each other, we first fix
the window size, which determines which nearby words we want to pick.

For Example, For a window size of 2, implies that for every word, we’ll
pick the 2 words behind and the 2 words after it. Let’s see the following
example:

Sentence: the pink horse is eating

With the help of the above table, we can see the word pairs constructed
with this method. The highlighted word denotes the word for which we
want to find pairs. Here, we don’t care about how much the distance
between the words in the window is. As long as words are inside the
window, we don’t differentiate between words that are 1 word away or
more.

The General Flow of the Algorithm

 Step-1: Initially, we will assign a vector of random numbers to each


word in the corpus.

 Step-2: Then, we will iterate through each word of the document


and grab the vectors of the nearest n-words on either side of our
target word, and concatenate all these vectors, and then forward
propagate these concatenated vectors through a linear layer +
softmax function, and try to predict what our target word was.

 Step-3: In this step, we will compute the error between our


estimate and the actual target word and then backpropagated the
error and then modifies not only the weights of the linear layer but
also the vectors or embeddings of our neighbor’s words.

 Step-4: Finally, we will extract the weights from the hidden layer
and by using these weights encode the meaning of words in the
vocabulary.

Word2Vec model is not a single algorithm but is composed of the following


two preprocessing modules or techniques:

 Continuous Bag of Words (CBOW)

 Skip-Gram.
Both of the mentioned models are basically shallow neural networks that
map word(s) to the target variable which is also a word(s). These
techniques learn the weights that act as word vector representations. Both
these techniques can be used to implementing word embedding using
word2vec.

Why Word2Vec technique is created?

As we know that most of the NLP systems treat words as atomic units. In
existing systems with the same purpose as that of word2vec, there is a
disadvantage that there is no notion of similarity between words. Also,
those system works for small, simpler data and outperforms on because of
only a few billions of data or less.

So, In order to train the system with a larger dataset with complex
models, these techniques use a neural network architecture to train
complex data models and outperform huge datasets with billions of words
and with vocabulary having millions of words.

It helps to measure the quality of the resulting vector representations and


works with similar words that tend to close with words that can have
multiple degrees of similarity.

Syntactic Regularities: These regularities refer to grammatical


sentence correction.

Semantic Regularities: These regularities refer to the meaning of the


vocabulary symbols arranged in that structure.

The proposed technique was found that the similarity of word


representations goes beyond syntactic regularities and works surprisingly
well for algebraic operations of word vectors.

For Example,

Vector(“King”) — Vector(“Man”)+Vector(“Woman”) =
Word(“Queen”)

where “Queen” is considered the closest result vector of word


representations.

The above new two proposed models i.e, CBOW and Skip-Gram in
Word2Vec uses a distributed architecture that tries to minimize the
computation complexity.

Continuous Bag of Words (CBOW)

The aim of the CBOW model is to predict a target word in its


neighborhood, using all words. To predict the target word, this model uses
the sum of the background vectors. For this, we use the pre-defined
window size surrounding the target word to define the neighboring terms
that are taken into account.

Case-1: Single context word

Image Source: Google Images

We breakdown the way this model works in the following steps:

1. Firstly, the input layer and the target, both are one-hot encoded of
size [1 X V].

2. In this model, we have two sets of weights- one is between the input
and the hidden layer and the second between the hidden and
output layer.

3. Input-Hidden layer matrix size =[V X N], hidden-Output layer matrix


size = [N X V]: Where N is an arbitrary size that defines the size of
our embedding space or the number of dimensions that we choose
to represent our word in. It is a hyper-parameter for a Neural
Network. Also, N is the number of neurons present in the hidden
layer.

4. There is no activation function present between any of the layers in


the model or More specifically, we can refer to this as a linear
activation.

5. The input is multiplied by the weights present between the input


and hidden layer and it is known as hidden activation. It becomes
the corresponding row in the input-hidden matrix copied.

6. The hidden input gets multiplied by weights present between hidden


and output layers and output is computed.
7. Then, compute the error between output and target and using that
error propagated back to re-adjust the weights upto we achieve the
minimum error.

8. So, the weight between the hidden layer and the output layer is
taken as the word vector representation of the word.

Case-2: Multiple context words

Image Source: Google Images

Let’s consider the following matrix representation for a specified example:

Image Source: Google Images

As we can observe in the above image, it takes 3 context words and


predicts the probability of a target word.

INPUT: The input can be assumed as taking three one-hot encoded


vectors in the input layer as shown above in red, blue, and green.

So, the input layer will have 3 [1 X V] Vectors and we have 1 [1 X V] vector
in the output layer. The rest of the architecture is the same as for a 1-
context CBOW.

The above-mentioned steps remain the same but the only thing that
changes is the calculation of hidden activation. Here, instead of just
sending the corresponding rows of the input-hidden weight matrix to the
hidden layer, an average is taken over all the corresponding rows of the
matrix. We can understand this with the above figure. Therefore, the
average vector calculated becomes the hidden activation.

So, if for a single target word we have three context words, then we will
have three initial hidden activations which we are averaged element-wise
to obtain the final activation.

Objective Function of CBOW Model

The objective function in CBOW is the negative log-likelihood of a word


given a set of context i.e -log( p( wo/wi )), where p( wo/wi ) is given as:

where,

wo: output word

wi: context words

Advantages of CBOW:

1. Generally, it is supposed to perform superior to deterministic methods


due to its probabilistic nature.

2. It does not need to have huge RAM requirements. So, it is low on


memory.

Disadvantages of CBOW:

1. CBOW takes the average of the context of a word. For


Example, consider the word apple that can be both a fruit and a company
but CBOW takes an average of both the contexts and places it in between
a cluster for fruits and companies.

2. If we want to train a CBOW model from scratch, then it can take forever
if we not properly optimized it.

Skip-Gram

1. Given a word, the Skip-gram model predicts the context.

2. Skip–gram follows the same topology as CBOW. It just flips CBOW’s


architecture on its head. Therefore, the skip-gram model is the exact
opposite of the CBOW model.
3. In this case, the target word is given as the input, the hidden layer
remains the same, and the output layer of the neural network is replicated
multiple times to accommodate the chosen number of context words.

General Steps involved in the algorithm

1. Let’s the input vector give to a skip-gram is going to be similar to a


1-context CBOW model. Note that the calculations up to hidden
layer activations are going to be the same.

2. The difference will be in the target variable. Since we have defined a


context window of 1 on both sides of our target word, we will be
getting “two” one-hot encoded target variables and “two”
corresponding outputs which are represented by the blue section
in the below image.

3. Then, we compute the two separate errors with respect to the two
target variables, and then we add the two error vectors element-
wise to obtain a final error vector which is propagated back to
update the weights until our objective is met.

4. Finally, the weights present between the input and the hidden layer
are considered as the word vector representation after training. The
loss function or the objective is of the same type as the CBOW
model.

Now, let’s see the architecture of the skip-gram model:

Image Source: Google Images

For a better understanding, let’s see the matrix-style structure given


below:
Image Source: Google Images

We breakdown the way this model works in the following steps:

For the above matrix, the sizes of different layers are as follows:

 Size of Input layer ——— [1 X V],

 Size of Input hidden weight matrix ———– [V X N],

 Number of neurons present in the hidden layer —— N,

 Size of Hidden-Output weight matrix ——- [N X V],

 Size of Output layer —- C [1 X V]

In the above example, C is the number of context words=2, and V= 10,


N=4.

1. The red row represents the hidden activation corresponding to the


input one-hot encoded vector. It basically represents the
corresponding row of the input-hidden matrix.

2. The yellow matrix is the weights present between the hidden layer
and the output layer.

3. To obtain the blue matrix, we do the matrix multiplication of hidden


activation and the hidden output weights, and there will be two rows
calculated for two targets (context) words.

4. Then, we convert each row of the blue matrix into its softmax
probabilities individually which is shown in the green box.

5. Here, the grey matrix describes the one-hot encoded vectors of the
two context words i.e, target.

6. Error is calculated by subtracting the first row of the grey


matrix(target) from the first row of the green matrix(output)
element-wise. This is repeated for the next row. Therefore, if we
have n target context words, then we will have n error vectors.

7. The element-wise sum is taken over all the error vectors to obtain a
final error vector.
8. Finally, the calculated error vector is backpropagated to adjust the
weights.

Advantages of Skip-Gram Model

1. The Skip-gram model can capture two semantics for a single word. i.e
two vector representations for the word Apple. One for the company and
the other for the fruit.

2. Generally, Skip-gram with negative sub-sampling performs well then


every other method.

Tool for Visualize CBOW and Skip-Gram

To visualize CBOW and skip-gram in action, the given below is a very


excellent interactive tool. I would suggest you really go through this link
for a better understanding.

Visualize CBOW and Skip-Gram

When to use which- CBOW or Skip-Gram?

Now that we have a broad idea of both the models involved in the
Word2Vec Technique, which one is better? Of course, which model we
choose from the above two largely depends on the problem statement
we’re trying to solve.

Image Source: Google Images


According to the original paper, Mikolov et al., it is observed that
the Skip-Gram model works well with a small amount of the
training datasets, and can better represent rare words or
phrases.

However, the CBOW model is observed to train faster than Skip-


Gram, and can better represent more frequent words which mean
gives slightly better accuracy for the frequent words.

Geometric Understanding of Lines, Planes and Hyperplanes

Lines in 2D Space

A line in 2D space is represented by the equation (y = mx + c), where (m)


is the slope, and C is the y-intercept. This can be seen as the simplest
form of a decision boundary, separating the plane into two halves. In
vector notation, considering vectors (a) and (b) for points on the line, the
equation can be expressed as

when the line passes through the origin, simplifying to,

showcasing an early application of the dot product in determining


orthogonality.

Planes in 3D Space

Moving to three dimensions, a plane’s equation is.

This equation can be intuitively understood as an extension of the 2D line


equation into an additional dimension. Using vector notation, a plane can
be described as,

where (w) is a vector normal to the plane, and (x) represents points on the
plane. This equation highlights the dot product’s role in determining the
perpendicular distance from the origin to the plane, encapsulating the
plane’s orientation in space.
Hyperplanes in N-Dimensional Space

Hyperplanes generalize the concept of planes to n-dimensional spaces


and are crucial in separating data in machine learning models. The
equation

can be compactly written using vector notation as ,

where (w) is the vector perpendicular to the hyperplane, and (x)


represents coordinates in n-dimensional space. This formulation is pivotal
in SVMs, where the dot product is used to compute the margin between
classes, illustrating the dot product’s significance in defining relationships
between data points and decision boundaries.

Intuitive Similarities Across Dimensions

The progression from lines to hyperplanes showcases a fundamental


similarity: each of these geometric constructs serves to partition the
space in which they exist, creating boundaries that can be leveraged for
classification. The transition from in 2D to

in n-dimensional space exemplifies how linear algebra scales these


concepts, enabling their application in complex, high-dimensional machine
learning tasks.

Through this, we can see how lines, planes, and hyperplanes form the
geometric backbone of many machine learning algorithms. By framing
these constructs in vector notation and highlighting the dot product’s role,
we gain a deeper understanding of how algorithms like SVMs carve out
decision boundaries in multidimensional datasets.

Half-Spaces: The Building Blocks of Decision Boundaries

In n-dimensional space, half-spaces are defined by hyperplanes. A half-


space can be thought of as one side of a hyperplane; every point in n-
dimensional space lies in one of the two half-spaces created by a given
hyperplane. Formally, for a hyperplane defined by,

the two half-spaces are determined by the inequalities.

Understanding half-spaces is crucial because they form the fundamental


‘decision regions’ in many machine learning classification algorithms.

Normal to the Plane: The Direction of Differentiation

The normal vector to a plane is a vector that is perpendicular to every


vector lying on the plane. It essentially defines the plane’s orientation in
n-dimensional space. For the plane’s equation,

the normal vector would be ((a, b, c)). This concept extends to


hyperplanes in higher dimensions as well. The normal vector is vital in
optimization problems as it points in the direction of the greatest rate of
increase of a function, and thus, is intimately related to gradient descent
methods in machine learning.

Distance from a Point to a Plane: Measuring Closeness

The shortest distance from a point to a plane is a metric that quantifies


how ‘close’ a point is to a given plane. For a point (p) and a plane defined
by the normal vector (w) and a point on the plane, the distance (d) from
the point to the plane is given by the formula.
This formula is derived from the projection of the vector

onto the normal vector (w), and it is a concept that finds use in evaluating
the performance of models, particularly in regression and classification
tasks.

With these concepts in mind, let’s explore how they are applied in
machine learning to extract meaningful insights from data and improve
algorithm performance.

Distance from a Point to a Plane:

Understanding the shortest distance from a point to a plane, given by,

is not just a geometric exercise but a practical tool for refining the
accuracy of our models. This principle aids in optimizing algorithms by
ensuring they can quantify and minimize errors effectively.

Half-Spaces: Defining Decision Boundaries

The concept of half-spaces emerges naturally as we discuss hyperplanes,


illustrating how these geometric constructs divide n-dimensional space
into distinct regions. In machine learning, recognizing which half-space a
point falls into allows us to classify data with greater precision,
underscoring the value of our geometric insights.

Normal to the Plane and Its Significance

The direction of the normal vector to a plane informs us about the


gradient’s direction in optimization techniques like gradient descent,
facilitating the efficient training of models by highlighting the path of
steepest descent. What is Data Cleaning?

Data cleaning is a crucial step in the machine learning (ML) pipeline, as it


involves identifying and removing any missing, duplicate, or irrelevant
data. The goal of data cleaning is to ensure that the data is accurate,
consistent, and free of errors, as incorrect or inconsistent data can
negatively impact the performance of the ML model. Professional data
scientists usually invest a very large portion of their time in this step
because of the belief that “Better data beats fancier algorithms”.

Data cleaning, also known as data cleansing or data preprocessing, is


a crucial step in the data science pipeline that involves identifying and
correcting or removing errors, inconsistencies, and inaccuracies in the
data to improve its quality and usability. Data cleaning is essential
because raw data is often noisy, incomplete, and inconsistent, which can
negatively impact the accuracy and reliability of the insights derived from
it.

Why is Data Cleaning Important?

Data cleansing is a crucial step in the data preparation process, playing an


important role in ensuring the accuracy, reliability, and overall quality of a
dataset.

For decision-making, the integrity of the conclusions drawn heavily relies


on the cleanliness of the underlying data. Without proper data cleaning,
inaccuracies, outliers, missing values, and inconsistencies can
compromise the validity of analytical results. Moreover, clean data
facilitates more effective modeling and pattern recognition, as algorithms
perform optimally when fed high-quality, error-free input.

Additionally, clean datasets enhance the interpretability of findings, aiding


in the formulation of actionable insights.

Data Cleaning in Data Science

Data clean-up is an integral component of data science, playing a


fundamental role in ensuring the accuracy and reliability of datasets. In
the field of data science, where insights and predictions are drawn from
vast and complex datasets, the quality of the input data significantly
influences the validity of analytical results. Data cleaning involves the
systematic identification and correction of errors, inconsistencies, and
inaccuracies within a dataset, encompassing tasks such as handling
missing values, removing duplicates, and addressing outliers. This
meticulous process is essential for enhancing the integrity of analyses,
promoting more accurate modeling, and ultimately facilitating informed
decision-making based on trustworthy and high-quality data.

Steps to Perform Data Cleanliness


Performing data cleaning involves a systematic process to identify and
rectify errors, inconsistencies, and inaccuracies in a dataset. The following
are essential steps to perform data cleaning.

Data Cleaning

 Removal of Unwanted Observations: Identify and eliminate


irrelevant or redundant observations from the dataset. The step
involves scrutinizing data entries for duplicate records, irrelevant
information, or data points that do not contribute meaningfully to
the analysis. Removing unwanted observations streamlines the
dataset, reducing noise and improving the overall quality.

 Fixing Structure errors: Address structural issues in the dataset,


such as inconsistencies in data formats, naming conventions, or
variable types. Standardize formats, correct naming discrepancies,
and ensure uniformity in data representation. Fixing structure errors
enhances data consistency and facilitates accurate analysis and
interpretation.

 Managing Unwanted outliers: Identify and manage outliers,


which are data points significantly deviating from the norm.
Depending on the context, decide whether to remove outliers or
transform them to minimize their impact on analysis. Managing
outliers is crucial for obtaining more accurate and reliable insights
from the data.

 Handling Missing Data: Devise strategies to handle missing data


effectively. This may involve imputing missing values based on
statistical methods, removing records with missing values, or
employing advanced imputation techniques. Handling missing data
ensures a more complete dataset, preventing biases and
maintaining the integrity of analyses.

 What is scaling?
 Feature Scaling is one of the most important
transformation we need to apply to our data. Machine
Learning algorithms (Mostly Regression algorithms)
don’t perform well when the inputs are numerical
with different scales.
 when different features are in different scales, after
applying scaling all the features will be converted to
the same scale. Let’s take we have two features where
one feature is measured on a scale from 1 to 10 and
the second feature is measured on a scale from 1 to
100,00, respectively. If we calculate the mean squared
error, algorithm will mostly be busy in optimizing the
weights corresponding to second feature instead of
both the features. Same will be applicable when the
algorithm uses distance calculations like Euclidian or
Manhattan distances, second feature will dominate
the result. So, if we scale the features, algorithm will
give equal priority for both the features.
 There are two common ways to get all attributes to
have the same scale: min-max
scaling and standardization.
 Min-Max scaling, We have to subtract min value from
actual value and divide it with max minus min. Scikit-
Learn provides a transformer called MinMaxScaler. It
has a feature_range hyperparameter that lets you
change the range if you don’t want 0 to1 for any
reason.
 Class
sklearn.preprocessing.MinMaxScaler(feature_range
=0,1,*, copy=True, clip=False)
 Mathematical Explanation
 Min-Max Scaling formula:

 Considered 2 columns A and B with different scales.



 Min Max Scale Math — 1

 Min Max Scale Math — 2

 After scaling we can see both A and B columns are in


same scale i.e in between 0 and 1.
 We can change the min and max values. In the below
case I selected 1 as min and 2 as max, you can
observe all the values are fitted in between 1 and 2.

 Range changed to 1 and 2

 linear regression, logistic regression, or anything that


involves a matrix, are affected by the scale of the
input, So Scaling will improve the model performance.
Tree-based models, on the other hand, couldn’t care
scaling.
 When we use Neural Networks, remember that it is
important to first normalize the input feature vectors,
or else training may be much slower. Image
Processing and NLP will also use neural networks to
deal, so Scaling has to be performed on features.

You might also like