0% found this document useful (0 votes)
6 views16 pages

Machine Learning

a summary of Google ML
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views16 pages

Machine Learning

a summary of Google ML
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
You are on page 1/ 16

Machine Learning

It is basically to train the models using datasheet provided and model selected.

Firstly, we use pandas to import the file :

import pandas as pd
iowa_file_path = '../input/home_Data/train.csv'
home_data = pd.read_csv( iowa_file_path )

Then, we create the target object ( normally called y ), which is the thing we
targeted to get ( eg. prices of houses ).

y = home_data.SalePrice

And the datasheet including features we select, called X.

features = [ 'LotArea', 'YearBuilt', '1stFlrSF', '2ndFlrSF', 'FullBath',


'Bedroom' ]
X = home_data [ features ]

For accuracy, we use train_X and train_y to train the model, and we use val_X and
val_y to check.
To achieve such purpose, we import the train_test_split function.

from sklearn.model_selection import train_test_split


train_X, val_X, train_y, val_y = train_test_split( X, y, random_state = 1 )

To train the model, we can use DecisionTreeRegressor.


( Just a reminder, random_state only represents a special and fixed case, it's not
giving random cases )

from sklearn.tree import DecisionTreeRegressor


model = DecisionTreeRegressor ( random_state = 1 )
model.fit ( train_X, train_y )

we can get the predictions made by the model using .predict function.

val_predictions = model.predict ( val_X )

How to check the accuracy? check the predictions with actual values ( target object
y )

from sklearn.metrics import mean_absolute_error


val_mae = mean_absolute_error ( val_predictions, val_y )

Hmmm... what we are getting is pretty inaccurate though...


This is caused by overfitting.

Overfitting occurs when there are too many leaves, it copied exactly the same as
the dataset, thus it'll be almost 100% accurate using the training data, but very
inaccurate using actual data.

On the flip side, underfitting occurs when there are too less leaves, the whole
dataset is divided into two or three groups, thus the result is very inaccurate.

So, obviously, finding the proper number of leaves is the key, namely finding
appropiate value for MAX_LEAF_NODES.
mae_collection = []
new_model = DecisionTreeRegressor ( max_leaf_nodes = n, random_state = 0 )
new_model.fit ( train_X, train_y )
pred_X = new_model.predict ( val_X )
accuracy = mean_absolute_error ( pred_x, val_y )
mae_collection.append ( accuracy ) for n in range [ 5, 20, 100, 500, 2000 ]
print (mae_collection)

In this way, we can find a more suitable max_leaf_nodes, well, it's 100 in this
case.
So we're going to modify our model accordingly.

iowa_model = DecisionTreeRegressor ( max_leaf_nodes = 100, random_state = 1 )


iowa_model.fit ( train_X, train_y )
val_predictions = iowa_model.predict ( val_X )
val_mae = mean_absolute_error ( val_predictions, val_y )

Well, using this method, it's a little bit better than just now, but just a little
bit.
It is still quite inaccurate.
HOW?

What is the root cause of overfitting and underfitting?


THE MODEL ITSELF.

So we just simply change the model to a more scentific based one.


Here, we use Random Forests.

from sklearn.ensemble import RandomForestRegressor


rf_model = RandomForestRegressor ( random_state = 0 )
rf_model.fit ( train_X, train_y )
x_pre = rf_model.predict ( val_X )
rf_mae = mean_absolute_error ( x_pre, val_y )

We are slowly improving the accuracy of our model, but is this the end? Does it
100% match the actual data?
I want to say YES really loud, but the truth is NO.
In fact, it's just a little bit better than just now.
-_-

( Here is the actual mae for this model, if u want to know )

Validation MAE when not specifying max_leaf_nodes: 29,653


Validation MAE for best value of max_leaf_nodes: 27,283
Validation MAE for Random Forest Model: 21857.15912981083

Above is a brief intro of machine learning, though we didn't manage to build a


model that MAE = 0 ( which every data analytist dreams of ), we've already had some
basic understanding about it.

Don't worry, to make MAE closer to zero isn't our job, at least not before you
become a data analytist.

-----------------------------------------------------------------------------------
--------------------------------------

Above is acutally a simple application of ML, let's dive into the concepts now.

There are four categories : Supervised learning, Unsupervised learning,


Reinforcement learning and Generative AI.

Supervised learning is our focus for the time being, cuz it's more fundamental. By
feeding correct data into the models, the models can make predictions on new datas,
which is what we did just now.

Unsupervised learning means we feed incorrect data into the model and we just want
to find the trend among them. The model will automatically cluster them into
differnt groups, and if u understand the dataset very well, u can rename the groups
identified by the model.

Reinforcement learning is to punish and reward based on the performance, so that we


can "train" robots or models we want. ( If it is a game plug-in, we punish when we
get hit, we reward when we hit someone or slide effectively ).

Generative AI is very popular nowadays, it is capable of generating smth out of ur


input. for eg, text-to-image, u write the description of the image, and AI helps to
generate that image. I don't think I need to go deeper.

But all in all, we are going to talk about supervised learning only, as we are
unable to understand other things yet.

The core concepts are Data, Model, Training, Evaluating and Inference.

Data:

we feed data into the model, including labeled or unlabeled datasets, first of all,
all datas are divided into two groups, Label, which is the target we want ;
Features, which is everything exclude the label. Our goal is to find relationship
between features and label, ( acutally, it is the model's goal, we don't have to do
it ourselves. )

For eg,. if we are doing a weather prediction model, specially for raining days,
then rainfall in cm will be our LABEL ; and all other things like how long it
lasts, or the temp and humidity are influencing factors, we called them FEATURES.
All we want to see is the model learn the relationship between features and label,
so that when we feed new features, it can predict the target label accurately.

( Btw, choosing the dataset is also important, we should find those large in size
and having high diversity. But normally,
" a large dataset doesn’t guarantee sufficient diversity, and a dataset that is
highly diverse doesn't guarantee sufficient examples", so, lol, just try ur best, I
guess)

Model:

Now we are choosing what model is going to be used for prediction, though it is
importnat, it is also pretty fixed, we'll talk about it later.
( Oops, I almost forgot this, if we want numerical value as output, like rainfall
in cm, we use regression models ; If we want, a word of description, we need to use
classification model instead. )

However, if u like to make things become troublesome, u can also use classification
model for numerical values, just take average of the labeled thing we want, set
thresold, label as low, medium and high, we feed in this specialized dataset for
classification and we will get an predicted output of "low, medium or high", if,
like I said, u love to get urself into troubles.

Training:
To put it simply, after the model makes some predictions, it compares the acutal
value with predicted value, and adjust to be more accurate based on that.

Evaluating:

After we train the model successfully, we need to evaluate the accuracy of the
model, to see whether it is reliable or not.

Normally, we will try som new datas, and compare the predicted value with actual
value, find the mae, and decide whether it is reliable to use, or it is a trash, we
should leave it in the dustbin.

Inference:

If u think this model doesn't belong to the dustbin, then we might use it for
predictions. For eg, weather predictions. Still, we feed in those unlabeled
Features, like temp, humi, amtospheric pressure... and get the predicted amount of
rainfall.

I also don't know why it's called inference instead of simply saying predicted
outputs...

And often, the mae of the model is pretty big, which means it is pretty inaccurate,
well, I guess I know why the weather reports are so unreliable these days.

Ok, now we are about to dive into deeper and detailed concepts, ready?

GO!

1. ML Models

1) Linear Regression

This model is quite mathmetical, there will be only one feature and one label, and
when we try to find the relationship between them, we find that the dots on the
feature-label graph have a best-fit line. In math, we use y = mx + b.

Similarly, we use y' = b + w1x1 to represent the best-fit line in this model.

y is the label, b is called bias ( y-intercept ), w1 is called weight ( gradient )


and x1 is the feature input.

y' is predicted using the model, x1 is the given input of feature, b and w1 are
calculated by the model.

You might already noticed that we use x "1" and w "1", meaning building such model
with multiple features is possible.

So that it finds the linear relationship between the label and every feature.

It looks smth like this:

y' = b + w1x1 + w2x2 + w3x3 + w4x4 + w5x5

Alright, u might be a bit overwhelmed, I'll show u one example, so that u'll be
even more overwhelmed. HAHA.
Kidding.

Let's say we have a model predicts the gas milege of a car, the possible factors
will be :

w1 : the weight of car in pounds

w2 : engine displacement

w3 : acceleration

w4 : number of cylinders

w5 : horsepower

Well, I admit that it involves a bit of luck to find those features all linear-
related, but this is truly the simplest model.

I mean, it will be a bit inaccurate, errr, maybe a lot inaccurate, but this is smth
easy for us to understand, isn't it?

Okok, I admit that it's very inaccurate, especially when it comes to a diverse and
large dataset.

But u need to measure this kind of inaccuracy, to prove that u are correct, and
this parameter is called LOSS.

Loss is a numerical metric that descirbes the inaccuracy of the model, so my first
thought is, smth like standard deviation?

Let's take a look at it.

Oh, exactly like what I thought, but it's much simpler than standard deviation.

Loss is the average of difference between actual value and predicted values.

There are two types of loss, L1 and L2. (sounds stupid, of cuz it is loss1 n loss2)

our goal is to measure the inaccuracy, so we just want to see how far are the
points from the "best-fit" line, so we don't want to see negative sign during the
calculation.

In this case, we have two methods, one simply takes the absolute value of all the
distances, another one squares them.

We name the "abosolute method" L1, and "square method" L2.

However, L1 and L2 are sum of absolute value / square value, we want to see the
average performance.

Here comes the Mean Absolute Error ( L1 / n ) and Mean Squared Error ( L2 / n ).

( Actually, I personally think that MSE is more accurate, but I never really try it
out yet, maybe tmr we'll see )

Oh, my bad, when u choose between the two models, u only needs to focus on one
thing, the OUTLINER.

Outliner is smth far out of range, eg,. in a class, there might be extremely-good
student and extrmenly-bad student.

Do you want to take them into consideration?

If you want to ignore them, use MAE ; If you want to include them in your model,
use MSE then.

In other words, MSE will be affected by the outlier values while MAE won't be
affected.

I really want to paste some pics here, but I couldn't, stupid notebook.

You might take a break if u read throught evth till here just now, u can also
choose to move on if u just started.

Although we can calculate the bias and weight manually, we better to leave it to
the model.

But how does it work?

The model will start from zero bias and weight, check mae, adjust bias and weight a
little bit to fit better.

In the end, where the loss is being minimised, the model is said to be converged.

On the 2D loss curve, it becomes flat after being converged. ( can't reduce the
loss anymore )

If you are good at math, the loss functions for linear models always produce a
convex surface (3D), where weight is on the x-axis, bias is on the y-axis, and loss
is on the z-axis.

When the model converges, it is because the loss surface is convex and contains a
point where the weight and bias have a shope that's almost zero. ( "almost", means
it never truly find the minimum value, but it can find a very closed value )

We've already breifly explained the important concepts, time to relax~

There are three variables that control different aspects of training, Learning
rate, Batch size and Epochs.

These are called hyperparameters.

Don't get confused, parameters are variables like weight and bias, it is calculated
by the model; Hyperparameters are values u can control, outside the model.

Learning rate :

Learning rate, in the name, is to control the rate of learning.

Do u still remember? We mention that by changing parameters a little bit every


time, the model slowly converges.

Learning rate determines how much is "a little bit".

If it's too low, then it takes forever to finish the trianing ; However, if it's
too high, the model can't even converge.

So we need to choose a reasonable learning rate, eg. learning rate = 0.01, gradient
= 2.5, then change it by 0.025.

An ideal learning rate reduces loss significantly in first few iterations (cycles),
and still can find the minimum point.

Batch size :

It refers to the number of dots we use before training the model.

In fact, the model might contains millions of datas, we need to select some out of
these.

Normally, we use SGD or mini-SGD.

I know what you are thinking about, but sorry, you can't get SGD dollars out of
that. XD

It's the abbreviation of Stochastic Gradient Descent.

It sounds very complicated, but indeed it simply uses single example per iteration.
(only one dot at a time)

But there will be a lot of "noise", which is the small fluctuations, thoughout the
whole graph.

Mini-SGD is a compromise between full-batch (evth) and SGD.

It uses a random group of examples, (32, 64, etc), reducing the fluctuations
effectively.

You might think that, then why we still have SGD? Isn't mini-SGD always having a
less 'noisy' graph?

You are right, but sometimes, noisy graph is what we want, to put it simply, it
shakes the model and prevents overfitting, which is widely used in neural network.

Don't worry, we are not there yet. Or maybe you have to worry, cuz you are not
there yet. haha.

Epochs :

It is even simpler, it means all the examples have been processed at least once.

Let's say we have a training set of 1000 examples, with a mini-SGD batch size of
100, thus it takes 10 iterations to complete one epoch.

Still, you need to set how many epochs u want the model to train.

Generally, more epochs means better accuracy, but also takes more time.

So in many cases, we'll experiment with how many epochs it takes for the model to
converge.
2) Logistic Regression

Unlike Linear Regression, Logistic Regression is a mechanism for calculating


probabilities instead of numeric values.

There are two ways of returning probabilities :


- a numeric value ( i.e. 0.932, meaning 93.2% )
- a binary category ( i.e. True / False ; Spam / Not Spam )

As we said earlier, Logistic Regression only returns a probability, thus it should


a numeric value between 0 and 1.

How do we achieve that?

The answer is Sigmoid function.

Again, it sounds complicated, but actually, "Sigmoid" means S-shape only.

It has the formula :

f(x) = ( 1 + e^-x ) ^-1

When it is nearly negetive infinity, y = 0 ; when it is nearly postive infinity, y


= 1.

Similarly, it also uses the following equation to represent linear relationship :

z = b + w1x1+ w2x2 + ... + wNxN


- z is the output of the linear equation, called the log odds
- b is the bias
- w's are the model's learned weights
- x's are the feature values

After that, to yield a possibility, we use the Sigmoid function.

y' = ( 1 + e^-z ) ^ -1 ( z is the linear output just now )

Logistic Regression models are trained using the same process as Linear Regression
models, except for two things :
- uses Log Loss instead of squared loss
- applys regularization to prevent overfitting

Log Loss :

Log loss is designed to punish the wrong predictions as it should, for eg :

- if the correct answer is 1, predicted answer is 0.1, the squared loss is 0.81
only;
- instead, if we use log loss, - log (1-0.1) = 2,3, bigger penalty.

So when the correct probability is 1, we use - log ( 1 - p ). ( p stands for


predicted value here )
- - when the correct value is 0, we use - log ( p ).

Combined these together, as the probability is rarely 0 or 1 :

Log Loss = - log (y') * y - log (1 - y') * (1 - y)


- y is the acutal labeled value, while y' is the predicted value
In such way, we are able to calculate the loss of this logarithm function.

Regularization :

When the model learn too many detailed features, it becomes overfitted.

Namely, the model perfectly fits the given dataset, yet very inaccurate for new or
unseen data.

We have two ways of regularization to prevent this from happening.

L2 Regularization ( "Complexity Tax" ) :

The model is penalised if it uses very large numbers to weigh features, so the
weights are smaller but more balanced.

In details, it squares all the weights, add them together, multiplies with a "Tax
Constant" λ.

It is considered part of the loss, and the constant λ controls how strict the
penalty is.

Early Stopping :

It simply stops the model before continues to learn more detailed features.

More detailed, It limits the number of training steps to halt training while loss
is still increasing.

And that's all for Linear Regression model and Logistic Regression model.

Pretty complex, huh?

Haha, just be famaliar with it and everything will be fine.

You might take a small break and review what we've learnt so far.

And of cuz, I'll be more than happy if u want to move on.

But it's not for me, lol.

3) Classification

In a Logistic Regression model, we use the sigmoid function to convert raw model to
probability value. But what if our goal is not to output probability but a
category, for eg, "Spam" or "Not spam" ?

So firstly, we still need to use the possibility output from the Logistic
Regression Model, then we use binary classification to convert it to predict one of
two classes.

What happens if we have multiple classes (more than two) ?

Don't worry, Let's go step by step.

How do we convert the numeric possibility into smth like boolean value?
We set a thresold for it, called classification threshold.

For eg, if we set 0.5 as the thresold, then 51% is considered "Spam" and 49% is
considered "Not spam".

However, if there's only 0.01% of samples above the thresold, meaning this model
isn't a balanced one.

The probability is not reality, or Ground Truth.

So we will encounter four cases :

- Predicted positive aligns with actual positive, which is True Positive (TP)

- Predicted postive aligns with actual negative, which is Fake Positive (FP)

- Predicted negative aligns with actual positive, which is False Negative (FN)

- Predicted negative aligns with actual negative, which is True Negative (TN)

Obviously, we want less FP n FN, and more TP n TN, and this is called confusion
matrix.

There will be three scenarios of our datasets :

- Separated, where positive examples and negative examples are well differentiated.

- Unseparated, where many positive examples have lower scores than negative
examples.

- Imbalanced, containing very few examples of the positive class in the dataset.

Still, we want to know whether our classifications are accurate enough or not.

So we need to measure the accuracy.

Accuracy is to measure the proportion of correct classifications, which is ( TP +


TN ) / ( TP + TN + FP + FN ).

An ideally perfect model would have an accuracy of 100%, or 1.0, but we could
hardly reach this.

It fairly measures the correctness of the predicted data, thus it is often the
Default Evaluation Metric used.

However, in real-world applications, FP or FN is often more costly than the other.

For eg, missing important emails (FP) is worse than seeing spam unexpectedly (FN).

If we want to focus on a certain category, we use True Positive Rate / False


Positive Rate.

True Positive Rate = TP / (TP + FN), calculating the proportion of all actual
positive values being classified correctly.

False Positive Rate = FP / (TN + FP), opposite to the concept of TPR.

These terms calculate correct predicted value to the whole category.


But Precision calculates the true portion in the positive prediction.

Precision = TP / (TP + FP).

Thus, a short summary, Accuracy measures the overall correctness, while FPR or TPR
only focus only the actually postive or negative datas. Precision is just a
measurement of how many False Positive are there in the dataset.

SO we use
- accuracy when we want to get the overall model performance, but it shouldn't be
used for imbalanced datasets.
- Recall (TPR) when False Negatives are more costly
- FPR when False Posities are more expensive
- Precision when it's important for positive predictions to be accurate

These statements are based on a single classification value only, but in reality,
we need to evaluate a model's quality across all possible thresholds, that's why we
use Receiver-Operating Characteristic curve and Area Under the Curve.

Receiver-Operating Charateristic Curve (ROC) :

By trying possible thresholds at selected intervals, we can draw a graph of TPR


over FPR, and this curve is called ROC.

At a high threshold (eg. 0.9), it has a low TPR (few positives caught) and a low
FPR (few false alarms).

Vice versa, at a low threshold, it has a high TPR and FPR, locating the point near
top right. (smth like this)

| /
|/
-------- FPR

Basically, it shows the trade-offs between catching more positives and making
mistakes on negatives.

Area Under the Curve (AUC) :

In the name, it is the Area under the ROC curve just now. Typically, an ideal model
shows a perfect square with side length of 1, meaning the model 100% ranked a
positive example higher than a negative example.

What we definitely don't want to see is, a diagnoal ROC, it means the model is
flipping a coin. (50%)

However, it works well to evaluate a roughly balanced dataset, what do we do with


an imbalanced dataset?

We use Precision-Recall Curve and calculate the area under the respective graph.
(focus on the no. of true positives)

Just take note that, the baseline of PR-AUC depends on the class imbalanced, or the
overall proportion of positives.

For eg. if only 10% of emails are spam, then the baseline precition is 0.1, so any
model with AUC-PR > 0.1 is at least better than flipping a coin.

The reason we do all these is not to trouble ourselves, it is to find a better


model and a suitable threshold.

The points on a ROC curve closest to (1, 0) represent a range of best thresholds.

Still, u can choose between these values based on the impotance of false negative
or false positive.

Like Linear Regression Model, we also use prediction bias to see how effective the
predictive values are.

Just collect the mean of predictions and a mean of ground-truth labels, the
difference between them is the bias.

Ahhh... U might still remember what I said at first...

What happens to the multi-class ?

Well, we won't invent a new technique out of nowhere, we still prefer to use what
we already have.

We can split these classes into two classes, one of the classes (eg. A) and the
otheres (eg. B, C, D, E...).

By repeating this step until all the classes are shown, we are able to handle a
multi-class data model.

And one more tip, in the real application, to handle the extreme datas, we use Z-
score.

Z-score is how many standard deviation a value is from mean value.

For eg, if the acutal value is 70 while mean = 50, std = 10, then the Z-score =
(70-50) / 10 = 2.0 .

2. Data

1) working with numerical data

Data is very important in ML, most likely, we spend far more time on evaluating,
cleaning and transforming data than building models.

This unit foucses on numeical data, meaning integers or floating-point values that
behave like numbers.

Such as Temperature, Weight, The number of deerse wintering in a nature


preserver...

But the postal code is not included, as double the postal code (i,e. 247440 to
494880) has no acutal meaning.

The model can't just grab the intended cells in the dataset, it ingests an array of
floating-point values called a Feature Factor, and this value needs to further
processed so that your model can better learn from.

It seems a bit abnormal, as we're training the model using acutal data but using
processed data.

But trust me, it produces a better predictions, as it makes Normalization and


Binning possible.

You should be quite famaliar with Normalization already, converting values into a
standard range.

Binning (also called bucketing), is to convert numerical values into buckets of


ranges.

In the next unit, it further coveres Preprocessing, which is to convert non-


numerical data to floating values.

We'll talk about that in later chapters, let's deal with numerical data first.

Before creating feature vectors, we study the numerical data by


- visualize your data in plots or graphs
- get statistics about your data

Visualize your Data

Graph helps to find the anomalies or the common pattern.

Normally, we recommend pandas for visualization.

I hope you still remember things like pd.readcsv and describe( ).

Statistically Evaluate your Data

This is even simpler, as we are using terms like mean, median and standard
deviation to evaluate the datas.

We also want to know the 0th, 25th, 50th, 75th, 100th percentiles. (the 0th is min
and the 100th is max)

Find the outliners

An outliner is a value distant from most other values, we can easily find it using
graphs or plots.

Or statistically, if the delta between 0th and 25th percentiles differs


significantly from the delta between 75th and 100th percentiles, the dataset
probably contains outliners.

In details, we compare std with mean, if std overweighs mean, meaning it's an
imbalanced dataset.

However, sometimes outliners also hide in seemingly well-balanced data, so don't


over-rely on basic statistics.

Now you find the outliners, if it is just a mistake, you can simply delete examples
containing mistake ouliners ;

But if it is a legit data point, will your model ultimately need to infer good
predictions on these outliners?
- If yes, keep these in your model to make better predictions.
- If no, delete the outliners or apply techniques such as clipping.

I'm really impressed by this code.

Let's say we find the Thursday data abnormal, and we want to compare with others.
The dataset is the calories we take in everyday, and we take down 50 datas per day.

How to find the anomolies?

Day_4 = 0
Non_day_4 = 0
count1 = 0
count2 = 0

for week in range (0,4):


for day in range(0,7):
for subject in range (0, 50):
position = (week*350) + (day * 50) + subject
if day == 4 :
count1 += 1
Day_4 += training_df["calories"][position]
else:
count2 += 1
Non_day_4 += training_df["calories"][position]

Mean_Day_4 = Day_4 / count1


Mean_Non_day_4 = Non_day_4 / count2

print("The mean of Day 4 is %.0f" % (Mean_Day_4))


print("The mean of Non Day 4 is %.0f" % (Mean_Non_day_4))
print("The number of datas in Day 4 is %d"%(count1))
print("The number of datas in Non Day 4 is %d"%(count2))

output :
The mean of Day 4 is 93
The mean of Non Day 4 is 201
The number of datas in Day 4 is 200
The number of datas in Non Day 4 is 1200

Now, let's talk about one of the most important things in data analysis,
NORMALIZATION.

Normalization is to transform features to be on a similar scale.

It helps the model to converge faster, infer better predictions, avoid "NaN" trap
and learn proper weights.

"NaN" stands for not a number, meaning a floating-point number exceeds the
precision limit.

Without normalization, the model pay too much attention to features with wide
ranges.

For eg, - 0.5 < A < + 0.5, - 5.0 < B < + 5.0, at first, the model assumes tht B is
ten times more "important".

Therefore training will take longer than expected and the resulting model might be
suboptimal.

Due to their close values, this overall damage for not normalizing is relatively
small, but we still recommend normalizing Feature A and Feature B on the same
scale, maybe -1.0 to +1.0.
But what if the case is -1 < C < +1, +5000 < D < + 1,000,000,000 ?

Now, if you don't normalize, your model will likely be suboptimal and take much
longer to converge or even fail.

BUT don't worry, we provide three methods of normalization :


- linear scaling
- Z-score scaling
- log scaling

Actually, the first two methods are mentioned somewhere before, and this section
covers clipping.

Linear Scaling

Well, this might be the easiest, it converts floating-point values from their
natural range into a standard range.

The range is ususally 0 to 1 or - 1 to + 1, if it is the former, then x' = ( x - x


min ) / ( x max - x min ).

This method is used when


- the lower and upper bounds of data don't change much over time
- there is no or few outliners, and they are not extreme
- the features are approximately uniformly distributed across its range.

Take note that, most real-world features, sadly, don't meet all of the criteria for
linear scaling.

Z-score scaling is typically a better choice.

Z-score Scaling

First, let's see the math behind, x' = ( x - mean ) / standard deviation.

It means that we are finding how many std in between raw value x and the mean.

And surprsingly, it doesn't change anything ( shape ) shown on the histograms !

That makes it an ideal tool to normalize different ranges of data while keep
relationship the same.

This, again, doesn't really welcome the outliners, so we may combine with other
technique (ususally clipping) to handle this situation.

Log Scaling

In the name, it calculates the logarithm of the raw value, and the base is normally
the natural logarithm ( ln ).

It works for data conforms to a power law distribution, meaning the distribution is
suspected to be powered :

- when low values of X have very high values of Y


- and as X increases, Y decreases, so high values of X have very low values of Y.

A good example would be the ratings per movie.

A few movies have lots of ratings while most movies have very few user ratings.
Thus the graph would appers like a "反比例方程", in this case, we'll use Log Scaling the
change the distribution.

Logging a bigger number will still get a bigger output, but it's much more
balanced.

Clipping

Clipping is to minimize the influence of extreme outliners.

Remember where you saw clipping for the first time, yes, in DE II, When the output
of an amplifier is greater than the supply to the amplifier, the output would be
clipped to the supply voltage.

SO, similarly, we are not going to eliminate all the outliners, instead, we clip
them to the same value.

For eg, if a model has lots of outliners greater than 4.0, we can simply clip all
values above 4.0 to become 4.0 .

In Z-scores, perhaps, you can clip Z-scores greater than 3 to become exactly 3 and
less than -3 to become -3.

( It is because, 99.7% of datas are between -3 to 3)

In short, Linear scaling is used when a feature is uniformly distributed across a


fixed range ;

Z-score scaling is used when the feature distribution does not contain extreme
outliners.

Log scaling is used when the feature conforms to the power law.

Clipping is used when the model needs less outliners.

WELL, that's a lot of things, I have to rest for a while, hope I won't forget
everything after.

You might also like