0% found this document useful (0 votes)

11 views16 pages

Machine Learning

a summary of Google ML

Uploaded by

wangjunhui52121314

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as TXT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views16 pages

Machine Learning

a summary of Google ML

Uploaded by

wangjunhui52121314

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as TXT, PDF, TXT or read online on Scribd

You are on page 1/ 16

Machine Learning

It is basically to train the models using datasheet provided and model selected.

Firstly, we use pandas to import the file :

import pandas as pd
iowa_file_path = '../input/home_Data/train.csv'
home_data = pd.read_csv( iowa_file_path )

Then, we create the target object ( normally called y ), which is the thing we
targeted to get ( eg. prices of houses ).

y = home_data.SalePrice

And the datasheet including features we select, called X.

features = [ 'LotArea', 'YearBuilt', '1stFlrSF', '2ndFlrSF', 'FullBath',

'Bedroom' ]
X = home_data [ features ]

For accuracy, we use train_X and train_y to train the model, and we use val_X and
val_y to check.
To achieve such purpose, we import the train_test_split function.

from sklearn.model_selection import train_test_split

train_X, val_X, train_y, val_y = train_test_split( X, y, random_state = 1 )

To train the model, we can use DecisionTreeRegressor.

( Just a reminder, random_state only represents a special and fixed case, it's not
giving random cases )

from sklearn.tree import DecisionTreeRegressor

model = DecisionTreeRegressor ( random_state = 1 )
model.fit ( train_X, train_y )

we can get the predictions made by the model using .predict function.

val_predictions = model.predict ( val_X )

How to check the accuracy? check the predictions with actual values ( target object
y )

from sklearn.metrics import mean_absolute_error

val_mae = mean_absolute_error ( val_predictions, val_y )

Hmmm... what we are getting is pretty inaccurate though...

This is caused by overfitting.

Overfitting occurs when there are too many leaves, it copied exactly the same as
the dataset, thus it'll be almost 100% accurate using the training data, but very
inaccurate using actual data.

On the flip side, underfitting occurs when there are too less leaves, the whole
dataset is divided into two or three groups, thus the result is very inaccurate.

So, obviously, finding the proper number of leaves is the key, namely finding
appropiate value for MAX_LEAF_NODES.
mae_collection = []
new_model = DecisionTreeRegressor ( max_leaf_nodes = n, random_state = 0 )
new_model.fit ( train_X, train_y )
pred_X = new_model.predict ( val_X )
accuracy = mean_absolute_error ( pred_x, val_y )
mae_collection.append ( accuracy ) for n in range [ 5, 20, 100, 500, 2000 ]
print (mae_collection)

In this way, we can find a more suitable max_leaf_nodes, well, it's 100 in this
case.
So we're going to modify our model accordingly.

iowa_model = DecisionTreeRegressor ( max_leaf_nodes = 100, random_state = 1 )

iowa_model.fit ( train_X, train_y )
val_predictions = iowa_model.predict ( val_X )
val_mae = mean_absolute_error ( val_predictions, val_y )

Well, using this method, it's a little bit better than just now, but just a little
bit.
It is still quite inaccurate.
HOW?

What is the root cause of overfitting and underfitting?

THE MODEL ITSELF.

So we just simply change the model to a more scentific based one.

Here, we use Random Forests.

from sklearn.ensemble import RandomForestRegressor

rf_model = RandomForestRegressor ( random_state = 0 )
rf_model.fit ( train_X, train_y )
x_pre = rf_model.predict ( val_X )
rf_mae = mean_absolute_error ( x_pre, val_y )

We are slowly improving the accuracy of our model, but is this the end? Does it
100% match the actual data?
I want to say YES really loud, but the truth is NO.
In fact, it's just a little bit better than just now.
-_-

( Here is the actual mae for this model, if u want to know )

Validation MAE when not specifying max_leaf_nodes: 29,653

Validation MAE for best value of max_leaf_nodes: 27,283
Validation MAE for Random Forest Model: 21857.15912981083

Above is a brief intro of machine learning, though we didn't manage to build a

model that MAE = 0 ( which every data analytist dreams of ), we've already had some
basic understanding about it.

Don't worry, to make MAE closer to zero isn't our job, at least not before you
become a data analytist.

-----------------------------------------------------------------------------------
--------------------------------------

Above is acutally a simple application of ML, let's dive into the concepts now.

There are four categories : Supervised learning, Unsupervised learning,

Reinforcement learning and Generative AI.

Supervised learning is our focus for the time being, cuz it's more fundamental. By
feeding correct data into the models, the models can make predictions on new datas,
which is what we did just now.

Unsupervised learning means we feed incorrect data into the model and we just want
to find the trend among them. The model will automatically cluster them into
differnt groups, and if u understand the dataset very well, u can rename the groups
identified by the model.

Reinforcement learning is to punish and reward based on the performance, so that we

can "train" robots or models we want. ( If it is a game plug-in, we punish when we
get hit, we reward when we hit someone or slide effectively ).

Generative AI is very popular nowadays, it is capable of generating smth out of ur

input. for eg, text-to-image, u write the description of the image, and AI helps to
generate that image. I don't think I need to go deeper.

But all in all, we are going to talk about supervised learning only, as we are
unable to understand other things yet.

The core concepts are Data, Model, Training, Evaluating and Inference.

Data:

we feed data into the model, including labeled or unlabeled datasets, first of all,
all datas are divided into two groups, Label, which is the target we want ;
Features, which is everything exclude the label. Our goal is to find relationship
between features and label, ( acutally, it is the model's goal, we don't have to do
it ourselves. )

For eg,. if we are doing a weather prediction model, specially for raining days,
then rainfall in cm will be our LABEL ; and all other things like how long it
lasts, or the temp and humidity are influencing factors, we called them FEATURES.
All we want to see is the model learn the relationship between features and label,
so that when we feed new features, it can predict the target label accurately.

( Btw, choosing the dataset is also important, we should find those large in size
and having high diversity. But normally,
" a large dataset doesn’t guarantee sufficient diversity, and a dataset that is
highly diverse doesn't guarantee sufficient examples", so, lol, just try ur best, I
guess)

Model:

Now we are choosing what model is going to be used for prediction, though it is
importnat, it is also pretty fixed, we'll talk about it later.
( Oops, I almost forgot this, if we want numerical value as output, like rainfall
in cm, we use regression models ; If we want, a word of description, we need to use
classification model instead. )

However, if u like to make things become troublesome, u can also use classification
model for numerical values, just take average of the labeled thing we want, set
thresold, label as low, medium and high, we feed in this specialized dataset for
classification and we will get an predicted output of "low, medium or high", if,
like I said, u love to get urself into troubles.

Training:
To put it simply, after the model makes some predictions, it compares the acutal
value with predicted value, and adjust to be more accurate based on that.

Evaluating:

After we train the model successfully, we need to evaluate the accuracy of the
model, to see whether it is reliable or not.

Normally, we will try som new datas, and compare the predicted value with actual
value, find the mae, and decide whether it is reliable to use, or it is a trash, we
should leave it in the dustbin.

Inference:

If u think this model doesn't belong to the dustbin, then we might use it for
predictions. For eg, weather predictions. Still, we feed in those unlabeled
Features, like temp, humi, amtospheric pressure... and get the predicted amount of
rainfall.

I also don't know why it's called inference instead of simply saying predicted
outputs...

And often, the mae of the model is pretty big, which means it is pretty inaccurate,
well, I guess I know why the weather reports are so unreliable these days.

Ok, now we are about to dive into deeper and detailed concepts, ready?

GO!

1. ML Models

1) Linear Regression

This model is quite mathmetical, there will be only one feature and one label, and
when we try to find the relationship between them, we find that the dots on the
feature-label graph have a best-fit line. In math, we use y = mx + b.

Similarly, we use y' = b + w1x1 to represent the best-fit line in this model.

y is the label, b is called bias ( y-intercept ), w1 is called weight ( gradient )

and x1 is the feature input.

y' is predicted using the model, x1 is the given input of feature, b and w1 are
calculated by the model.

You might already noticed that we use x "1" and w "1", meaning building such model
with multiple features is possible.

So that it finds the linear relationship between the label and every feature.

It looks smth like this:

y' = b + w1x1 + w2x2 + w3x3 + w4x4 + w5x5

Alright, u might be a bit overwhelmed, I'll show u one example, so that u'll be
even more overwhelmed. HAHA.
Kidding.

Let's say we have a model predicts the gas milege of a car, the possible factors
will be :

w1 : the weight of car in pounds

w2 : engine displacement

w3 : acceleration

w4 : number of cylinders

w5 : horsepower

Well, I admit that it involves a bit of luck to find those features all linear-
related, but this is truly the simplest model.

I mean, it will be a bit inaccurate, errr, maybe a lot inaccurate, but this is smth
easy for us to understand, isn't it?

Okok, I admit that it's very inaccurate, especially when it comes to a diverse and
large dataset.

But u need to measure this kind of inaccuracy, to prove that u are correct, and
this parameter is called LOSS.

Loss is a numerical metric that descirbes the inaccuracy of the model, so my first
thought is, smth like standard deviation?

Let's take a look at it.

Oh, exactly like what I thought, but it's much simpler than standard deviation.

Loss is the average of difference between actual value and predicted values.

There are two types of loss, L1 and L2. (sounds stupid, of cuz it is loss1 n loss2)

our goal is to measure the inaccuracy, so we just want to see how far are the
points from the "best-fit" line, so we don't want to see negative sign during the
calculation.

In this case, we have two methods, one simply takes the absolute value of all the
distances, another one squares them.

We name the "abosolute method" L1, and "square method" L2.

However, L1 and L2 are sum of absolute value / square value, we want to see the
average performance.

Here comes the Mean Absolute Error ( L1 / n ) and Mean Squared Error ( L2 / n ).

( Actually, I personally think that MSE is more accurate, but I never really try it
out yet, maybe tmr we'll see )

Oh, my bad, when u choose between the two models, u only needs to focus on one
thing, the OUTLINER.

Outliner is smth far out of range, eg,. in a class, there might be extremely-good
student and extrmenly-bad student.

Do you want to take them into consideration?

If you want to ignore them, use MAE ; If you want to include them in your model,
use MSE then.

In other words, MSE will be affected by the outlier values while MAE won't be
affected.

I really want to paste some pics here, but I couldn't, stupid notebook.

You might take a break if u read throught evth till here just now, u can also
choose to move on if u just started.

Although we can calculate the bias and weight manually, we better to leave it to
the model.

But how does it work?

The model will start from zero bias and weight, check mae, adjust bias and weight a
little bit to fit better.

In the end, where the loss is being minimised, the model is said to be converged.

On the 2D loss curve, it becomes flat after being converged. ( can't reduce the
loss anymore )

If you are good at math, the loss functions for linear models always produce a
convex surface (3D), where weight is on the x-axis, bias is on the y-axis, and loss
is on the z-axis.

When the model converges, it is because the loss surface is convex and contains a
point where the weight and bias have a shope that's almost zero. ( "almost", means
it never truly find the minimum value, but it can find a very closed value )

We've already breifly explained the important concepts, time to relax~

There are three variables that control different aspects of training, Learning
rate, Batch size and Epochs.

These are called hyperparameters.

Don't get confused, parameters are variables like weight and bias, it is calculated
by the model; Hyperparameters are values u can control, outside the model.

Learning rate :

Learning rate, in the name, is to control the rate of learning.

Do u still remember? We mention that by changing parameters a little bit every

time, the model slowly converges.

Learning rate determines how much is "a little bit".

If it's too low, then it takes forever to finish the trianing ; However, if it's
too high, the model can't even converge.

So we need to choose a reasonable learning rate, eg. learning rate = 0.01, gradient
= 2.5, then change it by 0.025.

An ideal learning rate reduces loss significantly in first few iterations (cycles),
and still can find the minimum point.

Batch size :

It refers to the number of dots we use before training the model.

In fact, the model might contains millions of datas, we need to select some out of
these.

Normally, we use SGD or mini-SGD.

I know what you are thinking about, but sorry, you can't get SGD dollars out of
that. XD

It's the abbreviation of Stochastic Gradient Descent.

It sounds very complicated, but indeed it simply uses single example per iteration.
(only one dot at a time)

But there will be a lot of "noise", which is the small fluctuations, thoughout the
whole graph.

Mini-SGD is a compromise between full-batch (evth) and SGD.

It uses a random group of examples, (32, 64, etc), reducing the fluctuations
effectively.

You might think that, then why we still have SGD? Isn't mini-SGD always having a
less 'noisy' graph?

You are right, but sometimes, noisy graph is what we want, to put it simply, it
shakes the model and prevents overfitting, which is widely used in neural network.

Don't worry, we are not there yet. Or maybe you have to worry, cuz you are not
there yet. haha.

Epochs :

It is even simpler, it means all the examples have been processed at least once.

Let's say we have a training set of 1000 examples, with a mini-SGD batch size of
100, thus it takes 10 iterations to complete one epoch.

Still, you need to set how many epochs u want the model to train.

Generally, more epochs means better accuracy, but also takes more time.

So in many cases, we'll experiment with how many epochs it takes for the model to
converge.
2) Logistic Regression

Unlike Linear Regression, Logistic Regression is a mechanism for calculating

probabilities instead of numeric values.

There are two ways of returning probabilities :

- a numeric value ( i.e. 0.932, meaning 93.2% )
- a binary category ( i.e. True / False ; Spam / Not Spam )

As we said earlier, Logistic Regression only returns a probability, thus it should

a numeric value between 0 and 1.

How do we achieve that?

The answer is Sigmoid function.

Again, it sounds complicated, but actually, "Sigmoid" means S-shape only.

It has the formula :

f(x) = ( 1 + e^-x ) ^-1

When it is nearly negetive infinity, y = 0 ; when it is nearly postive infinity, y

= 1.

Similarly, it also uses the following equation to represent linear relationship :

z = b + w1x1+ w2x2 + ... + wNxN

- z is the output of the linear equation, called the log odds
- b is the bias
- w's are the model's learned weights
- x's are the feature values

After that, to yield a possibility, we use the Sigmoid function.

y' = ( 1 + e^-z ) ^ -1 ( z is the linear output just now )

Logistic Regression models are trained using the same process as Linear Regression
models, except for two things :
- uses Log Loss instead of squared loss
- applys regularization to prevent overfitting

Log Loss :

Log loss is designed to punish the wrong predictions as it should, for eg :

- if the correct answer is 1, predicted answer is 0.1, the squared loss is 0.81
only;
- instead, if we use log loss, - log (1-0.1) = 2,3, bigger penalty.

So when the correct probability is 1, we use - log ( 1 - p ). ( p stands for

predicted value here )
- - when the correct value is 0, we use - log ( p ).

Combined these together, as the probability is rarely 0 or 1 :

Log Loss = - log (y') * y - log (1 - y') * (1 - y)

- y is the acutal labeled value, while y' is the predicted value
In such way, we are able to calculate the loss of this logarithm function.

Regularization :

When the model learn too many detailed features, it becomes overfitted.

Namely, the model perfectly fits the given dataset, yet very inaccurate for new or
unseen data.

We have two ways of regularization to prevent this from happening.

L2 Regularization ( "Complexity Tax" ) :

The model is penalised if it uses very large numbers to weigh features, so the
weights are smaller but more balanced.

In details, it squares all the weights, add them together, multiplies with a "Tax
Constant" λ.

It is considered part of the loss, and the constant λ controls how strict the
penalty is.

Early Stopping :

It simply stops the model before continues to learn more detailed features.

More detailed, It limits the number of training steps to halt training while loss
is still increasing.

And that's all for Linear Regression model and Logistic Regression model.

Pretty complex, huh?

Haha, just be famaliar with it and everything will be fine.

You might take a small break and review what we've learnt so far.

And of cuz, I'll be more than happy if u want to move on.

But it's not for me, lol.

3) Classification

In a Logistic Regression model, we use the sigmoid function to convert raw model to
probability value. But what if our goal is not to output probability but a
category, for eg, "Spam" or "Not spam" ?

So firstly, we still need to use the possibility output from the Logistic
Regression Model, then we use binary classification to convert it to predict one of
two classes.

What happens if we have multiple classes (more than two) ?

Don't worry, Let's go step by step.

How do we convert the numeric possibility into smth like boolean value?
We set a thresold for it, called classification threshold.

For eg, if we set 0.5 as the thresold, then 51% is considered "Spam" and 49% is
considered "Not spam".

However, if there's only 0.01% of samples above the thresold, meaning this model
isn't a balanced one.

The probability is not reality, or Ground Truth.

So we will encounter four cases :

- Predicted positive aligns with actual positive, which is True Positive (TP)

- Predicted postive aligns with actual negative, which is Fake Positive (FP)

- Predicted negative aligns with actual positive, which is False Negative (FN)

- Predicted negative aligns with actual negative, which is True Negative (TN)

Obviously, we want less FP n FN, and more TP n TN, and this is called confusion
matrix.

There will be three scenarios of our datasets :

- Separated, where positive examples and negative examples are well differentiated.

- Unseparated, where many positive examples have lower scores than negative
examples.

- Imbalanced, containing very few examples of the positive class in the dataset.

Still, we want to know whether our classifications are accurate enough or not.

So we need to measure the accuracy.

Accuracy is to measure the proportion of correct classifications, which is ( TP +

TN ) / ( TP + TN + FP + FN ).

An ideally perfect model would have an accuracy of 100%, or 1.0, but we could
hardly reach this.

It fairly measures the correctness of the predicted data, thus it is often the
Default Evaluation Metric used.

However, in real-world applications, FP or FN is often more costly than the other.

For eg, missing important emails (FP) is worse than seeing spam unexpectedly (FN).

If we want to focus on a certain category, we use True Positive Rate / False

Positive Rate.

True Positive Rate = TP / (TP + FN), calculating the proportion of all actual
positive values being classified correctly.

False Positive Rate = FP / (TN + FP), opposite to the concept of TPR.

These terms calculate correct predicted value to the whole category.

But Precision calculates the true portion in the positive prediction.

Precision = TP / (TP + FP).

Thus, a short summary, Accuracy measures the overall correctness, while FPR or TPR
only focus only the actually postive or negative datas. Precision is just a
measurement of how many False Positive are there in the dataset.

SO we use
- accuracy when we want to get the overall model performance, but it shouldn't be
used for imbalanced datasets.
- Recall (TPR) when False Negatives are more costly
- FPR when False Posities are more expensive
- Precision when it's important for positive predictions to be accurate

These statements are based on a single classification value only, but in reality,
we need to evaluate a model's quality across all possible thresholds, that's why we
use Receiver-Operating Characteristic curve and Area Under the Curve.

Receiver-Operating Charateristic Curve (ROC) :

By trying possible thresholds at selected intervals, we can draw a graph of TPR

over FPR, and this curve is called ROC.

At a high threshold (eg. 0.9), it has a low TPR (few positives caught) and a low
FPR (few false alarms).

Vice versa, at a low threshold, it has a high TPR and FPR, locating the point near
top right. (smth like this)

| /
|/
-------- FPR

Basically, it shows the trade-offs between catching more positives and making
mistakes on negatives.

Area Under the Curve (AUC) :

In the name, it is the Area under the ROC curve just now. Typically, an ideal model
shows a perfect square with side length of 1, meaning the model 100% ranked a
positive example higher than a negative example.

What we definitely don't want to see is, a diagnoal ROC, it means the model is
flipping a coin. (50%)

However, it works well to evaluate a roughly balanced dataset, what do we do with

an imbalanced dataset?

We use Precision-Recall Curve and calculate the area under the respective graph.
(focus on the no. of true positives)

Just take note that, the baseline of PR-AUC depends on the class imbalanced, or the
overall proportion of positives.

For eg. if only 10% of emails are spam, then the baseline precition is 0.1, so any
model with AUC-PR > 0.1 is at least better than flipping a coin.

The reason we do all these is not to trouble ourselves, it is to find a better

model and a suitable threshold.

The points on a ROC curve closest to (1, 0) represent a range of best thresholds.

Still, u can choose between these values based on the impotance of false negative
or false positive.

Like Linear Regression Model, we also use prediction bias to see how effective the
predictive values are.

Just collect the mean of predictions and a mean of ground-truth labels, the
difference between them is the bias.

Ahhh... U might still remember what I said at first...

What happens to the multi-class ?

Well, we won't invent a new technique out of nowhere, we still prefer to use what
we already have.

We can split these classes into two classes, one of the classes (eg. A) and the
otheres (eg. B, C, D, E...).

By repeating this step until all the classes are shown, we are able to handle a
multi-class data model.

And one more tip, in the real application, to handle the extreme datas, we use Z-
score.

Z-score is how many standard deviation a value is from mean value.

For eg, if the acutal value is 70 while mean = 50, std = 10, then the Z-score =
(70-50) / 10 = 2.0 .

2. Data

1) working with numerical data

Data is very important in ML, most likely, we spend far more time on evaluating,
cleaning and transforming data than building models.

This unit foucses on numeical data, meaning integers or floating-point values that
behave like numbers.

Such as Temperature, Weight, The number of deerse wintering in a nature

preserver...

But the postal code is not included, as double the postal code (i,e. 247440 to
494880) has no acutal meaning.

The model can't just grab the intended cells in the dataset, it ingests an array of
floating-point values called a Feature Factor, and this value needs to further
processed so that your model can better learn from.

It seems a bit abnormal, as we're training the model using acutal data but using
processed data.

But trust me, it produces a better predictions, as it makes Normalization and

Binning possible.

You should be quite famaliar with Normalization already, converting values into a
standard range.

Binning (also called bucketing), is to convert numerical values into buckets of

ranges.

In the next unit, it further coveres Preprocessing, which is to convert non-

numerical data to floating values.

We'll talk about that in later chapters, let's deal with numerical data first.

Before creating feature vectors, we study the numerical data by

- visualize your data in plots or graphs
- get statistics about your data

Visualize your Data

Graph helps to find the anomalies or the common pattern.

Normally, we recommend pandas for visualization.

I hope you still remember things like pd.readcsv and describe( ).

Statistically Evaluate your Data

This is even simpler, as we are using terms like mean, median and standard
deviation to evaluate the datas.

We also want to know the 0th, 25th, 50th, 75th, 100th percentiles. (the 0th is min
and the 100th is max)

Find the outliners

An outliner is a value distant from most other values, we can easily find it using
graphs or plots.

Or statistically, if the delta between 0th and 25th percentiles differs

significantly from the delta between 75th and 100th percentiles, the dataset
probably contains outliners.

In details, we compare std with mean, if std overweighs mean, meaning it's an
imbalanced dataset.

However, sometimes outliners also hide in seemingly well-balanced data, so don't

over-rely on basic statistics.

Now you find the outliners, if it is just a mistake, you can simply delete examples
containing mistake ouliners ;

But if it is a legit data point, will your model ultimately need to infer good
predictions on these outliners?
- If yes, keep these in your model to make better predictions.
- If no, delete the outliners or apply techniques such as clipping.

I'm really impressed by this code.

Let's say we find the Thursday data abnormal, and we want to compare with others.
The dataset is the calories we take in everyday, and we take down 50 datas per day.

How to find the anomolies?

Day_4 = 0
Non_day_4 = 0
count1 = 0
count2 = 0

for week in range (0,4):

for day in range(0,7):
for subject in range (0, 50):
position = (week*350) + (day * 50) + subject
if day == 4 :
count1 += 1
Day_4 += training_df["calories"][position]
else:
count2 += 1
Non_day_4 += training_df["calories"][position]

Mean_Day_4 = Day_4 / count1

Mean_Non_day_4 = Non_day_4 / count2

print("The mean of Day 4 is %.0f" % (Mean_Day_4))

print("The mean of Non Day 4 is %.0f" % (Mean_Non_day_4))
print("The number of datas in Day 4 is %d"%(count1))
print("The number of datas in Non Day 4 is %d"%(count2))

output :
The mean of Day 4 is 93
The mean of Non Day 4 is 201
The number of datas in Day 4 is 200
The number of datas in Non Day 4 is 1200

Now, let's talk about one of the most important things in data analysis,
NORMALIZATION.

Normalization is to transform features to be on a similar scale.

It helps the model to converge faster, infer better predictions, avoid "NaN" trap
and learn proper weights.

"NaN" stands for not a number, meaning a floating-point number exceeds the
precision limit.

Without normalization, the model pay too much attention to features with wide
ranges.

For eg, - 0.5 < A < + 0.5, - 5.0 < B < + 5.0, at first, the model assumes tht B is
ten times more "important".

Therefore training will take longer than expected and the resulting model might be
suboptimal.

Due to their close values, this overall damage for not normalizing is relatively
small, but we still recommend normalizing Feature A and Feature B on the same
scale, maybe -1.0 to +1.0.
But what if the case is -1 < C < +1, +5000 < D < + 1,000,000,000 ?

Now, if you don't normalize, your model will likely be suboptimal and take much
longer to converge or even fail.

BUT don't worry, we provide three methods of normalization :

- linear scaling
- Z-score scaling
- log scaling

Actually, the first two methods are mentioned somewhere before, and this section
covers clipping.

Linear Scaling

Well, this might be the easiest, it converts floating-point values from their
natural range into a standard range.

The range is ususally 0 to 1 or - 1 to + 1, if it is the former, then x' = ( x - x

min ) / ( x max - x min ).

This method is used when

- the lower and upper bounds of data don't change much over time
- there is no or few outliners, and they are not extreme
- the features are approximately uniformly distributed across its range.

Take note that, most real-world features, sadly, don't meet all of the criteria for
linear scaling.

Z-score scaling is typically a better choice.

Z-score Scaling

First, let's see the math behind, x' = ( x - mean ) / standard deviation.

It means that we are finding how many std in between raw value x and the mean.

And surprsingly, it doesn't change anything ( shape ) shown on the histograms !

That makes it an ideal tool to normalize different ranges of data while keep
relationship the same.

This, again, doesn't really welcome the outliners, so we may combine with other
technique (ususally clipping) to handle this situation.

Log Scaling

In the name, it calculates the logarithm of the raw value, and the base is normally
the natural logarithm ( ln ).

It works for data conforms to a power law distribution, meaning the distribution is
suspected to be powered :

- when low values of X have very high values of Y

- and as X increases, Y decreases, so high values of X have very low values of Y.

A good example would be the ratings per movie.

A few movies have lots of ratings while most movies have very few user ratings.
Thus the graph would appers like a "反比例方程", in this case, we'll use Log Scaling the
change the distribution.

Logging a bigger number will still get a bigger output, but it's much more
balanced.

Clipping

Clipping is to minimize the influence of extreme outliners.

Remember where you saw clipping for the first time, yes, in DE II, When the output
of an amplifier is greater than the supply to the amplifier, the output would be
clipped to the supply voltage.

SO, similarly, we are not going to eliminate all the outliners, instead, we clip
them to the same value.

For eg, if a model has lots of outliners greater than 4.0, we can simply clip all
values above 4.0 to become 4.0 .

In Z-scores, perhaps, you can clip Z-scores greater than 3 to become exactly 3 and
less than -3 to become -3.

( It is because, 99.7% of datas are between -3 to 3)

In short, Linear scaling is used when a feature is uniformly distributed across a

fixed range ;

Z-score scaling is used when the feature distribution does not contain extreme
outliners.

Log scaling is used when the feature conforms to the power law.

Clipping is used when the model needs less outliners.

WELL, that's a lot of things, I have to rest for a while, hope I won't forget
everything after.

11 Plus GL Assessment Maths Question Booklet
100% (1)
11 Plus GL Assessment Maths Question Booklet
10 pages
Diagramas Volvo Ec140b LC
100% (7)
Diagramas Volvo Ec140b LC
258 pages
You Can Heal Your Life PDA
88% (8)
You Can Heal Your Life PDA
8 pages
Machine Learning
No ratings yet
Machine Learning
2 pages
ML Book Notes
No ratings yet
ML Book Notes
9 pages
Choosing Model and Tuning
No ratings yet
Choosing Model and Tuning
20 pages
Scikit Learn What Were Covering
No ratings yet
Scikit Learn What Were Covering
15 pages
Data Science
No ratings yet
Data Science
5 pages
05 - Ensemble Learning
No ratings yet
05 - Ensemble Learning
39 pages
Model Learning Steps
No ratings yet
Model Learning Steps
12 pages
Day 2 Presentation
No ratings yet
Day 2 Presentation
65 pages
Deep Learning and Machine Learning: Lab Explanation
No ratings yet
Deep Learning and Machine Learning: Lab Explanation
34 pages
Types of Machine Learning
No ratings yet
Types of Machine Learning
63 pages
Machine Learning Notes ?
No ratings yet
Machine Learning Notes ?
64 pages
ML Notion 1
No ratings yet
ML Notion 1
18 pages
Data Collection
No ratings yet
Data Collection
8 pages
Slides On DataI
No ratings yet
Slides On DataI
33 pages
SML
No ratings yet
SML
8 pages
Learn Machine Learning in One Lesson Book
No ratings yet
Learn Machine Learning in One Lesson Book
8 pages
ML Functions
No ratings yet
ML Functions
12 pages
ML and DL
No ratings yet
ML and DL
15 pages
ML External Xerox
No ratings yet
ML External Xerox
1 page
Foundations of Machine Learning and Data Science_ Concepts, Techniques, and Applications
No ratings yet
Foundations of Machine Learning and Data Science_ Concepts, Techniques, and Applications
9 pages
MAchine Learning Notes
No ratings yet
MAchine Learning Notes
6 pages
ML Interview
No ratings yet
ML Interview
65 pages
ML Practical Updated
No ratings yet
ML Practical Updated
64 pages
ML Python
No ratings yet
ML Python
11 pages
Mastering The Basics of Machine Learning
No ratings yet
Mastering The Basics of Machine Learning
65 pages
Jupyter Lab
No ratings yet
Jupyter Lab
42 pages
Decissin Tree & Over Fitting
No ratings yet
Decissin Tree & Over Fitting
22 pages
Simple Introduction of Neural Network
No ratings yet
Simple Introduction of Neural Network
28 pages
Codes and Concepts of ML-Developer-2
No ratings yet
Codes and Concepts of ML-Developer-2
17 pages
L03 The Regression Pipeline - 2
No ratings yet
L03 The Regression Pipeline - 2
58 pages
AI ML - Cycle 2 Programs
No ratings yet
AI ML - Cycle 2 Programs
15 pages
ML11 Generalization
No ratings yet
ML11 Generalization
40 pages
Module3 DS PPT
No ratings yet
Module3 DS PPT
68 pages
Machine Learning
No ratings yet
Machine Learning
8 pages
DE - Python For Data Science - Machine Learning
No ratings yet
DE - Python For Data Science - Machine Learning
45 pages
CS 461 - Fall 2021 - Neural Networks - Machine Learning
No ratings yet
CS 461 - Fall 2021 - Neural Networks - Machine Learning
5 pages
AI Lab M.Tech
No ratings yet
AI Lab M.Tech
29 pages
Machine Learning Engineer Interview Preparation Guide
No ratings yet
Machine Learning Engineer Interview Preparation Guide
14 pages
Machine Learning Strategies
No ratings yet
Machine Learning Strategies
59 pages
ChatGPT - MyLearning On Coding For Machine Learning
No ratings yet
ChatGPT - MyLearning On Coding For Machine Learning
16 pages
Methods and Models
No ratings yet
Methods and Models
12 pages
AIch 5
No ratings yet
AIch 5
50 pages
Top 90+ Data Science Interview Questions and Answers (2024)
No ratings yet
Top 90+ Data Science Interview Questions and Answers (2024)
38 pages
Final ML
No ratings yet
Final ML
2 pages
All DL
No ratings yet
All DL
72 pages
Udacity Machine Learning Analysis Supervised Learning
100% (1)
Udacity Machine Learning Analysis Supervised Learning
504 pages
ML Interview Questions
No ratings yet
ML Interview Questions
10 pages
Model Evaluation in ML
No ratings yet
Model Evaluation in ML
12 pages
Beginner's Guide To Implementing A Simple Machine Learning Project - DeV Community
No ratings yet
Beginner's Guide To Implementing A Simple Machine Learning Project - DeV Community
9 pages
ES335
No ratings yet
ES335
22 pages
Machine Learning
No ratings yet
Machine Learning
43 pages
Ai512 Book
No ratings yet
Ai512 Book
127 pages
LAB MANUAL For Machine Learning
No ratings yet
LAB MANUAL For Machine Learning
15 pages
Data Analyst Interview Questionaries
No ratings yet
Data Analyst Interview Questionaries
16 pages
Machine Learning Assignment
No ratings yet
Machine Learning Assignment
5 pages
Overfitting & Feature Engineering
No ratings yet
Overfitting & Feature Engineering
37 pages
Data Science and Machine Learning - Interview Questions
No ratings yet
Data Science and Machine Learning - Interview Questions
185 pages
Take It Easy: Created Status Last Read
No ratings yet
Take It Easy: Created Status Last Read
55 pages
Unit Iii Machine Learning
No ratings yet
Unit Iii Machine Learning
19 pages
MCS-011: Problem Solving and Programming
From Everand
MCS-011: Problem Solving and Programming
Dr. DK Sukhani
No ratings yet
Std-145 General Safety Audit Guidlines
100% (2)
Std-145 General Safety Audit Guidlines
123 pages
HZ PV-Solar560
No ratings yet
HZ PV-Solar560
2 pages
30 Dayslimdown
No ratings yet
30 Dayslimdown
79 pages
Amendment in Regional Transmission Grid Plan of Gwadar Area - Complete (1) - Pages-70-74
No ratings yet
Amendment in Regional Transmission Grid Plan of Gwadar Area - Complete (1) - Pages-70-74
5 pages
Brayden 8pp Brochure - Final
No ratings yet
Brayden 8pp Brochure - Final
8 pages
Choose Your Own Ever After 1 - How To Get To Rio - Julie Fison
No ratings yet
Choose Your Own Ever After 1 - How To Get To Rio - Julie Fison
165 pages
Bricks Beads and Bones MCQs
100% (1)
Bricks Beads and Bones MCQs
4 pages
Alina Biomes of The World
No ratings yet
Alina Biomes of The World
22 pages
Incident Investigation Form: Name of Person Involved in The Incident: Date of Incident: Location of Incident
No ratings yet
Incident Investigation Form: Name of Person Involved in The Incident: Date of Incident: Location of Incident
1 page
Gavopanishad
100% (1)
Gavopanishad
4 pages
Spherical Interpolation (SI) Abel and Smith 1987
No ratings yet
Spherical Interpolation (SI) Abel and Smith 1987
4 pages
MPK Hitachi B & W 674 1 KRI MLT - 561 A. Filter BB
No ratings yet
MPK Hitachi B & W 674 1 KRI MLT - 561 A. Filter BB
24 pages
United States "Patents PDF
No ratings yet
United States "Patents PDF
17 pages
Test-Item-Bank - Science 1ST
No ratings yet
Test-Item-Bank - Science 1ST
4 pages
Juniper Mx960 - Fisico
No ratings yet
Juniper Mx960 - Fisico
9 pages
Agriculture Monitoring System A Study
No ratings yet
Agriculture Monitoring System A Study
8 pages
21 Bargain Perfumes That Smell Just Like Designer Scents PDF
No ratings yet
21 Bargain Perfumes That Smell Just Like Designer Scents PDF
28 pages
620 - E MGMT - BFT771 - Food - VII Sem
No ratings yet
620 - E MGMT - BFT771 - Food - VII Sem
1 page
Contoh Soal Us B Ing 2022
No ratings yet
Contoh Soal Us B Ing 2022
12 pages
Temperature and Humidity Sensor Lesson
No ratings yet
Temperature and Humidity Sensor Lesson
18 pages
Resistive Transducers: Instructor: DR Alivelu M Parimi
No ratings yet
Resistive Transducers: Instructor: DR Alivelu M Parimi
28 pages
Chu Ching Wu
No ratings yet
Chu Ching Wu
3 pages
Engl A
No ratings yet
Engl A
7 pages
Sanya Kapoor - Pre - Production Process Assignment
No ratings yet
Sanya Kapoor - Pre - Production Process Assignment
7 pages
Simulado de Inglês 8º Ano.
No ratings yet
Simulado de Inglês 8º Ano.
2 pages
The Early Education of Milarepa
No ratings yet
The Early Education of Milarepa
24 pages