Machine Learning
Machine Learning
It is basically to train the models using datasheet provided and model selected.
import pandas as pd
iowa_file_path = '../input/home_Data/train.csv'
home_data = pd.read_csv( iowa_file_path )
Then, we create the target object ( normally called y ), which is the thing we
targeted to get ( eg. prices of houses ).
y = home_data.SalePrice
For accuracy, we use train_X and train_y to train the model, and we use val_X and
val_y to check.
To achieve such purpose, we import the train_test_split function.
we can get the predictions made by the model using .predict function.
How to check the accuracy? check the predictions with actual values ( target object
y )
Overfitting occurs when there are too many leaves, it copied exactly the same as
the dataset, thus it'll be almost 100% accurate using the training data, but very
inaccurate using actual data.
On the flip side, underfitting occurs when there are too less leaves, the whole
dataset is divided into two or three groups, thus the result is very inaccurate.
So, obviously, finding the proper number of leaves is the key, namely finding
appropiate value for MAX_LEAF_NODES.
mae_collection = []
new_model = DecisionTreeRegressor ( max_leaf_nodes = n, random_state = 0 )
new_model.fit ( train_X, train_y )
pred_X = new_model.predict ( val_X )
accuracy = mean_absolute_error ( pred_x, val_y )
mae_collection.append ( accuracy ) for n in range [ 5, 20, 100, 500, 2000 ]
print (mae_collection)
In this way, we can find a more suitable max_leaf_nodes, well, it's 100 in this
case.
So we're going to modify our model accordingly.
Well, using this method, it's a little bit better than just now, but just a little
bit.
It is still quite inaccurate.
HOW?
We are slowly improving the accuracy of our model, but is this the end? Does it
100% match the actual data?
I want to say YES really loud, but the truth is NO.
In fact, it's just a little bit better than just now.
-_-
Don't worry, to make MAE closer to zero isn't our job, at least not before you
become a data analytist.
-----------------------------------------------------------------------------------
--------------------------------------
Above is acutally a simple application of ML, let's dive into the concepts now.
Supervised learning is our focus for the time being, cuz it's more fundamental. By
feeding correct data into the models, the models can make predictions on new datas,
which is what we did just now.
Unsupervised learning means we feed incorrect data into the model and we just want
to find the trend among them. The model will automatically cluster them into
differnt groups, and if u understand the dataset very well, u can rename the groups
identified by the model.
But all in all, we are going to talk about supervised learning only, as we are
unable to understand other things yet.
The core concepts are Data, Model, Training, Evaluating and Inference.
Data:
we feed data into the model, including labeled or unlabeled datasets, first of all,
all datas are divided into two groups, Label, which is the target we want ;
Features, which is everything exclude the label. Our goal is to find relationship
between features and label, ( acutally, it is the model's goal, we don't have to do
it ourselves. )
For eg,. if we are doing a weather prediction model, specially for raining days,
then rainfall in cm will be our LABEL ; and all other things like how long it
lasts, or the temp and humidity are influencing factors, we called them FEATURES.
All we want to see is the model learn the relationship between features and label,
so that when we feed new features, it can predict the target label accurately.
( Btw, choosing the dataset is also important, we should find those large in size
and having high diversity. But normally,
" a large dataset doesn’t guarantee sufficient diversity, and a dataset that is
highly diverse doesn't guarantee sufficient examples", so, lol, just try ur best, I
guess)
Model:
Now we are choosing what model is going to be used for prediction, though it is
importnat, it is also pretty fixed, we'll talk about it later.
( Oops, I almost forgot this, if we want numerical value as output, like rainfall
in cm, we use regression models ; If we want, a word of description, we need to use
classification model instead. )
However, if u like to make things become troublesome, u can also use classification
model for numerical values, just take average of the labeled thing we want, set
thresold, label as low, medium and high, we feed in this specialized dataset for
classification and we will get an predicted output of "low, medium or high", if,
like I said, u love to get urself into troubles.
Training:
To put it simply, after the model makes some predictions, it compares the acutal
value with predicted value, and adjust to be more accurate based on that.
Evaluating:
After we train the model successfully, we need to evaluate the accuracy of the
model, to see whether it is reliable or not.
Normally, we will try som new datas, and compare the predicted value with actual
value, find the mae, and decide whether it is reliable to use, or it is a trash, we
should leave it in the dustbin.
Inference:
If u think this model doesn't belong to the dustbin, then we might use it for
predictions. For eg, weather predictions. Still, we feed in those unlabeled
Features, like temp, humi, amtospheric pressure... and get the predicted amount of
rainfall.
I also don't know why it's called inference instead of simply saying predicted
outputs...
And often, the mae of the model is pretty big, which means it is pretty inaccurate,
well, I guess I know why the weather reports are so unreliable these days.
Ok, now we are about to dive into deeper and detailed concepts, ready?
GO!
1. ML Models
1) Linear Regression
This model is quite mathmetical, there will be only one feature and one label, and
when we try to find the relationship between them, we find that the dots on the
feature-label graph have a best-fit line. In math, we use y = mx + b.
Similarly, we use y' = b + w1x1 to represent the best-fit line in this model.
y' is predicted using the model, x1 is the given input of feature, b and w1 are
calculated by the model.
You might already noticed that we use x "1" and w "1", meaning building such model
with multiple features is possible.
So that it finds the linear relationship between the label and every feature.
Alright, u might be a bit overwhelmed, I'll show u one example, so that u'll be
even more overwhelmed. HAHA.
Kidding.
Let's say we have a model predicts the gas milege of a car, the possible factors
will be :
w2 : engine displacement
w3 : acceleration
w4 : number of cylinders
w5 : horsepower
Well, I admit that it involves a bit of luck to find those features all linear-
related, but this is truly the simplest model.
I mean, it will be a bit inaccurate, errr, maybe a lot inaccurate, but this is smth
easy for us to understand, isn't it?
Okok, I admit that it's very inaccurate, especially when it comes to a diverse and
large dataset.
But u need to measure this kind of inaccuracy, to prove that u are correct, and
this parameter is called LOSS.
Loss is a numerical metric that descirbes the inaccuracy of the model, so my first
thought is, smth like standard deviation?
Oh, exactly like what I thought, but it's much simpler than standard deviation.
Loss is the average of difference between actual value and predicted values.
There are two types of loss, L1 and L2. (sounds stupid, of cuz it is loss1 n loss2)
our goal is to measure the inaccuracy, so we just want to see how far are the
points from the "best-fit" line, so we don't want to see negative sign during the
calculation.
In this case, we have two methods, one simply takes the absolute value of all the
distances, another one squares them.
However, L1 and L2 are sum of absolute value / square value, we want to see the
average performance.
Here comes the Mean Absolute Error ( L1 / n ) and Mean Squared Error ( L2 / n ).
( Actually, I personally think that MSE is more accurate, but I never really try it
out yet, maybe tmr we'll see )
Oh, my bad, when u choose between the two models, u only needs to focus on one
thing, the OUTLINER.
Outliner is smth far out of range, eg,. in a class, there might be extremely-good
student and extrmenly-bad student.
If you want to ignore them, use MAE ; If you want to include them in your model,
use MSE then.
In other words, MSE will be affected by the outlier values while MAE won't be
affected.
I really want to paste some pics here, but I couldn't, stupid notebook.
You might take a break if u read throught evth till here just now, u can also
choose to move on if u just started.
Although we can calculate the bias and weight manually, we better to leave it to
the model.
The model will start from zero bias and weight, check mae, adjust bias and weight a
little bit to fit better.
In the end, where the loss is being minimised, the model is said to be converged.
On the 2D loss curve, it becomes flat after being converged. ( can't reduce the
loss anymore )
If you are good at math, the loss functions for linear models always produce a
convex surface (3D), where weight is on the x-axis, bias is on the y-axis, and loss
is on the z-axis.
When the model converges, it is because the loss surface is convex and contains a
point where the weight and bias have a shope that's almost zero. ( "almost", means
it never truly find the minimum value, but it can find a very closed value )
There are three variables that control different aspects of training, Learning
rate, Batch size and Epochs.
Don't get confused, parameters are variables like weight and bias, it is calculated
by the model; Hyperparameters are values u can control, outside the model.
Learning rate :
If it's too low, then it takes forever to finish the trianing ; However, if it's
too high, the model can't even converge.
So we need to choose a reasonable learning rate, eg. learning rate = 0.01, gradient
= 2.5, then change it by 0.025.
An ideal learning rate reduces loss significantly in first few iterations (cycles),
and still can find the minimum point.
Batch size :
In fact, the model might contains millions of datas, we need to select some out of
these.
I know what you are thinking about, but sorry, you can't get SGD dollars out of
that. XD
It sounds very complicated, but indeed it simply uses single example per iteration.
(only one dot at a time)
But there will be a lot of "noise", which is the small fluctuations, thoughout the
whole graph.
It uses a random group of examples, (32, 64, etc), reducing the fluctuations
effectively.
You might think that, then why we still have SGD? Isn't mini-SGD always having a
less 'noisy' graph?
You are right, but sometimes, noisy graph is what we want, to put it simply, it
shakes the model and prevents overfitting, which is widely used in neural network.
Don't worry, we are not there yet. Or maybe you have to worry, cuz you are not
there yet. haha.
Epochs :
It is even simpler, it means all the examples have been processed at least once.
Let's say we have a training set of 1000 examples, with a mini-SGD batch size of
100, thus it takes 10 iterations to complete one epoch.
Still, you need to set how many epochs u want the model to train.
Generally, more epochs means better accuracy, but also takes more time.
So in many cases, we'll experiment with how many epochs it takes for the model to
converge.
2) Logistic Regression
Logistic Regression models are trained using the same process as Linear Regression
models, except for two things :
- uses Log Loss instead of squared loss
- applys regularization to prevent overfitting
Log Loss :
- if the correct answer is 1, predicted answer is 0.1, the squared loss is 0.81
only;
- instead, if we use log loss, - log (1-0.1) = 2,3, bigger penalty.
Regularization :
When the model learn too many detailed features, it becomes overfitted.
Namely, the model perfectly fits the given dataset, yet very inaccurate for new or
unseen data.
The model is penalised if it uses very large numbers to weigh features, so the
weights are smaller but more balanced.
In details, it squares all the weights, add them together, multiplies with a "Tax
Constant" λ.
It is considered part of the loss, and the constant λ controls how strict the
penalty is.
Early Stopping :
It simply stops the model before continues to learn more detailed features.
More detailed, It limits the number of training steps to halt training while loss
is still increasing.
And that's all for Linear Regression model and Logistic Regression model.
You might take a small break and review what we've learnt so far.
3) Classification
In a Logistic Regression model, we use the sigmoid function to convert raw model to
probability value. But what if our goal is not to output probability but a
category, for eg, "Spam" or "Not spam" ?
So firstly, we still need to use the possibility output from the Logistic
Regression Model, then we use binary classification to convert it to predict one of
two classes.
How do we convert the numeric possibility into smth like boolean value?
We set a thresold for it, called classification threshold.
For eg, if we set 0.5 as the thresold, then 51% is considered "Spam" and 49% is
considered "Not spam".
However, if there's only 0.01% of samples above the thresold, meaning this model
isn't a balanced one.
- Predicted positive aligns with actual positive, which is True Positive (TP)
- Predicted postive aligns with actual negative, which is Fake Positive (FP)
- Predicted negative aligns with actual positive, which is False Negative (FN)
- Predicted negative aligns with actual negative, which is True Negative (TN)
Obviously, we want less FP n FN, and more TP n TN, and this is called confusion
matrix.
- Separated, where positive examples and negative examples are well differentiated.
- Unseparated, where many positive examples have lower scores than negative
examples.
- Imbalanced, containing very few examples of the positive class in the dataset.
Still, we want to know whether our classifications are accurate enough or not.
An ideally perfect model would have an accuracy of 100%, or 1.0, but we could
hardly reach this.
It fairly measures the correctness of the predicted data, thus it is often the
Default Evaluation Metric used.
For eg, missing important emails (FP) is worse than seeing spam unexpectedly (FN).
True Positive Rate = TP / (TP + FN), calculating the proportion of all actual
positive values being classified correctly.
Thus, a short summary, Accuracy measures the overall correctness, while FPR or TPR
only focus only the actually postive or negative datas. Precision is just a
measurement of how many False Positive are there in the dataset.
SO we use
- accuracy when we want to get the overall model performance, but it shouldn't be
used for imbalanced datasets.
- Recall (TPR) when False Negatives are more costly
- FPR when False Posities are more expensive
- Precision when it's important for positive predictions to be accurate
These statements are based on a single classification value only, but in reality,
we need to evaluate a model's quality across all possible thresholds, that's why we
use Receiver-Operating Characteristic curve and Area Under the Curve.
At a high threshold (eg. 0.9), it has a low TPR (few positives caught) and a low
FPR (few false alarms).
Vice versa, at a low threshold, it has a high TPR and FPR, locating the point near
top right. (smth like this)
| /
|/
-------- FPR
Basically, it shows the trade-offs between catching more positives and making
mistakes on negatives.
In the name, it is the Area under the ROC curve just now. Typically, an ideal model
shows a perfect square with side length of 1, meaning the model 100% ranked a
positive example higher than a negative example.
What we definitely don't want to see is, a diagnoal ROC, it means the model is
flipping a coin. (50%)
We use Precision-Recall Curve and calculate the area under the respective graph.
(focus on the no. of true positives)
Just take note that, the baseline of PR-AUC depends on the class imbalanced, or the
overall proportion of positives.
For eg. if only 10% of emails are spam, then the baseline precition is 0.1, so any
model with AUC-PR > 0.1 is at least better than flipping a coin.
The points on a ROC curve closest to (1, 0) represent a range of best thresholds.
Still, u can choose between these values based on the impotance of false negative
or false positive.
Like Linear Regression Model, we also use prediction bias to see how effective the
predictive values are.
Just collect the mean of predictions and a mean of ground-truth labels, the
difference between them is the bias.
Well, we won't invent a new technique out of nowhere, we still prefer to use what
we already have.
We can split these classes into two classes, one of the classes (eg. A) and the
otheres (eg. B, C, D, E...).
By repeating this step until all the classes are shown, we are able to handle a
multi-class data model.
And one more tip, in the real application, to handle the extreme datas, we use Z-
score.
For eg, if the acutal value is 70 while mean = 50, std = 10, then the Z-score =
(70-50) / 10 = 2.0 .
2. Data
Data is very important in ML, most likely, we spend far more time on evaluating,
cleaning and transforming data than building models.
This unit foucses on numeical data, meaning integers or floating-point values that
behave like numbers.
But the postal code is not included, as double the postal code (i,e. 247440 to
494880) has no acutal meaning.
The model can't just grab the intended cells in the dataset, it ingests an array of
floating-point values called a Feature Factor, and this value needs to further
processed so that your model can better learn from.
It seems a bit abnormal, as we're training the model using acutal data but using
processed data.
You should be quite famaliar with Normalization already, converting values into a
standard range.
We'll talk about that in later chapters, let's deal with numerical data first.
This is even simpler, as we are using terms like mean, median and standard
deviation to evaluate the datas.
We also want to know the 0th, 25th, 50th, 75th, 100th percentiles. (the 0th is min
and the 100th is max)
An outliner is a value distant from most other values, we can easily find it using
graphs or plots.
In details, we compare std with mean, if std overweighs mean, meaning it's an
imbalanced dataset.
Now you find the outliners, if it is just a mistake, you can simply delete examples
containing mistake ouliners ;
But if it is a legit data point, will your model ultimately need to infer good
predictions on these outliners?
- If yes, keep these in your model to make better predictions.
- If no, delete the outliners or apply techniques such as clipping.
Let's say we find the Thursday data abnormal, and we want to compare with others.
The dataset is the calories we take in everyday, and we take down 50 datas per day.
Day_4 = 0
Non_day_4 = 0
count1 = 0
count2 = 0
output :
The mean of Day 4 is 93
The mean of Non Day 4 is 201
The number of datas in Day 4 is 200
The number of datas in Non Day 4 is 1200
Now, let's talk about one of the most important things in data analysis,
NORMALIZATION.
It helps the model to converge faster, infer better predictions, avoid "NaN" trap
and learn proper weights.
"NaN" stands for not a number, meaning a floating-point number exceeds the
precision limit.
Without normalization, the model pay too much attention to features with wide
ranges.
For eg, - 0.5 < A < + 0.5, - 5.0 < B < + 5.0, at first, the model assumes tht B is
ten times more "important".
Therefore training will take longer than expected and the resulting model might be
suboptimal.
Due to their close values, this overall damage for not normalizing is relatively
small, but we still recommend normalizing Feature A and Feature B on the same
scale, maybe -1.0 to +1.0.
But what if the case is -1 < C < +1, +5000 < D < + 1,000,000,000 ?
Now, if you don't normalize, your model will likely be suboptimal and take much
longer to converge or even fail.
Actually, the first two methods are mentioned somewhere before, and this section
covers clipping.
Linear Scaling
Well, this might be the easiest, it converts floating-point values from their
natural range into a standard range.
Take note that, most real-world features, sadly, don't meet all of the criteria for
linear scaling.
Z-score Scaling
First, let's see the math behind, x' = ( x - mean ) / standard deviation.
It means that we are finding how many std in between raw value x and the mean.
That makes it an ideal tool to normalize different ranges of data while keep
relationship the same.
This, again, doesn't really welcome the outliners, so we may combine with other
technique (ususally clipping) to handle this situation.
Log Scaling
In the name, it calculates the logarithm of the raw value, and the base is normally
the natural logarithm ( ln ).
It works for data conforms to a power law distribution, meaning the distribution is
suspected to be powered :
A few movies have lots of ratings while most movies have very few user ratings.
Thus the graph would appers like a "反比例方程", in this case, we'll use Log Scaling the
change the distribution.
Logging a bigger number will still get a bigger output, but it's much more
balanced.
Clipping
Remember where you saw clipping for the first time, yes, in DE II, When the output
of an amplifier is greater than the supply to the amplifier, the output would be
clipped to the supply voltage.
SO, similarly, we are not going to eliminate all the outliners, instead, we clip
them to the same value.
For eg, if a model has lots of outliners greater than 4.0, we can simply clip all
values above 4.0 to become 4.0 .
In Z-scores, perhaps, you can clip Z-scores greater than 3 to become exactly 3 and
less than -3 to become -3.
Z-score scaling is used when the feature distribution does not contain extreme
outliners.
Log scaling is used when the feature conforms to the power law.
WELL, that's a lot of things, I have to rest for a while, hope I won't forget
everything after.