Supervised Learning
Supervised Learning
Pembelajaran Mendalam
#3 Supervised Learning
Prodi Informatika
Universitas Atma Jaya Yogyakarta
Semester Gasal 2023/2024
Outline
• Classification and Regression
• Classification Algorithms: KNN, Decision Tree, Random Forest,
Gradient Boosting, Logistic Regression, Support Vector Machines,
Neural Networks
• Regression: Linear Regression, Ridge, Lasso
Classification and Regression
• Predicting class labels: a selection from an
existing list of possibilities
• Binary classification: distinguish between two
classes only. Ex: spam email or not
Klasifikasi • Multiclass classification: classification between
more than two classes. Ex: irises: setosa,
versicolor, or virginica
• to predict continuous numbers, or floating-
point numbers in programming terms (or real
numbers in mathematical terms)
• An easy way to differentiate between
classification and regression tasks is there
Regression some kind of continuity in the output
• Example: predicting a person's annual income
from their education, their age, and where
they live
• the predicted value is a number, and can be
any number within a specified range
K-Nearest Neighbors
• The KNN algorithm is "lazy" because it does
not work by learning functions that can form
decision boundaries from data.
• KNN learns from the training set by
K-Nearest memorizing.
• The KNN algorithm has the following working
Neighbors steps:
• Selects the number k and the distance
measurement method
• Looking for k nearest neighbors from the train data
set
• Perform class predictions by voting the majority of
the closest class for new data
KNN
Illustration
• Based on the distance measurement method
chosen, KNN finds a number of k data in the train
set that are close to or most similar to the points
we want to classify.
• The class label of the data point is determined by
the highest number of votes among the k nearest
K-Nearest neighbor data.
Neighbors • The choice of the right k value is very important
to overcome overfitting and underfitting.
• We must ensure that the selected measurement
method is appropriate for the features of the
dataset.
• Ex: using the Euclidean method requires data
standardization so that each feature contributes
equally to the distance measurement.
• The advantages of KNN are:
• Easy to apply for simple cases
• Quite reliable on unbalanced datasets
• Simple model training process
K-Nearest • The disadvantages of KNN are:
• If the train set is large, the training process will take
Neighbors longer
• Very sensitive to outliers in calculating the distance
between data points
• Cannot take into account the level of relevance or
significance of features, if there are many features
that are less important it will interfere with the
learning process of the KNN model
• KNN is also susceptible to overfitting due to
the curse of dimensionality.
• Curse of dimensionality is a phenomenon
where the feature space increases in number
Curse of of dimensions.
• We can assume nearby data points to be too
Dimensionality far away in higher dimensions to provide a
good distance estimate.
• Models such as KNN and Decision Tree cannot
apply regularization so we need to use feature
selection and feature reduction to help us
avoid the curse of dimensionality.
KNN Code Ex
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV
KNN = KNeighborsClassifier()
pipe_KNN = Pipeline(steps=[(‘scale’,…),(‘feat_select’,…),(‘clf’,KNN)])
GSCV_KNN = GridSearchCV(pipe_KNN,params_KNN,cv=5)
GSCV_KNN.fit(X_train,y_train)
GSCV_KNN.score(X_test,y_test)
Decision Trees
• Decision Tree is a model that is widely used for
classification and regression tasks
• The model learns if/else questions, leading to
decisions
Decision Trees • model for distinguishing between four classes
of animals (eagles, penguins, dolphins, and
bears) using the three characteristics "has
feathers," "can fly," and "has fins."
14
• Building a Decision Tree:
• Learning the sequence of if/else questions that gets
us to the correct answer most quickly
• In a machine learning setting, these questions are
Decision called tests (as opposed to test sets)
• Usually data does not appear in the form of binary
Trees “yes/no” features as in the animal Decision Tree
example, but is instead represented as continuous
features as in 2D data sets
15
• Building a Decision Tree
• The algorithm searches for all possible tests and
finds the most informative one about the target
variable
• This recursive process produces a binary Decision
Decision Tree, with each node containing a test
Trees
16
• Building a Decision Tree
• The recursive partition of the data is repeated until
each region in the partition (each leaf in the
Decision Tree) contains only one target value (one
class or one regression value)
Decision • A tree leaf that contains data points that all have
Trees the same target value is called a pure leaf
17
• Building a Decision Tree
• Predictions on a new data point are made by
examining the region where the point resides from
a partition of the feature space, and then
predicting the majority target (or single target in
Decision the case of pure leaves) in that region
Trees • Regions can be found by traversing the tree from
the root and to the left or right, depending on
whether the test is met or not
18
• Building a Decision Tree
• The last partition is too detailed = overfitting
Decision
Trees
19
• Controlling the complexity of the Decision Tree
• There are two general strategies to prevent
overfitting: stopping building the tree early (pre-
pruning) or building the tree but then removing or
collapsing nodes that contain little information
Decision (post-pruning/pruning)
Trees • Possible criteria for pre-pruning include: limiting
the maximum depth of the tree, limiting the
maximum number of leaves
• Decision Trees in scikit-learn are implemented in
the DecisionTreeRegressor and
DecisionTreeClassifier classes.
• Scikit-learn only implements pre-pruning
20
• Instead of looking at the entire tree, there are
several useful attributes we can derive to
summarize how the tree works.
• The most commonly used summary is feature
Feature importances, assessing how important each
importances feature is to the decisions the tree makes.
• It's a number between 0 and 1 for each
feature, where 0 means "not used at all" and 1
means "predicts the target perfectly."
21
• Feature importance in trees
• def plot_feature_importances_cancer(model):
• n_features = cancer.data.shape[1]
• plt.barh(range(n_features), model.feature_importances_,
align='center')
Decision • plt.yticks(np.arange(n_features), cancer.feature_names)
• plt.xlabel("Feature importance")
Trees • plt.ylabel("Feature")
• plot_feature_importances_cancer(tree)
22
Decision Tree Code Ex
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV
DT = DecisionTreeClassifier(random_state=0)
pipe_DT = Pipeline(steps=[(‘scale’,…),(‘feat_select’,…),(‘clf’,DT)])
GSCV_DT = GridSearchCV(pipe_DT,params_DT,cv=5)
GSCV_DT.fit(X_train,y_train)
GSCV_DT.score(X_test,y_test)
• Ensembles: a method that combines multiple
machine learning models to create a more
powerful model.
• There are many models in the machine
Ensembles of learning literature that fall into this category,
Decision Trees but there are two ensemble models that have
proven effective on a variety of data sets for
classification and regression, both of which use
Decision Trees as their building blocks: Random
Forest and Gradient Boosted Tree
24
• Random forest: a collection of Decision Trees,
where each tree is slightly different from the
others
• The idea behind Random Forest is that any
Ensembles of given tree may do a relatively good job of
Decision Trees predicting, but it will likely be a good fit on only
part of the data
• There are two ways in which the trees in
Random Forest are randomized; by selecting
the data points used to build the tree or by
selecting features in each split test
25
• To build a tree, we first take what is called a
bootstrap sample of our data
• From n_samples data points, repeatedly
sample data randomly with replacement
Ensembles of (meaning the same sample can be drawn
Decision Trees multiple times)
• To illustrate, let's say we want to create a
bootstrap instance of the list ['a', 'b', 'c', 'd’].
• A possible bootstrap example is ['b', 'd', 'd', 'c’].
Another possible example is ['d', 'a', 'd', 'a'].
26
• Next, a Decision Tree is built based on this
newly created dataset
• The algorithm randomly selects a subset of
features, and searches for the best possible
Ensembles of test involving one of these features.
Decision Trees • This feature subset selection is repeated
separately at each node, so that each node in
the tree can make decisions using a different
feature subset.
27
• To make predictions using Random Forest, the
algorithm first makes predictions for each tree
in the "forest"
• for regression, averaging the results per tree to
Ensembles of get the final prediction
• for classification, a “soft voting” strategy was
Decision Trees used
• Random Forest provides a much more intuitive
decision boundary. In any real application, the
model will use many more trees (often
hundreds or thousands), leading to finer
bounds
28
Ensembles of
Decision Trees
29
Random Forest provides feature importances, which
are calculated by combining the feature importances
of each tree in the Random Forest
Ensembles of
Decision Trees
30
• Advantages: reliable, often works well without
heavy parameter tuning, and does not require
data scaling
• Random forests tend not to perform well on
Ensembles of incomplete and very high-dimensional data,
Decision Trees such as text data
• Random forests require more memory and are
slower to train and predict than linear models
• Important parameters to adjust are
n_estimators, max_features, and possibly pre-
pruning options like max_depth
31
• Gradient Boosting Tree works by building trees
sequentially, where each tree tries to correct
the errors of the previous one
• Gradient Boosting Tree often uses very shallow
Ensembles of trees, with a depth of one to five, which makes
Decision Trees the model smaller in terms of memory and
makes predictions faster
• By default, there is no randomization in a
Gradient Boosting Tree; instead, strong
prepruning is used
32
• The main idea behind Gradient Boosting Trees
is to combine many simple models (in this
context known as weak learners), such as
shallow trees
Ensembles of • Each tree can only provide good predictions on
Decision Trees a portion of the data, and more trees are
added to improve performance iteratively
33
• The feature importance of Gradient Boosting Tree
is somewhat similar to Random Forest, although
Gradient Boosting Tree ignores some features
Ensembles of
Decision Trees
34
• Pros: one of the most powerful and widely used
models for supervised learning
• Disadvantages: requires careful parameter tuning
and may take a long time to train
• The main parameters of the model are n_estimator,
Ensembles of and learning_rate, which controls the extent to
Decision Trees which each tree is allowed to correct the errors of
the previous tree
• Another important parameter is max_depth (or
alternatively max_leaf_nodes), to reduce the
complexity of each tree
35
Random Forest Code Ex
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
RF = RandomForestClassifier(random_state=0)
pipe_RF = Pipeline(steps=[(‘scale’,…),(‘feat_select’,…),(‘clf’,RF)])
params_RF = {‘feat_select__k’:
…,’clf__n_estimators’:[100,150,200],‘clf__criterion’:
[‘gini’,’entropy’],‘clf__max_depth’:[2,3,4,5]}
GSCV_RF = GridSearchCV(pipe_RF,params_RF,cv=5)
GSCV_RF.fit(X_train,y_train)
GSCV_RF.score(X_test,y_test)
GradientBoostedTree Code Ex
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import GridSearchCV
GBT = GradientBoostingClassifier(random_state=0)
pipe_GBT = Pipeline(steps=[(‘scale’,…),(‘feat_select’,…),(‘clf’,GBT)])
params_GBT = {‘feat_select__k’:
…,’clf__n_estimators’:[100,150,200],‘clf__criterion’:
[‘friedman_mse’,’squared_error’],‘clf__max_depth’:[2,3,4,5],’clf__learn
ing_rate’:[0.1,1,10]}
GSCV_GBT = GridSearchCV(pipe_GBT,params_GBT,cv=5)
GSCV_GBT.fit(X_train,y_train)
GSCV_GBT.score(X_test,y_test)
Logistic Regression
• Logistic Regression is a basic but effective
approach to linear and binary classification
problems.
• Even though it has the name 'regression',
Logistic Logistic Regression is a classification model, not
Regression a regression model.
• The workings behind Logistic Regression as a
model for binary classification are based on
'likelihood’.
Logistic Regression
• Logistic Regression applies a
sigmoid function to the
results of a linear equation
of input variables and
produces a probabilistic
output ranging between 0
and 1.
• A threshold value is applied
to the output to classify new
data into a certain class.
• Logistic Regression has a parameter that
determines the strength of regularization
called C.
• A higher value of C indicates less regularization
Logistic factor. High C values try to adapt learning to
Regression the training set as best as possible (in more
detail "trust" the training set).
• A low C value makes the model place more
emphasis on finding a coefficient vector (𝑤)
that is close to zero (minimizing the influence
of several features).
• The advantages of Logistic Regression are:
• Easy to implement and simple
• Reliable for datasets with few to moderate features
• The coefficients are easy to interpret
Logistic • The disadvantages of Logistic Regression are:
Regression • Sensitive to outliers and imbalanced datasets
• Not reliable on complex datasets
• Interpretation of coefficients becomes difficult
when there are correlated features
LogisticRegression Code Ex
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
LogReg = LogisticRegression()
pipe_LogReg = Pipeline(steps=[(‘scale’,…),(‘feat_select’,…),(‘clf’,
LogReg)])
params_LogReg = {‘feat_select__k’:
…,’clf__C’:[0.1,1,10],‘clf__penalty’: [‘l1’,’l2’]}
GSCV_LogReg = GridSearchCV(pipe_LogReg,params_LogReg,cv=5)
GSCV_LogReg.fit(X_train,y_train)
GSCV_LogReg.score(X_test,y_test)
Kernelized Support Vector Machine
• extensions that allow more complex models
that are not defined solely by hyperplanes in
the input space
Kernelized • One way to make a linear model more flexible
Support is to add more features, by adding interactions
Vector or polynomials of the input features
Machines
45
• Two-class classification dataset where the
classes are not linearly separable
Kernelized
Support
Vector
Machines
46
• Linear models and nonlinear features
• Decision boundaries created by linear SVM
Kernelized
Support
Vector
Machines
47
Kernelized Support Vector Machines
48
Kernelized Support Vector Machines
• As a function of the original features, the linear SVM model is actually
not linear anymore. It's not a line, but more like an ellipse
49
• Kernel trick=mathematical trick that allows our
classifier to learn in a higher dimensional space
without actually computing a new
Kernelized representation which may be very large
Support • works by directly calculating the distance
Vector (more precisely, the scalar product) of data
points for an expanded feature representation,
Machines without ever actually calculating the expansion
50
• Two ways to map your data to a higher
dimensional space with SVM:
• polynomial kernel, which computes all possible
polynomials up to a certain level of the original
Kernelized feature (such as feature1 ** 2 * feature2 ** 5);
Support • radial basis function (RBF) kernel, also known as
Gaussian kernel
Vector • During training, the SVM learns how important
each training data point is to representing the
Machines decision boundary between two classes
• Usually only a subset of the training points is
important for determining the decision boundary:
those located on the borders between classes
(support vectors)
51
• To make predictions for new points, the
distance to each support vector is measured
• Classification decisions are made based on the
Kernelized distance to the support vector, and the
Support importance of the support vector learned
Vector during training (stored in the dual_coef_
attribute of the SVC)
Machines • https://youtu.be/Q7vT0--5VII
52
• The gamma parameter controls the width of
the Gaussian kernel
• determines the scale of what it means for
Kernelized points to be close together
Support • The C parameter is a regularization parameter,
Vector similar to that used in linear models
Machines • limiting the importance of each point (or more
precisely, dual_coef_)
53
Kernelized
Support
Vector
Machines
54
• A small value of gamma means a large radius
for the Gaussian kernel, which means many
points are considered close
Kernelized • reflected in very fine decision boundaries on
Support the left, and more focused boundaries at a
Vector single point further to the right
Machines • A low gamma value means the decision
boundary will vary slowly, resulting in a model
with low complexity, while a high gamma value
results in a more complex model.
55
• A small value of C means a very limited model,
where each data point can only have a very
limited influence
Kernelized • on the top left the decision boundary looks
Support almost linear, with misclassified points having
Vector almost no influence on the line
Machines • increasing C, as shown in the bottom right,
allows these points to have a stronger
influence on the model and makes the decision
boundary bend to classify them correctly
56
• Strengths, weaknesses and parameters
• SVM is a powerful model and performs well on a
variety of datasets
Kernelized • SVM allows complex decision boundaries, even if
the data only has a few features. SVM performs
Support well on low-dimensional and high-dimensional
data (i.e., few and many features), but does not
Vector scale very well with sample size
• Running SVM on data with 10,000 samples may
Machines work fine, but working with datasets of 100,000 or
more, can be challenging in terms of runtime and
memory usage
• Another disadvantage of SVMs is that they require
careful data pre-processing and parameter tuning
57
• Additionally, SVM models are difficult to
examine; it may be difficult to understand why
certain predictions are made, and it may be
Kernelized difficult to explain the model to non-experts
Support • The important parameters in the SVM kernel
Vector are the regularization parameter C, kernel
choice, and kernel specific parameters
Machines
58
SVC Code Ex
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV
SVMClf = SVC()
pipe_SVM = Pipeline(steps=[(‘scale’,…),(‘feat_select’,…),(‘clf’,
SVMClf)])
GSCV_SVM = GridSearchCV(pipe_SVM,params_SVM,cv=5)
GSCV_SVM.fit(X_train,y_train)
GSCV_SVM.score(X_test,y_test)
Regression
• Regression in supervised learning is a method
used to model the relationship between the
independent variable (X) and the dependent
variable (y) by looking for mathematical
equations.
Regresi • Regression has the aim of predicting the value
of the dependent variable based on the value
of the independent variable.
• An easy way to differentiate between
classification and regression tasks: is there
some kind of continuity in the output
• Example: predicting a person's annual income
from their education, their age, and where
Regresi they live
• The predicted value is a quantity, and can be
any number within a specified range
Ilustrasi
Regresi
• one of the simplest and most frequently used
regression methods.
• Linear Regression works by looking for the best
straight line (linear) that can describe the
Linear relationship between the independent (X) and
Regression dependent (y) variables.
• Linear Regression calculates the w and b
parameters that minimize the Mean Squared
Error between the prediction and the actual
regression target.
• Advantages: simple and easy to understand,
can provide accurate results on linear datasets,
relatively fast computing time.
• Disadvantages: cannot overcome non-linear
Linear relationships, cannot overcome
Regression multicollinearity (the existence of correlation
or strong relationship between two or more
independent variables in a regression model.
Significant to the predicted value by coefficient
b).
• Lasso and Ridge regression techniques were
developed to overcome multicollinearity and
overfitting problems in linear regression
models.
• Both techniques work by imposing a penalty
Lasso on the regression equation.
• Lasso uses the L1 penalty, creating a simpler
model by focusing on the most important
features and discarding minor features.
• The Lasso model also limits the coefficient w to
approach zero.
• Advantages: can create simplified models,
select the most important features, and avoid
overfitting.
• Disadvantages: often ignores features that
have no impact on the dependent variable and
Lasso may produce an unstable model.
• Ridge uses the L2 penalty by reducing
regression coefficients and addressing
multicollinearity to produce a more stable
model.
• The coefficient value is made as small as
Ridge possible; all w values must be close to zero.
• Each feature should have as little effect as
possible on the results (which means a small
slope if graphed), but should still predict well.
• Advantages: Ridge can eliminate
multicollinearity and create a more stable
model.
• Disadvantages: cannot ignore insignificant
features and tends to build complicated
Ridge models.
Regression Code Ex
from sklearn.linear_model import Ridge
from sklearn.model_selection import GridSearchCV
RR = Ridge()
pipe_RR = Pipeline(steps=[(‘scale’,…),(‘feat_select’,…),(‘reg’, RR)])
GSCV_RR =
GridSearchCV(pipe_RR,params_RR,cv=5,scoring=‘neg_mean_absolute_error’)
GSCV_RR.fit(X_train,y_train)
GSCV_RR.score(X_test,y_test)