Scikit Learn
Scikit Learn
1
1.1. What is machine learning? UNIT 01
2
1.1. What is machine learning? UNIT 01
Modern definition
‣ “A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if
its performance at tasks in T, as measured by P, improves with experience E.” – Mitchell, 1997 (p.2)
‣ “Programming computers to optimize a performance criterion using example data or past experience.”
–Alpaydin, 2010
‣ “Computational methods using experience to improve performance or to make accurate predictions.” – Mohri, 2012
3
1.1. What is machine learning? UNIT 01
Mathematical definition
‣ Suppose that the x-axis is invested advertising expenses while the y-axis is sales.
(Target) ‣ Question about prediction –What is the sales when random advertising expenses are
given?
𝑓2
‣ Linear regression
𝑓3
• w and b as parameters
𝑓1
𝑦=𝑤 𝑥+𝑏
(Feature) • ‘w’ is commonly used as an abbreviation of ‘weigh.’
2 4 6 8 10
‣ Since the optimal value is unknown in the beginning, start with an arbitrary value and then reach the optimal value by
gradually enhancing the performance.
• From the graph, it starts from f1 to continue as f1 → f2 → f3.
• The optimal value is f3 where w=0.5 and b=2.0.
4
1.1. What is machine learning? UNIT 01
Pattern
Recognition
Artificial Intelligence
Statistics
Machine
Learning
Deep
Learning
Data Mining
Data
Science
Database Computational
Neuroscience
5
1.1. What is machine learning? UNIT 01
Pattern
Recognition
Artificial Intelligence
Statistics
Machine
Learning
Deep
Learning
Data Mining
Data
Science
Database Computational
Neuroscience
6
1.1. What is machine learning? UNIT 01
Machine Learning
Target pattern is given. Target pattern must be found out. Policy optimization.
7
1.1. What is machine learning? UNIT 01
Machine Learning
Feature engineering
Understanding the business and Pre-processing
problem definition and searching of data
Train
Modeling and
Problem Definition Data Preparation
Optimization
Validate Model training for data
8
1.1. What is machine learning? UNIT 01
Type Algorithm/Method
Clustering.
MDS, t-SNE.
Unsupervised learning
PCA, NMF.
Association analysis.
Linear regression.
Logistic regression.
Tree, Random Forest, Ada Boost, XGBoost.
Supervised learning Naïve Bayes.
KNN.
Support vector machine (SVM).
Neural Network.
9
1.1. What is machine learning? UNIT 01
Hyperparameters
‣ Can be set manually by the practitioner.
‣ Can be tuned to optimize the machine learning performance.
Ex in KNN algorithm.
Ex Learning rate in neural network.
Ex Maximum depth in Tree algorithm.
10
Unit 1.
11
1.2. Phyton scikit-learn library for machine learning UNIT 01
12
1.2. Phyton scikit-learn library for machine learning UNIT 01
Mechanism of scikit-learn
Scikit-learn is characterized by its intuitive and easy interface complete with high-level API.
Predict /
Instance Fit
transform
13
1.2. Phyton scikit-learn library for machine learning UNIT 01
Estimator
Training: .fit
Prediction: .predict
Classifier Regressor
DecisionTreeClassifier LinearRegression
KNeighborsClassifier KNeighborsRegressor
GradientBoostingRegressor GradientBoostingRegressor
Gaussian NB … Ridge …
14
1.2. Phyton scikit-learn library for machine learning UNIT 01
Scikit-Learn Library
About the Scikit-Learn library
‣ It is a representative Python machine learning library.
‣ To import a machine learning algorithm as class:
from sklearn. <family> import <machine learning algorithm>
Ex from sklearn.linear_model import LinearRegression
‣ Hyperparameters are specified when the machine learning object is instantiated:
15
1.2. Phyton scikit-learn library for machine learning UNIT 01
16
1.2. Phyton scikit-learn library for machine learning UNIT 01
Practicing scikit-learn
‣ The sklearn.datasets module includes utilities to load datasets, including methods to load and fetch popular reference datasets.
It also features some artificial data generators.
17
1.2. Phyton scikit-learn library for machine learning UNIT 01
Practicing scikit-learn
Line 5
• This data becomes x (independent variable, data).
18
1.2. Phyton scikit-learn library for machine learning UNIT 01
Practicing scikit-learn
Line 6
• This data becomes y (dependent variable, actual value).
19
1.2. Phyton scikit-learn library for machine learning UNIT 01
Practicing scikit-learn
Line 1
• Provides details about the data.
• The help shows that the default value of test_size is 0.25.
20
1.2. Phyton scikit-learn library for machine learning UNIT 01
Practicing scikit-learn
Line 7
• From the total of 569 observed values, divide the data for training and evaluation into 7:3 or 8:2.
7.5:2.5 is the default value.
21
1.2. Phyton scikit-learn library for machine learning UNIT 01
Practicing scikit-learn
‣ Use train_test_split() to split the data for making and evaluating the model.
22
1.2. Phyton scikit-learn library for machine learning UNIT 01
Practicing scikit-learn
Line 11
• 426 observed values (75%) out of total 569 observations are found.
Line 13
• 143 observed values (25%) out of total 569 observations are found.
23
1.2. Phyton scikit-learn library for machine learning UNIT 01
Practicing scikit-learn
‣ For instancing, use the model’s hyperparameter as an argument. Hyperparameter is an option that requires human setting and
affects a lot to the model performance.
Line 1-5
• Loading the test data set
Line 1-8
• Instancing the estimator and hyperparameter setting
• Model initialization by using the entropy for branching
24
1.2. Phyton scikit-learn library for machine learning UNIT 01
fit
‣ Use the fit method with instance estimator for training. Send the training data and label data together as an argument to
supervised learning algorithm.
predict
‣ The instance estimator that has completed training with fitting can be applied with the predict method. ‘Predict’ converts the
estimated results of the model regarding the entered data.
Line 2
• It is an estimated value, so the actual value for X_test may vary.
Measure the accuracy by comparing the two values.
25
1.2. Phyton scikit-learn library for machine learning UNIT 01
Practicing scikit-learn
26
1.2. Phyton scikit-learn library for machine learning UNIT 01
Practicing scikit-learn
27
1.2. Phyton scikit-learn library for machine learning UNIT 01
Practicing scikit-learn
Line 57
• Data frame shows a result where predicted value and actual value differ.
28
1.2. Phyton scikit-learn library for machine learning UNIT 01
Practicing scikit-learn
29
1.2. Phyton scikit-learn library for machine learning UNIT 01
Practicing scikit-learn
30
1.2. Phyton scikit-learn library for machine learning UNIT 01
Practicing scikit-learn
Line 66
• 133/143
31
1.2. Phyton scikit-learn library for machine learning UNIT 01
Practicing scikit-learn
‣ It showed 93% accuracy, which is a quite rare result. In fact, a process of increasing the data accuracy is required during data
pre-processing, and standardization is one of the options. The following is a brief summary of standardization.
• Standardization can be done by calculating standard normal distribution.
Another term of standardization is z-transformation, and the standardized value is also referred to as z-score. 94% accuracy
would be obtained from KNN wine classification through standardization.
• Standardization is widely used in data pre-processing in general other than KNN, and the following is the equation.
(, standard deviation)
32
1.2. Phyton scikit-learn library for machine learning UNIT 01
Practicing scikit-learn
33
1.2. Phyton scikit-learn library for machine learning UNIT 01
Practicing scikit-learn
Line 35
• Data frame before standardization
34
1.2. Phyton scikit-learn library for machine learning UNIT 01
Practicing scikit-learn
Line 39
• The differences among column values are huge before standardization.
35
1.2. Phyton scikit-learn library for machine learning UNIT 01
Practicing scikit-learn
Line 40
• After standardization, the column values do not significantly deviate from 0.
• Better performance would be possible compared to before standardization.
36
1.2. Phyton scikit-learn library for machine learning UNIT 01
transform
‣ Feature processing is done with ‘transform’ to return the result.
Line 3-3
• Output before pre-processing
37
1.2. Phyton scikit-learn library for machine learning UNIT 01
transform
‣ Feature processing is done with ‘transform’ to return the result.
Line 3-7
• Pre-processing – Apply scaling
38
1.2. Phyton scikit-learn library for machine learning UNIT 01
transform
‣ Feature processing is done with ‘transform’ to return the result.
Line 3-9
• Result check after pre-processing
39
1.2. Phyton scikit-learn library for machine learning UNIT 01
transform
‣ Feature processing is done with ‘transform’ to return the result.
40
1.2. Phyton scikit-learn library for machine learning UNIT 01
fit_transform
‣ Fit and Transform is combined as fit_transform.
41
1.2. Phyton scikit-learn library for machine learning UNIT 01
42
1.2. Phyton scikit-learn library for machine learning UNIT 01
43
Unit 1.
44
1.3. Preparation and division of data set UNIT 01
45
1.3. Preparation and division of data set UNIT 01
Performance
Test data set
evaluation
Machine learning modeling process through division of training and test data sets
46
1.3. Preparation and division of data set UNIT 01
𝑦 𝑦 𝑦 𝑦 𝑦
𝑥 𝑥 𝑥 𝑥 𝑥
Overfitting and underfitting
47
1.3. Preparation and division of data set UNIT 01
𝑦 𝑦 𝑦 𝑦 𝑦
𝑥 𝑥 𝑥 𝑥 𝑥
‣ Even if the machine learning finds the optimal solution in data distribution, a wide margin of error occurs, and this is because
the model has a small capacity. Such phenomenon is referred to as underfitting, and the linear equation model on the leftmost
figure above is an example.
‣ An easy alternative is to use higher degree polynomials, which are non-linear equations.
‣ The rightmost figure provided above is applied with 12th order polynomial.
‣ The model capacity got larger, and there are 13 parameters for estimation.
12 11 10 1
𝑦 =𝑤12 𝑥 +𝑤11 𝑥 +𝑤10 𝑥 ⋯ 𝑤1 𝑥 +𝑤0
48
1.3. Preparation and division of data set UNIT 01
Overfitting
‣ When choosing a 12th order polynomial curve, it approximates almost perfectly to the training set.
‣ However, an issue occurs when predicting new data.
• The region around the red bar at should be predicted, but the red dot is predicted instead.
‣ The reason is because of the large capacity of the model.
• Accepting the noise during the learning process → Overfitting
‣ Model selection is required to select an adequate size model.
12th
𝑥 𝑥0
49
1.3. Preparation and division of data set UNIT 01
50
1.3. Preparation and division of data set UNIT 01
51
1.3. Preparation and division of data set UNIT 01
Cross-Validation:
1) Split the data into a training set and a testing set.
2) Further subdivide the training set into a smaller training and a validation set.
3) Train the model with the smaller training set.
4) Evaluate the errors with the validation set.
5) Repeat a few times from the step 2).
52
1.3. Preparation and division of data set UNIT 01
Validation
Training
‣ Subdivide the training dataset into 𝑘 equal parts. Then, apply sequentially.
53
1.3. Preparation and division of data set UNIT 01
Validation
Training
‣ Leave only one observation for validation. Apply sequentially. More time consuming.
54
1.3. Preparation and division of data set UNIT 01
k-cross folding
55
Unit 1.
56
1.4. Data pre-processing for making a good training data set UNIT 01
57
1.4. Data pre-processing for making a good training data set UNIT 01
58
1.4. Data pre-processing for making a good training data set UNIT 01
Ex In the case of the iris data, ignore the fourth row as shown on the table below.
x1 x2 x3 x4 y
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2
5 5 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
59
1.4. Data pre-processing for making a good training data set UNIT 01
x1 x2 x3 x4 y
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 unknown
5 5 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
60
1.4. Data pre-processing for making a good training data set UNIT 01
x1 x2 x3 x4 y
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 5.843 3.1 1.5 0.2 setosa
5 5 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
61
1.4. Data pre-processing for making a good training data set UNIT 01
62
1.4. Data pre-processing for making a good training data set UNIT 01
63
1.4. Data pre-processing for making a good training data set UNIT 01
‣ The omitted value is changed to NaN. In this case, it is not problematic due to the low number of data, but it is extremely
inconvenient to manually find missing values from a huge data frame.
64
1.4. Data pre-processing for making a good training data set UNIT 01
Line 17
• The number of missing values can be counted.
65
1.4. Data pre-processing for making a good training data set UNIT 01
Line 18
• axis=0 is default, so the row with the NaN value is deleted.
66
1.4. Data pre-processing for making a good training data set UNIT 01
67
1.4. Data pre-processing for making a good training data set UNIT 01
68
1.4. Data pre-processing for making a good training data set UNIT 01
69
1.4. Data pre-processing for making a good training data set UNIT 01
70
1.4. Data pre-processing for making a good training data set UNIT 01
71
1.4. Data pre-processing for making a good training data set UNIT 01
72
1.4. Data pre-processing for making a good training data set UNIT 01
73
1.4. Data pre-processing for making a good training data set UNIT 01
74
1.4. Data pre-processing for making a good training data set UNIT 01
75
1.4. Data pre-processing for making a good training data set UNIT 01
Imputation
‣ It is sometimes hard to delete a training sample or a certain column, and this is because it loses too much of useful data. If so,
estimate missing values from other training samples in data set by kriging. The most commonly used method is to impute
with average value, which is to change the missing value into the overall average of a certain column. In scikit-learn, use
SimpleImputer class.
76
1.4. Data pre-processing for making a good training data set UNIT 01
Imputation
77
1.4. Data pre-processing for making a good training data set UNIT 01
Imputation
Line 45
• Check it is the average of the column.
78
1.4. Data pre-processing for making a good training data set UNIT 01
79
1.4. Data pre-processing for making a good training data set UNIT 01
80
1.4. Data pre-processing for making a good training data set UNIT 01
‣ The data on the table have features that have orders and do not have orders.
Size is ordered, but color is not ordered. Thus, size is classified as ordinals scale while color is nominal scale.
81
1.4. Data pre-processing for making a good training data set UNIT 01
82
1.4. Data pre-processing for making a good training data set UNIT 01
83
1.4. Data pre-processing for making a good training data set UNIT 01
84
1.4. Data pre-processing for making a good training data set UNIT 01
85
1.4. Data pre-processing for making a good training data set UNIT 01
Line 62
• Change the class label from strings to integers.
86
1.4. Data pre-processing for making a good training data set UNIT 01
87
1.4. Data pre-processing for making a good training data set UNIT 01
88
1.4. Data pre-processing for making a good training data set UNIT 01
89
1.4. Data pre-processing for making a good training data set UNIT 01
The encoding is done with integers, so insert the iris ‘species’ value.
90
1.4. Data pre-processing for making a good training data set UNIT 01
The encoding is done with integers, so insert the iris ‘species’ value.
91
1.4. Data pre-processing for making a good training data set UNIT 01
Use the get_dummies() function of pandas to convert every eigenvalue of categorical variables into new dummy
variable.
92
1.4. Data pre-processing for making a good training data set UNIT 01
Use sklearn library to conveniently process one-hot encoding. The result is given as the sparse matrix in linear algebra.
In the sparse matrix, the value of most matrices is 0. An opposite concept to sparse matrix is dense matrix.
93
1.4. Data pre-processing for making a good training data set UNIT 01
OneHotEncoder
94
1.4. Data pre-processing for making a good training data set UNIT 01
OneHotEncoder
95
1.4. Data pre-processing for making a good training data set UNIT 01
OneHotEncoder
96
1.4. Data pre-processing for making a good training data set UNIT 01
Line 82
• (0, 0) is 1, thus setosa (setosa up to 50 matrices)
97
1.4. Data pre-processing for making a good training data set UNIT 01
98
1.4. Data pre-processing for making a good training data set UNIT 01
99
1.4. Data pre-processing for making a good training data set UNIT 01
Row index 0 2
Column index Species setosa versicolor virginica Sparse matrix ex-
pression
0 setosa 1 0 0 (0,0)
1 setosa 1 0 0 (1,0)
setosa 1 0 0
49 setosa 1 0 0 (49,0
50 versicolor 0 1 0 (50,1)
51 versicolor 0 1 0 (51,1)
versicolor 0 1 0
100 versicolor 0 1 0
101 virginica 0 0 1 (101,2)
virginica 0 0 1 (102,2)
virginica 0 0 1
150 virginica 0 0 1 (150,2)
100
1.4. Data pre-processing for making a good training data set UNIT 01
Using hold-out in real life that splits the data set into training data set and test set
‣ df_wine is the data that measure wines produced in Vinho Verde which is adjacent to Atlantic Ocean in the northwest of
Portugal. It measured and analyzed the grade, taste, and acidity of 1,599 red wine samples along with 4,898 white wine
samples to create data. If the data is not found in the following route, it is possible to import the data from the local by
directly downloading it from the UCI repository.
101
1.4. Data pre-processing for making a good training data set UNIT 01
Using hold-out in real life that splits the data set into training data set and test set
Line 85
• When it is not accessible to the wine data set of the UCI machine learning repository,
• Remove the remark of the following code and read the data set from the local route:
• df_wine = pd.read_csv(‘wine.data’, header=None)
102
1.4. Data pre-processing for making a good training data set UNIT 01
Using hold-out in real life that splits the data set into training data set and test set
103
1.4. Data pre-processing for making a good training data set UNIT 01
Using hold-out in real life that splits the data set into training data set and test set
‣ Data splitting is possible by using the train_test_split function provided in the model_selection module of scikit-learn. First,
convert the features from index 1 to 13 to NumPy array and assign to variable X. With the train_test_split function, data
conversion is done in four tuples, so assign by designating appropriate variables.
‣ Randomly split X and y into training and test data sets. test_size=0.3, so 30% of the sample is assigned to X_test and y_test.
‣ Regarding the stratify parameter, if the class label array y is sent, the class ratio found in the training data set and test data set
is identically maintained with the original data set.
‣ The most widely used ratios in real life are 6:4, 7:3 or 8:2 depending on the size of the data set. For large data set, it is
common and suitable to split the training data set and test data set into the ratio of 9:1 or 9.9:0.1.
104
1.4. Data pre-processing for making a good training data set UNIT 01
105
1.4. Data pre-processing for making a good training data set UNIT 01
106
1.4. Data pre-processing for making a good training data set UNIT 01
#MaxAbsScaler divides the data into the maximum absolute value based on each
feature. Thus, the maximum value of each feature becomes 1.
#The overall feature changes to [-1, 1] range.
107