0% found this document useful (0 votes)
17 views107 pages

Scikit Learn

well

Uploaded by

Cường Bùi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views107 pages

Scikit Learn

well

Uploaded by

Cường Bùi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 107

Unit 1.

Machine Learning Based Data Analysis


1.1. What is machine learning?
1.2. Python scikit-learn library for machine learning
1.3. Preparation and division of data set
1.4. Data pre-processing for making a good training data set
1.5. Practicing to find an optimal method to solve problems with scikit-learn

1
1.1. What is machine learning? UNIT 01

What is machine learning?


What is machine learning?

‣ A statistical model that learns from data.


‣ A rather simple model can make complex predictions.

2
1.1. What is machine learning? UNIT 01

Samuel’s definition in the early phase of artificial intelligence


‣ “Programming Computers to learn from experience should eventually eliminate the need for much of this detailed
programming effort.” - Samuel, 1959

Modern definition
‣ “A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if
its performance at tasks in T, as measured by P, improves with experience E.” – Mitchell, 1997 (p.2)
‣ “Programming computers to optimize a performance criterion using example data or past experience.”
–Alpaydin, 2010
‣ “Computational methods using experience to improve performance or to make accurate predictions.” – Mohri, 2012

3
1.1. What is machine learning? UNIT 01

Mathematical definition
‣ Suppose that the x-axis is invested advertising expenses while the y-axis is sales.

(Target) ‣ Question about prediction –What is the sales when random advertising expenses are
given?
𝑓2
‣ Linear regression
𝑓3
• w and b as parameters
𝑓1
𝑦=𝑤 𝑥+𝑏
(Feature) • ‘w’ is commonly used as an abbreviation of ‘weigh.’
2 4 6 8 10

‣ Since the optimal value is unknown in the beginning, start with an arbitrary value and then reach the optimal value by
gradually enhancing the performance.
• From the graph, it starts from f1 to continue as f1 → f2 → f3.
• The optimal value is f3 where w=0.5 and b=2.0.

4
1.1. What is machine learning? UNIT 01

Statistics and machine learning from a data analysis perspective

Pattern
Recognition
Artificial Intelligence

Statistics

Machine
Learning
Deep
Learning
Data Mining
Data
Science
Database Computational
Neuroscience

Connection among machine learning and different kinds of study

5
1.1. What is machine learning? UNIT 01

Data mining and machine learning from a data analysis perspective

Pattern
Recognition
Artificial Intelligence

Statistics

Machine
Learning
Deep
Learning
Data Mining
Data
Science
Database Computational
Neuroscience

Connection among machine learning and different kinds of study

6
1.1. What is machine learning? UNIT 01

Types of machine learning according to methods of supervision

Machine Learning

Supervised Unsupervised Reinforcement

Target pattern is given. Target pattern must be found out. Policy optimization.

7
1.1. What is machine learning? UNIT 01

Machine learning workflow

Machine Learning
Feature engineering
Understanding the business and Pre-processing
problem definition and searching of data
Train
Modeling and
Problem Definition Data Preparation
Optimization
Validate Model training for data

Test Performance metrics


Raw
Data
Model performance evaluation

Data collection Enhanced model performance and


application to real life

8
1.1. What is machine learning? UNIT 01

Machine learning types:

Type Algorithm/Method
Clustering.
MDS, t-SNE.
Unsupervised learning
PCA, NMF.
Association analysis.
Linear regression.
Logistic regression.
Tree, Random Forest, Ada Boost, XGBoost.
Supervised learning Naïve Bayes.
KNN.
Support vector machine (SVM).
Neural Network.

9
1.1. What is machine learning? UNIT 01

Parameters vs. Hyperparameters


Parameters
‣ Learned from data by training and not manually set by the practitioner.
‣ Contain the data pattern.

Ex Coefficients of linear regression.


Ex Weights of neural network.

Hyperparameters
‣ Can be set manually by the practitioner.
‣ Can be tuned to optimize the machine learning performance.

Ex in KNN algorithm.
Ex Learning rate in neural network.
Ex Maximum depth in Tree algorithm.

10
Unit 1.

Machine Learning Based Data Analysis


1.1. What is machine learning?
1.2. Phython scikit-learn library for machine learning
1.3. Preparation and division of data set
1.4. Data pre-processing for making a good training data set
1.5. Practicing to find an optimal method to solve problems with scikit-learn

11
1.2. Phyton scikit-learn library for machine learning UNIT 01

Features of the scikit-learn library


Features
‣ Integrated library interface by applying the façade design pattern
‣ Installed with various kinds of machine learning algorithms, model selection and data pre-processing functions
‣ Simple and efficient tool to analyze predicted data
‣ Based on NumPy, SciPy and matplotlib
‣ Easily accessible and can be reused in many different situations
‣ Highly compatible with different libraries
‣ Does not support GPU
‣ Can be used as an open source and for commercial purposes

12
1.2. Phyton scikit-learn library for machine learning UNIT 01

Mechanism of scikit-learn
Scikit-learn is characterized by its intuitive and easy interface complete with high-level API.

Predict /
Instance Fit
transform

13
1.2. Phyton scikit-learn library for machine learning UNIT 01

Estimator, Classifier, Regressor


‣ Estimator refers to an object that can fit the model and deduce a certain features of new data based on the training data.
‣ Classifier refers to a class that realizes a classifying algorithm, while regression refers to a class that realizes regressing
algorithm.

Estimator

Training: .fit
Prediction: .predict

Classifier Regressor

DecisionTreeClassifier LinearRegression
KNeighborsClassifier KNeighborsRegressor
GradientBoostingRegressor GradientBoostingRegressor
Gaussian NB … Ridge …

14
1.2. Phyton scikit-learn library for machine learning UNIT 01

Scikit-Learn Library
About the Scikit-Learn library
‣ It is a representative Python machine learning library.
‣ To import a machine learning algorithm as class:
from sklearn. <family> import <machine learning algorithm>
Ex from sklearn.linear_model import LinearRegression
‣ Hyperparameters are specified when the machine learning object is instantiated:

Ex myModel = KNeighborsClassifier(n_neighbors=10) # KNN with k = 10

15
1.2. Phyton scikit-learn library for machine learning UNIT 01

About the Scikit-Learn library


‣ To train a supervised learning model: myModel.fit(X_train, Y_train)
‣ To train a unsupervised learning model: myModel.fit(X_train)
‣ To predict using an already trained model: myModel.predict(X_test)
‣ To import a preprocessor as class: from sklearn.preprocessing import <a preprocessor>
‣ To split the dataset into a training set and a testing set:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3, random_state=123)
‣ To calculate a performance metric (accuracy): metrics.accuracy_score(Y_test, Y_pred)
‣ To cross validate and do hyperparameter tuning at the same time:
myGridCV = GridSearchCV(estimator, parameter_grid, cv=k)
myGridCV.fit(X_train, Y_train)

16
1.2. Phyton scikit-learn library for machine learning UNIT 01

Practicing scikit-learn
‣ The sklearn.datasets module includes utilities to load datasets, including methods to load and fetch popular reference datasets.
It also features some artificial data generators.

‣ Import data with the load_breast_cancer().

‣ Container object exposing keys as attributes.


Bunch objects are sometimes used as an output for functions and methods.
They extend dictionaries by enabling values to be accessed by key, bunch[“value_key”], or by an attribute, bunch.value_key.

17
1.2. Phyton scikit-learn library for machine learning UNIT 01

Practicing scikit-learn

Line 5
• This data becomes x (independent variable, data).

18
1.2. Phyton scikit-learn library for machine learning UNIT 01

Practicing scikit-learn

Line 6
• This data becomes y (dependent variable, actual value).

19
1.2. Phyton scikit-learn library for machine learning UNIT 01

Practicing scikit-learn

Line 1
• Provides details about the data.
• The help shows that the default value of test_size is 0.25.

20
1.2. Phyton scikit-learn library for machine learning UNIT 01

Practicing scikit-learn

Line 7
• From the total of 569 observed values, divide the data for training and evaluation into 7:3 or 8:2.
7.5:2.5 is the default value.

21
1.2. Phyton scikit-learn library for machine learning UNIT 01

Practicing scikit-learn
‣ Use train_test_split() to split the data for making and evaluating the model.

22
1.2. Phyton scikit-learn library for machine learning UNIT 01

Practicing scikit-learn

Line 11
• 426 observed values (75%) out of total 569 observations are found.
Line 13
• 143 observed values (25%) out of total 569 observations are found.

23
1.2. Phyton scikit-learn library for machine learning UNIT 01

Practicing scikit-learn
‣ For instancing, use the model’s hyperparameter as an argument. Hyperparameter is an option that requires human setting and
affects a lot to the model performance.

Line 1-5
• Loading the test data set
Line 1-8
• Instancing the estimator and hyperparameter setting
• Model initialization by using the entropy for branching

24
1.2. Phyton scikit-learn library for machine learning UNIT 01

fit
‣ Use the fit method with instance estimator for training. Send the training data and label data together as an argument to
supervised learning algorithm.

predict
‣ The instance estimator that has completed training with fitting can be applied with the predict method. ‘Predict’ converts the
estimated results of the model regarding the entered data.

Line 2
• It is an estimated value, so the actual value for X_test may vary.
Measure the accuracy by comparing the two values.

25
1.2. Phyton scikit-learn library for machine learning UNIT 01

Practicing scikit-learn

26
1.2. Phyton scikit-learn library for machine learning UNIT 01

Practicing scikit-learn

27
1.2. Phyton scikit-learn library for machine learning UNIT 01

Practicing scikit-learn

Line 57
• Data frame shows a result where predicted value and actual value differ.

28
1.2. Phyton scikit-learn library for machine learning UNIT 01

Practicing scikit-learn

29
1.2. Phyton scikit-learn library for machine learning UNIT 01

Practicing scikit-learn

30
1.2. Phyton scikit-learn library for machine learning UNIT 01

Practicing scikit-learn

Line 66
• 133/143

31
1.2. Phyton scikit-learn library for machine learning UNIT 01

Practicing scikit-learn

‣ It showed 93% accuracy, which is a quite rare result. In fact, a process of increasing the data accuracy is required during data
pre-processing, and standardization is one of the options. The following is a brief summary of standardization.
• Standardization can be done by calculating standard normal distribution.
Another term of standardization is z-transformation, and the standardized value is also referred to as z-score. 94% accuracy
would be obtained from KNN wine classification through standardization.
• Standardization is widely used in data pre-processing in general other than KNN, and the following is the equation.

(, standard deviation)

• The standardization is available as StandardScaler class in the scikit-learn.

32
1.2. Phyton scikit-learn library for machine learning UNIT 01

Practicing scikit-learn

33
1.2. Phyton scikit-learn library for machine learning UNIT 01

Practicing scikit-learn

Line 35
• Data frame before standardization

34
1.2. Phyton scikit-learn library for machine learning UNIT 01

Practicing scikit-learn

Line 39
• The differences among column values are huge before standardization.

35
1.2. Phyton scikit-learn library for machine learning UNIT 01

Practicing scikit-learn

Line 40
• After standardization, the column values do not significantly deviate from 0.
• Better performance would be possible compared to before standardization.

36
1.2. Phyton scikit-learn library for machine learning UNIT 01

transform
‣ Feature processing is done with ‘transform’ to return the result.

Line 3-3
• Output before pre-processing

37
1.2. Phyton scikit-learn library for machine learning UNIT 01

transform
‣ Feature processing is done with ‘transform’ to return the result.

Line 3-7
• Pre-processing – Apply scaling

38
1.2. Phyton scikit-learn library for machine learning UNIT 01

transform
‣ Feature processing is done with ‘transform’ to return the result.

Line 3-9
• Result check after pre-processing

39
1.2. Phyton scikit-learn library for machine learning UNIT 01

transform
‣ Feature processing is done with ‘transform’ to return the result.

40
1.2. Phyton scikit-learn library for machine learning UNIT 01

fit_transform
‣ Fit and Transform is combined as fit_transform.

Line 4-1 & 4-5


• Before & After
Line 4-3
• Combination of fit and transform

41
1.2. Phyton scikit-learn library for machine learning UNIT 01

Major scikit modules

Classification Module Embedded functions


Data example sklearn.datasets Data set for practicing
sklearn.preprocessing Pre-processing techniques (One-hot encoding, normalization, scaling, etc.)
Technique to search and select a feature that provides a significant impact
sklearn.feature_selection
to the model
Feature
processing Feature extraction from source data
sklearn.feature_extraction The supporting API for feature extraction regarding image is present in the
submodule image, while the supporting API for text data feature extraction
is present in the submodule test.
Dimension Algorithms related to dimension reduction (PCA, NMF, Truncated SVD,
sklearn.decomposition
reduction etc.)
Validation,
hyperparameter Validation, hyperparameter tuning, data separation, etc.
sklearn.model_selection
tuning, data (corss_validate, GridSearchCV, train_test_split, learning_curve, etc.)
separation
Model Techniques to measure and evaluate model performance
sklearn.metrics
evaluation (accuracy, precision, recall, ROC curve, etc.)

42
1.2. Phyton scikit-learn library for machine learning UNIT 01

Major scikit modules

Classification Module Embedded functions


sklearn.ensemble Ensemble algorithms (Random forest, AdaBoost, bagging, etc.)
sklearn.linear_model Linear algorithms (Linear regression, logistic regression, SGD, etc.)
Naive Bayes algorithms
Machine sklearn.naïve_bayes
(Bernoulli NB, Gaussian NB, multinomial distribution NB, etc.)
learning
algorithm sklearn.neighbors Nearest neighbor algorithms (K-NN, etc.)
sklearn.svm Support Vector Machine algorithms
sklearn.tree Decision tree algorithms
sklearn.cluster Unsupervised learning (clustering) algorithms (Kmeans, DBSCAN, etc.)
Serial conversion of feature processing and machine learning algorithms,
Utility sklearn.pipeline
etc.

43
Unit 1.

Machine Learning Based Data Analysis


1.1. What is machine learning?
1.2. Phyton scikit-learn library for machine learning
1.3. Preparation and division of data set
1.4. Data pre-processing for making a good training data set
1.5. Practicing to find an optimal method to solve problems with scikit-learn

44
1.3. Preparation and division of data set UNIT 01

Preparation and division of data set


Chapter objectives
‣ Be able to understand the meaning and ripple effect of overfitting and generalization and design data set division to solve
issues.
‣ Be able to properly divide training data set and test data set for machine learning technique application according to the
analysis purpose and data set features.
‣ Be able to divide training data set and validation data set and decide appropriate k-value for cross validation by deciding the
necessity of cross validation according to the issue and applied technique. Be able to divide the data set and and perform
sampling by considering the prediction results based on data features and classified variable distribution.
‣ Be able to analyze differences of various sampling methods for data set division and apply appropriate sampling methods.

45
1.3. Preparation and division of data set UNIT 01

Necessity of data set division


‣ When analyzing machine learning-based data, especially when applying a supervised learning-based model, do not analyze
the overall data set but analyze by dividing the training and evaluation (test) data sets.

Division of Training data set


data set
(Perform K-fold Model
cross validation if
necessary)
Final
Overall data set model

Performance
Test data set
evaluation

Machine learning modeling process through division of training and test data sets

46
1.3. Preparation and division of data set UNIT 01

Overfitting and generalization of modeling


‣ Strictly speaking, the data included in the provided training data set can be considered as the values obtained by chance, so a
new data set obtained to predict the values of new objective variables (or response variables) is not the same as the existing
training data set.
‣ Thus, the chance that the patterns from the training data and new data to perfectly accord is extremely low. So, when learning
the machine learning-based model, overfitting occurs in which it highlights the training data set pattern when reflecting too
much of the training data set, while generalization for accurate prediction of new data is underperformed.
‣ To prevent such issues, the data set is generally divided into training data set and test data set. By measuring how the machine
learning model learned with the training data set accurately predicts the objective variables (or response variables) of the test
data set, the resulted standard becomes the standard for model performance evaluation.

1st 2nd 3rd 4th 12th

𝑦 𝑦 𝑦 𝑦 𝑦

𝑥 𝑥 𝑥 𝑥 𝑥
Overfitting and underfitting

47
1.3. Preparation and division of data set UNIT 01

Overfitting and generalization of modeling


1st 2nd 3rd 4th 12th

𝑦 𝑦 𝑦 𝑦 𝑦

𝑥 𝑥 𝑥 𝑥 𝑥

Overfitting and underfitting

‣ Even if the machine learning finds the optimal solution in data distribution, a wide margin of error occurs, and this is because
the model has a small capacity. Such phenomenon is referred to as underfitting, and the linear equation model on the leftmost
figure above is an example.
‣ An easy alternative is to use higher degree polynomials, which are non-linear equations.
‣ The rightmost figure provided above is applied with 12th order polynomial.
‣ The model capacity got larger, and there are 13 parameters for estimation.

12 11 10 1
𝑦 =𝑤12 𝑥 +𝑤11 𝑥 +𝑤10 𝑥 ⋯ 𝑤1 𝑥 +𝑤0
48
1.3. Preparation and division of data set UNIT 01

Overfitting
‣ When choosing a 12th order polynomial curve, it approximates almost perfectly to the training set.
‣ However, an issue occurs when predicting new data.
• The region around the red bar at should be predicted, but the red dot is predicted instead.
‣ The reason is because of the large capacity of the model.
• Accepting the noise during the learning process → Overfitting
‣ Model selection is required to select an adequate size model.

12th

𝑥 𝑥0

Inaccurate prediction in overfitting

49
1.3. Preparation and division of data set UNIT 01

Overfitting and generalization of modeling


‣ As the flexibility of machine learning technique increases, (in
RMSE Graph
other words, flexibility is increased as the possibility of the
Training set given data patterns accurately according with each other rises
Test set followed by increased order of polynomial.)
‣ the root mean squared error of the training data set shows a
monotone decreasing trend, while the root mean squared error
of the test data set declines in the beginning as the order of
polynomial rises, but it increases after a certain point.
RMSE

‣ Summing up, the figure on the left shows an overfitting trend


that reflects the training data set pattern too much after the 4th
order polynomial. If there was no root mean squared error
index calculated with test data, the root mean squared error of
the training data would decline as the order of polynomial rises
to lead to selecting an overfitting model.
‣ Thus, test data is required.
Flexibility (Degree of polynomial expression)

Comparison of the difference between the root mean


squared errors of training and test data

50
1.3. Preparation and division of data set UNIT 01

Method and process of data set division


Cross-Validation:
‣ The data should be split into a training set and a testing set.
‣ In principle, the testing set should be used only once! It should not be reused!
‣ If the training set is used also for evaluation, the errors can be unrealistically small.
‣ We would like to evaluate realistic errors while training by splitting the training data into two.

Training Data Testing Data

Train Cross Validate Evaluate

Cross-Validation and Hyperparameter optimization:


‣ As we can repeatedly evaluate errors while training, it is also possible to tune the hyperparameters.

51
1.3. Preparation and division of data set UNIT 01

Cross-Validation:
1) Split the data into a training set and a testing set.
2) Further subdivide the training set into a smaller training and a validation set.
3) Train the model with the smaller training set.
4) Evaluate the errors with the validation set.
5) Repeat a few times from the step 2).

52
1.3. Preparation and division of data set UNIT 01

Cross-Validation method: k-Fold

Validation

Training

‣ Subdivide the training dataset into 𝑘 equal parts. Then, apply sequentially.

53
1.3. Preparation and division of data set UNIT 01

Cross-Validation method: Leave One Out (LOO)

Validation

Training

‣ Leave only one observation for validation. Apply sequentially. More time consuming.

54
1.3. Preparation and division of data set UNIT 01

Cross-Validation method: k-cross folding

k-cross folding

n=k n=10 most of the time


epoch Repeated measurement
round 1 round 2 round 3 round 4 round 5 round 6 round 7 round 8 round 9 round 10
validation set
validation set
validation set
validation set training set training set training set training set
validation set
training set training set training set training set validation set
training set validation set
training set validation set
validation set
validation set

Accuracy 93% 90% 91% 95%

Final average accuracy (Round1, Round2… Round10)

55
Unit 1.

Machine Learning Based Data Analysis


1.1. What is machine learning?
1.2. Phyton scikit-learn library for machine learning
1.3. Preparation and division of data set
1.4. Data pre-processing for making a good training data set
1.5. Practicing to find an optimal method to solve problems with scikit-learn

56
1.4. Data pre-processing for making a good training data set UNIT 01

Missing value processing


Data cleansing for machine learning-based data analysis uses missing value and noise processing to eliminate
discrepancy of collected data.
‣ Missing value processing is done as follows.
‣ First, import the iris data for quick examination.

57
1.4. Data pre-processing for making a good training data set UNIT 01

Missing value processing

58
1.4. Data pre-processing for making a good training data set UNIT 01

Missing value processing


1) Ignore the record (row)
• In data classification, ignore the record if the class label is not distinguished.

Ex In the case of the iris data, ignore the fourth row as shown on the table below.

x1 x2 x3 x4 y
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2
5 5 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa

• ‘Ignore the record’ is extremely inefficient if missing values frequently occur.

59
1.4. Data pre-processing for making a good training data set UNIT 01

Missing value processing


2) Insert the missing value
• Enter a certain value like ‘unknown’ for missing value. Or, enter the average value of data such as the overall average
value, median value, or class that belong to the same record.

x1 x2 x3 x4 y
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 unknown
5 5 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa

60
1.4. Data pre-processing for making a good training data set UNIT 01

Missing value processing


2) Insert the missing value
• The average value of Sepal.Length for iris is 5.843 as provided earlier, so insert 5.843.
x1 x2 x3 x4 y
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 3.1 1.5 0.2 setosa
5 5 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa

x1 x2 x3 x4 y
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 5.843 3.1 1.5 0.2 setosa
5 5 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa

61
1.4. Data pre-processing for making a good training data set UNIT 01

Missing value processing


3) Manual entry
• A person in charge (or an expert) should check the data and modify it into an appropriate value.
• It requires a lot of time but provides high reliability.

62
1.4. Data pre-processing for making a good training data set UNIT 01

Missing value processing


‣ Identify the missing value from the table type data below and make appropriate processing.
• In python, the missing value is specified as np.nan or null value.
• ’nan’ is abbreviation of Not a Number.

63
1.4. Data pre-processing for making a good training data set UNIT 01

Missing value processing

‣ The omitted value is changed to NaN. In this case, it is not problematic due to the low number of data, but it is extremely
inconvenient to manually find missing values from a huge data frame.

64
1.4. Data pre-processing for making a good training data set UNIT 01

Missing value processing


‣ isnull() returns data frame with boolean which show if the cell has a numerical value (False) or is omitted with a numerical
value (True). Then, sum() is used to obtain the number of omissions.
‣ It is mandatory to check the number of missing values when importing data.

Line 17
• The number of missing values can be counted.

65
1.4. Data pre-processing for making a good training data set UNIT 01

Removing the training sample or feature with missing value


‣ Completely delete a certain training sample (row) or feature (column). Use dropna().
‣ help(df.dropna) shows axis=0 is default.

Line 18
• axis=0 is default, so the row with the NaN value is deleted.

66
1.4. Data pre-processing for making a good training data set UNIT 01

Removing the training sample or feature with missing value


‣ When trying to reflect the deleted result immediately to the object, do not omit inplace=True option.

67
1.4. Data pre-processing for making a good training data set UNIT 01

Removing the training sample or feature with missing value


‣ Delete the row with missing value.

68
1.4. Data pre-processing for making a good training data set UNIT 01

Removing the training sample or feature with missing value

69
1.4. Data pre-processing for making a good training data set UNIT 01

Removing the training sample or feature with missing value


‣ If all rows have NaN, use how='all’ to delete.

70
1.4. Data pre-processing for making a good training data set UNIT 01

Removing the training sample or feature with missing value

71
1.4. Data pre-processing for making a good training data set UNIT 01

Removing the training sample or feature with missing value

72
1.4. Data pre-processing for making a good training data set UNIT 01

Removing the training sample or feature with missing value


‣ When deleting 3 or more NaN values, use thresh.

73
1.4. Data pre-processing for making a good training data set UNIT 01

Removing the training sample or feature with missing value

74
1.4. Data pre-processing for making a good training data set UNIT 01

Removing the training sample or feature with missing value


‣ When deleting a row with NaN on a certain column, use subject.

75
1.4. Data pre-processing for making a good training data set UNIT 01

Imputation
‣ It is sometimes hard to delete a training sample or a certain column, and this is because it loses too much of useful data. If so,
estimate missing values from other training samples in data set by kriging. The most commonly used method is to impute
with average value, which is to change the missing value into the overall average of a certain column. In scikit-learn, use
SimpleImputer class.

‣ Impute with using df.values.

76
1.4. Data pre-processing for making a good training data set UNIT 01

Imputation

77
1.4. Data pre-processing for making a good training data set UNIT 01

Imputation

Line 45
• Check it is the average of the column.

‣ For strategy parameters, median and most_frequent can be also set.

78
1.4. Data pre-processing for making a good training data set UNIT 01

Review on the scikit-learn estimator API


‣ Impute the missing value of the data set by using the SimpleImputer class of scikit-learn from previous clause.
‣ The SimpleImputer class is a transformer class of scikit-learn that is used for data conversion.
‣ The two main methods for the estimator include fit and transform.
‣ Use the fit method to learn the model parameter in the training data.
‣ Use the transform method to convert the data into the learned parameter.
‣ The data array for conversion should be same as the number of data features used in the model learning.

79
1.4. Data pre-processing for making a good training data set UNIT 01

Categorical data processing


‣ Data is generally classified into categorical scale and continuous scale depending on their features.
‣ In liberal arts and social science, questionnaires are mainly used to collect data.
‣ Categorical scale
• It is a scale that can distinguish data into different categories and is classified into nominal scale and ordinal scale.
‣ Continuous scale
• It is a scale that divides linked data into the purpose of survey and is classified into interval scale and ratio scale.
‣ Actual data sets would include more than one categorical feature. As explained earlier, categorical data would classify
sequential and non-sequential features. Ordinal scale can be referred to as sequential categorical scale that can array features
with sequences.

80
1.4. Data pre-processing for making a good training data set UNIT 01

Categorical data processing

‣ The data on the table have features that have orders and do not have orders.
Size is ordered, but color is not ordered. Thus, size is classified as ordinals scale while color is nominal scale.

81
1.4. Data pre-processing for making a good training data set UNIT 01

Categorical data processing


‣ Convert ordered data into numerical values. The reason why changing text data into numerical data is to allow a computer to
process arithmetic operations.

82
1.4. Data pre-processing for making a good training data set UNIT 01

Categorical data processing

83
1.4. Data pre-processing for making a good training data set UNIT 01

Class label encoding


‣ The class refers to the y value, which is a column with an actual value.
‣ Create a mapping to convert the class label from strings to integers.

84
1.4. Data pre-processing for making a good training data set UNIT 01

Class label encoding


‣ ‘enumerate’ creates an object with an index.

‣ ‘enumerate’ creates an object with an index.

85
1.4. Data pre-processing for making a good training data set UNIT 01

Class label encoding


‣ ‘enumerate’ creates an object with an index.

Line 62
• Change the class label from strings to integers.

86
1.4. Data pre-processing for making a good training data set UNIT 01

Class label encoding


‣ Since the method in the previous slide is rather inconvenient, scikit-learn supports LabelEncoder for easy conversion.

87
1.4. Data pre-processing for making a good training data set UNIT 01

Class label encoding


‣ enumerate 는 열거형으로 인덱스가 있는 형태의 객체로 만들어 준다 .

88
1.4. Data pre-processing for making a good training data set UNIT 01

Application of one-hot encoding to unordered feature


There are cases when it is not possible to directly use categorical data to machine learning algorithm such as regression
analysis, etc., and if so, conversion is required to be recognized by a computer.
‣ In such cases, use dummy variable which is expressed as 0 or 1. 0 or 1 does not represent how the number is large or small,
but shows whether a certain feature is present or not.
‣ If a certain feature is present, it is expressed as 1 and if it’s not found, it is classified as 0. Likewise, one-hot encoding is the
conversion of categorical data to one hot vector that consists of 0 or 1 that can be recognized by a computer.
‣ Practice with the iris.target object.

89
1.4. Data pre-processing for making a good training data set UNIT 01

The encoding is done with integers, so insert the iris ‘species’ value.

90
1.4. Data pre-processing for making a good training data set UNIT 01

The encoding is done with integers, so insert the iris ‘species’ value.

91
1.4. Data pre-processing for making a good training data set UNIT 01

Use the get_dummies() function of pandas to convert every eigenvalue of categorical variables into new dummy
variable.

92
1.4. Data pre-processing for making a good training data set UNIT 01

Use sklearn library to conveniently process one-hot encoding. The result is given as the sparse matrix in linear algebra.
In the sparse matrix, the value of most matrices is 0. An opposite concept to sparse matrix is dense matrix.

Example of sparse matrix

Only 9 out of the above 35 coefficients are not 0


in the above sparse matrix.

93
1.4. Data pre-processing for making a good training data set UNIT 01

OneHotEncoder

94
1.4. Data pre-processing for making a good training data set UNIT 01

OneHotEncoder

95
1.4. Data pre-processing for making a good training data set UNIT 01

OneHotEncoder

96
1.4. Data pre-processing for making a good training data set UNIT 01

Conversion to the sparse matrix

Line 82
• (0, 0) is 1, thus setosa (setosa up to 50 matrices)

97
1.4. Data pre-processing for making a good training data set UNIT 01

Refer to the figure below for easier understanding.

98
1.4. Data pre-processing for making a good training data set UNIT 01

Refer to the figure below for easier understanding.

99
1.4. Data pre-processing for making a good training data set UNIT 01

Refer to the figure below for easier understanding.

Row index 0 2
Column index Species setosa versicolor virginica Sparse matrix ex-
pression
0 setosa 1 0 0 (0,0)
1 setosa 1 0 0 (1,0)
setosa 1 0 0
49 setosa 1 0 0 (49,0
50 versicolor 0 1 0 (50,1)
51 versicolor 0 1 0 (51,1)
versicolor 0 1 0
100 versicolor 0 1 0
101 virginica 0 0 1 (101,2)
virginica 0 0 1 (102,2)
virginica 0 0 1
150 virginica 0 0 1 (150,2)

100
1.4. Data pre-processing for making a good training data set UNIT 01

Using hold-out in real life that splits the data set into training data set and test set
‣ df_wine is the data that measure wines produced in Vinho Verde which is adjacent to Atlantic Ocean in the northwest of
Portugal. It measured and analyzed the grade, taste, and acidity of 1,599 red wine samples along with 4,898 white wine
samples to create data. If the data is not found in the following route, it is possible to import the data from the local by
directly downloading it from the UCI repository.

101
1.4. Data pre-processing for making a good training data set UNIT 01

Using hold-out in real life that splits the data set into training data set and test set

Line 85
• When it is not accessible to the wine data set of the UCI machine learning repository,
• Remove the remark of the following code and read the data set from the local route:
• df_wine = pd.read_csv(‘wine.data’, header=None)

102
1.4. Data pre-processing for making a good training data set UNIT 01

Using hold-out in real life that splits the data set into training data set and test set

103
1.4. Data pre-processing for making a good training data set UNIT 01

Using hold-out in real life that splits the data set into training data set and test set
‣ Data splitting is possible by using the train_test_split function provided in the model_selection module of scikit-learn. First,
convert the features from index 1 to 13 to NumPy array and assign to variable X. With the train_test_split function, data
conversion is done in four tuples, so assign by designating appropriate variables.

‣ Randomly split X and y into training and test data sets. test_size=0.3, so 30% of the sample is assigned to X_test and y_test.
‣ Regarding the stratify parameter, if the class label array y is sent, the class ratio found in the training data set and test data set
is identically maintained with the original data set.
‣ The most widely used ratios in real life are 6:4, 7:3 or 8:2 depending on the size of the data set. For large data set, it is
common and suitable to split the training data set and test data set into the ratio of 9:1 or 9.9:0.1.

104
1.4. Data pre-processing for making a good training data set UNIT 01

Arranging the scale between features (variables)


‣ Refer to the practical code (Chapter5_Unit1_Machine Learning-Based Data Analysis 1 (Supervised Learning)) for detailed
code.

105
1.4. Data pre-processing for making a good training data set UNIT 01

Arranging the scale between features (variables)


‣ Refer to the practical code (Chapter5_Unit1_Machine Learning-Based Data Analysis 1 (Supervised Learning)) for detailed
code.

106
1.4. Data pre-processing for making a good training data set UNIT 01

Arranging the scale between features (variables)


‣ Refer to the practical code (Chapter5_Unit1_Machine Learning-Based Data Analysis 1 (Supervised Learning)) for detailed
code.

#MaxAbsScaler divides the data into the maximum absolute value based on each
feature. Thus, the maximum value of each feature becomes 1.
#The overall feature changes to [-1, 1] range.

107

You might also like