0% found this document useful (0 votes)
26 views

Machine Learning Lab Guide

Uploaded by

Shasya
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views

Machine Learning Lab Guide

Uploaded by

Shasya
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 74

Artificial Intelligence Technology and Application

Machine Learning
Lab Guide
Teacher Version

Huawei Technologies CO., LTD.


Contents

1 Feature Engineering on Banks' Private Credit Data ................................................................... 1


1.1 Introduction .................................................................................................................................................................. 1
1.1.1 About This Lab ........................................................................................................................................................... 1
1.1.2 Objectives .................................................................................................................................................................. 1
1.1.3 Case Background........................................................................................................................................................ 1
1.2 Data Preprocessing ....................................................................................................................................................... 1
1.2.1 Background ................................................................................................................................................................ 1
1.2.2 Procedure .................................................................................................................................................................. 2
1.3 Feature Selection .......................................................................................................................................................... 4
1.3.1 Background ................................................................................................................................................................ 4
1.3.2 Feature Selection Methods........................................................................................................................................ 4
1.3.3 Filter ........................................................................................................................................................................... 5
1.3.4 Wrapper ..................................................................................................................................................................... 7
1.3.5 Embedded .................................................................................................................................................................. 8
1.3.6 Variable Removal ....................................................................................................................................................... 9
1.4 Feature Construction .................................................................................................................................................. 10
1.4.1 Background .............................................................................................................................................................. 10
1.4.2 Polynomial Feature Construction ............................................................................................................................ 10
2 Real-Time Recommendation Practice for Retail Products ....................................................... 11
2.1 Introduction ................................................................................................................................................................ 11
2.1.1 About This Lab ......................................................................................................................................................... 11
2.2 Procedure ................................................................................................................................................................... 12
2.2.1 Preparing E-commerce Platform Data ..................................................................................................................... 12
2.2.2 Recommending Products Based on the Product Popularity .................................................................................... 17
2.2.3 Recommending Products Based on Collaborative Filtering ..................................................................................... 20
3 Private Credit Default Prediction ........................................................................................... 24
3.1 Introduction ................................................................................................................................................................ 24
3.1.1 Objectives ................................................................................................................................................................ 24
3.1.2 Background .............................................................................................................................................................. 24
3.2 Procedure ................................................................................................................................................................... 25
3.2.1 Reading Data ............................................................................................................................................................ 25
3.2.2 Viewing Missing Values............................................................................................................................................ 25
3.2.3 Splitting the Dataset ................................................................................................................................................ 26
3.2.4 Standardizing Data (Preprocessing Data) ................................................................................................................ 26
3.2.5 Handling the Class Imbalance Issue (Preprocessing) ............................................................................................... 26
3.2.6 Performing Grid Search (Modeling) ......................................................................................................................... 27
3.2.7 Verifying Performance (Evaluation)......................................................................................................................... 27
3.2.8 Saving the Model ..................................................................................................................................................... 28
4 Survival Prediction of the Titanic ........................................................................................... 29
4.1 Introduction ................................................................................................................................................................ 29
4.1.1 About This Lab ......................................................................................................................................................... 29
4.1.2 Objectives ................................................................................................................................................................ 29
4.1.3 Datasets and Frameworks ....................................................................................................................................... 29
4.2 Procedure ................................................................................................................................................................... 29
4.2.1 Importing Related Libraries ..................................................................................................................................... 29
4.2.2 Importing Datasets .................................................................................................................................................. 30
4.2.3 Preprocessing Data .................................................................................................................................................. 34
4.2.4 Building a Model ...................................................................................................................................................... 36
5 Linear Regression .................................................................................................................. 38
5.1 Introduction ................................................................................................................................................................ 38
5.1.1 About This Lab ......................................................................................................................................................... 38
5.1.2 Objectives ................................................................................................................................................................ 38
5.2 Procedure ................................................................................................................................................................... 38
5.2.1 Preparing Data ......................................................................................................................................................... 38
5.2.2 Defining Related Functions ...................................................................................................................................... 39
5.2.3 Starting Iteration ...................................................................................................................................................... 40
5.3 Thinking and Practices ................................................................................................................................................ 45
5.3.1 Question 1................................................................................................................................................................ 45
5.3.2 Question 2................................................................................................................................................................ 45
6 Flower Category Analysis ....................................................................................................... 46
6.1 Introduction ................................................................................................................................................................ 46
6.1.1 About This Lab ......................................................................................................................................................... 46
6.2 Experiment Code ........................................................................................................................................................ 46
6.2.1 Importing Related Libraries ..................................................................................................................................... 46
6.2.2 Importing a Dataset ................................................................................................................................................. 46
6.2.3 Splitting the Dataset ................................................................................................................................................ 46
6.2.4 Performing Modeling ............................................................................................................................................... 47
6.2.5 Effect After Data Preprocessing ............................................................................................................................... 49
7 Emotion Recognition of Customer Evaluations in the Retail Industry ...................................... 50
7.1 Introduction ................................................................................................................................................................ 50
7.1.1 Objectives ................................................................................................................................................................ 50
7.1.2 Case background ...................................................................................................................................................... 50
7.2 Procedure ................................................................................................................................................................... 51
7.2.1 Data Management ................................................................................................................................................... 51
7.2.2 Data Reading ............................................................................................................................................................ 51
7.2.3 Data Processing ....................................................................................................................................................... 53
7.2.4 Model Training ......................................................................................................................................................... 55
8 Boston Housing Price Forecast ............................................................................................... 60
8.1 Introduction ................................................................................................................................................................ 60
8.1.1 About This Lab ......................................................................................................................................................... 60
8.1.2 Objectives ................................................................................................................................................................ 60
8.1.3 Experiment Dataset and Framework ....................................................................................................................... 61
8.2 Procedure ................................................................................................................................................................... 61
8.2.1 Introducing the Dependency ................................................................................................................................... 61
8.2.2 Loading the Dataset, Viewing Data Attributes, and Visualizing Data ...................................................................... 62
8.2.3 Splitting and Preprocessing the Dataset .................................................................................................................. 63
8.2.4 Performing Modeling on the Dataset by Using Various Regression Models ........................................................... 64
8.2.5 Adjusting Grid Search Hyperparameters ................................................................................................................. 65
9 E-commerce Website User Group Analysis ............................................................................. 67
9.1 Introduction ................................................................................................................................................................ 67
9.1.1 About This Lab ......................................................................................................................................................... 67
9.2 Experiment Code ........................................................................................................................................................ 67
9.2.1 Using sklearn for Modeling ...................................................................................................................................... 67
9.2.2 Selecting the Optimal k Value .................................................................................................................................. 69
Machine Learning Lab Guide-Teacher Version Page 1

1 Feature Engineering on Banks' Private Credit


Data

1.1 Introduction
1.1.1 About This Lab
Feature engineering is a process of extracting features from raw data. Data and features determine
the upper limit of machine learning, while models and algorithms help continuously approaching this
upper limit. Feature engineering and construction aim to enable extracted features to represent the
essential characteristics of data to the greatest extent, so that a model constructed based on these
features has a good prediction effect on unknown datasets.

1.1.2 Objectives
Upon completion of this task, you will be able to:
 Master the Python-based feature selection method.
 Master the Python-based feature extraction method.
 Master the Python-based feature construction method.

1.1.3 Case Background


With the development of online financial services, bank H plans to evaluate customer risks by using
online approval to reduce labor costs and improve approval efficiency. Online approval requires a
more strict and accurate risk control model to control corporate financial risks. Therefore, algorithm
engineer A needs to complete feature engineering and construction of a credit risk model before
constructing the model based on historical customer credit data. Engineer A needs to complete the
following operations:
 Data preprocessing
 Feature selection
 Feature construction

1.2 Data Preprocessing


1.2.1 Background
The raw data collected by the back-end server of the bank may have problems such as missing values,
garbled characters, redundant fields, and inconsistent data formats. To improve the data quality,
engineer A needs to cleanse the data first.
Machine Learning Lab Guide-Teacher Version Page 2

1.2.2 Procedure
1.2.2.1 Importing Data
Code:

import pandas as pd
df=pd.read_csv('./credit.csv',index_col=0)
df.head()

Output:

1.2.2.2 Processing Missing Values


Step 1 View the missing values.
The missing values in the data may be caused by machine faults, manual input errors, or service
attributes. The method for processing the missing values varies with the cause.
missingno is a tool for visualizing missing values. You can run the following command to view the
missing-value distribution in the data:
Code:

import missingno # Import the missingno package.


missingno.matrix(df)

Output:
Machine Learning Lab Guide-Teacher Version Page 3

As shown in the figure above, the Nation, Marriage_State, Highest Education, House_State, Industry,
Title, and Duty fields contain a large number of missing values. In Pandas, isnull() can determine the
missing values in data, and isnull().sum() can count the number of missing values and further check
the rates of the missing values in the fields.
Code:

df_missing = pd.DataFrame(df.isnull().sum()/df.shape[0],columns=['missing_rate']).reset_index()
df_missing.sort_values(by='missing_rate',ascending=False)[:15]

Output:

Step 2 Fill the missing values with the mode.


Pandas provides fillna() to fill the missing values, and mode() to fill the missing values with the mode.
You need to construct a for loop to process multiple fields that contain missing values and fill the
missing values with the mode.
# Define the list of fields with missing values.

missing_col = ['Title','Industry','House_State','Nation','Marriage_State','Highest Education','Duty']

# Use the for loop to process the missing values in the multiple fields.

for col in missing_col:


df[col]=df[col].fillna(int(df[col].mode()))

After the processing is complete, check the missing rate of each field.

df_missing_2 = pd.DataFrame(df.isnull().sum()/df.shape[0],columns=['missing_rate']).reset_index()
df_missing_2.sort_values(by='missing_rate',ascending=False)[:15]
Machine Learning Lab Guide-Teacher Version Page 4

The following are methods for handling missing values:


1. Direct deletion: deletes the non-important service fields with the missing rate greater than 80%.
2. Data filling: fills the missing values with the determined values, statistical indicator values, and
algorithm-based missing value predictions.
3. Separate processing: processes samples with missing values as one category.
----End

1.3 Feature Selection


1.3.1 Background
If excessive features are obtained after data preprocessing, the model may be unstable and have poor
generalization capability. As a result, the computing complexity increases exponentially. Therefore,
engineer A needs to preliminarily filter out features that are not important to the prediction result.

1.3.2 Feature Selection Methods


The following are methods for feature selection:
 Filter: filters features based on the statistical indicators for the feature vectors and those
common to the feature vectors and the target variables.
 Wrapper: attempts to use different feature subsets for modeling and use the model precision as
an evaluation indicator for the feature subsets.
 Embedded: evaluates feature weights during model training and scores the importance of the
features.
Machine Learning Lab Guide-Teacher Version Page 5

1.3.3 Filter
Step 1 Analyze the crosstab.
Apply the crosstab() method to draw a crosstab by using the variable House_State and the target
variable Target as an example.

cross_table = pd.crosstab(df.House_State,columns = df.Target,margins=True)


cross_table_rowpct = cross_table.div(cross_table['All'],axis = 0)
cross_table_rowpct

In the output, the default rate is 0.019 when House_State is set to 1, and is 0.045 when House_State
is set to 2. If the default rates are considered the same, the variable House_State does not affect the
default prediction.
The crosstab analysis can only be used for preliminary judgment and analysis. The chi-square test is
further needed to determine whether the numerical difference has statistical significance.

Step 2 Perform the chi-square test.


Separate independent variables and dependent variables from the raw data, and select categorical
variables from the independent variables.
The Target field is a target variable and is assigned to y. The column with the target variable removed
is assigned to X as an independent variable. X_category indicates a categorical variable.

X = df.drop('Target',axis=1)
y = df['Target']
X_category=df[['Nation','Birth_Place','Gender','Marriage_State','Highest
Education','House_State','Work_Years','Title','Duty','Industry']]

Import the chi-square test package chi2 of sklearn.feature_selection and use chi2() to calculate the
chi-square values of each categorical variable and target variable.

from sklearn.feature_selection import chi2


(chi2,pval) = chi2(X_category,y)
dict_feature = {}
for i,j in zip(X_category.columns.values,chi2):
dict_feature[i]=j
Machine Learning Lab Guide-Teacher Version Page 6

ls = sorted(dict_feature.items(),key=lambda item:item[1],reverse=True)
ls

Step 3 Test the continuous variable correlation.


If two continuous independent variables are highly correlated, delete one of the two independent
variables or extract common information from the two independent variables.

nominal_features = ['Nation','Birth_Place','Gender','Marriage_State','Highest
Education','House_State','Work_Years','Unit_Kind','Title',
'Occupation','Duty','Industry']
numerical_features = [col_ for col_ in df.columns if col_ not in nominal_features ]
numerical_features.pop(0) # Delete the first element from the list.
X_num = df[numerical_features]

The method parameter indicates the method for calculating the correlation coefficient. The options
are as follows:
 pearson: Pearson correlation coefficient.
 kendall: correlation coefficient for unordered categorical variables.
 spearman: Spearman correlation coefficient, which is mainly used for correlation analysis of
non-linearly and non-normally distributed data.

import matplotlib.pyplot as plt


import seaborn as sns
corr_matrix = X_num.corr(method='pearson')
plt.figure(figsize=(25, 15))
sns.heatmap(corr_matrix, annot= True) # Display the correlation between the heatmap and variables in a visualize
manner.
Machine Learning Lab Guide-Teacher Version Page 7

Calculate the correlation coefficient between continuous independent variables and select the
combination of independent variables whose correlation coefficient is greater than 0.8.

cols_pair = []
for index_ in corr_matrix.index:
for col_ in corr_matrix.columns:
if corr_matrix.loc[index_,col_] >= 0.8 and index_!=col_ and (col_,index_) not in cols_pair:
cols_pair.append((index_,col_))
cols_pair

----End

1.3.4 Wrapper
In the wrapper selection method, different feature subsets are used for modeling, the model precision
is used as the evaluation indicator for the feature subsets, and a base model is selected to perform
multi-round training. After each round of training, features of some weight coefficients are removed,
and then the next round of training is performed based on the new feature set. The RFE() method of
the feature_selection submodule in sklearn is invoked. The logistic regression model
LogisticRegressio() is used as the base model to be invoked, and parameter will be transferred into
this model.
Wrapper:
estimator: basic training model, which is a logistic regression model in this example.
n_features_to_select: indicates the number of retained features.
fit(X,y): invokes and trains a model.
Machine Learning Lab Guide-Teacher Version Page 8

from sklearn.feature_selection import RFE


from sklearn.linear_model import LogisticRegression
x_rfe=RFE(estimator=LogisticRegression(), n_features_to_select=20).fit(X, y)
print(x_rfe.n_features_ )
print(x_rfe.support_ )
print(x_rfe.ranking_ )
print(x_rfe.estimator_ )

Output:

20
[ True True False True True True False True True True True False
False True False True True False True True True True True True
True False False False True False]
[ 1 1 9 1 1 1 10 1 1 1 1 6 3 1 11 1 1 8 1 1 1 1 1 1
1 5 4 7 1 2]
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, max_iter=100, multi_class='warn',
n_jobs=None, penalty='l2', random_state=None, solver='warn',
tol=0.0001, verbose=0, warm_start=False)

The return value of the RFE() method is output, which can be any of the following:
 n_features_: number of selected features, that is, the value of the n_features_to_select
parameter transferred into the RFE() method.
 support_: indicates that the selected features are displayed at their corresponding positions.
True indicates that the feature is retained, and False indicates that the feature is removed.
 ranking_: indicates the feature ranking. ranking_[i] corresponds to the ranking of the ith feature.
The value 1 indicates the optimal feature. The selected features are the 20 feature
corresponding to the value 1, namely, the optimal features.
 estimator_: returns the parameters of the base model.

1.3.5 Embedded
The embedded method uses a machine learning model for training to obtain weight coefficients of
features, and selects features in descending order of the weight coefficients.
Common embedded methods are based on either of the following:
 Linear model and regularization
 Feature selection of a tree model
In the tree model, the importance of a feature is determined by the depth of the leaf node. In this
experiment, the random forest is used to calculate the importance of a feature.
The random forest classification method in the sklearn.ensemble submodule is invoked to train the
model by using the fit(X,y) method.

from sklearn.ensemble import RandomForestClassifier


rfc=RandomForestClassifier()
rfc.fit(X,y)

After the model training is complete, the weight evaluation value of each feature is printed.

cols=[i for i in X.columns]


Machine Learning Lab Guide-Teacher Version Page 9

sorted_feature=sorted(zip(map(lambda x:round(x,4),rfc.feature_importances_),cols),reverse=True)
sorted_feature

Output:

[(0.1315, 'Ast_Curr_Bal'),
(0.1286, 'Age'),
(0.0862, 'Year_Income'),
(0.0649, 'Std_Cred_Limit'),
(0.043, 'ZX_Max_Account_Number'),
(0.0427, 'Highest Education'),
(0.0416, 'ZX_Link_Max_Overdue_Amount'),
(0.0374, 'ZX_Max_Link_Banks'),
(0.0355, 'Industry'),
(0.0354, 'ZX_Max_Overdue_Duration'),
(0.0311, 'ZX_Total_Overdu_Months'),
(0.0305, 'Marriage_State'),
(0.0305, 'Duty'),
(0.0292, 'Couple_Year_Income'),
(0.0279, 'ZX_Credit_Max_Overdu_Amount'),
(0.0246, 'ZX_Max_Overdue_Account'),
(0.0241, 'ZX_Max_Credit_Banks'),
(0.0221, 'ZX_Max_Credits'),
(0.0205, 'Birth_Place'),
(0.0195, 'Loan_Curr_Bal'),
(0.0173, 'L12_Month_Pay_Amount'),
(0.015, 'ZX_Credit_Max_Overdue_Duration'),
(0.013, 'Title'),
(0.0097, 'ZX_Credit_Total_Overdue_Months'),
(0.0096, 'Nation'),
(0.0084, 'Gender'),
(0.0079, 'Work_Years'),
(0.0064, 'ZX_Max_Overdue_Credits'),
(0.0059, 'House_State'),
(0.0, 'Couple_L12_Month_Pay_Amount')]

1.3.6 Variable Removal


Based on the results of the preceding three variable selection methods, the variables that have a small
model impact weight are removed.

del_cols
=['Gender','House_State','Couple_Year_Income','Loan_Curr_Bal','ZX_Max_Credit_Banks','ZX_Max_Overdue_Credi
ts','ZX_Credit_Max_Overdu_Amount','ZX_Credit_Max_Overdue_Duration']
df_select = df.drop(del_cols,axis=1)
df_select.head()
Machine Learning Lab Guide-Teacher Version Page 10

1.4 Feature Construction


1.4.1 Background
Feature selection is used to find the importance of each feature to model construction and remove
variables that have little impact on model construction to reduce dimensions. Feature construction is
to manually discover some significant model features from raw data. On the one hand, a new variable
can be constructed by combining several different features based on service understanding; on the
other hand, variables of different time windows can be divided according to a time attribute.
Engineer A has completed the preliminary filtering of features, removed some variables that have little
impact on the model, and now attempts to construct some new features to improve the model
precision.

1.4.2 Polynomial Feature Construction


Polynomial feature construction is to explore the impact of complex variables on the target variable
through product transformation for existing features. The PolynomialFeatures() method in the
sklearn.preprocessing submodule is used for feature interaction. Variables with higher scores in the
model, that is, Ast_Curr_Bal, Age, Year_Income, and Std_Cred_Limit, are selected to construct a
polynomial feature.
PolynomialFeatures(degree=3): specifies the degree 3 of interaction between variables, that is,
constructs a polynomial whose sum of powers of two variables is 3.

from sklearn.preprocessing import PolynomialFeatures


poly_feature = df[['Ast_Curr_Bal','Age','Year_Income','Std_Cred_Limit']] # Select fields used to construct the
polynomial feature.
poly_trans = PolynomialFeatures(degree = 3)
ptf = poly_trans.fit(poly_feature) # Invoke the fit() method to construct the polynomial feature.
poly_feature = poly_trans.transform(poly_feature) # Convert data.

To check the correlation between the newly generated variable and the target variable, construct a
dataset containing the target variable and the newly generated variable first.

poly_features=pd.DataFrame(poly_feature,columns
=poly_trans.get_feature_names(['Ast_Curr_Bal','Age','Year_Income','Std_Cred_Limit']))
poly_features['Target']=y
poly_features.head()
Machine Learning Lab Guide-Teacher Version Page 11

The corr() function is used to calculate the correlation coefficient between the newly generated
variable and the target variable.

poly_corrs = poly_features.corr()['Target'].sort_values()
print("five features with the smallest correlation coefficients: \n",poly_corrs.head(5))
print("five features with the largest correlation coefficients: \n",poly_corrs.tail(5))

Output:

Five features with the smallest correlation coefficients:


Age^3 -0.010601
Age^2 -0.009275
Age^2 Std_Cred_Limit -0.008064
Age -0.007356
Age Std_Cred_Limit -0.006834
Name: Target, dtype: float64
Five features with the largest correlation coefficients:
Year_Income^3 -0.001910
Ast_Curr_Bal Age -0.001114
Ast_Curr_Bal 0.002849
Target 1.000000
1 NaN
Name: Target, dtype: float64

2 Real-Time Recommendation Practice for Retail


Products

2.1 Introduction
2.1.1 About This Lab
Mr. Zhao works in the AI algorithm department of e-commerce platform company A and is responsible
for product recommendation for online businesses. In the modern world of the Internet and e-
commerce, people are overwhelmed by data that provides useful information. However, it is
impossible for users to extract the information they are interested in from the data. To help users find
Machine Learning Lab Guide-Teacher Version Page 12

product information, the recommendation system can create similarities between users and products
and provide suggestions for customers based on the similarities. The recommendation system is
beneficial in:
 Helping users find the right products.
 Increasing user engagement. Providing recommendations. For example, Google News saw a 40%
increase in hits due to recommendations.
 Helping project providers deliver projects to the right users. At Amazon, 35% of products are
sold through recommendations.
 Helping personalize the recommended content. In Netflix, most rented movies are
recommended ones.

2.2 Procedure
2.2.1 Preparing E-commerce Platform Data
Step 1 Import the required packages.
Functions in the NumPy library are used to perform basic operations on arrays. Pandas provides many
data processing methods and time sequence operation methods.

# Import module packages required by the project.


import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

Step 2 Read data.

electronics_data=pd.read_csv("./data/ratings_Electronics.csv",names=['userId', 'productId','Rating','timestamp'])

Step 3 Check the data overview.


View the format of the read data. You can use the head() function to check the first five rows of the
data to get a rough understanding of the data content.

electronics_data.head()

Step 4 View the data size.


Machine Learning Lab Guide-Teacher Version Page 13

You can further view the data size (the number of samples and the number of features in the data) by
using the shape function.

electronics_data.shape

Step 5 View the data type.


After learning of the data size, you still need to view the data type by using the dtypes() function to
facilitate subsequent data calculation.

electronics_data.dtypes

According to the result, only Rating and timestamp fall into the numeric type and can be used for
mathematical calculation. If userId and productId need to be used for mathematical calculation,
convert the types of them. In addition, you can use the info() function to view the general information
about the data.

electronics_data.info()

The result contains the number of data samples, feature type, data type, and data storage size. The
info function can display the preceding information by default, but you can set an item to False to
hide the item. For example, you can run the following command to hide the data storage size:

electronics_data.info(memory_usage =False)
Machine Learning Lab Guide-Teacher Version Page 14

Step 6 View the product ratings from users.


Product ratings are important data that can reflect users' preference. The data is critical to an efficient
recommendation system. You can use the describe function to check the data overview of the numeric
type. To view only the preliminary data analysis of Rating, add the corresponding column name in
square brackets to the end of the command.

electronics_data.describe()['Rating']

The result contains the average value, maximum value, minimum value, standard deviation, and
quartile of the data, and the product rating is generally about 4. You can use the min() and max()
functions to print the maximum and minimum value of the rating.

print('Minimum rating is: %d' %(electronics_data.Rating.min()))


print('Maximum rating is: %d' %(electronics_data.Rating.max()))

You can also use the print() function to print the result or the value of a parameter. According to the
result, the highest rating is 5, indicating that users' ratings on the product are generally high.

Step 7 View the default values of the data.


The most important factors that affect data quality are default values and abnormal values. As the
ratings all fall within the normal ranges, you need to use the isnull() function to check whether the
parameter is null, and then use the sum() function to count the total number of non-null parameters.

print('Number of missing values across columns: \n',electronics_data.isnull().sum())

Step 8 Check whether the users and products are unique.


A user can rate multiple products. Similarly, a product can be rated by different users. To determine
the product types and the number of users, you need to check whether the users and products are
unique.

print("Total data ")


print("-"*50)
Machine Learning Lab Guide-Teacher Version Page 15

print("\nTotal no of ratings :",electronics_data.shape[0])


print("Total No of Users :", len(np.unique(electronics_data.userId)))
print("Total No of products :", len(np.unique(electronics_data.productId)))

Step 9 Delete time information.


You can use the drop() function to delete the product rating time.
axis: deletes the column name part when it is set to 1, and deletes the index number part when it is
set to 0.
inplace: indicates the operation result when it is set to True.

electronics_data.drop(['timestamp'], axis=1,inplace=True)
electronics_data.head()

Step 10 Analyze the rating data.


Sort the users and products by rating and view the sorting result.
groupby(): performs matching based on the specific data.
sort_values(): sorts a group of data.
ascending: ascending order.

no_of_rated_products_per_user =
electronics_data.groupby(by='userId')['Rating'].count().sort_values(ascending=False)
no_of_rated_products_per_user.head()

no_of_rated_products_per_user.describe()
Machine Learning Lab Guide-Teacher Version Page 16

After obtaining the product data corresponding to the sorted user ratings, the system returns the
quantiles by using the quantile() function, and displays the quantiles by icons.

quantiles = no_of_rated_products_per_user.quantile(np.arange(0,1.01,0.01), interpolation='higher')


plt.figure(figsize=(10,10))
plt.title("Quantiles and their Values")
quantiles.plot()
# quantiles with 0.05 difference
plt.scatter(x=quantiles.index[::5], y=quantiles.values[::5], c='orange', label="quantiles with 0.05 intervals")
# quantiles with 0.25 difference
plt.scatter(x=quantiles.index[::25], y=quantiles.values[::25], c='m', label = "quantiles with 0.25 intervals")
plt.ylabel('No of ratings by user')
plt.xlabel('Value at the quantile')
plt.legend(loc='best')
plt.show()

print('\n No of rated product more than 50 per user : {}\n'.format(sum(no_of_rated_products_per_user >= 50)) )

----End
Machine Learning Lab Guide-Teacher Version Page 17

2.2.2 Recommending Products Based on the Product Popularity


Sorting the products by rating to analyze the product popularity during data preparation helps
implement recommendation based on the product popularity.

Step 1 Sort products.


Similar to user sorting, products can be sorted based on the rating data to obtain products that have
been rated for more than 50 times.

new_df=electronics_data.groupby("productId").filter(lambda x:x['Rating'].count() >=50)


no_of_ratings_per_product = new_df.groupby(by='productId')['Rating'].count().sort_values(ascending=False)
fig = plt.figure(figsize=plt.figaspect(.5))
ax = plt.gca()
plt.plot(no_of_ratings_per_product.values)
plt.title('# RATINGS per Product')
plt.xlabel('Product')
plt.ylabel('No of ratings per product')
ax.set_xticklabels([])
plt.show()

Step 2 Sort the products by the average rating.


Calculate the average rating of each product, and then sort the products based on the average rating.

# Calculate the average rating of each product.


new_df.groupby('productId')['Rating'].mean().head()

new_df.groupby('productId')['Rating'].mean().sort_values(ascending=False).head()
Machine Learning Lab Guide-Teacher Version Page 18

# Obtain the rankings of the products sorted by the number of rating times.
new_df.groupby('productId')['Rating'].count().sort_values(ascending=False).head()

ratings_mean_count = pd.DataFrame(new_df.groupby('productId')['Rating'].mean())
ratings_mean_count['rating_counts'] = pd.DataFrame(new_df.groupby('productId')['Rating'].count())
ratings_mean_count.head()

The result shows that the product with the highest average rating is rated by 1051 users.

ratings_mean_count['rating_counts'].max()# View the maximum number of rating times.

Step 3 Visualize the result.


Analyze the product rankings and display the result in a chart. Specifically, use a histogram first to
display the distribution of the number of users who rate each type of products.
hist(): histogram
bins: number of buckets in the histogram.

plt.figure(figsize=(8,6))# Set the image size.


plt.rcParams['patch.force_edgecolor'] = True
ratings_mean_count['rating_counts'].hist(bins=50)
Machine Learning Lab Guide-Teacher Version Page 19

plt.figure(figsize=(8,6))
plt.rcParams['patch.force_edgecolor'] = True
ratings_mean_count['Rating'].hist(bins=50)

plt.figure(figsize=(8,6))
plt.rcParams['patch.force_edgecolor'] = True
sns.jointplot(x='Rating', y='rating_counts', data=ratings_mean_count, alpha=0.4)
Machine Learning Lab Guide-Teacher Version Page 20

Sort the products by the number of users who rate the products, to obtain the product popularity.

popular_products = pd.DataFrame(new_df.groupby('productId')['Rating'].count())
most_popular = popular_products.sort_values('Rating', ascending=False)
most_popular.head(30).plot(kind = "bar")

----End

2.2.3 Recommending Products Based on Collaborative Filtering


Recommending products based on the product popularity is not enough to meet the actual
requirement. Therefore, the currently popular collaborative filtering method is used to implement
recommendation.

Step 1 Create a table of relationships between products and users.


Select 10,000 samples and use pivot_table() to create a table of relationships between products and
users.

new_df1=new_df.head(10000)
ratings_matrix = new_df1.pivot_table(values='Rating', index='userId', columns='productId', fill_value=0)
ratings_matrix.head()
Machine Learning Lab Guide-Teacher Version Page 21

You can use the shape function to view the table size, and then transform the table. The data in the
table is the product ratings from users.

ratings_matrix.shape

X = ratings_matrix.T
X.head()

X.shape# View the data size.

Step 2 Decompose the table.


You can use the SVD algorithm to reduce the dimensions of the table to obtain 10 important product-
based features.

from sklearn.decomposition import TruncatedSVD # Import the SVD algorithm.


SVD = TruncatedSVD(n_components=10)# Construct an SVD model to combine the number of features (that is, the
number of columns) into 10 important combined features.
decomposed_matrix = SVD.fit_transform(X)# Transform the table.
decomposed_matrix.shape# View the size of the table after conversion.

Step 3 Build a correlation coefficient matrix.


Machine Learning Lab Guide-Teacher Version Page 22

Calculate product similarities to implement a product-based recommendation system.


corrcoef(): calculates the correlation coefficient between data.

correlation_matrix = np.corrcoef(decomposed_matrix)
correlation_matrix.shape

Step 4 Recommend products based on the product similarities.


Randomly select a product, select products whose coefficient of correlation with the selected one is
greater than 0.65, and recommend these products to users who like the selected one.

X.index[20]# Select the 20th product.

# Determine whether the product is unique.


i = "9984984354"
product_names = list(X.index)
product_ID = product_names.index(i)
product_ID

# View the number of similar products in the similarity table.


correlation_product_ID = correlation_matrix[product_ID]
correlation_product_ID.shape

#Select products whose coefficient of correlation with the 20th product is greater than 0.65.
Recommend = list(X.index[correlation_product_ID > 0.65])
# Delete the 20th product.
Recommend.remove(i)
Recommend[0:10]# Recommend products ranked ahead to the users who like the 20th product.

As shown in the result, there are eight products whose coefficient of correlation with the 20th product
(9984984354) is greater than 0.65. You can also select other products to view their similar products.
Machine Learning Lab Guide-Teacher Version Page 23

----End
Machine Learning Lab Guide-Teacher Version Page 24

3 Private Credit Default Prediction

3.1 Introduction
Under the impact of the Internet, financial institutions are suffering from internal and external
troubles. On one hand, financial institutions encounter great competition and performance pressure
from large financial and technology enterprises; on the other hand, more and more criminal groups
use artificial intelligence (AI) technologies to increase the crime efficiency. These risk details are
hidden in each transaction phase. If they are not prevented, losses will be irreparable. Therefore,
financial institutions pose increasingly high requirements on risk management accuracy and approval
efficiency.
This experiment will discuss the problem and perform practice step by step from the perspectives of
problem statement, breakdown, priority ranking, solution design, key point analysis, and summary
and suggestions, and cultivate the project implementation thinking and implement analysis of the
private credit default prediction from scratch.

3.1.1 Objectives
Upon completion of this task, you will be able to:
 Understand the significance of credit default prediction.
 Master the development process of big data mining projects.
 Master the common algorithms for private credit default prediction.
 Understand the importance of data processing and feature engineering.
 Master the common methods for data preprocessing and feature engineering
 Master the algorithm principles of logistic regression and XGBoost, and understand the key
parameters.

3.1.2 Background
The case in this document is for reference only. The actual procedure may vary. For details, see the
corresponding product documents.
The company has just set up a project team for private credit default prediction. Engineer A was
appointed as the offline development PM of the project. This project aims to:
 Identify high-risk customers efficiently and accurately using new technologies.
 Make risk modes data-based by using scientific methods.
 Provide objective risk measurement.
 Reduce subjective judgments.
 Improve risk management efficiency.
 Save labor costs.
The ultimate goal is to productize the results, so that front-end operating departments can identify
transactions with credit default risks in a timely manner to avoid corporate losses.
Machine Learning Lab Guide-Teacher Version Page 25

3.2 Procedure
3.2.1 Reading Data
First, import the dataset. This document uses a third-party module from Pandas to import the dataset.

import pandas as pd
# Use pd.read_csv to read the dataset. (The dataset is stored in the current directory so that it can be read
directly.)
# ./credit.csv indicates the current directory. The slash (/) here must be in the same direction as one in a directory
of the Linux operating system (OS).
# In the Windows OS, the backslash (\) is used. Therefore, the slash in the file path must be the same as that in the
Linux OS.
# Be aware of using the slash symbol in the same key on the keyboard as the question mark (?).
data=pd.read_csv('./credit.csv')
# An auxiliary module warnings can be imported.
import warnings
warnings.filterwarnings('ignore')
# This module can help filter many redundant and annoying warnings.
# After data reading, some simple operations can be performed, for example:
# Run the following command to view all data.
data
# Run the following command to view the first 10 rows of data.
data.head(10)
# Run the following command to view the length and width of data in the matrix format.
data.shape

3.2.2 Viewing Missing Values


# Check the data missing status in a visualized manner.
# The third-party library missingno is used.
import missingno
missingno.matrix(data)
Machine Learning Lab Guide-Teacher Version Page 26

# Many values are missing and need to be filled.


# There are many filling methods. The missing values can be filled with average values, medians, and the mode.
#The numeric types include discrete and continuous.
# If the average values are used, a new discrete value may be generated. Therefore, the mode is used for
simplicity.
missname=[i for i in data if data[i].isnull().sum()>0]
for i in missname:
data[i]=data[i].fillna(int(data[i].mode()))
#isnull() is used to determine whether a value is null. If yes, True is returned. If not, False is returned.
# In Python, 1 is equal to True, and 0 is equal to False.
# Therefore, sum() is used for judgment. If the result is greater than 0, True is displayed.
# The features with missing values are placed in the missname list.
#fillna() is used to fill empty values with the mode.

3.2.3 Splitting the Dataset


Then, split the dataset. Before splitting the dataset, remove the index as it is unimportant and
interferes with the model judgment, and remove Target (result) which cannot be used as input into
the model.

X=data.drop(['Cust_No','Target'],axis=1)
y=data['Target']

X is equivalent to an independent variable in mathematics, and y is equivalent to a dependent variable.


Import the dataset splitting function to split the dataset.

from sklearn.model_selection import train_test_split


X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.1,shuffle=True)

X_train is the training set, and y_train is the answer to the training set. X_test is the test set, and
y_test is the answer to the test set. test_size=0.1 indicates that the ratio of the training set to the test
set is 9:1. shuffle indicates that the training set and test set are shuffled.

3.2.4 Standardizing Data (Preprocessing Data)


After the dataset is split, standardize the data.

from sklearn.preprocessing import StandardScaler


std_scaler=StandardScaler().fit(X)
X_train_std=std_scaler.transform(X_train)
X_test_std=std_scaler.transform(X_test)

Standardization is to ensure that the data complies with normal distribution. In nature environments,
real random distribution is similar to normal distribution, and an aggregation point appears.
Completely balanced random distribution is not natural but intentionally made. In the preceding
commands, the standardization function StandardScaler() is first declared. The following fit function
is used to obtain the standard deviation and average value of the dataset. Then, transform is used to
transform the data.

3.2.5 Handling the Class Imbalance Issue (Preprocessing)


Next, handle the class imbalance issue, which is mainly about the difference between the number of
positive results and the number of negative results of the data. In this dataset, the number of
defaulters (represented by 1) is very small, and the number of non-defaulters (represented by 0) is
Machine Learning Lab Guide-Teacher Version Page 27

very large. Therefore, the model tends to determine people as non-defaulters due to class imbalance.
Check the current result ratio first.

from collections import Counter


Counter(y_train)
# Use collections in the standard library to query the results.
# Import the third-party library imblearn.
from imblearn import over_sampling
fixtool=over_sampling.SMOTE()
X_train_fix,y_train_fix=fixtool.fit_sample(X_train_std,y_train)
# X_train_fix and y_train_fix are the corrected data.
# Next, check the number of samples.
Counter(y_train_fix)
# Check the corrected y_train_fix instead of the original y_train.

3.2.6 Performing Grid Search (Modeling)


from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression

# Declare the logistic regression algorithm and set max_iter (the maximum number of training times) to 500.
# Perform judgment based on the cross verification thinking to help split the dataset.
# cv=5 indicates that the dataset is split into five equal parts.
lr_model = LogisticRegression(solver='liblinear',max_iter=500)
cv_scores = cross_val_score(lr_model,X_train_fix,y_train_fix,scoring='roc_auc',cv=5)
# Apply grid search to find the optimal parameters through traversal.
# Import the grid search module.
from sklearn.model_selection import GridSearchCV
# C indicates the regularization coefficient.
c_range=[0.001,0.01,0.1,1.0]
# solvers indicates the optimization method.
solvers=['liblinear','lbfgs','newton-cg','sag']
# Combine the regularization coefficient with the optimization method using the dictionary method.
tuned_parameters=dict(solver=solvers,C=c_range)
# Declare the logistic regression algorithm.
lr_model=LogisticRegression(solver='liblinear',max_iter=500)
# Declare the grid search algorithm and describe the cross verification method.
grid=GridSearchCV(lr_model,tuned_parameters,cv=5,scoring='roc_auc')
# Perform training.
grid.fit(X_train_fix,y_train_fix)
# Check the optimal accuracy.
print(grid.best_score_)
# Check which parameters are optimal.
print(grid.best_params_)

3.2.7 Verifying Performance (Evaluation)


Then, verify the result.
Machine Learning Lab Guide-Teacher Version Page 28

from sklearn.metrics import accuracy_score,precision_score,recall_score,roc_auc_score


# Use the obtained optimal parameters for modeling.
lr_model=LogisticRegression(C=grid.best_params_['C'],solver=grid.best_params_['solver'],max_iter=500)
lr_model.fit(X_train_fix,y_train_fix)
# Construct a function to return the verification result.
def scoree(model,X,y,name=None):
y_predict=model.predict(X)
# Use the predict parameter to predict the result.
if name:
print(name,':')
print('accuracy score is:{}'.format(accuracy_score(y,y_predict)))
print('precision score is:{}'.format(precision_score(y,y_predict)))
print('recall score is:{}'.format(recall_score(y,y_predict)))
print('aic:{}'.format(roc_auc_score(y,y_predict)))

# Output the performance data of the training set.


scoree(lr_model,X_train_fix,y_train_fix)
# Output the performance data of the test set.
scoree(lr_model,X_test_std,y_test)

3.2.8 Saving the Model


Save the model.

# Import the joblib library


import joblib
# The dump function is used to save models. Enter the trained model and the model name into the dump function
to save the model.
# The model file must be suffixed with .pkl.
joblib.dump(lr_model,'lr_model.pkl')
# Load the saved model again.
loadmodel=joblib.load('lr_model.pkl')
# After the model is loaded, use the model for prediction directly.
loadmodel.predict(X_test_std)
Machine Learning Lab Guide-Teacher Version Page 29

4 Survival Prediction of the Titanic

4.1 Introduction
4.1.1 About This Lab
This experiment is to predict whether passengers on the Titanic can survive based on the Titanic
datasets.

4.1.2 Objectives
Upon completion of this task, you will be able to:
 Use the Titanic datasets open to the Internet as the model input data.
 Build, train, and evaluate machine learning models
 Understand the overall process of building a machine learning model.

4.1.3 Datasets and Frameworks


This experiment is based on train.csv and test.csv. test.csv contains the result about whether the
passengers can survive. This dataset has no target, that is, no result, and can be used as a real-world
dataset. Involved parameters are as follows:
 PassengerId: passenger ID
 Pclass: cabin class (class 1/2/3)
 Name: passenger name
 Sex: gender
 Age: age
 SibSp: number of siblings/number of spouses
 Parch: number of parents/number of children
 Ticket: ticket No.
 Fare: ticket price
 Cabin: cabin No.
 Embarked: port of boarding

4.2 Procedure
4.2.1 Importing Related Libraries
import pandas as pd
import numpy as np
Machine Learning Lab Guide-Teacher Version Page 30

import random as rnd

import seaborn as sns


import matplotlib.pyplot as plt
%matplotlib inline

from sklearn.linear_model import LogisticRegression


from sklearn.svm import SVC, LinearSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import Perceptron
from sklearn.linear_model import SGDClassifier
from sklearn.tree import DecisionTreeClassifier

4.2.2 Importing Datasets


Step 1 Read data.

train_df = pd.read_csv('./train.csv')
test_df = pd.read_csv('./test.csv')
combine = [train_df, test_df]

Step 2 View data.

print(train_df.columns.values)

The first five rows of data are displayed.

train_df.head()

The last five rows of data are displayed.

The data overview helps check whether some data is missing and what the data type is.
Machine Learning Lab Guide-Teacher Version Page 31

train_df.info()
test_df.info()

The related numeric-type information of the data helps check the average value and other statistics.

train_df.describe()
Machine Learning Lab Guide-Teacher Version Page 32

The character-type information helps check the number of types, the type with the maximum value,
and the frequency.

train_df.describe(include=['O'])

Step 3 Check the survival probability corresponding to each feature based on statistics.

train_df[['Pclass', 'Survived']].groupby(['Pclass'], as_index=False).mean().sort_values(by='Survived',


ascending=False)

The intuitive data shows that passengers in class 1 cabins are more likely to survive.

train_df[["SibSp", "Survived"]].groupby(['SibSp'], as_index=False).mean().sort_values(by='Survived',


ascending=False)
Machine Learning Lab Guide-Teacher Version Page 33

The survival probability can be directly determined by the number of siblings.

train_df[["Sex", "Survived"]].groupby(['Sex'], as_index=False).mean().sort_values(by='Survived', ascending=False)

When the survival probability is determined by gender, an obvious imbalance occurs.

g = sns.FacetGrid(train_df, col='Survived')
g.map(plt.hist, 'Age', bins=20)

As shown in the preceding figure, most young passengers died.

grid = sns.FacetGrid(train_df, col='Survived', row='Pclass', size=2.2, aspect=1.6)


grid.map(plt.hist, 'Age', alpha=.5, bins=20)
grid.add_legend();

The following figure shows the survival probability determined based on the cabin and age.
Machine Learning Lab Guide-Teacher Version Page 34

----End

4.2.3 Preprocessing Data


As the datasets have missing values, combine the datasets, and fill the missing values with data.

Step 1 Combine the datasets.

data=pd.concat([train_df,test_df],ignore_index=True)

Step 2 Check for missing values.

data.isnull().sum()
Machine Learning Lab Guide-Teacher Version Page 35

Step 3 Fill the missing values with data.


Process the datasets by using different methods as required. For example, fill the Fare and Embarked
parameters having few missing values with the mode.

data['Embarked'].fillna(str(data['Embarked'].mode()[0]),inplace=True)
data['Fare'].fillna(int(data['Fare'].mode()[0]),inplace=True)

Use the average age value.

data['Age'].fillna(data['Age'].mean(),inplace=True)

Delete less significant data. Before this, assign a value to Target first.

Target=data['Survived']
data=data.drop(['Cabin','Name','Ticket','Survived'],axis=1)

Check whether missing values still exist.

data.isnull().sum()

Step 4 Convert data.


Convert some character-type data into numeric-type data for model input. To do so, check the number
of types first.

data['Sex'].value_counts()

Use the search function to obtain each character-type value and replace it with a numeric-type value.

data['Sex']=data['Sex'].replace(['male','female'],[0,1])
data['Embarked']=data['Embarked'].replace(['S','C','Q'],[0,1,2])
Machine Learning Lab Guide-Teacher Version Page 36

test.csv cannot be used as a training test set as it does not contain Target. train.csv contains 891
pieces of data (with Target), which need to be extracted.

X=data[:891]
y=Target[:891]

----End

4.2.4 Building a Model


This section describes how to build a model. To build a model, split the training set and test set.

Step 1 Split the dataset.

from sklearn.model_selection import train_test_split


train_x,test_x,train_y,test_y=train_test_split(X,y)

Step 2 Train a model.


The logistic regression algorithm, random forest algorithm, and AdaBoost are used for training.

from sklearn.linear_model import LogisticRegression


from sklearn.ensemble import RandomForestClassifier
from sklearn import ensemble
model=LogisticRegression()
model.fit(X,y)
print('logR',model.score(X,y))
model=RandomForestClassifier()
model.fit(X,y)
print('RFC',model.score(X,y))
model=ensemble.AdaBoostClassifier()
model.fit(X,y)
print('AdaBoost',model.score(X,y))

As shown above, the random forest algorithm has a good effect.

Step 3 Predict data.

model.predict(data[891:])
Machine Learning Lab Guide-Teacher Version Page 37

----End
Machine Learning Lab Guide-Teacher Version Page 38

5 Linear Regression

5.1 Introduction
5.1.1 About This Lab
This experiment uses the basic Python code and the simplest data to reproduce how a linear
regression algorithm iterates and fits the existing data distribution.
The NumPy and Matplotlib modules are used in the experiment. NumPy is used for calculation, and
Matplotlib is used for drawing.

5.1.2 Objectives
Upon completion of this task, you will be able to:
 Be familiar with basic Python statements.
 Master the procedure for implementing linear regression.

5.2 Procedure
5.2.1 Preparing Data
Randomly set ten pieces of data, with the data in a linear relationship.
Convert the data into an array format so that the data can be directly calculated when multiplication
and addition are used.
Code:

# Import the required modules NumPy for calculation and Matplotlib for drawing.
import numpy as np
import matplotlib.pyplot as plt
#This code is used only for Jupyter Notebook.
%matplotlib inline

# Define data and convert the list into an array.


x=[3,21,22,34,54,34,55,67,89,99]
x = np.array(x)
y = [1,10,14,34,44,36,22,67,79,90]
y = np.array(y)

# Display the data through the scatter chart.


plt.scatter (x,y)
Machine Learning Lab Guide-Teacher Version Page 39

Output:

Figure 5-1 Scatter chart

5.2.2 Defining Related Functions


Model function: defines the linear regression model wx+b.
Loss function: calculates the mean square error.
Optimization function: calculates the partial derivatives of w and b by using the gradient descent
method.
Code:

# The basic linear regression model is wx+b. In this example, the model is ax+b as a two-dimensional space is used.
def model(a,b,x):
return a*x+b

# The mean square error loss function is the most commonly used loss function in the linear regression model.
def loss_function(a,b,x,y):
num=len(x)
predict=model(a,b,x)
return (0.5/num)*(np.square(predict-y)).sum()
# The optimization function mainly uses the partial derivatives to update a and b.
def optimize(a,b,x,y):
num=len(x)
predict=model(a,b,x)
da = (1.0/num) * ((predict -y)*x).sum()
db = (1.0/num) * ((predict -y).sum())
a = a - Lr*da
b = b - Lr*db
return a, b

# Perform function iteration to return a and b.


Machine Learning Lab Guide-Teacher Version Page 40

def iterate(a,b,x,y,times):
for i in range(times):
a,b = optimize(a,b,x,y)
return a,b

5.2.3 Starting Iteration


Step 1 Initialize the iterative optimization model.
Code:

# Initialize and display parameters.


a = np.random.rand(1)
print(a)
b = np.random.rand(1)
Lr = 1e-4
a,b = iterate(a,b,x,y,1)
prediction=model(a,b,x)
loss = loss_function(a, b, x, y)
print(a,b,loss)
plt.scatter(x,y)
plt.plot(x,prediction)

Output:

Figure 5-2 First iteration

Step 2 Perform the second iteration and display the parameter values, loss values, and visualization effect.
Code:

a,b = iterate(a,b,x,y,2)
prediction=model(a,b,x)
Machine Learning Lab Guide-Teacher Version Page 41

loss = loss_function(a, b, x, y)
print(a,b,loss)
plt.scatter(x,y)
plt.plot(x,prediction)

Output:

Figure 5-3 Second iteration

Step 3 Perform the third iteration and display the parameter values, loss values, and visualization effect.
Code:

a,b = iterate(a,b,x,y,3)
prediction=model(a,b,x)
loss = loss_function(a, b, x, y)
print(a,b,loss)
plt.scatter(x,y)
plt.plot(x,prediction)

Output:
Machine Learning Lab Guide-Teacher Version Page 42

Figure 5-4 Third iteration

Step 4 Perform the fourth iteration and display the parameter values, loss values, and visualization effect.
Code:

a,b = iterate(a,b,x,y,4)
prediction=model(a,b,x)
loss = loss_function(a, b, x, y)
print(a,b,loss)
plt.scatter(x,y)
plt.plot(x,prediction)

Output:
Machine Learning Lab Guide-Teacher Version Page 43

Figure 5-5 Fourth iteration

Step 5 Perform the fifth iteration and display the parameter values, loss values, and visualization effect.
Code:

a,b = iterate(a,b,x,y,5)
prediction=model(a,b,x)
loss = loss_function(a, b, x, y)
print(a,b,loss)
plt.scatter(x,y)
plt.plot(x,prediction)

Output:
Machine Learning Lab Guide-Teacher Version Page 44

Figure 5-6 Fifth iteration

Step 6 Perform the 10000th iteration and display the parameter values, loss values, and visualization effect.
Code:

a,b = iterate(a,b,x,y,1000)
prediction=model(a,b,x)
loss = loss_function(a, b, x, y)
print(a,b,loss)
plt.scatter(x,y)
plt.plot(x,prediction)

Output:
Machine Learning Lab Guide-Teacher Version Page 45

Figure 5-7 10000th iteration

----End

5.3 Thinking and Practices


5.3.1 Question 1
Must the loss value return to zero when the raw data is modified?

5.3.2 Question 2
What is the function of Lr during Lr modification?
Machine Learning Lab Guide-Teacher Version Page 46

6 Flower Category Analysis

6.1 Introduction
6.1.1 About This Lab
This experiment uses a dataset with a small sample quantity. The dataset includes the open-source
Iris data provided by scikit-learn. The Iris prediction project is a simple classification model. By using
this model, you can understand the basic usage and data processing methods of the machine learning
library sklearn.

6.2 Experiment Code


6.2.1 Importing Related Libraries
import numpy as np
import matplotlib.pyplot as plt

6.2.2 Importing a Dataset


The dataset is the built-in data of sklearn. Therefore, no external dataset needs to be imported.

from sklearn.datasets import load_iris


data=load_iris()
x = data.data
y = data.target

According to the preceding code, x is specified as a feature, and y as a label. The dataset includes a
total of 150 samples and four features: sepal length, sepal width, petal length, and petal width.

6.2.3 Splitting the Dataset


Split the data into a training set and a test set.

from sklearn.model_selection import train_test_split


train_X,test_X,train_y,test_y=train_test_split(x,y)

View the data size after the splitting.


Data size before the splitting:

x.shape
Machine Learning Lab Guide-Teacher Version Page 47

Data size after the splitting:

train_X.shape

6.2.4 Performing Modeling


6.2.4.1 Logistic Regression
Import the algorithm model to be used.

from sklearn.linear_model import LogisticRegression


from sklearn.neighbors import KNeighborsClassifier
from sklearn import svm
from sklearn.tree import DecisionTreeClassifier

Logistic regression is used for modeling first. The one-vs-one (OvO) multiclass method is used for
logistic regression by default.

model = LogisticRegression()
model.fit(train_X,train_y)
print('Logistic Regression:',model.score(test_X,test_y))

As described above, logistic regression has a good effect.


6.2.4.2 SVM
Use the Support Vector Machine (SVM) for classification. The one-vs-the-rest (OvR) multiclass method
is used for the SVM by default.

model = svm.SVC()
model.fit(train_X,train_y)
print('SVM:',model.score(test_X,test_y))

6.2.4.3 Decision Tree


Use the decision tree algorithm.

model=DecisionTreeClassifier()
model.fit(train_X,train_y)
print('Decision Tree',model.score(test_X,test_y))
Machine Learning Lab Guide-Teacher Version Page 48

6.2.4.4 K-Nearest Neighbors Algorithm


Use the k-nearest neighbors algorithm.

model=KNeighborsClassifier(n_neighbors=3)
model.fit(train_X,train_y)
print(' KNN',model.score(test_X,test_y))

Three neighbors are set for the k-nearest neighbors algorithm. Another number of neighbors can be
tried for better accuracy.
Therefore, the recursion method is used to find the optimal number of neighbors.

t=[]
for i in range(1,11):
model=KNeighborsClassifier(n_neighbors=i)
model.fit(train_X,train_y)
print('neighbor:{},acc:{}'.format(i,model.score(test_X,test_y)))
t.append(model.score(test_X,test_y))
plt.plot([i for i in range(1,11)],t)
Machine Learning Lab Guide-Teacher Version Page 49

As shown in the figure above, the k-nearest neighbors algorithm has the optimal effect when there is
one nearest neighbor.

6.2.5 Effect After Data Preprocessing


Consider data standardization before modeling.

from sklearn.preprocessing import StandardScaler


std=StandardScaler()
train_X_std=std.fit_transform(train_X)
test_X_std=std.fit_transform(test_X)
print('after',train_X_std.std(axis=0),train_X_std.mean(axis=0))

After standardization, the standard deviation is 1, and the mean value is infinitely close to 0.

model = svm.SVC()
model.fit(train_X_std,train_y)
print('SVM:',model.score(test_X_std,test_y))

Then, use the SVM to perform modeling after the standardization. Change the data names of the
training set and test set to new ones.

As described above, the SVM precision is also improved after the standardization.
Machine Learning Lab Guide-Teacher Version Page 50

7 Emotion Recognition of Customer Evaluations


in the Retail Industry

7.1 Introduction
Emotion analysis is a classification technology based on natural language processing (NLP), and is
usually used in classification methods for extracting emotional content of texts. Compared with
related recommendation and precision marketing, users prefer to view or listen to the personal
experience and feedback of users of the same type. For example, evaluations from users who have
purchased similar products and comparison results from users who have used similar products can
bring bidirectional values to users and enterprises. This experiment will discuss the problem and
perform practice step by step from the perspectives of problem statement, breakdown, priority
ranking, solution design, key point analysis, and summary and suggestions, and cultivate the project
implementation thinking and implement analysis of the evaluation emotion analysis project from
scratch.

7.1.1 Objectives
Upon completion of this task, you will be able to:
 Clarify the function and business value of emotion analysis.
 Understand the differences between conventional machine learning and deep learning in
emotion analysis methods.
 Clarify label extraction methods for emotion analysis.
 Master deep learning-based emotion analysis methods.
 Understand future applications of emotion analysis.

7.1.2 Case background


The case in this document is for reference only. The actual procedure may vary. For details, see the
corresponding product documents. Data engineer A works in the market data analysis department of
a Hi-Tech company. The company plans to develop home appliance services, such as smart TV and
smart reader, but it does not know how the current market is and how users evaluate such products.
Therefore, the company wants the data department to output the market data survey report as soon
as possible. Engineer A considers using the NLP technology to analyze users' evaluation tendency and
evaluation keywords of competitors' products of the same type, and build an emotion prediction
model to predict the users' emotion tendency based on texts.
Machine Learning Lab Guide-Teacher Version Page 51

7.2 Procedure
7.2.1 Data Management
The following information is involved:
 Id: ID
 reviews.rating: score
 reviews.text: text evaluation
 reviews.title: evaluation keywords
 reviews.username: name of the evaluator
This dataset contains 21 attribute fields and 34,657 data samples. The experiment aims to analyze
customer evaluation data. Therefore, this document describes only the data attributes required in this
experiment.

7.2.2 Data Reading


After obtaining the provided data files, you need to read and view the data over Python by performing
the following steps:

Step 1 Import common library files such as sklearn, pandas, and numpy.
sklearn is a powerful third-party machine learning library of Python. It contains data in various aspects
from data preprocessing to model training. Most functions in the sklearn library are classified into
estimators and transformers. An estimator is equivalent to a modeling tool, and is used to predict data.
Common estimator functions include fit(x,y) and predict(x). A transformer is used to process data,
such as reducing dimensions and standardizing data. Common transformer functions include
transform(x) and transform(x,y).

import pandas as pd
import matplotlib.pyplot as plt
import matplotlib as mpl
import nltk.classify.util
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn import metrics
from sklearn.metrics import roc_curve, auc
from nltk.classify import NaiveBayesClassifier
import numpy as np
import re
import string
import nltk
%matplotlib inline

Step 2 Read data from a local disk.

temp = pd.read_csv(r"C:/Users/swx592904.CHINA/Desktop/1XXX /1429_1.csv",encoding='utf-8',engine='python')

Step 3 Visualize the data.


Visualize the first five rows of data and view the data attribute columns.

temp.head()
Machine Learning Lab Guide-Teacher Version Page 52

Output:

This experiment uses only the reviews.rating, reviews.text, reviews.username, and reviews.title
attribute columns. Therefore, you need to extract data from the dataset for the second time to retain
only the attribute data in the four columns, and name the extracted attribute data file permanent to
facilitate the subsequent experiment procedure.

permanent = temp[['reviews.rating' , 'reviews.text' , 'reviews.title' , 'reviews.username']]

View the missing values.

print(permanent.isnull().sum())
permanent.head()

Output:

The reviews.rating attribute column is indispensable to emotion analysis. The dataset contains 34,657
data samples. The data volume is large. Therefore, you can delete the data samples with the
reviews.rating value missing. Specifically, you can extract the data without the reviews.rating value
and name the data senti, and extract the data with the reviews.rating value and name the data check.

check = permanent[permanent["reviews.rating"].isnull()]
senti= permanent[permanent["reviews.rating"].notnull()]

With respect to score processing, this experiment defines data samples with the reviews.rating value
greater than or equal to 4 as positive (pos) and those with the reviews.rating value less than 4 negative
(neg), and renames the reviews.rating attribute column senti.
replace(x,y): replaces x with y.

senti["senti"] = senti["reviews.rating"]>=4
Machine Learning Lab Guide-Teacher Version Page 53

senti["senti"] = senti["senti"].replace([True , False] , ["pos" , "neg"])

Visualize the data after identifying the samples as positive or negative.

senti["senti"].value_counts().plot.bar()

Output:

The output shows that the data is unbalanced.

7.2.3 Data Processing


A regular expression is used to check whether a string matches a pattern. The re module is added in
Python 1.5 and later versions. The re module enables Python to have all regular expression functions:
 re.sub(): match item in the string to be replaced.
re.sub(pattern, repl, string, count=0, flags=0)
 pattern: pattern string in the regular expression.
 repl: character string to be replaced, which can also be a function.
 string: original character string to be searched for and replaced.
 count: maximum number of replacements after pattern matching. The default value is 0,
indicating that all matches need to be replaced.
Data slicing is to use pandas.DataFrame.sample to randomly select several rows of data.

DataFrame.sample(n=None, frac=None, replace=False, weights=None, random_state=None, axis=None)[source]

 n: number of rows to be extracted.


 frac: proportion of rows to be extracted. If frac is set to 0.8, 80% of the rows need to be
extracted.
 replace: indicates whether extraction is performed with replacement. The value True indicates
that extraction is performed with replacement.
 random_state: seed of the random number generator. If random_state is set to None, the
obtained data is not repeated.
Machine Learning Lab Guide-Teacher Version Page 54

Step 1 Import related packages.

import nltk.classify.util
from nltk.classify import NaiveBayesClassifier
import numpy as np
import re
import string
import nltk

Step 2 Preprocess data.


Text data includes spaces, punctuation marks, and data. This experiment focuses on the text (English)
analysis. Therefore, you need to delete the information other than letters. You can define the cleanup()
function, use a regular expression to delete non-letter characters, use the lower() function to convert
uppercase letters into lowercase ones, and delete spaces, including '\n', '\r', '\t', and ' '. After apply()
is used, the reviews.text attribute is saved as the summary_clean column.

cleanup_re = re.compile('[^a-z]+')
def cleanup(sentence):
sentence = str(sentence)
sentence = sentence.lower()
sentence = cleanup_re.sub(' ', sentence).strip()
return sentence
senti["Summary_Clean"] = senti["reviews.text"].apply(cleanup)
check["Summary_Clean"] = check["reviews.text"].apply(cleanup)

Step 3 Generate a training set and a test set.


Obtain ["Summary_Clean","senti"] from the senti dataset and save it as the split dataset.

split = senti[["Summary_Clean" , "senti"]]

Output:

Use 80% of data in split as the training set through split.sample(), remove the data that has been used
in the training set train from split through drop(), and use the remaining data as the test set test.

train=split.sample(frac=0.8,random_state=200)
Machine Learning Lab Guide-Teacher Version Page 55

test=split.drop(train.index)

Output:

7.2.4 Model Training


7.2.4.1 Model Selection
Emotion analysis of customer evaluations is essentially a classification problem, which can be solved
by using a classification model. Practice has proved that a Naive Bayes model based on all words
performs well in solving some problems, while a model using a word subset performs well in solving
other problems. Logistic regression (LR), multinomial NB, and Bernouli NB are selected based on
comprehensive consideration.
7.2.4.2 Model Calculation and Evaluation
The Naive Bayes classifier usually uses three models: Gaussian model, Polynomial model, and Bernoulli
model. The three models respectively correspond to functions GaussianNB(), MultinomialNB(), and
BernoulliNB() in sklearn.

from sklearn.naive_bayes import GaussianNB


clf = GaussianNB()
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB()
from sklearn.naive_bayes import BernoulliNB
clf = BernoulliNB ()

 GaussianNB() is used when a feature is a continuous variable.


 MultinomialNB() is used when a feature is a discrete variable.
 BernoulliNB() is used when a feature is a discrete variable and the feature can be set only to 1
or 0.
The model calculation phase consists of the following steps:

Step 1 Import libraries.


Machine Learning Lab Guide-Teacher Version Page 56

from wordcloud import STOPWORDS


from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer

Step 2 Build and evaluate a model.


Convert the data in the training set, test set, and verification set into a list and create indexes.

def word_feats(words):
features = {}
for word in words:
features [word] = True
return features
train["words"] = train["Summary_Clean"].str.lower().str.split()
test["words"] = test["Summary_Clean"].str.lower().str.split()
check["words"] = check["Summary_Clean"].str.lower().str.split()
train.index = range(train.shape[0])
test.index = range(test.shape[0])
check.index = range(check.shape[0])
prediction = {}

Set all words in train["words"] to True and add neg or pos to the end of a sentence based on the
scoring criteria.

train_naive = []
test_naive = []
check_naive = []
for i in range(train.shape[0]):
train_naive = train_naive +[[word_feats(train["words"][i]) , train["senti"][i]]]
for i in range(test.shape[0]):
test_naive = test_naive +[[word_feats(test["words"][i]) , test["senti"][i]]]
for i in range(check.shape[0]):
check_naive = check_naive +[word_feats(check["words"][i])]
classifier = NaiveBayesClassifier.train(train_naive)
print("NLTK Naive bayes Accuracy : {}".format(nltk.classify.util.accuracy(classifier , test_naive)))
classifier.show_most_informative_features(5)

Use a trained classifier to attach emotion labels to the test set and verification set to predict whether
words in the test set and verification set are positive or negative.

y =[]
only_words= [test_naive[i][0] for i in range(test.shape[0])]
for i in range(test.shape[0]):
y = y + [classifier.classify(only_words[i] )]
prediction["Naive"]= np.asarray(y)

Output:
Machine Learning Lab Guide-Teacher Version Page 57

y1 = []
for i in range(check.shape[0]):
y1 = y1 + [classifier.classify(check_naive[i] )]
check["Naive"] = y1

Output:

The original dataset check does not contain review.ratings data. As shown in the preceding figure,
whether each word is negative or positive is predicted after the classifier is created based on the
training set.

from sklearn.naive_bayes import MultinomialNB


stopwords = set(STOPWORDS)
stopwords.remove("not")

Use the CountVectorizer class to perform vectorization, invoke the TfidfTransformer class to perform
preprocessing, construct the term frequency (TF) vector, and calculate the importance of words. The
training set, test set, and verification set are obtained, which are X_train_tfidf, X_test_tfidf, and
checktfidf, respectively.
The main idea of TF is as follows: If a word or phrase has a high TF in an article but a low TF in other
retail articles, the word or phrase is considered to have a good class distinguishing capability. TF-IDF
tends to filter out commonly used words and retain important words.
The CountVectorizer class converts words in the text into a TF matrix, and uses the fit_transform()
function to calculate the number of appearance times of each word. In general, you can use
CountVectorizer to extract features and then use TfidfTransformer to calculate the weight of each
feature.

count_vect = CountVectorizer(min_df=2 ,stop_words=stopwords , ngram_range=(1,2))


tfidf_transformer = TfidfTransformer()
X_train_counts = count_vect.fit_transform(train["Summary_Clean"])
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
X_new_counts = count_vect.transform(test["Summary_Clean"])
Machine Learning Lab Guide-Teacher Version Page 58

X_test_tfidf = tfidf_transformer.transform(X_new_counts)
checkcounts = count_vect.transform(check["Summary_Clean"])
checktfidf = tfidf_transformer.transform(checkcounts)

Retain important words based on tfidf_transformer to construct the MultinomialNB model.

model1 = MultinomialNB().fit(X_train_tfidf , train["senti"])


prediction['Multinomial'] = model1.predict_proba(X_test_tfidf)[:,1]
print("Multinomial Accuracy : {}".format(model1.score(X_test_tfidf , test["senti"])))
check["multi"] = model1.predict(checktfidf)

Output:

Retain important words based on tfidf_transformer to construct the BernoulliNB model.

from sklearn.naive_bayes import BernoulliNB


model2 = BernoulliNB().fit(X_train_tfidf,train["senti"])
prediction['Bernoulli'] = model2.predict_proba(X_test_tfidf)[:,1]
print("Bernoulli Accuracy : {}".format(model2.score(X_test_tfidf , test["senti"])))
check["Bill"] = model2.predict(checktfidf)

Output:

Retain important words based on tfidf_transformer to construct the LR model.

from sklearn import linear_model


logreg = linear_model.LogisticRegression(solver='lbfgs' , C=1000)
logistic = logreg.fit(X_train_tfidf, train["senti"])
prediction['LogisticRegression'] = logreg.predict_proba(X_test_tfidf)[:,1]
print("Logistic Regression Accuracy : {}".format(logreg.score(X_test_tfidf , test["senti"])))
check["log"] = logreg.predict(checktfidf)

Output:

In comparison, the LR model has higher accuracy than the other two models.

Step 3 Verify the model.


Select the LR model for verification.

words = count_vect.get_feature_names()
feature_coefs = pd.DataFrame( data = list(zip(words, logistic.coef_[0])), columns = ['feature', 'coef'])
feature_coefs.sort_values(by="coef")
def format(x):
Machine Learning Lab Guide-Teacher Version Page 59

if x == 'neg':
return 0
if x == 0:
return 0
return 1
vfunc = np.vectorize(format)
test.senti = test.senti.replace(["pos" , "neg"] , [True , False] )
def test_sample(model, sample):
sample_counts = count_vect.transform([sample])
sample_tfidf = tfidf_transformer.transform(sample_counts)
result = model.predict(sample_tfidf)[0]
prob = model.predict_proba(sample_tfidf)[0]
print("Sample estimated as %s: negative prob %f, positive prob %f" % (result.upper(), prob[0], prob[1]))
test_sample(logreg, "The product was good and easy to use")
test_sample(logreg, "the whole experience was horrible and product is worst")
test_sample(logreg, "product is not good")

Output:

The classifier accurately provides the positive probability and negative probability of each sentence.

Step 4 Build a word cloud.

from wordcloud import WordCloud, STOPWORDS


stopwords = set(STOPWORDS)
mpl.rcParams['font.size']=12 #10
mpl.rcParams['savefig.dpi']=100 #72
mpl.rcParams['figure.subplot.bottom']=.1
def show_wordcloud(data, title = None):
wordcloud = WordCloud(
background_color='white',
stopwords=stopwords,
max_words=300,
max_font_size=40,
scale=3,
random_state=1 # chosen at random by flipping a coin; it was heads
).generate(str(data))

fig = plt.figure(1, figsize=(15, 15))


plt.axis('off')
if title:
fig.suptitle(title, fontsize=20)
fig.subplots_adjust(top=2.3)
plt.imshow(wordcloud)
plt.show()

show_wordcloud(senti["Summary_Clean"])

Output:
Machine Learning Lab Guide-Teacher Version Page 60

----End

8 Boston Housing Price Forecast

8.1 Introduction
8.1.1 About This Lab
This experiment uses a dataset with a small sample quantity. The dataset includes the open-source
Boston housing price data provided by scikit-learn. The Boston housing price forecast project is a
simple regression model. By using this model, you can understand the basic usage and data processing
methods of the machine learning library sklearn.

8.1.2 Objectives
Upon completion of this task, you will be able to:
 Use the Boston housing price dataset open to the Internet as the model input data.
 Build, train, and evaluate machine learning models
 Understand the overall process of building a machine learning model.
 Master the application of machine learning model training, grid search, and evaluation
indicators.
 Master the application of related APIs.
Machine Learning Lab Guide-Teacher Version Page 61

8.1.3 Experiment Dataset and Framework


This experiment is based on the Boston housing price dataset, which contains 506 samples with 13
features. Each data record contains detailed information about the house and its surroundings. To be
specific, the dataset includes the following features:
 CRIM: per capita crime rate by town
 ZN: proportion of residential land zoned for lots over 25,000 sq.ft
 INDUS: proportion of non-retail business acres per town
 CHAS: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
 NOX: Nitric oxide concentration
 RM: average number of rooms per dwelling
 AGE: proportion of owner-occupied units built prior to 1940
 DIS: weighted distances to five Boston employment centers
 RAD: index of accessibility to radial highways
 TAX: full-value property-tax rate per $10,000
 PTRATIO: pupil-teacher ratio by town
 B: 1000(Bk – 0.63)^2 where Bk is the proportion of blacks by town
 LSTAT: % lower status of the population
The target is to obtain the median value of owner-occupied homes in the unit of $1000.
The sklearn framework is used to provide the Boston housing price data and functions such as dataset
splitting, standardization, and evaluation, and integrate various common machine learning algorithms.
In addition, XGBoost optimized from gradient boosted decision tree (GBDT) is used as the integral
algorithm.

8.2 Procedure
8.2.1 Introducing the Dependency
Code:

#Prevent unnecessary warnings.


import warnings
warnings.filterwarnings("ignore")

#Introduce the basic package of data science.


import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
import pandas as pd
import scipy.stats as st
import seaborn as sns
##Set attributes to prevent garbled characters in Chinese.
mpl.rcParams['font.sans-serif'] = [u'SimHei']
mpl.rcParams['axes.unicode_minus'] = False

#Introduce machine learning, preprocessing, model selection, and evaluation indicators.


from sklearn.preprocessing import StandardScaler
Machine Learning Lab Guide-Teacher Version Page 62

from sklearn.model_selection import train_test_split


from sklearn.model_selection import GridSearchCV
from sklearn.metrics import r2_score

#Import the Boston dataset used this time.


from sklearn.datasets import load_boston

#Introduce algorithms.
from sklearn.linear_model import RidgeCV, LassoCV, LinearRegression, ElasticNet
#Compared with SVC, it is the regression form of SVM.
from sklearn.svm import SVR
#Integrate algorithms.
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from xgboost import XGBRegressor

8.2.2 Loading the Dataset, Viewing Data Attributes, and Visualizing Data
Step 1 Load the Boston housing price dataset and display related attributes.
Code:

#Load the Boston house price dataset.


boston = load_boston()

#x features, and y labels.


x = boston.data
y = boston.target

#Display related attributes.


print('Feature column name')
print(boston.feature_names)
print("Sample data volume: %d, number of features: %d"% x.shape)
print("Target sample data volume: %d"% y.shape[0])

Output:

Feature column names: ['CRIM' 'ZN' 'INDUS' 'CHAS' 'NOX' RM' 'AGE' DIS' 'RAD' 'TAX' PTRATIO' 'B' 'LSTAT'], sample
quantity: 506, feature quantity: 13, target sample quantity: 506

Step 2 Convert the data into the data frame format


Code:

x = pd.DataFrame(boston.data, columns=boston.feature_names)
x.head()

Output:
Machine Learning Lab Guide-Teacher Version Page 63

Figure 8-1 Information about the first five samples

Step 3 Visualize the label distribution.


Code:

sns.distplot(tuple(y), kde=False, fit=st.norm)

Output:

Figure 8-2 Target data distribution

----End

8.2.3 Splitting and Preprocessing the Dataset


Code:

#Segment the data.


x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=28)
#Standardize the dataset.
ss = StandardScaler()
x_train = ss.fit_transform(x_train)
x_test = ss.transform(x_test)
x_train[0:100]
Machine Learning Lab Guide-Teacher Version Page 64

Output:

Arrays: ([–0.35451414, –0.49503678, –0.15692398, ..., –0.01188637, 0.42050162, –0.29153411], [–0.38886418, –


0.49503678, –0.02431196, ..., 0.35398749, 0.37314392, –0.97290358], [0.50315442, –0.49503678,
1.03804143, ..., 0.81132983, 0.4391143, 1.18523567], ..., [–0.34444751, –0.49503678, –0.15692398, ..., –
0.01188637, 0.4391143, –1.11086682], [–0.39513036, 2.80452783, –0.87827504, ..., 0.35398749, 0.4391143, –
1.28120919], [–0.38081287, 0.41234349, –0.74566303, ..., 0.30825326, 0.19472652, –0.40978832]])

8.2.4 Performing Modeling on the Dataset by Using Various Regression


Models
Code:

#Set the model name.


names = ['LinerRegression',
'Ridge',
'Lasso',
'Random Forrest',
'GBDT',
'ElasticNet',
'XgBoost']

#Define the model.


# cv is the cross-validation idea here.
models = [LinearRegression(),
RidgeCV(alphas=(0.001,0.1,1),cv=3),
LassoCV(alphas=(0.001,0.1,1),cv=5),
RandomForestRegressor(n_estimators=10),
GradientBoostingRegressor(n_estimators=30),
ElasticNet(alpha=0.001,max_iter=10000),
XGBRegressor()]
# Output the R2 scores of all regression models.

#Define the R2 scoring function.


def R2(model,x_train, x_test, y_train, y_test):

model_fitted = model.fit(x_train,y_train)
y_pred = model_fitted.predict(x_test)
score = r2_score(y_test, y_pred)
return score

#Traverse all models to score.


for name,model in zip(names,models):
score = R2(model,x_train, x_test, y_train, y_test)
print("{}: {:.6f}, {:.4f}".format(name,score.mean(),score.std()))

Output:
Machine Learning Lab Guide-Teacher Version Page 65

8.2.5 Adjusting Grid Search Hyperparameters


Step 1 Build a model.
Code:

'''
'kernel': kernel function
'C': SVR regularization factor
'gamma': 'rbf', 'poly' and 'sigmoid' kernel function coefficient, which affects the model performance
'''
parameters = {
'kernel': ['linear', 'rbf'],
'C': [0.1, 0.5,0.9,1,5],
'gamma': [0.001,0.01,0.1,1]
}

#Use grid search and perform cross validation.


model = GridSearchCV(SVR(), param_grid=parameters, cv=3)
model.fit(x_train, y_train)

Output:

Step 2 Obtain the optimal parameters.


Code:

print("Optimal parameter list:", model.best_params_)


print("Optimal model:", model.best_estimator_)
print("Optimal R2 value:", model.best_score_)

Output:

Optimal parameter list: {'C': 5, 'gamma': 0.1, 'kernel': 'rbf'}


Optimal model: SVR(C=5, cache_size=200, coef0=0.0, degree=3, epsilon=0.1, gamma=0.1,
kernel='rbf', max_iter=-1, shrinking=True, tol=0.001, verbose=False)
Optimal R2 value: 0.797481706635164
Machine Learning Lab Guide-Teacher Version Page 66

Step 3 Visualize the output.


Code:

##Perform visualization.
ln_x_test = range(len(x_test))
y_predict = model.predict(x_test)

#Set the canvas.


plt.figure(figsize=(16,8), facecolor='w')
#Draw with a red solid line.
plt.plot (ln_x_test, y_test, 'r-', lw=2, label=u'Value')
#Draw with a green solid line.
plt.plot (ln_x_test, y_predict, 'g-', lw = 3, label=u'Estimated value of the SVR algorithm, $R^2$=%.3f' %
(model.best_score_))
#Display in a diagram.
plt.legend(loc ='upper left')
plt.grid(True)
plt.title(u"Boston Housing Price Forecast (SVM)")
plt.xlim(0, 101)
plt.show() $R^2$=%.3f' % (model.best_score_)

Output:

Figure 8-3 Visualized result

----End
Machine Learning Lab Guide-Teacher Version Page 67

9 E-commerce Website User Group Analysis

9.1 Introduction
9.1.1 About This Lab
This experiment performs modeling based on the k-means algorithm by using the virtual dataset
automatically generated by sklearn to obtain user categories. It is a clustering experiment, which can
find out the method for selecting the optimal k value and observe the effect in a visualized manner.

9.2 Experiment Code


9.2.1 Using sklearn for Modeling
Step 1 Import libraries.

import numpy as np
import matplotlib.pyplot as plt

Step 2 Create a dataset.


Create virtual data for the algorithm model.

from sklearn.datasets import make_blobs


X, y = make_blobs(n_samples=2000,centers=2,n_features=2)

The built-in tool of sklearn is used to create the virtual data, which is scientific and conforms to a
normal distribution. Parameter settings are as follows:
 n_samples: set to 2000, indicating that 2000 sample points are set.
 centers: set to 2, indicating that the data actually has two centers.
 n_features: set to 2, indicating the number of features.
For ease of illustration in the coordinate system, only two features are used.

Step 3 Apply the k-means algorithm.

from sklearn.cluster import KMeans


y_pred = KMeans(n_clusters=5).fit_predict(X)

n_clusters=5: indicates that five data clusters are expected. However, there are only two data
categories.

Step 4 Visualize the output.


Machine Learning Lab Guide-Teacher Version Page 68

import matplotlib.pyplot as plt


plt.figure(figsize=(10,10))
plt.scatter(X[:, 0], X[:, 1])

plt.figure(figsize=(10,10))
plt.scatter(X[:, 0], X[:, 1],c=y_pred)

Output:

Different data is generated each time. Therefore, the output diagram may be different from that in
the lab. To generate the same data, add the random_state parameter during data generation.

X, y = make_blobs(n_samples=2000,centers=2,n_features=2,random_state=3)

In this example, random_state is set to 3. In this way, the same data can be generated for the same
data input.

Step 5 Use more features for comparison.

X, y = make_blobs(n_samples=2000,centers=3,n_features=10,random_state=30)

In this example, ten features are used to generate data, random_state is set to 30, and there are three
categories in theory.
Machine Learning Lab Guide-Teacher Version Page 69

y_pred = KMeans(n_clusters=5).fit_predict(X)
plt.figure(figsize=(10,10))
plt.subplot(121)
plt.scatter(X[:, 0], X[:, 1])

4.
5. plt.subplot(122)
6. plt.scatter(X[:, 0], X[:, 1],c=y_pred)

----End

9.2.2 Selecting the Optimal k Value


In the preceding steps, the k value is manually set. In actual environments, the number of centers is
unknown. Therefore, you need to find the optimal k value.

import random
centers=random.randint(1,30)
n_features=random.randint(1,30)
X, y = make_blobs(n_samples=2000,centers=centers,n_features=n_features)
Machine Learning Lab Guide-Teacher Version Page 70

First, generate two random numbers ranging from 1 to 30 (indicating that the number of true centers
in the data is unknown), and use a random number of features.

temp=[]
for i in range(1,50):
model=KMeans(n_clusters=i)
model.fit(X)
temp.append(model.inertia_)

Then, perform k-means clustering by using a recursive method. The .inertia_ attribute returns the
distance from the attribute point to the center.

plt.figure(1 , figsize = (15 ,6))


plt.plot(np.arange(1 , 50) , temp , 'o')
plt.plot(np.arange(1 , 50) , temp , '-' , alpha = 0.5)
plt.xlabel('Number of Clusters') , plt.ylabel('Inertia')
plt.show()

Visualize the result by using a visualization tool.

The result varies each time due to impact of the random numbers. As shown in the preceding figure,
the turning point appears at the position corresponding to the value 21. Therefore, 21 is the optimal
k value.

You might also like