Machine Learning Lab Guide
Machine Learning Lab Guide
Machine Learning
Lab Guide
Teacher Version
1.1 Introduction
1.1.1 About This Lab
Feature engineering is a process of extracting features from raw data. Data and features determine
the upper limit of machine learning, while models and algorithms help continuously approaching this
upper limit. Feature engineering and construction aim to enable extracted features to represent the
essential characteristics of data to the greatest extent, so that a model constructed based on these
features has a good prediction effect on unknown datasets.
1.1.2 Objectives
Upon completion of this task, you will be able to:
Master the Python-based feature selection method.
Master the Python-based feature extraction method.
Master the Python-based feature construction method.
1.2.2 Procedure
1.2.2.1 Importing Data
Code:
import pandas as pd
df=pd.read_csv('./credit.csv',index_col=0)
df.head()
Output:
Output:
Machine Learning Lab Guide-Teacher Version Page 3
As shown in the figure above, the Nation, Marriage_State, Highest Education, House_State, Industry,
Title, and Duty fields contain a large number of missing values. In Pandas, isnull() can determine the
missing values in data, and isnull().sum() can count the number of missing values and further check
the rates of the missing values in the fields.
Code:
df_missing = pd.DataFrame(df.isnull().sum()/df.shape[0],columns=['missing_rate']).reset_index()
df_missing.sort_values(by='missing_rate',ascending=False)[:15]
Output:
# Use the for loop to process the missing values in the multiple fields.
After the processing is complete, check the missing rate of each field.
df_missing_2 = pd.DataFrame(df.isnull().sum()/df.shape[0],columns=['missing_rate']).reset_index()
df_missing_2.sort_values(by='missing_rate',ascending=False)[:15]
Machine Learning Lab Guide-Teacher Version Page 4
1.3.3 Filter
Step 1 Analyze the crosstab.
Apply the crosstab() method to draw a crosstab by using the variable House_State and the target
variable Target as an example.
In the output, the default rate is 0.019 when House_State is set to 1, and is 0.045 when House_State
is set to 2. If the default rates are considered the same, the variable House_State does not affect the
default prediction.
The crosstab analysis can only be used for preliminary judgment and analysis. The chi-square test is
further needed to determine whether the numerical difference has statistical significance.
X = df.drop('Target',axis=1)
y = df['Target']
X_category=df[['Nation','Birth_Place','Gender','Marriage_State','Highest
Education','House_State','Work_Years','Title','Duty','Industry']]
Import the chi-square test package chi2 of sklearn.feature_selection and use chi2() to calculate the
chi-square values of each categorical variable and target variable.
ls = sorted(dict_feature.items(),key=lambda item:item[1],reverse=True)
ls
nominal_features = ['Nation','Birth_Place','Gender','Marriage_State','Highest
Education','House_State','Work_Years','Unit_Kind','Title',
'Occupation','Duty','Industry']
numerical_features = [col_ for col_ in df.columns if col_ not in nominal_features ]
numerical_features.pop(0) # Delete the first element from the list.
X_num = df[numerical_features]
The method parameter indicates the method for calculating the correlation coefficient. The options
are as follows:
pearson: Pearson correlation coefficient.
kendall: correlation coefficient for unordered categorical variables.
spearman: Spearman correlation coefficient, which is mainly used for correlation analysis of
non-linearly and non-normally distributed data.
Calculate the correlation coefficient between continuous independent variables and select the
combination of independent variables whose correlation coefficient is greater than 0.8.
cols_pair = []
for index_ in corr_matrix.index:
for col_ in corr_matrix.columns:
if corr_matrix.loc[index_,col_] >= 0.8 and index_!=col_ and (col_,index_) not in cols_pair:
cols_pair.append((index_,col_))
cols_pair
----End
1.3.4 Wrapper
In the wrapper selection method, different feature subsets are used for modeling, the model precision
is used as the evaluation indicator for the feature subsets, and a base model is selected to perform
multi-round training. After each round of training, features of some weight coefficients are removed,
and then the next round of training is performed based on the new feature set. The RFE() method of
the feature_selection submodule in sklearn is invoked. The logistic regression model
LogisticRegressio() is used as the base model to be invoked, and parameter will be transferred into
this model.
Wrapper:
estimator: basic training model, which is a logistic regression model in this example.
n_features_to_select: indicates the number of retained features.
fit(X,y): invokes and trains a model.
Machine Learning Lab Guide-Teacher Version Page 8
Output:
20
[ True True False True True True False True True True True False
False True False True True False True True True True True True
True False False False True False]
[ 1 1 9 1 1 1 10 1 1 1 1 6 3 1 11 1 1 8 1 1 1 1 1 1
1 5 4 7 1 2]
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, max_iter=100, multi_class='warn',
n_jobs=None, penalty='l2', random_state=None, solver='warn',
tol=0.0001, verbose=0, warm_start=False)
The return value of the RFE() method is output, which can be any of the following:
n_features_: number of selected features, that is, the value of the n_features_to_select
parameter transferred into the RFE() method.
support_: indicates that the selected features are displayed at their corresponding positions.
True indicates that the feature is retained, and False indicates that the feature is removed.
ranking_: indicates the feature ranking. ranking_[i] corresponds to the ranking of the ith feature.
The value 1 indicates the optimal feature. The selected features are the 20 feature
corresponding to the value 1, namely, the optimal features.
estimator_: returns the parameters of the base model.
1.3.5 Embedded
The embedded method uses a machine learning model for training to obtain weight coefficients of
features, and selects features in descending order of the weight coefficients.
Common embedded methods are based on either of the following:
Linear model and regularization
Feature selection of a tree model
In the tree model, the importance of a feature is determined by the depth of the leaf node. In this
experiment, the random forest is used to calculate the importance of a feature.
The random forest classification method in the sklearn.ensemble submodule is invoked to train the
model by using the fit(X,y) method.
After the model training is complete, the weight evaluation value of each feature is printed.
sorted_feature=sorted(zip(map(lambda x:round(x,4),rfc.feature_importances_),cols),reverse=True)
sorted_feature
Output:
[(0.1315, 'Ast_Curr_Bal'),
(0.1286, 'Age'),
(0.0862, 'Year_Income'),
(0.0649, 'Std_Cred_Limit'),
(0.043, 'ZX_Max_Account_Number'),
(0.0427, 'Highest Education'),
(0.0416, 'ZX_Link_Max_Overdue_Amount'),
(0.0374, 'ZX_Max_Link_Banks'),
(0.0355, 'Industry'),
(0.0354, 'ZX_Max_Overdue_Duration'),
(0.0311, 'ZX_Total_Overdu_Months'),
(0.0305, 'Marriage_State'),
(0.0305, 'Duty'),
(0.0292, 'Couple_Year_Income'),
(0.0279, 'ZX_Credit_Max_Overdu_Amount'),
(0.0246, 'ZX_Max_Overdue_Account'),
(0.0241, 'ZX_Max_Credit_Banks'),
(0.0221, 'ZX_Max_Credits'),
(0.0205, 'Birth_Place'),
(0.0195, 'Loan_Curr_Bal'),
(0.0173, 'L12_Month_Pay_Amount'),
(0.015, 'ZX_Credit_Max_Overdue_Duration'),
(0.013, 'Title'),
(0.0097, 'ZX_Credit_Total_Overdue_Months'),
(0.0096, 'Nation'),
(0.0084, 'Gender'),
(0.0079, 'Work_Years'),
(0.0064, 'ZX_Max_Overdue_Credits'),
(0.0059, 'House_State'),
(0.0, 'Couple_L12_Month_Pay_Amount')]
del_cols
=['Gender','House_State','Couple_Year_Income','Loan_Curr_Bal','ZX_Max_Credit_Banks','ZX_Max_Overdue_Credi
ts','ZX_Credit_Max_Overdu_Amount','ZX_Credit_Max_Overdue_Duration']
df_select = df.drop(del_cols,axis=1)
df_select.head()
Machine Learning Lab Guide-Teacher Version Page 10
To check the correlation between the newly generated variable and the target variable, construct a
dataset containing the target variable and the newly generated variable first.
poly_features=pd.DataFrame(poly_feature,columns
=poly_trans.get_feature_names(['Ast_Curr_Bal','Age','Year_Income','Std_Cred_Limit']))
poly_features['Target']=y
poly_features.head()
Machine Learning Lab Guide-Teacher Version Page 11
The corr() function is used to calculate the correlation coefficient between the newly generated
variable and the target variable.
poly_corrs = poly_features.corr()['Target'].sort_values()
print("five features with the smallest correlation coefficients: \n",poly_corrs.head(5))
print("five features with the largest correlation coefficients: \n",poly_corrs.tail(5))
Output:
2.1 Introduction
2.1.1 About This Lab
Mr. Zhao works in the AI algorithm department of e-commerce platform company A and is responsible
for product recommendation for online businesses. In the modern world of the Internet and e-
commerce, people are overwhelmed by data that provides useful information. However, it is
impossible for users to extract the information they are interested in from the data. To help users find
Machine Learning Lab Guide-Teacher Version Page 12
product information, the recommendation system can create similarities between users and products
and provide suggestions for customers based on the similarities. The recommendation system is
beneficial in:
Helping users find the right products.
Increasing user engagement. Providing recommendations. For example, Google News saw a 40%
increase in hits due to recommendations.
Helping project providers deliver projects to the right users. At Amazon, 35% of products are
sold through recommendations.
Helping personalize the recommended content. In Netflix, most rented movies are
recommended ones.
2.2 Procedure
2.2.1 Preparing E-commerce Platform Data
Step 1 Import the required packages.
Functions in the NumPy library are used to perform basic operations on arrays. Pandas provides many
data processing methods and time sequence operation methods.
electronics_data=pd.read_csv("./data/ratings_Electronics.csv",names=['userId', 'productId','Rating','timestamp'])
electronics_data.head()
You can further view the data size (the number of samples and the number of features in the data) by
using the shape function.
electronics_data.shape
electronics_data.dtypes
According to the result, only Rating and timestamp fall into the numeric type and can be used for
mathematical calculation. If userId and productId need to be used for mathematical calculation,
convert the types of them. In addition, you can use the info() function to view the general information
about the data.
electronics_data.info()
The result contains the number of data samples, feature type, data type, and data storage size. The
info function can display the preceding information by default, but you can set an item to False to
hide the item. For example, you can run the following command to hide the data storage size:
electronics_data.info(memory_usage =False)
Machine Learning Lab Guide-Teacher Version Page 14
electronics_data.describe()['Rating']
The result contains the average value, maximum value, minimum value, standard deviation, and
quartile of the data, and the product rating is generally about 4. You can use the min() and max()
functions to print the maximum and minimum value of the rating.
You can also use the print() function to print the result or the value of a parameter. According to the
result, the highest rating is 5, indicating that users' ratings on the product are generally high.
electronics_data.drop(['timestamp'], axis=1,inplace=True)
electronics_data.head()
no_of_rated_products_per_user =
electronics_data.groupby(by='userId')['Rating'].count().sort_values(ascending=False)
no_of_rated_products_per_user.head()
no_of_rated_products_per_user.describe()
Machine Learning Lab Guide-Teacher Version Page 16
After obtaining the product data corresponding to the sorted user ratings, the system returns the
quantiles by using the quantile() function, and displays the quantiles by icons.
print('\n No of rated product more than 50 per user : {}\n'.format(sum(no_of_rated_products_per_user >= 50)) )
----End
Machine Learning Lab Guide-Teacher Version Page 17
new_df.groupby('productId')['Rating'].mean().sort_values(ascending=False).head()
Machine Learning Lab Guide-Teacher Version Page 18
# Obtain the rankings of the products sorted by the number of rating times.
new_df.groupby('productId')['Rating'].count().sort_values(ascending=False).head()
ratings_mean_count = pd.DataFrame(new_df.groupby('productId')['Rating'].mean())
ratings_mean_count['rating_counts'] = pd.DataFrame(new_df.groupby('productId')['Rating'].count())
ratings_mean_count.head()
The result shows that the product with the highest average rating is rated by 1051 users.
plt.figure(figsize=(8,6))
plt.rcParams['patch.force_edgecolor'] = True
ratings_mean_count['Rating'].hist(bins=50)
plt.figure(figsize=(8,6))
plt.rcParams['patch.force_edgecolor'] = True
sns.jointplot(x='Rating', y='rating_counts', data=ratings_mean_count, alpha=0.4)
Machine Learning Lab Guide-Teacher Version Page 20
Sort the products by the number of users who rate the products, to obtain the product popularity.
popular_products = pd.DataFrame(new_df.groupby('productId')['Rating'].count())
most_popular = popular_products.sort_values('Rating', ascending=False)
most_popular.head(30).plot(kind = "bar")
----End
new_df1=new_df.head(10000)
ratings_matrix = new_df1.pivot_table(values='Rating', index='userId', columns='productId', fill_value=0)
ratings_matrix.head()
Machine Learning Lab Guide-Teacher Version Page 21
You can use the shape function to view the table size, and then transform the table. The data in the
table is the product ratings from users.
ratings_matrix.shape
X = ratings_matrix.T
X.head()
correlation_matrix = np.corrcoef(decomposed_matrix)
correlation_matrix.shape
#Select products whose coefficient of correlation with the 20th product is greater than 0.65.
Recommend = list(X.index[correlation_product_ID > 0.65])
# Delete the 20th product.
Recommend.remove(i)
Recommend[0:10]# Recommend products ranked ahead to the users who like the 20th product.
As shown in the result, there are eight products whose coefficient of correlation with the 20th product
(9984984354) is greater than 0.65. You can also select other products to view their similar products.
Machine Learning Lab Guide-Teacher Version Page 23
----End
Machine Learning Lab Guide-Teacher Version Page 24
3.1 Introduction
Under the impact of the Internet, financial institutions are suffering from internal and external
troubles. On one hand, financial institutions encounter great competition and performance pressure
from large financial and technology enterprises; on the other hand, more and more criminal groups
use artificial intelligence (AI) technologies to increase the crime efficiency. These risk details are
hidden in each transaction phase. If they are not prevented, losses will be irreparable. Therefore,
financial institutions pose increasingly high requirements on risk management accuracy and approval
efficiency.
This experiment will discuss the problem and perform practice step by step from the perspectives of
problem statement, breakdown, priority ranking, solution design, key point analysis, and summary
and suggestions, and cultivate the project implementation thinking and implement analysis of the
private credit default prediction from scratch.
3.1.1 Objectives
Upon completion of this task, you will be able to:
Understand the significance of credit default prediction.
Master the development process of big data mining projects.
Master the common algorithms for private credit default prediction.
Understand the importance of data processing and feature engineering.
Master the common methods for data preprocessing and feature engineering
Master the algorithm principles of logistic regression and XGBoost, and understand the key
parameters.
3.1.2 Background
The case in this document is for reference only. The actual procedure may vary. For details, see the
corresponding product documents.
The company has just set up a project team for private credit default prediction. Engineer A was
appointed as the offline development PM of the project. This project aims to:
Identify high-risk customers efficiently and accurately using new technologies.
Make risk modes data-based by using scientific methods.
Provide objective risk measurement.
Reduce subjective judgments.
Improve risk management efficiency.
Save labor costs.
The ultimate goal is to productize the results, so that front-end operating departments can identify
transactions with credit default risks in a timely manner to avoid corporate losses.
Machine Learning Lab Guide-Teacher Version Page 25
3.2 Procedure
3.2.1 Reading Data
First, import the dataset. This document uses a third-party module from Pandas to import the dataset.
import pandas as pd
# Use pd.read_csv to read the dataset. (The dataset is stored in the current directory so that it can be read
directly.)
# ./credit.csv indicates the current directory. The slash (/) here must be in the same direction as one in a directory
of the Linux operating system (OS).
# In the Windows OS, the backslash (\) is used. Therefore, the slash in the file path must be the same as that in the
Linux OS.
# Be aware of using the slash symbol in the same key on the keyboard as the question mark (?).
data=pd.read_csv('./credit.csv')
# An auxiliary module warnings can be imported.
import warnings
warnings.filterwarnings('ignore')
# This module can help filter many redundant and annoying warnings.
# After data reading, some simple operations can be performed, for example:
# Run the following command to view all data.
data
# Run the following command to view the first 10 rows of data.
data.head(10)
# Run the following command to view the length and width of data in the matrix format.
data.shape
X=data.drop(['Cust_No','Target'],axis=1)
y=data['Target']
X_train is the training set, and y_train is the answer to the training set. X_test is the test set, and
y_test is the answer to the test set. test_size=0.1 indicates that the ratio of the training set to the test
set is 9:1. shuffle indicates that the training set and test set are shuffled.
Standardization is to ensure that the data complies with normal distribution. In nature environments,
real random distribution is similar to normal distribution, and an aggregation point appears.
Completely balanced random distribution is not natural but intentionally made. In the preceding
commands, the standardization function StandardScaler() is first declared. The following fit function
is used to obtain the standard deviation and average value of the dataset. Then, transform is used to
transform the data.
very large. Therefore, the model tends to determine people as non-defaulters due to class imbalance.
Check the current result ratio first.
# Declare the logistic regression algorithm and set max_iter (the maximum number of training times) to 500.
# Perform judgment based on the cross verification thinking to help split the dataset.
# cv=5 indicates that the dataset is split into five equal parts.
lr_model = LogisticRegression(solver='liblinear',max_iter=500)
cv_scores = cross_val_score(lr_model,X_train_fix,y_train_fix,scoring='roc_auc',cv=5)
# Apply grid search to find the optimal parameters through traversal.
# Import the grid search module.
from sklearn.model_selection import GridSearchCV
# C indicates the regularization coefficient.
c_range=[0.001,0.01,0.1,1.0]
# solvers indicates the optimization method.
solvers=['liblinear','lbfgs','newton-cg','sag']
# Combine the regularization coefficient with the optimization method using the dictionary method.
tuned_parameters=dict(solver=solvers,C=c_range)
# Declare the logistic regression algorithm.
lr_model=LogisticRegression(solver='liblinear',max_iter=500)
# Declare the grid search algorithm and describe the cross verification method.
grid=GridSearchCV(lr_model,tuned_parameters,cv=5,scoring='roc_auc')
# Perform training.
grid.fit(X_train_fix,y_train_fix)
# Check the optimal accuracy.
print(grid.best_score_)
# Check which parameters are optimal.
print(grid.best_params_)
4.1 Introduction
4.1.1 About This Lab
This experiment is to predict whether passengers on the Titanic can survive based on the Titanic
datasets.
4.1.2 Objectives
Upon completion of this task, you will be able to:
Use the Titanic datasets open to the Internet as the model input data.
Build, train, and evaluate machine learning models
Understand the overall process of building a machine learning model.
4.2 Procedure
4.2.1 Importing Related Libraries
import pandas as pd
import numpy as np
Machine Learning Lab Guide-Teacher Version Page 30
train_df = pd.read_csv('./train.csv')
test_df = pd.read_csv('./test.csv')
combine = [train_df, test_df]
print(train_df.columns.values)
train_df.head()
The data overview helps check whether some data is missing and what the data type is.
Machine Learning Lab Guide-Teacher Version Page 31
train_df.info()
test_df.info()
The related numeric-type information of the data helps check the average value and other statistics.
train_df.describe()
Machine Learning Lab Guide-Teacher Version Page 32
The character-type information helps check the number of types, the type with the maximum value,
and the frequency.
train_df.describe(include=['O'])
Step 3 Check the survival probability corresponding to each feature based on statistics.
The intuitive data shows that passengers in class 1 cabins are more likely to survive.
g = sns.FacetGrid(train_df, col='Survived')
g.map(plt.hist, 'Age', bins=20)
The following figure shows the survival probability determined based on the cabin and age.
Machine Learning Lab Guide-Teacher Version Page 34
----End
data=pd.concat([train_df,test_df],ignore_index=True)
data.isnull().sum()
Machine Learning Lab Guide-Teacher Version Page 35
data['Embarked'].fillna(str(data['Embarked'].mode()[0]),inplace=True)
data['Fare'].fillna(int(data['Fare'].mode()[0]),inplace=True)
data['Age'].fillna(data['Age'].mean(),inplace=True)
Delete less significant data. Before this, assign a value to Target first.
Target=data['Survived']
data=data.drop(['Cabin','Name','Ticket','Survived'],axis=1)
data.isnull().sum()
data['Sex'].value_counts()
Use the search function to obtain each character-type value and replace it with a numeric-type value.
data['Sex']=data['Sex'].replace(['male','female'],[0,1])
data['Embarked']=data['Embarked'].replace(['S','C','Q'],[0,1,2])
Machine Learning Lab Guide-Teacher Version Page 36
test.csv cannot be used as a training test set as it does not contain Target. train.csv contains 891
pieces of data (with Target), which need to be extracted.
X=data[:891]
y=Target[:891]
----End
model.predict(data[891:])
Machine Learning Lab Guide-Teacher Version Page 37
----End
Machine Learning Lab Guide-Teacher Version Page 38
5 Linear Regression
5.1 Introduction
5.1.1 About This Lab
This experiment uses the basic Python code and the simplest data to reproduce how a linear
regression algorithm iterates and fits the existing data distribution.
The NumPy and Matplotlib modules are used in the experiment. NumPy is used for calculation, and
Matplotlib is used for drawing.
5.1.2 Objectives
Upon completion of this task, you will be able to:
Be familiar with basic Python statements.
Master the procedure for implementing linear regression.
5.2 Procedure
5.2.1 Preparing Data
Randomly set ten pieces of data, with the data in a linear relationship.
Convert the data into an array format so that the data can be directly calculated when multiplication
and addition are used.
Code:
# Import the required modules NumPy for calculation and Matplotlib for drawing.
import numpy as np
import matplotlib.pyplot as plt
#This code is used only for Jupyter Notebook.
%matplotlib inline
Output:
# The basic linear regression model is wx+b. In this example, the model is ax+b as a two-dimensional space is used.
def model(a,b,x):
return a*x+b
# The mean square error loss function is the most commonly used loss function in the linear regression model.
def loss_function(a,b,x,y):
num=len(x)
predict=model(a,b,x)
return (0.5/num)*(np.square(predict-y)).sum()
# The optimization function mainly uses the partial derivatives to update a and b.
def optimize(a,b,x,y):
num=len(x)
predict=model(a,b,x)
da = (1.0/num) * ((predict -y)*x).sum()
db = (1.0/num) * ((predict -y).sum())
a = a - Lr*da
b = b - Lr*db
return a, b
def iterate(a,b,x,y,times):
for i in range(times):
a,b = optimize(a,b,x,y)
return a,b
Output:
Step 2 Perform the second iteration and display the parameter values, loss values, and visualization effect.
Code:
a,b = iterate(a,b,x,y,2)
prediction=model(a,b,x)
Machine Learning Lab Guide-Teacher Version Page 41
loss = loss_function(a, b, x, y)
print(a,b,loss)
plt.scatter(x,y)
plt.plot(x,prediction)
Output:
Step 3 Perform the third iteration and display the parameter values, loss values, and visualization effect.
Code:
a,b = iterate(a,b,x,y,3)
prediction=model(a,b,x)
loss = loss_function(a, b, x, y)
print(a,b,loss)
plt.scatter(x,y)
plt.plot(x,prediction)
Output:
Machine Learning Lab Guide-Teacher Version Page 42
Step 4 Perform the fourth iteration and display the parameter values, loss values, and visualization effect.
Code:
a,b = iterate(a,b,x,y,4)
prediction=model(a,b,x)
loss = loss_function(a, b, x, y)
print(a,b,loss)
plt.scatter(x,y)
plt.plot(x,prediction)
Output:
Machine Learning Lab Guide-Teacher Version Page 43
Step 5 Perform the fifth iteration and display the parameter values, loss values, and visualization effect.
Code:
a,b = iterate(a,b,x,y,5)
prediction=model(a,b,x)
loss = loss_function(a, b, x, y)
print(a,b,loss)
plt.scatter(x,y)
plt.plot(x,prediction)
Output:
Machine Learning Lab Guide-Teacher Version Page 44
Step 6 Perform the 10000th iteration and display the parameter values, loss values, and visualization effect.
Code:
a,b = iterate(a,b,x,y,1000)
prediction=model(a,b,x)
loss = loss_function(a, b, x, y)
print(a,b,loss)
plt.scatter(x,y)
plt.plot(x,prediction)
Output:
Machine Learning Lab Guide-Teacher Version Page 45
----End
5.3.2 Question 2
What is the function of Lr during Lr modification?
Machine Learning Lab Guide-Teacher Version Page 46
6.1 Introduction
6.1.1 About This Lab
This experiment uses a dataset with a small sample quantity. The dataset includes the open-source
Iris data provided by scikit-learn. The Iris prediction project is a simple classification model. By using
this model, you can understand the basic usage and data processing methods of the machine learning
library sklearn.
According to the preceding code, x is specified as a feature, and y as a label. The dataset includes a
total of 150 samples and four features: sepal length, sepal width, petal length, and petal width.
x.shape
Machine Learning Lab Guide-Teacher Version Page 47
train_X.shape
Logistic regression is used for modeling first. The one-vs-one (OvO) multiclass method is used for
logistic regression by default.
model = LogisticRegression()
model.fit(train_X,train_y)
print('Logistic Regression:',model.score(test_X,test_y))
model = svm.SVC()
model.fit(train_X,train_y)
print('SVM:',model.score(test_X,test_y))
model=DecisionTreeClassifier()
model.fit(train_X,train_y)
print('Decision Tree',model.score(test_X,test_y))
Machine Learning Lab Guide-Teacher Version Page 48
model=KNeighborsClassifier(n_neighbors=3)
model.fit(train_X,train_y)
print(' KNN',model.score(test_X,test_y))
Three neighbors are set for the k-nearest neighbors algorithm. Another number of neighbors can be
tried for better accuracy.
Therefore, the recursion method is used to find the optimal number of neighbors.
t=[]
for i in range(1,11):
model=KNeighborsClassifier(n_neighbors=i)
model.fit(train_X,train_y)
print('neighbor:{},acc:{}'.format(i,model.score(test_X,test_y)))
t.append(model.score(test_X,test_y))
plt.plot([i for i in range(1,11)],t)
Machine Learning Lab Guide-Teacher Version Page 49
As shown in the figure above, the k-nearest neighbors algorithm has the optimal effect when there is
one nearest neighbor.
After standardization, the standard deviation is 1, and the mean value is infinitely close to 0.
model = svm.SVC()
model.fit(train_X_std,train_y)
print('SVM:',model.score(test_X_std,test_y))
Then, use the SVM to perform modeling after the standardization. Change the data names of the
training set and test set to new ones.
As described above, the SVM precision is also improved after the standardization.
Machine Learning Lab Guide-Teacher Version Page 50
7.1 Introduction
Emotion analysis is a classification technology based on natural language processing (NLP), and is
usually used in classification methods for extracting emotional content of texts. Compared with
related recommendation and precision marketing, users prefer to view or listen to the personal
experience and feedback of users of the same type. For example, evaluations from users who have
purchased similar products and comparison results from users who have used similar products can
bring bidirectional values to users and enterprises. This experiment will discuss the problem and
perform practice step by step from the perspectives of problem statement, breakdown, priority
ranking, solution design, key point analysis, and summary and suggestions, and cultivate the project
implementation thinking and implement analysis of the evaluation emotion analysis project from
scratch.
7.1.1 Objectives
Upon completion of this task, you will be able to:
Clarify the function and business value of emotion analysis.
Understand the differences between conventional machine learning and deep learning in
emotion analysis methods.
Clarify label extraction methods for emotion analysis.
Master deep learning-based emotion analysis methods.
Understand future applications of emotion analysis.
7.2 Procedure
7.2.1 Data Management
The following information is involved:
Id: ID
reviews.rating: score
reviews.text: text evaluation
reviews.title: evaluation keywords
reviews.username: name of the evaluator
This dataset contains 21 attribute fields and 34,657 data samples. The experiment aims to analyze
customer evaluation data. Therefore, this document describes only the data attributes required in this
experiment.
Step 1 Import common library files such as sklearn, pandas, and numpy.
sklearn is a powerful third-party machine learning library of Python. It contains data in various aspects
from data preprocessing to model training. Most functions in the sklearn library are classified into
estimators and transformers. An estimator is equivalent to a modeling tool, and is used to predict data.
Common estimator functions include fit(x,y) and predict(x). A transformer is used to process data,
such as reducing dimensions and standardizing data. Common transformer functions include
transform(x) and transform(x,y).
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib as mpl
import nltk.classify.util
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn import metrics
from sklearn.metrics import roc_curve, auc
from nltk.classify import NaiveBayesClassifier
import numpy as np
import re
import string
import nltk
%matplotlib inline
temp.head()
Machine Learning Lab Guide-Teacher Version Page 52
Output:
This experiment uses only the reviews.rating, reviews.text, reviews.username, and reviews.title
attribute columns. Therefore, you need to extract data from the dataset for the second time to retain
only the attribute data in the four columns, and name the extracted attribute data file permanent to
facilitate the subsequent experiment procedure.
print(permanent.isnull().sum())
permanent.head()
Output:
The reviews.rating attribute column is indispensable to emotion analysis. The dataset contains 34,657
data samples. The data volume is large. Therefore, you can delete the data samples with the
reviews.rating value missing. Specifically, you can extract the data without the reviews.rating value
and name the data senti, and extract the data with the reviews.rating value and name the data check.
check = permanent[permanent["reviews.rating"].isnull()]
senti= permanent[permanent["reviews.rating"].notnull()]
With respect to score processing, this experiment defines data samples with the reviews.rating value
greater than or equal to 4 as positive (pos) and those with the reviews.rating value less than 4 negative
(neg), and renames the reviews.rating attribute column senti.
replace(x,y): replaces x with y.
senti["senti"] = senti["reviews.rating"]>=4
Machine Learning Lab Guide-Teacher Version Page 53
senti["senti"].value_counts().plot.bar()
Output:
import nltk.classify.util
from nltk.classify import NaiveBayesClassifier
import numpy as np
import re
import string
import nltk
cleanup_re = re.compile('[^a-z]+')
def cleanup(sentence):
sentence = str(sentence)
sentence = sentence.lower()
sentence = cleanup_re.sub(' ', sentence).strip()
return sentence
senti["Summary_Clean"] = senti["reviews.text"].apply(cleanup)
check["Summary_Clean"] = check["reviews.text"].apply(cleanup)
Output:
Use 80% of data in split as the training set through split.sample(), remove the data that has been used
in the training set train from split through drop(), and use the remaining data as the test set test.
train=split.sample(frac=0.8,random_state=200)
Machine Learning Lab Guide-Teacher Version Page 55
test=split.drop(train.index)
Output:
def word_feats(words):
features = {}
for word in words:
features [word] = True
return features
train["words"] = train["Summary_Clean"].str.lower().str.split()
test["words"] = test["Summary_Clean"].str.lower().str.split()
check["words"] = check["Summary_Clean"].str.lower().str.split()
train.index = range(train.shape[0])
test.index = range(test.shape[0])
check.index = range(check.shape[0])
prediction = {}
Set all words in train["words"] to True and add neg or pos to the end of a sentence based on the
scoring criteria.
train_naive = []
test_naive = []
check_naive = []
for i in range(train.shape[0]):
train_naive = train_naive +[[word_feats(train["words"][i]) , train["senti"][i]]]
for i in range(test.shape[0]):
test_naive = test_naive +[[word_feats(test["words"][i]) , test["senti"][i]]]
for i in range(check.shape[0]):
check_naive = check_naive +[word_feats(check["words"][i])]
classifier = NaiveBayesClassifier.train(train_naive)
print("NLTK Naive bayes Accuracy : {}".format(nltk.classify.util.accuracy(classifier , test_naive)))
classifier.show_most_informative_features(5)
Use a trained classifier to attach emotion labels to the test set and verification set to predict whether
words in the test set and verification set are positive or negative.
y =[]
only_words= [test_naive[i][0] for i in range(test.shape[0])]
for i in range(test.shape[0]):
y = y + [classifier.classify(only_words[i] )]
prediction["Naive"]= np.asarray(y)
Output:
Machine Learning Lab Guide-Teacher Version Page 57
y1 = []
for i in range(check.shape[0]):
y1 = y1 + [classifier.classify(check_naive[i] )]
check["Naive"] = y1
Output:
The original dataset check does not contain review.ratings data. As shown in the preceding figure,
whether each word is negative or positive is predicted after the classifier is created based on the
training set.
Use the CountVectorizer class to perform vectorization, invoke the TfidfTransformer class to perform
preprocessing, construct the term frequency (TF) vector, and calculate the importance of words. The
training set, test set, and verification set are obtained, which are X_train_tfidf, X_test_tfidf, and
checktfidf, respectively.
The main idea of TF is as follows: If a word or phrase has a high TF in an article but a low TF in other
retail articles, the word or phrase is considered to have a good class distinguishing capability. TF-IDF
tends to filter out commonly used words and retain important words.
The CountVectorizer class converts words in the text into a TF matrix, and uses the fit_transform()
function to calculate the number of appearance times of each word. In general, you can use
CountVectorizer to extract features and then use TfidfTransformer to calculate the weight of each
feature.
X_test_tfidf = tfidf_transformer.transform(X_new_counts)
checkcounts = count_vect.transform(check["Summary_Clean"])
checktfidf = tfidf_transformer.transform(checkcounts)
Output:
Output:
Output:
In comparison, the LR model has higher accuracy than the other two models.
words = count_vect.get_feature_names()
feature_coefs = pd.DataFrame( data = list(zip(words, logistic.coef_[0])), columns = ['feature', 'coef'])
feature_coefs.sort_values(by="coef")
def format(x):
Machine Learning Lab Guide-Teacher Version Page 59
if x == 'neg':
return 0
if x == 0:
return 0
return 1
vfunc = np.vectorize(format)
test.senti = test.senti.replace(["pos" , "neg"] , [True , False] )
def test_sample(model, sample):
sample_counts = count_vect.transform([sample])
sample_tfidf = tfidf_transformer.transform(sample_counts)
result = model.predict(sample_tfidf)[0]
prob = model.predict_proba(sample_tfidf)[0]
print("Sample estimated as %s: negative prob %f, positive prob %f" % (result.upper(), prob[0], prob[1]))
test_sample(logreg, "The product was good and easy to use")
test_sample(logreg, "the whole experience was horrible and product is worst")
test_sample(logreg, "product is not good")
Output:
The classifier accurately provides the positive probability and negative probability of each sentence.
show_wordcloud(senti["Summary_Clean"])
Output:
Machine Learning Lab Guide-Teacher Version Page 60
----End
8.1 Introduction
8.1.1 About This Lab
This experiment uses a dataset with a small sample quantity. The dataset includes the open-source
Boston housing price data provided by scikit-learn. The Boston housing price forecast project is a
simple regression model. By using this model, you can understand the basic usage and data processing
methods of the machine learning library sklearn.
8.1.2 Objectives
Upon completion of this task, you will be able to:
Use the Boston housing price dataset open to the Internet as the model input data.
Build, train, and evaluate machine learning models
Understand the overall process of building a machine learning model.
Master the application of machine learning model training, grid search, and evaluation
indicators.
Master the application of related APIs.
Machine Learning Lab Guide-Teacher Version Page 61
8.2 Procedure
8.2.1 Introducing the Dependency
Code:
#Introduce algorithms.
from sklearn.linear_model import RidgeCV, LassoCV, LinearRegression, ElasticNet
#Compared with SVC, it is the regression form of SVM.
from sklearn.svm import SVR
#Integrate algorithms.
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from xgboost import XGBRegressor
8.2.2 Loading the Dataset, Viewing Data Attributes, and Visualizing Data
Step 1 Load the Boston housing price dataset and display related attributes.
Code:
Output:
Feature column names: ['CRIM' 'ZN' 'INDUS' 'CHAS' 'NOX' RM' 'AGE' DIS' 'RAD' 'TAX' PTRATIO' 'B' 'LSTAT'], sample
quantity: 506, feature quantity: 13, target sample quantity: 506
x = pd.DataFrame(boston.data, columns=boston.feature_names)
x.head()
Output:
Machine Learning Lab Guide-Teacher Version Page 63
Output:
----End
Output:
model_fitted = model.fit(x_train,y_train)
y_pred = model_fitted.predict(x_test)
score = r2_score(y_test, y_pred)
return score
Output:
Machine Learning Lab Guide-Teacher Version Page 65
'''
'kernel': kernel function
'C': SVR regularization factor
'gamma': 'rbf', 'poly' and 'sigmoid' kernel function coefficient, which affects the model performance
'''
parameters = {
'kernel': ['linear', 'rbf'],
'C': [0.1, 0.5,0.9,1,5],
'gamma': [0.001,0.01,0.1,1]
}
Output:
Output:
##Perform visualization.
ln_x_test = range(len(x_test))
y_predict = model.predict(x_test)
Output:
----End
Machine Learning Lab Guide-Teacher Version Page 67
9.1 Introduction
9.1.1 About This Lab
This experiment performs modeling based on the k-means algorithm by using the virtual dataset
automatically generated by sklearn to obtain user categories. It is a clustering experiment, which can
find out the method for selecting the optimal k value and observe the effect in a visualized manner.
import numpy as np
import matplotlib.pyplot as plt
The built-in tool of sklearn is used to create the virtual data, which is scientific and conforms to a
normal distribution. Parameter settings are as follows:
n_samples: set to 2000, indicating that 2000 sample points are set.
centers: set to 2, indicating that the data actually has two centers.
n_features: set to 2, indicating the number of features.
For ease of illustration in the coordinate system, only two features are used.
n_clusters=5: indicates that five data clusters are expected. However, there are only two data
categories.
plt.figure(figsize=(10,10))
plt.scatter(X[:, 0], X[:, 1],c=y_pred)
Output:
Different data is generated each time. Therefore, the output diagram may be different from that in
the lab. To generate the same data, add the random_state parameter during data generation.
X, y = make_blobs(n_samples=2000,centers=2,n_features=2,random_state=3)
In this example, random_state is set to 3. In this way, the same data can be generated for the same
data input.
X, y = make_blobs(n_samples=2000,centers=3,n_features=10,random_state=30)
In this example, ten features are used to generate data, random_state is set to 30, and there are three
categories in theory.
Machine Learning Lab Guide-Teacher Version Page 69
y_pred = KMeans(n_clusters=5).fit_predict(X)
plt.figure(figsize=(10,10))
plt.subplot(121)
plt.scatter(X[:, 0], X[:, 1])
4.
5. plt.subplot(122)
6. plt.scatter(X[:, 0], X[:, 1],c=y_pred)
----End
import random
centers=random.randint(1,30)
n_features=random.randint(1,30)
X, y = make_blobs(n_samples=2000,centers=centers,n_features=n_features)
Machine Learning Lab Guide-Teacher Version Page 70
First, generate two random numbers ranging from 1 to 30 (indicating that the number of true centers
in the data is unknown), and use a random number of features.
temp=[]
for i in range(1,50):
model=KMeans(n_clusters=i)
model.fit(X)
temp.append(model.inertia_)
Then, perform k-means clustering by using a recursive method. The .inertia_ attribute returns the
distance from the attribute point to the center.
The result varies each time due to impact of the random numbers. As shown in the preceding figure,
the turning point appears at the position corresponding to the value 21. Therefore, 21 is the optimal
k value.