0% found this document useful (0 votes)
22 views

PythonForML2023 Laboratory07 08 Regression Classification Update2

Python agh
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views

PythonForML2023 Laboratory07 08 Regression Classification Update2

Python agh
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Faculty of Mechanical

Engineering and Robotics

Department of Robotics and


Mechatronics

Python for machine learning and data science

Topic: Regression and classification

Aim of the exercise: Learn basic implementation of the regression and


classification models using scikit-learn.

Course supervisor: Ziemowit Dworakowski, [email protected]


Laboratory author: Adam Machynia, [email protected]
1. Introduction

During this laboratory, you will learn basic implementation and use of regression and classification models
in Python. We are using the same dataset for red wine, which can be found at https://ar-
chive.ics.uci.edu/dataset/186/wine+quality. Next time, you will implement your ideas based on the pre-
sented examples for the project datasets.

2. Regression

Regression aims to predict the target value based on some features. We will use several models for this
purpose, all of them are available in the scikit-learn library. To start up, let's use very simple linear regres-
sion. After importing LinearRegression from sklearn.linear_model, we create a model simply by calling
LinearRegression(). Then, to train our model we use the function fit(), which takes as arguments training
data and labels for them. When the model is trained, we can use the function predict() to make predictions.
Importing, creating, training models, and making predictions for training and validation sets are shown in
the listing below. As you can see, with scikit-learn, it is very straightforward.

from sklearn.linear_model import LinearRegression

reg = LinearRegression()
reg.fit(x_train, y_train)

y_pred_train = reg.predict(x_train)
y_pred = reg.predict(x_val)

For evaluation of obtained predictions, we will use root mean squared error (RMSE), which is calculated
according to the following equation:

𝑁
1
𝑅𝑀𝑆𝐸 = √ ∑(𝑦̂𝑛 − 𝑦𝑛 )2 ,
𝑁
𝑛=1

where 𝑦̂𝑛 is the predicted value, 𝑦𝑛 is the real value, and N is the number of samples. RMSE can be calcu-
lated from scratch as follows.

rmse = np.sqrt(np.mean((y_pred_train - y_train)**2))

One can also use the mean_squared_error function from sklearn.metrics.

Task 1: Split the data for training, validation, and test sets. Take the quality as target value – y, do not
forget to drop it from the rest of the data. Implement linear regression and calculate the RMSE for the
training and validation set.

2
Task 2: A simple linear model can be easily extended to a weighted linear model just by using an additional
parameter sample_weight. Check how to use this parameter and implement weighted linear regression.

Now, let’s consider a more sophisticated model which is a neural network. To use it, we need to import
MLPRegressor from sklearn.neural_network.

from sklearn.neural_network import MLPRegressor

# simply with almost default parameters


reg = MLPRegressor(random_state=1, max_iter=500)
reg.fit(x_train, y_train)
y_pred_train = reg.predict(x_train)
y_pred = reg.predict(x_val)

# and then try some customization, for example:


reg = MLPRegressor(hidden_layer_sizes=(10, 5), random_state=1,
max_iter=1000, solver='lbfgs')

As neural networks are more complex, we should consider some parameters for the customization of our
model.

• Parameter hidden_layer_sizes takes a tuple where each number denotes the number of neurons
in a particular hidden layer. For example, (15, 10, 5) stands for 15 neurons in the first hidden layer,
10 neurons in the second hidden layer, and 5 neurons in the third hidden layer.
• random_state allows for reproducibility.
• max_iter is a maximal number of epochs if convergence is not obtained before.
• solver sets the algorithm for weight optimization.

These are only selected parameters, you can find the description of all of them in scikit-learn documenta-
tion.

Task 3: Implement neural network regressor. Make predictions and calculate RMSE for training and vali-
dation sets. Repeat the previous step trying your ideas for model parameters setup.

As you can see, while we have the model created, training and making predictions look the same for
different models, which helps in ensuring consistency in the code and makes examination of different
models easy. However, not all regression methods have their direct implementation. Still, they may often
be implemented quite easily with scikit-learn. If you would like to learn how to cleverly implement poly-
nomial regression using a linear regression model take a look here: https://scikit-learn.org/stable/mod-
ules/linear_model.html#polynomial-regression-extending-linear-models-with-basis-functions. Generally,
handling nonlinear models as linear models operating on nonlinear functions might be beneficial as it
keeps fast performance allowing solving more complex problems.

Task 4: Based on provided link implement polynomial regression.

3
3. Classification

As classification aims to assign the class label to each sample, we will modify our task. Suppose that we
want to predict if wine is just good or not. Let's assume that wine is good if its quality is higher than five.
So, we can simply prepare labels for training like the following. Do not forget to do the same for other
subsets or perform thresholding in advance.

y_train = data_train['quality'] > 5.5

Having the labels arranged we may pick the first classification model. Let’s start with the logistic regression
classifier. Similarly as in the case of regression, after creating the model we use function fit() for training
and predict() for prediction.

from sklearn.linear_model import LogisticRegression

clf = LogisticRegression()
clf.fit(x_train, y_train)

y_pred_train = clf.predict(x_train)
y_pred = clf.predict(x_val)

For now, we will use just the accuracy metric to check the performance of models. Generally, this is not
enough and we will discuss this issue during lab 9.

from sklearn.metrics import accuracy_score

acc_train = accuracy_score(y_train, y_pred_train)


acc_val = accuracy_score(y_val, y_pred)

Task 5: Implement the logistic regression model, train it, and make predictions on train and validation
sets. Check its accuracy.

Task 6: Use the logistic regression model, but only with some selected features. Compare two scenarios
by picking four different features:

1) 'alcohol', 'volatile acidity', 'total sulfur dioxide', 'sulphates',


2) 'pH', 'free sulfur dioxide', 'residual sugar', 'fixed acidity'.
Compare accuracy.

Various other models might be created and used similarly. Each of them has its specific parameters that
should be tuned, e.g. kernel of the SVM, number of the neighbors for KNN, or maximal depth for decision
trees.

from sklearn import svm


from sklearn.neural_network import MLPClassifier
from sklearn.neighbors import KNeighborsClassifier
4
from sklearn import tree

clf = svm.SVC(kernel='rbf')
clf = MLPClassifier()
clf = KNeighborsClassifier(n_neighbors=5)
clf = tree.DecisionTreeClassifier(max_depth=3)

Task 7: Implement the SVM model. Try different kernels using the kernel parameter.

Task 8: Implement multi-layer perceptron. Set your own parameters’ configuration. Some of them were
described in the section covering regression.

Task 9: Implement some other classifiers: KNN, decision tree or others.

Task 10: Probably, you’ve easily obtained around 70-75% accuracy. Try improving the models’ perfor-
mance by configuring their parameters.

Task 11: Pick the model (with tuned parameters) that seems to work the best for you. Now train it on
both training and validation sets and check its final performance on the test subset.

4. Pipelines

The scikit-learn module implements so-called pipelines, which facilitates creating the processing routines
when we want to execute the same processing path for different data sets, e.g. training, and validation.
This might include, for example, outliers removal or feature scaling. The Pipeline takes a list of tuples as
an argument, constructed from name and transform. The name is a string, and a transform must imple-
ment the fit and transform methods. The final transform, serving as the estimator, only requires the fit
method. Let’s take a look at the example. Here we put the standardization and the MLP into one pipeline,
so the data input to the clf.fit() method is first standardized and then the neural network is trained.

from sklearn.pipeline import Pipeline


from sklearn.preprocessing import StandardScaler

clf = Pipeline([("scaler", StandardScaler()),


("mlp", MLPClassifier(max_iter=500))])

clf.fit(x_train_subset, y_train)

5
Task 12: Implement the pipeline consisting of two steps: standardization and neural network. Select
a reasonable size of the net and compare its performance with and without feature scaling while using
the stochastic gradient descent (‘sgd’) algorithm for learning.

5. Additional tasks and challenges

Task 13: Compare regression performance (with one selected model) for all features, and selected subset
of features. Firstly, choose two, then four features, then suggest your concepts. Remember that for fea-
ture selection for regression, the correlation matrix is useful.

Task 14: Implement linear regression and neural network regressor for pH feature as a target.

Task 15: Implement KNN and decision tree classifiers. Try adjusting their parameters.

6. Tasks for project datasets

For your project datasets, discuss specific tasks with the course instructor. However, in general, all teams
should complete the following tasks.

Task 1: Perform the classification or regression task outlined in your project scope using the chosen model.
Draw conclusions.

Task 2: Repeat the previous task using different models.


• For regression: linear regressor and neural network.
• For classification: SVM and neural network.
Also, experiment with various sets of features and compare the results.

Task 3: Explore some other models that weren't mentioned in Task 2.

Task 4: Prepare your data and code for further testing:


• Ensure clarity and transparency in your code.
• Organize it to facilitate easy modification of scoring metrics and model parameters.

Task 5: Encapsulate your processing path using a pipeline.

Task 6: Describe the implementation of the model used to solve the project task in a dedicated report
section.

You might also like