PythonForML2023 Laboratory07 08 Regression Classification Update2
PythonForML2023 Laboratory07 08 Regression Classification Update2
During this laboratory, you will learn basic implementation and use of regression and classification models
in Python. We are using the same dataset for red wine, which can be found at https://ar-
chive.ics.uci.edu/dataset/186/wine+quality. Next time, you will implement your ideas based on the pre-
sented examples for the project datasets.
2. Regression
Regression aims to predict the target value based on some features. We will use several models for this
purpose, all of them are available in the scikit-learn library. To start up, let's use very simple linear regres-
sion. After importing LinearRegression from sklearn.linear_model, we create a model simply by calling
LinearRegression(). Then, to train our model we use the function fit(), which takes as arguments training
data and labels for them. When the model is trained, we can use the function predict() to make predictions.
Importing, creating, training models, and making predictions for training and validation sets are shown in
the listing below. As you can see, with scikit-learn, it is very straightforward.
reg = LinearRegression()
reg.fit(x_train, y_train)
y_pred_train = reg.predict(x_train)
y_pred = reg.predict(x_val)
For evaluation of obtained predictions, we will use root mean squared error (RMSE), which is calculated
according to the following equation:
𝑁
1
𝑅𝑀𝑆𝐸 = √ ∑(𝑦̂𝑛 − 𝑦𝑛 )2 ,
𝑁
𝑛=1
where 𝑦̂𝑛 is the predicted value, 𝑦𝑛 is the real value, and N is the number of samples. RMSE can be calcu-
lated from scratch as follows.
Task 1: Split the data for training, validation, and test sets. Take the quality as target value – y, do not
forget to drop it from the rest of the data. Implement linear regression and calculate the RMSE for the
training and validation set.
2
Task 2: A simple linear model can be easily extended to a weighted linear model just by using an additional
parameter sample_weight. Check how to use this parameter and implement weighted linear regression.
Now, let’s consider a more sophisticated model which is a neural network. To use it, we need to import
MLPRegressor from sklearn.neural_network.
As neural networks are more complex, we should consider some parameters for the customization of our
model.
• Parameter hidden_layer_sizes takes a tuple where each number denotes the number of neurons
in a particular hidden layer. For example, (15, 10, 5) stands for 15 neurons in the first hidden layer,
10 neurons in the second hidden layer, and 5 neurons in the third hidden layer.
• random_state allows for reproducibility.
• max_iter is a maximal number of epochs if convergence is not obtained before.
• solver sets the algorithm for weight optimization.
These are only selected parameters, you can find the description of all of them in scikit-learn documenta-
tion.
Task 3: Implement neural network regressor. Make predictions and calculate RMSE for training and vali-
dation sets. Repeat the previous step trying your ideas for model parameters setup.
As you can see, while we have the model created, training and making predictions look the same for
different models, which helps in ensuring consistency in the code and makes examination of different
models easy. However, not all regression methods have their direct implementation. Still, they may often
be implemented quite easily with scikit-learn. If you would like to learn how to cleverly implement poly-
nomial regression using a linear regression model take a look here: https://scikit-learn.org/stable/mod-
ules/linear_model.html#polynomial-regression-extending-linear-models-with-basis-functions. Generally,
handling nonlinear models as linear models operating on nonlinear functions might be beneficial as it
keeps fast performance allowing solving more complex problems.
3
3. Classification
As classification aims to assign the class label to each sample, we will modify our task. Suppose that we
want to predict if wine is just good or not. Let's assume that wine is good if its quality is higher than five.
So, we can simply prepare labels for training like the following. Do not forget to do the same for other
subsets or perform thresholding in advance.
Having the labels arranged we may pick the first classification model. Let’s start with the logistic regression
classifier. Similarly as in the case of regression, after creating the model we use function fit() for training
and predict() for prediction.
clf = LogisticRegression()
clf.fit(x_train, y_train)
y_pred_train = clf.predict(x_train)
y_pred = clf.predict(x_val)
For now, we will use just the accuracy metric to check the performance of models. Generally, this is not
enough and we will discuss this issue during lab 9.
Task 5: Implement the logistic regression model, train it, and make predictions on train and validation
sets. Check its accuracy.
Task 6: Use the logistic regression model, but only with some selected features. Compare two scenarios
by picking four different features:
Various other models might be created and used similarly. Each of them has its specific parameters that
should be tuned, e.g. kernel of the SVM, number of the neighbors for KNN, or maximal depth for decision
trees.
clf = svm.SVC(kernel='rbf')
clf = MLPClassifier()
clf = KNeighborsClassifier(n_neighbors=5)
clf = tree.DecisionTreeClassifier(max_depth=3)
Task 7: Implement the SVM model. Try different kernels using the kernel parameter.
Task 8: Implement multi-layer perceptron. Set your own parameters’ configuration. Some of them were
described in the section covering regression.
Task 10: Probably, you’ve easily obtained around 70-75% accuracy. Try improving the models’ perfor-
mance by configuring their parameters.
Task 11: Pick the model (with tuned parameters) that seems to work the best for you. Now train it on
both training and validation sets and check its final performance on the test subset.
4. Pipelines
The scikit-learn module implements so-called pipelines, which facilitates creating the processing routines
when we want to execute the same processing path for different data sets, e.g. training, and validation.
This might include, for example, outliers removal or feature scaling. The Pipeline takes a list of tuples as
an argument, constructed from name and transform. The name is a string, and a transform must imple-
ment the fit and transform methods. The final transform, serving as the estimator, only requires the fit
method. Let’s take a look at the example. Here we put the standardization and the MLP into one pipeline,
so the data input to the clf.fit() method is first standardized and then the neural network is trained.
clf.fit(x_train_subset, y_train)
5
Task 12: Implement the pipeline consisting of two steps: standardization and neural network. Select
a reasonable size of the net and compare its performance with and without feature scaling while using
the stochastic gradient descent (‘sgd’) algorithm for learning.
Task 13: Compare regression performance (with one selected model) for all features, and selected subset
of features. Firstly, choose two, then four features, then suggest your concepts. Remember that for fea-
ture selection for regression, the correlation matrix is useful.
Task 14: Implement linear regression and neural network regressor for pH feature as a target.
Task 15: Implement KNN and decision tree classifiers. Try adjusting their parameters.
For your project datasets, discuss specific tasks with the course instructor. However, in general, all teams
should complete the following tasks.
Task 1: Perform the classification or regression task outlined in your project scope using the chosen model.
Draw conclusions.
Task 6: Describe the implementation of the model used to solve the project task in a dedicated report
section.