Chan, Jamie - Machine Learning With Python For Beginners - A Step-By-Step Guide With Hands-On Projects (Learn Coding Fast With Hands-On Project (2021) - Libgen - Li
Chan, Jamie - Machine Learning With Python For Beginners - A Step-By-Step Guide With Hands-On Projects (Learn Coding Fast With Hands-On Project (2021) - Libgen - Li
Preface
Machine learning is becoming mainstream in recent years and has
revolutionized every aspect of our lives. From facial recognition on
social media platforms to route suggestions by digital maps, it is at
the heart of countless technological breakthroughs.
With the increase in the use of machine learning, there has been a
surge in demand for individuals with relevant skills. Perhaps you
would like to add machine learning features to your projects, or you
are required to pick up some machine learning skills for your job.
Whatever your reason is for learning machine learning, this book
aims to cover the major concepts in a step-by-step fashion and
provide you with an excellent base to explore further.
The book uses a hands-on approach and includes many code
examples for you to try out. In addition, you’ll get a chance to
practice what you learned with three hands-on projects.
You can download all the code examples, images, and data files
used in this book at https://www.learncodingfast.com/machine-
learning .
I sincerely hope you find this book useful. If you have any questions
or suggestions regarding the book, feel free to reach out to me at
[email protected] .
Python Series
Learn Python in One Day and Learn It Well (2 nd Edition)
Learn Python in One Day and Learn It Well (2 nd Edition) –
Workbook
The linear regression algorithm aims to find the best possible values
for the coefficients “a” and “b” by performing calculations on the data
provided. Once the calculations are done, the linear regression
algorithm returns a model, including the values of “a” and “b”. Our
machine can then use the equation y = a + bx (with known values of
“a” and “b”) to make predictions.
Machine learning models range from simple to complex. Some
complex models involve neural networks and are part of a subset of
machine learning known as deep learning. These models are
inspired by the structure of our brains and involve networks with
multiple connected layers. We will not be covering deep learning
models in this book.
The first cell will likely take a few seconds to run as we need to
connect to the online resources. Next, click on the second code cell,
add the following code, and run it by pressing Shift-Enter:
print('Jupyter Notebook is fun and easy to use')
Here, we first declare and initialize three Python lists ( list1 , list2
, and list3 ).
Notice that list1 and list2 are very similar, except that list1 has
one set of square brackets while list2 has two.
list1 is a one-dimensional (1D) list with four elements.
list2 , on the other hand, is a two-dimensional (2D) list with one
element. This element - [1, 2, 3, 4] - is a list itself and consists of
four elements.
list3 is also a 2D list, with three nested lists of four elements each.
After declaring the lists, we pass them to the NumPy array()
function to convert them to ndarrays. We then print the data types of
the resulting arrays. If you run the code above, you’ll get the
following output:
<class 'numpy.ndarray'> <class 'numpy.ndarray'> <class
'numpy.ndarray'>
(4,) tells us that arr1 is a 1D array with four elements, while (1, 4)
tells us that arr2 is a 2D array with one nested array; this nested
array has four elements.
Finally, the last line (3, 4) tells us that arr3 is a 2D array with three
nested arrays; each nested array has four elements.
The example above shows how to use the array() function to
convert Python built-in structures to ndarrays. Next, let’s learn to
create NumPy arrays from scratch. To do that, we can use different
predefined functions in the NumPy library. An example is the
linspace() function.
Here, we use the slice 1:6 to select elements from index 1 (i.e., the
element 'b' ) to 5. This gives us the following output:
['b' 'c' 'd' 'e' 'f']
If we add a step of 2 to the slice, we’ll get every 2nd element from
index 1 to 5. In other words,
print(arr3[1:6:2])
gives us
['b' 'd' 'f']
The slice before the comma selects the rows, while the slice after it
selects the columns.
Therefore, arr4[0:2, 2:4] selects elements from rows 0 to 1
(because of the slice 0:2 ) and columns 2 to 3 (because of the slice
2:4 ). If we run the code above, we’ll get the following output
[[ 3 4]
[30 40]]
The first line shows the original array, which is not changed by the
reshape() method. The remaining lines show the reshaped array,
which is a (4, 2) array consisting of 4 rows and 2 columns.
It is necessary to know how to reshape 1D arrays to two-dimensional
because certain methods in the Scikit-Learn library require 2D arrays
as input. We’ll learn more about this library in Chapter 5.
A useful trick when reshaping arrays to two-dimensional is to pass
(n, -1) or (-1, n) to the reshape() method and let it infer the
shape for us. For instance, for the example above, we can pass (4,
-1) to reshape arr2 to a 2D array with four rows and an unspecified
number of columns (denoted by -1). When we do that, reshape()
infers the number of columns and returns a (4, 2) array to us.
We can also pass (1, -1) to convert a 1D array to a 2D array with
one row and an unspecified number of columns. An example is
shown below:
arr3 = np.array([1, 2, 3, 4, 5])
reshaped_array_2 = arr3.reshape((1, -1))
print(arr3)
print(reshaped_array_2)
If you run the code above, you’ll get the following output:
[1 2 3 4 5]
[[1 2 3 4 5]]
Here, arr1 can be viewed as a table with three rows and two
columns, while arr2 is a table with one row and two columns.
As the two arrays have the same number of columns, we can
concatenate them along axis 0. We do that on the third line by
passing a tuple of the two arrays - (arr1, arr2) - to the function. If
you run the code above, you’ll get the following output:
[[1 2]
[3 4]
[5 6]
[7 8]]
The rows in arr1 and arr2 have been combined to give us a new
array with four rows and two columns.
If we want to add columns to this array ( combined_array ), we need
to concatenate it with an array that has 4 rows (i.e., 4 nested arrays).
An example is shown below:
arr3 = np.array([[9], [10], [11], [12]])
combined_array = np.concatenate((combined_array, arr3), axis=1)
print(combined_array)
Wrapper Functions
Some functions in NumPy are wrappers around corresponding
methods in the ndarray class. An example is the NumPy reshape()
function. This function is a wrapper around the reshape() method in
the ndarray class.
A wrapper is a function whose main purpose is to call another
function or method. In other words, if you look under the hood of the
NumPy reshape() function, you’ll see that it calls the reshape()
method in the ndarray class.
To use methods with wrapper functions, we have two options.
Suppose we have an array
arr4 = np.array([1, 2, 3, 4, 5, 6])
series1 = pd.Series(list1)
print(series1, end='\n\n')
If you run the code above, you’ll get the following output:
[1, 2, 3, 4, 5]
0 1
1 2
2 3
3 4
4 5
dtype: int64
P 1
Q 2
R 3
S 4
T 5
dtype: int64
A B
0 1 4
1 2 5
2 3 6
Here, we specify the row labels of df3 as ['A', 'B'] and the
column labels as ['1st', '2nd', '3rd', '4th', '5th'] .
If you run the code above, you’ll get the following output:
The first three values in this Series are True because the
corresponding rows in classData satisfy the boolean condition
classData['ModuleID'] == 'CS101' .
Last but not least, classData.iloc[:, :-1] selects all the columns
except the last column from all the rows in classData . This is
because the slice after the comma ( :-1 ) selects columns from
index 0 to -1, including column 0 but excluding column -1 (which is
the last column). Therefore, we get a DataFrame with all the rows
and columns, except for the last column.
classData['AverageGPA'] = classData['AverageGPA']*0.1
If you display the first five rows of classData now, you’ll get the
following output:
The values for the first two rows in myData.isnull() are True
because the values for the first two rows in myData are missing.
We can find the total number of missing values in myData using
myData.isnull().sum() .
This tells us that there are 2 missing values in column A for myData .
Referring back to our classData DataFrame, if we run the statement
classData.isnull().sum()
In both cases, the dropna() method does not change the original
DataFrame. Instead, it returns a new DataFrame. Therefore, we
need to assign the resulting DataFrame to a variable, as shown in
the examples above.
to_numpy()
Next, we have the to_numpy() method. As the name suggests, this
method converts a pandas Series or DataFrame to a NumPy array.
For instance, the example below converts myData2 to a NumPy
array.
myData2 = pd.DataFrame([[1, 2], [3, 4]], columns = ['A', 'B'])
myArr = myData2.to_numpy()
print(type(myData2))
print(myData2)
print(type(myArr))
print(myArr)
If you run the code above, you’ll get the following output:
<class 'pandas.core.frame.DataFrame'>
A B
0 1 2
1 3 4
<class 'numpy.ndarray'>
[[1 2]
[3 4]]
corr()
Last but not least, we have the corr() method. This method gives
us the pairwise correlation coefficients of columns in a DataFrame.
Correlation is a statistical measure that indicates the extent to which
two variables are related. If two variables move in the same
direction, they have a positive correlation. In contrast, if they move in
opposite directions, they have a negative correlation.
Correlation coefficients range from -1 to 1, with -1 indicating a perfect
negative correlation, 0 indicating no correlation, and 1 indicating a
perfect positive correlation.
A perfect positive correlation occurs when an increase in one
variable coincides with an increase of a fixed amount in the other.
For instance, if we have the following DataFrame:
myData3 = pd.DataFrame({'A':[1, 4, 7, 10], 'B':[1, 2, 3, 4], 'C':
[2, 12, 1, 5]})
Here, we declare and initialize two lists sp_x and sp_y for the x and y
values of the scatter plot, respectively. Next, we pass the two lists to
the scatter() function and call the show() function to display the
chart.
Calling show() is optional in a Jupyter Notebook as Jupyter
automatically displays the plot when we execute the code.
Therefore, we’ll omit this command in subsequent examples. If you
are not using Jupyter Notebook and the chart does not display, you’ll
need to use the show() function.
If you run the code above, you’ll get the following plot:
In the example above, we declare a list h_x with values from 0 to 10.
Next, we use the hist() function to plot a histogram, specifying the
number of bins as 5. With bins = 5 , the range of h_x (0 to 10) is
divided into 5 equal-width bins.
The first bin is from 0 to 2 (including 0 but excluding 2), the second is
from 2 (inclusive) to 4 (exclusive), and so on, while the last bin is
from 8 (inclusive) to 10 (inclusive). The height of each bar represents
the frequency (i.e., the number of elements within each interval).
The code above gives us the following histogram:
Figure 4.3: Histogram
Specifying the number of bins is optional when plotting a histogram
in Matplotlib. In this example, if we do not specify the number of bins,
the function plots one bar for each number from 0 to 10.
Here, we first pass l_x and l_y to a built-in Python function called
zip() . This function pairs the corresponding elements in l_x and
l_y as tuples and returns a zip object, which we assign to a variable
called zipped .
zipped consists of the following tuples: (7, 98) , (1, 2) , (4, 32) ,
(8, 128) , (5, 15) , (2, 28) , and (3, 18) , which are not sorted.
We can now pass l_x and l_y to the plot() function again:
plt.plot(l_x, l_y)
4.3.1 plot()
In the section above, we learned to plot charts using the pyplot
interface. If our data is stored in a pandas Series or DataFrame, in
addition to the pyplot interface, we can use the plot() method in
pandas to plot our charts.
This method uses Matplotlib in the background by default and is very
similar to the pyplot functions discussed above.
However, there are two major differences. Firstly, pyplot functions do
not label the charts we plot, but the pandas plot() method does.
Secondly, instead of having separate functions for different types of
charts, the pandas plot() method uses the kind parameter.
Let’s look at an example.
In the “Plotting a Scatter Plot” section above, we used
plt.scatter(sp_x, sp_y) to plot a scatter plot for sp_x and sp_y .
If you run this example, you’ll get a scatter plot that is very similar to
Figure 4.1 above. However, the plot() method labels the axes of a
scatter plot using labels of the columns used to plot the chart.
Therefore, this new scatter plot's x and y axes will be labeled “A” and
“B”, respectively.
Next, let’s look at how we can use the pandas plot() method to plot
a bar chart and a line graph:
Plotting a Bar Chart
b_x = ['0-3', '4-6', '7-9']
b_y = [20, 50, 30]
df = pd.DataFrame({'A':b_x, 'B':b_y})
df.plot(kind='bar', x='A', y='B')
Plotting a Line Graph
l_x = [7, 1, 4, 8, 5, 2, 3]
l_y = [98, 2, 32, 128, 15, 28, 18]
df = pd.DataFrame({'A':l_x, 'B':l_y})
df = df.sort_values(['A'])
df.plot(kind='line', x='A', y='B')
4.3.2 hist()
Besides the plot() method, pandas comes with another useful
method - hist() - for plotting histograms. This method plots a
histogram for every numerical column in a DataFrame. Let’s look at
an example:
df = pd.DataFrame({'A':[2, 3, 1, 1, 4, 4], 'B':[3, 4, 4, 1, 2, 2]})
df.hist()
plt.figure(figsize=(10, 5))
plt.legend(loc='best')
plt.grid()
plt.xticks([0, 2, 4, 6, 8, 10])
plt.yticks(range(0, 13))
plt.xlabel('Age')
plt.ylabel('Number of Tries')
This gives a scatter plot that looks very similar to Figure 4.9, except
that its x and y axes are labeled “A” and “B”, respectively.
Next, let’s look at how we can plot two charts on a single figure:
fig, axs = plt.subplots(1, 2, figsize=(10, 5))
df = pd.DataFrame({'A':[1, 2, 3, 4], 'B':[4, 2, 3, 1]})
axs[0].scatter(df['A'], df['B'])
axs[1].plot(df['A'], df['B'])
axs[0].set_title('Chart 1')
axs[0].set_xlabel('Age')
axs[0].set_ylabel('Number of Tries')
axs[0].set_xticks([1, 2, 3, 4])
axs[0].set_yticks([1, 2, 3, 4])
If you print the value of X_scaled now, you’ll get the following output:
[[-1.41421356 1.32550825]
[ 0. -1.19926937]
[ 1.41421356 0.56807496]
[ 0. -0.69431384]]
The variance for the first column is not exactly 1 due to the way our
computers represent floating-point numbers. Other than that, the
transform() method has successfully standardized the two columns
in our DataFrame.
The fit_transform() method
In the example above, we called the fit() and transform()
methods separately. Alternatively, we can call the fit_transform()
method. Calling this method is equivalent to calling fit() and then
transform() , but fit_transform() is often optimized and runs
faster than calling the other two methods separately.
Predictors and the predict() method
Finally, some estimators are able to make predictions when given a
dataset. These estimators are known as predictors and perform the
prediction using a method called predict() .
This method makes predictions based on the parameters calculated
by the fit() method. For instance, the predict() method in the
LinearRegression class uses the coefficients calculated by the
fit() method of the same object to make predictions.
The code below replaces the missing values for the Years ,
Strength , and Height columns:
If you print the DataFrame now, you’ll get the following output:
Next, let’s encode the Weight column (which is a feature) using the
OrdinalEncoder class:
If you read the documentation for this class, you’ll see a constructor
parameter called sparse , which is True by default. This means the
methods in the OneHotEncoder class return a sparse matrix by
default.
A sparse matrix is a matrix that consists of very few non-zero
elements. For instance,
[[0, 0, 0, 4, 0],
[0, 1, 0, 0, 0]]
is a sparse matrix as it only has two non-zero elements. Storing this
matrix as a regular 2D array is a waste of space since most
elements are just 0s. Therefore, sklearn compresses the matrix into
a special data structure that only stores the non-zero values.
Most methods in sklearn accept this sparse structure as input.
However, if you want to work with a regular 2D array instead, you
can pass sparse = False to the OneHotEncoder() constructor when
initializing an object. When you do that, the fit_transform()
method returns a regular NumPy array.
Let’s look at an example:
from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder(dtype=np.int, sparse = False, drop='first')
color_encoded = ohe.fit_transform(df[['Color']])
df2 = pd.DataFrame(color_encoded, columns =
ohe.get_feature_names())
df = pd.concat((df, df2), axis = 1)
df
Even though the values for the second feature differ by 100% (2 vs.
1), this difference does not contribute much to the distance. In
contrast, the values for the first feature (with a difference of only 1% -
8000 vs. 8080) contribute much more to it. This bias occurs because
the first feature has larger values.
To prevent such biases from occurring, we can scale both features.
There are two possible options for doing so: normalization and
standardization.
Normalization scales a feature to values between 0 and 1 inclusive
using the formula
The code above should require no further explanation. If you run the
code, you’ll get the following output:
Figure 5.8: Features scaled to a range of 0 to 1
We have successfully scaled Years , Strength , and Height to a
range of 0 to 1.
5.3.1 Pipeline
A pipeline allows us to chain different machine learning tasks by
combining multiple estimators into a composite estimator. For
instance, we can use a pipeline to chain a SimpleImputer estimator
with a MinMaxScaler estimator. (Recall that any object with a fit()
method is an estimator.)
All but the last estimator in the pipeline must have the transform()
method (i.e., they must also be transformers), while the last
estimator only needs to have the fit() method.
Let’s look at an example.
The code below may appear a bit jumbled up if you are reading on a
very small screen (e.g., on your mobile phone) or using a large font.
If that’s the case, try switching to landscape mode or using a smaller
font size.
data = pd.DataFrame([[1], [4], [np.NaN], [8], [11]], columns=['A'])
pl = Pipeline([
('imp', SimpleImputer(strategy="mean")),
('scaler', MinMaxScaler())
])
print(pl.fit_transform(data))
5.3.2 ColumnTransformer
Next, let’s move on to column transformers.
A column transformer is similar to a pipeline, except that it is only for
data transformation. Therefore, all estimators in a column
transformer must have the transform() method. In addition, a
column transformer does not pass the output of each method call as
input to the next. Instead, it concatenates the results of the method
calls and returns the transformed dataset.
Let’s look at an example:
from sklearn.compose import ColumnTransformer
ct = ColumnTransformer([
('imp', SimpleImputer(strategy="mean"), ['A']),
('scaler', MinMaxScaler(), ['A'])
])
Here, we first import the ColumnTransformer class from the
sklearn.compose module. Next, we create the data DataFrame
using the same values as the previous example.
After that, we instantiate a ColumnTransformer object and assign it
to ct .
The ColumnTransformer() constructor accepts a list of transformers
as input, with each transformer represented as a (name, transformer,
columns) tuple. “name” is a user-chosen string for referencing the
transformer, “transformer” is a transformer object, and “columns” is a
list of columns.
In the example above, the first tuple is ('imp',
SimpleImputer(strategy="mean"), ['A']) .
As you can see, this result is different from the result returned by the
pl pipeline in the previous section. While pl passed the result of one
method call as input to the next and returned an array with one
column, ct performs the two method calls independently and returns
an array with two columns.
Pipelines and column transformers are both helpful in simplifying our
workflow. Which one to use depends on the specifics of the task we
want to perform.
One feature of a column transformer is that it allows us to specify the
column(s) that we want a transformer to work on. Therefore, we can
use it to apply different transformations to different columns. Let’s
use a dataset with three columns to demonstrate this.
data = pd.DataFrame({'A': [1, 2, 3, 4, 5], 'B':['Apple', 'Orange',
'Apple', 'Banana', 'Apple'], 'C':[11, 12, 13, 14, 15]})
ct2 = ColumnTransformer([
('encode', OrdinalEncoder(), ['B']),
('normalize', MinMaxScaler(), ['A'])
], remainder='passthrough')
print(ct2.fit_transform(data))
The order of the columns in this array follows the order of the
columns in the transformers list. Columns that are not transformed
are added to the right of the transformed columns.
The first column above is the result of the “encode” transformation
for column B , while the second is the result of the “normalize”
transformation for column A . The last column is from column C ,
which is added to the result because we specify
remainder='passthrough' when instantiating our
ColumnTransformer object. We’ll have a chance to use column
transformers in our project later.
as the output. We get this result because four out of the six labels in
the true list are predicted correctly.
Precision and Recall
Classification accuracy is very easy to calculate and interpret.
However, one issue with this score is that it can be misleading if we
have imbalanced data.
Suppose our dataset consists of 95 dogs and 5 cats. A model that
blindly classifies all animals as dogs has an accuracy of 0.95.
However, this model did not achieve such accuracy by detecting any
meaningful pattern in the dataset. As such, it is unlikely to perform
well on a new dataset, especially one that contains more cats.
Besides looking at classification accuracy, we can look at precision
and recall . To understand precision and recall, we need to discuss
the confusion matrix. Referring to the two lists in the previous
example:
true = ['Cat', 'Cat', 'Dog', 'Dog', 'Cat', 'Dog']
pred = ['Cat', 'Cat', 'Cat', 'Dog', 'Cat', 'Cat']
print(precision)
print(recall)
Here, the true values are 2.5, 1.6, 5.1, and 6.8, with a mean of 4. To
calculate the R2 score, we do the following:
Find the sum of the squared differences between the true values and
their mean (which equals 4)
differences between true values and mean = [-1.5, -2.4, 1.1, 2.8]
squared differences = [2.25, 5.76, 1.21, 7.84]
sum of square differences = 2.25 + 5.76 + 1.21 + 7.84 = 17.06
Find the sum of the squared differences between the true and
predicted values
differences between true values and predicted values = [-0.4, -0.2,
0.5, 1.1]
squared differences = [0.16, 0.04, 0.25, 1.21]
sum of squared differences = 0.16 + 0.04 + 0.25 + 1.21 = 1.66
R2 = 1 - (1.66/17.06) = 0.902696
If you find the calculations above daunting, do not worry. What is
important is to understand that, in general, the higher the R2 score,
the better. The best possible R2 score is 1, which happens when the
sum of squared differences between the true and predicted values is
zero. It is also possible for R2 to be negative. This occurs when the
model performs worse than the mean of the data.
The calculations above can be done using the sklearn.metrics
module. This module comes with two functions -
mean_squared_error() and r2_score() - for calculating the RMSE
and R 2 score, respectively. Both functions require us to pass the
true values first, followed by the predicted values:
from sklearn.metrics import r2_score, mean_squared_error
pred = [2.1, 1.4, 5.6, 7.9]
true = [2.5, 1.6, 5.1, 6.8]
RMSE = mean_squared_error(true, pred, squared=False)
r2 = r2_score(true, pred)
print(RMSE)
print(r2)
The function randomly splits the two arrays and returns four arrays. If
we do a 80-20 split, the function may return the following arrays:
Training Subset
X = [1, 2, 4, 5, 6, 7, 9, 10]
y = [23, 11, 45, 12, 65, 43, 13, 12]
Test Subset
X = [3, 8]
y = [31, 69]
Fold 2
X = [2, 6, 8], y = [11, 65, 69]
Fold 3
X = [3, 4, 7], y = [31, 45, 43]
y=a0+a1x1+a2x2+a3x3+…+anxn
where y is the target variable, and x 1 , x 2 , …, x n are the predictor
variables or features.
Linear regression that only involves one feature is known as simple
linear regression, while that with multiple features is known as
multiple linear regression.
This gives us the following output, which indicates that there are no
missing values:
Floor Area (sqft) 0
Value ($1000) 0
dtype: int64
Here, we first select the feature and target variable in our dataset
and assign them to X and y , respectively. Most sklearn methods
expect features to be passed as a 2D array. Therefore, we select the
feature as a DataFrame using two sets of square brackets.
Next, we pass X and y to the train_test_split() function. To
specify the amount of data to use for testing, we use the test_size
parameter. For this example, we specify test_size as 0.2, which
means we want to use 20% of our data for the test set.
Finally, we specify the random_state parameter. This parameter
controls the shuffling applied to the data before the split. The number
42 has no special meaning; you can choose another integer if you
desire. However, if you want to produce the same output shown in
this chapter, you need to use the same number 42. Else, your data
will be shuffled differently, resulting in a different split and different
outputs.
The train_test_split() function returns four arrays in the
following order:
- the training subset of the features
- the test subset of the features
- the training subset of the target variable, and
- the test subset of the target variable
We assign these four arrays to X_train , X_test , y_train , and
y_test , respectively.
You may get slightly different results due to differences in how our
systems handle floating-point numbers.
A simple linear relationship can be represented by the equation
y=a0+a1x1
intercept_ and coef_ give the values of a 0 and a 1 , respectively.
y_pred = lr.predict(X_train)
plt.plot(X_train['Floor Area (sqft)'], y_pred)
Here, we first plot a scatter plot for the training subset using the
scatter() function.
Next, we get the predicted values for X_train using the predict()
method and store the results in a variable called y_pred . We then
pass X_train['Floor Area (sqft)'] and y_pred to the plot()
function to plot the regression line. If you run the code above, you’ll
get the following output:
Figure 6.5: Plotting the regression line
Step 7: Evaluate the Model
After we get the parameters of our model, we need to evaluate the
model’s performance:
RMSE = mean_squared_error(y_train, y_pred, squared=False)
r2 = r2_score(y_train, y_pred)
print(RMSE)
print(r2)
11.426788012892116
0.8827389714759885
These scores are reasonably good, but not excellent. We can further
evaluate the model on the test set to see how well it generalizes to
the test set.
To do that, we pass X_test to the predict() method.
Note that we should NOT pass X_test to the fit() method. If we do
that, lr will calculate the model parameters again using the test set,
which is not what we want. The code below shows how to evaluate
the model on the test set:
y_pred_test = lr.predict(X_test)
RMSE = mean_squared_error(y_test, y_pred_test, squared=False)
r2 = r2_score(y_test, y_pred_test)
print(RMSE)
print(r2)
which shows that our model generalizes well to unseen data. In fact,
the model performs slightly better on the test set than the training
set.
y=a0+a1x+a2w
The more features we have, the harder it is to figure out the features
we need. Fortunately, sklearn provides a class called
PolynomialFeatures that can create these features for us. Let’s
learn to do that now.
from sklearn.preprocessing import PolynomialFeatures
poly_features = PolynomialFeatures(degree=2)
X_train_transformed = poly_features.fit_transform(X_train)
print(X_train_transformed[0])
If you run the code above, you’ll get the following output:
70.91161404547069
[ 0. -0.18658297 0.00047343]
If you run the code above, you’ll get [53.85545111] as the output.
The discrepancy between this value and our calculated value of
53.7125 is due to rounding errors.
To visualize how well our model fits the data, let's plot the regression
curve.
# Plotting the scatter plot
plt.scatter(X_train['Floor Area (sqft)'], y_train, s=10)
6.6 Pipeline
In the previous section, we learned to train a polynomial regression
model.
To do that, we need to do two things. First, we need to create the
additional features using the PolynomialFeatures class. After
creating the features, we pass them to the linear regression
algorithm to train our model.
The following section shows how we can combine the two steps
using a pipeline.
To do that, we use the code below:
from sklearn.pipeline import Pipeline
If you run the code above, you’ll get the following output:
[53.85545111]
8.522835594847725
0.9347660644643525
6.425083465019807
0.9706560666441122
6.7 Cross-Validation
Besides simplifying our workflow, another advantage of using a
pipeline is it helps to prevent data leakage when we do cross-
validation . Data leakage occurs when information from outside the
training set (such as the test set) is used to train the model.
When we discussed data preprocessing previously, we mentioned
that we should only call the transform() method on the test set and
not the fit() or fit_transform() method. This ensures that data
from the test set is not used to train the model, which is the correct
approach.
The same applies when we do cross-validation.
If we need to do any data preprocessing on our dataset, we should
call the fit() or fit_transform() method on the k-1 training folds
and only call the transform() or predict() method on the
validation fold.
Doing so manually is tedious. Fortunately, we can use a pipeline.
If we use a pipeline for cross-validation, sklearn applies the
fit_transform() and fit() methods on the training folds. For the
validation fold, it only applies the transform() and predict()
methods.
The code below shows how we can use a pipeline for cross-
validation:
from sklearn.model_selection import cross_val_score
scores = cross_val_score(pipeline, X_train, y_train, cv=3,
scoring="neg_root_mean_squared_error")
neg_rmse = scores.mean()
-neg_rmse
If you run the code above, you’ll get the following output:
8.615541039554307
Glucose 0
BMI 0
Outcome 0
dtype: int64
Now, we need to split the dataset into training and test sets:
X = df.iloc[:,:-1].to_numpy()
y = df.iloc[:,-1].to_numpy()
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.2, random_state=0)
Here, we first use iloc to select all but the last column in df and use
to_numpy() to convert the resulting DataFame to a NumPy array; we
then assign the array to X . Next, we use iloc again to select the last
column, convert it to a NumPy array, and assign the array to y .
(Refer to Chapter 3 Section 3.8.3 if you need help with the code
above.)
Most methods in sklearn accept both pandas DataFrames and
NumPy arrays as input. In this example, we show how to use NumPy
arrays, which are typically faster than DataFrames.
After selecting our features and target variable, we pass X and y to
the train_test_split() method to split our dataset into training
and test subsets. Now, we can train our decision tree using the
fit() method:
To plot the tree, we pass the tree classifier ( clf ) and two
parameters - feature_names and class_names - to the function.
feature_names stores the names of the features in the dataset and
is passed as a list of strings, sorted based on the order of the
columns in the dataset.
class_names is also passed as a list of strings. These names
correspond to the classes in the dataset, sorted in ascending
numerical order.
Our dataset has two classes - 0 and 1 , with 0 indicating the absence
of diabetes and 1 indicating otherwise. Therefore, we use the names
'No' and 'Yes' for the two classes, respectively, and pass
class_names=['No', 'Yes'] to the plot_tree() function.
If you run the code above, you’ll get the following output. Do not
worry if you can’t read the tree, we will not be using it:
clf.set_params(max_depth = 3)
Figure 7.5: Decision tree for the diabetes dataset with three
levels
We can now use the predict() method to predict the classes of new
instances:
clf.predict([[90, 20], [200, 30]])
Here, we pass two instances - [90, 20] and [200, 30] - to the
predict() method. The first instance (Glucose = 90, BMI = 20)
satisfies the left-most branch of the tree, while the second (Glucose
= 200, BMI = 30) satisfies the right-most.
If you run the code above, you’ll get array([0, 1]) as the output.
To see how well our decision tree performs, let’s do a cross-
validation on the training set:
from sklearn.model_selection import cross_val_score
scores = cross_val_score(clf, X_train, y_train, cv=5,
scoring="accuracy")
accuracy = scores.mean()
accuracy
This gives us 0.725 as the output, which is slightly better than the
accuracy for the decision tree model.
In the next section, we’ll see if we can improve this score using
another classification algorithm - the Support Vector Machine
algorithm.
Non-Linear Kernel
In the example above, our data was linearly separable (i.e., the
points can be separated by a linear line). If the data was more
convoluted and a straight line could not separate the two classes
well, we need to use a non-linear SVM model.
Recall that to solve a similar problem of non-linearity in regression,
we went with polynomial regression instead of linear regression. The
idea is the same here. To separate data that is not linearly
separable, we can map the data to a higher-dimensional space.
Figure 7.8: SVM for non-linearly separable data
In the figure above, the dataset on top (represented by x ) is not
separable by a linear line. However, we can map it to a 2D space by
adding a new dimension x 2 to it. For instance, the first data point x =
-3 can be mapped to x = -3 and x 2 = 9 . When we do that, the
dataset becomes linearly separable.
To map our dataset to a higher dimension, we used the
PolynomialFeatures class in Chapter 6 for polynomial regression.
Unfortunately, this approach tends to be computationally expensive
and infeasible for the SVM algorithm.
For SVM, we need to use what is known as a kernel trick. Without
going into the mathematics behind it, what we need to know is that a
kernel trick provides us with an efficient and less expensive way to
transform data into higher dimensions.
To use the kernel trick, we pass a kernel hyperparameter to the
SVM model when instantiating it. The kernels available in sklearn are
'linear' , 'poly' , 'rbf' , 'sigmoid' , and 'precomputed' ; the
default is 'rbf' .
We’ll demonstrate how to use this hyperparameter when we train a
SVM model later.
One-vs-Rest (OvR)
The first is the One-vs-Rest method. Suppose we need to classify
our dataset into three classes - 1,2 and 3, the OvR method splits the
problem into the following binary classification problems:
- Class-1 vs. Not-Class-1 (i.e., Class-2 and Class-3 are combined
into Not-Class-1)
- Class-2 vs. Not-Class-2
- Class-3 vs. Not-Class-3
A binary classifier is trained on each problem, and a decision score
is generated for each classifier. The output of the classifier with the
highest score is selected. For instance, suppose the first classifier
predicts Class-1 with a score of 0.15, the second predicts Class-2
with a score of 0.21, and the third predicts Class-3 with a score of
0.98, the class with the highest score (i.e., Class-3) is selected.
One-vs-One (OvO)
Next, we have the One-vs-One method. This method involves
training a binary classifier for each pair of classes and selecting the
class with the most votes. With reference to the example above, the
classifiers would classify:
- Class-1 vs. Class-2
- Class-1 vs. Class-3
- Class-2 vs. Class-3
If the first classifier predicts that an instance belongs to Class-2, the
second predicts it belongs to Class-1, and the third predicts Class-2,
our final prediction would be Class-2 (majority wins). In the event of
a tie, the OvO method uses a decision function based on the
confidence levels of the underlying binary classifiers to make a
prediction.
sklearn provides two classes - OneVsOneClassifier and
OneVsRestClassifier - for the OvO and OvR methods, respectively.
After creating the pipeline, we can use its fit() method to train our
model:
pipeline.fit(X_train, y_train)
If you run the code above, you’ll get array([0, 1]) as the output.
Next, let’s evaluate the model on the training subset using cross-
validation:
scores = cross_val_score(pipeline, X_train, y_train, cv=5,
scoring="accuracy")
accuracy = scores.mean()
accuracy
This gives us 0.745 as the output, which is slightly better than both
the random forest and decision tree models. Since the SVM model
performs the best, let’s select this model and test how well it
performs on the test set:
from sklearn.metrics import accuracy_score
y_pred = pipeline.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
accuracy
Here, we get 0.74 as the output, which is very close to the accuracy
we got on the training set. This score indicates that our SVM model
generalizes well to the test set.
Chapter 8 - Clustering
In the last two chapters, we dealt with datasets in which the label of
each instance is provided.
For instance, the housing.csv dataset includes the price of each
house, and the diabetes.csv dataset includes the diabetic status of
each person.
We mentioned earlier that learning from datasets with labels is
known as supervised learning. In this chapter, let’s move on to
unsupervised learning.
In unsupervised learning, we only have access to the features of an
instance and not its label. Therefore, the goal of unsupervised
learning is not to predict the labels. Instead, the goal is to discover
hidden patterns in the dataset.
Based on the distances shown above, we’ll assign (2, 2), (1,1), and
(0, 0) to C1, and (3, 2) and (3, 1) to C2. Now, we need to calculate
the mean of the data points in each cluster and use that to update
the centroids.
The centroid of C1 changes to (1, 1), while that of C2 stays the
same. Let’s calculate the distances of our data points from the new
centroids and reassign them.
Most of the points remain in the same cluster except for (2, 2). We
need to recalculate the mean of the data points in the two clusters
and update the centroids.
We repeat the process of updating the centroids and reassigning
data points until the clusters no longer change. When that happens,
we say that the algorithm has converged and the algorithm ends.
df = pd.read_csv('clustering.csv')
print(df.isnull().sum())
df.head()
plt.figure(figsize=(12, 4))
plt.grid()
plt.plot(range(2, 20), inertias)
plt.xticks(range(0, 20))
plt.xlabel('Number of Clusters')
plt.ylabel('Inertia')
plt.show()
After instantiating the KMeans object, we use it to train our model and
append the resulting inertia (stored in the inertia_ attribute) to the
inertias list.
After training all the models, the for loop ends. Outside the loop, we
use the plot() method to plot the inertias on the y-axis against the
number of clusters on the x-axis.
If you run the code above, you’ll get the following elbow plot.
Here, we use df.loc[[0]] to select the first row in df and pass this
array to the predict() method. We also pass a new point (-2, 10) as
a 2D array to the method. If you run the code above, you’ll get the
following output:
[1]
[0]
The first row in df is assigned to cluster 1, while the point (-2, 10) is
assigned to cluster 0.
If you want to get the clusters of all the points in df , you can use the
labels_ attribute:
df['Cluster'] = kmeans.labels_
df.head()
plt.scatter(kmeans.cluster_centers_[:, 0],
kmeans.cluster_centers_[:, 1], color = 'black', s = 50, marker =
"s")
plt.xlabel('X1')
plt.ylabel('X2')
plt.legend()
Here, we first specify the size of our chart and declare a list called
markers to store the markers for our clusters. Next, we use a for
loop (that runs 3 times) to plot the clusters.
The first time the loop runs, i equals 0 . The condition
df['Cluster'] == i becomes df['Cluster'] == 0 .
This plots all the data points in df with a Cluster value of 0 , using
'x' as the marker.
After plotting the points for the first cluster, the for loop runs two
more times to plot the other two clusters, changing the label and
marker for each plot.
After plotting all three clusters, the for loop ends. Outside the for
loop, we use the cluster_centers_ attribute to plot the centroids of
our clusters.
cluster_centers_ stores the centroids as a NumPy array.
cluster_centers_[:, 0] gives us the first column of this array (i.e.,
the x-values for the centroids), while cluster_centers_[:, 1] gives
us the second (i.e., the y-values).
If you run the code above, you’ll get the following chart:
Figure 8.7: Scatter plot showing the different clusters and
centroids
Clusters 0, 1, and 2 are plotted using 'x' , '.' , and '+' ,
respectively, while the centroids are plotted using squares.
Chapter 9 - Advanced Topics in Machine
Learning
We’ve covered a lot in the previous chapters.
Before we proceed to the projects in the following three chapters,
let’s discuss a few advanced topics in machine learning. Specifically,
we’ll be discussing dimensionality reduction, overfitting and
underfitting, and hyperparameter tuning in this chapter.
We’ll start with dimensionality reduction.
This list consists of two dictionaries. For the first dictionary, the
possible combinations are
C = 1, kernel = 'linear'
C = 2, kernel = 'linear'
C = 3, kernel = 'linear'
print(cancer.feature_names)
print(cancer.target_names)
Part of the output for the statements above are shown below:
['mean radius' 'mean texture' ... 'worst fractal dimension']
['malignant' 'benign']
The output above tells us that for the first instance in the dataset,
mean radius = 1.799e+01 ,
mean texture = 1.038e+01 , and so on. In addition, the class for
this instance is 0 , which corresponds to the 'malignant' label.
After downloading the dataset, we are ready to apply PCA. Before
we do that, let’s split the dataset into training and test subsets:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.2, random_state=0)
Now, we need to create two pipelines. The first pipeline does not
include the dimensionality reduction step, while the second does.
This allows us to compare the performance of the original dataset
with the reduced dataset.
To create a pipeline without PCA, we use the code below:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
To perform PCA in sklearn, we need to import the PCA class from the
sklearn.decomposition module. We use the n_components
parameter to specify the number of components or percentage of
variance we wish to retain.
Here, we specify the number of components as 5. Alternatively, we
can specify the percentage of variance. To do that, we assign a
value between 0 and 1 to n_components . For instance, if we wish to
retain 95% of the variance, we pass n_component = 0.95 to the
constructor.
After creating the pipeline, we assign it to a variable called
pca_pipeline and use its fit() method to train our model. Let’s
evaluate the performance of the model now:
scores = cross_val_score(pca_pipeline, X_train, y_train, cv=5,
scoring="accuracy")
print(scores.mean())
If you run the code above, you’ll get the following output:
[0.43430767 0.19740115 0.09351771 0.06677661 0.05642452]
This output tells us that the first PCA component explains 43.4% of
the variance in the feature set, the second explains 19.7%, and so
on.
If you run the code above, you’ll get the following output:
Best Score = 0.9714767050075519
Best Parameters = {'SVM__C': 1, ' SVM__kernel': 'linear',
'pca__n_components': 0.95}
Answer
1 and 2, respectively
Task 4 - Remove invalid rows
Since there are only a total of 3 rows, let’s drop them from our
dataset. Use boolean arrays to select rows with non-negative values
(i.e., values greater than or equal to 0) for both columns and assign
the resulting DataFrame back to df . Verify that there are 497 rows
left using the shape attribute.
Expected Output
(497, 7)
Next, select all the features in df (i.e., all columns except Rating ) as
a DataFrame and assign it to X . In addition, select the Rating
column as a Series and assign it to y .
Now, let’s split X and y into training and test subsets using the
train_test_split() function. Allocate 20% for the test set, use a
random_state of 0 , and assign the resulting arrays to X_train ,
X_test , y_train , and y_test .
Last but not least, concatenate tf_num and tf_cat (which are both
NumPy arrays) along axis 1 and assign the final array to
X_train_transformed .
Verify that you have done the steps above correctly by printing the
value of X_train_transformed[0] .
Expected Output
[-1.14922 1.1215984 -1.13561632 0.0732169 0.01224327 1. 0. 0. 0.]
Next, compute the RMSE and R2 score metrics for the training set
using these functions and print their values.
Hint: You need to get the predicted values for X_train_transformed
first.
Expected Output
0.313351772053805
0.8579549928483164
After completing Task 10, the training and evaluation of our model is
complete.
The following tasks involve using the ColumnTransformer and
Pipeline classes to simplify our workflow. We’ll get the same model.
The main purpose is to illustrate how to use the two classes.
Task 11 - Use a column transformer and pipeline to simplify our
workflow
First, let’s import the ColumnTransformer and Pipeline classes from
the sklearn.compose and sklearn.pipeline modules, respectively.
Next, instantiate a Pipeline object and assign it to a variable called
num_preprocessing .
You can refer to Chapter 5, Section 5.3.2 for two examples of using a
ColumnTransformer object. In that section, we used built-in
transformers (such as a SimpleImputer object) for the
transformations. Alternatively, we can use a pipeline, such as the
num_preprocessing pipeline created above.
Each number in this array gives us the pixel’s intensity and ranges
from 0 (black) to 15 (white). For instance, the first number 0 tells us
that the first pixel for this row is black, while the fourth number 13
tells us that the fourth pixel is almost white.
Our job is to build a model to predict the class of an instance. There
are a total of 10 classes, one for each digit from 0 to 9.
Task 1 - Import the libraries and load the dataset
First, let’s import NumPy, pandas, and pyplot using the np , pd , and
plt aliases and import the load_digits() function from
sklearn.datasets .
Answer
40 components
Task 8 - Evaluate the best estimator on the test set
Last but not least, let’s evaluate the performance of the best
estimator on the test set. Import the accuracy_score() function from
sklearn.metrics and use it to determine how well our model
generalizes to the test set.
Print the value of the resulting accuracy score.
Expected Output
0.9833333333333333
This score is very similar to the score we got on the training set in
Step 6. Hence, our model generalizes well to the test set.
With that, Project 2 is complete, and we are ready to move on to the
last project.
Project 3 - Clustering
This last project uses the same dataset as Project 2 and is also
concerned with classifying instances in the dataset. However, it uses
a different technique to reduce the dimensionality of the data.
Task 1 - Import the libraries, load the dataset, and split it into
training and test subsets
First, let’s repeat what we did for tasks 1 and 4 in the previous
project. We need to import the necessary libraries and modules, load
the dataset, and split it into training and test subsets.
Task 2 - Elbow plot
Next, we need to train different KMeans(random_state = 0) models
to cluster the instances in X_train .
Use one of the following n_clusters values - 10, 15, 20, 25, 30, 35,
40, 45, 50 – to train each model and plot the inertias as an elbow
plot.
Hint: You can refer to Chapter 8, Section 8.3 for help.
Expected Output
Figure P3.1: Elbow plot for the digits dataset
Task 3 - Select number of clusters
Select the number of clusters based on the elbow plot above. We’ll
use 15 for the tasks below.
Task 4 - Get the distances of the first instance from the
centroids
Instantiate a KMeans object using random_state=0 and
n_clusters=15 and assign it to a variable called kmeans .
Expected Output
[40.27473174 48.89414703 47.03967829 52.34630829 55.2135072
43.87537153 43.99011546 44.39907444 54.96211714 56.20646651
27.03319993 52.79477229 50.82501222 49.00040622 43.86789918]
Task 5 - Verify the cluster of the first instance
Our k-means model has 15 centroids as there are 15 clusters. Each
element in the array above gives us the distance of X_train[0] to
one of the centroids.
This output shows that even though we only used 15 features for this
project, the best estimator performs almost as well as the best
estimator in Project 2, which uses 40 features.
Task 8 - Evaluate the best estimator on the test set
Last but not least, let’s evaluate the performance of our best
estimator on the test set using the accuracy_score() function in
sklearn.metrics and print the value of the resulting accuracy score.
Expected Output
0.9805555555555555
Here, we see that our model generalizes well to the test set. In fact,
it performs slightly better on the test set than the training set.
With that, we’ve come to the end of this project and also the end of
the book.
Congratulations! I sincerely hope you’ve found the book useful and
have enjoyed the learning process.
You can find the suggested solutions for the three projects in
Appendices A, B, and C. In addition, you can download the CSV
files, completed notebooks, and images shown in this book at
https://www.learncodingfast.com/machine-learning .
Thank you once again for your support!
Appendix A - Suggested Solution for Project
1
In the suggested solution below, code from different cells within the
same task are separated by a line with two hyphens (--).
Task 1
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
df = pd.read_csv('reviews.csv')
df.head()
Task 2
df.isnull().sum()
--
df.hist(figsize=(10, 10))
--
df.describe()
Task 3
cond1 = df['Price'] < 0
df[cond1]
--
cond2 = df['Days since Last Update'] < 0
df[cond2]
Task 4
cond1 = df['Price'] >= 0
cond2 = df['Days since Last Update'] >= 0
df = df[cond1 & cond2]
df.shape
Task 5
from sklearn.model_selection import train_test_split
X = df.iloc[:, :-1]
y = df.iloc[:, -1]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size
= 0.2, random_state = 0)
Task 6
X_train.corr()
Task 7
columns = ['Category', 'No Of Reviews', 'No Of Installs', 'Size',
'Price', 'Days since Last Update']
i = 0
Task 8
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
imp = SimpleImputer(strategy='mean')
tf_num = imp.fit_transform(X_train[num_col])
scaler = StandardScaler()
tf_num = scaler.fit_transform(tf_num)
Task 9
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X_train_transformed, y_train)
print(model.coef_)
print(model.intercept_)
Task 10
from sklearn.metrics import mean_squared_error, r2_score
y_train_pred = model.predict(X_train_transformed)
print(rmse)
print(r2)
--
test_tf_num = imp.transform(X_test[num_col])
test_tf_num = scaler.transform(test_tf_num)
test_tf_cat = ohe.transform(X_test[cat_col])
X_test_transformed = np.concatenate((test_tf_num, test_tf_cat),
axis=1)
y_test_pred = model.predict(X_test_transformed)
Task 11
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
num_preprocessing = Pipeline(
[('imp', SimpleImputer(strategy='mean')),
('scaler', StandardScaler())])
full_preprocessing = ColumnTransformer(
[('num', num_preprocessing, num_col),
('cat', OneHotEncoder(sparse=False, drop='first'), cat_col)])
final_pipeline = Pipeline(
[('pre', full_preprocessing),
('model', LinearRegression())])
Task 12
final_pipeline.fit(X_train, y_train)
Task 13
y_train_pred = final_pipeline.predict(X_train)
print(rmse)
print(r2)
--
y_test_pred = final_pipeline.predict(X_test)
print(rmse)
print(r2)
Task 14
from sklearn.model_selection import cross_val_score
digits = load_digits()
X = digits.data
y = digits.target
Task 2
some_digit = X[0].reshape((8, 8))
plt.imshow(some_digit, cmap='gray')
Task 3
print(y[0])
Task 4
from sklearn.model_selection import train_test_split
Task 5
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.svm import SVC
pipeline = Pipeline(
[('scaler', StandardScaler()),
(' pca', PCA(n_components = 0.95)),
(' svm', SVC(random_state = 0))])
Task 6
from sklearn.model_selection import GridSearchCV
print(grid_search.best_score_)
print(grid_search.best_params_)
Task 7
best_model = grid_search.best_estimator_
print(best_model.named_steps['pca'].explained_variance_ratio_)
--
print(best_model.named_steps['pca'].n_components_)
Task 8
from sklearn.metrics import accuracy_score
y_test_pred = best_model.predict(X_test)
print(accuracy_score(y_test, y_test_pred))
Appendix C - Suggested Solution for Project
3
In the suggested solution below, code from different cells within the
same task are separated by a line with two hyphens (--).
Task 1
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
digits = load_digits()
X = digits.data
y = digits.target
--
from sklearn.model_selection import train_test_split
Task 2
from sklearn.cluster import KMeans
inertias = []
Task 5
X_train0_2D = X_train[0].reshape((1, -1))
print(kmeans.predict(X_train0_2D))
Task 6
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
pipeline = Pipeline(
[('cluster', KMeans(random_state=0, n_clusters=15)),
('scaler', StandardScaler()),
(' svm', SVC(random_state=0))])
Task 7
from sklearn.model_selection import GridSearchCV
print(grid_search.best_score_)
print(grid_search.best_params_)
Task 8
from sklearn.metrics import accuracy_score
best_model = grid_search.best_estimator_
y_test_pred = best_model.predict(X_test)
print(accuracy_score(y_test, y_test_pred))