Machine Learning
Machine Learning
1
Contents:
ML getting started --- page 3
ML mean median mode --- page 7
ML standard deviation --- page 11
ML percentiles --- page 13
ML data distribution --- page 14
ML normal data distribution --- page 16
ML scatter plot --- page 18
ML linear regression --- page 21
ML polynomial regression --- page 30
ML multiple regression --- page 38
ML scale --- page 45
ML train/test --- page 49
ML decision tree --- page 58
ML confusion matrix --- page 71
ML hierarchial clustering --- page 79
W
W
W
W
W
W
W
W
W
2
ML getting started
Where To Start?
In this tutorial we will go back to mathematics and study
statistics, and how to calculate important numbers based on
data sets.
Data Set
In the mind of a computer, a data set is any collection of data. It
can be anything from an array to a complete database.
3
Example of an array:
[99,86,87,88,111,86,103,87,94,78,77,85,86]
4
if we could predict if a car had an AutoPass, just by looking at
the other values?
Data Types
To analyze data, it is important to know what type of data we
are dealing with.
Numerical
Categorical
Ordinal
5
Numerical data are numbers, and can be split into two
numerical categories:
Discrete Data
numbers that are limited to integers. Example: The
number of cars passing by.
Continuous Data
numbers that are of infinite value. Example: The price of
an item, or the size of an item
By knowing the data type of your data source, you will be able
to know what technique to use when analyzing them.
You will learn more about statistics and analyzing data in the
next chapters.
6
ML Mean Median Mode
speed = [99,86,87,88,111,86,103,87,94,78,77,85,86]
Mean
The mean value is the average value.
7
To calculate the mean, find the sum of all values, and divide the
sum by the number of values:
(99+86+87+88+111+86+103+87+94+78+77+85+86) / 13 = 89.77
import numpy
speed = [99,86,87,88,111,86,103,87,94,78,77,85,86]
x = numpy.mean(speed)
print(x)
Median
The median value is the value in the middle, after you have
sorted all the values:
77, 78, 85, 86, 86, 86, 87, 87, 88, 94, 99, 103, 111
8
It is important that the numbers are sorted before you can find
the median.
The NumPy module has a method for this.
Use the NumPy median() method to find the middle value:
import numpy
speed = [99,86,87,88,111,86,103,87,94,78,77,85,86]
x = numpy.median(speed)
print(x)
If there are two numbers in the middle, divide the sum of those
numbers by two.
77, 78, 85, 86, 86, 86, 87, 87, 94, 98, 99, 103
9
Mode
The Mode value is the value that appears the most number of
times:
99, 86, 87, 88, 111, 86, 103, 87, 94, 78, 77, 85, 86 = 86
The SciPy module has a method for this. Learn about the SciPy
module in our SciPy Tutorial.
Use the SciPy mode() method to find the number that appears
the most:
speed = [99,86,87,88,111,86,103,87,94,78,77,85,86]
x = stats.mode(speed)
print(x)
10
ML Standard Deviation
speed = [86,87,88,86,87,85,86]
0.9
Meaning that most of the values are within the range of 0.9
from the mean value, which is 86.4.
11
Let us do the same with a selection of numbers with a wider
range:
speed = [32,111,138,28,59,77,97]
37.85
Meaning that most of the values are within the range of 37.85
from the mean value, which is 77.4.
As you can see, a higher standard deviation indicates that the
values are spread out over a wider range.
The NumPy module has a method to calculate the standard
deviation:
Use the NumPy std() method to find the standard deviation:
import numpy
speed = [86,87,88,86,87,85,86]
x = numpy.std(speed)
print(x)
12
ML Percentiles
ages = [5,31,43,48,50,41,7,11,15,39,80,82,25,36,27,61,31]
What is the 75. percentile? The answer is 43, meaning that 75%
of the people are 43 or younger.
Use the NumPy percentile() method to find the percentiles:
import numpy
ages = [5,31,43,48,50,41,7,11,15,39,80,82,32,36,27,61,31]
x = numpy.percentile(ages, 75)
print(x)
13
ML data distribution
Data Distribution
Earlier in this tutorial we have worked with very small amounts
of data in our examples, just to understand the different
concepts.
In the real world, the data sets are much bigger, but it can be
difficult to gather real world data, at least at an early stage of a
project.
import numpy
14
Histogram
To visualize the data set we can draw a histogram with the data
we collected.
We will use the Python module Matplotlib to draw a histogram.
Learn about the Matplotlib module in our Matplotlib Tutorial.
Draw a histogram:
import numpy
import matplotlib.pyplot as plt
plt.hist(x, 5)
plt.show()
Histogram Explained
We use the array from the example above to draw a histogram
with 5 bars.
The first bar represents how many values in the array are
between 0 and 1.
15
The second bar represents how many values are between 1 and
2.
Which gives us this result:
16
after the mathematician Carl Friedrich Gauss who came up with
the formula of this data distribution.
A typical normal data distribution:
import numpy
import matplotlib.pyplot as plt
plt.hist(x, 100)
plt.show()
Histogram Explained
ML Scatter Plot
Scatter Plot
A scatter plot is a diagram where each value in the data set is
represented by a dot.
x = [5,7,8,7,2,17,2,9,4,11,12,9,6]
y = [99,86,87,88,111,86,103,87,94,78,77,85,86]
x = [5,7,8,7,2,17,2,9,4,11,12,9,6]
y = [99,86,87,88,111,86,103,87,94,78,77,85,86]
plt.scatter(x, y)
plt.show()
What we can read from the diagram is that the two fastest cars
were both 2 years old, and the slowest car was 12 years old.
Note: It seems that the newer the car, the faster it drives, but
that could be a coincidence, after all we only registered 13 cars.
19
Random Data Distributions
In Machine Learning the data sets can contain thousands-, or
even millions, of values.
You might not have real world data when you are testing an
algorithm, you might have to use randomly generated values.
The first array will have the mean set to 5.0 with a
standard deviation of 1.0.
The second array will have the mean set to 10.0 with a
standard deviation of 2.0:
import numpy
import matplotlib.pyplot as plt
20
x = numpy.random.normal(5.0, 1.0, 1000)
y = numpy.random.normal(10.0, 2.0, 1000)
plt.scatter(x, y)
plt.show()
ML linear regression
Regression
The term regression is used when you try to find the
relationship between variables.
Linear Regression
Linear regression uses the relationship between the data-points
to draw a straight line through all them.
This line can be used to predict future values.
21
How Does it Work?
Python has methods for finding a relationship between data-
points and to draw a line of linear regression. We will show you
how to use these methods instead of going through the
mathematic formula.
In the example below, the x-axis represents age, and the y-axis
represents speed. We have registered the age and speed of 13
cars as they were passing a tollbooth. Let us see if the data we
collected could be used in a linear regression:
Import scipy and draw the line of Linear Regression:
x = [5,7,8,7,2,17,2,9,4,11,12,9,6]
y = [99,86,87,88,111,86,103,87,94,78,77,85,86]
def myfunc(x):
return slope * x + intercept
22
mymodel = list(map(myfunc, x))
plt.scatter(x, y)
plt.plot(x, mymodel)
plt.show()
Example Explained
Import the modules you need.
Create the arrays that represent the values of the x and y axis:
x = [5,7,8,7,2,17,2,9,4,11,12,9,6]
y = [99,86,87,88,111,86,103,87,94,78,77,85,86]
23
slope, intercept, r, p, std_err = stats.linregress(x, y)
def myfunc(x):
return slope * x + intercept
Run each value of the x array through the function. This will
result in a new array with new values for the y-axis:
plt.scatter(x, y)
plt.plot(x, mymodel)
24
Display the diagram:
plt.show()
R for Relationship
It is important to know how the relationship between the
values of the x-axis and the values of the y-axis is, if there are
no relationship the linear regression can not be used to predict
anything.
x = [5,7,8,7,2,17,2,9,4,11,12,9,6]
y = [99,86,87,88,111,86,103,87,94,78,77,85,86]
25
slope, intercept, r, p, std_err = stats.linregress(x, y)
print(r)
def myfunc(x):
return slope * x + intercept
26
from scipy import stats
x = [5,7,8,7,2,17,2,9,4,11,12,9,6]
y = [99,86,87,88,111,86,103,87,94,78,77,85,86]
def myfunc(x):
return slope * x + intercept
speed = myfunc(10)
print(speed)
27
Bad Fit?
Let us create an example where linear regression would not be
the best method to predict future values.
These values for the x- and y-axis should result in a very bad fit
for linear regression:
x=
[89,43,36,36,95,10,66,34,38,20,26,29,48,64,6,5,36,66,72,40]
y=
[21,46,3,35,67,95,53,72,58,10,26,34,90,33,38,20,56,2,47,15]
def myfunc(x):
return slope * x + intercept
28
plt.scatter(x, y)
plt.plot(x, mymodel)
plt.show()
import numpy
from scipy import stats
x=
[89,43,36,36,95,10,66,34,38,20,26,29,48,64,6,5,36,66,72,40]
y=
[21,46,3,35,67,95,53,72,58,10,26,34,90,33,38,20,56,2,47,15]
29
ML Polynomial Regression
Polynomial Regression
If your data points clearly will not fit a linear regression (a
straight line through all data points), it might be ideal for
polynomial regression.
30
The x-axis represents the hours of the day and the y-axis
represents the speed:
Start by drawing a scatter plot:
x = [1,2,3,5,6,7,8,9,10,12,13,14,15,16,18,19,21,22]
y = [100,90,80,60,60,55,60,65,70,70,75,76,78,79,90,99,99,100]
plt.scatter(x, y)
plt.show()
import numpy
import matplotlib.pyplot as plt
x = [1,2,3,5,6,7,8,9,10,12,13,14,15,16,18,19,21,22]
y = [100,90,80,60,60,55,60,65,70,70,75,76,78,79,90,99,99,100]
31
mymodel = numpy.poly1d(numpy.polyfit(x, y, 3))
myline = numpy.linspace(1, 22, 100)
plt.scatter(x, y)
plt.plot(myline, mymodel(myline))
plt.show()
Example Explained
Import the modules you need.
import numpy
import matplotlib.pyplot as plt
Create the arrays that represent the values of the x and y axis:
x = [1,2,3,5,6,7,8,9,10,12,13,14,15,16,18,19,21,22]
y = [100,90,80,60,60,55,60,65,70,70,75,76,78,79,90,99,99,100]
32
mymodel = numpy.poly1d(numpy.polyfit(x, y, 3))
plt.scatter(x, y)
plt.plot(myline, mymodel(myline))
plt.show()
33
R-Squared
It is important to know how well the relationship between the
values of the x- and y-axis is, if there are no relationship the
polynomial regression can not be used to predict anything.
import numpy
from sklearn.metrics import r2_score
x = [1,2,3,5,6,7,8,9,10,12,13,14,15,16,18,19,21,22]
y = [100,90,80,60,60,55,60,65,70,70,75,76,78,79,90,99,99,100]
34
print(r2_score(y, mymodel(x)))
Predict Future Values
Now we can use the information we have gathered to predict
future values.
Example: Let us try to predict the speed of a car that passes the
tollbooth at around the time 17:00:
import numpy
from sklearn.metrics import r2_score
x = [1,2,3,5,6,7,8,9,10,12,13,14,15,16,18,19,21,22]
y = [100,90,80,60,60,55,60,65,70,70,75,76,78,79,90,99,99,100]
35
mymodel = numpy.poly1d(numpy.polyfit(x, y, 3))
speed = mymodel(17)
print(speed)
Bad Fit?
Let us create an example where polynomial regression would
not be the best method to predict future values.
These values for the x- and y-axis should result in a very bad fit
for polynomial regression:
import numpy
import matplotlib.pyplot as plt
x=
[89,43,36,36,95,10,66,34,38,20,26,29,48,64,6,5,36,66,72,40]
y=
[21,46,3,35,67,95,53,72,58,10,26,34,90,33,38,20,56,2,47,15]
36
plt.scatter(x, y)
plt.plot(myline, mymodel(myline))
plt.show()
import numpy
from sklearn.metrics import r2_score
x=
[89,43,36,36,95,10,66,34,38,20,26,29,48,64,6,5,36,66,72,40]
y=
[21,46,3,35,67,95,53,72,58,10,26,34,90,33,38,20,56,2,47,15]
37
ML multiple regression
Multiple Regression
Multiple regression is like linear regression, but with more than
one independent value, meaning that we try to predict a value
based on two or more variables.
import pandas
38
The Pandas module allows us to read csv files and return a
DataFrame object.
df = pandas.read_csv("data.csv")
X = df[['Weight', 'Volume']]
y = df['CO2']
regr = linear_model.LinearRegression()
regr.fit(X, y)
import pandas
from sklearn import linear_model
df = pandas.read_csv("data.csv")
40
X = df[['Weight', 'Volume']]
y = df['CO2']
regr = linear_model.LinearRegression()
regr.fit(X, y)
print(predictedCO2)
Coefficient
The coefficient is a factor that describes the relationship with
an unknown variable.
41
In this case, we can ask for the coefficient value of weight
against CO2, and for volume against CO2. The answer(s) we get
tells us what would happen if we increase, or decrease, one of
the independent values.
Print the coefficient values of the regression object:
import pandas
from sklearn import linear_model
df = pandas.read_csv("data.csv")
X = df[['Weight', 'Volume']]
y = df['CO2']
regr = linear_model.LinearRegression()
regr.fit(X, y)
print(regr.coef_)
Result:
[0.00755095 0.00780526]
42
Result Explained
The result array represents the coefficient values of weight and
volume.
Weight: 0.00755095
Volume: 0.00780526
These values tell us that if the weight increase by 1kg, the CO2
emission increases by 0.00755095g.
And if the engine size (Volume) increases by 1 cm3, the CO2
emission increases by 0.00780526 g.
43
import pandas
from sklearn import linear_model
df = pandas.read_csv("data.csv")
X = df[['Weight', 'Volume']]
y = df['CO2']
regr = linear_model.LinearRegression()
regr.fit(X, y)
print(predictedCO2)
Result:
[114.75968007]
44
We have predicted that a car with 1.3 liter engine, and a weight
of 3300 kg, will release approximately 115 grams of CO2 for
every kilometer it drives.
ML scale
Scale Features
When your data has different values, and even different
measurement units, it can be difficult to compare them. What
is kilograms compared to meters? Or altitude compared to
time?
Take a look at the table below, it is the same data set that we
used in the multiple regression chapter, but this time the
45
volume column contains values in liters instead of cm3 (1.0
instead of 1000).
z = (x - u) / s
If you take the volume column from the data set above, the
first value is 1.0, and the scaled value will be:
46
(1.0 - 1.61) / 0.38 = -1.59
Now you can compare -2.1 with -1.59 instead of comparing 790
with 1.0.
You do not have to do this manually, the Python sklearn
module has a method called StandardScaler() which returns a
Scaler object with methods for transforming data sets.
Scale all values in the Weight and Volume columns:
import pandas
from sklearn import linear_model
from sklearn.preprocessing import StandardScaler
scale = StandardScaler()
df = pandas.read_csv("data.csv")
X = df[['Weight', 'Volume']]
scaledX = scale.fit_transform(X)
47
print(scaledX)
When the data set is scaled, you will have to use the scale when
you predict values:
Predict the CO2 emission from a 1.3 liter car that weighs 2300
kilograms:
import pandas
from sklearn import linear_model
from sklearn.preprocessing import StandardScaler
scale = StandardScaler()
df = pandas.read_csv("data.csv")
X = df[['Weight', 'Volume']]
48
y = df['CO2']
scaledX = scale.fit_transform(X)
regr = linear_model.LinearRegression()
regr.fit(scaledX, y)
predictedCO2 = regr.predict([scaled[0]])
print(predictedCO2)
ML train/test
49
To measure if the model is good enough, we can use a method
called Train/Test.
What is Train/Test
Train/Test is a method to measure the accuracy of your model.
It is called Train/Test because you split the data set into two
sets: a training set and a testing set.
50
Our data set illustrates 100 customers in a shop, and their
shopping habits.
import numpy
import matplotlib.pyplot as plt
numpy.random.seed(2)
x = numpy.random.normal(3, 1, 100)
y = numpy.random.normal(150, 40, 100) / x
plt.scatter(x, y)
plt.show()
train_x = x[:80]
train_y = y[:80]
51
test_x = x[80:]
test_y = y[80:]
plt.scatter(train_x, train_y)
plt.show()
plt.scatter(test_x, test_y)
plt.show()
The testing set also looks like the original data set.
52
Fit the Data Set
What does the data set look like? In my opinion I think the best
fit would be a polynomial regression, so let us draw a line of
polynomial regression.
import numpy
import matplotlib.pyplot as plt
numpy.random.seed(2)
x = numpy.random.normal(3, 1, 100)
y = numpy.random.normal(150, 40, 100) / x
train_x = x[:80]
53
train_y = y[:80]
test_x = x[80:]
test_y = y[80:]
plt.scatter(train_x, train_y)
plt.plot(myline, mymodel(myline))
plt.show()
54
R2
Remember R2, also known as R-squared?
import numpy
from sklearn.metrics import r2_score
numpy.random.seed(2)
x = numpy.random.normal(3, 1, 100)
y = numpy.random.normal(150, 40, 100) / x
55
train_x = x[:80]
train_y = y[:80]
test_x = x[80:]
test_y = y[80:]
r2 = r2_score(train_y, mymodel(train_x))
print(r2)
Now we want to test the model with the testing data as well, to
see if gives us the same result.
Let us find the R2 score when using testing data:
import numpy
56
from sklearn.metrics import r2_score
numpy.random.seed(2)
x = numpy.random.normal(3, 1, 100)
y = numpy.random.normal(150, 40, 100) / x
train_x = x[:80]
train_y = y[:80]
test_x = x[80:]
test_y = y[80:]
print(r2)
Note: The result 0.809 shows that the model fits the testing set
as well, and we are confident that we can use the model to
predict future values.
57
Predict Values
Now that we have established that our model is OK, we can
start predicting new values.
How much money will a buying customer spend, if she or he
stays in the shop for 5 minutes?
print(mymodel(5))
ML decision tree
Decision Tree
In this chapter we will show you how to make a "Decision
Tree". A Decision Tree is a Flow Chart, and can help you make
decisions based on previous experience.
58
Luckily our example person has registered every time there was
a comedy show in town, and registered some information
about the comedian, and also registered if he/she went or not.
Age Experience Rank Nationality Go
36 10 9 UK NO
42 12 4 USA NO
23 4 6 N NO
52 4 4 USA NO
43 21 8 USA YES
44 14 5 UK NO
66 3 7 N YES
35 14 9 UK YES
52 13 7 N YES
35 5 9 N YES
24 3 5 USA NO
18 3 7 UK YES
45 9 9 UK YES
Now, based on this data set, Python can create a decision tree
that can be used to decide if any new shows are worth
attending to.
59
Read and print the data set:
import pandas
df = pandas.read_csv("data.csv")
print(df)
60
To make a decision tree, all data has to be numerical.
We have to convert the non numerical columns 'Nationality'
and 'Go' into numerical values.
Pandas has a map() method that takes a dictionary with
information on how to convert the values.
print(df)
61
Then we have to separate the feature columns from the target
column.
X = df[features]
y = df['Go']
print(X)
print(y)
Now we can create the actual decision tree, fit it with our
details. Start by importing the modules we need:
Create and display a Decision Tree:
import pandas
62
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier
import matplotlib.pyplot as plt
df = pandas.read_csv("data.csv")
X = df[features]
y = df['Go']
dtree = DecisionTreeClassifier()
dtree = dtree.fit(X, y)
tree.plot_tree(dtree, feature_names=features)
63
Result Explained
The decision tree uses your earlier decisions to calculate the
odds for you to wanting to go see a comedian or not.
Rank
Rank <= 6.5 means that every comedian with a rank of 6.5
or lower will follow the True arrow (to the left), and the
rest will follow the False arrow (to the right).
gini = 0.497 refers to the quality of the split, and is always
a number between 0.0 and 0.5, where 0.0 would mean all
of the samples got the same result, and 0.5 would mean
that the split is done exactly in the middle.
samples = 13 means that there are 13 comedians left at
this point in the decision, which is all of them since this is
the first step.
64
value = [6, 7] means that of these 13 comedians, 6 will get
a "NO", and 7 will get a "GO".
Gini
There are many ways to split the samples, we use the GINI
method in this tutorial.
The Gini method uses this formula:
The next step contains two boxes, one box for the comedians
with a 'Rank' of 6.5 or lower, and one box with the rest.
65
True - 5 Comedians End Here:
gini = 0.0 means all of the samples got the same result.
samples = 5 means that there are 5 comedians left in this
branch (5 comedian with a Rank of 6.5 or lower).
value = [5, 0] means that 5 will get a "NO" and 0 will get a
"GO".
Nationality
Nationality <= 0.5 means that the comedians with a
nationality value of less than 0.5 will follow the arrow to
the left (which means everyone from the UK, ), and the
rest will follow the arrow to the right.
gini = 0.219 means that about 22% of the samples would
go in one direction.
samples = 8 means that there are 8 comedians left in this
branch (8 comedian with a Rank higher than 6.5).
66
value = [1, 7] means that of these 8 comedians, 1 will get a
"NO" and 7 will get a "GO".
Age
Age <= 35.5 means that comedians at the age of 35.5 or
younger will follow the arrow to the left, and the rest will
follow the arrow to the right.
67
gini = 0.375 means that about 37,5% of the samples would
go in one direction.
samples = 4 means that there are 4 comedians left in this
branch (4 comedians from the UK).
value = [1, 3] means that of these 4 comedians, 1 will get a
"NO" and 3 will get a "GO".
gini = 0.0 means all of the samples got the same result.
samples = 4 means that there are 4 comedians left in this
branch (4 comedians not from the UK).
value = [0, 4] means that of these 4 comedians, 0 will get a
"NO" and 4 will get a "GO".
68
True - 2 Comedians End Here:
gini = 0.0 means all of the samples got the same result.
samples = 2 means that there are 2 comedians left in this
branch (2 comedians at the age 35.5 or younger).
value = [0, 2] means that of these 2 comedians, 0 will get a
"NO" and 2 will get a "GO".
Experience
Experience <= 9.5 means that comedians with 9.5 years of
experience, or less, will follow the arrow to the left, and
the rest will follow the arrow to the right.
gini = 0.5 means that 50% of the samples would go in one
direction.
samples = 2 means that there are 2 comedians left in this
branch (2 comedians older than 35.5).
value = [1, 1] means that of these 2 comedians, 1 will get a
"NO" and 1 will get a "GO".
69
True - 1 Comedian Ends Here:
gini = 0.0 means all of the samples got the same result.
samples = 1 means that there is 1 comedian left in this
branch (1 comedian with 9.5 years of experience or less).
value = [0, 1] means that 0 will get a "NO" and 1 will get a
"GO".
gini = 0.0 means all of the samples got the same result.
samples = 1 means that there is 1 comedians left in this
branch (1 comedian with more than 9.5 years of
experience).
value = [1, 0] means that 1 will get a "NO" and 0 will get a
"GO".
70
Predict Values
We can use the Decision Tree to predict new values.
ML confusion matrix
71
The rows represent the actual classes the outcomes should
have been. While the columns represent the predictions we
have made. Using this table it is easy to see which predictions
are wrong.
import numpy
72
In order to create the confusion matrix we need to import
metrics from the sklearn module.
cm_display = metrics.ConfusionMatrixDisplay(confusion_matrix
= confusion_matrix, display_labels = [False, True])
73
Finally to display the plot we can use the functions plot() and
show() from pyplot.
cm_display.plot()
plt.show()
74
cm_display = metrics.ConfusionMatrixDisplay(confusion_matrix
= confusion_matrix, display_labels = [False, True])
cm_display.plot()
plt.show()
Results Explained
The Confusion Matrix created has four different quadrants:
75
Now that we have made a Confusion Matrix, we can calculate
different measures to quantify the quality of the model. First,
lets look at Accuracy.
Created Metrics
The matrix provides us with many useful metrics that help us to
evaluate out classification model.
Accuracy
Accuracy measures how often the model is correct.
How to Calculate
(True Positive + True Negative) / Total Predictions
Precision
76
Of the positives predicted, what percentage is truly positive?
How to Calculate
True Positive / (True Positive + False Positive)
Sensitivity (Recall)
Of all the positive cases, what percentage are predicted
positive?
How to Calculate
77
True Positive / (True Positive + False Negative)
F-score
F-score is the "harmonic mean" of precision and sensitivity.
How to Calculate
2 * ((Precision * Sensitivity) / (Precision + Sensitivity))
This score does not take into consideration the True Negative
values:
78
Specificity
How well the model is at prediciting negative results?
How to Calculate
True Negative / (True Negative + False Positive)
Example
79
print({"Accuracy":Accuracy,"Precision":Precision,"Sensitivity_re
call":Sensitivity_recall,"Specificity":Specificity,"F1_score":F1_sc
ore})
ML hierarchial clustering
80