6.lab Activity
6.lab Activity
06
Introduction to Machine Learning, Mean Median Mode, Standard
Theory:
Machine Learning
Machine Learning is making the computer learn from studying data and statistics.
Machine Learning is a step into the direction of artificial intelligence (AI).
Machine Learning is a program that analyses data and learns to predict the outcome.
Data Set
In the mind of a computer, a data set is any collection of data. It can be anything from an array to
a complete database.
Example of an array:
[99,86,87,88,111,86,103,87,94,78,77,85,86]
By looking at the array, we can guess that the average value is probably around 80 or 90, and we
are also able to determine the highest value and the lowest value, but what else can we do?
And by looking at the database we can see that the most popular color is white, and the oldest car
is 17 years, but what if we could predict if a car had an AutoPass, just by looking at the other
values?
That is what Machine Learning is for! Analyzing data and predicting the outcome!
In Machine Learning it is common to work with very large data sets. In this lab
we will try to make it as easy as possible to understand the different concepts of
machine learning, and we will work with small easy-to-understand data sets.
Data Types
To analyze data, it is important to know what type of data we are dealing with.
We can split the data types into three main categories:
Numerical
Categorical
Ordinal
Numerical data are numbers, and can be split into two numerical categories:
Discrete-Data
- counted data that are limited to integers. Example: The number of cars passing by.
Continuous-Data
- measured data that can be any number. Example: The price of an item, or the size of an
item
Categorical data are values that cannot be measured up against each other. Example: a color
value, or any yes/no values.
Ordinal data are like categorical data, but can be measured up against each other. Example:
school grades where A is better than B and so on.
By knowing the data type of your data source, you will be able to know what technique to
use when analyzing them.
Mean
The mean value is the average value.
To calculate the mean, find the sum of all values, and divide the sum by the number of values:
(99+86+87+88+111+86+103+87+94+78+77+85+86) / 13 = 89.77
Example
Use the NumPy mean() method to find the average speed:
import numpy
speed = [99,86,87,88,111,86,103,87,94,78,77,85,86]
x = numpy.mean(speed)
print(x)
Median
The median value is the value in the middle, after you have sorted all the values:
77, 78, 85, 86, 86, 86, 87, 87, 88, 94, 99, 103, 111
It is important that the numbers are sorted before you can find the median.
Example
Use the NumPy median() method to find the middle value:
import numpy
speed = [99,86,87,88,111,86,103,87,94,78,77,85,86]
x = numpy.median(speed)
print(x)
If there are two numbers in the middle, divide the sum of those numbers by two.
77, 78, 85, 86, 86, 86, 87, 87, 94, 98, 99, 103
Example
Using the NumPy module:
import numpy
speed = [99,86,87,88,86,103,87,94,78,77,85,86]
x = numpy.median(speed)
print(x)
Mode
The Mode value is the value that appears the most number of times:
99, 86, 87, 88, 111, 86, 103, 87, 94, 78, 77, 85, 86 = 86
Example
Use the SciPy mode() method to find the number that appears the most:
speed = [99,86,87,88,111,86,103,87,94,78,77,85,86]
x = stats.mode(speed)
print(x)
What is Standard Deviation?
Standard deviation is a number that describes how spread out the values are.
A low standard deviation means that most of the numbers are close to the mean (average) value.
A high standard deviation means that the values are spread out over a wider range.
speed = [86,87,88,86,87,85,86]
0.9
Meaning that most of the values are within the range of 0.9 from the mean value, which is 86.4.
speed = [32,111,138,28,59,77,97]
37.85
Meaning that most of the values are within the range of 37.85 from the mean value, which is
77.4.
As you can see, a higher standard deviation indicates that the values are spread out over a wider
range.
Example
Use the NumPy std() method to find the standard deviation:
import numpy
speed = [86,87,88,86,87,85,86]
x = numpy.std(speed)
print(x)
Example
import numpy
speed = [32,111,138,28,59,77,97]
x = numpy.std(speed)
print(x)
Percentiles are used in statistics to give you a number that describes the value that a given
percent of the values are lower than.
Example: Let's say we have an array of the ages of all the people that live in a street.
ages = [5,31,43,48,50,41,7,11,15,39,80,82,32,2,8,6,25,36,27,61,31]
What is the 75. percentile? The answer is 43, meaning that 75% of the people are 43 or younger.
The NumPy module has a method for finding the specified percentile:
Example
Use the NumPy percentile() method to find the percentiles:
import numpy
ages = [5,31,43,48,50,41,7,11,15,39,80,82,32,2,8,6,25,36,27,61,31]
x = numpy.percentile(ages, 75)
print(x)
Example
What is the age that 90% of the people are younger than?
import numpy
ages = [5,31,43,48,50,41,7,11,15,39,80,82,32,2,8,6,25,36,27,61,31]
x = numpy.percentile(ages, 90)
print(x)
Data Distribution
Earlier in this lab we have worked with very small amounts of data in our examples, just to
understand the different concepts.
In the real world, the data sets are much bigger, but it can be difficult to gather real world data, at
least at an early stage of a project.
To create big data sets for testing, we use the Python module NumPy, which comes with a
number of methods to create random data sets, of any size.
Example
Create an array containing 250 random floats between 0 and 5:
import numpy
print(x)
Histogram
To visualize the data set we can draw a histogram with the data we collected.
Example
Draw a histogram:
import numpy
import matplotlib.pyplot as plt
Histogram Explained
We use the array from the example above to draw a histogram with 5 bars.
The first bar represents how many values in the array are between 0 and 1.
The second bar represents how many values are between 1 and 2.
Etc.
Note: The array values are random numbers and will not show the exact same
result on your computer.
Example
Create an array with 100000 random numbers, and display them using a
histogram with 100 bars:
import numpy
import matplotlib.pyplot as plt
plt.hist(x, 100)
plt.show()
Now, we have learned how to create a completely random array, of a given size, and between
two given values.
In this section we will learn how to create an array where the values are concentrated around a
given value.
In probability theory this kind of data distribution is known as the normal data distribution, or
the Gaussian data distribution, after the mathematician Carl Friedrich Gauss who came up with
the formula of this data distribution.
Example
A typical normal data distribution:
import numpy
import matplotlib.pyplot as plt
plt.hist(x, 100)
plt.show()
Note: A normal distribution graph is also known as the bell curve because of it's
characteristic shape of a bell.
Histogram Explained
We use the array from the numpy.random.normal() method, with 100000 values, to draw a
histogram with 100 bars.
We specify that the mean value is 5.0, and the standard deviation is 1.0.
Meaning that the values should be concentrated around 5.0, and rarely further away than 1.0
from the mean.
And as you can see from the histogram, most values are between 4.0 and 6.0, with a top at
approximately 5.0.
Scatter Plot
A scatter plot is a diagram where each value in the data set is represented by a dot.
The Matplotlib module has a method for drawing scatter plots, it needs two arrays of the same
length, one for the values of the x-axis, and one for the values of the y-axis:
x = [5,7,8,7,2,17,2,9,4,11,12,9,6]
y = [99,86,87,88,111,86,103,87,94,78,77,85,86]
Example
Use the scatter() method to draw a scatter plot diagram:
x = [5,7,8,7,2,17,2,9,4,11,12,9,6]
y = [99,86,87,88,111,86,103,87,94,78,77,85,86]
plt.scatter(x, y)
plt.show()
Scatter Plot Explained
What we can read from the diagram is that the two fastest cars were both 2 years old, and the
slowest car was 12 years old.
Note: It seems that the newer the car, the faster it drives, but that could be a coincidence, after all
we only registered 13 cars.
In Machine Learning the data sets can contain thousands-, or even millions, of values.
You might not have real world data when you are testing an algorithm, you might have to use
randomly generated values.
Let us create two arrays that are both filled with 1000 random numbers from a normal data
distribution.
The first array will have the mean set to 5.0 with a standard deviation of 1.0.
The second array will have the mean set to 10.0 with a standard deviation of 2.0:
Example
A scatter plot with 1000 dots:
import numpy
import matplotlib.pyplot as plt
plt.scatter(x, y)
plt.show()
Scatter Plot Explained
We can see that the dots are concentrated around the value 5 on the x-axis, and 10 on the y-axis.
We can also see that the spread is wider on the y-axis than on the x-axis.
Regression
The term regression is used when you try to find the relationship between variables.
In Machine Learning, and in statistical modeling, that relationship is used to predict the outcome
of future events.
Linear Regression
Linear regression uses the relationship between the data-points to draw a straight line through all
them.
Python has methods for finding a relationship between data-points and to draw a line of linear
regression. We will show you how to use these methods instead of going through the mathematic
formula.
In the example below, the x-axis represents age, and the y-axis represents speed. We have
registered the age and speed of 13 cars as they were passing a tollbooth. Let us see if the data we
collected could be used in a linear regression:
Example
Start by drawing a scatter plot:
import matplotlib.pyplot as plt
x = [5,7,8,7,2,17,2,9,4,11,12,9,6]
y = [99,86,87,88,111,86,103,87,94,78,77,85,86]
plt.scatter(x, y)
plt.show()
Example
Import scipy and draw the line of Linear Regression:
x = [5,7,8,7,2,17,2,9,4,11,12,9,6]
y = [99,86,87,88,111,86,103,87,94,78,77,85,86]
def myfunc(x):
return slope * x + intercept
plt.scatter(x, y)
plt.plot(x, mymodel)
plt.show()
Create the arrays that represent the values of the x and y axis:
x = [5,7,8,7,2,17,2,9,4,11,12,9,6]
y = [99,86,87,88,111,86,103,87,94,78,77,85,86]
Execute a method that returns some important key values of Linear Regression:
def myfunc(x):
return slope * x + intercept
Run each value of the x array through the function. This will result in a new array with new
values for the y-axis:
plt.scatter(x, y)
plt.plot(x, mymodel)
plt.show()
R for Relationship
It is important to know how the relationship between the values of the x-axis and the values of
the y-axis is, if there are no relationship the linear regression can not be used to predict anything.
The r value ranges from -1 to 1, where 0 means no relationship, and 1 (and -1) means 100%
related.
Python and the Scipy module will compute this value for you, all you have to do is feed it with
the x and y values.
Example
How well does my data fit in a linear regression?
x = [5,7,8,7,2,17,2,9,4,11,12,9,6]
y = [99,86,87,88,111,86,103,87,94,78,77,85,86]
slope, intercept, r, p, std_err = stats.linregress(x, y)
print(r)
Note: The result -0.76 shows that there is a relationship, not perfect, but it
indicates that we could use linear regression in future predictions.
Now we can use the information we have gathered to predict future values.
To do so, we need the same myfunc() function from the example above:
def myfunc(x):
return slope * x + intercept
Example
Predict the speed of a 10 years old car:
x = [5,7,8,7,2,17,2,9,4,11,12,9,6]
y = [99,86,87,88,111,86,103,87,94,78,77,85,86]
def myfunc(x):
return slope * x + intercept
speed = myfunc(10)
print(speed)
The example predicted a speed at 85.6, which we also could read from the diagram:
Bad Fit?
Let us create an example where linear regression would not be the best method to predict future
values.
Example
These values for the x- and y-axis should result in a very bad fit for linear
regression:
x = [89,43,36,36,95,10,66,34,38,20,26,29,48,64,6,5,36,66,72,40]
y = [21,46,3,35,67,95,53,72,58,10,26,34,90,33,38,20,56,2,47,15]
def myfunc(x):
return slope * x + intercept
Example
You should get a very low r value.
import numpy
from scipy import stats
x = [89,43,36,36,95,10,66,34,38,20,26,29,48,64,6,5,36,66,72,40]
y = [21,46,3,35,67,95,53,72,58,10,26,34,90,33,38,20,56,2,47,15]
print(r)
The result: 0.013 indicates a very bad relationship, and tells us that this data set is not suitable for
linear regression.
Task 1: Mean, Median, and Mode
2. Calculate and print the mean, median, and mode of the dataset.
1. Calculate and print the standard deviation of the dataset created in Task 1.
2. Generate a new dataset of 1000 values that follows a normal distribution (mean=50,
std_dev=10).
Group No.