Ultimate Step by Step Guide To Machine Learning Using Python Predictive
Ultimate Step by Step Guide To Machine Learning Using Python Predictive
Click on ‘New’ and the ‘Python 3’ in the top right. Once you do that, a new
browser window will open with your new project and Python environment
ready to execute. You can rename your project as ‘My First Python Project’
by clicking on ‘Untitled’ at the top of the screen (highlighted below for
reference).
Done? Alright, you are ready to go!
3. Data Types
In Python, there are 3 main types of data types, like most programming
languages:
1) Booleans
2) Numbers
3) Strings
3.1 Booleans
Boolean values are True or False. You can use the following Boolean
operators to compare different values:
1) Is equal to: ==
2) Is greater than: >
3) Is less than: <
4) Is greater than or equal to: >=
5) Is less than or equal to: <=
Try out this code in the Python notebook you created above as follows:
3.2 Numbers
In Python, numbers can be integers i.e. whole numbers like 1, 2, 3 or floats
i.e. numbers with decimals like 1.1, 2.3, 4.5 etc. You can use the basic
mathematical operators like:
1) Plus: +
2) Minus: -
3) Multiplication: *
4) Division: /
5) Parenthesis: () to enforce precedence in operations
Try out this code in the Python notebook you created above as follows:
You can also check the type of a number by using the type command in
Python. See below:
3.3 Strings
Strings are essentially a string of characters. For example, words or
sentences. In Python, Strings are designated by single (‘ ‘) or double quotes
(“”). Just like Booleans and Numbers above, there are certain operators and
functions you can use in Python on Strings.
You can use + sign to concatenate two strings. See below:
You can also change strings to upper and lower case using upper() and
lower() functions in Python and you can use count() function to determine
number of characters in a string. See below:
Another useful function is the replace() that lets you replace one character
with another in a string. See below:
4. Data Structures
Now that we have learned about data types in Python, let’s go over how
Python organizes these data types into different types of data structures. We
will cover the following data structures in this book:
1) Tuples
2) Lists
3) Dictionaries
4) Sets
5) Custom Objects
4.1 Tuples
Tuples are ordered sequences of data represented as data contained within
parenthesis separated by commas. For example:
(1, ‘two’, 3.5, False)
As you can see from the above example, Tuples can contain data of all types
at the same time.
Tuples are immutable – as in data within Tuples cannot be changed. For
example, look at this sequence of code:
Python throws an error when we tried to assign the value of 4 to the second
index in the tuple, indicating that tuples objects do not support item
assignment.
Tuples can also contain other data structures like tuples – this concept is
defined as ‘nesting’. Imagine visualizing it as a tree as follows:
4.2 Lists
Lists are also sequenced data structures with data represented as separated by
brackets (as opposed to parenthesis for tuples). For example:
[1, ‘two’, 3.5, False]
Unlike tuples, lists are mutable i.e. data in lists can be changed. See the
following sequence of code:
Python allows us to change the list by replacing the value 3.5 with 4 in index
2 position and does not throw an error like in the case of tuples.
Just like tuples, lists can hold multiple data types at the same time including
other data structures like tuples and lists i.e. nesting as outlined above.
4.3 Dictionaries
Dictionaries are like tuples and lists in that they help order and sequence the
data. However, the main difference is that they are represented by data
separated by commas in curly braces and the index for dictionaries can be
string labels. See code example below:
{‘leader’:’Abraham Lincoln’, ‘fighter’:’Mike Tyson’, ‘city’:’Toronto’}
In the case of dictionaries, the index or keys are immutable i.e. cannot be
changed. However, the values represented by the labels can be changed. See
the code sample below:
In the above example, Python allows us to change the leader in the dictionary
to Winston Churchill – and we used the string ‘leader’ as our index as
opposed to 0 to access the first data element in the structure.
4.4 Sets
Sets are another example of data structures like tuples, lists and dictionaries –
the key difference being that sets are not ordered i.e. they do not have
indexes. The reason behind that is that the sets only contain unique elements.
They are represented by data elements separated by commas in curly
brackets. See code example below:
In the above example, even though 3.5 was duplicated in the curly brackets,
when the set was created, the duplicates were removed.
Sets can have values added or removed from them using ‘add’ and ‘remove’
functions. See code example below:
You can also perform additional mathematical functions on sets by showing
the common elements in two sets by using the ‘&’ function. See code
example below:
As you would remember, set1 had data elements 1 and 3.5 that are also
present in set2. So the ‘&’ function found the commonality and returned the
common data elements between the two sets.
Now let’s try combining the two sets by using the ‘union’ function. See code
example below:
Notice how while combining the two data sets, Python consolidated the
common elements 1 and 3.5 and did not repeat them – as duplicates are not
allowed in sets.
4.5 Custom Objects
While tuples, lists, dictionaries and sets are built in data structures in Python,
sometimes you may find the need to create custom objects or data structures
in Python with their own attributes. Suppose you wanted to create a custom
object in Python called ‘Human’ with its own specific attributes for re-
usability in your code. Consider this code sample:
Notice the following in the above lines of code:
- Custom class Human was initialized using the def key word and the
‘__init__’ initialization function (notice the two underscores on each
side of the init keyword)
- In the initialization function, ‘self’ is an instance of the object being
created along with the two attributes ‘height’ and ‘weight’ that define
the human object in this instance
- After the custom class has been defined, we use that to initialize a new
Human object and pass it the attribute values for height (60 inches) and
weight (180 pounds).
- When we print out the results the object attributes assigned, we get the
desired output
5. Data Traversing
Now that we have introduced different types of data structures in Python,
let’s see how we navigate and traverse through them. We can do so in the
following ways:
1) If then else statements
2) Loops
3) Functions
5.1 If Then Else Statements
We will start with if then else statements. These conditional structures are a
common presence in all object-oriented programming languages. They make
use of Boolean operators introduced earlier in this book and depending on the
outcome of the Boolean condition, the path forward is defined. Consider this
sample of code:
In the above example, we set the wizard variable to String value of ‘Gandalf’.
We then used the Boolean equals operator (==) to check if the wizard value is
‘Sauron’ instead. Clearly the result of that condition will be false, so Python
will not print “oh no” and instead it picks the else path and prints “keep
going”.
In the syntax, notice how there is a colon at the end of both if and else lines
of code. Also, notice how the dependent output is indented to show
dependency on the if and else conditions.
Now, what if you wanted to have multiple if conditions in your code? Check
out this sample of code:
Notice that we changed the wizard variable to ‘Sauron’ now and we also
introduced a new elf variable ‘Frodo’. We have also added to the if condition
by using the Boolean operator AND – basically checking if both the wizard
== Sauron and elf == Frodo conditions are true. Since that is the case, Python
prints “oh no” and skips the remaining conditions in the code.
What happens if we change the elf to ‘Sam’? Let’s find out:
Since our first if condition is dependent on wizard being ‘Sauron’ and elf
being ‘Frodo’, it is no longer true. So Python skips that condition and goes to
the condition branch elif (short for else if). We know that helpComing is set
to True, so Python will print “keep going” as the output.
5.2 Loops
Just like if then else statements, for loops and while loops are a common
presence in object-oriented programming languages.
For loops are used to traverse data structures based on specific conditions.
See code example below:
In the above example, we started with a list of odd numbers and we wanted to
convert them to even numbers. We start out with using the ‘enumerate’
function in Python which essentially produces the index for each data
element in the list. We then iterate through the list by assigning the index to i
variable and adding 1 to each element in the list until we are done going
through the entire list as follows:
list [0] = 1+1 = 2
list [1] = 3+1 = 4
list [3] = 5+1 = 6
and so on…
The outcome is a list of even numbers: [2, 4, 6, 8, 10]. Remember earlier in
the book we mentioned that, lists are mutable and that’s why we were able to
change the odd numbers to even numbers in the list – this function will not
have worked for tuples which are immutable.
Now let’s try traversing a set – if you recall, sets are shown in curly braces
and do not have indices. So what do we do in that case? Not to worry, Python
has an answer for that as well. Check out this sample code:
Notice how without using an index, Python assigns the value of each data
element to the set variable and lets you iterate through the entire length of the
set? That’s the elegance and simplicity of Python!
Now let’s look at another example of loops in Python – while loops. While
loops are great for traversing data structures where the size of the data
structures is not known, and you would like to traverse until a specific
condition is met. Consider this sample code:
In the above example, we were trying to find ‘gold’ but don’t know how
many hops it will take to hit gold. So, we set our index to 0 and set the
Boolean condition in the while loop to keep going while the metal is not
equal to gold. We finally find gold after 3 total hops and end the loop before
hitting ‘bronze’ or ‘diamond’ which are later in the list.
5.3 Functions
In Python, you have the option to define your own functions if there is
specific set of operations you expect to repeat on a data structure. For
example, suppose you always wanted to return multiples of a specific number
anytime a list of numeric values is provided to you. You can define a function
as follows:
In the above example, we defined a function called ‘multiples’ using the def
keyword in Python. We define the function as taking two parameters ‘x’ –
that being the number that we will multiply the list with and ‘list1’ – that
being the list we will convert to multiples of x.
Once we are done defining the function, we try it out by passing it a list of
values of 13, 15, 17 and 19 and ask the multiples function to convert the list
by multiplying it by 4. The outcome is 52, 60, 68 and 76 – all multiples of 4
as desired. Now you can re-use this custom defined function later in your
code as well – as required.
6. Data Exploration and Analysis
Now that we have covered different data types, structures and how to use
them, we will take it one step further and cover how we work with data files
and data frames. We will also introduce two foundational Python libraries:
Pandas and Numpy.
For the purposes of this and upcoming chapters, we will use a sample
fictional dataset that contains house sales prices based on various house
features such as:
1) Year Built – Year house was built
2) Square Footage – Size of the house in square feet
3) House Type – Whether the house is a detached, semi-detached or
townhouse
4) Garage Size – Number of cars that the garage can accommodate
5) Fireplaces – Whether house has a fireplace or not
6) Pool – Whether house has a pool or not
7) Sale Price – Actual sale price of the house based on the above features
You can download this sample dataset as a .csv file from my website by
following this link: House Sales Data
6.1 Explore and Clean Data with Pandas
Now that we have selected our dataset to work with, let’s first start with
reading the train.csv file that we downloaded earlier. For that we will use a
built in Python library called ‘Pandas’. Pandas are cute furry animals but in
Python they also allow us to work with data as organized data frames. This
library also has built in functions that we can perform on these data frames.
We start with importing this library as follows:
In the above code sample, we used the pandas function ‘read_csv’ by passing
it the path to the train.csv file and stored it into our data frame called ‘df’.
We are now ready to use pandas’ data exploratory functions, so we can
understand our dataset better.
Let’s start by looking at the first 10 records of the data set. For that we use
the pandas head() function. See code sample below:
Looks like our dataset has 1000 rows and 7 columns. Ok great, now let’s find
out what are data types in each column of the data frame. For that we will use
the pandas dtypes attribute. See code sample below:
Looks like we have a mix of integers, floats and strings (objects) in our
dataset. Good to know and this will come in handy as we transform our data
for further processing later in this book.
Let’s further describe this data set using the describe() function to understand
it better. See code sample below:
We now have a smaller data set to work with, consisting of 980 rows, 7
columns and no null values!
6.2 Find Outliers with Numpy and Scipy
Before we go further though, let’s introduce a few more foundational libraries
in Python.
1) Numpy is a library that is used for performing numeric operations on
data structures like lists and matrices
2) Scipy is a library that leverages Numpy data structures for advanced
algebra and calculus operations
3) Scikit-learn (also known as Sklearn) is a foundational Python library
for building predictive models
To leverage the numeric calculations (Numpy), statistical analysis (Scipy)
and predictive modeling (Scikit-learn) functions in the above libraries, we
need to replace categorical string values in our dataset with numeric values.
Since this is a common issue, Sklearn library has built-in functions that allow
you to do exactly that in very few lines of code. See below:
Notice how for Fireplaces and Pool, before we had ‘Yes’ and ‘No’ values and
they have now been replaced with 1 and 0 respectively. Also, for House Type
different house types have been re-coded as follows 0, 1, 2.
Our data types for different columns in the dataset are also numeric:
We will use Numpy and SciPy libraries to find outliers in our data set to
clean it further. For that we will use Z-score method.
Z-score basically indicates how far from the mean a data point is. Typically, a
Z-score of 3 or more is a good indicator of an outlier. Let’s look for these
outliers in our data set.
In the above code, we followed the following steps to calculate and remove
our outliers:
1) We imported Numpy and Scipy libraries
2) We then calculated Z scores for each of the numeric values in our data
frame using the Scipy stats.zscore function
3) We then re-formed the data frame with data that has Z score of less
than 3 to eliminate all the outliers
4) Finally, we printed out the shape of this updated data set
Good news! Z-Score function did not find any outliers and our dataset still
has the same size – 980 rows and 7 columns.
6.3 Visualize Data with Matplotlib and Seaborn
Now that we have removed all null values, replaced string values and
checked for outliers, let’s find out which features of the house have the
biggest impact on house sales price. Easiest way to see that is by visualizing
your data. For that we will use two additional Python libraries:
1) Matplotlib is the foundational library for creating graphs and plots in
Python
2) Seaborn is a more advanced data visualization library and is based on
the Matplotlib as its foundation
Let’s use the above two libraries to visualize our data and see which features
impact the house sale prices the most. We will first look at the distribution of
the house sale prices as follows:
In the above snapshot, we covered the following steps:
1) We first imported the matplotlib and seaborn libraries
2) We then called the distribution plot function called ‘distplot’ and
passed it the Sale Price column to generate the distribution plot
visualization
The distribution plot shows us that that the sale prices range between
$500,000 to $1 million and then $2 million.
Now, let’s create a new data visualization called box plots – to see house
prices by different house types. As you may recall, we re-coded the house
types in the previous chapter as follows:
House Type – String Value House Type – Numeric Encoding
Detached 0
Semi-Detached 1
Town-home 2
Let’s see how we generate a box plot visualization using the following code:
Based on the above visual, we see that detached house prices can range from
$800,000 to $2,000,000, semi-detached houses are just below the $800,000
mark and townhomes are in the $400,000 to under $800,000 range.
While the above visualizations give us a good idea of impact of individual
variables on the house sales price, it is even better to be able to see the
relative impact of all variables on the house sales price.
For that we will use the correlation matrix and heatmap visualization – see
snapshot below:
From the above heatmap, we can see that Square Footage and House Type
are the two distinctive features that have a material impact on the house sales
price (based on their distinctive colors on the gradient scale on the right).
7. Building Predictive Models
Now that we have selected a dataset and cleaned it for null values and
outliers, we are ready to build different predictive models. But what is a
predictive model? In the simplest terms, predictive models use the past to
predict the future. They analyze current state factors, correlations and data
relationships to determine how things will play out based on these
dependences.
We will build two different kind of predictive models in this chapter using
the dataset we have been working with so far:
1) Linear Regression
2) Decision Tree Regression
We will then test and compare the accuracy of these predictive models.
Regression: Before we go too far, we just mentioned a new term – regression.
Why would predictive models, that predict the future, be referencing
regression? Regression is a statistical technique and the term was first used
by Francis Galton ([6]) in 1877 when describing how the heights of
descendants tended to regress towards the mean as opposed to increasing
with each new generation. This term then evolved into the statistical concept
of describing the relationship between dependent variables like house sale
price and independent variables like house square footage and number of
garages.
There is a wide variety of different machine learning and predictive
algorithms and I will go over some of them at a higher level in the next
chapter. However, I will go over in more detail why Linear Regression and
Decision Tree are suitable algorithms for this specific problem.
To train our models, we will first split our data into train and test data sets as
per below:
In the above code sample, we used the train_test_split function in the Scikit-
learn library. As you can see, we defined ‘y’ as the target / dependent variable
‘SalePrice’ and ‘X’ as all the independent variables that will be used to
predict the price.
We will take a moment here to describe the purpose of the additional
parameters in the train_test_split function:
1) test_size – defines what percentage of your data will be treated as test
dataset. For this example, we used 50%
2) random_state – is used as an input into random number generation
during the split. For our example, we used 80
3) shuffle – is used to determine whether data should be shuffled before
splitting. For our purpose, we set that to true
Now we are ready to build, train and test the predictive models!
7.1 Linear Regression Model
A linear regression model assumes a linear relationship between a dependent
or target variable e.g. Sale Price for the house and independent variables like
square footage, number of garages etc.
It can be illustrated as an equation as follows:
Y = cX + I
Where
Y = dependent variable aka Sale Price of the House
X = independent variable e.g. Square footage
c = coefficient – that will be calculated by the linear regression model
I = intercept – that will be calculated by the linear regression model as well
The above is a case of Simple Linear Regression, where you only have one
dependent and one independent variables.
When you have multiple independent variables, like in our case, we use the
equation below and this can be defined as Multiple Linear Regression
Y = I+ (c1X1)+ (c2X2)+ (c3X3)+……
We can initialize, train and test the Linear Regression Model as follows:
In the above code snapshot, we covered the following steps:
1) We initialized the Linear Regression Model from the Scikit-Learn
library
2) We then fit (or trained) the model using the training data set
3) We made the prediction using the test data set
4) We tested the accuracy of the model using two methods:
The above is an overly simplified illustration of the decision tree on how the
model will arrive at a sale price prediction but it gets the point across.
Below is how we will initiate, train, test and evaluate a Decision Tree
Regression model in Python.
The above is a very basic illustration of how you would initialize a Decision
Tree Regression model. In the above sample code, we covered the following
steps:
1) We initialized the Decision Tree Regression Model from the Scikit-
Learn library
2) We then fit (or trained) the model using the training data set
3) We made the prediction using the test data set
4) We tested the accuracy of the model using the metrics.accuracy_score
function in the Scikit-Learn library and got an accuracy score of 67% -
that is an improvement over the Linear Regression model result we got
earlier
Typically finding the best fitting model for your dataset is a trial and error
exercise and we work with the one that is giving us the highest degree of
accuracy based on our training and test datasets.
8. Understanding Machine Learning
In the previous chapter, I described two machine learning models: Linear
Regression and Decision Tree analysis. I also mentioned that there are many
other machine learning models that can be applied based on the nature of the
problem at hand. In this chapter, we will explain additional machine learning
models and algorithms in simple terms – with a goal for the reader to be able
to understand them at a high level and know when to apply them given the
nature of the data and their data science use case.
At the highest level, machine learning models can be classified into three
categories:
1) Supervised Machine Learning Models
2) Unsupervised Machine Learning Models
3) Deep Learning Models
8.1 Supervised Machine Learning Models
In case of supervised machine learning models, the model is provided some
direction in terms of how to classify the data and it uses those instructions to
learn before making its predictions. In the house sales example that we used
in the previous chapter, that was an example of supervised learning as we
told the model what the different types of houses were and even which house
features have an impact on the sales price. We then split the data set into
training and test sets, that the model used to make its prediction.
Examples of supervised machine learning models include:
1) Regression analysis
2) Classification analysis
Regression analysis – in case of regression analysis, given several factors,
the model is expected to predict a number. In case of the house sales price
example, the model considered several house features to predict the house
sales price. Note that above was an example of linear regression where there
is a straight-line correlation between dependent variable (e.g. house sales
price) and independent variables (e.g. house type, number of garages etc.).
There are certainly situations where that straight-line correlation does not
exist and typically in those situations, polynomial regression analysis is used
– or a curved line instead of a straight line.
In other words, all the machine learning model is doing when performing
regression analysis is it is trying to fit a line between scattered data points to
find a correlation between independent and dependent variables. That can be
illustrated visually as follows:
Linear Regression Polynomial Regression