Learning Predictive Analytics With Python - Sample Chapter
Learning Predictive Analytics With Python - Sample Chapter
$ 49.99 US
31.99 UK
P U B L I S H I N G
Ashish Kumar
ee
Sa
m
pl
C o m m u n i t y
E x p e r i e n c e
D i s t i l l e d
Ashish Kumar
the batch of 2012-13. He is a data science enthusiast with extensive work experience
in the field. As a part of his work experience, he has worked with tools, such as
Python, R, and SAS. He has also implemented predictive algorithms to glean
actionable insights for clients from transport and logistics, online payment, and
healthcare industries. Apart from the data sciences, he is enthused by and adept at
financial modelling and operational research. He is a prolific writer and has authored
several online articles and short stories apart from running his own analytics blog.
He also works pro-bono for a couple of social enterprises and freelances his data
science skills.
He can be contacted on LinkedIn at https://goo.gl/yqrfo4, and on Twitter at
https://twitter.com/asis64.
Preface
Preface
Social media and the Internet of Things have resulted in an avalanche of data. The
data is powerful but not in its raw form; it needs to be processed and modelled and
Python is one of the most robust tools we have out there to do so. It has an array of
packages for predictive modelling and a suite of IDEs to choose from. Learning to
predict who would win, lose, buy, lie, or die with Python is an indispensable skill
set to have in this data age.
This book is your guide to get started with Predictive Analytics using Python as the
tool. You will learn how to process data and make predictive models out of them. A
balanced weightage has been given to both the statistical and mathematical concepts
and implementing them in Python using libraries, such as pandas, scikit-learn, and
NumPy. Starting with understanding the basics of predictive modelling, you will see
how to cleanse your data of impurities and make it ready for predictive modelling.
You will also learn more about the best predictive modelling algorithms, such as
linear regression, decision trees, and logistic regression. Finally, you will see what
the best practices in predictive modelling are, as well as the different applications of
predictive modelling in the modern world.
Preface
Preface
Appendix, A List of Links, contains a list of sources which have been directly or
indirectly consulted or used in the book. It also contains the link to the folder
which contains datasets used in the book.
[1]
The breakneck speed at which the social media and Internet of Things have grown
is reflected in the huge silos of data humans generate. The data about where we
live, where we come from, what we like, what we buy, how much money we spend,
where we travel, and so on. Whenever we interact with a social media or Internet
of Things website, we leave a trail, which these websites gleefully log as their data.
Every time you buy a book at Amazon, receive a payment through PayPal, write a
review on Yelp, post a photo on Instagram, do a check-in on Facebook, apart from
making business for these websites, you are creating data for them.
Harvard Business Review (HBR) says "Data is the new oil" and that "Data Scientist
is the sexiest job of the 21st century". So, why is the data so important and how
can we realize the full potential of it? There are broadly two ways in which the
data is used:
Data is as abundant as oil used to be, once upon a time, but in contrast to
oil, data is a non-depleting resource. In fact, one can argue that it is reusable,
in the sense that, each dataset can be used in more than one way and also
multiple number of times.
[2]
Chapter 1
A more detailed comparison of oil and data is provided in the following table:
Data
Oil
[3]
Algorithms, on the other hand, are the blueprints of a model. They are responsible
for creating mathematical equations from the historical data. They analyze the data,
quantify the relationship between the variables, and convert it into a mathematical
equation. There is a variety of them: Linear Regression, Logistic Regression,
Clustering, Decision Trees, Time-Series Modelling, Nave Bayes Classifiers, Natural
Language Processing, and so on. These models can be classified under two classes:
The selection of a particular algorithm for a model depends majorly on the kind
of data available. The focus of this book would be to explain methods of handling
various kinds of data and illustrating the implementation of some of these models.
Statistical tools
There are a many statistical tools available today, which are laced with inbuilt
methods to run basic statistical chores. The arrival of open-source robust tools like R
and Python has made them extremely popular, both in industry and academia alike.
Apart from that, Python's packages are well documented; hence, debugging is easier.
Python has a number of libraries, especially for running the statistical, cleaning,
and modelling chores. It has emerged as the first among equals when it comes to
choosing the tool for the purpose of implementing preventive modelling. As the
title suggests, Python will be the choice for this book, as well.
Historical data
Our machinery (model) is built and operated on this oil called data. In general,
a model is built on the historical data and works on future data. Additionally,
a predictive model can be used to fill missing values in historical data by
interpolating the model over sparse historical data. In many cases, during modelling
stages, future data is not available. Hence, it is a common practice to divide the
historical data into training (to act as historical data) and testing (to act as future
data) through sampling.
[4]
Chapter 1
As discussed earlier, the data might or might not have an output variable. However,
one thing that it promises to be is messy. It needs to undergo a lot of cleaning and
manipulation before it can become of any use for a modelling process.
Mathematical function
Most of the data science algorithms have underlying mathematics behind them. In
many of the algorithms, such as regression, a mathematical equation (of a certain
type) is assumed and the parameters of the equations are derived by fitting the data
to the equation.
For example, the goal of linear regression is to fit a linear model to a dataset and find
the equation parameters of the following equation:
Y = 0 + 1. X 1 + 2 . X 2 + ....... + n . X n
The purpose of modelling is to find the best values for the coefficients. Once
these values are known, the previous equation is good to predict the output.
The equation above, which can also be thought of as a linear function of Xi's
(or the input variables), is the linear regression model.
Another example is of logistic regression. There also we have a mathematical
equation or a function of input variables, with some differences. The defining
equation for logistic regression is as follows:
e a + b x
1
P=
=
a + b x
( a + b* x )
1+ e
1+ e
Here, the goal is to estimate the values of a and b by fitting the data to this equation.
Any supervised algorithm will have an equation or function similar to that of the
model above. For unsupervised algorithms, an underlying mathematical function
or criterion (which can be formulated as a function or equation) serves the purpose.
The mathematical equation or function is the backbone of a model.
Business context
All the effort that goes into predictive analytics and all its worth, which accrues to
data, is because it solves a business problem. A business problem can be anything
and it will become more evident in the following examples:
Tricking the users of the product/service to buy more from you by increasing
the click through rates of the online ads
[5]
Banking
Social media
Retail
Transport
Healthcare
Policing
Education
E-commerce
Human resource
By what quantum did the proposed solution make life better for the business, is all
that matters. That is the reason; predictive analytics is becoming an indispensable
practice for management consulting.
In short, predictive analytics sits at the sweet spot where statistics, algorithm,
technology and business sense intersect. Think about it, a mathematician, a
programmer, and a business person rolled in one.
[6]
Chapter 1
Fig. 1.2: Task matrix: split of time spent on data cleaning and modelling and their final contribution to the
model
[7]
Many of the data cleaning and exploration chores can be automated because
they are alike most of the times, irrespective of the data. The part that needs a
lot of human thinking is the implementation of a model, which is what makes
the bulk of this book.
What it does?
Let's say you have searched for some person who works at a particular organization
and LinkedIn throws up a list of search results. You click on one of them and you
land up on their profile. In the middle-right section of the screen, you will find a
panel titled "People Also Viewed"; it is essentially a list of people who either work at
the same organization as the person whose profile you are currently viewing or the
people who have the same designation and belong to same industry.
Isn't it cool? You might have searched for these people separately if not for this
feature. This feature increases the efficacy of your search results and saves your time.
How is it done?
Are you wondering how LinkedIn does it? The rough blueprint is as follows:
LinkedIn leverages the search history data to do this. The model underneath
this feature plunges into a treasure trove of search history data and looks at
what people have searched next after finding the correct person they were
searching for.
[8]
Chapter 1
This event of searching for a particular second person after searching for a
particular first person has some probability. This will be calculated using
all the data for such searches. The profiles with the highest probability of
being searched (based on the historical data) are shown in the "People Also
Viewed" section.
This probability comes under the ambit of a broad set of rules called
Association Rules. These are very widely used in Retail Analytics where we
are interested to know what a group of products will sell together. In other
words, what is the probability of buying a particular second product given
that the consumer has already bought the first product?
How is it done?
The historical data in this case will consist of information about people who
visited a certain website/app and whether they clicked the published ad or
not. Some or a combination of classification models, such as Decision Trees,
and Support Vector Machines are used in such cases to determine whether a
visitor will click on the ad or not, given the visitor's profile information.
One problem with standard classification algorithms in such cases is that the
Click Through Rates are very small numbers, of the order of less than 1%.
The resulting dataset that is used for classification has a very sparse positive
outcome. The data needs to be downsampled to enrich the data with positive
outcomes before modelling.
The logistical regression is one of the most standard classifiers for situations with
binary outcomes. In banking, whether a person will default on his loan or not can be
predicted using logistical regression given his credit history.
[9]
How is it done?
A decision tree model was created using the historical data. The prediction of
the model will foretell whether a crime will occur in an area on a given date
and time in the future.
The model is consistently recalibrated every day to include the crimes that
happened during that day.
The good news is that the police are using such techniques to predict the crime
scenes in advance so that they can prevent it from happening. The bad news is that
certain terrorist organizations are using such techniques to target the locations that
will cause the maximum damage with minimal efforts from their side. The good
news again is that this strategic behavior of terrorists has been studied in detail and
is being used to form counter-terrorist policies.
How is it done?
The clustering performs well in such cases if the columns contributing the
maximum to the separation of activities are also included while calculating
the distance matrix for clustering. Such columns can be found out using a
technique called Singular Value Decomposition.
[ 10 ]
Chapter 1
Bill James, using historical data, concluded that the older metrics used to rate
a player, such as stolen balls, runs batted in, and batting average were not
very useful indicators of a player's performance in a given match. He rather
relied on metrics like on-base percentage and sluggish percentage to be a
better predictor of a player's performance.
The chief statistician behind the algorithms, Bill James, compiled the data
for performance of all the baseball league players and sorted them for these
metrics. Surprisingly, the players who had high values for these statistics also
came at cheaper prices.
This way, they gathered an unbeatable team that didn't have individual stars who
came at hefty prices but as a team were an indomitable force. Since then, these
algorithms and their variations have been used in a variety of real and fantasy
leagues to select players. The variants of these algorithms are also being used by
Venture Capitalists to optimize and automate their due diligence to select the
prospective start-ups to fund.
Anaconda
Anaconda is a popular Python distribution consisting of more than 195 popular
Python packages. Installing Anaconda automatically installs many of the packages
discussed in the preceding section, but they can be accessed only through an
IDE called Spyder (more on this later in this chapter), which itself is installed on
Anaconda installation. Anaconda also installs IPython Notebook and when you click
on the IPython Notebook icon, it opens a browser tab and a Command Prompt.
[ 11 ]
Download the suitable installer and double click on the .exe file and it will install
Anaconda. Two of the features that you must check after the installation are:
IPython Notebook
Spyder IDE
Search for them in the "Start" icon's search, if it doesn't appear in the list of programs
and files by default. We will be using IPython Notebook extensively and the codes in
this book will work the best when run in IPython Notebook.
IPython Notebook can be opened by clicking on the icon. Alternatively, you can
use the Command Prompt to open IPython Notebook. Just navigate to the directory
where you have installed Anaconda and then write ipython notebook, as shown in
the following screenshot:
On the system used for this book, Anaconda was installed in the C:\
Users\ashish directory. One can open a new Notebook in IPython by
clicking on the New Notebook button on the dashboard, which opens up.
In this book, we have used IPython Notebook extensively.
Standalone Python
You can download a Python version that is stable and is compatible to the OS on
your system. The most stable version of Python is 2.7.0. So, installing this version is
highly recommended. You can download it from https://www.python.org/ and
install it.
[ 12 ]
Chapter 1
There are some Python packages that you need to install on your machine before
you start predictive analytics and modelling. This section consists of a demo of
installation of one such library and a brief description of all such libraries.
Installing pip
The following steps demonstrate how to install pip. Follow closely!
1. Navigate to the webpage shown in the following screenshot. The URL
address is https://pypi.python.org/pypi/pip:
[ 13 ]
2. Download the pip-7.0.3.tar.gz file and unzip in the folder where Python
is installed. If you have Python v2.7.0 installed, this folder should be C:\
Python27:
6. Now, the current directory is set to the directory where setup file for pip
(setup.py) resides. Write the following command to install pip:
python setup.py install
[ 14 ]
Chapter 1
Once pip is installed, it is very easy to install all the required Python packages to
get started.
4. Finally, to confirm that the package has installed successfully, write the
following command:
python
-c "import pandas"
[ 15 ]
If this doesn't throw up an error, then the package has been installed successfully.
The various methods in pandas will be explained in this book as and when we
use them.
To get an overview, navigate to the official page of pandas here:
http://pandas.pydata.org/index.html
[ 16 ]
Chapter 1
It can be used to plot all kind of common plots, such as histograms, stacked
and unstacked bar charts, scatterplots, heat diagrams, box plots, power
spectra, error charts, and so on
It can be used to edit and manipulate all the plot properties such as title, axes
properties, color, scale, and so on
To get an overview, navigate to the official page of matplotlib at:
http://matplotlib.org
It gives a very concise method to predict the outcome based on the model
and measure the accuracy of the outcomes
To get an overview, navigate to the official page of scikit-learn
here: http://scikit-learn.org/stable/index.html
Python packages, other than these, if used in this book, will be situation based and
can be installed using the method described earlier in this section.
[ 17 ]
IDLE is widely popular as an IDE for beginners; it is simple to use and works well
for simple tasks. Some of the issues with IDLE are bad output reporting, absence of
line numbering options, and so on. As a result, advanced practitioners move on to
better IDEs.
IPython Notebook: IPython Notebook is a powerful computational environment
where code, execution, results, and media can co-exist in one single document.
There are two components of this computing environment:
The IPython documents are stored with an extension .ipynb in the directory where
it is installed on the computer.
Some of the features of IPython Notebook are as follows:
Inline figure rendering of the matplotlib plots that can be saved in multiple
formats(JPEG, PNG).
The notebooks can be saved as HTML files and .ipynb files. These notebooks
can be viewed in browsers and this has been developed as a popular tool for
illustrated blogging in Python. A notebook in IPython looks as shown in the
following screenshot:
[ 18 ]
Chapter 1
An Ipython Notebook
Python kernel and code editor with line numbering in the same screen
[ 19 ]
In this book, IPython Notebook and Spyder have been used extensively. IDLE
has been used from time to time and some people use other environments, such
as Pycharm. Readers of this book are free to use such editors if they are more
comfortable with them. However, they should make sure that all the required
packages are working fine in those environments.
[ 20 ]
Chapter 1
Summary
The following are some of the takeaways from this chapter:
Data is powerful but not in its raw form. The data needs to be processed
and modelled.
Organizations across the world and across the domains are using data to
solve critical business problems. The knowledge of statistical algorithms,
statisticals tool, business context, and handling of historical data is vital to
solve these problems using predictive modelling.
Python is a robust tool to handle, process, and model data. It has an array of
packages for predictive modelling and a suite of IDEs to choose from.
Let us enter the battlefield where Python is our weapon. We will start using it from
the next chapter. In the next chapter, we will learn how to read data in various cases
and do a basic processing.
[ 21 ]
www.PacktPub.com
Stay Connected: