Asset-V1 VIT+MBA109+2020+type@asset+block@Introductio To ML Using Python
Asset-V1 VIT+MBA109+2020+type@asset+block@Introductio To ML Using Python
In this tutorial let us get introduced to the world of Machine Learning (ML) with Python. Machine
Learning primarily studies the design of algorithms that can learn from experience. To learn, they need
data that has certain attributes based on which the algorithms try to find some meaningful predictive
patterns. Majorly, ML tasks can be categorized as concept learning, clustering, predictive modeling,
etc. The ultimate goal of ML algorithms is to be able to take decisions without any human intervention
correctly.
Overview of contents
1. Installing the Python and SciPy platform.
2. Loading the dataset.
3. Summarizing the dataset.
4. Visualizing the dataset.
5. Evaluating some algorithms.
6. Making some predictions.
To check whether Python environment is installed successfully run the script below that will help us
to test the environment.
We are going to use the heart disease dataset. This dataset is famous because it is used as the
“hello world” dataset in machine learning and statistics by pretty much everyone
(https://archive.ics.uci.edu/ml/datasets/Heart+Disease). The dataset contains 1025
observations of patients. There are thirteen columns of patient’s diagnostic measurements.
The fourteenth column is the target stating disease is yes or no.
First, let’s import all of the modules, functions and objects that are needed for Machine
learning project.
# Load libraries
import pandas
from pandas.plotting import scatter_matrix
import matplotlib.pyplot as plt
from sklearn import model_selection
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
2.2 Load Dataset
We will use pandas to load the data and to explore the data both with descriptive statistics
and data visualization. Note that we are specifying the names of each column when loading
the data. This will help later when we explore the data.
# Load dataset
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/iris.csv"
names = [‘sex’,‘cp’,‘trestbps’,‘chol’,‘fbs’,‘restecg’,‘thalach’,‘exang’,‘oldpeak’,‘slope’,‘ca’, ‘thal’
,‘target’]
dataset = pandas.read_csv(url, names=names)
If you do have network problems, you can download the iris.csv file into your working
directory and load it using the same method, changing URL to the local file name.
Now it is time to take a look at the data. In this step we are going to take a look at the data a
few different ways:
1. Dimensions of the dataset.
2. Peek at the data itself.
3. Statistical summary of all attributes.
4. Breakdown of the data by the class variable.
Now we can take a look at a summary of each attribute. This includes the count, mean, the
min and max values as well as some percentiles.
# descriptions
print(dataset.describe())
#Describe the field thalach
dataset. thalach.describe()
dataset. thalach.value_counts() #frequency table
Let’s now take a look at the number of instances (rows) that belong to each class. We can
view this as an absolute count.
# class distribution
print(dataset.groupby('target').size())
4. Data Visualization
We now have a basic idea about the data. We need to extend that with some visualizations.
We are going to look at two types of plots:
Univariate plots helps us to better understand each attribute.
Multivariate plots helps us to better understand the relationships between attributes.
We start with some univariate plots, that is, plots of each individual variable.
Given that the input variables are numeric, we can create box and whisker plots of each.
Boxplots summarize the distribution of each attribute, drawing a line for the median (middle
value) and a box around the 25th and 75th percentiles (the middle 50% of the data). The
whiskers give an idea of the spread of the data and dots outside of the whiskers show
candidate outlier values (values that are 1.5 times greater than the size of spread of the
middle 50% of the data).
dataset.plot(kind='box', subplots=True, layout=(2,2), sharex=False, sharey=False)
plt.show()
We can also create a histogram of each input variable to get an idea of the distribution.
Histograms group data into bins and provide you a count of the number of observations in
each bin. From the shape of the bins we can quickly get a feeling for whether an attribute is
Gaussian’, skewed or even has an exponential distribution. It can also help us see possible
outliers.
# histograms
dataset.hist()
plt.show()
Density Plots
Density plots are another way of getting a quick idea of the distribution of each attribute. The
plots look like an abstracted histogram with a smooth curve drawn through the top of each
bin, much like your eye tried to do with the histograms.
First, let’s look at scatterplots of all pairs of attributes. This can be helpful to spot structured
relationships between input variables.
Correlation Matrix Plot
Correlation gives an indication of how related the changes are between two variables. If two
variables change in the same direction they are positively correlated. If the change in opposite
directions together (one goes up, one goes down), then they are negatively correlated.
You can calculate the correlation between each pair of attributes. This is called a correlation
matrix. You can then plot the correlation matrix and get an idea of which variables have a high
correlation with each other.
This is useful to know, because some machine learning algorithms like linear and logistic
regression can have poor performance if there are highly correlated input variables in your
data.
correlations = dataset.corr()
# plot correlation matrix
fig = plt.figure()
ax = fig.add_subplot(111)
cax = ax.matshow(correlations, vmin=-1, vmax=1)
fig.colorbar(cax)
ticks = numpy.arange(0,9,1)
ax.set_xticks(ticks)
ax.set_yticks(ticks)
ax.set_xticklabels(names)
ax.set_yticklabels(names)
plt.show()
We can see that the matrix is symmetrical, i.e. the bottom left of the matrix is the same as
the top right. This is useful as we can see two different views on the same data in one plot.
We can also see that each variable is perfectly positively correlated with each other (as you
would expected) in the diagonal line from top left to bottom right.
Scatterplot Matrix
A scatterplot shows the relationship between two variables as dots in two dimensions, one
axis for each attribute. You can create a scatterplot for each pair of attributes in your data.
Drawing all these scatterplots together is called a scatterplot matrix.
Scatter plots are useful for spotting structured relationships between variables, like whether
you could summarize the relationship between two variables with a line. Attributes with
structured relationships may also be correlated and good candidates for removal from your
dataset.
Like the Correlation Matrix Plot, the scatterplot matrix is symmetrical. This is useful to look at
the pair-wise relationships from different perspectives. Because there is little point oi drawing
a scatterplot of each variable with itself, the diagonal shows histograms of each attribute.
5. Summary
Hope this section would have helped you to visualize or sense how far the variables
are distributed when dealing with a set of data.