0% found this document useful (0 votes)
60 views

Asset-V1 VIT+MBA109+2020+type@asset+block@Introductio To ML Using Python

This document provides an introduction to machine learning using Python. It discusses loading common Python ML libraries, loading and exploring a heart disease dataset, summarizing the dataset using statistics and visualizations, and evaluating some common ML algorithms on the data. The overview outlines downloading and installing Python and SciPy libraries, loading and summarizing a dataset, visualizing the data, and making predictions using ML algorithms.

Uploaded by

Kartik Bhathire
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
60 views

Asset-V1 VIT+MBA109+2020+type@asset+block@Introductio To ML Using Python

This document provides an introduction to machine learning using Python. It discusses loading common Python ML libraries, loading and exploring a heart disease dataset, summarizing the dataset using statistics and visualizations, and evaluating some common ML algorithms on the data. The overview outlines downloading and installing Python and SciPy libraries, loading and summarizing a dataset, visualizing the data, and making predictions using ML algorithms.

Uploaded by

Kartik Bhathire
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

INTRODUCTION TO ML USING PYTHON

In this tutorial let us get introduced to the world of Machine Learning (ML) with Python. Machine
Learning primarily studies the design of algorithms that can learn from experience. To learn, they need
data that has certain attributes based on which the algorithms try to find some meaningful predictive
patterns. Majorly, ML tasks can be categorized as concept learning, clustering, predictive modeling,
etc. The ultimate goal of ML algorithms is to be able to take decisions without any human intervention
correctly.

Overview of contents
1. Installing the Python and SciPy platform.
2. Loading the dataset.
3. Summarizing the dataset.
4. Visualizing the dataset.
5. Evaluating some algorithms.
6. Making some predictions.

1. Downloading, Installing and Starting Python SciPy

1.1 Install SciPy Libraries


We expect that Python version 2.7 or 3.5+ is already installed in your work end. There are 5 key
libraries that you will need to install. Below is a list of the Python SciPy libraries required to be installed.
They are:
scipy
numpy
matplotlib
pandas
sklearn
if not use pip intall command to install all the libraries needed.

1.2 Start Python and Check Versions

To check whether Python environment is installed successfully run the script below that will help us
to test the environment.

Open a command line and start the python interpreter:


>>python
# Check the versions of libraries
# Python version
import sys
print('Python: {}'.format(sys.version))
# scipy
import scipy
print('scipy: {}'.format(scipy.__version__))
# numpy
import numpy
print('numpy: {}'.format(numpy.__version__))
# matplotlib
import matplotlib
print('matplotlib: {}'.format(matplotlib.__version__))
# pandas
import pandas
print('pandas: {}'.format(pandas.__version__))
# scikit-learn
import sklearn
print('sklearn: {}'.format(sklearn.__version__))

2. Load the Data

We are going to use the heart disease dataset. This dataset is famous because it is used as the
“hello world” dataset in machine learning and statistics by pretty much everyone
(https://archive.ics.uci.edu/ml/datasets/Heart+Disease). The dataset contains 1025
observations of patients. There are thirteen columns of patient’s diagnostic measurements.
The fourteenth column is the target stating disease is yes or no.

2.1 Import libraries

First, let’s import all of the modules, functions and objects that are needed for Machine
learning project.

# Load libraries
import pandas
from pandas.plotting import scatter_matrix
import matplotlib.pyplot as plt
from sklearn import model_selection
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
2.2 Load Dataset

We will use pandas to load the data and to explore the data both with descriptive statistics
and data visualization. Note that we are specifying the names of each column when loading
the data. This will help later when we explore the data.

# Load dataset
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/iris.csv"
names = [‘sex’,‘cp’,‘trestbps’,‘chol’,‘fbs’,‘restecg’,‘thalach’,‘exang’,‘oldpeak’,‘slope’,‘ca’, ‘thal’
,‘target’]
dataset = pandas.read_csv(url, names=names)

If you do have network problems, you can download the iris.csv file into your working
directory and load it using the same method, changing URL to the local file name.

3. Summarize the Dataset

Now it is time to take a look at the data. In this step we are going to take a look at the data a
few different ways:
1. Dimensions of the dataset.
2. Peek at the data itself.
3. Statistical summary of all attributes.
4. Breakdown of the data by the class variable.

3.1 Dimensions of Dataset


We can get a quick idea of how many instances (rows) and how many attributes (columns)
the data contains with the shape property.
# shape
print(dataset.shape)

#Print only column names in the dataset


datset.columns.values

3.2 Peek at the Data


It is also always a good idea to actually eyeball your data.
# head
print(dataset.head(20))
#tail
print(dataset.tail(20))
3.3 Statistical Summary

Now we can take a look at a summary of each attribute. This includes the count, mean, the
min and max values as well as some percentiles.
# descriptions
print(dataset.describe())
#Describe the field thalach
dataset. thalach.describe()
dataset. thalach.value_counts() #frequency table

3.4 Class Distribution

Let’s now take a look at the number of instances (rows) that belong to each class. We can
view this as an absolute count.
# class distribution
print(dataset.groupby('target').size())

4. Data Visualization

We now have a basic idea about the data. We need to extend that with some visualizations.
We are going to look at two types of plots:
Univariate plots helps us to better understand each attribute.
Multivariate plots helps us to better understand the relationships between attributes.

4.1 Univariate Plots

We start with some univariate plots, that is, plots of each individual variable.

Given that the input variables are numeric, we can create box and whisker plots of each.

Box and whisker plots

Boxplots summarize the distribution of each attribute, drawing a line for the median (middle
value) and a box around the 25th and 75th percentiles (the middle 50% of the data). The
whiskers give an idea of the spread of the data and dots outside of the whiskers show
candidate outlier values (values that are 1.5 times greater than the size of spread of the
middle 50% of the data).
dataset.plot(kind='box', subplots=True, layout=(2,2), sharex=False, sharey=False)
plt.show()
We can also create a histogram of each input variable to get an idea of the distribution.
Histograms group data into bins and provide you a count of the number of observations in
each bin. From the shape of the bins we can quickly get a feeling for whether an attribute is
Gaussian’, skewed or even has an exponential distribution. It can also help us see possible
outliers.
# histograms
dataset.hist()
plt.show()

Density Plots

Density plots are another way of getting a quick idea of the distribution of each attribute. The
plots look like an abstracted histogram with a smooth curve drawn through the top of each
bin, much like your eye tried to do with the histograms.

# Univariate Density Plots


import matplotlib.pyplot as plt
import pandas
dataset.plot(kind='density', subplots=True, layout=(3,3), sharex=False)
plt.show()

# Univariate Density Plots


import matplotlib.pyplot as plt
import pandas
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-
diabetes.data.csv"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = pandas.read_csv(url, names=names)
data.plot(kind='density', subplots=True, layout=(3,3), sharex=False)
plt.show()

4.2 Multivariate Plots

Now we can look at the interactions between the variables.

First, let’s look at scatterplots of all pairs of attributes. This can be helpful to spot structured
relationships between input variables.
Correlation Matrix Plot

Correlation gives an indication of how related the changes are between two variables. If two
variables change in the same direction they are positively correlated. If the change in opposite
directions together (one goes up, one goes down), then they are negatively correlated.

You can calculate the correlation between each pair of attributes. This is called a correlation
matrix. You can then plot the correlation matrix and get an idea of which variables have a high
correlation with each other.

This is useful to know, because some machine learning algorithms like linear and logistic
regression can have poor performance if there are highly correlated input variables in your
data.

correlations = dataset.corr()
# plot correlation matrix
fig = plt.figure()
ax = fig.add_subplot(111)
cax = ax.matshow(correlations, vmin=-1, vmax=1)
fig.colorbar(cax)
ticks = numpy.arange(0,9,1)
ax.set_xticks(ticks)
ax.set_yticks(ticks)
ax.set_xticklabels(names)
ax.set_yticklabels(names)
plt.show()

We can see that the matrix is symmetrical, i.e. the bottom left of the matrix is the same as
the top right. This is useful as we can see two different views on the same data in one plot.
We can also see that each variable is perfectly positively correlated with each other (as you
would expected) in the diagonal line from top left to bottom right.

Scatterplot Matrix

A scatterplot shows the relationship between two variables as dots in two dimensions, one
axis for each attribute. You can create a scatterplot for each pair of attributes in your data.
Drawing all these scatterplots together is called a scatterplot matrix.

Scatter plots are useful for spotting structured relationships between variables, like whether
you could summarize the relationship between two variables with a line. Attributes with
structured relationships may also be correlated and good candidates for removal from your
dataset.

# scatter plot matrix


scatter_matrix(dataset)
plt.show()

Like the Correlation Matrix Plot, the scatterplot matrix is symmetrical. This is useful to look at
the pair-wise relationships from different perspectives. Because there is little point oi drawing
a scatterplot of each variable with itself, the diagonal shows histograms of each attribute.

5. Summary

Hope this section would have helped you to visualize or sense how far the variables
are distributed when dealing with a set of data.

You might also like