0% found this document useful (0 votes)

70 views7 pages

Asset-V1 VIT+MBA109+2020+type@asset+block@Introductio To ML Using Python

This document provides an introduction to machine learning using Python. It discusses loading common Python ML libraries, loading and exploring a heart disease dataset, summarizing the dataset using statistics and visualizations, and evaluating some common ML algorithms on the data. The overview outlines downloading and installing Python and SciPy libraries, loading and summarizing a dataset, visualizing the data, and making predictions using ML algorithms.

Uploaded by

Kartik Bhathire

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

70 views7 pages

Asset-V1 VIT+MBA109+2020+type@asset+block@Introductio To ML Using Python

Uploaded by

Kartik Bhathire

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

INTRODUCTION TO ML USING PYTHON

In this tutorial let us get introduced to the world of Machine Learning (ML) with Python. Machine
Learning primarily studies the design of algorithms that can learn from experience. To learn, they need
data that has certain attributes based on which the algorithms try to find some meaningful predictive
patterns. Majorly, ML tasks can be categorized as concept learning, clustering, predictive modeling,
etc. The ultimate goal of ML algorithms is to be able to take decisions without any human intervention
correctly.

Overview of contents
1. Installing the Python and SciPy platform.
2. Loading the dataset.
3. Summarizing the dataset.
4. Visualizing the dataset.
5. Evaluating some algorithms.
6. Making some predictions.

1. Downloading, Installing and Starting Python SciPy

1.1 Install SciPy Libraries

We expect that Python version 2.7 or 3.5+ is already installed in your work end. There are 5 key
libraries that you will need to install. Below is a list of the Python SciPy libraries required to be installed.
They are:
scipy
numpy
matplotlib
pandas
sklearn
if not use pip intall command to install all the libraries needed.

1.2 Start Python and Check Versions

To check whether Python environment is installed successfully run the script below that will help us
to test the environment.

Open a command line and start the python interpreter:

>>python
# Check the versions of libraries
# Python version
import sys
print('Python: {}'.format(sys.version))
# scipy
import scipy
print('scipy: {}'.format(scipy.__version__))
# numpy
import numpy
print('numpy: {}'.format(numpy.__version__))
# matplotlib
import matplotlib
print('matplotlib: {}'.format(matplotlib.__version__))
# pandas
import pandas
print('pandas: {}'.format(pandas.__version__))
# scikit-learn
import sklearn
print('sklearn: {}'.format(sklearn.__version__))

2. Load the Data

We are going to use the heart disease dataset. This dataset is famous because it is used as the
“hello world” dataset in machine learning and statistics by pretty much everyone
(https://archive.ics.uci.edu/ml/datasets/Heart+Disease). The dataset contains 1025
observations of patients. There are thirteen columns of patient’s diagnostic measurements.
The fourteenth column is the target stating disease is yes or no.

2.1 Import libraries

First, let’s import all of the modules, functions and objects that are needed for Machine
learning project.

# Load libraries
import pandas
from pandas.plotting import scatter_matrix
import matplotlib.pyplot as plt
from sklearn import model_selection
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
2.2 Load Dataset

We will use pandas to load the data and to explore the data both with descriptive statistics
and data visualization. Note that we are specifying the names of each column when loading
the data. This will help later when we explore the data.

# Load dataset
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/iris.csv"
names = [‘sex’,‘cp’,‘trestbps’,‘chol’,‘fbs’,‘restecg’,‘thalach’,‘exang’,‘oldpeak’,‘slope’,‘ca’, ‘thal’
,‘target’]
dataset = pandas.read_csv(url, names=names)

If you do have network problems, you can download the iris.csv file into your working
directory and load it using the same method, changing URL to the local file name.

3. Summarize the Dataset

Now it is time to take a look at the data. In this step we are going to take a look at the data a
few different ways:
1. Dimensions of the dataset.
2. Peek at the data itself.
3. Statistical summary of all attributes.
4. Breakdown of the data by the class variable.

3.1 Dimensions of Dataset

We can get a quick idea of how many instances (rows) and how many attributes (columns)
the data contains with the shape property.
# shape
print(dataset.shape)

#Print only column names in the dataset

datset.columns.values

3.2 Peek at the Data

It is also always a good idea to actually eyeball your data.
# head
print(dataset.head(20))
#tail
print(dataset.tail(20))
3.3 Statistical Summary

Now we can take a look at a summary of each attribute. This includes the count, mean, the
min and max values as well as some percentiles.
# descriptions
print(dataset.describe())
#Describe the field thalach
dataset. thalach.describe()
dataset. thalach.value_counts() #frequency table

3.4 Class Distribution

Let’s now take a look at the number of instances (rows) that belong to each class. We can
view this as an absolute count.
# class distribution
print(dataset.groupby('target').size())

4. Data Visualization

We now have a basic idea about the data. We need to extend that with some visualizations.
We are going to look at two types of plots:
Univariate plots helps us to better understand each attribute.
Multivariate plots helps us to better understand the relationships between attributes.

4.1 Univariate Plots

We start with some univariate plots, that is, plots of each individual variable.

Given that the input variables are numeric, we can create box and whisker plots of each.

Box and whisker plots

Boxplots summarize the distribution of each attribute, drawing a line for the median (middle
value) and a box around the 25th and 75th percentiles (the middle 50% of the data). The
whiskers give an idea of the spread of the data and dots outside of the whiskers show
candidate outlier values (values that are 1.5 times greater than the size of spread of the
middle 50% of the data).
dataset.plot(kind='box', subplots=True, layout=(2,2), sharex=False, sharey=False)
plt.show()
We can also create a histogram of each input variable to get an idea of the distribution.
Histograms group data into bins and provide you a count of the number of observations in
each bin. From the shape of the bins we can quickly get a feeling for whether an attribute is
Gaussian’, skewed or even has an exponential distribution. It can also help us see possible
outliers.
# histograms
dataset.hist()
plt.show()

Density Plots

Density plots are another way of getting a quick idea of the distribution of each attribute. The
plots look like an abstracted histogram with a smooth curve drawn through the top of each
bin, much like your eye tried to do with the histograms.

# Univariate Density Plots

import matplotlib.pyplot as plt
import pandas
dataset.plot(kind='density', subplots=True, layout=(3,3), sharex=False)
plt.show()

# Univariate Density Plots

import matplotlib.pyplot as plt
import pandas
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-
diabetes.data.csv"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = pandas.read_csv(url, names=names)
data.plot(kind='density', subplots=True, layout=(3,3), sharex=False)
plt.show()

4.2 Multivariate Plots

Now we can look at the interactions between the variables.

First, let’s look at scatterplots of all pairs of attributes. This can be helpful to spot structured
relationships between input variables.
Correlation Matrix Plot

Correlation gives an indication of how related the changes are between two variables. If two
variables change in the same direction they are positively correlated. If the change in opposite
directions together (one goes up, one goes down), then they are negatively correlated.

You can calculate the correlation between each pair of attributes. This is called a correlation
matrix. You can then plot the correlation matrix and get an idea of which variables have a high
correlation with each other.

This is useful to know, because some machine learning algorithms like linear and logistic
regression can have poor performance if there are highly correlated input variables in your
data.

correlations = dataset.corr()
# plot correlation matrix
fig = plt.figure()
ax = fig.add_subplot(111)
cax = ax.matshow(correlations, vmin=-1, vmax=1)
fig.colorbar(cax)
ticks = numpy.arange(0,9,1)
ax.set_xticks(ticks)
ax.set_yticks(ticks)
ax.set_xticklabels(names)
ax.set_yticklabels(names)
plt.show()

We can see that the matrix is symmetrical, i.e. the bottom left of the matrix is the same as
the top right. This is useful as we can see two different views on the same data in one plot.
We can also see that each variable is perfectly positively correlated with each other (as you
would expected) in the diagonal line from top left to bottom right.

Scatterplot Matrix

A scatterplot shows the relationship between two variables as dots in two dimensions, one
axis for each attribute. You can create a scatterplot for each pair of attributes in your data.
Drawing all these scatterplots together is called a scatterplot matrix.

Scatter plots are useful for spotting structured relationships between variables, like whether
you could summarize the relationship between two variables with a line. Attributes with
structured relationships may also be correlated and good candidates for removal from your
dataset.

# scatter plot matrix

scatter_matrix(dataset)
plt.show()

Like the Correlation Matrix Plot, the scatterplot matrix is symmetrical. This is useful to look at
the pair-wise relationships from different perspectives. Because there is little point oi drawing
a scatterplot of each variable with itself, the diagonal shows histograms of each attribute.

5. Summary

Hope this section would have helped you to visualize or sense how far the variables
are distributed when dealing with a set of data.

Machine Learning in Python
No ratings yet
Machine Learning in Python
5 pages
Machine Learning
No ratings yet
Machine Learning
30 pages
Data Visualisation
No ratings yet
Data Visualisation
6 pages
Code Shabab Error 7
No ratings yet
Code Shabab Error 7
5 pages
ML Lab File
No ratings yet
ML Lab File
43 pages
Module 4 - Supervised Learning - First ML Model
No ratings yet
Module 4 - Supervised Learning - First ML Model
23 pages
AIML Short Term Internship Session 9 Summary-1719044709410
No ratings yet
AIML Short Term Internship Session 9 Summary-1719044709410
14 pages
ML Lab Manual Completed
No ratings yet
ML Lab Manual Completed
56 pages
Data Science and Analtics Laboratory
No ratings yet
Data Science and Analtics Laboratory
21 pages
Assignment 4 R Program1
No ratings yet
Assignment 4 R Program1
11 pages
Eda Lab Assignment2
No ratings yet
Eda Lab Assignment2
10 pages
Day 30 UnderstandingYourData 7steps
No ratings yet
Day 30 UnderstandingYourData 7steps
4 pages
2,3. Introduction Pandas & Matplotlib
No ratings yet
2,3. Introduction Pandas & Matplotlib
32 pages
Python For DS Cheat Sheet
100% (2)
Python For DS Cheat Sheet
6 pages
ML Lab
No ratings yet
ML Lab
14 pages
ML Book Notes
No ratings yet
ML Book Notes
9 pages
EDS - Python Cheat Sheet
0% (1)
EDS - Python Cheat Sheet
3 pages
ML 3
No ratings yet
ML 3
24 pages
ML in Python
No ratings yet
ML in Python
15 pages
Topic 2. Visual Data Analysis in Python: Mlcourse - Ai (Https://mlcourse - Ai)
No ratings yet
Topic 2. Visual Data Analysis in Python: Mlcourse - Ai (Https://mlcourse - Ai)
25 pages
Aids Lab
No ratings yet
Aids Lab
45 pages
Machine Learning Laboratory (BTCS619-18) B.Tech Cse 6Th 2024 EVEN
No ratings yet
Machine Learning Laboratory (BTCS619-18) B.Tech Cse 6Th 2024 EVEN
29 pages
A Short Guide For Feature Engineering and Feature Selection
No ratings yet
A Short Guide For Feature Engineering and Feature Selection
32 pages
Eda Code Snippets
No ratings yet
Eda Code Snippets
17 pages
PR Final File
No ratings yet
PR Final File
49 pages
Hint Sheet
No ratings yet
Hint Sheet
13 pages
ML Complete Notes Hridoy
No ratings yet
ML Complete Notes Hridoy
5 pages
Python For Machine Learning
No ratings yet
Python For Machine Learning
66 pages
Pandas
No ratings yet
Pandas
25 pages
Roll NO 2020
No ratings yet
Roll NO 2020
8 pages
ML File Syllabus
No ratings yet
ML File Syllabus
43 pages
Datascience
No ratings yet
Datascience
26 pages
Unit2 Modified
No ratings yet
Unit2 Modified
42 pages
ML Lab Manual With Statistical Formulas
No ratings yet
ML Lab Manual With Statistical Formulas
9 pages
Python Libraries
No ratings yet
Python Libraries
27 pages
Project Report
No ratings yet
Project Report
37 pages
Machine Learning Lab Dlihebca6sem
100% (1)
Machine Learning Lab Dlihebca6sem
25 pages
Learneverythingai
No ratings yet
Learneverythingai
9 pages
Lecture 4
No ratings yet
Lecture 4
60 pages
3-Numpy Pandas
No ratings yet
3-Numpy Pandas
37 pages
EDA Document
No ratings yet
EDA Document
13 pages
EDA Cheatsheet - Class Note
No ratings yet
EDA Cheatsheet - Class Note
29 pages
Lesson 2 - Data Preprocessing
100% (1)
Lesson 2 - Data Preprocessing
72 pages
Unit 2 ML
No ratings yet
Unit 2 ML
93 pages
PR Final File
No ratings yet
PR Final File
70 pages
PP DWDM 4 5
No ratings yet
PP DWDM 4 5
26 pages
EDA+Cheatsheet+ +Class+Note
No ratings yet
EDA+Cheatsheet+ +Class+Note
29 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
4 pages
EDA Cheatsheet - Class Note
No ratings yet
EDA Cheatsheet - Class Note
29 pages
Python For Statistics
No ratings yet
Python For Statistics
40 pages
Practical Labs Guide
No ratings yet
Practical Labs Guide
34 pages
cs3362 Foundations of Data Science Lab Manual
No ratings yet
cs3362 Foundations of Data Science Lab Manual
53 pages
Data Visualization Lab3
No ratings yet
Data Visualization Lab3
23 pages
AD3411
No ratings yet
AD3411
28 pages
Project Paarth
No ratings yet
Project Paarth
21 pages
Data Analysis Lab - Final - 23-24
No ratings yet
Data Analysis Lab - Final - 23-24
11 pages
Logistic Regression For Binary Classification With Core APIs - TensorFlow Core
No ratings yet
Logistic Regression For Binary Classification With Core APIs - TensorFlow Core
22 pages
L6 and 7-Data Preprocessing-Coding
No ratings yet
L6 and 7-Data Preprocessing-Coding
34 pages
Algorithms and Data Structures: An Easy Guide to Programming Skills
From Everand
Algorithms and Data Structures: An Easy Guide to Programming Skills
Rigdon Jonathan
No ratings yet
Introduction to Algorithms
From Everand
Introduction to Algorithms
S VASIST
No ratings yet
Automobile Da1 17bme0865
No ratings yet
Automobile Da1 17bme0865
11 pages
TM - 17BEM0043 - Exp 10
No ratings yet
TM - 17BEM0043 - Exp 10
4 pages
Vortex360 - Dashboard - Devfolio
No ratings yet
Vortex360 - Dashboard - Devfolio
3 pages
Turbo Machines-DA 3: Design of Radial Swept Impeller Wheel For Centrifugal Blower For HVAC Application
No ratings yet
Turbo Machines-DA 3: Design of Radial Swept Impeller Wheel For Centrifugal Blower For HVAC Application
4 pages
Help Manual Industrial Internship Registration Students
No ratings yet
Help Manual Industrial Internship Registration Students
6 pages
Self Healing Engineering Products Using The Idea: Digital Assignment-1 Bit1028-Bio Inspired Design
No ratings yet
Self Healing Engineering Products Using The Idea: Digital Assignment-1 Bit1028-Bio Inspired Design
2 pages
Pendulum-And Harmonic Oscillator
No ratings yet
Pendulum-And Harmonic Oscillator
5 pages
Advanced Functions Ex.3
No ratings yet
Advanced Functions Ex.3
16 pages
Big Data Analytics Nep Sem 2 23-24
No ratings yet
Big Data Analytics Nep Sem 2 23-24
15 pages
Csat
No ratings yet
Csat
5 pages
Email Alerts On Whatsapp
No ratings yet
Email Alerts On Whatsapp
12 pages
Condensate Recovery Meter CRM 485R: Energy Conservation - Environment - Process Efficiency
0% (1)
Condensate Recovery Meter CRM 485R: Energy Conservation - Environment - Process Efficiency
6 pages
AIX Performance Tuning VUG May2418
No ratings yet
AIX Performance Tuning VUG May2418
50 pages
VR History
No ratings yet
VR History
7 pages
Memory NSU
100% (1)
Memory NSU
29 pages
Digital Bot (En)
No ratings yet
Digital Bot (En)
20 pages
Aprisa XE User Manual: September 2006
No ratings yet
Aprisa XE User Manual: September 2006
134 pages
(4 Usd) (76561199183231530)
No ratings yet
(4 Usd) (76561199183231530)
1 page
Virtual Gamepad Ik - Icp
No ratings yet
Virtual Gamepad Ik - Icp
1 page
Index DSA
No ratings yet
Index DSA
2 pages
Web3 Based Blockchain
No ratings yet
Web3 Based Blockchain
50 pages
HB Aircraft Industries AG HB-23/2400
No ratings yet
HB Aircraft Industries AG HB-23/2400
24 pages
Intel 8080 CPU Chip Development
No ratings yet
Intel 8080 CPU Chip Development
4 pages
Cybercrime and You How Criminals Attack and The Hu
100% (1)
Cybercrime and You How Criminals Attack and The Hu
22 pages
IT Security Hacker Pitch Deck by Slidesgo
No ratings yet
IT Security Hacker Pitch Deck by Slidesgo
42 pages
H. Huong - Đề Toeic - HK 1-2019 - No2
No ratings yet
H. Huong - Đề Toeic - HK 1-2019 - No2
11 pages
HEC-RAS 507 Unsteady
No ratings yet
HEC-RAS 507 Unsteady
9 pages
Camara Horizontal DS-2CD1653G0-IZ HIKVISION
No ratings yet
Camara Horizontal DS-2CD1653G0-IZ HIKVISION
4 pages
Syllabus CSI104 Summer 2021
No ratings yet
Syllabus CSI104 Summer 2021
13 pages
Duplicate 1723844408809
No ratings yet
Duplicate 1723844408809
404 pages
Huawei Optix Osn 580 Datasheet
No ratings yet
Huawei Optix Osn 580 Datasheet
4 pages
Unit 1: MPI (CST-282, ITT-282) SUBMISSION DATE: 14.02.2020
No ratings yet
Unit 1: MPI (CST-282, ITT-282) SUBMISSION DATE: 14.02.2020
9 pages
Purchase Order Version Management - S - 4HANA Materials Management
No ratings yet
Purchase Order Version Management - S - 4HANA Materials Management
18 pages
Access Networks: Introduction and Overview
No ratings yet
Access Networks: Introduction and Overview
15 pages
Aws Certified Solutions Architect Associate Study Guide B08L1BC3QR
No ratings yet
Aws Certified Solutions Architect Associate Study Guide B08L1BC3QR
91 pages
WickedWhims V147i Exception
No ratings yet
WickedWhims V147i Exception
1,237 pages
Colqwen2 Similarity Maps Cookbook
No ratings yet
Colqwen2 Similarity Maps Cookbook
8 pages
Huawei AC650-128AP Wireless Access Controller Datasheet
No ratings yet
Huawei AC650-128AP Wireless Access Controller Datasheet
15 pages