0% found this document useful (0 votes)
23 views

Visualization Using Python

This document discusses using matplotlib and seaborn libraries in Python to analyze and visualize data with one and two numerical variables. Boxplots can be used to find outliers in a single numerical column. Matplotlib allows bi-variant analysis through scatter plots of two numerical columns or line plots joining scatter points. Seaborn makes boxplots customizable and is easier to use than matplotlib for basic plotting. Formulas can also be used to precisely identify outliers found through visualization.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views

Visualization Using Python

This document discusses using matplotlib and seaborn libraries in Python to analyze and visualize data with one and two numerical variables. Boxplots can be used to find outliers in a single numerical column. Matplotlib allows bi-variant analysis through scatter plots of two numerical columns or line plots joining scatter points. Seaborn makes boxplots customizable and is easier to use than matplotlib for basic plotting. Formulas can also be used to precisely identify outliers found through visualization.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 42

Finding outliers using boxplet

bexplot A box plot or a ben and whisker plot


summarizes a data set virtually using a

five point summary


Lower bound lowest value

Q1 25th percentile
Q2 Median 50th percentile
03 75th percentile
upper bound highest value

bemplet is a Uni variant plot


box plot can only be applied to single numerical
column
IOR
outlier outliers

xx

Ign
x x

upper
bound bound
Q1 Q2 03
median

The vertical lines in the ben show Q1 Q2 median

03
The whiskers at the end show how bound upper bound

In the bonplet the width of the bone shows

IOR
To find outliers visually
matpletlib
Visualization
Python Seaborn

matpletlib matpletlib is a low level graph


plotting library in python that serves as

visualization utility

Install matpletlib

Pip install matplotlib

import matplotlib

most of the matpletlib utilities lies under the


Pyp lot submodule and are usually imported
under Plt alias

import matplotlib Pgp lot as pit

matplotlib is mostly from matlab

In matpletlib we have complete control over

plot
It is a difficult way
Seaborn Seaborn is a library that uses matplotlib
underneath to plot graphs
Install Seaborn

Pip install Seaborn

import seaborn as Sns

Seaborn is more customizable than

matplotlib
It is a easier way

Finding outliers visually we use Seaborn bonplot


Sns bomp lot data sepalwidthcm
Canes subplot Kabel sepalwidtham

ABBATE

21.0 21.5 31.0 31.5 4.0 4.5

Sepaloid them

see one point less than lower


Though visually we

inbound and 3 points more than upper bound visually


one point can be duplicate of many points
so using benput we can check if outliers exist
and then using formulae we can find outliers
Formula's give the exact outliers

using Seaborn bemplet we can plot for each feature


Van able separately
we can use pandas function called bemplotC

datac bonplete
Anes subplot

5
8
Y

Y
s

z
o

sepallengthen sepaluidthon Petallerstran petalwidtham


This function applies to all numerical values

This shows outters column wise

using pandas to baseplate and find if outliers


exist
exact outliers
using formulae to find the

boxplet is a uni variant numerical plot to


find outliers and visualize the spread
bemplet we can see spread and std also
Data Analysis using 2 numerical variables with
matpletlib bi variant analysis
matpletlib is a
library
we can use pyplet submodule to plot python
plots
import matplotlib pyplet as pit

we can use plot I function to draw a scatter

plot or line plot joins all scatter points


There is a scatter C function also to draw
a scatter plot

Scatter plot A scatter plot is a


diagram where
each value in the data set is represented by
a dot

The matpletlib module has a method for drawing


scatter plots it needs two arrays of the same

length one for the value of a anis and one for


the value of y axis

we can use pot scatter Ny

we can use pit plotchy with additional


parameters

This is a bi variant plot


Both the columns have to be numerical columns
yn
column 2 x x

s
x
I se
Calum I
NUST
numerical vs numerical

line plot This is a plot that is a result of line

each value in the dataset


joining
This is a bi variant plot
Both the columns have to be numerical columns

yr

715

Faa T I i n i seams

Estates
we use pit put ay from matplotlib to draw the

line plot
By default pit plot C draws a line plot

In matplotlib we have to mention everything


to give n label y label and title we use

pit n label
Is string
pit y label
I string
a n
Plt title
string
ex

pit plot data sepallengthen dataclisepalwidthem

pit nlabel sepallengthen


Plt y label sepal wid them
pit title use us SW

Tent o s l o Slrs Sw
Sluss w

783

g
g
isepallenginem

Additional parameters
line style defines the style of the line

default solid line


if we change to dotted line
a
if we change to space
i e line style
we use this to get scatter plot
But here the values are not marked so we need
to mark them to see scatter plot

modifications to do to get scatter plot

Plt plot data Sepallengthen datactisepalwidthem


lifestyle marker t

pit nlabel sepallengthen


Plt y label Sepalwidtham
pit title use us SW

Slusser

goy

isepallenginem

Various options for marked


markers d diamond
s square
O circle
Various options for color

color K black
r red

Marker size 5

Line size 7

so far we have learned to set scatter plot

using different combinations


we are not getting any info from this plot as

we don't know what these points mean

we can plot scatter plot for various combinations

to understand and analyze the data better

Ext
we can split dataset in to 3 datasets depending on

species

setosa datac.be datacEspecies Iris setosa

Versicolor datac Loc datacEspecies Iris versicolor

virginica datac 6c datacEspecies Iris virginica

to add label we need to call one more function


called legend to visualize label to plot

pit plot setosa sepal lengthen Setosa sepal width cm


lines tyles d
marker o color g marker size 3

label Setosa
pl t plot versicolor sepallength em Versicolor sepal width cm

line style d
marker o color r marker size 7

label versicolor

pl t plot virginica sepallength em Virginia sepal width cm a


line style d
marker o color k marker size 7

label Virginica

pit nlabel sepallengthen


pl t y label sepal wid them
pit title use us s w

pit legend
Sluss w

ggy
x xx
q
xx xx
B
n
n I i n I l

sepallengthen
Assignment Do scatter plot using all the combinations
and find 2 important variables that can help
in separating the 3 different species
SL SW PL PW

combinations no i e
4oz
SL SW

SL PL
SL PW

SW PL

SW PW

PL PW

conclusion petal lens them petal width cm are the

features that are most important

I
xxxxx

É eggs I
By t
t t
t t t t t tt t

petal lengthen
we have another matplotlib function called scatter
that plots scatter plot directly

Plt scatter setosa petalwidtham'T setosafsepalwidthan


marker o color g label salsa

Kmatplotlib collections path collections at 0 97467600203

45
go
4 o

35 as

3.0 É
2.5
I n i n I l
0.2 03 0 4 0 5 0 6
0 I
Using Seaborn for bi variant data Analysis
with Seaborn we can do scatter plot
As parameters we mention the data we are

use
going to

E
Sns scatter plot data datac n sepallengthen
y Sepalwidtham

É xxx
q
xxxx
q
sepallenginem

from this we are not able to analyze anything


we can add class variable as another dimension
we use parameter hue for this
there the 3D plot is represented in 2D

hue only takes categorical data class variable


we can directly put two numerical columns feature
variables for all different class variables

i e here we plot sepallengtham


can us Sepdwidtham for
setosa versicolor and virginica
Sns Scatter plot data datac n Sepallengthen
y Sepdwidtham hue species
Axes subplot a label sepal lengthen Label Sepalwidthan

species
x setosa
x versicolor
x virginica

É g
xxx

xxxx
no

sepallenginem

when we
using Sepallengtham and Sepalwidthan
are

we can only separate setosa


Here we can write if conditions to separate
the species
i e D if sepallengthan is CG and petalaidthcmco.co then
it is setosa
2 if petaloid them C1 5 and 70 8 then versicolor but
we have few mistakes
3 if petalwidthan 1 5 then virginica but we have

few mistakes
4 Se heresepal lens them petalwidthon are more

important than sepaloid them


This is called Exploratory Data Analysis EDA
Uni variant analysis of data
since Uni variant we use one variable
we will use PDF Probability Density function
In Seaborn we have a function called distplotC
i e Sns distplot datac petallengtham

Density
L

petal lengthen
smooth we will get PDF
on Histogram if we

Similarly we can check PDF for all feature


variables but since we do not know which data
corresponds against which species we will not be

able to classify
so first we will divide the data and do analysis
with PDF
Setosa data c Loc datacC species Iris setosay
Versicolor data c Loc datacC species Iris Version'd
virginica data c Loc datacC species Iris Virginia
Sns distplot setosa petallengtham
Sns distplot versicolor petallengtham
Sns distplot Virginia petallengtham

Density

H tint

petal lengthens

with just petallengthens we can separate from


other flowers
if Petallengtham I and 22 then setosa
As we analyze other features we see that there

is lot of overlapping

conclusion
most important petal length cm

next petalwidtham
neut Sepal lengthen
worst sepal width am

inform the client sepal lengthen and sepalwidthon


not useful
we can use if condition to create a model to
distinguish
we can also make machine learning model using
this
Nele distplotc will be removed soon so instead
we will use another function dispute

displotc default is histogram


If we add a parameter called kind Kde
we can get ppf from the histogram
data petal lengthen'D land Kde
Sns displot
b

Density M

petal lengthen
default histogram
is
Sns displet data datac n petallersthanshueispecies
species

to
count

petal lengthen
Sns dis plot data datac n petallengthan hue species
kind Kde

i É
Density
AAA

petal lengthens

As we
analyze each feature variable we see that
there is lot of overlapping
Conclusion

most important petal lens them


neut petaloid them
neut sepal length an

worst sepal width am


To see data is balanced or not visually
Iris data set is balanced data set as

data pd read Csu Cr c 19ns csv

data species Value counts C

Iris sets a 50

Iris Versicolor so

9ns Virginica 50
Name species d type intoy

If we want to see visually for categorical


data if it is balanced or not we can use

bar graph

matplotlib command
50,50 50
Plt bar data species unique 7

Versicolor
Setosa Virginia
Plt bar data species unique C

data species value counts s

Versicolor
Setosa Virginia

Bar graph I Dimensional plot

x am's unique value of categorical data

Y anis count of category

50
balanced
count

setosa Versicolor Virginica

If bars are of same height even then

balanced
If bars are uneven then unbalanced
N

unbalanced
MMA
Seaborn command to plot bargraph
This is more beautiful with colors differential
the species

Sns barplot data species unique


data species value counts

50

count

setosa Hersicolor Virginica

we can reduce or increase the figure size using

shape in tuple format


Plt figure fig size C
ex's
5,5
Plt figure fig size
Sns barplot data species unique
data species value counts

50

count
Versicolor Virginica
setosa
face color or paper color can be changed
Pit figure fig size 515 face color k

50

count

se to sa Hersicolor Virginica
Histogram A histogram is a graphshowing frequency
distributions It is a graph showing the number of
observations within each given interval

Histogram is used on numerical data column

Histogram is a uni variant analysis

Bargraph Histogram

Uni variant analysis Uh Yan ant analysis on

on categorical data numerical data

X axis numerical data


y amis no of observations with in each interval
frequency of data

bins

t t
free t t

f
bin
height
I

d f I k numerical data
bin edges bath
If we are given a dataset
1 sort the values in ascending order
want decide
2 How many bins we we can
3 find the man valued min value
bin width man min
no of bins

ex G 5 4 3 1 2 7 8 9

1 I 2 3 4 5 6 7 8 9 sorted
2 no of bins y
3 man 9 min I

binwidth 1 2
94
1 1 2 1 2 2 1 2 2 2 1 2 2 242
bin edges
1 3 5 7 9

C 3 2
II Values
31 Values 5 2

5 Values 7 2

7 Values 9 2

i j si y d
If we choose less no of bins then the bin width
will be more

more roof bins then bin width will be less

when we want put PDF over


to histogram
we make the no of bins very very high
se bin width becomes very less and we

use Kde Kernel density estimation to plot PDF

import pandas as pd

data pd read Csr rac Iris csu

import matpletlib ply plot as Plt

hist data sepallengtham bins 6 width 0.97


Plt
bin height
16 36 37 33 21 y
array
259
array 4 3 4.9 5 5 61 6.7 7.3 7.9327
Bar container object of 6 artists
35
30

25
20

15
10

4.5 so is 6.0 6.5 4.0 is go

Oerwidth It gives a space between each bin

default is 1 no width
output will be two arrays and histogram plot
bin height

bin edges

we can use seaborn to plot histogram

default bins 9

import Seaborn as Sns

Sns hist plot data data n sepal lengthen

inlabel sepalleytham y label Count's


Anes subplot
35
30

25
20
U I
15
10

4.5 so is 6.0 6.5 4.0 is so

sepal length cm

can see x axis and


In Seaborn histogram we

Y anis labels and clear lines showing each bin

Here we see overlapping but we do not understand


what data corresponds to which Isis species

when we use hue species then in l D histogram

we will be adding another dimension and will


be doing bi variant analysis
Seaborn gives the plot us species in single line of

code instead of creating sub datasets in matplelib

data data n sepal length cm bins 7


S ns.histplot
hue species
Count
Canes subplot in label sepallengthen y Label
species

I his Versicolor
35
30

25

j 20

15
10

y 5 t o 5.5 6.0 6.5 7.0 is 8.0


Sepal length cm

In matpletlib if we want to plot the similar plot

setosa data Loc data species Isis serosa

Versicolor data Loc data species Ins Versicolor


data loc data species Ins Virginia
Virginica

Plt hist setosa petal wid them bin 6 width 0.98


label setosa

Plt hist Versicolor petal wid them bin 6 width 0.98


label versicolor

Plt hist petalwidthom bin 6 width 0.98


Virgin ca
label Virgin ca

pit Legend C
we can plot histogram using displot as default
plot of dispute is histogram

pit figure fig size 15,10

Sns dis plot data data n Sepal width am bins 10

hue species
n

species
7 Iris sets a

15 I Ins Versicolor
f Ins Virginica

J
Violin plot violin plot is a statistical representation
of numerical data It is simillar to bone plot

with addition of PDF on each side


Ub 03 1 5 IOR

03
acmedian

PDF lb 0 I 5 IOR

This can be done horizontally also

IfEenput
t

i I
lb i
a an as
median

The PDF is symmetrical on both sides of the


bone plot
This is a uni variant analysis
This uses numerical data
with addition of hue categorical data we

can make it act like bi variant analysis


with numerical data vs categorical data
ya
bi variant
analysis
gift
a
categorical data
Sns violin plot data data n petal lengthen size 10

Axes subplot n label petal length cm

MW

8 12 y 6 8
Petal length cm

Sns violin plot data data N petaloid them Ya species

hues species size lo

Axes subplot nlabel petal lengthen y label species's


It
is
versicolor
is
sets a

d 1ns
versicolor at at

Iris
virginica

i i s i s s t
petal lens them

we can change the anis

Sns violin plot data data n'species y petal lengthen


hue species size 10

Axes subplot in label species y labels petal length em


I species

ps his Versicolor

É H
o
f y

83
2

Ji's In s
setosa Versicolor Fironica
Count plot this is simillar to bar graph
Takes categorical data
Uni variant analysis
unlike bar do not need to mention
graph we

n axis and y anis we just need to mention

data

Sns Count plot data species

G
Axes subplot se label species y label county

50

yo

30

20

10

In's
serosa
Yeficolor Iisinica

If we just give data it will identify the

categorical data will plot based on unique


values
pie chart A
pie chart is a circular statistical
graph which is divided into slices to illustrate
numerical proportion
This is used on discrete numerical data
Cire limited values
This is a variant
Uni analysis

ext if we have marks data

math him bio eng


Name 60 20 10 90

PIE pie 60 20 10,90 labels math hin bid eng

lot of info about anis is printed here

hin math

bio

eng
there to remove all the info we get after
running this command we use pit show c

PIE pie Go 20,101903 labels math him bio ey

Plt show C

him math

bio

eng

we can print the percentage on the pie chart


using the parameter auto pct auto percentage
here autopct 10.1ft Y
here oil represents I value after decimal so

o.o is no point after decimal

pit pie 60 20,10 903 labels math him bro


engl
auto pct Y o Ift t

Plt Show C
hin

II It 33.34
bid
g Gt

50.04

eng

we can also make the wedges move away


from the center or explode using explode para
meter

pit pie 60 20,10 903 labels math him bro


engl
auto pct yo 1ft t explodes 0
1,010103
pit title marks

Plt Show C
Marks

him math
33 34
II It
bio
5 Gt

50.07

eng

we can increase the size using radios parameter

ie
pit pie 60 20,10 903 labels math him bro
engl
auto pct Y 0.1ft t explodes 0 1,0 0,03
radius 3

pit title marks

Plt Show C
Marks

him math
33 34
II It

io
5 Gt

50.04

eng

pie chart cannot be used for continous data


It can be used for discrete numerical data or

categorical data
Regression plot Regression plot as the name

suggests creates a regression line between


2 parameters and helps to visualize their linear

relationships
This is bi variant analysis
This is numerical data vs numerical data

This is a scatter plot with regression line on it

reg plot we will learn more how to calculate


the regression line in machine learning

Sns neg plot data data n Sepallengtham sepalwidtham


b
Axes subplot n label sepalleython labels sepal widman'd

u s
g xÉÉÉx X y
tht
XI
t
y

E u o
x Xx
X

g 35 x x
xx
xx
T x x x X X X
3 o
X x
b X x x x
2.5 X X X X

20
y's go s's to 6s to it 80
Sepal lengthen
Summarization of plots learned so far

Numerical box plot


A histogram
violin plot
dist plot
pie chart Discrete data

categorical A bar graph


count plot
pie chart

Numerical A Scatter plot


us line plot
numerical Regression plot
Seaborn gives
y this functionality
box hue
numerical plot by adding
VS A Violin plot by adding hue

categorical histogram by adding hue

You might also like