0% found this document useful (0 votes)
8 views

Data Visualization Graphs for ML

Uploaded by

omvati343
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

Data Visualization Graphs for ML

Uploaded by

omvati343
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 61

Introduction to Matplotlib

• Matplotlib is a low level graph plotting library in python


that serves as a visualization utility.
• Matplotlib was created by John D. Hunter.
• Matplotlib is open source and we can use it freely.
• Matplotlib is mostly written in python, a few segments
are written in C, Objective-C and Javascript for Platform
compatibility.
Installation of Matplotlib
• The Python distributions like Anaconda, Spyder etc.
already have Matplotlib installed in them.
• import matplotlib statement is used to imprt the
library.
• The version of library installed can be checked using
__version__ attribute.

import matplotlib
print(matplotlib.__version__)
Pyplot
• Most of the Matplotlib utilities lies under the pyplot submodule, and
are usually imported under the plt alias:

• Syntax:
import matplotlib.pyplot as plt
Marker Description
'o' Circle
'*' Star
'.' Point
',' Pixel
'x' X
'X' X (filled)
'+' Plus
'P' Plus (filled)
's' Square
'D' Diamond
'd' Diamond (thin)
'p' Pentagon
'H' Hexagon
'h' Hexagon
'v' Triangle Down
'^' Triangle Up
'<' Triangle Left
'>' Triangle Right
'1' Tri Down
'2' Tri Up
'3' Tri Left
'4' Tri Right
'|' Vline
'_' Hline
Color Syntax Description

'r' Red

'g' Green

'b' Blue

'c' Cyan

'm' Magenta

'y' Yellow

'k' Black

'w' White
Line Syntax Description

'-' Solid line

':' Dotted line

'--' Dashed line

'-.' Dashed/dotted line


Bar Chart
Trainer: Ms. Nidhi Grover Raheja
What is Bar Graph?
• A bar graph is a graph that shows complete data with rectangular bars and
the heights of bars are proportional to the values that they represent.
• The bars in the graph can be shown vertically or horizontally.
• Bar graphs are also known as bar charts and it is a pictorial representation
of grouped data.
• Bar graph is an excellent tool to represent data that are independent of
one another and that do not need to be in any specific order while being
represented.
• The bars give a visual display for comparing quantities in different
categories.
• The bar graphs have two lines, horizontal and vertical axis, also called the x
and y-axis along with the title, labels, and scale range.
Properties of Bar Graph
Some properties that make a bar graph unique and different from
other types of graphs are given below:

1. All rectangular bars should have equal width and should have equal
space between them.
2. The rectangular bars can be drawn horizontally or vertically.
3. The height of the rectangular bar is equivalent to the data they
represent.
4. The rectangular bars must be on a common base.
Uses of Bar Graph
A bar graph is mostly used in mathematics and statistics. Some of the uses of
the bar graph are as follows:

1. The comparisons between different variables are easy and convenient.


2. It is the easiest diagram to prepare and does not require too much effort.
3. It is the most widely used method of data representation. Therefore, it is
used by various industries.
4. It is used to compare data sets. Data sets are independent of one
another.
5. It helps in studying patterns over long periods of time.
Parts of a Bar Graph
1. Title: The title is explains what the graph is about.
2. Scale: The scale is the numbers that show the units used on the bar graph.
3. Labels: Both the side and the bottom of the bar graph have a label that tells
what kind of data is shown. X-axis describes what each data point on the line
represents and y-axis shows the numeric value for each point on the line.
4. Bars: The bar is measures the data number.
5. Key: Explains any additional information included in the graph. Many times,
this will show different colors or symbols used to represent different
categories. It is needed only when data about more than one category are
shown in graph.
6. Data values: Data values are the actual numbers for each data point.
When and how to use Bar Charts for Visual
Analysis
Bar charts are versatile and can answer many questions in visual analysis. They can
highlight the largest or smallest number in a set of data or to show relationships
between values.
A good bar chart will follow these rules:
The base starts at zero
The axes are labeled clearly
Colors are consistent and defined
The bar chart does not display too many bars
When creating a bar chart, do not:
Make each bar a different width
Cram too many bars into subcategories
Leave the axes unlabeled
Types of Bar Graphs
Bar Graphs are mainly classified into 4 types:

1. Vertical Bar Graph


2. Horizontal Bar Graph
3. Grouped Bar Graph
4. Stacked Bar Graph
Vertical Bar Graphs
• When the given data is
represented vertically in a
graph or chart with the help
of rectangular bars that show
the measure of data, such
graphs are known as vertical
bar graphs.
Horizontal Bar Graphs
• When the given data is
represented horizontally
by using rectangular
bars that show the
measure of data, such
graphs are known as
horizontal bar graphs.
Bar Chart
Trainer: Ms. Nidhi Grover Raheja
What is Bar Graph?
• A bar graph is a graph that shows complete data with rectangular bars and
the heights of bars are proportional to the values that they represent.
• The bars in the graph can be shown vertically or horizontally.
• Bar graphs are also known as bar charts and it is a pictorial representation
of grouped data.
• Bar graph is an excellent tool to represent data that are independent of
one another and that do not need to be in any specific order while being
represented.
• The bars give a visual display for comparing quantities in different
categories.
• The bar graphs have two lines, horizontal and vertical axis, also called the x
and y-axis along with the title, labels, and scale range.
Properties of Bar Graph
Some properties that make a bar graph unique and different from
other types of graphs are given below:

1. All rectangular bars should have equal width and should have equal
space between them.
2. The rectangular bars can be drawn horizontally or vertically.
3. The height of the rectangular bar is equivalent to the data they
represent.
4. The rectangular bars must be on a common base.
Uses of Bar Graph
A bar graph is mostly used in mathematics and statistics. Some of the uses of
the bar graph are as follows:

1. The comparisons between different variables are easy and convenient.


2. It is the easiest diagram to prepare and does not require too much effort.
3. It is the most widely used method of data representation. Therefore, it is
used by various industries.
4. It is used to compare data sets. Data sets are independent of one
another.
5. It helps in studying patterns over long periods of time.
Parts of a Bar Graph
1. Title: The title is explains what the graph is about.
2. Scale: The scale is the numbers that show the units used on the bar graph.
3. Labels: Both the side and the bottom of the bar graph have a label that tells
what kind of data is shown. X-axis describes what each data point on the line
represents and y-axis shows the numeric value for each point on the line.
4. Bars: The bar is measures the data number.
5. Key: Explains any additional information included in the graph. Many times,
this will show different colors or symbols used to represent different
categories. It is needed only when data about more than one category are
shown in graph.
6. Data values: Data values are the actual numbers for each data point.
When and how to use Bar Charts for Visual
Analysis
Bar charts are versatile and can answer many questions in visual analysis. They can
highlight the largest or smallest number in a set of data or to show relationships
between values.
A good bar chart will follow these rules:
The base starts at zero
The axes are labeled clearly
Colors are consistent and defined
The bar chart does not display too many bars
When creating a bar chart, do not:
Make each bar a different width
Cram too many bars into subcategories
Leave the axes unlabeled
Types of Bar Graphs
Bar Graphs are mainly classified into 4 types:

1. Vertical Bar Graph


2. Horizontal Bar Graph
3. Grouped Bar Graph
4. Stacked Bar Graph
Vertical Bar Graphs
• When the given data is
represented vertically in a
graph or chart with the help
of rectangular bars that show
the measure of data, such
graphs are known as vertical
bar graphs.
Horizontal Bar Graphs
• When the given data is
represented horizontally
by using rectangular
bars that show the
measure of data, such
graphs are known as
horizontal bar graphs.
Scatter Plot
Trainer : Ms. Nidhi Grover Raheja
What is Scatter Plot?
• Scatter plots are probably the simplest kind of graph, and provide a
great way to visually look for relationships between two variables.
• They are an incredibly powerful chart type, allowing viewers to
immediately understand a relationship or trend.
• Like most other graph or chart types, a scatterplot has an X and a Y
axis. The X is the horizontal line with the independent variable and
the Y is the vertical with the dependent variable.
• An even scale is created on both axes, and then a mark or dot is made
at the point that represents the intersection of the two coordinates.
Types of Scatter Plot
• A scatter plot helps find the relationship between two variables. This
relationship is referred to as a correlation.
• Based on the correlation, scatter plots can be classified as follows.
Scatter Plot for Positive Correlation
Scatter Plot for Negative Correlation
Scatter Plot for Null Correlation
Scatter Plot for Positive Correlation
• A scatter plot with
increasing values of both
variables can be said to
have a positive correlation.
• The scatter plot for the
relationship between the
time spent studying for an
examination and the marks
scored can be referred to
as having a positive
correlation.
Scatter Plot for Negative Correlation
• A scatter plot with an
increasing value of one
variable and a decreasing
value for another variable
can be said to have a
negative correlation.
• Observe the below image of
negative scatter plot
depicting the amount of
production of wheat against
the respective price of
wheat.
Scatter Plot for Null Correlation
• A scatter plot with no clear
increasing or decreasing trend in
the values of the variables is
said to have no correlation.
• Here the points are distributed
randomly across the graph.
• For example, the data for the
number of birds on a tree at
different times of the day does
not show any correlation.
• Observe the below scatter plot
showing the number of birds on
a tree versus time of the day.
Trend Line
• A line of best fit is used to determine if there is a certain pattern
within a set of data.
• A trend line, often referred to as a line of best fit, is a line that is used
to represent the behavior of a set of data to determine if there is a
certain pattern.
• A trend line is an analytical tool used most often in conjunction with a
scatter plot (a two-dimensional graph of ordered pairs) to see if there
is a relationship between two variables.
Main purposes of a Trend Line
1. Determining if a set of points exhibits a positive trend, a negative
trend, or no trend at all.
2. Predicting unknown or future data points.
Trend Line Example
• This graph shows temperatures over the course of ten days. If you
were attempting to predict the temperature on the 11th day based
on this graph, a good estimate would be 70.5 degrees.
Positive Trend
Negative Trend
No Trend
Trend Lines
Box Plots an Violin Plots
Trainer: Ms. Nidhi Grover Raheja
Percentile
• A percentile basically shows the weightage of a specific value or a data
point when compared with the other different values in the dataset.
• That specific value is the point under which a particular percentage of
values lies.
• For example, if we consider a dataset where the values are :
• [10, 30, 40, 70, 95], we can roughly say that the value 70 is greater than
60% of the total values in the dataset and the value 70 is therefore called
as 60 percentile.
• This simply means that 60% of the total values in the dataset are at or
below 70.
• Similarly, the value 30 is at or greater than 40% of the total given values
and is called 40 percentile.
Percentile Calculation
The formula for calculating percentile is as follows:
Percentile(X) = (Number of Values Less than “X” in the dataset / Total
Number or Total Count of Values in the dataset) × 100
where,
X = value for which for which the percentile is to be calculated.
Let us consider a small dataset example for illustration. If we consider a dataset :
data = [55, 43, 60, 68, 22, 15, 76, 88, 92, 96]
Firstly, we need to sort this data from smallest value to highest value. The sorted data will be :
sorted_data = [15, 22, 43, 55, 60, 68, 76, 88, 92, 96]
Now, suppose we want to find the percentile of the value 88. Let X = 88
Number of Values less than X in the dataset(sorted_data) = 7
Total Number of Values in the dataset(sorted_data) = 10
Therefore, according to the formula for percentile above,

Percentile(88) = (7 / 10) × (100)


Percentile(88) = 70
So, 88 is the 70th percentile in the dataset. this means that 70% values in the dataset are below 88.
What is Box and Whiskers Plot?
• A box plot is a diagram used to display the distribution of data.
• A box plot indicates the position of the minimum, maximum and median values
along with the position of the lower and upper quartiles.
• From this, the range, interquartile range and skewness of the data can be observed.
• Box plots are a useful way to compare two or more sets of data visually.
• In statistics, a box plot is used to provide a visual summary of data.
• The distribution of data is shown through the positions of the median and the
quartiles.
• From this, the spread and skew of the data can also be seen.
• Side-by-side box plots allow for two or more data sets to be compared in a graphical
form.
How to Read a Box Plot?
To read a box plot:
1. Read the minimum value in line with the first line.
2. Read the maximum value in line with the last line.
3. Read the lower quartile which is in line with the start of the box.
4. Read the upper quartile which is in line with the end of the box.
5. Read the median which is in line with the line inside the box.
Note:
• The spread of data refers to how spread out the numbers in the data are. Both
the range and interquartile range are used to describe the spread of data.
• The larger the range, the more spread the whole data is.
• The larger the interquartile range, the more spread the middle 50% of data is.
How to Construct a Box Plot?
To construct a box plot:
• Draw lines to indicate the position of the lower and upper quartiles.
• Connect these lines to make a box.
• Draw a line inside the box to indicate the position of the median.
• Draw lines to indicate the position of the minimum and maximum
and connect these lines to the box.
Example:
For example, construct a box plot for the following data:
Minimum = 7. Lower Quartile (Q1) = 10. Median (Q2) = 12. Upper
Quartile (Q3) = 16. Maximum = 20.

The position of the minimum and


maximum are shown with lines,
called whiskers. These whiskers are
connected to the box portion of the
box plot.
How to Construct a Box Plot from a List of
Data
• To construct a box plot from a list of data, first calculate the first,
second and third quartiles.
• These quartiles are found at the (n+1)/4, (n+1)/2 and 3(n+1)/4 positions,
where n is the number of data points in an ordered list.
• Plot these quartiles along with the minimum and maximum points
using lines and connect them to make a box.
Example
• For example, construct a box plot from the data in the list 1, 3,
5, 6, 6, 7, 9. There are 7 numbers in the list, so n = 7.
• Q1 is found at position (n+1)/4, which for n=7 is position (7+1)/4.
This equals 2 and so, Q1 is found at the second number in the
list. So, Q1 = 3.
• Q2 is found at position (n+1)/2, which for n=7 is position (7+1)/2.
This equals 4 and so, Q1 is found at the fourth number in the
list. So, Q2 = 6
• Q3 is found at position 3(n+1)/4, which for n=7 is position 3×(7+1)/4.
This equals 6 and so, Q1 is found at the sixth number in the list.
So, Q3 = 7
Box Plot Created!!
• The minimum is the smallest number in the list, which is 1.
• The maximum is the largest number in the list, which is 9.

A box plot is constructed by


labelling the minimum and
maximum points at the
whiskers of the plot. The
lower and upper quartiles
are plotted at the positions
of the start and end of the
box. The median (Q2) is
labelled with a line inside
the box.
How to Compare Box Plots?
• To compare box side by side box plots:
1.Compare the location of the median to compare the averages of the
data.
2.Compare the lengths from whisker to whisker (the range), which is
the spread of the data.
3.Compare the lengths of the boxes (the interquartile range), which is
the spread of the middle 50% of data.
4.Consider outliers and the skewness of the data.
Consistent means that the results are less spread out.
Because school 2 has the smaller range and interquartile range,
it has the most consistent results.
• The image here is
comparison of the Box
Plot with the
Probability Density
Function(PDF) of a
Standard Normal
Distribution which
describes how the IQR,
median, Q1, Q3,
minimum and
maximum are plotted.
• One important thing to
notice here are the
outliers. The outliers
overall make around 7%
(0.35% + 0.35%) of the
total dataset.
What is a violin plot?

A violin plot is a hybrid of a box plot and a kernel


density plot, which shows peaks in the data. It is
used to visualize the distribution of numerical
data. Unlike a box plot that can only show
summary statistics, violin plots depict summary
statistics and the density of each variable.
Violin plots have many of the
same summary statistics as box
plots:

 the white dot represents the


median
 the thick gray bar in the center
represents the interquartile range
 the thin gray line represents the
rest of the distribution, except for
points that are determined to be
“outliers” using a method that is
a function of the interquartile
range.
 On each side of the gray line
is a kernel density estimation
to show the distribution
shape of the data.
 Wider sections of the violin
plot represent a higher
probability that members of
the population will take on
the given value; the skinnier
sections represent a lower
probability.

You might also like