0% found this document useful (0 votes)
30 views18 pages

0 Boxplot

Uploaded by

darkvaderkx007
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views18 pages

0 Boxplot

Uploaded by

darkvaderkx007
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

A box plot (aka box and whisker plot) uses boxes and lines to depict the

distributions of one or more groups of numeric data.


Box limits indicate the range of the central 50% of the data, with a
central line marking the median value.
Box plot is a graphical representation of the distribution of a dataset. It
displays key summary statistics such as the median, quartiles, and
potential outliers in a concise and visual manner.

Box plot is a type of chart that depicts a group of numerical data


through their quartiles.
Each of four equal groups into which a population can be divided
according to the distribution of values of a particular variable.
Elements of Box Plot
A box plot gives a five-number summary of a set of data which is-
• Minimum – It is the minimum value in the dataset excluding the
outliers.
• First Quartile (Q1) – 25% of the data lies below the First (lower)
Quartile.
• Median (Q2) – It is the mid-point of the dataset. Half of the
values lie below it and half above.
• Third Quartile (Q3) – 75% of the data lies below the Third
(Upper) Quartile.
• Maximum – It is the maximum value in the dataset excluding the
outliers.
Note: The box plot shown in the above diagram is a perfect plot with no
skewness. The plots can have skewness and the median might not be at the center
of the box.

The area inside the box (50% of the data) is known as the Inter Quartile
Range. The IQR is calculated as –
IQR = Q3-Q1

Outlies are the data points below and above the lower and upper
limit. The lower and upper limit is calculated as –

The values below and above these limits are considered outliers and the
minimum and maximum values are calculated from the points which lie
under the lower and upper limit.
How to create a box plots?
Let us take a sample data to understand how to create a box plot.
Here are the runs scored by a cricket team in a league of 12 matches –
100, 120, 110, 150, 110, 140, 130, 170, 120, 220, 140, 110.
To draw a box plot for the given data first we need to arrange the data
in ascending order and then find the minimum, first quartile, median,
third quartile and the maximum.

To find the First Quartile we take the first six values and find their median.

Note: If the total number of values is odd then we exclude the Median while
calculating Q1 and Q3. Here since there were two central values we included
them. Now, we need to calculate the Inter Quartile Range.
What is Histogram?
A histogram is a graphical representation of the frequency distribution of
continuous series using rectangles.

The x-axis of the graph represents the class interval, and the y-axis shows the
various frequencies corresponding to different class intervals.

A histogram is a two-dimensional diagram in which the width of the rectangles


shows the width of the class intervals, and the length of the rectangles depicts the
corresponding frequency.

There are no gaps between two consecutive rectangles based on the fact that
histograms can be drawn when data are in the form of the frequency distribution
of a continuous series.

No histogram can be drawn for a data set in the form of discrete series, and this
makes histograms different from bar graphs as they can be plotted for both
discrete and continuous series.

The major difference between a histogram and a bar graph is that the former is
two-dimensional; i.e., both the width and length of the rectangles are used for
comparison, whereas the latter is one-dimensional, which means only the length
of the rectangles is used for comparison. A histogram is used to determine the
value of the Mode of a data set in the form of a continuous series.

Types of Histogram

Histograms of Frequency Distribution are of two types:


Histogram of Equal Class Intervals
Histogram of Unequal Class Intervals

1. Histogram of Equal Class Intervals:


When histograms are drawn based on the data with equal class intervals,
they are known as Histograms of equal class intervals.
The histogram of equal class intervals includes rectangles with equal
width; however, the length of the rectangles is proportional to the
frequency distribution of the class intervals.

Example of Histogram of Equal Class Intervals:


Present the following information in the form of a Histogram:
Histogram of Unequal Class Intervals

When histograms are drawn based on the data with unequal class intervals,
they are known as Histograms of unequal class intervals.

Histogram of unequal class intervals includes rectangles of different width


sizes. Therefore, before drawing a histogram in case of unequal class
intervals, frequency distribution has to be adjusted.
Solution
1. It can be seen clearly that the given class interval is unequal. So, before
plotting the histogram, frequencies have to be adjusted.
2. Determine the class of the smallest interval, i.e., 10-15. Thus, the lowest
class interval in the given frequency distribution is 5.
3. Formulate the Adjusted Table as shown below:

In the above table, the class interval is calculated as the difference between the
upper-class limit and lower-class limit, i.e.,
15-10=5, 20-15=5, 20-25=5, 30-25=5, 40-30=10, 60-40=20, and 80-60=20.
4. Plotting Histogram:

What are bins in a histogram?


Bins (or intervals) are consecutive, non-overlapping intervals of a variable. The
data points are grouped into these bins, and the height of the bar in the histogram
represents the frequency of data points in each bin.
How can outliers be identified using a histogram?
Outliers can be identified as data points that fall into bins far from where most of
the data are concentrated. They often appear as isolated bars on the far left or
right of the histogram.
Can histograms be used for comparing multiple datasets?
While histograms are primarily used for a single dataset, multiple histograms can
be plotted side by side or overlaid (using transparency) to compare distributions.
However, other plots like box plots or density plots might be more suitable for
direct comparison.
Scatter plot/ScatterDiagram
What is a Scatter Diagram?
A simple and attractive method of measuring correlation by diagrammatically
representing bivariate distribution for determination of the nature of the
correlation between the variables is known as the Scatter Diagram Method.

This method gives the investigator/analyst a visual idea of the nature of the
association between the two variables. It is the simplest method of studying the
relationship between two variables as there is no need to calculate any numerical
value.

How to draw a Scatter Diagram?


The two steps required to draw a Scatter Diagram or Dot Diagram are as follows:
1. Plot the values of the given variables (say X and Y) along the X-axis and Y-
axis, respectively.
2. Show these plotted values on the graph by dots. Each of these dots represents
a pair of values.

Interpretation of Scatter Diagram


After observing the pattern of dots, one can know the presence or absence of
correlation and its type. Besides, it also gives an idea of the nature and intensity of
the relationship between the two variables.

The scatter diagram can be interpreted in the following ways:

1. Perfect Positive Correlation


If the points of the scatter diagram fall on a straight line and have a
positive(upward) slope, then the correlation is said to be perfectly positive;
i.e., r = +1.
2. Perfect Negative Correlation
If the points of the scatter diagram fall on a straight line and have a
negative(downward) slope, then the correlation is said to be perfectly
negative; i.e., r = -1

3. Positive Correlation
When the points of the scatter diagram cluster around a straight line
(upward slope from left to right), then the correlation is said to be positive.
4. Negative Correlation
When the points of the scatter diagram cluster around a straight line
(downward/negative slope), then the correlation is said to be negative.

5.No Correlation
When the points of the scatter diagram are scattered in a haphazard manner, then
there is zero or no correlation.
How to interpret a Scatter Diagram?
While interpreting a scatter diagram, the given below points should be taken into
consideration:
Dense or Scattered Points: If the plotted points are close to each other, then
the analyst can expect a high degree of correlation between the two variables.
However, if the plotted points are widely scattered, then the analyst can expect
a poor correlation between the variables.

Trend or No Trend: If the points plotted on the scatter diagram shows any
trend either upward or downward, then it can be said that the variables are
correlated. However, if the plotted points do not show any trend, then it can be
said that the variables are uncorrelated.

Upward or Downward Trend: If the plotted points show an upward trend


rising from the lower left-hand corner of the graph and goes upward to the
upper right-hand corner, then the correlation is positive. It means that the two
variables move in the same direction. However, if the plotted points show a
downward trend from the upper left-hand corner of the graph to the lower
right-hand corner, then the correlation is negative. It means that the two
variables move in the opposite direction.

Perfect Correlation: If the points plotted on the scatter diagram lie on a


straight line and have a positive slope, then it can be said that the correlation is
perfect and positive. However, if the points plotted lie on a straight line and
have a negative slope, then it can be said that the correlation is perfect and
negative.

Merits of Scatter Diagram


1. Simplicity: Scatter Diagram is a simple and non-mathematical method to
study correlation between two variables.

2. First Step: It is the first step of investigating the relationship between two
variables.

3. Easily Understandable: One can easily understand and interpret scatter


diagrams. Besides, only at a single glance at the diagram, one can easily tell
the presence or absence of correlation.
4. Not Affected by Extreme Items: The size of extreme values does not affect
the scatter diagram. It is a quality which is not present in most mathematical
methods.

Demerits of Scatter Diagram


1. Rough Measure: Scatter diagram only gives a rough idea of the degree and
nature of correlation between the given two variables. Therefore, it is only a
qualitative expression rather than a quantitative expression.

2. Non-mathematical Method: Like other methods of correlation, Scatter


Diagram Method does not indicate the exact numerical value of correlation.

3. Unsuitable for Large Observations: If there are more than two variables, it
becomes difficult to draw a scatter diagram.

What is a scatter diagram?


A scatter diagram, also known as a scatter plot, is a graphical representation of
the relationship between two quantitative variables. Each point on the scatter
diagram represents an observation in the dataset, with one variable plotted on
the x-axis and the other on the y-axis.
How can you determine the direction of the relationship using a scatter
diagram?
Positive Relationship: If the points tend to rise from left to right, the relationship
is positive, indicating that as one variable increases, the other variable also
increases.
Negative Relationship: If the points tend to fall from left to right, the relationship
is negative, indicating that as one variable increases, the other variable
decreases.
No Relationship: If the points are randomly scattered with no discernible
pattern, there is no clear relationship between the variables.

What are outliers in a scatter diagram?


Outliers are points that lie far away from the overall pattern of the data. They
may indicate unusual observations or errors in the data.
Can scatter diagrams show non-linear relationships?
Yes, scatter diagrams can show non-linear relationships. If the points form a
pattern that is curved or follows a non-linear trend, it indicates a non-linear
relationship between the variables.
How can you enhance a scatter diagram to better interpret correlation?
You can enhance a scatter diagram by:
• Adding a Trend Line: A line of best fit (regression line) helps in
visualizing the overall trend.
• Color-Coding Points: Use different colors to represent different groups or
categories within the data.
• Annotating Outliers: Mark or label outliers to highlight unusual
observations.
• Using Jitter: Add slight random noise to the points to better visualize
dense clusters (helpful for large datasets).
CORRELATION HEATMAP
A correlation heatmap is a graphical tool that displays the correlation
between multiple variables as a color-coded matrix. It's like a color chart
that shows us how closely related different variables are.
A correlation heatmap is a heatmap that shows a 2D correlation matrix between
two discrete dimensions, using colored cells to represent data from usually a
monochromatic scale.
The values of the first dimension appear as the rows of the table while of the
second dimension as a column.
The color of the cell is proportional to the number of measurements that match
the dimensional value. This makes correlation heatmaps ideal for data analysis
since it makes patterns easily readable and highlights the differences and
variation in the same data.
A correlation heatmap, like a regular heatmap, is assisted by a colorbar making
data easily readable and comprehensible.
The following steps show how a correlation heatmap can be produced:
Import all required modules first
Import the file where your data is stored
Plot a heatmap
Display it using matplotlib
For plotting heatmap method of the seaborn module will be used.

Syntax: heatmap(data, vmin, vmax, center, cmap……………………………..)

You might also like