Data Visualization
Data Visualization
Data Science
Data Science Process Stages :
Stage 4 .Data Exploration
● Data Exploration refers to explore,visualize,describe and analyse the
dataset characterizations,such as size,quantity,and accuracy,in order to
better understand the nature of the data and identify areas or patterns to
dig into more.
By using visual elements like charts, graphs, and maps, data visualization tools
provide an accessible way to explore and understand the nature ,spread ,trends,
outliers, and patterns in data.
1.Histogram
A histogram is a plot that shows the frequency distribution of a set of variables.
The histogram gives an insight into the underlying distribution of the variable, outliers, skewness, etc.
(In Python)To draw a histogram, invoke the ‘hist()’ method of the matplotlib library.
Histogram
Example . Draw a Histogram of the following data a (list of prices of commonly
sold items at AllElectronics.)
The numbers have been sorted: 1, 1, 5, 5, 5, 5, 5, 8, 8, 10, 10, 10, 10, 12, 14, 14,
14, 15, 15, 15, 15, 15, 15, 18, 18, 18, 18, 18, 18, 18, 18, 20, 20, 20, 20, 20, 20, 20,
21, 21, 21, 21, 25, 25, 25, 25, 25, 28, 28, 30, 30, 30.
Histogram
-Box plot is constructed using IQR, minimum and maximum values. IQR is the distance between the 3rd
quartile and 1st quartile. The length of the box is equivalent to IQR.
( In Python))to draw the box plot, call boxplot() of the seaborn library.
Box and Whisker plot
● For example ,the scores of 15 students in the data science first exam were
as follow :{77,45,66,80,90,34,89,95,66,67,45,88,80,72,70}
● To draw the box and whisker plot for these data set ,5 values required
.Starting by the median:
1. Order the values in the ascending order.
34,45,45,66,66,67,70,72,77,80,80,88,89,90,95
2. In this case the count of values is odd so the median is the middle value 72 .
34,45,45,66,66,67,70,72,77,80,80,88,89,90,95
Box and Whisker plot
3.Lower Quartile and Upper Quartile should be detected .
34,45,45,66,66,67,70,72,77,80,80,88,89,90,95
4.Find the extreme values(The Range Values ),the minimum and the maximum value
-The minimum values is :34
34,45,45,66,66,67,70,72,77,80,80,88,89,90,95
Box and Whisker plot
The solution :
%25 %25
%25 %25
Box and Whisker plot
The box and whisker plot shows that 50% of the students have scores between 66 and 88 points.
In addition, 75% scored lower than 88 points, and 50% have test results above 66. So, if you have
test results somewhere in the lower whisker, you may need to study more.
Question :a cloth shop has two stores ,the following datasets show the recorded sales in each
store made each month, interpret the results of the Box and whisker plot for both stores.
Store 1:
250, 360, 90, 189, 580, 350, 200, 180, 266, 410, 190, 170.
Store 2:
520, 320, 450, 500, 120, 500, 630, 420, 210, 600, 230, 140.
Scatter plots
Scatter plots represent each relationship between two continuous variables as
individual data point in a 2D graph
● For example, smoking is known to be correlated with lung cancer. Since, smoking
18*50 = 900
√900 = 30
r =30/30= 1
Correlation
Correlation
Correlation
Causation
● Causation means that changes in one variable brings about changes in the
other; there is a cause-and-effect relationship between variables. The two
variables are correlated with each other and there is also a causal link
between them.
Correlation Vs Causation
A classic example:
Does this mean that increase of ice cream sale is a direct cause of
increased drowning accidents?
Correlation Vs Causation
In other words: can we use ice cream sale to predict drowning accidents?
It is likely that these two variables are accidentally correlating with each other.
● Unskilled swimmers
● Waves
● Lack of supervision
● Alcohol (mis)use
Correlation