0% found this document useful (0 votes)
8 views

Data Visualization

The document discusses key concepts in data exploration including data visualization techniques like histograms, box plots, and scatter plots. It also covers descriptive statistics and correlation. Histograms, box plots, and scatter plots are useful for exploring patterns in data. Descriptive statistics and correlation help analyze relationships between variables.

Uploaded by

yasmine hussein
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

Data Visualization

The document discusses key concepts in data exploration including data visualization techniques like histograms, box plots, and scatter plots. It also covers descriptive statistics and correlation. Histograms, box plots, and scatter plots are useful for exploring patterns in data. Descriptive statistics and correlation help analyze relationships between variables.

Uploaded by

yasmine hussein
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

Principles and Practices of

Data Science
Data Science Process Stages :
Stage 4 .Data Exploration
● Data Exploration refers to explore,visualize,describe and analyse the
dataset characterizations,such as size,quantity,and accuracy,in order to
better understand the nature of the data and identify areas or patterns to
dig into more.

● It uses statistical techniques and data visualizations to examine the


data at a high level.
● Businesses determine which data is most important and which may
distort the analysis and therefore should be removed.

● It is helpful in decreasing time spent on less valuable analysis by


selecting the right path forward from the start.
Data Visualization
Data Visualization
Data visualization is the graphical representation of information and data.

By using visual elements like charts, graphs, and maps, data visualization tools
provide an accessible way to explore and understand the nature ,spread ,trends,
outliers, and patterns in data.
1.Histogram
A histogram is a plot that shows the frequency distribution of a set of variables.

The histogram gives an insight into the underlying distribution of the variable, outliers, skewness, etc.

(In Python)To draw a histogram, invoke the ‘hist()’ method of the matplotlib library.
Histogram
Example . Draw a Histogram of the following data a (list of prices of commonly
sold items at AllElectronics.)

The numbers have been sorted: 1, 1, 5, 5, 5, 5, 5, 8, 8, 10, 10, 10, 10, 12, 14, 14,
14, 15, 15, 15, 15, 15, 15, 18, 18, 18, 18, 18, 18, 18, 18, 20, 20, 20, 20, 20, 20, 20,
21, 21, 21, 21, 25, 25, 25, 25, 25, 28, 28, 30, 30, 30.
Histogram

The frequency table shows the


prices in the first column and the
frequency in the second column .
Histogram
Examle2. shows the grades of 25 math student,Draw a Histogram for
this dataset.
Histogram
Solution:
Box and Whisker Plot
● Box plot is a graphical representation of numerical data that can be used to

understand the variability of the data and the existence of outliers.


Box and Whisker plot

● Box plot is designed by identifying the following descriptive statistics.

1. Lower quartile, median and upper quartile

2. Lowest and highest values

-Box plot is constructed using IQR, minimum and maximum values. IQR is the distance between the 3rd

quartile and 1st quartile. The length of the box is equivalent to IQR.

( In Python))to draw the box plot, call boxplot() of the seaborn library.
Box and Whisker plot

● For example ,the scores of 15 students in the data science first exam were
as follow :{77,45,66,80,90,34,89,95,66,67,45,88,80,72,70}
● To draw the box and whisker plot for these data set ,5 values required
.Starting by the median:
1. Order the values in the ascending order.

34,45,45,66,66,67,70,72,77,80,80,88,89,90,95

2. In this case the count of values is odd so the median is the middle value 72 .

34,45,45,66,66,67,70,72,77,80,80,88,89,90,95
Box and Whisker plot
3.Lower Quartile and Upper Quartile should be detected .

-Lower Quartile:is the median of the values lower the median .

-Upper Quartile:is the median of the values upper the median.

34,45,45,66,66,67,70,72,77,80,80,88,89,90,95

4.Find the extreme values(The Range Values ),the minimum and the maximum value
-The minimum values is :34

-The Maximum value is 95

34,45,45,66,66,67,70,72,77,80,80,88,89,90,95
Box and Whisker plot
The solution :

%25 %25
%25 %25
Box and Whisker plot

The box and whisker plot shows that 50% of the students have scores between 66 and 88 points.

In addition, 75% scored lower than 88 points, and 50% have test results above 66. So, if you have
test results somewhere in the lower whisker, you may need to study more.

Question :a cloth shop has two stores ,the following datasets show the recorded sales in each
store made each month, interpret the results of the Box and whisker plot for both stores.

Store 1:

250, 360, 90, 189, 580, 350, 200, 180, 266, 410, 190, 170.

Store 2:

520, 320, 450, 500, 120, 500, 630, 420, 210, 600, 230, 140.
Scatter plots
Scatter plots represent each relationship between two continuous variables as
individual data point in a 2D graph

we can interpret that there is


a linear relationship between
engine size and price.
Cars with bigger engines
might be costlier than the
cars with small-sized
engines.
Descriptive Statistics

● Descriptive Statistics summarize or describe features of a data set,such as


the measure of central tendency or measure of the spread.

-Univariate data : it analyzes only one variable, and used in


identifying characteristics of a single trait without analyzing any relationships
or causations.for example mean of the university students Age.

-Bivariate data :it analyzes the relationship between two variables


and attempts to link them by Correlation.For example test of the is
relationship between the Age of the student and the test score.
Correlation
● when we look at two variables over time, if one variable changes, how does this affect

change in the other variable?

● For example, smoking is known to be correlated with lung cancer. Since, smoking

increases the chances of lung cancer.


Correlation
● Correlation basically means a mutual connection between two or more sets of
data. In statistics, bivariate data or two random variables are used to find the
correlation between them.

● The correlation coefficient is generally the measurement of the correlation


between the bivariate data which basically denotes how much two random
variables are correlated with each other
Correlation
Correlation
-Example, find the correlation coefficient
between X and Y .

● Starting by finding the sample means :


Correlation
● Calculate the distance of each datapoint from its mean
Correlation
● Complete the top of the coefficient equation

● Complete the bottom of the coefficient equation


Correlation

● Multiply the results of two expressions together :

18*50 = 900

● Then take the square root of the multiplication result :

√900 = 30

● pull in the numbers for the numerator and denominator :

r =30/30= 1
Correlation
Correlation
Correlation
Causation
● Causation means that changes in one variable brings about changes in the
other; there is a cause-and-effect relationship between variables. The two
variables are correlated with each other and there is also a causal link
between them.
Correlation Vs Causation
A classic example:

● During the summer, the sale of ice cream at a beach increases


● Simultaneously, drowning accidents also increase as well

Does this mean that increase of ice cream sale is a direct cause of
increased drowning accidents?
Correlation Vs Causation

In other words: can we use ice cream sale to predict drowning accidents?

The answer is - Probably not.

It is likely that these two variables are accidentally correlating with each other.

What causes drowning then?

● Unskilled swimmers
● Waves
● Lack of supervision
● Alcohol (mis)use
Correlation

You might also like