0% found this document useful (0 votes)
9 views

3rd Session. Slides

Uploaded by

Shathika Ananth
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

3rd Session. Slides

Uploaded by

Shathika Ananth
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 58

Data Viz

Statistics Analytics
Graphical
EDA
Session #3
“The greatest value of a picture is when it
forces us to notice what we never expected
to see.”

John Tukey
CSV
Spreadsheet
A Quick
Refresher
On
Statistics
Statistics

Descriptive Inferential

Measures of Hypothesis
Measures of Measures of
Central
Tendency
Dispersion Distribution Testing
Regression
Mean Range Skewness Analysis
Statistics Standard
Median Deviation Kurtosis

Mode Variance

IQR
56 , 1 , 1 , 7 , 10
60

56

Mean
50

56 + 1 + 1 + 7 + 10
40

5
30

15
20

15

10 10
7

0 1 1
56 , 1 , 1 , 7 , 10

Median 1, 1, 77, 10, 56

7
56 , 1 , 1 , 7 , 10

Mode
56 – 1
1–2 1
7–1
10 – 1
56 , 1 , 1 , 7 , 10

Mean 15
Which is the best
measure of central
Median 7 Tendency ?

Mode 1
56 , 1 , 1 , 7 , 10

Range 56 - 1

55
56 , 1 , 1 , 7 , 10
60

56

50

40

Variance
30

20

15

10 10
7

0 1 1
56 , 1 , 1 , 7 , 10
60

56

50

40

Variance 30

20

15

10 10
7

0 1 1
56 , 1 , 1 , 7 , 10
60

56

50

40

Variance 30

20

15

10 10
7

0 1 1
56 , 1 , 1 , 7 , 10
60

56

50

40

Variance 30
1681

20

64 25 15
10 196 196 10
7

0 1 1
56 , 1 , 1 , 7 , 10
60

56

50

40

Variance 30
1681

20

64 25 15
10 196 196 10
7

)
196 25

(
0 1 1
64
1681 196
Mean
432.4
56 , 1 , 1 , 7 , 10


Standard
Deviation 432.4 20.8

20.8

20.8
• Measure of how dispersed the data is in relation to the mean.
• Small standard deviation indicates data are clustered tightly
around the mean
• Large standard deviation indicates data are more spread out.

Standard
Deviation
Standard
Deviation
IQR Median

(Inter
Quartile
Range)

Represents the middle


50%
of data
12
Box &
Whisker
Plot
Median

Box Plot
Box &
Whisker
Plot

Box Plot
Box Plot
Measure of the asymmetry of the distribution of a variable about its mean
Skewness
Skewness

Right-Skewed Data:
• In business, right-skewed data could indicate a small number of high-value customers or transactions.
• Strategies might focus on retaining these high-value entities while trying to shift more customers towards higher
value.

Left-Skewed Data:
• If the distribution of purchase amounts is left-skewed, it suggests that most customers are making relatively high-
value purchases.
• This can indicate a strong, high-value customer base.
Kurtosis

• Measure of the "tailedness" of the


distribution of a variable

• Shape of a distribution's tails in relation


to its overall shape

• Spike Sales or stable


Correlation

Statistical measure that describes the strength and direction


of a relationship between two variables
Exploratory
Data Analysis
(EDA)

Crucial process of
performing initial
investigations on data
to discover patterns, to
check assumptions with
the help of summary
statistics and graphical
representations
Meta Data
Exploratory
Business
Data Insights
Uni-Variate
Analysis
Analysis

Multivariate Detect
Detect Feature
Analysis Missing
Outliers Engineering
Values
Prescriptive
What should be done

Predictive
What will happen

Analytics
Value Diagnostic
Why something happened

Descriptive
What happened

Complexity
Analysis Vs Analytics
Past and Present
Focus
Looks at Historical Data
Analytics

Future Oriented
Analysis Anticipates Trends
• Explores each variable on its own in a data set
• Descriptive statistics describe and summarize data.
• Central tendency of the values.
• Check for the variability in the data
• Check the shape of data
• Check for missing values
• Check for outliers
• Check Skewness and Kurtosis

Univariate
Analysis
• Involves examining relationships between two or more variables simultaneously
• Correlation Analysis
• Inferential Analysis
• Interaction Analysis
• Visualize relationships using : Scatterplots, Heatmaps, Pair-Plots etc.
• Checking Missing values and Outliers

Multivariate
Analysis
Graphical/Visual Analysis Of Statistical
Measures
✓ What are the central tendencies?
✓ Mean
✓ Median
✓ Mode(s)
✓ What are the dispersion measures?
✓ Range
✓ IQR
✓ Standard Deviation.
✓ Variance
✓ What are the shapes of the distributions?
✓ Uniform
✓ Symmetric or skewed?
• Are there any missing values? How do you Treat them ?
• Are there outliers or extreme values? How do you Treat them ?
Metadata

Filename Catpics#1.jpg
Owner Bella
Created 1st May 2024
Camera ………
……..
Missing
Data
Few Reasons for Missing Values

• Past data might get corrupted due to improper maintenance.

• Observations are not recorded for certain fields due to some reasons.

• There might be a failure in recording the values due to human error.

• The user has not provided the values intentionally

• Item nonresponse: This means the participant refused to respond.


Missing Values
Treatment

Deletion Imputation
Impute Missing Values
Outlier Analysis
Detection and Treatment
Outlier
What are Anomalies ?

• They are abnormal observations that lies far away from other
values

• It is easy to identify it when the observations are just a bunch of


numbers and it is one dimensional

• But when there are thousands of observations or multi-


dimensions, we need more clever ways to detect those values.
Generally, an anomaly is an outcome or value that deviates from what
is expected, but the exact criteria for what determines an anomaly can
vary from situation to situation
Most Common Causes Of Outliers In A Dataset

• Errors
• Natural
• Intentional
# OF UNITS SOLD

0
100
200
300
400
500
600
700
900
800
2:00 AM 1000
4:00 AM
6:00 AM
8:00 AM
10:00 AM
12:00 PM
2:00 PM

TIME
4:00 PM
April

6:00 PM
8:00 PM
10:00 PM
12:00 AM
2:00 AM
4:00 AM
iPhone Sales - Website on 1 st

6:00 AM
8:00 AM
10:00 AM
12:00 PM
₹ 1,50,000
Applications of Anomaly Detection

Anomaly
₹ 10,000
• Credit card fraud is a socially relevant problem
and poses a great threat to businesses all
around the world.

• In order to detect fraudulent transactions made


by the wrongdoer, Outlier Detection Algorithms
are applied

Fraud Detection In Credit Card


Transactions
Outliers
Univariate Multivariate
Outlier Outlier
• Univariate outliers can be found Multivariate outliers can be found in a n-dimensional space (of
when looking at a distribution of n-features).
values in a single feature space
• Central rectangle spans the first quartile to the third quartile (the interquartile range or IQR).
• Segment inside the rectangle shows the median
• “Whiskers” above and below the box show the locations of the minimum and maximum.
• Outliers are marked beyond whiskers.
Google Facets

• Link: https://pair-code.github.io/facets/index.html
• Load the Dataset in LMS : “Session#3_Dataset _Used_Car.xlsx”
• Perform the following Univariate Analysis and Multivariate Analysis
Disclaimer

Few of the graphs and visualizations used in the presentation are not my own and have been taken from following sources.
Sources : Google Images, the Economist, William S. Cleveland and Robert McGill 1984, Data-To-Viz, Oracle, Datavizcatalogue,
blogs.sas.com,steema.com, consumer reports,visual display of quantitative information by E.Tufte, python graph gallery and
lucidchart, Storytelling with data, cole nussbaumer knaflic, Good charts by Scott Berinato

You might also like