0% found this document useful (0 votes)

21 views40 pages

Week - 1 Day - 1 Descriptive Statistics

Uploaded by

tanvimishraoct

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views40 pages

Week - 1 Day - 1 Descriptive Statistics

Uploaded by

tanvimishraoct

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 40

Week - 1

Day- 1
Descriptive Statistics
Sabyasachi Parida
Agenda
● Introduction to EDA
○ Dealing with Outliers
○ What is EDA?
■ Identifying outliers using IQR,
○ Python Libraries for EDA
Z-score
● Descriptive Statistics
○ Basic Descriptive Statistics ○ Data Transformation
■ Mean, Median, Mode ■ Scaling and normalization
■ Variance, Standard Deviation ■ Log transformation
■ Percentiles and Quartiles ■ Encoding categorical variables
○ Data Distribution ● Data Visualization
■ Univariate Analysis
● Data Summarization ● Histograms
● Bar plots
○ Summary statistics with Pandas (describe())
● Box plots
○ Grouping and aggregating data
■ Bivariate Analysis
● Scatter plots
● Data Cleaning and Preparation ● Line plots
○ Handling Missing Values ● Heatmaps
■ Identifying missing data
■ Multivariate Analysis
■ Imputation methods (mean, median,
mode, interpolation) ● Pair plots
■ Dropping missing values ● Correlation matrices

2
Introduction to EDA

3
What is EDA?

Exploratory Data Analysis (EDA) is a crucial step in the data science process that involves summarizing
and visualizing the main characteristics of a dataset.

The goal of EDA is to gain insights and understanding of the data before proceeding to more formal
modeling and hypothesis testing.

4
Python Libraries for EDA

● Pandas
● Numpy
● Matplotlib
● Seaborn
● Plotly

5
Descriptive Statistics

6
Basic Descriptive Statistics

Mean:

● The mean, often referred to as the average, is a measure of central tendency that summarizes a set
of numbers by identifying the central point within that set.
● It is calculated by dividing the sum of all the values in a dataset by the number of values.
● The mean provides a single value that represents the center of the data distribution.

7
Median:

The median is a measure of central tendency that represents the middle value in a dataset when the
values are arranged in ascending or descending order.

Unlike the mean, the median is not affected by extreme values (outliers), making it a more robust
measure for datasets with skewed distributions or outliers.

Mode:

The mode is a measure of central tendency that represents the most frequently occurring value in a
dataset.

Unlike the mean and median, which are measures of average and middle value respectively, the mode
identiﬁes the value that appears most often.

8
Variance and Standard Deviation

Variance:

● Variance is a statistical measure that quantiﬁes the spread or dispersion of a set of data points
around their mean.
● It indicates how much the individual data points in a dataset differ from the mean value of the
dataset.
● A higher variance indicates that the data points are more spread out from the mean, while a lower
variance indicates that they are closer to the mean.

9
Standard Deviation:

Standard deviation is a measure of the amount of variation or dispersion in a set of values.

It quantiﬁes how much the individual data points in a dataset deviate from the mean (average) of the
dataset.

The standard deviation is the square root of the variance, providing a measure of dispersion that is in the
same units as the original data, making it easier to interpret.

10
Percentiles and Quartiles

A percentile is a measure used in statistics to indicate the value below which a given percentage of observations
in a group of observations falls.

Quartiles are statistical measures that divide a dataset into four equal parts, each containing 25% of the data
points.

First Quartile (Q1): Also known as the 25th percentile, it is the value below which 25% of the data points lie.

Second Quartile (Q2): Also known as the median or the 50th percentile, it is the value below which 50% of the
data points lie.

Third Quartile (Q3): Also known as the 75th percentile, it is the value below which 75% of the data points lie.

11
Example and Test

12
Data Distribution

Normal Distribution (Gaussian Distribution):

A symmetric bell-shaped distribution where the mean, median, and mode are equal and located at the center of the
distribution.

13
Uniform Distribution:

All values in the dataset occur with equal probability, resulting in a flat and
constant distribution.

Exponential Distribution:

A right-skewed distribution where the probability of an event decreases

exponentially as the value increases.

14
Binomial Distribution:

A discrete probability distribution of the number of successes in a fixed number of

independent Bernoulli trials.

Poisson Distribution:

A discrete probability distribution that expresses the probability of a given number

of events occurring in a fixed interval of time or space.

15
Data Summarization

16
Summary Statistics with Pandas

Grouping and Aggregating Data

17
Data Cleaning and Preparation

18
Handling Missing Values
■ Identifying missing data
● How to identify missing data?
● Is Null really missing data?

■ Imputation methods
● mean, median, mode
● Forward Fill/ Backward Fill
● Interpolation
○ Linear/ Polynomial/ Spline/ Gaussian/ Nearest Neighbor
etc.
■ Dropping missing values
● When to drop/not drop?
19
Dealing with Outliers

What are Outliers?

Example

20
InterQuartile Range (IQR)

Interquartile Range (IQR) is a measure of statistical dispersion, or how spread out the values in a data set are.

It is deﬁned as the range between the ﬁrst quartile (Q1) and the third quartile (Q3), and it captures the middle 50% of
the data.

IQR=Q3−Q1

Lower Bound=Q1−1.5×IQR

Upper Bound=Q3+1.5×IQR

The IQR is a robust measure of variability that is less affected by outliers and skewed data than the range or standard
deviation.

21
Z-Score
The Z-score, also known as the standard score, is a statistical measure that describes the position of a raw
score in terms of its distance from the mean of the dataset, measured in standard deviations.

● Z = 0: The data point is exactly at the mean.

● Z > 0: The data point is above the mean.
● Z < 0: The data point is below the mean.
● |Z| > 2 or 3: The data point is considered an outlier if its Z-score is greater than 2 or 3 standard
deviations from the mean (common thresholds).
22
Data Transformation

Consistency: Different features in your dataset might have different units or scales. For instance, one feature could be in
dollars while another could be in inches. Normalizing or scaling ensures that all features contribute equally to the model.

Improved Convergence: Many machine learning algorithms, particularly those that involve gradient descent (e.g., linear
regression, neural networks), converge faster when the features are scaled.

Distance-Based Algorithms: Algorithms such as K-Nearest Neighbors (KNN) and clustering algorithms (e.g., K-Means)
rely on distance metrics. Features on different scales can distort these distances.

23
Normalization and Standardization
Normalization is a data preprocessing technique used to adjust the values of numeric columns in a
dataset to a common scale, typically between 0 and 1.

1. Min-Max Normalization (Rescaling)

Min-Max normalization scales the data to a fixed range, usually [0, 1].

2. Z-Score Normalization (Standardization)

Standardization scales the data to have a mean of 0 and a standard deviation of 1.

24
Log Transformation
Log transformation is a data transformation technique that can be applied to make highly skewed data
more normally distributed.

It is particularly useful for handling data with a long tail and for stabilizing the variance of a dataset.

25
Encoding Categorical Variables

Encoding categorical variables is an essential step in data preprocessing for machine learning models, as
most models require numerical input.

Techniques for Encoding Categorical Variables

1. Label Encoding
2. One-Hot Encoding
3. Ordinal Encoding
4. Target Encoding

26
Label Encoding One Hot Encoding

Ordinal Encoding Target Encoding

27
28
Data Visualisation

29
Univariate Analysis

Histogram
A histogram is a graphical representation of the distribution of a dataset. It is used to
visualize the frequency of data points within speciﬁed ranges (bins).
Used for Continuous Data.

30
Bar Plot:

A bar plot (or bar chart) is a graphical representation of categorical data with rectangular bars.
The length of each bar is proportional to the value or frequency of the category it represents.

31
Box Plot:

A box plot, also known as a box-and-whisker plot, is a standardized way of displaying the distribution of
data based on a ﬁve-number summary: minimum, ﬁrst Quartile (Q1), median (Q2), third quartile (Q3),
and maximum.

It helps in visualizing the spread and skewness of the data, as well as identifying outliers.

32
Bivariate Analysis
Scatter Plot

● A scatter plot is a type of plot used to visualize the relationship between two continuous variables.

● Each point on the plot represents a single observation in the dataset, with one variable plotted on
the x-axis and the other variable plotted on the y-axis.

● Scatter plots are useful for identifying patterns, trends, and relationships between variables, such
as correlation or clustering.

33
34
Line Plot

● A line plot is a type of plot that displays data points connected by straight lines.
● It is commonly used to visualize trends and patterns over time or any other ordered dimension.

Heat Maps

A heatmap is a graphical representation of data where values in a matrix are represented as colors.
Heatmaps are particularly useful for visualizing the magnitude of relationships between two variables in
a dataset.

35
36
Multivariate Analysis

Pair Plot

A pair plot, also known as a scatterplot matrix, is a grid of scatterplots that allows you to visualize pairwise relationships between multiple variables in
a dataset.

It provides a quick way to explore the correlation or association between variables by plotting each variable against every other variable.

Pair plots are particularly useful for understanding the relationships between multiple variables and identifying patterns or trends in the data.

Key Components of a Pair Plot

● Scatterplots: Each cell in the grid contains a scatterplot of two variables.

● Diagonal Plots: Along the diagonal of the grid, histograms or density plots of individual variables are displayed.
● Axes Labels: Descriptive labels for the axes and a title for the pair plot.

37
38
Correlation Matrices:

What is Correlation?

Correlation is a statistical measure that describes the strength and direction of a relationship between
two variables.

Types of Correlation:

❖ Pearson Correlation
❖ Spearman Correlation - Considers monotonicity

● A correlation matrix is a tabular representation of the correlation coefﬁcients between variables

in a dataset.

● Each cell in the matrix represents the correlation coefﬁcient between two variables, ranging from
-1 to 1.

39
40

Applications and Interpretation Standard June 2022 Paper 1
No ratings yet
Applications and Interpretation Standard June 2022 Paper 1
21 pages
D.A Lab Assignment-08: Solution: Input
No ratings yet
D.A Lab Assignment-08: Solution: Input
21 pages
AGE 301 - NOTE - A-1
No ratings yet
AGE 301 - NOTE - A-1
8 pages
Quantifying Refractive Error in Companion Dogs With and Without Nuclear Sclerosis - 229 Eyes From 118 Dogs
No ratings yet
Quantifying Refractive Error in Companion Dogs With and Without Nuclear Sclerosis - 229 Eyes From 118 Dogs
9 pages
Unit-2 Biostatistics Descriptive
No ratings yet
Unit-2 Biostatistics Descriptive
31 pages
Week 6+7+8
No ratings yet
Week 6+7+8
37 pages
DWDM 3rd Edition Text Book Slides
No ratings yet
DWDM 3rd Edition Text Book Slides
938 pages
Test Bank For Mind On Statistics 5th Edition Utts Heckard 1285463188 9781285463186
No ratings yet
Test Bank For Mind On Statistics 5th Edition Utts Heckard 1285463188 9781285463186
64 pages
Gagan Jindali Report
No ratings yet
Gagan Jindali Report
11 pages
Case Study Module 1
No ratings yet
Case Study Module 1
4 pages
Data Mining 1
No ratings yet
Data Mining 1
29 pages
8614 Assignment 1
No ratings yet
8614 Assignment 1
20 pages
2021 ESSIP Term 20 Manual 20 Mathematics
No ratings yet
2021 ESSIP Term 20 Manual 20 Mathematics
135 pages
Output and Questions - LAB 1
No ratings yet
Output and Questions - LAB 1
24 pages
Chapter - 3
No ratings yet
Chapter - 3
11 pages
Arpita Saha SMDM Coded Project Module 2 10 01 2024 G2 Business Report
No ratings yet
Arpita Saha SMDM Coded Project Module 2 10 01 2024 G2 Business Report
21 pages
Cumulative Frequency
No ratings yet
Cumulative Frequency
17 pages
Whisker Box Cumulative Frequency - MS
No ratings yet
Whisker Box Cumulative Frequency - MS
29 pages
Erhardt, Robert J
No ratings yet
Erhardt, Robert J
29 pages
Chapter 5
No ratings yet
Chapter 5
6 pages
LESSON 4 Data Management Schedule 2
No ratings yet
LESSON 4 Data Management Schedule 2
57 pages
Business Analytics Unit 4
No ratings yet
Business Analytics Unit 4
24 pages
Module 3 Data Analysis Techniques
No ratings yet
Module 3 Data Analysis Techniques
55 pages
Descriptive Statistics
No ratings yet
Descriptive Statistics
15 pages
UNIT-1 (Preparing To Model)
No ratings yet
UNIT-1 (Preparing To Model)
82 pages
Day 3
No ratings yet
Day 3
24 pages
CS822 DataMining Week2
No ratings yet
CS822 DataMining Week2
28 pages
SCSA1606 - Predictive and Advanced Analytics - Unit II
No ratings yet
SCSA1606 - Predictive and Advanced Analytics - Unit II
50 pages
Research Method Lecture Notes
No ratings yet
Research Method Lecture Notes
32 pages
Unit II TYCS DS
No ratings yet
Unit II TYCS DS
176 pages
R Programming Practical File
No ratings yet
R Programming Practical File
38 pages
Week - 6-7
No ratings yet
Week - 6-7
9 pages
Descriptive Statistics
No ratings yet
Descriptive Statistics
63 pages
IA Sample High Marks
No ratings yet
IA Sample High Marks
6 pages
Ai - Ssmda
No ratings yet
Ai - Ssmda
142 pages
Time Series Forecasting - Sparkling - Buisness Report
No ratings yet
Time Series Forecasting - Sparkling - Buisness Report
70 pages
Business and Statistics
No ratings yet
Business and Statistics
29 pages
Unit 2 - Merged
No ratings yet
Unit 2 - Merged
17 pages
Data Visualization For Python - Sales Retail - r1
No ratings yet
Data Visualization For Python - Sales Retail - r1
19 pages
Semester Review Seniors SL Apps Semester 1
No ratings yet
Semester Review Seniors SL Apps Semester 1
7 pages
Ads Exp 1
No ratings yet
Ads Exp 1
13 pages
ADS PRINT Ans
No ratings yet
ADS PRINT Ans
4 pages
614 Descriptive Statistcs
No ratings yet
614 Descriptive Statistcs
56 pages
Unit 3
No ratings yet
Unit 3
20 pages
Data Mining CSE-443: Ayesha Aziz Prova Lecturer, Dept. of CSE CWU
No ratings yet
Data Mining CSE-443: Ayesha Aziz Prova Lecturer, Dept. of CSE CWU
21 pages
02data Edited v2
No ratings yet
02data Edited v2
43 pages
Combined QP (Reduced) - S1 Edexcel PDF
No ratings yet
Combined QP (Reduced) - S1 Edexcel PDF
107 pages
Program-1
No ratings yet
Program-1
15 pages
Engineering Statistics Handbook 3. Production Process Characterization
No ratings yet
Engineering Statistics Handbook 3. Production Process Characterization
137 pages
ADS Imp Ans
No ratings yet
ADS Imp Ans
11 pages
Business Analytics
No ratings yet
Business Analytics
44 pages
02 Data
No ratings yet
02 Data
36 pages
Section 1 Slide
No ratings yet
Section 1 Slide
132 pages
Data Analytics Summary
No ratings yet
Data Analytics Summary
89 pages
Data Analytics Summary
No ratings yet
Data Analytics Summary
80 pages
Statistics and Its Types (v1.0)
No ratings yet
Statistics and Its Types (v1.0)
6 pages
02 Exploratory Data Analytics
No ratings yet
02 Exploratory Data Analytics
41 pages
Module 1
No ratings yet
Module 1
64 pages
ANL303 - Week - 3 - Jan 2023
No ratings yet
ANL303 - Week - 3 - Jan 2023
69 pages
Laporan Praktikum StatDas2020 - Siti Rubi'Ah (G1B019069) FISIKA A
No ratings yet
Laporan Praktikum StatDas2020 - Siti Rubi'Ah (G1B019069) FISIKA A
59 pages
(Document Title) (Document Subtitle) : (Company Name) (Company Address)
No ratings yet
(Document Title) (Document Subtitle) : (Company Name) (Company Address)
39 pages
Machine Learning (1) : Inteligência Artificial E Cibersegurança (Inacs)
No ratings yet
Machine Learning (1) : Inteligência Artificial E Cibersegurança (Inacs)
33 pages
DS Chapter - 2
No ratings yet
DS Chapter - 2
73 pages
Chapter2-Statistical Analysis
No ratings yet
Chapter2-Statistical Analysis
86 pages
Statistical Graphics in Pharmacokinetics and Pharmacodynamics: A Tutorial
No ratings yet
Statistical Graphics in Pharmacokinetics and Pharmacodynamics: A Tutorial
11 pages
Illuminating Data: A hands on guide to data visualization in R
From Everand
Illuminating Data: A hands on guide to data visualization in R
Eman Ahmad
No ratings yet
Basic Statistical Descriptions of Data
No ratings yet
Basic Statistical Descriptions of Data
7 pages
Chapter 1
No ratings yet
Chapter 1
51 pages
Iba Unit - Ii
No ratings yet
Iba Unit - Ii
31 pages
Lec 2
No ratings yet
Lec 2
26 pages
DAAN436277 Buoi09 EDA
No ratings yet
DAAN436277 Buoi09 EDA
132 pages
Topic 8 Data Processing and Analysis PDF
No ratings yet
Topic 8 Data Processing and Analysis PDF
157 pages
Chapter 2
No ratings yet
Chapter 2
53 pages
CHP 2
No ratings yet
CHP 2
52 pages
Unit 3 Data Preprocessing - Data
No ratings yet
Unit 3 Data Preprocessing - Data
90 pages
02data Part2
No ratings yet
02data Part2
34 pages
02 Data
No ratings yet
02 Data
64 pages
IT326 - Ch2
No ratings yet
IT326 - Ch2
44 pages
Qunt Data Coding & Analysis
No ratings yet
Qunt Data Coding & Analysis
104 pages
02 Data
No ratings yet
02 Data
62 pages
Unit .......
No ratings yet
Unit .......
45 pages
It B.tech II Year II Sem DV (R18a0555)
No ratings yet
It B.tech II Year II Sem DV (R18a0555)
73 pages
L4 Exploratory Analysis en
No ratings yet
L4 Exploratory Analysis en
42 pages
Chapter 2 - Understand Data
No ratings yet
Chapter 2 - Understand Data
63 pages
Statistics Midterm Review
No ratings yet
Statistics Midterm Review
21 pages
Data Mining-5 - Getting Know Data 1
No ratings yet
Data Mining-5 - Getting Know Data 1
27 pages
Getting To Know Your Data
No ratings yet
Getting To Know Your Data
78 pages
L1-D3 Concepts of Data Analysis
No ratings yet
L1-D3 Concepts of Data Analysis
17 pages
Chapter 1: Descriptive Statistics: Example 1: Making Steel Rods
No ratings yet
Chapter 1: Descriptive Statistics: Example 1: Making Steel Rods
20 pages
Topic1 Summarizing and Visualizing Data PDF
No ratings yet
Topic1 Summarizing and Visualizing Data PDF
29 pages

Week - 1 Day - 1 Descriptive Statistics

Uploaded by

Week - 1 Day - 1 Descriptive Statistics

Uploaded by

Week - 1

Standard deviation is a measure of the amount of variation or dispersion in a set of values.

Normal Distribution (Gaussian Distribution):

A right-skewed distribution where the probability of an event decreases

A discrete probability distribution of the number of successes in a fixed number of

A discrete probability distribution that expresses the probability of a given number

Grouping and Aggregating Data

What are Outliers?

● Z = 0: The data point is exactly at the mean.

1. Min-Max Normalization (Rescaling)

2. Z-Score Normalization (Standardization)

Standardization scales the data to have a mean of 0 and a standard deviation of 1.

Techniques for Encoding Categorical Variables

Ordinal Encoding Target Encoding

Key Components of a Pair Plot

● Scatterplots: Each cell in the grid contains a scatterplot of two variables.

● A correlation matrix is a tabular representation of the correlation coefﬁcients between variables

You might also like