Week - 1 Day - 1 Descriptive Statistics
Week - 1 Day - 1 Descriptive Statistics
Day- 1
Descriptive Statistics
Sabyasachi Parida
Agenda
● Introduction to EDA
○ Dealing with Outliers
○ What is EDA?
■ Identifying outliers using IQR,
○ Python Libraries for EDA
Z-score
● Descriptive Statistics
○ Basic Descriptive Statistics ○ Data Transformation
■ Mean, Median, Mode ■ Scaling and normalization
■ Variance, Standard Deviation ■ Log transformation
■ Percentiles and Quartiles ■ Encoding categorical variables
○ Data Distribution ● Data Visualization
■ Univariate Analysis
● Data Summarization ● Histograms
● Bar plots
○ Summary statistics with Pandas (describe())
● Box plots
○ Grouping and aggregating data
■ Bivariate Analysis
● Scatter plots
● Data Cleaning and Preparation ● Line plots
○ Handling Missing Values ● Heatmaps
■ Identifying missing data
■ Multivariate Analysis
■ Imputation methods (mean, median,
mode, interpolation) ● Pair plots
■ Dropping missing values ● Correlation matrices
2
Introduction to EDA
3
What is EDA?
Exploratory Data Analysis (EDA) is a crucial step in the data science process that involves summarizing
and visualizing the main characteristics of a dataset.
The goal of EDA is to gain insights and understanding of the data before proceeding to more formal
modeling and hypothesis testing.
4
Python Libraries for EDA
● Pandas
● Numpy
● Matplotlib
● Seaborn
● Plotly
5
Descriptive Statistics
6
Basic Descriptive Statistics
Mean:
● The mean, often referred to as the average, is a measure of central tendency that summarizes a set
of numbers by identifying the central point within that set.
● It is calculated by dividing the sum of all the values in a dataset by the number of values.
● The mean provides a single value that represents the center of the data distribution.
7
Median:
The median is a measure of central tendency that represents the middle value in a dataset when the
values are arranged in ascending or descending order.
Unlike the mean, the median is not affected by extreme values (outliers), making it a more robust
measure for datasets with skewed distributions or outliers.
Mode:
The mode is a measure of central tendency that represents the most frequently occurring value in a
dataset.
Unlike the mean and median, which are measures of average and middle value respectively, the mode
identifies the value that appears most often.
8
Variance and Standard Deviation
Variance:
● Variance is a statistical measure that quantifies the spread or dispersion of a set of data points
around their mean.
● It indicates how much the individual data points in a dataset differ from the mean value of the
dataset.
● A higher variance indicates that the data points are more spread out from the mean, while a lower
variance indicates that they are closer to the mean.
9
Standard Deviation:
It quantifies how much the individual data points in a dataset deviate from the mean (average) of the
dataset.
The standard deviation is the square root of the variance, providing a measure of dispersion that is in the
same units as the original data, making it easier to interpret.
10
Percentiles and Quartiles
A percentile is a measure used in statistics to indicate the value below which a given percentage of observations
in a group of observations falls.
Quartiles are statistical measures that divide a dataset into four equal parts, each containing 25% of the data
points.
First Quartile (Q1): Also known as the 25th percentile, it is the value below which 25% of the data points lie.
Second Quartile (Q2): Also known as the median or the 50th percentile, it is the value below which 50% of the
data points lie.
Third Quartile (Q3): Also known as the 75th percentile, it is the value below which 75% of the data points lie.
11
Example and Test
12
Data Distribution
A symmetric bell-shaped distribution where the mean, median, and mode are equal and located at the center of the
distribution.
13
Uniform Distribution:
All values in the dataset occur with equal probability, resulting in a flat and
constant distribution.
Exponential Distribution:
14
Binomial Distribution:
Poisson Distribution:
15
Data Summarization
16
Summary Statistics with Pandas
17
Data Cleaning and Preparation
18
Handling Missing Values
■ Identifying missing data
● How to identify missing data?
● Is Null really missing data?
■ Imputation methods
● mean, median, mode
● Forward Fill/ Backward Fill
● Interpolation
○ Linear/ Polynomial/ Spline/ Gaussian/ Nearest Neighbor
etc.
■ Dropping missing values
● When to drop/not drop?
19
Dealing with Outliers
Example
20
InterQuartile Range (IQR)
Interquartile Range (IQR) is a measure of statistical dispersion, or how spread out the values in a data set are.
It is defined as the range between the first quartile (Q1) and the third quartile (Q3), and it captures the middle 50% of
the data.
IQR=Q3−Q1
Lower Bound=Q1−1.5×IQR
Upper Bound=Q3+1.5×IQR
The IQR is a robust measure of variability that is less affected by outliers and skewed data than the range or standard
deviation.
21
Z-Score
The Z-score, also known as the standard score, is a statistical measure that describes the position of a raw
score in terms of its distance from the mean of the dataset, measured in standard deviations.
Consistency: Different features in your dataset might have different units or scales. For instance, one feature could be in
dollars while another could be in inches. Normalizing or scaling ensures that all features contribute equally to the model.
Improved Convergence: Many machine learning algorithms, particularly those that involve gradient descent (e.g., linear
regression, neural networks), converge faster when the features are scaled.
Distance-Based Algorithms: Algorithms such as K-Nearest Neighbors (KNN) and clustering algorithms (e.g., K-Means)
rely on distance metrics. Features on different scales can distort these distances.
23
Normalization and Standardization
Normalization is a data preprocessing technique used to adjust the values of numeric columns in a
dataset to a common scale, typically between 0 and 1.
Min-Max normalization scales the data to a fixed range, usually [0, 1].
24
Log Transformation
Log transformation is a data transformation technique that can be applied to make highly skewed data
more normally distributed.
It is particularly useful for handling data with a long tail and for stabilizing the variance of a dataset.
25
Encoding Categorical Variables
Encoding categorical variables is an essential step in data preprocessing for machine learning models, as
most models require numerical input.
1. Label Encoding
2. One-Hot Encoding
3. Ordinal Encoding
4. Target Encoding
26
Label Encoding One Hot Encoding
27
28
Data Visualisation
29
Univariate Analysis
Histogram
A histogram is a graphical representation of the distribution of a dataset. It is used to
visualize the frequency of data points within specified ranges (bins).
Used for Continuous Data.
30
Bar Plot:
A bar plot (or bar chart) is a graphical representation of categorical data with rectangular bars.
The length of each bar is proportional to the value or frequency of the category it represents.
31
Box Plot:
A box plot, also known as a box-and-whisker plot, is a standardized way of displaying the distribution of
data based on a five-number summary: minimum, first Quartile (Q1), median (Q2), third quartile (Q3),
and maximum.
It helps in visualizing the spread and skewness of the data, as well as identifying outliers.
32
Bivariate Analysis
Scatter Plot
● A scatter plot is a type of plot used to visualize the relationship between two continuous variables.
● Each point on the plot represents a single observation in the dataset, with one variable plotted on
the x-axis and the other variable plotted on the y-axis.
● Scatter plots are useful for identifying patterns, trends, and relationships between variables, such
as correlation or clustering.
33
34
Line Plot
● A line plot is a type of plot that displays data points connected by straight lines.
● It is commonly used to visualize trends and patterns over time or any other ordered dimension.
Heat Maps
A heatmap is a graphical representation of data where values in a matrix are represented as colors.
Heatmaps are particularly useful for visualizing the magnitude of relationships between two variables in
a dataset.
35
36
Multivariate Analysis
Pair Plot
A pair plot, also known as a scatterplot matrix, is a grid of scatterplots that allows you to visualize pairwise relationships between multiple variables in
a dataset.
It provides a quick way to explore the correlation or association between variables by plotting each variable against every other variable.
Pair plots are particularly useful for understanding the relationships between multiple variables and identifying patterns or trends in the data.
37
38
Correlation Matrices:
What is Correlation?
Correlation is a statistical measure that describes the strength and direction of a relationship between
two variables.
Types of Correlation:
❖ Pearson Correlation
❖ Spearman Correlation - Considers monotonicity
● Each cell in the matrix represents the correlation coefficient between two variables, ranging from
-1 to 1.
39
40