0% found this document useful (0 votes)
21 views40 pages

Week - 1 Day - 1 Descriptive Statistics

Uploaded by

tanvimishraoct
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views40 pages

Week - 1 Day - 1 Descriptive Statistics

Uploaded by

tanvimishraoct
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 40

Week - 1

Day- 1
Descriptive Statistics
Sabyasachi Parida
Agenda
● Introduction to EDA
○ Dealing with Outliers
○ What is EDA?
■ Identifying outliers using IQR,
○ Python Libraries for EDA
Z-score
● Descriptive Statistics
○ Basic Descriptive Statistics ○ Data Transformation
■ Mean, Median, Mode ■ Scaling and normalization
■ Variance, Standard Deviation ■ Log transformation
■ Percentiles and Quartiles ■ Encoding categorical variables
○ Data Distribution ● Data Visualization
■ Univariate Analysis
● Data Summarization ● Histograms
● Bar plots
○ Summary statistics with Pandas (describe())
● Box plots
○ Grouping and aggregating data
■ Bivariate Analysis
● Scatter plots
● Data Cleaning and Preparation ● Line plots
○ Handling Missing Values ● Heatmaps
■ Identifying missing data
■ Multivariate Analysis
■ Imputation methods (mean, median,
mode, interpolation) ● Pair plots
■ Dropping missing values ● Correlation matrices

2
Introduction to EDA

3
What is EDA?

Exploratory Data Analysis (EDA) is a crucial step in the data science process that involves summarizing
and visualizing the main characteristics of a dataset.

The goal of EDA is to gain insights and understanding of the data before proceeding to more formal
modeling and hypothesis testing.

4
Python Libraries for EDA

● Pandas
● Numpy
● Matplotlib
● Seaborn
● Plotly

5
Descriptive Statistics

6
Basic Descriptive Statistics

Mean:

● The mean, often referred to as the average, is a measure of central tendency that summarizes a set
of numbers by identifying the central point within that set.
● It is calculated by dividing the sum of all the values in a dataset by the number of values.
● The mean provides a single value that represents the center of the data distribution.

7
Median:

The median is a measure of central tendency that represents the middle value in a dataset when the
values are arranged in ascending or descending order.

Unlike the mean, the median is not affected by extreme values (outliers), making it a more robust
measure for datasets with skewed distributions or outliers.

Mode:

The mode is a measure of central tendency that represents the most frequently occurring value in a
dataset.

Unlike the mean and median, which are measures of average and middle value respectively, the mode
identifies the value that appears most often.

8
Variance and Standard Deviation

Variance:

● Variance is a statistical measure that quantifies the spread or dispersion of a set of data points
around their mean.
● It indicates how much the individual data points in a dataset differ from the mean value of the
dataset.
● A higher variance indicates that the data points are more spread out from the mean, while a lower
variance indicates that they are closer to the mean.

9
Standard Deviation:

Standard deviation is a measure of the amount of variation or dispersion in a set of values.

It quantifies how much the individual data points in a dataset deviate from the mean (average) of the
dataset.

The standard deviation is the square root of the variance, providing a measure of dispersion that is in the
same units as the original data, making it easier to interpret.

10
Percentiles and Quartiles

A percentile is a measure used in statistics to indicate the value below which a given percentage of observations
in a group of observations falls.

Quartiles are statistical measures that divide a dataset into four equal parts, each containing 25% of the data
points.

First Quartile (Q1): Also known as the 25th percentile, it is the value below which 25% of the data points lie.

Second Quartile (Q2): Also known as the median or the 50th percentile, it is the value below which 50% of the
data points lie.

Third Quartile (Q3): Also known as the 75th percentile, it is the value below which 75% of the data points lie.

11
Example and Test

12
Data Distribution

Normal Distribution (Gaussian Distribution):

A symmetric bell-shaped distribution where the mean, median, and mode are equal and located at the center of the
distribution.

13
Uniform Distribution:

All values in the dataset occur with equal probability, resulting in a flat and
constant distribution.

Exponential Distribution:

A right-skewed distribution where the probability of an event decreases


exponentially as the value increases.

14
Binomial Distribution:

A discrete probability distribution of the number of successes in a fixed number of


independent Bernoulli trials.

Poisson Distribution:

A discrete probability distribution that expresses the probability of a given number


of events occurring in a fixed interval of time or space.

15
Data Summarization

16
Summary Statistics with Pandas

Grouping and Aggregating Data

17
Data Cleaning and Preparation

18
Handling Missing Values
■ Identifying missing data
● How to identify missing data?
● Is Null really missing data?

■ Imputation methods
● mean, median, mode
● Forward Fill/ Backward Fill
● Interpolation
○ Linear/ Polynomial/ Spline/ Gaussian/ Nearest Neighbor
etc.
■ Dropping missing values
● When to drop/not drop?
19
Dealing with Outliers

What are Outliers?

Example

20
InterQuartile Range (IQR)

Interquartile Range (IQR) is a measure of statistical dispersion, or how spread out the values in a data set are.

It is defined as the range between the first quartile (Q1) and the third quartile (Q3), and it captures the middle 50% of
the data.

IQR=Q3−Q1

Lower Bound=Q1−1.5×IQR

Upper Bound=Q3+1.5×IQR

The IQR is a robust measure of variability that is less affected by outliers and skewed data than the range or standard
deviation.

21
Z-Score
The Z-score, also known as the standard score, is a statistical measure that describes the position of a raw
score in terms of its distance from the mean of the dataset, measured in standard deviations.

● Z = 0: The data point is exactly at the mean.


● Z > 0: The data point is above the mean.
● Z < 0: The data point is below the mean.
● |Z| > 2 or 3: The data point is considered an outlier if its Z-score is greater than 2 or 3 standard
deviations from the mean (common thresholds).
22
Data Transformation

Consistency: Different features in your dataset might have different units or scales. For instance, one feature could be in
dollars while another could be in inches. Normalizing or scaling ensures that all features contribute equally to the model.

Improved Convergence: Many machine learning algorithms, particularly those that involve gradient descent (e.g., linear
regression, neural networks), converge faster when the features are scaled.

Distance-Based Algorithms: Algorithms such as K-Nearest Neighbors (KNN) and clustering algorithms (e.g., K-Means)
rely on distance metrics. Features on different scales can distort these distances.

23
Normalization and Standardization
Normalization is a data preprocessing technique used to adjust the values of numeric columns in a
dataset to a common scale, typically between 0 and 1.

1. Min-Max Normalization (Rescaling)

Min-Max normalization scales the data to a fixed range, usually [0, 1].

2. Z-Score Normalization (Standardization)

Standardization scales the data to have a mean of 0 and a standard deviation of 1.

24
Log Transformation
Log transformation is a data transformation technique that can be applied to make highly skewed data
more normally distributed.

It is particularly useful for handling data with a long tail and for stabilizing the variance of a dataset.

25
Encoding Categorical Variables

Encoding categorical variables is an essential step in data preprocessing for machine learning models, as
most models require numerical input.

Techniques for Encoding Categorical Variables

1. Label Encoding
2. One-Hot Encoding
3. Ordinal Encoding
4. Target Encoding

26
Label Encoding One Hot Encoding

Ordinal Encoding Target Encoding

27
28
Data Visualisation

29
Univariate Analysis

Histogram
A histogram is a graphical representation of the distribution of a dataset. It is used to
visualize the frequency of data points within specified ranges (bins).
Used for Continuous Data.

30
Bar Plot:

A bar plot (or bar chart) is a graphical representation of categorical data with rectangular bars.
The length of each bar is proportional to the value or frequency of the category it represents.

31
Box Plot:

A box plot, also known as a box-and-whisker plot, is a standardized way of displaying the distribution of
data based on a five-number summary: minimum, first Quartile (Q1), median (Q2), third quartile (Q3),
and maximum.

It helps in visualizing the spread and skewness of the data, as well as identifying outliers.

32
Bivariate Analysis
Scatter Plot

● A scatter plot is a type of plot used to visualize the relationship between two continuous variables.

● Each point on the plot represents a single observation in the dataset, with one variable plotted on
the x-axis and the other variable plotted on the y-axis.

● Scatter plots are useful for identifying patterns, trends, and relationships between variables, such
as correlation or clustering.

33
34
Line Plot

● A line plot is a type of plot that displays data points connected by straight lines.
● It is commonly used to visualize trends and patterns over time or any other ordered dimension.

Heat Maps

A heatmap is a graphical representation of data where values in a matrix are represented as colors.
Heatmaps are particularly useful for visualizing the magnitude of relationships between two variables in
a dataset.

35
36
Multivariate Analysis

Pair Plot

A pair plot, also known as a scatterplot matrix, is a grid of scatterplots that allows you to visualize pairwise relationships between multiple variables in
a dataset.

It provides a quick way to explore the correlation or association between variables by plotting each variable against every other variable.

Pair plots are particularly useful for understanding the relationships between multiple variables and identifying patterns or trends in the data.

Key Components of a Pair Plot

● Scatterplots: Each cell in the grid contains a scatterplot of two variables.


● Diagonal Plots: Along the diagonal of the grid, histograms or density plots of individual variables are displayed.
● Axes Labels: Descriptive labels for the axes and a title for the pair plot.

37
38
Correlation Matrices:

What is Correlation?

Correlation is a statistical measure that describes the strength and direction of a relationship between
two variables.

Types of Correlation:

❖ Pearson Correlation
❖ Spearman Correlation - Considers monotonicity

● A correlation matrix is a tabular representation of the correlation coefficients between variables


in a dataset.

● Each cell in the matrix represents the correlation coefficient between two variables, ranging from
-1 to 1.

39
40

You might also like