0% found this document useful (0 votes)
35 views19 pages

Unit II Descriptive-Statistics-And-Correlation

Notes for data science

Uploaded by

Pratik Bante
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views19 pages

Unit II Descriptive-Statistics-And-Correlation

Notes for data science

Uploaded by

Pratik Bante
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 19

UNRAVELING

DATA
RELATIONSHIPS
: A DEEP D I V E
INTO
DESCRIPTIVE
STATISTICS A N D
CORRELATION
Unit 2: Statistics:
• Descriptive Statistics
• Correlation
• distributions and probability
• Statistical Inference: Populations and samples
• Statistical modelling
• probability distributions
• fitting a model
• Hypothesis Testing
INTRODUCTION TO DATA
RELATIONSHIPS

In this presentation, we will explore


data relationships, focusing on
descriptive statistics and
correlation.
Understanding these concepts is
crucial for analyzing data
get respectively and
on this journey making
to uncover
informed
the decisions.
insights Let's our
hidden within
data.
WHAT ARE DESCRIPTIVE
STATISTICS?
Descriptive statistics summarize
and describe the main features of
a dataset.
They provide insights into the
central tendency, variability, and
overall distribution of the data.
Common measures include mean,
median, mode,
and standard deviation.
Understanding these statistics is
foundational for any data
analysis.
CENTRAL T E N D E N C Y

Central tendency measures, such


as mean, median, and mode,
help us understand the typical
value in a dataset. The mean is
the average, the median is the
middle value, and the mode is
the most frequently occurring
value. Each measure provides
unique insights into the data's
distribution.
A measure of central tendency describes where most of the values in the
dataset occur. It’s the center of the distribution of values. Excel presents three
measures of central tendency. Which one is best for your data?
Mean: This measure is the one with which you’re most familiar. It’s the sum of all
observations divided by the number of observations. It’s best for data that follow
symmetric distributions.
Median: This value splits your data in half. Half the values fall above the median
while half are below it. It’s best for skewed distributions.
Mode: This measure represents the value that occurs most frequently in your data.
It’s best for categorical and ordinal data.
The example data are continuous variables. Excel frequently displays “N/A”
for the mode when you have continuous data. That happens because continuous
data are unlikely to have exactly duplicated values, a requirement for the mode
UNDERSTA NDI NG
VARIABILITY

Variability refers to how much the


data points differ from each
other. Key measures include
range, variance, and standard
deviation. High variability
indicates that data points are
spread out, while low variability
suggests they are clustered
closely. Understanding variability
is essential for interpreting data
accurately.
1. Range

Definition: The range is the difference between the largest and smallest values in
a dataset.
Formula: Range = Maximum Value − Minimum Value
Example: If a data set contains values 2, 5, 8, 10, and 12,
the range is: 12−2=10
Explanation: The range gives a quick sense of the spread of the data, but it is
affected by extreme values (outliers).
2. Variance
3. Standard Deviation
Example-

Exam Scores Suppose you have the following scores of 20 students on an exam:
85, 90, 75, 92, 88, 79, 83, 95, 87, 91, 78, 86, 89, 94, 82, 80, 84, 93, 88, 81
To calculate descriptive statistics:
• Mean: Add up all the scores and divide by the number of scores. Mean = (85 + 90 + 75 + 92 + 88 + 79 + 83
+ 95 + 87 + 91 + 78 + 86 + 89 + 94 + 82 + 80 + 84 + 93 + 88 + 81) / 20 = 1770 / 20 = 88.5
• Median: Arrange the scores in ascending order and find the middle value. Median = 86 (middle value)
• Mode: Identify the score(s) that appear(s) most frequently. Mode = 88
• Range: Calculate the difference between the highest and lowest scores. Range = 95 - 75 = 20
• Variance: Calculate the average of the squared differences from the mean. Variance = [(85-88.5)^2 + (90-
88.5)^2 + ... + (81-88.5)^2] / 20 = 33.25
• Standard Deviation: Take the square root of the variance. Standard Deviation = √33.25 = 5.77
VISUALIZING DESCRIPTIVE
STATISTICS

Data visualization tools like


histograms, box plots, and
scatter plots help illustrate
descriptive statistics
e ectively. These visuals
provide a clearer
understanding of data
distribution, central tendency,
and
variability, making it easier to
communicate findings to
stakeholders and decision-
makers.
INTRODUCTION TO
CORRELATION

Correlation measures the strength


and direction of the relationship
between two variables. It ranges
from -1 to 1, where -1 indicates a
perfect negative correlation, 1
indicates a perfect positive
correlation, and 0 indicates no
correlation. Understanding
correlation is vital for identifying
relationships in data.
TYPES OF
CORRELATION
There are three main types of
correlation: positive, negative, and
no correlation. Positive correlation
means that as one variable
increases, the other also
increases. Negative correlation
indicates that as one variable
increases, the other decreases. No
correlation means there is no
discernible relationship between
the variables.
CALCULATING CORRELATION COEFFICIENT

The correlation coe cient, often


represented as r, quantifies the
degree of correlation between
two variables. It is calculated
using statistical methods, such as
Pearson's or Spearman's
correlation. Understanding how to
calculate and interpret this coe
cient is essential for data analysis
and research.
LIMITATIONS OF
CORRELATION

While correlation can indicate a


relationship between variables, it
does not imply causation. Other
factors may influence the
relationship, leading to
misleading interpretations.
Therefore, it is crucial to
complement correlation analysis
with further investigation to
understand the underlying causes.
REAL-WORLD
APPLICATIONS
Descriptive statistics and
correlation are widely used in
various fields, including business,
healthcare, and social sciences.
They help professionals make
data-driven decisions, identify
trends, and improve outcomes.
Understanding these concepts is
essential for anyone working with
data.
K EY
TAKEAWAYS
In summary, understanding
descriptive statistics and
correlation is crucial for analyzing
data relationships. These concepts
provide insights into data
distribution, variability, and
relationships between variables.
Mastering these tools enhances
data analysis skills and informs
better decision-making.
C O N C LU S I
ON

In conclusion, unraveling data relationships through


descriptive statistics and correlation is essential for e
ective data analysis. By understanding these concepts,
we can uncover valuable insights that drive informed
decisions and strategies. Thank you for your attention!

You might also like