0% found this document useful (0 votes)

45 views17 pages

Statistics Handbook For Data Analysts - by Anita Gupta - Medium

Uploaded by

dknow75

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

45 views17 pages

Statistics Handbook For Data Analysts - by Anita Gupta - Medium

Uploaded by

dknow75

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 17

20/12/2024, 09:57 Statistics Handbook for Data Analysts | by Anita Gupta | Medium

Get unlimited access to the best of Medium for less than $1/week. Become a member

Statistics Handbook for Data Analysts

Anita Gupta · Follow
17 min read · Sep 14, 2024

Listen Share More

Why This Handbook?

Hi, I am Anita, and welcome to my statistics handbook.

I designed this handbook with the goal that everybody should have a basic
understanding of statistics. People are usually afraid of math and calculations, and
hence think that it is a very difficult subject. But times have changed. We are living
in the age of the AI revolution, where everyone has access to powerful AI tools like

https://medium.com/@agirmaus/statistics-handbook-for-data-analysts-fe15b8a07667 1/24
20/12/2024, 09:57 Statistics Handbook for Data Analysts | by Anita Gupta | Medium

ChatGPT. I want to tell you that all the math and all the coding can be obtained from
ChatGPT — even the free version!

But you need to know what you want to do! Only then can you prompt ChatGPT to
search for the things you want. Open in app

So, this course is all about concepts. Understanding the concept behind statistics is
Search

important. If you know the concept, you can implement it using ChatGPT. Hence,
this course is designed to give you concepts and free you from the burden of
knowing any math behind it. You don’t need to know math in this new era. So why
not learn the concepts and start using them to your advantage?

Now that I have convinced you that it is easy to learn statistics in this new age, let
me explain why you should learn statistics: we live in a data-driven world.

Every day, we are inundated with data, facts, figures, and statistics terms like:

“3 out of every 5 people who took this energy drink felt more productive.”

Should you trust this data? That’s a very common example from daily life.

In business, no company makes any decisions without data. We usually see

aggregate percentages and numbers, and I observe that advanced analyses like
hypothesis testing and machine learning are limited to some elite teams. I feel
things need to change.

Now, I am not telling you to become a data scientist or a statistician, but what I am
telling you is that you need to know the concepts. This way, when someone shares
advanced analysis or research findings with you, you’ll understand what it means
and know the right way to analyze the problem.

I hope I was able to encourage you to start learning statistics, and I will see you in
my lecture series. I promise it will be fun.

1. Data Types:

Why learn data types?

In statistics, when we analyze data, we first need to check what kind of data we are
dealing with. The data type determines which chart or what kind of analysis can be

https://medium.com/@agirmaus/statistics-handbook-for-data-analysts-fe15b8a07667 2/24
20/12/2024, 09:57 Statistics Handbook for Data Analysts | by Anita Gupta | Medium

performed.

It’s very important to understand the data type you’re dealing with because numeric
data may sometimes be stored as text. A date column may also be stored as text.
When we have to do data transformations (for example, extracting the year from a
date column), we need to bring the data to the correct data type first; otherwise, we
will see errors.

In daily life, we typically deal with two kinds of data: numeric and non-numeric
(text). But in statistics, we have several data types. Let’s understand each of them
one by one. Note that there are more data types, but for now, we’ll stick to the
basics:

Nominal: Represents categories or groups without any inherent order or

ranking. Examples include colors, types of cars, or genders.

Discrete: Refers to data that can only take specific, distinct values, usually whole
numbers. Examples include the number of students in a class or the outcomes
of rolling a die.

Continuous: Describes data that can take any value within a certain range.
Examples include height, weight, or temperature.

Ordinal: Data with categories that have a specific order or ranking. However, the
differences between the categories may not be consistent. Examples include
rating scales like “poor,” “fair,” “good,” and “excellent.”

Categorical: Anything not numerical can be referred to as categorical. These are

distinct categories or groups, regardless of whether they have a specific order.

Okay, that was all for this presentation on data types. I hope it was clear to you, and
I’ll see you in the next video.

2. Descriptive statistics:

Descriptive, as the name suggests, means we are simply describing the data. How do
we do it? We calculate the mean. 99% of the time, we call this the average. For
example, the average salary of people in the US is $60,000. But is this metric enough?
Simply knowing the mean at this stage is not enough — we need other metrics. In a

https://medium.com/@agirmaus/statistics-handbook-for-data-analysts-fe15b8a07667 3/24
20/12/2024, 09:57 Statistics Handbook for Data Analysts | by Anita Gupta | Medium

business setting, the average is a favorite metric to track. Business leaders don’t
mind some people getting paid higher or lower; it tends to even out.

Now let’s understand the median. What if we want to know the median salary?

To calculate the median, we first rank all employees in increasing order of their
salary. Then, we take the midpoint. Say the median is $50,000. Remember, this
course is all about getting the intuition behind the concepts and being confident
that you understand the topic being discussed. So, coming back to the topic, the
median divides the data points: 50% are above it, and 50% are below it. So next time
someone says the median salary is $50,000, it means 50% of people earn below that
and 50% earn above it.

The mode is simple — it’s the value that occurs most frequently. For example, if you
do a poll and ask what food students want to order for lunch, and the majority vote
for pizza, then pizza is the mode — the highest occurring value.

These examples show you that all three methods of central tendency serve different
purposes, and depending on your use case, you’ll use one or more of them.

3. Measures of Dispersion: Range, Variance, and Standard Deviation

These are measures of dispersion, or how your data is spread around the mean
value. Just knowing the average is not sufficient.

Range is simple. For example, if you’re looking to buy a house, you might be
considering houses in the price range of $500,000 to $800,000. We use ranges all
the time in day-to-day conversations.

Variance might give two numbers, the lowest and the highest. But what if we
want one number that captures the essence of the spread of data? In that case,
we calculate how far each value in the dataset is from the mean, square those
distances, and then divide by the number of observations. Since we square the
distances, interpreting variance directly can be difficult. However, variance is
very useful when working with machine learning algorithms, where we try to
reduce variance by hyperparameter tuning.

Standard deviation is simply the square root of the variance. This allows the
dispersion metric to be in the same unit as the observed data. For example, we
now have dollars instead of dollars squared, making interpretation more useful.
https://medium.com/@agirmaus/statistics-handbook-for-data-analysts-fe15b8a07667 4/24
20/12/2024, 09:57 Statistics Handbook for Data Analysts | by Anita Gupta | Medium

If I say the mean salary is $60,000 with a standard deviation of $10,000, this means
salaries vary between $50,000 and $70,000.

Quick question: Is a high standard deviation better?

The answer is no. We generally want a tight distribution, with most data centered
around the mean.

4. Central Limit Theorem (CLT)

Let’s cover two very important concepts: the Central Limit Theorem (CLT) and the
z-score. These concepts are good to know and may be used when you attempt some
advanced predictive analytics based on a sample population.

To understand the Central Limit Theorem, we need to know what population and
sample mean. Intuitively, population refers to the entire universe, while a sample
refers to a small group of data points from that universe. Can we measure every data
point in the universe? No, we cannot measure, for example, the weight of every
individual on this planet. What we do instead is take samples and draw inferences.
Based on the sample, the population is expected to have similar characteristics or
metrics.

The way we derive inferences is scientific and mathematical. But we won’t go into
the math, just the intuitive part of it.

So, imagine this is my universe, and I have 10 samples. I take the mean of these
samples and then plot the curve. The CLT states that the curve will follow a bell-
curve distribution — a normal distribution — regardless of the type of distribution
of the universe.

Second, it is mathematically proven that the mean of sample means equals the
population mean.

Third, given the standard deviation of the population (denoted as sigma, σ), the
standard deviation of the sample means (also known as the standard error) is given
by sigma divided by the square root of the number of observations (n). Therefore,
the larger the sample size (n), the closer the sample mean is likely to be to the
population mean.

Explanation:
https://medium.com/@agirmaus/statistics-handbook-for-data-analysts-fe15b8a07667 5/24
20/12/2024, 09:57 Statistics Handbook for Data Analysts | by Anita Gupta | Medium

The standard deviation of the sample means is referred to as the standard error
(SE), which is calculated as:

As the sample size (n) increases, the standard error decreases, meaning the sample
mean becomes a more accurate estimate of the population mean.

This is why, in statistical studies, we always check the sample size before trusting
the results. If the sample is very small, we know it’s just a bluff.

I know this was complex, but just have a rough idea of how things work. Next time
someone tells you, “Eat this and reduce your belly — 5 out of 8 people saw results,”
check the sample size. Try to find out more about the metrics they’re sharing. All
I’m telling you is that statistics can be misused if not understood by users.

Alright, let’s move to the next topic: the z-score. We’ll see what a z-score is and why
we use it.

A z-score indicates how many standard deviations away a data point is from the
population mean.

This score is used for two primary purposes:

1. Standardization

2. Comparing which group is better

Standardization:

When we have a dataset, there may be many different variables, like age and
income. If we’re using distance-based machine learning algorithms, a one-unit
change in age is not the same as a one-unit change in income. Income may be in

https://medium.com/@agirmaus/statistics-handbook-for-data-analysts-fe15b8a07667 6/24
20/12/2024, 09:57 Statistics Handbook for Data Analysts | by Anita Gupta | Medium

100k USD, and age is in discrete units like 10, 20, etc. So, the data needs to be
standardized. You don’t need to do anything manually — everything is taken care of
by a single line of code. But you should know what it means when someone says that
they need to standardize the data for better accuracy. In this case, I explained the z-
score.

There are other ways to standardize data, such as the min-max scaler, where all
variables are adjusted to have values between 0 and 100.

Let’s move on to the second use of the z-score.

Using z-scores to compare the means of two groups is a common approach in

hypothesis testing. Here’s an example from an HR context:

Scenario:

An HR department wants to compare the average job satisfaction scores between

two departments, A and B, within a company. The company uses a standardized job
satisfaction survey with a known population standard deviation (σ) of 10.

Data:

Department A:

Sample mean (X̄ A) = 75

Sample size (nA) = 50

Department B:

Sample mean (X̄ B) = 70

Sample size (nB) = 60

Objective:

To determine if there is a statistically significant difference in job satisfaction scores

between the two departments.

Steps:

1. State the Hypotheses:

https://medium.com/@agirmaus/statistics-handbook-for-data-analysts-fe15b8a07667 7/24
20/12/2024, 09:57 Statistics Handbook for Data Analysts | by Anita Gupta | Medium

Null hypothesis (H₀): There is no difference in the job satisfaction means of the
two departments (μA = μB).

Alternative hypothesis (H₁): There is a difference in the job satisfaction means

(μA ≠ μB).

1. Calculate the Standard Error of the Difference in Means:

2. Calculate the Z-Score:

3. Determine the P-Value:

Using a standard normal distribution table, find the p-value corresponding to Z =
2.62.
A Z-score of 2.62 corresponds to a p-value of approximately 0.0044.

4. Make a Decision:
If the p-value is less than the significance level (typically α = 0.05), reject the null
hypothesis.
In this case, p = 0.0044 < 0.05, so we reject the null hypothesis and conclude that
there is a statistically significant difference in job satisfaction scores between
departments A and B.

Coming back to the definition of a z-score: it’s a measure of how far the sample
mean is from the population mean.

Again, I shared this concept with you so that if you get a problem comparing two
different groups, you simply can’t be okay with just comparing their scores. We need
to use a z-score, a standard measure for comparing two very different groups.

Lastly, the coefficient of variation (standard deviation as a percentage of the mean,

σ/μ * 100) is useful when we want to see if one group has a higher variation
compared to another group. For example, salaries in India and salaries in the US. If
the coefficient of variation is higher in India, then I can conclude that there is more
variation in salaries in India compared to the US.

5. Inferential statistics

Welcome to the world of inferential statistics. This is where things get interesting.
Let me take you through it step by step.

https://medium.com/@agirmaus/statistics-handbook-for-data-analysts-fe15b8a07667 8/24
20/12/2024, 09:57 Statistics Handbook for Data Analysts | by Anita Gupta | Medium

A hypothesis is nothing more than an assumption. Then, we have the Null

Hypothesis and the Alternative Hypothesis. The null hypothesis assumes that
everything is fine, peaceful, and as expected. The alternative hypothesis is just the
opposite — it assumes that things are not as they seem.

Here are some examples:

H₀: There is no difference between the incomes of male and female employees.
H₁: Male and female salaries are different.

H₀: The coin is fair.

H₁: The coin is not fair.

For example, consider the famous coin-tossing experiment.

Everyone has tossed a coin and tried to see if it lands heads or tails.

If I ask you to toss the coin 100 times and record how many times you get heads:

If you get heads 50% of the time, you’re okay with the results.

If you get heads 40% of the time, you’re still okay, right?

If you get heads 10% of the time, you might say, “Umm… okay,” but start to have
doubts.

But if you get heads only 5% of the time, this is when you get frustrated and say,
“The coin is not fair!” — and you stop!

In the example above, as the proportion of heads decreases, your confidence in the
fairness of the coin also decreases. The question is: at what stage do you confirm
that the coin is not fair? The industry standard is usually 5%. This 5% is called the

https://medium.com/@agirmaus/statistics-handbook-for-data-analysts-fe15b8a07667 9/24
20/12/2024, 09:57 Statistics Handbook for Data Analysts | by Anita Gupta | Medium

level of significance, or 0.05 in decimal form. The p-value is the probability of the
null hypothesis being true.

So, if 100% — 5% = 95%, this means that 95% confidence is what we have left. With
95% confidence, we can say that the coin is not fair.

A confidence level of 95% means that if the study were repeated multiple times,
95% of the time the actual average would fall within the expected range.

Let’s look at this on a graph. You want your data to be near the mean. If the
proportion of testing results is very low and far from the mean (what we generally
observe), then we will reject the null hypothesis.

A 95% confidence interval means that data lies within these two ranges. If any data
falls outside this range — 5% on either side — we reject the null hypothesis. The
curve is symmetric, so 2.5% of the data will be on each side.

Interpretation of Results:

If the p-value is less than 0.05, we can reject the null hypothesis.

But if the p-value is greater than 0.05, then we say that we do not have sufficient
evidence to prove the coin is biased. We would need more data to confirm that.

Note: We can only reject the Null Hypothesis with these tests. That was the basic
idea behind hypothesis testing and p-values.

6. Which Test to Use for Hypothesis Testing

It depends on the data type and the type of question you are solving — whether the
data is numeric or categorical.

If the data is numerical, we use a t-test.

If the data is categorical, we use a chi-square test.

Let’s look at a scenario for the t-test:

Scenario for T-Test:

An HR manager wants to determine if there is a difference in average salaries

between male and female employees. They collect salary data for a sample of male
https://medium.com/@agirmaus/statistics-handbook-for-data-analysts-fe15b8a07667 10/24
20/12/2024, 09:57 Statistics Handbook for Data Analysts | by Anita Gupta | Medium

and female employees.

Data:

Let’s assume we have the following salary data (in USD) for a sample of male and
female employees:

Male Employees:

Salaries: $50,000, $55,000, $60,000, $45,000, $52,000

Female Employees:

Salaries: $48,000, $52,000, $58,000, $42,000, $50,000

Hypotheses:

Null Hypothesis (H₀): There is no difference in average salaries between male

and female employees.

Alternative Hypothesis (H₁): There is a difference in average salaries between

male and female employees.

Interpretation:

If the t-test indicates a significant difference in average salaries between male and
female employees, it suggests that there is indeed a difference in salaries between
these two groups. This insight can be valuable for HR managers in addressing
potential gender pay gaps and ensuring fair compensation practices within the
organization.

In instances where you want to prove if males earn more or females earn more, you
can do a one-tailed test.

Paired T-Test

A paired t-test is used to determine whether there is a significant difference

between the means of two related groups. It is typically used when you have two sets
of data that are paired in some way, such as before-and-after measurements on the
same group of subjects, or measurements taken from two different conditions on
the same subjects.

https://medium.com/@agirmaus/statistics-handbook-for-data-analysts-fe15b8a07667 11/24
20/12/2024, 09:57 Statistics Handbook for Data Analysts | by Anita Gupta | Medium

Consider a scenario where we want to compare employee engagement levels before

and after COVID-19:

ANOVA

While t-tests can be used to compare the means of two groups, ANOVA is used when
you have three or more groups to compare.

Scenario:

Let’s consider salary comparisons by country. Suppose you want to compare the
salaries of employees across different countries within your organization. You could
use ANOVA to determine if there are significant differences in salary levels between
countries. Each country would represent a group, and salary would be the
dependent variable.

Chi-Square Test

Chi-square is used when we have categorical variables.

Example:

Consider the question, “Do women drink more cola than men?” Here, we can
compute frequencies: how many men drink cola and how many do not, and
similarly for women, and then perform a chi-square test.

Another example:

“If I say female students take arts subjects more than male students, we have
subjects and then male-female categories.” This is a categorical data scenario.

Gender/Marks/Subjects/Pre-coaching Marks/Post-coaching Marks/Country:

https://medium.com/@agirmaus/statistics-handbook-for-data-analysts-fe15b8a07667 12/24
20/12/2024, 09:57 Statistics Handbook for Data Analysts | by Anita Gupta | Medium

If we want to prove that mean scores of males and females are different, then we
would use a t-test.

Scenario for Chi-Square:

An HR manager wants to determine if there is a difference in job performance

between college hires and non-college hires. Job performance is evaluated using a
performance rating scale ranging from 1 to 5, where 1 indicates “Low” performance
and 5 indicates “High” performance.

Data:

Here’s a hypothetical dataset representing the performance ratings of employees

categorized into two groups: College Hires and Non-College Hires.

Hypotheses:

Null Hypothesis (H₀): There is no difference in job performance between

college hires and non-college hires.

Alternative Hypothesis (H₁): There is a difference in job performance between

college hires and non-college hires.

Interpretation:

If the chi-square test indicates a significant difference in job performance between

college hires and non-college hires, it suggests that there is indeed a difference in
the performance of these two groups. This insight can be valuable for HR managers
in making decisions related to recruitment strategies and talent management
practices within the organization.

Z-Test (Are we paying more than the industry average?)

https://medium.com/@agirmaus/statistics-handbook-for-data-analysts-fe15b8a07667 13/24
20/12/2024, 09:57 Statistics Handbook for Data Analysts | by Anita Gupta | Medium

When we use the z-test, we need to compare two distributions. We know the
population mean and the population standard deviation, and to compare, we need a
large sample size (>30).

Example:

Suppose the industry average salary is $60,000 with a known population standard
deviation of $5,000. We collect a sample of 100 employees from the company and
find that the average salary in the sample is $62,000. Given this scenario, we can use
a z-test to determine if the average salary of employees in the company is
significantly different from the industry average.

7. Box Plot :

Let’s see what a box plot is. It’s a visual way of understanding how spread out your
data is.

The middle line represents the median, which means that 50% of the data
points lie below this line, and 50% lie above.

Then we have the first quartile (Q1) and the third quartile (Q3). The median is
the second quartile.

The range between Q1 and Q3 is called the interquartile range (IQR), which
represents 50% of the data.

The whiskers extend from Q1 and Q3 to the minimum and maximum values
within a defined range.

Any value lying beyond the maximum and minimum values is called an outlier. In
the plot, these outliers are typically shown as individual points beyond the whiskers.

Below is a visual representation of a box plot:

https://medium.com/@agirmaus/statistics-handbook-for-data-analysts-fe15b8a07667 14/24
20/12/2024, 09:57 Statistics Handbook for Data Analysts | by Anita Gupta | Medium

So now, I’ve introduced another common term: Outlier. An outlier is a data point
that lies very far from the other data points (or the overall distribution). When you
have outliers, the mean is not always a good statistic, and it’s often necessary to
remove the outliers for clearer analysis.

Another term often used in statistics is Percentile. I hope you have heard of it. Let
me explain it with an example.

There is a famous MBA entrance exam in India called CAT (Common Admission
Test), where the score comes in percentiles. Colleges select students based on top
percentiles.

If I got 80th percentile in English, it means that 80% of students scored less than me,
or conversely, I am in the top 20%. Similarly, if I scored 90th percentile in Math, that
means I was in the top 10% for that subject. I hope that clarifies percentiles.

Log Scale

I also want to touch upon when we use a log scale instead of a normal scale. When
data has a very wide range, using a normal scale might not make sense.

For example, let’s say you’re plotting the income of a country. One person might be
earning $10 a day, while another might be earning millions. In this case, if you plot
the graph using a normal scale, it may not be clear. However, if you convert the
salaries to a log scale, the data becomes more manageable. In this case, the salaries
would be converted into powers of 10 — like 1, 10, 100, 1,000 — making the graph
more meaningful.

8. Proportion, Probability, and Odds

Now, let’s look at proportion, probability, and odds.

Proportion is simply a fraction of the whole. For example, male students might
make up 40% of the class.

Probability is used when an event is uncertain, and we are trying to predict an

outcome. For instance, “What is the probability that it will rain today?” or “What
is the probability that I will win a game?” Even if the probability of winning is
95%, you still might not win because it’s a probability.

https://medium.com/@agirmaus/statistics-handbook-for-data-analysts-fe15b8a07667 15/24
20/12/2024, 09:57 Statistics Handbook for Data Analysts | by Anita Gupta | Medium

Odds is the ratio of p/(1−p)p / (1 — p)p/(1−p), where ppp is the probability of an

event occurring. For example, if the odds of winning a game are 3:1, it means
that the probability of winning is three times higher than the probability of not
winning.

9. Regression vs Correlation vs Covariance

Let’s look at Covariance. As the name suggests, it tells us how two variables vary
together. Covariance shows whether two variables move in the same direction (+
covariance) or in opposite directions (- covariance). For example, if age increases
and income also increases, we have a + covariance.

However, covariance doesn’t tell us the strength of the relationship. For that, we use
correlation. Correlation provides both the strength and the direction of the
relationship between two variables.

But remember, correlation does not imply causation.

Regression is used when we want to find the impact of an independent variable on

a dependent variable. For example, if we want to study how salary is influenced by
various factors like education, we would use regression analysis. You might have
heard of regression as a slope or a straight line. This line is called the regression
line and is an approximation of real-life relationships between variables.

This handbook was designed to make statistics not only accessible but also
enjoyable. As you proceed, remember, the goal is to understand the concepts. Once
you do, you can leverage tools like ChatGPT to handle the heavy lifting for
calculations. Let’s all demystify statistics and show how valuable it can be in
navigating today’s data-driven world.

I’m excited to see you in my lecture series and promise to make it a learning
experience you’ll enjoy!

Statistics Statistical Analysis Data Analysis Data Science

https://medium.com/@agirmaus/statistics-handbook-for-data-analysts-fe15b8a07667 16/24
20/12/2024, 09:57 Statistics Handbook for Data Analysts | by Anita Gupta | Medium

Written by Anita Gupta

126 Followers · 7 Following

I like to share what I learn! I write on a variety of topics : Data Analysis, Storytelling, Statistics, Personal
Development, book summaries and more.

Responses (2)