Statistics For Data Science
Statistics For Data Science
Contents
Statistics ................................................................................................... Error! Bookmark not defined.
Descriptive statistics vs Inferential Statistics ......................................... Error! Bookmark not defined.
Central Tendency ................................................................................... Error! Bookmark not defined.
Mean ...................................................................................................... Error! Bookmark not defined.
Median .................................................................................................................................................. 3
Mode ..................................................................................................................................................... 4
Inter-quartile Range .............................................................................................................................. 4
Calculating Interquartile Range. ....................................................................................................... 5
Variance: ............................................................................................................................................... 5
Co-Variance: .......................................................................................................................................... 6
Standard deviation ................................................................................................................................ 6
Correlation ............................................................................................................................................ 7
Skewness and Kurtosis .......................................................................................................................... 8
Types of Distributions ......................................................................................................................... 10
Hypothesis Testing .............................................................................................................................. 11
Types in Hypothesis Testing ................................................................................................................ 11
Error in Hypothesis Testing ................................................................................................................. 12
T-test ................................................................................................................................................... 13
One sample t-test……………………………………………………………………………………………………………………...13
1|Page
Statistics for Data Science
Statistics
For example, if we consider one live class to be a sample of the population of all live
classes, then the average number of points earned by students in that one live class at the
end of the term is an example of a statistic.
Two main branches of statistics are
descriptive statistics
uses the data to provide the descriptions of the population.
- It analyses the data in a meaningful way
- Descriptive statistics is very important to present our raw data in
effective/meaningful way
Inferential Statistics
makes the predictions and inferences.
- Predictions are made by group of data we are interested
- It can be defined as a random sample of data taken from a population
2|Page
Statistics for Data Science
Central Tendency
Central tendency is a descriptive summery of a dataset through a single value that reflects
the centre of the data distribution. Along with the variability of the dataset, central tendency
is a branch of descriptive statistics
Central Tendency
The most common measures of central tendency are the mean, median, and mode. A
middle tendency can be calculated for either a finite set of values or for a theoretical
distribution, such as the normal distribution.
Mean
The mean represents the average value of the dataset. It can be calculated as the sum of all
the values in the dataset divided by the number of values. In general, it is considered as the
arithmetic mean.
Formulae
x̄ = Σfx/N
Where,
• x = the mean value of the set of given data.
• f = frequency of the individual data
• N = sum of frequencies
Median
3|Page
Statistics for Data Science
Similarly, we have median formula for grouped data. The median formula for grouped data
is given as,
Median = Lm+[n/2−F/fm]i
Where,
• n = the total frequency.
• F = The cumulative frequency of before class median
• fm = the frequency of the class median
• i = the class width
• Lm = the lower boundary of the class median
Mode
The mode is the value that appears most often in a set of data values
Where,
Inter-quartile Range
The interquartile range defines the difference between the third and the first quartile.
Quartiles are the partitioned values that divide the whole series into equal parts. So, there
are 3 quartiles. First Quartile is denoted by Q1 known as the lower quartile, the second
Quartile is denoted by Q2 and the third Quartile is denoted by Q3 known as the upper
quartile. Therefore, the interquartile range is equal to the upper quartile minus lower
quartile.
Formulae for interquartile
Interquartile range = Upper Quartile – Lower Quartile = Q3 – Q1
where Q1 is the first quartile and Q3 is the third quartile of the series
4|Page
Statistics for Data Science
Then count the given values. If it is odd, then the center value is median otherwise obtain
the mean value for two center values. This is known as Q2 value. If there are even number
of values, the median will be the average of the middle two values.
Median cuts the given values into two equal parts. They are described as Q1 and Q3 parts.
The median of data values above the median value represents Q3.
Max\High value
Median \Q2
Lower Quartile\Q1
Min\Low value
1-High value
2-75th percentile-Q1
4-25th percentile-Q3
5-Low value
Variance:
5|Page
Statistics for Data Science
Co-Variance:
Co-Variance formulae
Cov(X,Y)= ∑(xi−¯¯¯x)(yi−¯¯¯y)/N−1
Standard deviation
A standard deviation is a measure of how dispersed the data is in relation to the mean.
Low standard deviation means data are clustered around the mean, and high standard
deviation indicates data are more spread out.
6|Page
Statistics for Data Science
Answer:
Mean = 600 + 470 + 170 + 430 + 300
= 1970
= 394
To calculate the Variance, take each difference, square it, and then average the result:
Variance
σ2 = 2062 + 762 + (−224)2 + 362 + (−94)2
σ2 = 42436 + 5776 + 50176 + 1296 + 8836
σ2 = 108520
σ2 = 21,704
And the Standard Deviation is just the square root of Variance, so:
Standard Deviation
σ = √21704
= 147.32...
= 147 (mm)
Correlation
A statistic function that measures the degree to which two variables move in relation
to each other
Correlation shows the strength of a relationship between two variables and is expressed
numerically by the correlation coefficient. The correlation coefficient's values range
between -1.0 and 1.0.
A perfect positive correlation means that the correlation coefficient is exactly 1. This
implies that as one security moves, either up or down, the other security moves in
lockstep, in the same direction. A perfect negative correlation means that two assets
7|Page
Statistics for Data Science
skewness
8|Page
Statistics for Data Science
+ve skewed
In negatively skewed, the mean of the data is less than the median (a large number of
data-pushed on the left-hand side). Negatively Skewed Distribution is a type of
distribution where the mean, median, and mode of the distribution are negative rather
than positive or zero
-Ve Skewed
9|Page
Statistics for Data Science
Kurtosis
Positive Kurtosis
Neutral Kurtosis
Negative Kurtosis
Types of Distributions
Bernoulli Distribution
A Bernoulli distribution has only two possible outcomes, namely 1 (success) and 0
(failure), and a single trial. So the random variable X which has a Bernoulli distribution
can take value 1 with the probability of success, say p, and the value 0 with the
probability of failure, say q or 1-p.
The probability mass function is given by: px(1-p)1-x where x € (0, 1).
Uniform Distribution
When you toss a dice, the outcomes are 1, 2, 3, 4, 5 and 6 . The probabilities of getting
these outcomes are equally likely and that is the basis of a uniform distribution.
Binomial Distribution
Lets say we have tossed the coin, what are the possible outcomes?
There are only two possible outcomes. Head denoting success and tail denoting failure.
Therefore, probability of getting a head = 0.5 and the probability of failure can be easily
computed as: q = 1- p = 0.5
10 | P a g e
Statistics for Data Science
Normal Distribution
Normal distribution represents the behaviour of most of the situations in the universe
(That is why it’s called a “normal” distribution. I guess!). The large sum of (small)
random variables often turns out to be normally distributed, contributing to its
widespread application.
The curve of the distribution is bell-shaped and symmetrical about the line
Exactly half of the values are to the left of the centre and the other half right
Poisson Distribution
Suppose you work at a call centre, how many calls do you get in a day? It can be any
number right, the entire number of calls at a call centre in a day is modelled by Poisson
distribution.
Exponential Distribution
Considering the above Example, the time gap between each call can be called as
Exponential Distribution
Exponential distribution is widely used for survival analysis. From the expected life of
a machine to the expected life of a human
Hypothesis Testing
1. Null Hypothesis
2. Alternative Hypothesis
Null hypothesis(H0)
In statistics, the null hypothesis is a general given statement or default position that
there is no relationship between two measured cases or no relationship among groups.
11 | P a g e
Statistics for Data Science
Alternative hypothesis(H1)
The alternative hypothesis is the hypothesis used in hypothesis testing that is contrary
to the null hypothesis
Calculating Hypothesis
𝑥−𝜇
t = 𝑠/
√𝑛
t = tested static
xbar= mean of the samples
𝜇 = null hypothesis
S = standard deviation
n = number of samples
Level of significance
P-value
• Type I error
When we reject the null hypothesis, although that hypothesis was true. Type I error
is denoted by alpha.
• Type II errors
When we accept the null hypothesis but it is false. Type II errors are denoted by beta.
1. Is it true that vitamin C has the ability to cure or prevent the common cold? Or its just
a myth?
2. Null hypothesis: Children who take vitamin C are no less likely to become ill during flu
12 | P a g e
Statistics for Data Science
5. Conclusion: By observing p value which is 0.2 and hence rejecting null hypothesis and
accepting Alternative hypothesis
T-test
A t-test allows us to compare the average values of the two data sets and determine if they
came from the same population. if we were to take a sample of students from class A and
another sample of students from class B, we would not expect them to have exactly the
same mean and standard deviation.
Calculating a t-test requires three key data values. They include the difference between the
mean values from each data set (called the mean difference), the standard deviation of each
group, and the number of data values of each group.
Types of T-Tests
one-sample t-test
In a one-sample t-test, we compare the average (or mean) of one group against the set
average (or mean). This set average can be any theoretical value (or it can be the population
mean).
t=(x¯−μ0)/sx¯
μ0 = The test value -- the proposed constant for the population mean
x¯ = Sample mean
n = Sample size (i.e., number of observations)
s = Sample standard deviation
Two-Sample t-test
13 | P a g e
Statistics for Data Science
Z-test
Z-test is a statistical test where normal distribution is applied and is basically used for dealing
with problems relating to large samples when n ≥ 30.
z test single proportion is used to test a hypothesis on a specific value of the population
proportion.
z test for difference of proportions is used to test the hypothesis that two populations have
the same proportion.
z test for single variance is used to test a hypothesis on a specific value of the population
variance.
Z-test for testing equality of variance is used to test the hypothesis of equality of
two population variances when the sample size of each sample is 30 or larger.
z-scores
z-score, or z-statistic, is a number representing how many standard deviations above or
below the mean population the score derived from a z-test is. Essentially, it is a numerical
measurement that describes a value's relationship to the mean of a group of values. If a z-
score is 0, it indicates that the data point's score is identical to the mean score. A z-score of
1.0 would indicate a value that is one standard deviation from the mean. Z-scores may be
positive or negative, with a positive value indicating the score is above the mean and a
negative score indicating it is below the mean.
For the normal population with one sample:
where x̄ is the mean of the sample, and µ is the assumed mean, σ is the standard deviation,
and n is the number of observations.
• One sample z-test is used to determine whether a particular population parameter, which
is mostly mean, significantly different from an assumed value
• It helps to estimate the relationship between the mean of the sample and the assumed
mean
• In this case, the standard normal distribution is used to calculate the critical value of the
test
• If the z-value of the sample being tested falls into the criteria for the one-sided test, the
alternative hypothesis will be accepted instead of the null hypothesis
14 | P a g e
Statistics for Data Science
• A one-tailed test would be used when the study has to test whether the population
parameter being tested is either lower than or higher than some hypothesized value
• A one-sample z-test assumes that data are a random sample collected from a normally
distributed population that all have the same mean and same variance
• This hypothesis implies that the data is continuous, and the distribution is symmetric
• Based on the alternative hypothesis set for a study, a one-sided z-test can be either a
left-sided z-test or a right-sided z-test
• For instance, if our H0: µ0 = µ and Ha: µ < µ0, such a test would be a one-sided test or
more precisely, a left-tailed test and there is one rejection area only on the left tail of the
distribution
• However, if H0: µ = µ0 and Ha: µ > µ0, this is also a one-tailed test (right tail), and the
rejection region is present on the right tail of the curve
• In the case of two sample z-test, two normally distributed independent samples are
required
• In the case of the two-tailed z-test, the alternative hypothesis is accepted as long as the
population parameter is not equal to the assumed value
• The two-tailed test is appropriate when we have H0: µ = µ0 and Ha: µ ≠ µ0 which may
mean µ > µ0 or µ < µ0
• Thus, in a two-tailed test, there are two rejection regions, one on each tail of the curve
Annova
Analysis of variance (ANOVA) is a statistical technique that is used to check if the means
of two or more groups are significantly different from each other. ANOVA checks the
impact of one or more factors by comparing the means of different samples.
The ANOVA test is the initial step in analyzing factors that affect a given data set. Once
the test is finished, an analyst performs additional testing on the methodical factors that
measurably contribute to the data set's inconsistency. The analyst utilizes the ANOVA test
results in an f-test to generate additional data that aligns with the
proposed regression models.
If no real difference exists between the tested groups, which is called the null hypothesis,
the result of the ANOVA's F-ratio statistic will be close to 1. The distribution of all possible
values of the F statistic is the F-distribution. This is actually a group of distribution
functions, with two characteristic numbers, called the numerator degrees of freedom and
the denominator degrees of freedom.
15 | P a g e
Statistics for Data Science
Biologist want to know how different levels of sunlight exposure (no sunlight, low sunlight,
medium sunlight, high sunlight) and watering frequency (daily, weekly) impact the growth
of a certain plant. In this case, two factors are involved (level of sunlight exposure and water
frequency), so they will conduct a two-way ANOVA to see if either factor significantly
impacts plant growth and whether or not the two factors are related to each other.
The results of the ANOVA will tell us whether each individual factor has a significant effect
on plant growth. Using this information, the biologists can better understand which level of
sunlight exposure and/or watering frequency leads to optimal growth.
Understanding P-Values
- The p-value, or probability value, tells you how likely it is that your data could have
occurred under the null hypothesis
- p-value is the probability of obtaining results at least as extreme as the observed results
of a statistical hypothesis test
- The P value is used all over statistics, from t-tests to regression analysis
- It does this by calculating the likelihood of your test statistic, which is the number
calculated by a statistical test using your data
- n statistics, the p-value is the probability of obtaining results at least as extreme as the
observed results of a statistical hypothesis test, assuming that the null hypothesis is
correct
- The p-value is used as an alternative to rejection points to provide the smallest level of
significance at which the null hypothesis would be rejected. A smaller p-value means
that there is stronger evidence in favour of the alternative hypothesis
- The p-value is a proportion: if your p-value is 0.05, that means that 5% of the time you
would see a test statistic at least as extreme as the one you found if the null hypothesis
was true.
- P-values are usually automatically calculated by our statistical program.
16 | P a g e
Statistics for Data Science
- Parametric Test
The basic principle behind the parametric tests is that we have a fixed set of parameters
that are used to determine a probabilistic model that may be used in Machine Learning
as well.
- Non-Parametric Test
In non-parametric tests we don’t assume any parameters for the population and these
tests too doesn’t depend on the population, Hence there is no fixed distribution like
normal distribution or whatever
Chi-Square Test
Chi- square is non-parametric test
Chi-square supports in assessing the goodness of fit between a set of observed and those
expected theoretically.
Chi-Square is calculated as
Oi =Observed value
Ei = Expected value
XC = Chi-squared
Mann-Whitney U-Test
Mann – Whitney U-Test is non-parametric test
17 | P a g e
Statistics for Data Science
This is used to check whether two independent samples were selected from a population
having the same distribution.
It Calculates based on the comparison of every observation in the first sample with every
observation.
U1 = R1 – n1(n1+1)/2
where n1 is the sample size for sample 1, and R1 is the sum of ranks in Sample 1.
U2 = R2 – n2(n2+1)/2
When consulting the significance tables, the smaller values of U1 and U2 are used. The sum
of two values is given by,
U1 + U2 = { R1 – n1(n1+1)/2 } + { R2 – n2(n2+1)/2 }
Knowing that R1+R2 = N(N+1)/2 and N=n1+n2, and doing some algebra, we find that the
sum is:
U1 + U2 = n1*n2 .
18 | P a g e
Statistics for Data Science
19 | P a g e
Statistics for Data Science
A normal distribution can be called a bell-curve distribution. It gets its name from the
bell curve shape that we get when we visualize the distribution.
Q24. What is skewness?
Skewness measures the lack of symmetry in a data distribution. It indicates that there are
significant differences between the mean, the mode, and the median of data. Skewed
data cannot be used to create a normal distribution.
20 | P a g e
Statistics for Data Science
21 | P a g e
Statistics for Data Science
Work it out
>140 5 4
>145 6 11
>150 18 29
>155 11 40
>160 6 46
>165 7 52
*Find the median height, mean height and mode height of the same
No. of families 7 8 2 2 1
* Find Mean, Median and mode of this data. Which Among the Following Cannot be
Represented Graphically?
(a) Median
(b) Mean
(c) Mode
(d) None of the above option
22 | P a g e
Statistics for Data Science
* the bell or mound shaped distribution will have approximately 68% of the data within
what number of standard deviations of the mean?
a. one standard deviation
b. two standard deviations
c. three standard deviations
d. four standard deviations
e. none of the above
23 | P a g e
Statistics for Data Science
24 | P a g e
Statistics for Data Science
25 | P a g e