Statistics
Statistics
Statistics
Introduction to Statistics: -
Stats Definition: - Stats is the science of collecting, organizing and
analyzing data.
3. IQ of students in classroom
Type of Statistics: -
1. Descriptive Statistics
2. Inferential Statistics
@Gangadhar Tiwari
III. Different type of distribution of data: -
i. Bernoulli Distribution
ii. Uniform Distribution
iii. Binomial Distribution
iv. Normal or Gaussian Distribution
v. Exponential Distribution
vi. Poisson Distribution
E.g.: - Let say there are 10 Cricket Camps in Bangalore and you have collected the
height of cricketers from one of the camps.
@Gangadhar Tiwari
(Sample data)
a. Descriptive Question: -
IV. What is the average height of the entire camps
V. Disturbance of a data
VI. 140cm how many STD it is away from mean
b. Inferential Question: -
• Are the average height of a players of camp1 similar to that of
camp2
Sample
data
Types Of Data: -
@Gangadhar Tiwari
No ranks Ranks Whole Numbers Any Value
E.g.:- Gender, Blood E.g.:- Customer E.g.:- No. of children e.g.:- House price in
Group, Colors, feedback {1, 2,3,4,5} in a family Bengaluru
No. of bikes Length of river
location, cities, days No. of people working
• Nominal Scale
• Ordinal Scale
• Interval Scale
• Ratio Scale
@Gangadhar Tiwari
I. Nominal Scale data: - A nominal scale is the 1st level of measurement scale in
which the numbers serve as “tags” or “labels” to classify or identify the objects. A
nominal scale usually deals with the non-numeric variables or the numbers that do not
have any value
II. Ordinal Scale Data: - The ordinal scale is the 2 nd level of measurement that reports
the ordering and ranking of data without establishing the degree of variation between
them. Ordinal represents the “order.”
Ordinal data is known as qualitative data or categorical data. It can be grouped, named
and also ranked.
• Rank is important
• Order matters
• Difference cannot be measured • Example:
@Gangadhar Tiwari
▪ Totally disagree
III. Interval Scale Data: - The interval scale is the 3 rd level of measurement scale. It is
defined as a quantitative measurement scale in which the difference between the two
variables is meaningful. In other words, the variables are measured in an exact manner,
not as in a relative way in which the presence of zero is arbitrary.
• The order matters
• Difference can be measured
• The ratio cannot be measured
• No ‘0’ starting point • Example:
• Likert Scale
• Net Promoter Score (NPS)
• Bipolar Matrix Table
• IQ
IV. Ratio Scale Data: - The ratio scale is the 4th level of measurement
scale, which is quantitative. It is a type of variable measurement scale.
It allows researchers to compare the differences or intervals. The ratio
scale has a unique feature. It possesses the character of the origin or
zero points.
Descriptive Statistics
1. Measure of Central Tendency: -
o Mean
oMedian
oMode
@Gangadhar Tiwari
Mean: - The mean represents the average value of the dataset. It can be calculated as the sum
of all the values in the dataset divided by the number of values.
Median: - Median is the middle value of the dataset in which the dataset is arranged in
the ascending order or in descending order. When the dataset contains an even number
of values, then the median value of the dataset can be found by taking the mean of the
middle two values. Consider the given dataset with the odd number of
observations arranged in descending order – 23, 21, 18, 16, 15, 13, 12, 10, 9, 7, 6, 5, and
2
Here 12 is the middle or median number that has 6 values above it and 6 values below it.
Now, consider another example with an even number of observations that are arranged in
descending order – 40, 38, 35, 33, 32, 30, 29, 27, 26, 24, 23, 22, 19, and 17
@Gangadhar Tiwari
When you look at the given dataset, the two middle values obtained are 27 and 29. Now,
find out the mean value for these two numbers. i.e., (27+29)/2 =28
Therefore, the median for the given data distribution is 28.
Mode: - The mode represents the frequently occurring value in the dataset. Sometimes
the dataset may contain multiple modes and, in some cases, it does not contain any
mode at all.
Since the mode represents the most common value. Hence, the most frequently
repeated value in the given dataset is 5.
@Gangadhar Tiwari
• The sample variance is divided by n-1 so that we can create an
Unbiased estimator of the population variance
II. Standard Deviation: - The square root of the variance is known as the standard
deviation i.e. S.D. = √σ.
• A standard deviation is used to determine how estimations for a group of
observations (i.e., data set) are spread out from the mean (average or expected
value).
• How many STD Xi is away from mean
@Gangadhar Tiwari
Sets: -
A= {1,2,3,4,5,6,7,8}
B= {3,4,5,6,7}
I. Intersection: -
A ∩ B = {3,4,5,6,7}
II. Union: -
A B = {1,2,3,4,5,6,7,8}
III. Difference: -
A-B= {1,2,8}
IV. Subset: -
A B = False
B A= True
V. Superset: -
A B = True
B A= False
@Gangadhar Tiwari
No. of Bins=50/5=10
Bin size=5
A. No Skewed: -
@Gangadhar Tiwari
B. Right Skewed: -
C. Left Skewed: -
@Gangadhar Tiwari
sampling Techniques: -
B. Stratified sampling:-
Stratified sampling involves dividing the population into subpopulations that may
differ in important ways. It allows you draw more precise conclusions by ensuring
that every subgroup is properly represented in the sample.
To use this sampling method, you divide the population into subgroups (called strata)
based on the relevant characteristic (e.g., gender identity, age range, income bracket,
job role).
C. Systematic sampling:-
Systematic sampling is similar to simple random sampling, but it is usually slightly
easier to conduct. Every member of the population is listed with a number, but instead
of randomly generating numbers, individuals are chosen at regular intervals.
D. Convenience sampling:-
A convenience sample simply includes the individuals who happen to be most
accessible to the researcher.
@Gangadhar Tiwari
This is an easy and inexpensive way to gather initial data, but there is no way to tell if
the sample is representative of the population, so it can’t
produce generalizable results. Convenience samples are at risk for both sampling bias
and selection bias.
E. Purposive sampling:-
This type of sampling, also known as judgement sampling, involves the researcher
using their expertise to select a sample that is most useful to the purposes of the
research.
It is often used in qualitative research, where the researcher wants to gain detailed
knowledge about a specific phenomenon rather than make statistical inferences, or
where the population is very small and specific. An effective purposive sample must
have clear criteria and rationale for inclusion. Always make sure to describe your
inclusion and exclusion criteria and beware of observer bias affecting your
arguments.
Example: Purposive sampling: - You want to know more about the opinions and
experiences of disabled students at your university, so you purposefully select a
number of students with different support needs in order to gather a varied range of
data on their experiences with student services.
F. Cluster sampling:-
Cluster sampling also involves dividing the population into subgroups, but each
subgroup should have similar characteristics to the whole sample. Instead of sampling
individuals from each subgroup, you randomly select entire subgroups.
If it is practically possible, you might include every individual from each sampled
cluster. If the clusters themselves are large, you can also sample individuals from
within each cluster using one of the techniques above. This is called multistage
sampling.
This method is good for dealing with large and dispersed populations, but there is
more risk of error in the sample, as there could be substantial differences between
clusters. It’s difficult to guarantee that the sampled clusters are really representative of
the whole population.
@Gangadhar Tiwari
Example: Cluster sampling: - The company has offices in 10 cities across the country
(all with roughly the same number of employees in similar roles). You don’t have the
capacity to travel to every office to collect your data, so you use random sampling to
select 3 offices – these are your clusters.
@Gangadhar Tiwari
Covariance and Correlation: -
• Covariance is a statistical term that refers to a systematic relationship
between two random variables in which a change in the other reflects
a change in one variable.
• The covariance value can range from -∞ to +∞, with a negative value
indicating a negative relationship and a positive value indicating a
positive relationship.
• The greater this number, the more reliant the relationship. Positive
covariance denotes a direct relationship and is represented by a
positive number.
• A negative number, on the other hand, denotes negative covariance,
which indicates an inverse relationship between the two variables.
Covariance is great for defining the type of relationship, but it's
terrible for interpreting the magnitude.
@Gangadhar Tiwari
A. Pearson correlation coefficient: - The Pearson correlation coefficient (r) is the
most common way of measuring a linear correlation. It is a number between –1 and 1
that measures the strength and direction of the relationship between two variables.
Between 0 and 1 Positive correlation When one variable Baby length & weight:
changes, the other
variable changes in the
same direction. The longer the baby, the
heavier their weight.
where
@Gangadhar Tiwari
Probability Distribution Function: - a distribution function is a
mathematical expression that describes the probability of different possible
outcomes for an experiment.
Let us say we are running an experiment of tossing a fair coin. The possible events
are Heads, Tails. And for instance, if we use X to denote the events, the probability
distribution of X would take the value 0.5 for X=heads, and 0.5 for X=tails
Discrete data is counted and can take only a limited number of values. It
makes no sense when written in decimal format. And the random variable that
holds discrete data is called the Discrete random variable.
@Gangadhar Tiwari
1. Discrete Distributions
2. Continuous Distributions
@Gangadhar Tiwari
• PDF for continuous case
4. Probability for values less than x, P(X<x) or Probability for values
within a range from a to b, P(a<X<b) can be directly obtained in: •
CDF for both discrete / continuous case
5. Distribution function is referred to CDF or Cumulative Frequency
Function
@Gangadhar Tiwari
C. Cumulative Distribution Function (CDF):- It is another method to describe
the distribution of a random variable (either continuous or discrete).
@Gangadhar Tiwari
Types of Probability Distribution: -
1. Normal or Gaussian Distribution
2. Bernoulli Distribution
3. Uniform Distribution
4. Poisson Distribution
5. Binomial Distribution
6. Log-Normal Distribution
1. Bernoulli Distribution: -
E.g.: -
Pr(T)=0.5 = 1-p=q
@Gangadhar Tiwari
----PMF=Pk*(1-P)1-K
E.g.: -
Tossing a Coin 10 times
=PMF
n
Cx = n!/x!(n-x)! Where,
n = the number of experiments
x = 0, 1, 2, 3, 4, …
p = Probability of Success in a single experiment q = Probability of
Failure in a single experiment = 1 – p
Mean, μ = np
@Gangadhar Tiwari
Variance, σ2 = npq
@Gangadhar Tiwari
P(x, λ ) =(e– λ λx)/x! Where,
e is the base of the
logarithm x is the number of
occurrences (x=0,1,2,…..)
λ Expected no. of events occur at
every time
interval
@Gangadhar Tiwari
4. Normal or Gaussian Distribution: -
• it’s concerned with Continuous random variables {PDF}
• Normal distributions are symmetrical, but not all symmetrical
distributions are normal
Characteristics of Normal Distribution
• mean = median = mode
• Symmetrical about the center
• Unimodal
• 50% of values less than the mean and 50% greater than the mean
@Gangadhar Tiwari
Here, x is value of the variable;
f(x) represents the probability
density function; μ (mu) is the
mean; and σ (sigma) is the
standard deviation.
1. Blood pressure
@Gangadhar Tiwari
4. Bias — is the tendency of a statistic to overestimate or
underestimate a parameter.
@Gangadhar Tiwari
@Gangadhar Tiwari
• Empirical Rule of Normal Distribution: - The empirical rule
in statistics, also known as the 68 95 99 rule, states that for normal
distributions, 68% of observed data points will lie inside one standard
deviation of the mean, 95% will fall within two standard deviations, and
99.7% will occur within three standard deviations.
@Gangadhar Tiwari
• 68.3% of values are within 1 standard deviation (1σ) of the mean
It is always good to know the standard deviation because we can say that
any value is:
• likely to be within 1 standard deviation (1σ)(68.3 out of 100 should be)
• very likely to be within 2 standard deviations (2σ) (95.5 out of 100
should be)
• almost certainly within 3 standard deviations (3σ) (997 out of 1000
should be)
@Gangadhar Tiwari
@Gangadhar Tiwari
II. Discrete Uniform Distribution (PMF): -
• Discrete random variables {PMF}
• Symmetrical
• Bell-shaped
• Mean and median are equal; both located at the center of the
distribution
@Gangadhar Tiwari
The mean of the normal distribution determines its location and the standard
deviation determines its spread.
• About 68% of data falls within one standard deviation of the mean
• About 95% of data falls within two standard deviations of the mean
• About 99.7% of data falls within three standard deviations of the mean
• What is a “Z-score”?
The number of standard deviations from the mean is also called the
“Standard Score”, “sigma” or “Z-score”. Simply, a Z-score describes
the position of a raw score in terms of its distance from the mean, when
measured in standard deviation units. z = (x – μ) / σ
@Gangadhar Tiwari
• Z is the “z-score” (Standard Score)
• x is the value to be standardized
• μ (mu) is the mean
• σ (sigma) is the standard deviation
We can take any Normal Distribution and convert it to The Standard Normal
Distribution.
@Gangadhar Tiwari
S.NO. Normalization Standardization
Minimum and maximum value of Mean and standard deviation is used for
1.
features are used for scaling scaling.
It is used when features are of different It is used when we want to ensure zero
2.
scales. mean and unit standard deviation.
3. Scales values between [0, 1] or [-1, 1]. It is not bounded to a certain range.
@Gangadhar Tiwari
Scikit-Learn provides a transformer Scikit-Learn provides a transformer
5.
called MinMaxScaler for Normalization. called StandardScaler for standardization.
This transformation squishes the It translates the data to the mean vector of
6. ndimensional data into an original data to the origin and squishes or
ndimensional unit hypercube. expands.
It is useful when we don’t know about It is useful when the feature distribution
7.
the distribution is Normal or Gaussian.
@Gangadhar Tiwari
@Gangadhar Tiwari
Central limit Theorem: - For large sample sizes, the sampling distribution of
means will approximate to normal distribution even if the population distribution is
not normal.
@Gangadhar Tiwari
1. The sample size is sufficiently large. This condition is usually met if the size of
the sample is n ≥ 30.
2. The samples are independent and identically distributed, i.e., random
variables. The sampling should be random.
3. The population’s distribution has a finite variance. The central limit theorem
doesn’t apply to distributions with infinite variance.
@Gangadhar Tiwari
@Gangadhar Tiwari
1. What is Central Limit Theorem in Statistics?
Central Limit Theorem in statistics states that whenever we take a large
sample size of a population then the distribution of sample mean
approximates to the normal distribution.
@Gangadhar Tiwari
Inferential Statistics
Statistical inference provides methods for drawing conclusions about a
population from sample data.
@Gangadhar Tiwari
2. Hypothesis And Hypothesis Testing Mechanism: -
- Null Hypothesis (H0):- The Null Hypothesis (H0) aims to nullify the
alternative hypothesis by implying that there exists no relation between
two variables in statistics. It states that the effect of one variable on the
other is solely due to chance and no empirical cause lies behind it.
@Gangadhar Tiwari
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
@Gangadhar Tiwari
3. P-Value: - P value is a number, calculated from a statistical test, that
describes how likely you are to have found a particular set of observation if
the null hypothesis were true, p values are used in hypothesis testing to help
decide whether to reject the null hypothesis
@Gangadhar Tiwari
4. Confidence Interval and Margin of Error: - Confidence intervals are a
range of values within which we can be confident that the true population
parameter lies. This range is estimated based on a sample from the
The margin of error is equal to half the width of the entire confidence
interval.
@Gangadhar Tiwari
@Gangadhar Tiwari
Hypothesis Testing and Statistical Analysis: - 1.
Z-Test Average
2. T-Test
3. Chi Square --------- Categorical
4. Anova-------- Variance
1. Z-Test:-
• Population standard deviation is known
• Large sample size (n > 30)
@Gangadhar Tiwari
• Z-Test = (x̅ – μ) / (σ / √n) σ/√n---- Standard Error σ
----- Population standard deviation μ----- Population
Mean x̅ ----- Sample Mean n---- No. of Sample
• Degrees of Freedom Not applicable
• We Used Z Test when the population standard deviation is known
and the sample size is large
@Gangadhar Tiwari
2. T-Test: - A t-test is an inferential statistic used to determine if there is a
significant difference between the means of two groups and how they are
related. T-tests are used when the data sets follow a normal distribution and
have unknown variances, like the data set recorded from flipping a coin 100
times.
• Population standard deviation is unknown
• Our sample size is small, n < 30
• T-Test = (x̅ – μ) / (s / √n) σ/√n---- Standard Error s --
--- sample standard deviation μ----- Population Mean x̅
----- Sample Mean n---- No. of Sample
• Degrees of Freedom is n-1
• We Used T-Test when the population standard deviation is unknown
or the sample size is small
• T-tests can be dependent or independent.
@Gangadhar Tiwari
Confidence interval = Point Estimate ± margin of error
Confidence interval = sample mean ± margin of error
C.I=X̅ ± T α /2* s/√n s/√n------ Standard error s-----
Sample variance α -----significance level n-----
no. of samples
@Gangadhar Tiwari
• Z-Test & T-Tests are Parametric Tests, where the Null Hypothesis is less than,
greater than or equal to some value.
• A z-test is used if the population variance is known, or if the sample size is
larger than 30, for an unknown population variance.
• If the sample size is less than 30 and the population variance is unknown, we
must use a t-test.
A. A one-tailed z-test allows for the possibility of rejection of the Null Hypothesis in
only one direction, whereas a two-tailed z-test tests the possibility of rejection in
both directions (left and right).
@Gangadhar Tiwari
3. Chi Square: -
• Chi Square test clams about Population proportions
• It is a non-parametric test is performed on categorical (nominal or
ordinal) data
@Gangadhar Tiwari
@Gangadhar Tiwari
@Gangadhar Tiwari
4. Anova(F-Test): -
• ANOVA, which stands for Analysis of Variance, is a statistical test
used to analyze the difference between the means of more than two
groups.
• ANOVA compares the variation between group means to the variation
within the groups. If the variation between group means is
significantly larger than the variation within groups, it suggests a
significant difference between the means of the groups.
• ANOVA calculates an F-statistic by comparing between-group
variability to within-group variability. If the F-statistic exceeds a
critical value, it indicates significant differences between group
means.
• ANOVA is used to compare treatments, analyse factors impact on a
variable, or compare means across multiple groups.
• Types of ANOVA include one-way (for comparing means of groups)
and two-way (for examining effects of two independent variables on
a dependent variable).
Types of Anova
1. One Way Annova:- One factor with at least 2 levels, these levels are
independent
@Gangadhar Tiwari
2. Repeated measures annova:- One factor with atleast 2 levels, levels are
dependents
@Gangadhar Tiwari
Hypothesis Testing of Annova:-
• Null Hypothesis H0 : μ1 = μ2 = μ3 = - - - - - μk
• Alternate hypothesis H1 : At least one of mean is not equal
• F Test Statistics
One Way Annova:- One Factor with at least 2 levels, levels are
independent
@Gangadhar Tiwari
@Gangadhar Tiwari
@Gangadhar Tiwari
@Gangadhar Tiwari