0% found this document useful (0 votes)

2 views45 pages

Stats Intro

The document provides a comprehensive overview of statistical concepts including sampling methods, types of variables, measures of central tendency and dispersion, and hypothesis testing. It explains the Central Limit Theorem, various statistical tests, and the importance of confidence intervals and tests for normality. Additionally, it covers the Chi-Square test for independence and goodness-of-fit, as well as the Levene's test for homogeneity of variances.

Uploaded by

funtooshda1

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views45 pages

Stats Intro

Uploaded by

funtooshda1

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 45

https://github.

com/krishnaik06/The-Grand-Complete-Data-Science-Materials/tree/main

Krish Naik - https://www.youtube.com/watch?v=6Z8SdN52GuU&list=PLZoTAELRMXVMhVyr3Ri9IQ-

t5QPBtxzJO&index=44

https://docs.google.com/document/d/1JXswnn4qSZhHQO7RaWLYWTj7kRyL9Y7PjrCNbGLIEvA/edit?
tab=t.0

https://dou.ua/forums/topic/44769/

AB Testing Project:

1. https://www.youtube.com/watch?
v=FTpmwX94_Yo&t=10247s&pp=ygUvY29tcGxldGUgYWIgdGVzdGluZyB0dXRvcmlhbGFiIHRlc
3RpbmcgdHV0b3JpYWw%3D
2. https://www.youtube.com/watch?
v=qhYsZWrTiuM&list=PLHS1p0ot3SVjQg0q1eEPrmOmPUY_AT1vB
3.

STATISTICS

Types of Sampling

 Simple Random Sampling – every member of Population has equal chance of being selected
 Stratified Sampling – Sampling from non overlapping groups
 Systematic Sampling – Survey for every nth person. Like people infront of Malls, offices, etc
Type of Variable

Central Tendency

Dispersion
Variance shows the spread of my data

SD shows how far from mean is the data available

Percentile

Now 25 Percentile = Q1=(25/100) x (n+1) this value is the position of Q1

75 percentile = Q3 = (75/100) x (n+1) this value is the position of Q3

Inter Quartile Range = Q3-Q1  used to find outlier via Boxplot

Removing Outlier

STEPS:

 Declare a lower fence (below which all numbers are outliers) and a higher fence(above which
all numbers are outliers)
 Find Q1, Q3, IQR, LF and HF by formula

 So Lower Fence(LF)=-3
 Higher Fence(HF) = 13
 27 is outlier. So below max value is 9 not 27, bcoz 27 is outlier
 Boxplot

SECOND METHOD TO REMOVE OUTLIER : Z-score

Applications of Z-score:
1. Standardisation
2. Finding outlier
3. Compare scores between various ditributions.
Eg: India cricket score in 2015 and 2019 WC where we know Avg, SD
and final score for both tournaments. We need to find which final score
is better?

Standardisation vs Normalisation
NORMAL DISTRIBUTION
CENTRAL LIMIT THEOREM
 Suppose our population/original data is normal/log-normal distribution/ any distribution
 Now we take maybe 100 different samples where n>=30 and for each such sample set we get
a mean for each such sample set.
 Eg ; For 1st sample set the mean is x1
 For 2nd sample set, mean is x2
 Similarly for 100th set mean is x100

In order to apply the central limit theorem, there are four conditions that must be met:

1. Randomization: The data must be sampled randomly such that every member in a
population has an equal probability of being selected to be in the sample.

2. Independence: The sample values must be independent of each other.

3. The 10% Condition: When the sample is drawn without replacement, the sample size should
be no larger than 10% of the population and n>30

4. Large Sample Condition: The sample size needs to be sufficiently large.

No matter what the shape of the original data is, if you take enough random samples and average
them, the distribution of those averages will start to look like a normal distribution.

 Now if all the mean values above(x1,x2,….,x100) and are plotted, a bell curve is going to from
which follows normal distribution

The central limit theorem states that if you take sufficiently large samples (n>=30) from a
population, the samples’ means will be normally distributed, even if the population isn’t normally
distributed.

Example: A population follows a Poisson distribution (left image). If we take 10,000 samples from
the population, each with a sample size of 50, the sample means follow a normal distribution, as
predicted by the central limit theorem (right image).
Skewness

Different Probability functions

What is a Random Variable?

In probability, a real-valued function, defined over the sample space of a random experiment, is
called a random variable. That is, the values of the random variable correspond to the outcomes of
the random experiment

A probability function is a rule or formula that assigns a probability (a value between 0 and 1) to
each possible outcome of a random event or variable.
🔹 Two Main Types:

1. Probability Mass Function (PMF)

→ For discrete random variables (things you can count: dice rolls, coin flips)

2. Probability Density Function (PDF)

→ For continuous random variables (things you measure: height, weight, time)

🔁 Summary:
Thing Description
Random Variable (X) Describes the possible outcomes numerically
Probability Function Assigns a probability to each possible value of X

Relation between PDF & CDF

Summary of Relationships in Action:

 Continuous Example (Normal Distribution):

o The PDF describes the probability density at any given point.

o The CDF accumulates this density to give the total probability up to a certain point.

o The CDF is the integral of the PDF.

 Discrete Example (Dice Roll):

o The PMF gives the probability for each outcome.

o The CDF gives the cumulative probability for all outcomes less than or equal to a
specific value.

o The CDF is the cumulative sum of the PMF.

Hypothesis Testing – take sample and infer something about the population

Steps
Statistical Tests Table
Test Type Conditions Test Name H0 H1

μ ≠ μ₀ (two-tailed) or
Mean 1 sample, σ known Z-test μ = μ₀
μ > μ₀ / μ < μ₀ (one-tailed)

μ ≠ μ₀ (two-tailed) or
1 sample, σ unknown T-test μ = μ₀
μ > μ₀ / μ < μ₀ (one-tailed)

2 samples, independent, σ
Independent T-
unknown μ₁ ≠ μ₂ (two-tailed) or
test (pooled μ₁ = μ₂
(Like drug test on 2 diff sets of μ₁ > μ₂ / μ₁ < μ₂ (one-tailed)
variance)
people)

2 samples, paired (Eg : Effect

μd = 0 μd ≠ 0 (two-tailed) or
of diet on weight on a set of Paired T-test
people before and after diet) (d=dependent) μd > 0 / μd < 0 (one-tailed)

μ₁ = μ₂ = μ₃ ... =
>2 independent samples ANOVA (F-test) At least one mean differs
μₙ

Median =
Median Non-parametric, 1 sample Sign Test Median ≠ Median₀
Median₀

Non-parametric, two Mann-Whitney U Distributions are

Rank One distribution is shifted
independent samples Test (MWU Test) equal

Non-parametric, paired(or Wilcoxon Signed- Medians are

Medians are different
dependent) samples Rank Test equal

Kruskal-Wallis Test Distributions are At least one distribution is

Non-parametric, >2 groups
(KW Test) equal different

1 categorical variable, 1 p ≠ p₀ (two-tailed) or p > p₀ / p

Proportion Z-test p = p₀
sample < p₀ (one-tailed)

1 categorical variable,2 p₁ ≠ p₂ (two-tailed) or p₁ > p₂ /

Z-test p₁ = p₂
independent sample p₁ < p₂ (one-tailed)

Observed
Goodness of fit, non
Chi- distribution = Observed distribution ≠
parametric & nominal Chi-Square Test
Square Expected Expected distribution
variables
distribution

Test of independence (nominal Variables are

Chi-Square Test Variables are dependent
variables) independent

Here's a chart with examples for each statistical test, showing when and where they are
commonly applied:

Test Type Conditions Conditions Test Name Example Use Case

Mean 1 sample, σ known 1 sample, σ known Z-test Testing if the average height of
(Parametric) students is 170 cm when population
Test Type Conditions Conditions Test Name Example Use Case
standard deviation is known.
Checking if the average weight of a
1 sample, σ
Mean test 1 sample, σ unknown T-test product is 500g when population
unknown
standard deviation is unknown.
2 samples, independent, σ
2 samples,
unknown Independent T-test Comparing the average test scores of
Mean test independent, σ
(Like drug test on 2 diff sets of (Pooled Variance) two different classrooms.
unknown
people)
2 samples, paired (Eg : Effect of Measuring the effect of a diet on
Mean test diet on weight on a set of 2 samples, paired Paired T-test weight before and after for the same
people before and after diet) group of people.
Comparing the average sales across 3
Mean test >=3 independent samples >=3 samples ANOVA (F-test)
different regions.
Median (Non- 1 sample, non- Testing if the median house price is
Non-parametric, 1 sample Sign Test
parametric) parametric $300,000.
Rank (Non-
Non-parametric, two
parametric) 2 samples, Mann-Whitney U Comparing customer satisfaction
independent samples, ordinal
independent Test rankings of two different stores.
data
Median Test
Non-parametric, paired(or Measuring the effect of therapy on
Wilcoxon Signed-
Median Test dependent) samples, 2 samples, paired stress levels before and after
Rank Test
ordinal data treatment.
Non-parametric, >2 groups, Comparing the effectiveness of 3
Median Test > 2 groups Kruskal-Wallis Test
ordinal data different fertilizers on plant growth.
1 categorical variable, 1 You want to test if the proportion of
Proportion sample(only 2 options available 1 category Z-test people who prefer tea is different
for variable) between two cities (City 1 and City 2)
1 categorical variable,2
Comparing the proportion of men vs
independent sample (only 2 2 categories Z-test
women who prefer online shopping.
options available for variable)
Chi-Square
(Non-
parametric) You want to test if there is an
Goodness of fit, no assumption
association between gender (Male,
for parametric & nominal Goodness of fit Chi-Square Test
To check for Female) and preferred beverage (Tea,
variables i.e has no order/rank
association Coffee, Water)
among
variables
Test of independence (nominal Test of Checking if gender is related to
Chi-Square Test
variables i.e has no order/rank) independence political preference.

Chi-square test - A statistical method is used to find the difference or correlation between the observed
and expected categorical variables in the dataset.
Example: A food delivery company wants to find the relationship between gender, location, and food
choices of people.

In Chi-Square, this difference is tested like this:

 Null Hypothesis (H₀): The observed distribution is the same as expected (no relationship, fair dice,
etc.).
 Alternative Hypothesis (H₁): The distribution is different (not fair, variables are related, etc.).
If your test shows a significant difference, you reject H₀ and say:
“The distribution is different” = “What we see doesn’t match what we expected under no effect.”

The Chi-Square test is actually used for two main purposes, and they both revolve around comparing observed vs.
expected distributions — but in slightly different contexts:
🔹 1. Chi-Square Test of Independence
(Also called association test)

👉 Purpose:

To test if two categorical variables are associated or independent.

🧠 Example:

 Is ice cream preference related to age group?

 Is gender associated with product purchase?

🧪 You’re testing:

"Is the distribution of preferences different across groups?"

If the distributions across age groups differ significantly, you conclude the variables are associated.

🔹 2. Chi-Square Goodness-of-Fit Test

👉 Purpose:

To test if a single categorical variable follows a specific distribution (often a uniform or expected one).

🧠 Example:

 Do people prefer all brands equally?

 Does a die produce all numbers with equal frequency?

🧪 You’re testing:

"Is the observed distribution of a variable different from what we expected?"

Test Checks What? Obs vs Exp Based On…

Goodness-of-Fit Is the data evenly/randomly distributed? Assumed probabilities
Test of Independence Are 2 categorical variables related? Frequencies from 2-way table

🔁 Common Ground: Observed vs. Expected

In both tests, you're comparing:

 Observed frequencies → from actual data

 Expected frequencies → under the null hypothesis

If the difference is too large, the test statistic (Chi-Square) becomes large, and you reject the null hypothesis.
✅ Quick Summary:

 Parametric Tests: Assume normal distribution; used for means.

 Non-Parametric Tests: No assumption of normality; used for medians/ranks.
 Chi-Square Tests: Used for categorical data (counts/frequencies).

For categorical variables, it is not meaningful to talk about whether the distribution is
"normal" or not because:

 Normal distribution applies to continuous variables (e.g., height, weight, income)

where values are measured on a continuous scale.
 Categorical variables (e.g., gender, device type, yes/no responses) represent
categories or labels, not numerical values.
 Since categorical data is based on counts or proportions, concepts like mean and
standard deviation don’t apply in the same way.

Note : Wilk_Shapiro test is used to determine whether a

distribution is normal or not
The Levene's test is used to check if the variances of two or more groups are equal, also known as
homogeneity of variances. It's a crucial assumption for many parametric tests, such as t-tests and ANOVA.

✅ When to Use Levene's Test:

1. Before Running Parametric Tests:

o Many tests (like t-tests and ANOVA) assume that the variances across groups are
approximately equal.

o Levene's test helps validate this assumption.

2. For Comparing Two or More Groups:

o Works for 2 or more independent groups.

o Useful when comparing categories like gender, treatment types, or experimental groups.

3. Robust to Non-Normality:

o Levene's test is less sensitive to departures from normality compared to other tests (like
Bartlett’s test).

o Ideal when data is non-normal or has outliers.

A Confidence Interval gives a range of values that is likely to contain the true population parameter (like
mean or proportion), based on your sample data.

Confidence Levels - It’s the probability (or certainty level) that your confidence interval actually contains the
true population parameter.

 Common confidence levels: 90%, 95%, 99%

 95% confidence level = If you repeated the experiment 100 times, 95 of those times the parameter
will liw within confidence interval

Relation to CI:

The higher the confidence level, the wider the confidence interval.

P-Value :- A p-value, or probability value, is a number describing the likelihood of obtaining the observed
data under the null hypothesis of a statistical test

 A small p-value (like less than 0.05) means the results are unlikely due to chance, so you might reject
the null hypothesis.

 A large p-value means the results could easily happen by chance, so you don't reject the null
hypothesis.

✅ Practical Interpretation in Interviews:

Concept What It Answers Example

"The true mean is likely between 180

Confidence Interval "Where could the true value lie?"
and 220."

"We are 95% confident that range

is correct." i.e. if I do the experiment 100
Confidence Level "How confident are you in that range?"
times, 95 times it will lie inside
confidence level

p-value "How likely is my result if there's no effect?" "Only 2% chance it happened randomly."

"If p < 0.05, I’ll consider the result

Significance level (α) "What's my cut-off for surprise?"
significant."
When to use Z test and t-test

Z-test Example
Similarly from below we can calculate the actual values of the lower and upper bounds of the
Confidence level

Exampe of T test

 Analyses the difference of mean

Degree of Freedom = n-1 = 30-1 = 29

This value of +2.045 and -2.045 is found in the t-test table specifically in the 2-tailed
table. The value 2.045 is called critical value
Confusion matrix - A confusion matrix is a table used to evaluate the performance of
a classification model. It compares the predicted labels with the actual labels and
provides insights into how well the model is performing. The matrix is especially
useful for binary and multi-class classification problems.

Explanation of Terms
 True Positive (TP) – The model correctly predicted the positive class.
 False Positive (FP) (Type I Error) – The model incorrectly predicted positive
when it was actually negative.
 False Negative (FN) (Type II Error) – The model incorrectly predicted negative
when it was actually positive.
 True Negative (TN) – The model correctly predicted the negative class
Q) When to reduce FN and FP?
 Domain specific
 Suppose for diseases, reducing FN should be more imp. Eg: Cancer. If it is FN,
that means patient has cancer but predicted as No Cancer

Point Estimate & Margin of Error
Z = critical value
Interpreting Margin of Error
 If a survey reports a 60% approval rating with a ±3% margin of error, the actual approval
could be anywhere between 57% and 63%.

 A larger sample size reduces the margin of error, making estimates more precise.

 A higher confidence level increases the margin of error, as it requires a wider range to ensure
accuracy.

Example to calculate the confidence interval values

This shows that hypothesis testing (Z-test) and confidence intervals are closely related:
 In a Z-test, we check whether Z-test exceeds a critical value to determine statistical
significance.
 In a confidence interval, we check whether a hypothesized mean μ lies inside or
outside the interval.
Key Insight:
 The confidence interval checks if the population mean is within a certain range.
Here sample mean is provided
 The Z-test checks if the sample mean is significantly different from the population
mean(population mean is provided). Also if the sample mean lies within this CI or
not (imagine the 95% bell curve)
ANOVA Test  F-distribution (Right Skewed Distribution)

1. Concept example of Factors & Levels

2. Assumptions to be considered before ANNOVA

3. Types of Annova
STEPS

Example- One Way Annova

1) Question

2) Null & Alternate Hypothesis

3) Degrees of Freedom

4)
Chi Square Test

1. 1st table is the original pop data(percentages provided on pop scale) and 2nd table is
the date gathered after sampling
2. So we are trying to find out if there is a change between these 2data? Did the
population changed in 10 years?
3.

Here calculating the expected numbers based on the population percentages

provided

Note : n=no of categories(not sample size=500) hence n=3

From Chi Square table, critical value is 5.99

CoVariance

Covariance is a measure of how two variables change together.

🎯 Intuitive Explanation

 If both variables increase together, covariance is positive.

 If one variable increases while the other decreases, covariance is negative.
 If there's no consistent pattern, covariance is close to zero.

For Sample, (n-1) in denominator

Covariance and correlation are both measures of how two variables change together, but they
differ in terms of scale, interpretation, and standardization.

 Covariance tells you the direction of the relationship.

 Correlation tells you the direction and strength of the relationship, and it's easier to interpret! 🚀
This correlation formula is also called Pearson Correlation Co-efficient. Main thing is it
can represent the properties linearly

Spearman Rank Correlation

This can also capture non-linear properties. See below figure.
The Spearman Rank Correlation (or Spearman's rho, denoted as ρ\rho) is a non-parametric measure of
the strength and direction of the monotonic relationship between two variables.

🎯 Key Points about Spearman Rank Correlation

 It does not assume linearity but instead measures if the relationship is monotonic (consistently
increasing or decreasing).

 Works with ranked data, meaning it converts data into ranks before calculating correlation.

 Insensitive to outliers because it’s based on ranks, not raw values.

 Ideal for ordinal data or data that doesn't meet the assumptions of Pearson correlation (like normal
distribution).
Rank is like this 

Probability Basics
Basic Probability Notes for Data Analysis Roles

1. Basics of Probability
Definitions:

2. Types of Probability
 Classical (Theoretical) Probability: Based on predefined rules (e.g., rolling a fair die).
 Empirical (Experimental) Probability: Based on observations/data.
 Subjective Probability: Based on intuition or expert knowledge.

3. Probability Rules
4. Conditional Probability

5. Important Probability Distributions for Data Analysis

1. Bernoulli Distribution: Binary outcomes (0 or 1), like coin flips.
2. Binomial Distribution: Number of successes in n independent Bernoulli trials.
3. Poisson Distribution: Counts of events happening in a fixed time interval.
4. Uniform Distribution: All outcomes are equally likely.
5. Normal Distribution (Gaussian): Bell-shaped curve, common in real-world data.
6. Exponential Distribution: Time until an event occurs (e.g., time between arrivals in a queue).

7. Expected Value & Variance

7. Applications in Data Analysis
 A/B Testing: Probability helps determine if a change in a product leads to statistically significant improvements.
 Predictive Modeling: Many machine learning models (e.g., Naive Bayes) are based on probability.
 Anomaly Detection: Outliers can be identified using probability distributions.
 Bayesian Statistics: Used for updating beliefs with new data.

Week 1
No ratings yet
Week 1
48 pages
MTE 201 (2024) Prof Mushayabasa
No ratings yet
MTE 201 (2024) Prof Mushayabasa
40 pages
MT233 October 2019-1
No ratings yet
MT233 October 2019-1
39 pages
Lecture Slides 1 To 10 - 2024
No ratings yet
Lecture Slides 1 To 10 - 2024
196 pages
DMV - Unit I
No ratings yet
DMV - Unit I
44 pages
Basics Statistics For Data Analysis
No ratings yet
Basics Statistics For Data Analysis
75 pages
Complete Data Analysts RoadMap
No ratings yet
Complete Data Analysts RoadMap
47 pages
Further Probability & Statistics
No ratings yet
Further Probability & Statistics
33 pages
Lecture 8
No ratings yet
Lecture 8
76 pages
BRM Answer Key Q Bank by Alam.
No ratings yet
BRM Answer Key Q Bank by Alam.
90 pages
Parametric and Non Parametric Test
No ratings yet
Parametric and Non Parametric Test
76 pages
Descriptive Statistics
No ratings yet
Descriptive Statistics
14 pages
Probability
No ratings yet
Probability
27 pages
Basic Statistics
No ratings yet
Basic Statistics
23 pages
Unit 5 Overview of Probability
No ratings yet
Unit 5 Overview of Probability
21 pages
Statistics For Traffic Engineers
No ratings yet
Statistics For Traffic Engineers
55 pages
Data Analysis For Social Scientists Cheatsheet
No ratings yet
Data Analysis For Social Scientists Cheatsheet
12 pages
Lecture04 CH 04 ContinuousDistributions Baron Inf Stats FA24
No ratings yet
Lecture04 CH 04 ContinuousDistributions Baron Inf Stats FA24
46 pages
Introduction To Data Science Exploratory Data Analysis
No ratings yet
Introduction To Data Science Exploratory Data Analysis
55 pages
CENG3300 Lecture 2-2
No ratings yet
CENG3300 Lecture 2-2
23 pages
Research Designe and Basics of Stistics Manish Jain
100% (1)
Research Designe and Basics of Stistics Manish Jain
67 pages
1 - 3 Descriptive Measures
No ratings yet
1 - 3 Descriptive Measures
33 pages
STA1006S Summarized Notes
No ratings yet
STA1006S Summarized Notes
16 pages
Day2 - Session - 2 - Acropolis - NPP
No ratings yet
Day2 - Session - 2 - Acropolis - NPP
55 pages
Probability and Statistics
No ratings yet
Probability and Statistics
33 pages
Probability - Statistics - Class Notes
No ratings yet
Probability - Statistics - Class Notes
15 pages
Statistics and Econometrics
No ratings yet
Statistics and Econometrics
12 pages
Review of Chapters 1-5
No ratings yet
Review of Chapters 1-5
21 pages
Desc. Stat
No ratings yet
Desc. Stat
41 pages
Module Wise Important Formulae
No ratings yet
Module Wise Important Formulae
45 pages
Formula List Statistics 2
No ratings yet
Formula List Statistics 2
4 pages
Stats Review
No ratings yet
Stats Review
65 pages
Bio Statistics
No ratings yet
Bio Statistics
72 pages
Independent Sample T Test
100% (1)
Independent Sample T Test
27 pages
Data Management Tutorials
No ratings yet
Data Management Tutorials
56 pages
Unit-IV of Data Science
No ratings yet
Unit-IV of Data Science
38 pages
Unit 8. Data Analysis
No ratings yet
Unit 8. Data Analysis
69 pages
Thesis
No ratings yet
Thesis
16 pages
Week 5-8 Short Notes
No ratings yet
Week 5-8 Short Notes
10 pages
Binomial
No ratings yet
Binomial
14 pages
Raft and Longline Culture of Green Mussel
No ratings yet
Raft and Longline Culture of Green Mussel
57 pages
Cheat Sheet 2 in 1-1
No ratings yet
Cheat Sheet 2 in 1-1
2 pages
SB8 Formula
No ratings yet
SB8 Formula
20 pages
A Study On The Effect of Price On Consumer Purchase Decision-Making
No ratings yet
A Study On The Effect of Price On Consumer Purchase Decision-Making
10 pages
Introduction To Statistics Part IV: Statistical Inference: Achim Ahrens Anna Babloyan Erkal Ersoy
No ratings yet
Introduction To Statistics Part IV: Statistical Inference: Achim Ahrens Anna Babloyan Erkal Ersoy
44 pages
Probs-Stats Revision Notes
No ratings yet
Probs-Stats Revision Notes
19 pages
Statistical Methods
No ratings yet
Statistical Methods
16 pages
QR Midterm Memo
No ratings yet
QR Midterm Memo
2 pages
LQ1 Notes
No ratings yet
LQ1 Notes
15 pages
MAT 211 Introduction To Business Statistics I Lecture Notes
No ratings yet
MAT 211 Introduction To Business Statistics I Lecture Notes
69 pages
M2 L5 T-Test and ANOVA
No ratings yet
M2 L5 T-Test and ANOVA
9 pages
Assignment - Probability Distributions and Data Modeling
No ratings yet
Assignment - Probability Distributions and Data Modeling
4 pages
OCR MEI S1 Revision Notes
No ratings yet
OCR MEI S1 Revision Notes
7 pages
ASIANResearch GM
No ratings yet
ASIANResearch GM
13 pages
ISDS 361A - Cheat Sheet Exam 1 PDF
No ratings yet
ISDS 361A - Cheat Sheet Exam 1 PDF
2 pages
Student Opinion Survey 13
No ratings yet
Student Opinion Survey 13
21 pages
Probability Formula Sheet
No ratings yet
Probability Formula Sheet
11 pages
Chapter 6
No ratings yet
Chapter 6
5 pages
Mankeerat BRM Lab File Bba 3 M1: Enrollment No.: 02713701721
No ratings yet
Mankeerat BRM Lab File Bba 3 M1: Enrollment No.: 02713701721
72 pages
ONE WAY ANOVA and ANCOVA
No ratings yet
ONE WAY ANOVA and ANCOVA
26 pages
Revision - Elements or Probability: Notation For Events
No ratings yet
Revision - Elements or Probability: Notation For Events
20 pages
Fatima Khan 17834 ResearchReport
No ratings yet
Fatima Khan 17834 ResearchReport
34 pages
Final Project - Data Analytics Case 1
No ratings yet
Final Project - Data Analytics Case 1
48 pages
CH 51 F - Sandip Solanki Corrected
No ratings yet
CH 51 F - Sandip Solanki Corrected
20 pages
Key Concepts Ch1 6
No ratings yet
Key Concepts Ch1 6
2 pages
A. Variables:: Types of Distributions
No ratings yet
A. Variables:: Types of Distributions
10 pages
Cultural Brands From Emerging Markets: Brand Image Across Host and Home Countries
No ratings yet
Cultural Brands From Emerging Markets: Brand Image Across Host and Home Countries
16 pages
Erwin John Landicho
No ratings yet
Erwin John Landicho
8 pages
Statistical Test Assumptions
No ratings yet
Statistical Test Assumptions
28 pages
13 PDF
No ratings yet
13 PDF
24 pages
Unit 5.2 Testing Two Population Means
No ratings yet
Unit 5.2 Testing Two Population Means
24 pages
Week 3 Session 6 BEO6000 PPT VU Format (Update)
No ratings yet
Week 3 Session 6 BEO6000 PPT VU Format (Update)
46 pages
MIT14 30s09 Lec17
No ratings yet
MIT14 30s09 Lec17
9 pages
Probability & Statistics Facts and Formulae: Guides To Statistical Information 1
No ratings yet
Probability & Statistics Facts and Formulae: Guides To Statistical Information 1
4 pages
Lampiran Lampiran 1. Data Analisis Fisikokimia Sirup Glukosa Koro Pedang
No ratings yet
Lampiran Lampiran 1. Data Analisis Fisikokimia Sirup Glukosa Koro Pedang
31 pages
Passengers Perception Towards Service Quality of Indigo Domestic Airlines
No ratings yet
Passengers Perception Towards Service Quality of Indigo Domestic Airlines
9 pages
ANOVAWelchcorrection Satterthwaitecorrectionand Kruskal Wallistestcomparisonoftype Ierrorrateandpower
No ratings yet
ANOVAWelchcorrection Satterthwaitecorrectionand Kruskal Wallistestcomparisonoftype Ierrorrateandpower
23 pages
Speaking Strategies Use of Grade Eleven Students at Leka Nekemte Preparatory School: Gender in Focus
No ratings yet
Speaking Strategies Use of Grade Eleven Students at Leka Nekemte Preparatory School: Gender in Focus
11 pages
Ma Statsv2 3
No ratings yet
Ma Statsv2 3
3 pages
Customer - Satisfaction - in - Mountain - Hotels - by Octoprase
No ratings yet
Customer - Satisfaction - in - Mountain - Hotels - by Octoprase
13 pages
Vestibular Function in Children Underperforming at School
No ratings yet
Vestibular Function in Children Underperforming at School
11 pages
Paired T Test
No ratings yet
Paired T Test
12 pages
One Way ANOVA SPSS Instructions
No ratings yet
One Way ANOVA SPSS Instructions
4 pages
Data Mentah Dan Hasil Spss
No ratings yet
Data Mentah Dan Hasil Spss
8 pages
Levenes Test
No ratings yet
Levenes Test
7 pages
The Impact of Leisure Time Activities On The Academic Performance Among College Students
No ratings yet
The Impact of Leisure Time Activities On The Academic Performance Among College Students
10 pages
Research Proposal
No ratings yet
Research Proposal
9 pages
Chapter 4 (Hypothesis Testing)
No ratings yet
Chapter 4 (Hypothesis Testing)
20 pages
Mba Statistics Midterm Review Sheet
No ratings yet
Mba Statistics Midterm Review Sheet
1 page
Sample Size for Analytical Surveys, Using a Pretest-Posttest-Comparison-Group Design
From Everand
Sample Size for Analytical Surveys, Using a Pretest-Posttest-Comparison-Group Design
Joseph George Caldwell
No ratings yet

Stats Intro

Uploaded by

Stats Intro

Uploaded by

https://github.

Krish Naik - https://www.youtube.com/watch?v=6Z8SdN52GuU&list=PLZoTAELRMXVMhVyr3Ri9IQ-

SD shows how far from mean is the data available

Now 25 Percentile = Q1=(25/100) x (n+1) this value is the position of Q1

75 percentile = Q3 = (75/100) x (n+1) this value is the position of Q3

Inter Quartile Range = Q3-Q1  used to find outlier via Boxplot

SECOND METHOD TO REMOVE OUTLIER : Z-score

2. Independence: The sample values must be independent of each other.

4. Large Sample Condition: The sample size needs to be sufficiently large.

Different Probability functions

1. Probability Mass Function (PMF)

2. Probability Density Function (PDF)

Relation between PDF & CDF

Summary of Relationships in Action:

 Continuous Example (Normal Distribution):

o The PDF describes the probability density at any given point.

o The CDF is the integral of the PDF.

 Discrete Example (Dice Roll):

o The PMF gives the probability for each outcome.

o The CDF is the cumulative sum of the PMF.

2 samples, paired (Eg : Effect

Non-parametric, two Mann-Whitney U Distributions are

Non-parametric, paired(or Wilcoxon Signed- Medians are

Kruskal-Wallis Test Distributions are At least one distribution is

1 categorical variable, 1 p ≠ p₀ (two-tailed) or p > p₀ / p

1 categorical variable,2 p₁ ≠ p₂ (two-tailed) or p₁ > p₂ /

Test of independence (nominal Variables are

Test Type Conditions Conditions Test Name Example Use Case

In Chi-Square, this difference is tested like this:

To test if two categorical variables are associated or independent.

 Is ice cream preference related to age group?

"Is the distribution of preferences different across groups?"

🔹 2. Chi-Square Goodness-of-Fit Test

 Do people prefer all brands equally?

"Is the observed distribution of a variable different from what we expected?"

Test Checks What? Obs vs Exp Based On…

🔁 Common Ground: Observed vs. Expected

 Observed frequencies → from actual data

 Parametric Tests: Assume normal distribution; used for means.

 Normal distribution applies to continuous variables (e.g., height, weight, income)

Note : Wilk_Shapiro test is used to determine whether a

✅ When to Use Levene's Test:

1. Before Running Parametric Tests:

o Levene's test helps validate this assumption.

2. For Comparing Two or More Groups:

o Works for 2 or more independent groups.

o Ideal when data is non-normal or has outliers.

 Common confidence levels: 90%, 95%, 99%

✅ Practical Interpretation in Interviews:

"The true mean is likely between 180

"We are 95% confident that range

"If p < 0.05, I’ll consider the result

 Analyses the difference of mean

Example to calculate the confidence interval values

1. Concept example of Factors & Levels

2. Assumptions to be considered before ANNOVA

Example- One Way Annova

2) Null & Alternate Hypothesis

Here calculating the expected numbers based on the population percentages

Note : n=no of categories(not sample size=500) hence n=3

From Chi Square table, critical value is 5.99

Covariance is a measure of how two variables change together.

 If both variables increase together, covariance is positive.

For Sample, (n-1) in denominator

 Covariance tells you the direction of the relationship.

Spearman Rank Correlation

🎯 Key Points about Spearman Rank Correlation

 Insensitive to outliers because it’s based on ranks, not raw values.

5. Important Probability Distributions for Data Analysis

7. Expected Value & Variance

You might also like