0% found this document useful (0 votes)
4 views36 pages

Correlation, Probability

The document discusses correlation coefficients, including Pearson and Spearman, explaining their interpretations and applications in analyzing relationships between variables. It also covers sampling methods, probability theory, and various types of probability including marginal, joint, and conditional probabilities, along with Bayes' Theorem. Additionally, it highlights the importance of sampling techniques in research and the differences between probability and non-probability sampling.

Uploaded by

rgrewal112233
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views36 pages

Correlation, Probability

The document discusses correlation coefficients, including Pearson and Spearman, explaining their interpretations and applications in analyzing relationships between variables. It also covers sampling methods, probability theory, and various types of probability including marginal, joint, and conditional probabilities, along with Bayes' Theorem. Additionally, it highlights the importance of sampling techniques in research and the differences between probability and non-probability sampling.

Uploaded by

rgrewal112233
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 36

Correlation

Correlation

Pearson Correlation Coefficient
Interpretation
• Range: -1 to +1
• Positive correlation: Variables move up together
• Example: Correlation of 0.80 between Hours spent studying and test scores
• Negative correlation: As one variable moves up, the other moves
down
• Example: Correlation of -0.70 between Hours spent watching TV and physical
fitness
• Zero correlation: Variables are unrelated
• Example: Correlation of 0.02 between Shoe size and IQ score
Positive, Negative, No Correlation
Pearson Correlation Coefficient for Earlier
Example Height Weight

65 68

67 69

68 70

66 69

64 65
Covariance Example
• C:\code\Data Analytics\correlation-covariance-crypto-gold.py

• Interpretation: Correlation

• Interpretation: Covariance
Why Spearman Rank Correlation?
• Pearson correlation coefficient: Works well when data is linear, but
not well when the data is not linear
Spearman Rank Correlation

Spearman Rank Correlation Coefficient for
Height Weight
Earlier Example 65 68
67 69
• 68 70
66 69
64 65

xi Rx yi Ry di = R x - R y d i2
65 2 68 2 0 0
67 4 69 3.5 0.5 0.25
68 5 70 5 0 0
66 3 69 3.5 -0.5 0.25
64 1 65 1 0 0
Population and Sample,
Probability Theory
Sampling and Population

Population Sampling Sample

Population and Sample


Sample Population
• 10000 students in a university Estimate
statistic (x̄) parameter (μ)
• Question: What is the average student height?
• Difficult: Measure the actual height of the population, i.e. all 10000 students
to calculate the average (parameter)
• Easier: Take a sample of 500 students, calculate the average and use it to
estimate the population average (statistic)

Population Symbol Sample Statistic Symbol


Parameter
Mean μ (Mu) Sample Mean x̄ (x bar)
Standard Deviation σ (Sigma) Sample Standard Deviation s
Proportion π (Pi) Sample Proportion p̂ (p-hat)
Sampling Types
Univariate Sampling Bivariate Sampling
Analysis of a single variable Analysis of two variables to find
relationship between them
Analyzing the ages of individuals Amount of time spent studying
in a population (variable 1) and the Test scores
Examining the heights of a achieved (variable 2)
sample of students Education (variable 1) and the
Studying the test scores of Income level (variable 2)
students in a class
Descriptive statistics such as mean, Correlation, Regression
mode, median, variance, standard
deviation
Histogram, Box plot Scatter plot, Heatmap
Resampling
• Resampling: Modifying a dataset by changing its size or composition
• How? Bootstrapping or Cross-validation
• Bootstrapping: Draw random samples from the original data using replacement
• Cross-validation: Split data into training set and a testing set
• Example: K-fold cross validation
• Divide data into k subsets (folds), e.g. sample 1, 2, and 3
• Train the model on k-1 folds and tested on the remaining fold (e.g. Training: 1,2 and Test 3)
• Repeat the process k times
• Oversampling: Increase the size of the minority class by duplicating samples or
generating synthetic data
• Example: SMOTE (Synthetic Minority Oversampling Technique): Handles imbalanced
datasets, where the minority class has significantly lower samples than the others (Code:
gender-smote.py and smote.py)
Sampling Techniques
• Probability sampling: Random selection
• Every member of the population has a chance of being selected
• Most appropriate
• Mainly used in quantitative research
• Non-probability sampling: Non-random selection
• Not every individual has a chance of being included
• Easier and cheaper to access, has a higher risk of sampling bias
• Used in qualitative research
Probability Sampling
Type Brief description Method Example
Simple Every member of Random number Just pick n random samples from 1 million
the population has generator credit card transactions
an equal chance
Systematic Similar but easier Choose at regular Pick order number # 7, 17, 27, 37, 47, …
intervals
Stratified More Divide population into From 5 different income groups, select
representative, if strata (sub-groups) proportionate samples
we have and then pick
sub-categories proportionate
within data samples from each
Cluster Similar, but use Divide population into Select 5 different branches of a bank out of
entire sub-groups, clusters and pick say 25 and consider all the customers from
rather than entire clusters as these 5 branches
samples within samples
sub-groups
Non-Probability Sampling
Type Brief description Method Example
Convenien Just pick sample No systematic To research opinions about student support
ce data as per one’s method services in our university, we ask our fellow
convenience students to complete a survey on the topic
Voluntary-r People voluntarily Start with the people Ask employees about problems faced
esponse come forward who are interested – through a third-party survey – unhappy
May be biased employees will come forward on their own
Purposive Done for a specific Reach the right We want to know the opinions of foreign
purpose, also people and then do a students at C-DAC, so we specifically talk
called judgment detailed study to them about a variety of topics
sampling
Snowball Recruit more Create a snowball of Homeless people survey – Initially difficult
participants via participants to find such people to speak – But they
existing may help us speak with others soon
participants
Sample Space and Events,
Probability
Sample Space and Events
• Sample space: All the possible outcomes of an experiment
• Event: A subset of the sample space
Sample space (S)

Event (E)
Events and Sample Space: Examples
• Consider students in our class
Presence event Activeness event
present active
absent inactive

Sample space
{(present, active), (present, inactive), (absent, active), (absent, inactive)}
Impossible Equal chance Certain

Probability Definition 0% 50% 100%


0 0.5 1

• Probability: How likely it is that something will occur?


• Written as a number or fraction (e.g. 0.5 or 50%)
• Example:
• What is the probability that it will rain today?
• Sample space (S) = {R, NR}
• P(R) = 1 / 2 = 0.5 or 50% Theoretical probability
• P(NR) = 1 / 2 = 0.5 or 50%
• But what about rainy season?
• It should be: P(R) = 100% and P(NR) = 0% Empirical probability

• Theoretical probability: What will happen in theory?


• Empirical/Practical probability: What actually happened?
A team winning consecutive matches (Symbols: Win = 1, Loss = 0)
Match Possible cases Theoretical probability Theoretical probability

1 0 0.50 = 50%
1
2 00 0.25 = 25%
01
10
11 Imagine the team
3 000 0.125 = 12.5% actually winning four
001 consecutive matches
010
011
100 Theoretical
101
110 probability would
111 diminish to 6.25%
4 0000 0.0625 = 6.25%
0001
0010 But the team has
0011 actually won! So, the
0100
0101 empirical probability
0110 will be 100%
0111
1000
1001
1010
1011
1100
1101
1110
1111
Marginal, Joint, Conditional
Probability
Basic Terms
• Independent events: Outcome of Event A does not impact the
outcome of Event B
• We take out one marble, note its colour, put the marble back (called
replacement)
• We then take out a second marble and note its colour
• What is the probability that both are red?
• Dependent events: Outcome of Event A does impact the outcome of
Event B
• We take out one marble, note its colour, do not put the marble back (no
replacement)
• We then take out a second marble and note its colour
• What is the probability that both are red?
Basic Terms
• Marginal probability: Probability of an event irrespective of the
outcome of another variable: P(A)
• Joint probability: Events A and B are happening together, whether
they are independent or dependent: P(A,B)
• Independent events: If A and B are independent, P(A,B) = P(A) x P(B)
• Dependent events: If A and B are dependent, P(A,B) = P(A) x P(B|A), where
P(B|A) is the conditional probability of B given A
Marginal Probability

Joint Probability of Independent Events

Joint Probability of Dependent Events
• Dependent event: •
The outcome of Event
A impacts the
outcome of Event B
• A jar contains 2 blue
marbles and 3 red
marbles
• If you take two
marbles out of the jar
without putting the
first one back, what is
the probability that
they are both red?
Conditional Probability to Bayes’ Theorem

Bayes’ Theorem Example
• 10% of patients entering a clinic have liver disease
• 5% of patients entering a clinic are alcoholic
• Out of the patients who have liver disease, 7% are alcoholics
• A = Liver disease; So P(A) = 0.10
• B = Alcoholic; So P(B) = 0.05
• B|A = Patient is alcoholic, given that the patient has liver disease; So
P(B|A) = 0.07
• Find P(A|B), i.e. probability that the patient has liver disease, given
that the patient is alcoholic
Bayes’ Theorem Example

Bayes’ Theorem Example
• Rain prediction
• Overall historical probability of rain: P(R) = 0.30
• Sky condition
• P(Overcast|Rain) = 0.8
• P(Clear-sky|Rain) = 0.2
• P(Overcast) = 0.6
• Find P(Rain|Overcast), because today it is overcast
• Suppose A = Rain, B = Overcast sky condition
Bayes’ Theorem Example

Bayes’ Theorem
• Dangerous fires are rare (1%)
• Smoke is quite common due to barbecues (10%)
• 90% dangerous fires cause smoke
• What is the probability that we have a dangerous fire when there is
smoke?
Bayes’ Theorem

Conditional Probability to Bayes’ Theorem

You might also like