0% found this document useful (0 votes)
40 views

Data Science With Python Relationship

1) Understanding relationships between variables is a critical step in data analysis. Relationships can be measured using summary tables, calculations, and visualizations like scatterplots. 2) Scatterplots show relationships between continuous variables and can reveal if the relationship is positive, negative, linear, or nonlinear. Summary tables describe the central tendency and dispersion of data. 3) Common metrics for measuring relationships include correlation coefficients, t-tests, ANOVA, and chi-square tests. Correlation coefficients measure the strength and direction of linear relationships between variables.

Uploaded by

Raj Chouhan
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
40 views

Data Science With Python Relationship

1) Understanding relationships between variables is a critical step in data analysis. Relationships can be measured using summary tables, calculations, and visualizations like scatterplots. 2) Scatterplots show relationships between continuous variables and can reveal if the relationship is positive, negative, linear, or nonlinear. Summary tables describe the central tendency and dispersion of data. 3) Common metrics for measuring relationships include correlation coefficients, t-tests, ANOVA, and chi-square tests. Correlation coefficients measure the strength and direction of linear relationships between variables.

Uploaded by

Raj Chouhan
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

DATA SCIENCE WITH PYTHON

UNDERSTANDING RELATIONSHIPS

3/22/19
UNDERSTANDING RELATIONSHIPS

3/22/19
OVERVIEW

• A critical step in making sense of data is an understanding of the relationships between different
variables.

• E.g., is there a relationship between interest rates and inflation or education level and income?

• The existence of an association between variables does not imply that one variable causes another

• How to measure?
• Summary tables
• Specific calculations
• Visualization tools

3/22/19 – content to be used for explanation/reference educational purposes, Slide no. 3


SCATTERPLOTS

• For continuous variables

• Positive or negative or no
relationship at all

• linear or nonlinear relationships

• Outlier detection

3/22/19 – content to be used for explanation/reference educational purposes, Slide no. 4


SUMMARY TABLES – DESCRIBE METHOD

• Generates descriptive statistics that


summarize the central tendency,
dispersion and shape of a dataset’s
distribution, excluding NaN values.

• Analyzes both numeric and object


series, as well as DataFrame column
sets of mixed data types.

• The output will vary depending on what


is provided.

3/22/19 – content to be used for explanation/reference educational purposes, Slide no. 5


METRICS ABOUT RELATIONSHIPS

• Correlation Coefficients
• Pearson’s … developed by Karl Pearson over 120 years 
• Spearman
• Kendall Tau

• t-Tests Comparing Two Groups

• ANOVA
• ANOVA-1 way
• ANOVA-2 way
• ANOVA-N way

• Chi-Square tests

3/22/19 – content to be used for explanation/reference educational purposes, Slide no. 6


COVARIANCE - COV

Covariance provides insight into how two variables are related to one another.

More precisely, covariance refers to the measure of how two random variables in a data set will change
together.

A positive covariance means that the two variables at hand are positively related, and they move in the
same direction.

A negative covariance means that the variables are inversely related, or that they move in opposite
directions.

Covariance always has units In a finance context, covariance is the term used to
describe how two stocks will move together.

3/22/19 – content to be used for explanation/reference educational purposes, Slide no. 7


CALCULATE COVARIANCE - COV

In this formula,
X represents the independent variable,
Y represents the dependent variable,
N represents the number of data points in the sample,

x-bar represents the mean of the X and


y-bar represents the mean of the dependent variable Y.

- Covariance values are not standardized. COV values ranges from -∞ to +∞

- With covariance, there is no minimum or maximum value, so the values are more difficult to
interpret. For example, a covariance of 50 may show a strong or weak relationship; this depends on
the units in which covariance is measured.

3/22/19 – content to be used for explanation/reference educational purposes, Slide no. 8


CORRELATION

Definition
• Correlation is used to test relationships between quantitative variables or categorical variables.

• In other words, it’s a measure of how things are related.

• The study of how variables are correlated is called correlation analysis.

Example:-
• Your caloric intake and your weight.
• Your eye color and your relatives’ eye colors.
• The amount of time your study and your GPA.

Some examples of data that have a low correlation (or none at all):
• A dog’s name and the type of dog biscuit they prefer.
• The cost of a car wash and how long it takes to buy a soda inside the station.
3/22/19 – content to be used for explanation/reference educational purposes, Slide no. 9
CORRELATION

Correlation is defined as covariance normalized by the product of standard deviations, so the correlation
between X and Y

Correlation coefficients are standardized.. value is always between –1 and 1

Correlation does not have units.

For example,
• a correlation of 0.9 indicates a very strong relationship in which two variables nearly always move in
the same direction;

• a correlation of –0.1 shows a very weak relationship in which there is a slight tendency for two
variables to move in opposite directions.

3/22/19 – content to be used for explanation/reference educational purposes, Slide no. 10


FORMULA

• r = correlation coefficient

• The formula used to calculate r is

3/22/19 – content to be used for explanation/reference educational purposes, Slide no. 11


CORR - EX

Correlation can have a value:

• 1 is a perfect positive correlation


• 0 is no correlation (the values don't seem linked at all)
• -1 is a perfect negative correlation

3/22/19 Slide no. 12


KEY POINTS

X1 X2 X3 X4 … … y

- Relationship between the predictors and target variables should be strong


- Relationship amongst the predictors – indicates multi-collinearity and redundancy
- Is a HUGE practical problem
- Inflates the coefficients of the variables
- How to handle?
- Feature selection to restrict the columns
- Use of advanced ML techniques (like Ridge, Lasso, XGB etc ..)
- Pearson’s corr coeff – good for continuous variables
- assumes a linear relationship between the two variables
- sensitive to outliers

3/22/19 – content to be used for explanation/reference educational purposes, Slide no. 13


RELATIONSHIP WITH R2

• Another way to interpret Pearson correlation is to use the coefficient of determination, also knows as
R2.

• While ρ is unitless, its square is interpreted at the proportion of variance of Y explained by X.

• ρ = -0.65 implies that (-0.652)*100 = 42% of variation in Y can be explained by X.

3/22/19 – content to be used for explanation/reference educational purposes, Slide no. 14


USAGE

Types of research questions a Pearson correlation can examine:

• Is there a statistically significant relationship between age, as measured in years, and height, measured
in inches?
• Is there a relationship between temperature, measured in degrees Fahrenheit, and ice cream sales,
measured by income?
• Is there a relationship between job satisfaction, as measured by the JSS, and income, measured in
dollars?

Assumptions
• For the Pearson r correlation, both variables should be normally distributed (normally distributed
variables have a bell-shaped curve).
• Other assumptions include linearity and homoscedasticity.
• Linearity assumes a straight line relationship between each of the two variables and homoscedasticity
assumes that data is equally distributed about the regression line.
3/22/19 – content to be used for explanation/reference educational purposes, Slide no. 15
SPEARMAN'S CORRELATION

• special case of Pearson ρ applied to ranked (sorted) variables.

• Unlike Pearson, Spearman's correlation is not restricted to linear relationships.

• Instead, it measures monotonic association (only strictly increasing or decreasing, but not
mixed) between two variables and relies on the rank order of values.

• In other words, rather than comparing means and variances, Spearman's coefficient looks
at the relative order of values for each variable.

• This makes it appropriate to use with both continuous and discrete data.

3/22/19 – content to be used for explanation/reference educational purposes, Slide no. 16


FORMULA

• The formula for Spearman's coefficient looks very similar to that of Pearson, with the
distinction of being computed on ranks instead of raw scores:

3/22/19 – content to be used for explanation/reference educational purposes, Slide no. 17


KENDALL TAU (RANK CORRELATION)

• based on a ranking of the observations for two variables. Kendalls' τ does not take into account the
difference between ranks — only directional agreement. Therefore, this coefficient is more
appropriate for discrete data.

• based on counts of concordant and discordant pairs of observations.

• observations A and B, Variables X, Y

concordant discordant
difference of the values for Variable X
(XB − XA or 2 − 1 = 1)

difference of the values for Variable Y


(YB − YA or 4 − 2 = 2).
Since these differences are in the same direction— A discordant pair occurs when the differences of the two
1 and 2 are both positive—the observations A and B are concordant variables’ values move in opposite directions.

3/22/19 – content to be used for explanation/reference educational purposes, Slide no. 18


EXPLANATION

Observation Variable X Variable Y Concordant Discordant • the observations A–J are ordered using Variable X, and each
name
unique pair of observations is compared.
A 1 2 8 1
• A is compared with all other observations (B, C, . . . , J) and
B 2 4 6 2
the number of concordant and discordant pairs are counted
C 3 1 7 0
D 4 3 6 0 • For observations A, there are eight concordant pairs (A–B,
A–D, A–E, A–F, A–G, A–H, A–I, A–J) and one discordant pair
E 5 6 4 1 (A–C).
F 6 5 4 0
• repeated for all other observations
G 7 7 2 1
H 8 8 2 0 • Observation B is compared to observations C through J,
observation C compared to D though J, and so on.
I 9 10 0 1
J 10 9 0 0 • Range between -1 to 1
SUM 39 6 • 1 indicating a perfect ranking
• −1 a perfect disagreement of the rankings.
𝜏A = (39 − 6) / 45 • A zero value— assigned when the ranks are tied
𝜏A = 0.73 —indicates a lack of association,
• Cannot SQ the coeff to get coeff of determination

3/22/19 – content to be used for explanation/reference educational purposes, Slide no. 19


INTERPRETATION

• The Kendall’s rank correlation coefficient can be calculated in Python using the kendalltau() SciPy
function.

• The test takes the 2 data samples as arguments and returns


• the correlation coefficient and
• the p-value.

• As a statistical hypothesis test, the method assumes (H0) that there is no association between the two
samples.

3/22/19 – content to be used for explanation/reference educational purposes, Slide no. 20


T-TESTS COMPARING TWO GROUPS

• The t test (also called Student’s T Test) compares two means and tells us if they are different from
each other. The t test also tells us how significant the differences are

• This concept can be extended to compare the mean values of two subsets. We can explore if the
means of two groups are different enough to say the difference is significant differentiability

• binary classification problem,

• each observation can be classified either into class C1 or class C2,

• t-Statistics helps us evaluate if the values of a particular feature for class C1 is significantly different
from values of same feature for class C2.

• If this holds, then the feature can helps us to better differentiate our data.

3/22/19 – content to be used for explanation/reference educational purposes, Slide no. 21


EXPLANATION

• Does the salary of a person impact his chances to get a loan ?

• Calculate mean and variance


• Salaries of individuals when the loan was approved
• Salaries of individuals when the loan was not approved

• Use t-statistics to check whether these two samples are significantly different or not.

• t- Statistics is computed using:

• calculate the t-Statistic for each feature,

• sort these values in descending order in order to select the important features.

3/22/19 – content to be used for explanation/reference educational purposes, Slide no. 22


INTERPRETATION

• If abs(t-statistic) <= critical value Accept null hypothesis that the means are equal.

• If abs(t-statistic) > critical value Reject the null hypothesis that the means are equal.
• the first mean is smaller or greater than the second mean.

• If p > alpha Accept null hypothesis that the means are equal.
• If p <= alpha Reject null hypothesis that the means are equal.

3/22/19 – content to be used for explanation/reference educational purposes, Slide no. 23


T-TEST ASSUMPTIONS

• the scale of measurement applied to the data collected follows a continuous or ordinal scale, such as
the scores for an IQ test.

• the data is collected from a representative, randomly selected portion of the total population.

• when plotted, results in a normal distribution, bell-shaped distribution curve.

• The fourth assumption is a reasonably large sample size is used. A larger sample size means the
distribution of results should approach a normal bell-shaped curve.

• The final assumption is the homogeneity of variance. Homogeneous, or equal, variance exists when
the standard deviations of samples are approximately equal.

3/22/19 – content to be used for explanation/reference educational purposes, Slide no. 24


CHI-SQUARE TEST

• for use with variables measured on a categorical (nominal or ordinal) features.

• When to Use the Chi-Square Goodness of Fit Test


• The sampling method is simple random sampling.
• The variable under study is categorical.
• The expected value of the number of sample observations in each level of the variable is at
least 5.

3/22/19 Slide no. 25


CHI-SQUARE TEST

• allows an analysis of whether there is a relationship between two categorical variables.

• Null Hypothesis: The two categorical variables are independent.


• Alternative Hypothesis: The two categorical variables are dependent.

• The chi-square test statistic is calculated by using the formula:

• O represents the observed frequency.


• E is the expected frequency under the null hypothesis & computed by:

3/22/19 Slide no. 26


CHI-SQUARE TEST

• Interpretation

3/22/19 – content to be used for explanation/reference educational purposes, Slide no. 27


ANOVA (ONE-WAY)

• The analysis of variance (ANOVA) can be thought of as an extension to the t-test.


• The independent t-test is used to compare the means of a condition between 2 groups but sometimes we want to
compare more than 2 groups
• E.g. to test whether voter age differs based on some categorical variable like race/Education level,
• compare the means of each level.
• Alternatively, a separate t-test for each pair of groups,
• The analysis of variance or ANOVA is a statistical inference test that lets you compare multiple groups at the
same time.

• H0: No difference between means, i.e. ͞x1 = ͞x2 = ͞x3


• Ha: Difference between means exist somewhere, i.e. ͞x1 ≠ ͞x2 ≠ ͞x3, or ͞x1 = ͞x2 ≠ ͞x3, or ͞x1 ≠ ͞x2 = ͞x3

3/22/19 – content to be used for explanation/reference educational purposes, Slide no. 28


ASSUMPTION

• There are 3 assumptions that need to be met for the results of an ANOVA test to be considered accurate
and trust worthy. It’s important to note the the assumptions apply to the residuals and not the variables
themselves.

• The ANOVA assumptions are the same as for linear regression and are:

• Normality - Caveat to this is, if group sizes are equal, the F-statistic is robust to violations of normality

• Homogeneity of variance - Same caveat as above, if group sizes are equal, the F-statistic is robust to this violation

• Independent observations

3/22/19 – content to be used for explanation/reference educational purposes, Slide no. 29


1-WAY, 2-WAY, N-WAY ANOVA

• a one-way ANOVA should be used if you have 1 categorical independent variable (IV) with 2+ categories
or groups and 1 continuous dependent variable (DV);

• this is a 1 factor design.

• The two-way ANOVA is an extension to the one-way ANOVA and should be used if you have 2
categorical IVs with 2+ groups, and 1 continuous DV;

• this is a multi-factor design, specifically a 2 factor design. It’s a 2 factor design, because there are 2 IVs.

• In the ANOVA framework, IVs are often called factors and each category/group within an IV is called a
level. Just as with a one-way ANOVA, a two-way ANOVA tests if there is a difference between the
means, but it does not tell which groups differ.

3/22/19 – content to be used for explanation/reference educational purposes, Slide no. 30

You might also like