Data Science With Python Relationship
Data Science With Python Relationship
UNDERSTANDING RELATIONSHIPS
3/22/19
UNDERSTANDING RELATIONSHIPS
3/22/19
OVERVIEW
• A critical step in making sense of data is an understanding of the relationships between different
variables.
• E.g., is there a relationship between interest rates and inflation or education level and income?
• The existence of an association between variables does not imply that one variable causes another
• How to measure?
• Summary tables
• Specific calculations
• Visualization tools
• Positive or negative or no
relationship at all
• Outlier detection
• Correlation Coefficients
• Pearson’s … developed by Karl Pearson over 120 years
• Spearman
• Kendall Tau
• ANOVA
• ANOVA-1 way
• ANOVA-2 way
• ANOVA-N way
• Chi-Square tests
Covariance provides insight into how two variables are related to one another.
More precisely, covariance refers to the measure of how two random variables in a data set will change
together.
A positive covariance means that the two variables at hand are positively related, and they move in the
same direction.
A negative covariance means that the variables are inversely related, or that they move in opposite
directions.
Covariance always has units In a finance context, covariance is the term used to
describe how two stocks will move together.
In this formula,
X represents the independent variable,
Y represents the dependent variable,
N represents the number of data points in the sample,
- With covariance, there is no minimum or maximum value, so the values are more difficult to
interpret. For example, a covariance of 50 may show a strong or weak relationship; this depends on
the units in which covariance is measured.
Definition
• Correlation is used to test relationships between quantitative variables or categorical variables.
Example:-
• Your caloric intake and your weight.
• Your eye color and your relatives’ eye colors.
• The amount of time your study and your GPA.
Some examples of data that have a low correlation (or none at all):
• A dog’s name and the type of dog biscuit they prefer.
• The cost of a car wash and how long it takes to buy a soda inside the station.
3/22/19 – content to be used for explanation/reference educational purposes, Slide no. 9
CORRELATION
Correlation is defined as covariance normalized by the product of standard deviations, so the correlation
between X and Y
For example,
• a correlation of 0.9 indicates a very strong relationship in which two variables nearly always move in
the same direction;
• a correlation of –0.1 shows a very weak relationship in which there is a slight tendency for two
variables to move in opposite directions.
• r = correlation coefficient
X1 X2 X3 X4 … … y
• Another way to interpret Pearson correlation is to use the coefficient of determination, also knows as
R2.
• Is there a statistically significant relationship between age, as measured in years, and height, measured
in inches?
• Is there a relationship between temperature, measured in degrees Fahrenheit, and ice cream sales,
measured by income?
• Is there a relationship between job satisfaction, as measured by the JSS, and income, measured in
dollars?
Assumptions
• For the Pearson r correlation, both variables should be normally distributed (normally distributed
variables have a bell-shaped curve).
• Other assumptions include linearity and homoscedasticity.
• Linearity assumes a straight line relationship between each of the two variables and homoscedasticity
assumes that data is equally distributed about the regression line.
3/22/19 – content to be used for explanation/reference educational purposes, Slide no. 15
SPEARMAN'S CORRELATION
• Instead, it measures monotonic association (only strictly increasing or decreasing, but not
mixed) between two variables and relies on the rank order of values.
• In other words, rather than comparing means and variances, Spearman's coefficient looks
at the relative order of values for each variable.
• This makes it appropriate to use with both continuous and discrete data.
• The formula for Spearman's coefficient looks very similar to that of Pearson, with the
distinction of being computed on ranks instead of raw scores:
• based on a ranking of the observations for two variables. Kendalls' τ does not take into account the
difference between ranks — only directional agreement. Therefore, this coefficient is more
appropriate for discrete data.
concordant discordant
difference of the values for Variable X
(XB − XA or 2 − 1 = 1)
Observation Variable X Variable Y Concordant Discordant • the observations A–J are ordered using Variable X, and each
name
unique pair of observations is compared.
A 1 2 8 1
• A is compared with all other observations (B, C, . . . , J) and
B 2 4 6 2
the number of concordant and discordant pairs are counted
C 3 1 7 0
D 4 3 6 0 • For observations A, there are eight concordant pairs (A–B,
A–D, A–E, A–F, A–G, A–H, A–I, A–J) and one discordant pair
E 5 6 4 1 (A–C).
F 6 5 4 0
• repeated for all other observations
G 7 7 2 1
H 8 8 2 0 • Observation B is compared to observations C through J,
observation C compared to D though J, and so on.
I 9 10 0 1
J 10 9 0 0 • Range between -1 to 1
SUM 39 6 • 1 indicating a perfect ranking
• −1 a perfect disagreement of the rankings.
𝜏A = (39 − 6) / 45 • A zero value— assigned when the ranks are tied
𝜏A = 0.73 —indicates a lack of association,
• Cannot SQ the coeff to get coeff of determination
• The Kendall’s rank correlation coefficient can be calculated in Python using the kendalltau() SciPy
function.
• As a statistical hypothesis test, the method assumes (H0) that there is no association between the two
samples.
• The t test (also called Student’s T Test) compares two means and tells us if they are different from
each other. The t test also tells us how significant the differences are
• This concept can be extended to compare the mean values of two subsets. We can explore if the
means of two groups are different enough to say the difference is significant differentiability
• t-Statistics helps us evaluate if the values of a particular feature for class C1 is significantly different
from values of same feature for class C2.
• If this holds, then the feature can helps us to better differentiate our data.
• Use t-statistics to check whether these two samples are significantly different or not.
• sort these values in descending order in order to select the important features.
• If abs(t-statistic) <= critical value Accept null hypothesis that the means are equal.
• If abs(t-statistic) > critical value Reject the null hypothesis that the means are equal.
• the first mean is smaller or greater than the second mean.
• If p > alpha Accept null hypothesis that the means are equal.
• If p <= alpha Reject null hypothesis that the means are equal.
• the scale of measurement applied to the data collected follows a continuous or ordinal scale, such as
the scores for an IQ test.
• the data is collected from a representative, randomly selected portion of the total population.
• The fourth assumption is a reasonably large sample size is used. A larger sample size means the
distribution of results should approach a normal bell-shaped curve.
• The final assumption is the homogeneity of variance. Homogeneous, or equal, variance exists when
the standard deviations of samples are approximately equal.
• Interpretation
• There are 3 assumptions that need to be met for the results of an ANOVA test to be considered accurate
and trust worthy. It’s important to note the the assumptions apply to the residuals and not the variables
themselves.
• The ANOVA assumptions are the same as for linear regression and are:
• Normality - Caveat to this is, if group sizes are equal, the F-statistic is robust to violations of normality
• Homogeneity of variance - Same caveat as above, if group sizes are equal, the F-statistic is robust to this violation
• Independent observations
• a one-way ANOVA should be used if you have 1 categorical independent variable (IV) with 2+ categories
or groups and 1 continuous dependent variable (DV);
• The two-way ANOVA is an extension to the one-way ANOVA and should be used if you have 2
categorical IVs with 2+ groups, and 1 continuous DV;
• this is a multi-factor design, specifically a 2 factor design. It’s a 2 factor design, because there are 2 IVs.
• In the ANOVA framework, IVs are often called factors and each category/group within an IV is called a
level. Just as with a one-way ANOVA, a two-way ANOVA tests if there is a difference between the
means, but it does not tell which groups differ.