0% found this document useful (0 votes)
149 views6 pages

Correlation of Statistics

Correlation is a statistical technique used to determine the strength and direction of the relationship between two variables. A correlation coefficient between -1 and 1 indicates how closely the variables are related, with values closer to the extremes indicating a stronger relationship. While correlation does not prove causation, it can provide insights into relationships within a dataset and lead to a greater understanding. There are several correlation techniques appropriate for different data types, and correlations with rating scale data require careful interpretation.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
149 views6 pages

Correlation of Statistics

Correlation is a statistical technique used to determine the strength and direction of the relationship between two variables. A correlation coefficient between -1 and 1 indicates how closely the variables are related, with values closer to the extremes indicating a stronger relationship. While correlation does not prove causation, it can provide insights into relationships within a dataset and lead to a greater understanding. There are several correlation techniques appropriate for different data types, and correlations with rating scale data require careful interpretation.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 6

Correlation

Correlation is a statistical technique that can show whether and how strongly pairs of variables are related. For example,
height and weight are related; taller people tend to be heavier than shorter people. The relationship isn't perfect.
People of the same height vary in weight, and you can easily think of two people you know where the shorter one is
heavier than the taller one. Nonetheless, the average weight of people 5'5'' is less than the average weight of people
5'6'', and their average weight is less than that of people 5'7'', etc. Correlation can tell you just how much of the
variation in peoples' weights is related to their heights.
Although this correlation is fairly obvious your data may contain unsuspected correlations. You may also suspect there
are correlations, but don't know which are the strongest. An intelligent correlation analysis can lead to a greater
understanding of your data.

Techniques in Determining Correlation


There are several different correlation techniques. Like all statistical techniques, correlation is only appropriate for
certain kinds of data. Correlation works for quantifiable data in which numbers are meaningful, usually quantities of
some sort. It cannot be used for purely categorical data, such as gender, brands purchased, or favorite color.
Rating Scales
Rating scales are a controversial middle case. The numbers in rating scales have meaning, but that meaning isn't very
precise. They are not like quantities. With a quantity (such as dollars), the difference between 1 and 2 is exactly the
same as between 2 and 3. With a rating scale, that isn't really the case. You can be sure that your respondents think a
rating of 2 is between a rating of 1 and a rating of 3, but you cannot be sure they think it is exactly halfway between. This
is especially true if you labeled the mid-points of your scale (you cannot assume "good" is exactly half way between
"excellent" and "fair").
Most statisticians say you cannot use correlations with rating scales, because the mathematics of the technique assume
the differences between numbers are exactly equal. Nevertheless, many survey researchers do use correlations with
rating scales, because the results usually reflect the real world. Our own position is that you can use correlations with
rating scales, but you should do so with care. When working with quantities, correlations provide precise
measurements. When working with rating scales, correlations provide general indications.
Correlation Coefficient
The main result of a correlation is called the correlation coefficient (or "r"). It ranges from -1.0 to +1.0. The closer r is to
+1 or -1, the more closely the two variables are related.
If r is close to 0, it means there is no relationship between the variables. If r is positive, it means that as one variable gets
larger the other gets larger. If r is negative it means that as one gets larger, the other gets smaller (often called an
"inverse" correlation).
While correlation coefficients are normally reported as r = (a value between -1 and +1), squaring them makes then
easier to understand. The square of the coefficient (or r square) is equal to the percent of the variation in one variable
that is related to the variation in the other. After squaring r, ignore the decimal point. An r of .5 means 25% of the
variation is related (.5 squared =.25). An r value of .7 means 49% of the variance is related (.7 squared = .49).
A correlation report can also show a second result of each test - statistical significance. In this case, the significance level
will tell you how likely it is that the correlations reported may be due to chance in the form of random sampling error. If
you are working with small sample sizes, choose a report format that includes the significance level. This format also
reports the sample size.
A key thing to remember when working with correlations is never to assume a correlation means that a change in one
variable causes a change in another. Sales of personal computers and athletic shoes have both risen strongly in the last
several years and there is a high correlation between them, but you cannot assume that buying computers causes
people to buy athletic shoes (or vice versa).
The second caveat is that the Pearson correlation technique works best with linear relationships: as one variable gets
larger, the other gets larger (or smaller) in direct proportion. It does not work well with curvilinear relationships (in
which the relationship does not follow a straight line). An example of a curvilinear relationship is age and health care.
They are related, but the relationship doesn't follow a straight line. Young children and older people both tend to use
much more health care than teenagers or young adults. Multiple regression can be used to examine curvilinear
relationships, but it is beyond the scope of this article.
ANOVA (Analysis of Variance)
ANOVA is a statistical technique that assesses potential differences in a scale-level dependent variable by a nominal-
level variable having 2 or more categories. For example, an ANOVA can examine potential differences in IQ scores by
Country (US vs. Canada vs. Italy vs. Spain). The ANOVA, developed by Ronald Fisher in 1918, extends the t and the z test
which have the problem of only allowing the nominal level variable to have two categories. This test is also called the
Fisher analysis of variance.

General Purpose of ANOVA

Researchers and students use ANOVA in many ways. The use of ANOVA depends on the research design. Commonly,
ANOVAs are used in three ways: one-way ANOVA, two-way ANOVA, and N-way ANOVA.

One-Way ANOVA
A one-way ANOVA has just one independent variable. For example, difference in IQ can be assessed by Country, and
County can have 2, 20, or more different categories to compare.

Two-Way ANOVA
A two-way ANOVA refers to an ANOVA using two independent variables. Expanding the example above, a 2-way ANOVA
can examine differences in IQ scores (the dependent variable) by Country (independent variable 1) and Gender
(independent variable 2). Two-way ANOVA can be used to examine the interaction between the two independent
variables. Interactions indicate that differences are not uniform across all categories of the independent variables. For
example, females may have higher IQ scores overall compared to males, but this difference could be greater (or less) in
European countries compared to North American countries. Two-way ANOVAs are also called factorial ANOVAs.

N-Way ANOVA
A researcher can also use more than two independent variables, and this is an n-way ANOVA (with n being the number
of independent variables you have). For example, potential differences in IQ scores can be examined by Country,
Gender, Age group, Ethnicity, etc, simultaneously.

Pre-Test
Can be used to refer to two different activities. A pre-test is where a questionnaire is tested on a (statistically) small
sample of respondents before a full-scale study, in order to identify any problems such as unclear wording or the
questionnaire taking too long to administer. A pre-test can also be used to refer to an initial measurement (such as
brand or advertising awareness) before an experimental treatment is administered and subsequent measurements are
taken.

Hypothesis Testing
Hypothesis testing is a scientific process of testing whether or not the hypothesis is plausible. The
following steps are involved in hypothesis testing:
The first step is to state the null and alternative hypothesis clearly. The null and alternative hypothesis
in hypothesis testing can be a one tailed or two tailed test.
The second step is to determine the test size. This means that the researcher decides whether a test
should be one tailed or two tailed to get the right critical value and the rejection region.
The third step is to compute the test statistic and the probability value. This step of the hypothesis
testing also involves the construction of the confidence interval depending upon the testing approach.
The fourth step involves the decision making step. This step of hypothesis testing helps the researcher
reject or accept the null hypothesis by making comparisons between the subjective criterion from the
second step and the objective test statistic or the probability value from the third step.
The fifth step is to draw a conclusion about the data and interpret the results obtained from the data .
There are basically three approaches to hypothesis testing. The researcher should note that all three
approaches require different subject criteria and objective statistics, but all three approaches give the
same conclusion.

Chi Square Test


The Chi Square Test is a test that involves the use of parameters to test the statistical significance of
the observations under study.
There are varieties of chi square tests that are used by the researcher. They are cross tabulation, chi
square test for the goodness of fit, likelihood ratio test, chi square test, etc.
The task of the chi square test is to test the statistical significance of the observed relationship with
respect to the expected relationship. The chi square statistic is used by the researcher for determining
whether or not a relationship exists.
In the chi square test, the null hypothesis is assumed as there not being an association between the two
variables that are observed in the study. The chi square test is calculated by evaluating the cell
frequencies that involve the expected frequencies in those types of cases when there is no association
between the variables. The comparison between the expected type of frequency and the actual
observed frequency is then made in this test. The computation of the expected frequency square test is
calculated as the product of the total number of observations in the row and the column, which is
divided by the total size of the sample .
The calculation of the statistic in the chi square test is done by computing the sum of the square of the
deviation between the observed and the expected frequency, which is divided by the expected
frequency.
The researcher should know that the greater t he difference between the observed and expected cell
frequency, the larger the value of the chi square statistic in the chi square test.
In order to determine if the association between the two variables exists, the probability of obtaining a
value of chi square should be larger than the one obtained from the chi square test of cross tabulation.
There is one more popular test called the chi square test for goodness of fit.
This type of test called the chi square test for goodness of fit helps the researcher to understand
whether or not the sample drawn from a certain population has a specific distribution and whether or
not it actually belongs to that specified distribution. This type of test can be applicable to only discrete
types of distribution, like Poisson, binomial, etc. This type of chi square test is an alternative test for
the non parametric test called the Kolmogorov Smrinov goodness of fit test.
The null hypothesis assumed by the researcher in this type of chi square test is that the data drawn
from the population follows the specified distribution. The chi square statistic in this test is defined in a
similar manner to the definition in the above type of test. One of the important points to be noted by
the researcher is that the expected number of frequencies in this type of chi square test should be at
least five. This means that the chi square test will not be valid for those whose expected cell frequency
is less than five.
There are certain assumptions in the chi square test.
The random sampling of data is assumed in the chi square test.
In the chi square test, a sample with a sufficiently large size is assumed. If the chi square test is
conducted on a sample with a smaller size, then the chi square test will yield inaccurate inferences. The
researcher, by using the chi square test on small samples, might end up committing a Type II error.
In the chi square test, the observations are always assumed to be independent of each other.
In the chi square test, the observations must have the same fundamental distribution.
How to Conduct Multiple Linear Regression
Multiple Linear Regression Analysis consists of more than just fitting a linear line through a cloud of data points. It
consists of 3 stages – (1) analyzing the correlation and directionality of the data, (2) estimating the model, i.e., fitting the
line, and (3) evaluating the validity and usefulness of the model.
Firstly, the scatter plots should be checked for directionality and correlation of data. Typically you would look at an
individual scatter plot for every independent variable in the analysis.
In the two examples shown here the first scatter plot indicates a positive relationship between the two variables. The
data is fit to run a multiple linear regression analysis.

The second scatter plot seems to have an arch-shape this indicates that a
regression line might not be the best way to explain the data, even if a correlation analysis establishes a positive link
between the two variables. However, most often data contains quite a large amount of variability (just as in the third
scatter plot example) in these cases it is up for decision how to best proceed with the data.

The second step of multiple linear regression is to formulate the model, i.e. that
variable X1, X2, and X3 have a causal influence on variable Y and that their
relationship is linear.
The third step of regression analysis is to fit the regression line.
Mathematically least square estimation is used to minimize the unexplained
residual. The basic idea behind this concept is illustrated in the following graph.

In our example we want to model the


relationship between age, job experience,
and tenure on one hand and job satisfaction
on the other hand. The research team has
gathered several observations of self-
reported job satisfaction and experience, as
well as age and tenure of the participant.
When we fit a line through the scatter plot
(for simplicity only one dimension is shown
here), the regression line represents the
estimated job satisfaction for a given
combination of the input factors. However in most cases the real observation might not fall exactly on the regression
line.
Because we try to explain the scatter plot with a linear equation of

for i = 1…n. The deviation between the regression line and the single data point
is variation that our model cannot explain. This unexplained variation is also called the residual ei.
The method of least squares is used to minimize the residual.

The multiple linear regression’s variance is estimated by

where p is the number of independent variables and n the sample size.


The result of this equation could for instance be yi = 1 + 0.1 * xi1+ 0.3 * xi2 – 0.1 * xi3+ 1.52 * xi4. This means that for
additional unit x1 (ceteris paribus) we would expect an increase of 0.1 in y, and for every additional unit x4 (c.p.) we
expect 1.52 units of y.
Now that we got our multiple linear regression equation we evaluate the validity and usefulness of the equation.
The key measure to the validity of the estimated linear line is R². R² = total variance / explained variance. The following
graph illustrates the key concepts to calculate R². In our example the R² is approximately 0.6, this means that 60% of the
total variance is explained with the relationship between age and satisfaction.

As you can easily see the number of observations and of course the number of independent variables increases the R².
However, over fitting occurs easily with multiple linear regression, over fitting happens at the point when the multiple
linear regression model becomes inefficient. To identify whether the multiple linear regression model is fitted efficiently
a corrected R² is calculated (it is sometimes called adjusted R²), which is defined

where J is the number of independent variables and N the sample size. As you can see the larger the sample size the
smaller the effect of an additional independent variable in the model.
In our example R²c = 0.6 – 4(1-0.6)/95-4-1 = 0.6 – 1.6/90 = 0.582. Thus we find the multiple linear regression model
quite well fitted with 4 independent variables and a sample size of 95.
The last step for the multiple linear regression analysis is the test of significance. Multiple linear regression uses two
tests to test whether the found model and the estimated coefficients can be found in the general population the sample
was drawn from. Firstly, the F-test tests the overall model. The null hypothesis is that the independent variables have
no influence on the dependent variable. In other words the F-tests of the multiple linear regression tests whether the
R²=0. Secondly, multiple t-tests analyze the significance of each individual coefficient and the intercept. The t-test has
the null hypothesis that the coefficient/intercept is zero.

Skewness
The data in a frequency distribution may fall into symmetrical or asymmetrical patterns and this
measure of the direction and degree of asymmetry is called the descriptive measure of skewness . This
refers to lack of symmetry. The researcher studies the descriptive measure of skewness in order to have
knowledge about the shape and size of the curve through which the researcher can draw an inference
about the given distribution.
A distribution is said to follow the descriptive measure of skewness if mean, mode, and median fall at
different points. This type will also follow in the case when quartiles are not equidistant from the
median and also in the case when the curve drawn from the given d ata is not symmetrical.
There are three descriptive measure of skewness.
The first type of descriptive measure of skewness is M - Md, where Md is the median of the distribution.
The second type of descriptive measure of skewness is M -M0, where M0 is the mode of the distribution.
The third type of descriptive measure of skewness is (Q3 - Md)-( Md – Q1).
These are also types of absolute descriptive measures of skewness.
The researcher calculates the relative measure for the descriptive measure called the coeffi cients of
skewness which are the pure numbers of independent units of the measurements.
Karl Pearson’s coefficient of skewness for the descriptive measure of skewness is the first type of
coefficient of skewness that is based on mean, median and mode. This coefficient for the descriptive
measure of skewness is positive if the value of the mean is more than the value of mode. Or, the median
and the coefficient for the descriptive measure of skewness is negative if the value of mode or median is
more than the mean.
Bowley’s coefficient of skewness for the descriptive measure of skewness is the second type of
coefficient of skewness that is based on the quartiles. This type of coefficient of skewness for the
descriptive measure of skewness is used in those case s where the mode is ill defined and the extreme
values are present in the observation. It is also used in cases where the distribution has open end
classes or unequal intervals.

You might also like