Correlation of Statistics
Correlation of Statistics
Correlation is a statistical technique that can show whether and how strongly pairs of variables are related. For example,
height and weight are related; taller people tend to be heavier than shorter people. The relationship isn't perfect.
People of the same height vary in weight, and you can easily think of two people you know where the shorter one is
heavier than the taller one. Nonetheless, the average weight of people 5'5'' is less than the average weight of people
5'6'', and their average weight is less than that of people 5'7'', etc. Correlation can tell you just how much of the
variation in peoples' weights is related to their heights.
Although this correlation is fairly obvious your data may contain unsuspected correlations. You may also suspect there
are correlations, but don't know which are the strongest. An intelligent correlation analysis can lead to a greater
understanding of your data.
Researchers and students use ANOVA in many ways. The use of ANOVA depends on the research design. Commonly,
ANOVAs are used in three ways: one-way ANOVA, two-way ANOVA, and N-way ANOVA.
One-Way ANOVA
A one-way ANOVA has just one independent variable. For example, difference in IQ can be assessed by Country, and
County can have 2, 20, or more different categories to compare.
Two-Way ANOVA
A two-way ANOVA refers to an ANOVA using two independent variables. Expanding the example above, a 2-way ANOVA
can examine differences in IQ scores (the dependent variable) by Country (independent variable 1) and Gender
(independent variable 2). Two-way ANOVA can be used to examine the interaction between the two independent
variables. Interactions indicate that differences are not uniform across all categories of the independent variables. For
example, females may have higher IQ scores overall compared to males, but this difference could be greater (or less) in
European countries compared to North American countries. Two-way ANOVAs are also called factorial ANOVAs.
N-Way ANOVA
A researcher can also use more than two independent variables, and this is an n-way ANOVA (with n being the number
of independent variables you have). For example, potential differences in IQ scores can be examined by Country,
Gender, Age group, Ethnicity, etc, simultaneously.
Pre-Test
Can be used to refer to two different activities. A pre-test is where a questionnaire is tested on a (statistically) small
sample of respondents before a full-scale study, in order to identify any problems such as unclear wording or the
questionnaire taking too long to administer. A pre-test can also be used to refer to an initial measurement (such as
brand or advertising awareness) before an experimental treatment is administered and subsequent measurements are
taken.
Hypothesis Testing
Hypothesis testing is a scientific process of testing whether or not the hypothesis is plausible. The
following steps are involved in hypothesis testing:
The first step is to state the null and alternative hypothesis clearly. The null and alternative hypothesis
in hypothesis testing can be a one tailed or two tailed test.
The second step is to determine the test size. This means that the researcher decides whether a test
should be one tailed or two tailed to get the right critical value and the rejection region.
The third step is to compute the test statistic and the probability value. This step of the hypothesis
testing also involves the construction of the confidence interval depending upon the testing approach.
The fourth step involves the decision making step. This step of hypothesis testing helps the researcher
reject or accept the null hypothesis by making comparisons between the subjective criterion from the
second step and the objective test statistic or the probability value from the third step.
The fifth step is to draw a conclusion about the data and interpret the results obtained from the data .
There are basically three approaches to hypothesis testing. The researcher should note that all three
approaches require different subject criteria and objective statistics, but all three approaches give the
same conclusion.
The second scatter plot seems to have an arch-shape this indicates that a
regression line might not be the best way to explain the data, even if a correlation analysis establishes a positive link
between the two variables. However, most often data contains quite a large amount of variability (just as in the third
scatter plot example) in these cases it is up for decision how to best proceed with the data.
The second step of multiple linear regression is to formulate the model, i.e. that
variable X1, X2, and X3 have a causal influence on variable Y and that their
relationship is linear.
The third step of regression analysis is to fit the regression line.
Mathematically least square estimation is used to minimize the unexplained
residual. The basic idea behind this concept is illustrated in the following graph.
for i = 1…n. The deviation between the regression line and the single data point
is variation that our model cannot explain. This unexplained variation is also called the residual ei.
The method of least squares is used to minimize the residual.
As you can easily see the number of observations and of course the number of independent variables increases the R².
However, over fitting occurs easily with multiple linear regression, over fitting happens at the point when the multiple
linear regression model becomes inefficient. To identify whether the multiple linear regression model is fitted efficiently
a corrected R² is calculated (it is sometimes called adjusted R²), which is defined
where J is the number of independent variables and N the sample size. As you can see the larger the sample size the
smaller the effect of an additional independent variable in the model.
In our example R²c = 0.6 – 4(1-0.6)/95-4-1 = 0.6 – 1.6/90 = 0.582. Thus we find the multiple linear regression model
quite well fitted with 4 independent variables and a sample size of 95.
The last step for the multiple linear regression analysis is the test of significance. Multiple linear regression uses two
tests to test whether the found model and the estimated coefficients can be found in the general population the sample
was drawn from. Firstly, the F-test tests the overall model. The null hypothesis is that the independent variables have
no influence on the dependent variable. In other words the F-tests of the multiple linear regression tests whether the
R²=0. Secondly, multiple t-tests analyze the significance of each individual coefficient and the intercept. The t-test has
the null hypothesis that the coefficient/intercept is zero.
Skewness
The data in a frequency distribution may fall into symmetrical or asymmetrical patterns and this
measure of the direction and degree of asymmetry is called the descriptive measure of skewness . This
refers to lack of symmetry. The researcher studies the descriptive measure of skewness in order to have
knowledge about the shape and size of the curve through which the researcher can draw an inference
about the given distribution.
A distribution is said to follow the descriptive measure of skewness if mean, mode, and median fall at
different points. This type will also follow in the case when quartiles are not equidistant from the
median and also in the case when the curve drawn from the given d ata is not symmetrical.
There are three descriptive measure of skewness.
The first type of descriptive measure of skewness is M - Md, where Md is the median of the distribution.
The second type of descriptive measure of skewness is M -M0, where M0 is the mode of the distribution.
The third type of descriptive measure of skewness is (Q3 - Md)-( Md – Q1).
These are also types of absolute descriptive measures of skewness.
The researcher calculates the relative measure for the descriptive measure called the coeffi cients of
skewness which are the pure numbers of independent units of the measurements.
Karl Pearson’s coefficient of skewness for the descriptive measure of skewness is the first type of
coefficient of skewness that is based on mean, median and mode. This coefficient for the descriptive
measure of skewness is positive if the value of the mean is more than the value of mode. Or, the median
and the coefficient for the descriptive measure of skewness is negative if the value of mode or median is
more than the mean.
Bowley’s coefficient of skewness for the descriptive measure of skewness is the second type of
coefficient of skewness that is based on the quartiles. This type of coefficient of skewness for the
descriptive measure of skewness is used in those case s where the mode is ill defined and the extreme
values are present in the observation. It is also used in cases where the distribution has open end
classes or unequal intervals.