Statistics - Sods
Statistics - Sods
https://www.analyticsvidhya.com/blog/2016/01/guide-data-exploration/
http://www.cpalms.org/Public/PreviewResourceLesson/Preview/71148
http://www.graphpad.com/guides/prism/6/statistics/index.htm?stat_standard_deviation_and_standar.htm
Data Scale
Nominal
Ordinal
Interval
Ratio
Outlier
How to Detect Outliers
Stem and Leaf
Splitting the stems
Stem & Leaf vs Histogram
Measures of Dispersion
Standard Error
Decile, Percentile, Quartile
Coefficient of Variation
Combined Mean(Pooled) and Weighted Mean
http://www.slideshare.net/infinityrulz/combined-mean-weighted-mean
Weight Assigning Rules
Q1: Why do we require Weighted Mean
Harmonic Mean
Geometric Mean
Skewness
Kurtosis
Estimate / Estimator
Moments
Measures of Association
Correlation Analysis
Correlation
Pearson r correlation:
Questions a Pearson correlation answers
Spearman rank correlation:
(1) your data does not have tied ranks or
(2) your data has tied ranks.
Assumptions in Regression
The ANOVA Table
The Sums of Squares
Autocorrelation
Calculate P Value Manually
P Values
Symbols
σ “sigma” R² Ho
H1 or Ha z t χ²
p̂ “p-hat”
y-hat (ŷ)
Data Scale
There are four measurement scales (or types of data): nominal, ordinal, interval and ratio. These are simply
ways to categorize different types of variables.
Nominal
Let’s start with the easiest one to understand. Nominal scales are used for labeling variables, without any
quantitative value. “Nominal” scales could simply be called “labels.” Here are some examples, below.
Notice that all of these scales are mutually exclusive (no overlap) and none of them have any numerical
significance. A good way to remember all of this is that “nominal” sounds a lot like “name” and nominal
scales are kind of like “names” or labels.
With ordinal scales, it is the order of the values is what’s important and significant, but the differences
between each one is not really known. Take a look at the example below. In each case, we know that a #4
is better than a #3 or #2, but we don’t know–and cannot quantify–how much better it is. For example, is the
difference between “OK” and “Unhappy” the same as the difference between “Very Happy” and “Happy?” We
can’t say.
Ordinal scales are typically measures of non-numeric concepts like satisfaction, happiness, discomfort,
etc.
“Ordinal” is easy to remember because is sounds like “order” and that’s the key to remember with “ordinal
scales”–it is the order that matters, but that’s all you really get from these.
Advanced note: The best way to determine central tendency on a set of ordinal data is to use the mode or
median; the mean cannot be defined from an ordinal set.
Interval
Interval scales are numeric scales in which we know not only the order, but also the exact differences between
the values. The classic example of an interval scale is Celsius temperature because the difference between
each value is the same. For example, the difference between 60 and 50 degrees is a measurable 10 degrees,
as is the difference between 80 and 70 degrees. Time is another good
example of an interval scale in which the increments are known,
consistent, and measurable.
Interval scales are nice because the realm of statistical analysis on these
data sets opens up. For example,central tendency can be measured
by mode, median, or mean; standard deviation can also be
calculated.
Like the others, you can remember the key points of an “interval scale”
pretty easily. “Interval” itself means “space in between,” which is the
important thing to remember–interval scales not only tell us about
order, but also about the value between each item.
Here’s the problem with interval scales: they don’t have a “true zero.” For
example, there is no such thing as “no temperature.” Without a true zero,
it is impossible to compute ratios. With interval data, we can add and
subtract, but cannot multiply or divide. Confused? Ok, consider this:
10 degrees + 10 degrees = 20 degrees. No problem there. 20 degrees is not twice as hot as 10 degrees,
however, because there is no such thing as “no temperature” when it comes to the Celsius scale. I hope that
makes sense. Bottom line, interval scales are great, but we cannot calculate ratios, which brings us to our last
measurement scale…
Ratio
Ratio scales are the ultimate nirvana when it comes to measurement scales because
they tell us about the order, they tell us the exact value between units, AND
they also have an absolute zero–which allows for a wide range of both
descriptive and inferential statistics to be applied. At the risk of repeating myself,
everything above about interval data applies to ratio scales + ratio scales have a
clear definition of zero. Good examples of ratio variables include height and
weight.
Ratio scales provide a wealth of possibilities when it comes to statistical analysis. These variables can be
meaningfully added, subtracted, multiplied, divided (ratios). Central tendency can be measured by
mode, median, or mean; measures of dispersion, such as standard deviation and coefficient of
variation can also be calculated from ratio scales.
This Device Provides Two Examples of Ratio Scales (height and weight)
Outlier
https://en.wikipedia.org/wiki/Outlier
[1][2]
In statistics, an outlier is an observation point that is distant from other observations. An outlier may be due to
variability in the measurement or it may indicate experimental error; the latter are sometimes excluded from the data
Demerit of Histogram - Do not give specific info of each class. The main advantage of a stem and leaf plot is
that the data are grouped and all the original data are shown,
Stem and leaf diagrams record data values in rows, and can easily be
made into a histogram. Large data sets can be accommodated by
splitting stems.
Advantages:
- Concise representation of data
- Shows range, minimum & maximum, gaps & clusters, and outliers
easily
- Can handle extremely large data sets
Disadvantages:
- Not visually appealing
- Does not easily indicate measures of centrality for large data sets
To make a stem and leaf plot, each observed value must first be
separated into its two parts:
● The stem is the first digit or digits;
● The leaf is the final digit of a value;
● Each stem can consist of any number of digits; but
● Each leaf can have only a single digit
Stem Leaf
0(0) 0012334 Note: The stem 0(0) means all the data within the interval 0–4.
The stem 0(5) means all the data within the interval 5–9.
0(5) 55778999
● Complete a stem-and-leaf plot for the following list of values:
○ 100, 110, 120, 130, 130, 150, 160, 170, 170, 190,
○ 210, 230, 240, 260, 270, 270, 280. 290, 290
● If I try to use the last digit, the hundredths digit, for these numbers, the
stem-and-leaf plot will be enormously long, because these values are so
spread out. (With the numbers' first three digits ranging from 232 to 270,
I'd have thirty-nine leaves, most of which would be empty.) So instead of
working with the given numbers, I'll round each of the numbers to the
nearest tenth, and then use those new values for my plot. Rounding gives
me the following list:
23.3, 24.1, 24.8, 24.8, 25.0, 25.3, 25.6, 25.9, 26.3, 26.3, 27.1
Q1: Take a random sample of 20 values and develop Stem & Leaf. Leaf unit - 1/10
Q2 : Take a random sample. Compare them. Discuss the contrast
Q3: Take 3 digit (10) and 4 digit (5) numbers. Develop stem and leaf.
http://www.ck12.org/section/Stem-and-Leaf-Plots-and-Histograms-::of::-Radicals-and-Geometry-Connections-
Data-Anaylsis-::of::-CK-12-Algebra-Basic/
It is important to note that when there is a repeated number in the data (such as two 72s) then the plot must reflect
such (so the plot would look like 7 | 2 2 5 6 when it has the numbers 72 72 75 76)
4|4679
5|
6|34688
7|2256
8|148
9|
10 | 6
key: 6|3=63
leaf unit: 1.0
stem unit: 10.0
----------------
Rounding may be needed to create a stem-and-leaf display. Based on the following set of data, the stem plot below
would be created:
For negative numbers, a negative is placed in front of the stem unit, which is still the value X / 10. Non-integers are
rounded. This allowed the stem and leaf plot to retain its shape, even for more complicated data sets. As in this
example below:
-2 | 4
-1 | 2
-0 | 3
0|466
1|7
2|5
3|
4|
5|7
key: -2|4=-24
Measures of Central Tendency
http://www.abs.gov.au/websitedbs/a3121120.nsf/home/statistical+language+-+measures+of+central+tendency
A measure of central tendency (also referred to as measures of centre or central location) is a summary measure
that attempts to describe a whole set of data with a single value that represents the middle or centre of its
distribution.
There are three main measures of central tendency: the mode, the median and the mean. Each of these measures
describes a different indication of the typical or central value in the distribution.
https://en.wikipedia.org/wiki/Central_tendency
● Arithmetic mean (or simply, mean) – the sum of all measurements divided by the number of
observations in the data set.
● Median – the middle value that separates the higher half from the lower half of the data set. The median
and the mode are the only measures of central tendency that can be used for ordinal data, in which
values are ranked relative to each other but are not measured absolutely.
● Mode – the most frequent value in the data set. This is the only central tendency measure that can be
used with nominal data, which have purely qualitative category assignments.
● Geometric mean – the nth root of the product of the data values, where there are n of these. This
measure is valid only for data that are measured absolutely on a strictly positive scale.
● Harmonic mean – the reciprocal of the arithmetic mean of the reciprocals of the data values. This
measure too is valid only for data that are measured absolutely on a strictly positive scale.
● Weighted mean – an arithmetic mean that incorporates weighting to certain data elements.
● Truncated mean (or trimmed mean) – the arithmetic mean of data values after a certain number or
proportion of the highest and lowest data values have been discarded.
● Interquartile mean – a truncated mean based on data within the interquartile range.
● Midrange – the arithmetic mean of the maximum and minimum values of a data set.
● Midhinge – the arithmetic mean of the two quartiles.
● Trimean – the weighted arithmetic mean of the median and two quartiles.
● Winsorized mean – an arithmetic mean in which extreme values are replaced by values closer to the
median.
Mean All the data is used to find the Very large or very small numbers can distort
answer the answer
Median Very big and very small values Takes a long time to calculate for a very
don't affect it large set of data
Mode or The only average we can use There may be more than one mode
modal class when the data is not numerical There may be no mode at all if none of the
data is the same
It may not accurately represent the data
In statistics, dispersion (also called variability, scatter, or spread) denotes how stretched or squeezed[1] a
distribution (theoretical or that underlying a statistical sample) is. Common examples of measures of statistical
dispersion are the variance, standard deviation and interquartile range.
Dispersion is contrasted with location or central tendency, and together they are the most used properties of
distributions.
1. Range: The simplest of our methods for measuring dispersion is range. Range is the difference
between the largest value and the smallest value in the data set. While being simple to compute, the
range is often unreliable as a measure of dispersion since it is based on only two values in the set.
A range of 50 tells us very little about how the values are dispersed.
Are the values all clustered to one end with the low value (12) or the high value (62) being an outlier?
Or are the values more evenly dispersed among the range?
The population form should be used unless you know a random sample is being analyzed.
3. Variance: To find the variance:
• subtract the mean, , from each of the values in the data set, .
• square the result
• add all of these squares
• and divide by the number of values in the data set.
4. Standard Deviation: Standard deviation is the square root of the variance. The formulas are:
http://classroom.synonym.com/conceptual-difference-between-standard-deviation-variance-2870.html
http://www.diffen.com/difference/Standard_Deviation_vs_Variance
Values in relation to Same scale as values Scale larger than the values in the given
given data set in the given data set; data set; not expressed in the same unit
therefore, expressed in as the values themselves.
the same units.
http://mba-lectures.com/statistics/descriptive-statistics/603/relationship-between-quartiles-deciles-and-
percentiles-grouped-data.html
http://www.slideshare.net/raiuniversity/mba-i-qt-unit21measures-of-variations
Coefficient of Variation
http://www.ats.ucla.edu/stat/mult_pkg/faq/general/coefficient_of_variation.htm
The coefficient of variation is a measure of spread that describes the amount of variability relative
to the mean. Because the coefficient of variation is unitless, you can use it instead of the standard
deviation to compare the spread of data sets that have different units or different means.
http://www.statisticshowto.com/weighted-mean/
http://www.slideshare.net/infinityrulz/combined-mean-weighted-mean
For real life problems we require different weights to be assigned to different factors. To take into account for
overall measurement & representation. In such a condition, general/ combine mean or other category of mean
(GM/ HM) fail to serve our purpose. As such we do need weighted mean.
100 50 60 40
harmonic mean (sometimes called the subcontrary mean) is one of several kinds of average, and in particular
one of the Pythagorean means. Typically, it is appropriate for situations when the average of rates is desired.
Geometric Mean
type of mean or average, which indicates the central tendency or typical value of a set of numbers by using the
product of their values (as opposed to the arithmetic mean which uses their
sum). The geometric mean is defined as the nth root of the product of n
numbers, i.e., for a set of numbers x1, x2, ..., xn, the geometric mean is
defined as
Skewness
http://www.investopedia.com/terms/s/skewness.asp
In probability theory and statistics, skewness is a measure of the asymmetry of the probability
distribution of a real-valued random variable about its mean. The skewness value can be positive or
negative, or even undefined.
Kurtosis
http://www.itl.nist.gov/div898/handbook/eda/section3/eda35b.htm
Estimate / Estimator
https://www.quora.com/What-is-an-estimator-and-an-estimands-in-statistical-models-Why-this-is-important
The estimand is the quantity of interest whose true value you want to know.
An estimate is a numerical estimate of the estimand that results from the use of a particular
estimator.
For example, suppose we are interested in the mean height of all male adults in the United States. Our
estimand is "the mean height of all male adults in the United States". A foolproof way to find this mean
exactly would be to measure the height of each and every male adult in the United States and compute the
mean. But that sounds too hard, so instead we decide to estimate the mean height by taking a random
sample of male adults in the United States and measuring the height of each individual. Suppose we take a
random sample of 100 adult men in the United States and measure their heights. Using this data, we now
have to choose an estimator that will provide us with an estimate of our estimand.
The most obvious thing to do would be to compute the sample average of the heights. That is, "the sample
average" is an estimator that provides an estimate of our estimand. Suppose the sample average is 70
inches. Then 70 inches is the estimate of our estimand provided by the "sample average" estimator.
In statistics, an estimator is a rule for calculating an estimate of a given quantity based on observed
data: thus the rule (the estimator), the quantity of interest (the estimand) and its result (the estimate)
are distinguished. There are point and interval estimators.
Moments
Measures of Association
http://uregina.ca/~gingrich/ch11a.pdf
http://www.statisticssolutions.com/directory-of-statistical-analyses-correlation-measures-of-association/
http://sphweb.bumc.bu.edu/otlt/MPH-Modules/EP/EP713_Association/EP713_Association_print.html
http://www.slideshare.net/gane_spm/measures-of-association
The measures of association refer to a wide variety of coefficients that measure the statistical
strength of the relationship on the variables of interest; these measures of strength, or association,
can be described in several ways, depending on the analysis
http://orb.essex.ac.uk/hs/hs908/general%20pages/measures_of_association.htm
http://www.neha.org/sites/default/files/pd/edu-train/Calculating-Measures-Association.pdf
https://www.r-bloggers.com/measuring-associations/
https://stat.ethz.ch/R-manual/R-devel/library/stats/html/cor.test.html
https://gist.github.com/marcschwartz/3665743
14 Aug 16
Measures of Association
Correlation Analysis
http://www.statisticssolutions.com/correlation-pearson-kendall-spearman/
http://www.dummies.com/how-to/content/how-to-calculate-a-correlation.html
Statisticians use the correlation coefficient to measure the strength and direction of the linear relationship
between two numerical variables X and Y. The correlation coefficient for a sample of data is denoted by r.
Although the street definition of correlation applies to any two items that are related (such as gender and
political affiliation), statisticians use this term only in the context of two numerical variables. The formal
term for correlation is the correlation coefficient. Many different correlation measures have been created;
the one used in this case is called the Pearson correlation coefficient.
Find the standard deviation of all the x-values (call it sx) and the
1. For example, to find sx, you would use the following equation:
Correlation
is a bivariate analysis that measures the strengths of association between two variables. In statistics, the
value of the correlation coefficient varies between +1 and -1. When the value of the correlation coefficient lies
around ± 1, then it is said to be a perfect degree of association between the two variables. As the correlation
coefficient value goes towards 0, the relationship between the two variables will be weaker. Usually, in
statistics, we measure three types of correlations:
● Pearson correlation,
● Kendall rank correlation and
● Spearman correlation
Pearson r correlation:
http://learntech.uwe.ac.uk/da/Default.aspx?pageid=1442
Pearson r correlation is widely used in statistics to measure the degree of the relationship between linear
related variables. For example, in the stock market, if we want to measure how two commodities are related to
each other, Pearson r correlation is used to measure the degree of relationship between the two commodities.
The following formula is used to calculate the Pearson r correlation:
∑xy = sum of the products of paired scores ∑y2= sum of squared y scores
Significance
The t-test is used to establish if the correlation coefficient is significantly different from zero, and, hence that
there is evidence of an association between the two variables. There is then the underlying assumption that
the data is from a normal distribution sampled randomly. If this is not true, the conclusions may well be
invalidated. If this is the case, then it is better to use Spearman's coefficient of rank correlation (for non-
parametric variables). See Campbell & Machin (1999) appendix A12 for calculations and more discussion of
this.
It is interesting to note that with larger samples, a low strength of correlation, for example r = 0.3, can be
highly statistically significant (ie p < 0.01). However, is this an indication of a meaningful strength of
association?
P= Spearman rank correlation,di= the difference between the ranks of corresponding values Xi and Yi
di= the difference between the ranks of corresponding values Xi and Yi , n= number of value in each data
set
There are two methods to calculate Spearman's rank-order correlation depending on whether:
(1) your data does not have tied ranks or
(2) your data has tied ranks.
http://study.com/academy/lesson/pearson-correlation-coefficient-formula-example-significance.html
https://statistics.laerd.com/statistical-guides/spearmans-rank-order-correlation-statistical-guide-2.php
You need to rank the scores for maths and English separately. The score with the highest value should be
labelled "1" and the lowest score should be labelled "10" (if your data set has more than 10 cases then the
lowest score will be how many cases you have).
Look carefully at the two individuals that scored 61 in the English exam (highlighted in bold). Notice their joint
rank of 6.5. This is because when you have two identical values in the data (called a "tie"), you need to take
the average of the ranks that they would have otherwise occupied.
We do this as, in this example, we have no way of knowing which score should be put in rank 6 and which
score should be ranked 7. Therefore, you will notice that the ranks of 6 and 7 do not exist for English. These
two ranks have been averaged ((6 + 7)/2 = 6.5) and assigned to each of these "tied" scores
https://www.rgs.org/NR/rdonlyres/4844E3AB-B36D-4B14-8A20-
3A3C28FAC087/0/OASpearmansRankExcelGuidePDF.pdf
Linear Correlation
Karl Pearson
Without Tie
With Tie
Interpretations
Regression Analysis
● Linear
○ Simple SLR
○ Multiple MLR
● Non-Linear
Models
● Analytic
● Stochastic
● Simulation
http://www.slideshare.net/linashuja/regression-analysis-29424735
https://www.analyticsvidhya.com/blog/2015/08/comprehensive-guide-regression/
Regression
http://ci.columbia.edu/ci/premba_test/c0331/s7/s7_6.html
http://blog.minitab.com/blog/adventures-in-statistics/regression-analysis-tutorial-and-examples
When you think of regression, think prediction. A regression uses the historical relationship between an independent
and a dependent variable to predict the future values of the dependent variable. Businesses use regression to predict
such things as future sales, stock prices, currency exchange rates, and productivity gains resulting from a training
program.
Types of Regression
A regression models the past relationship between variables to predict their future behavior. As an example, imagine
that your company wants to understand how past advertising expenditures have related to sales in order to make
future decisions about advertising. The dependent variable in this instance is sales and the independent variable is
advertising expenditures.
Usually, more than one independent variable influences the dependent variable. You can imagine in the above
example that sales are influenced by advertising as well as other factors, such as the number of sales representatives
and the commission percentage paid to sales representatives. When one independent variable is used in a regression,
it is called a simple regression; when two or more independent variables are used, it is called a multiple
regression.
Regression models can be either linear or nonlinear. A linear model assumes the relationships between variables are
straight-line relationships, while a nonlinear model assumes the relationships between variables are represented by
curved lines. In business you will often see the relationship between the return of an individual stock and the returns
of the market modeled as a linear relationship, while the relationship between the price of an item and the demand
for it is often modeled as a nonlinear relationship.
As you can see, there are several different classes of regression procedures, with each having varying degrees of
complexity and explanatory power. The most basic type of regression is that of simple linear regression. A simple
linear regression uses only one independent variable, and it describes the relationship between the independent
variable and dependent variable as a straight line. This review will focus on the basic case of a simple linear
regression.
http://ci.columbia.edu/ci/premba_test/c0331/s8/answers.html
http://reliawiki.org/index.php/Simple_Linear_Regression_Analysis
http://www.r-tutor.com/elementary-statistics/simple-linear-regression
http://www.gardenersown.co.uk/education/lectures/r/regression.htm#multiple_regression\
https://onlinecourses.science.psu.edu/stat501/node/250
Simple linear regression is a statistical method that allows us to summarize and study relationships between two
continuous (quantitative) variables. This lesson introduces the concept and basic procedures of simple linear
regression. We will also learn two measures that describe the strength of the linear association that we find in data.
https://rstudio-pubs-static.s3.amazonaws.com/119859_a290e183ff2f46b2858db66c3bc9ed3a.html
Assumptions in Regression
http://www.statisticssolutions.com/assumptions-of-multiple-linear-regression/
http://www.statisticssolutions.com/assumptions-of-linear-regression/
http://people.duke.edu/~rnau/testing.htm
ANOVA Table
http://www.itl.nist.gov/div898/handbook/prc/section4/prc433.htm
https://onlinecourses.science.psu.edu/stat414/node/218
(1) Source means "the source of the variation in the data." As we'll soon see, the possible choices for a one-factor
study, such as the learning study, are Factor, Error, and Total. The factor is the characteristic that defines the
populations being compared. In the tire study, the factor is the brand of tire. In the learning study, the factor is the
learning method.
(1) Factor means "the variability due to the factor of interest." In the tire example on the previous page, the factor
was the brand of the tire. In the learning example on the previous page, the factor was the method of learning.
Sometimes, the factor is a treatment, and therefore the row heading is instead labeled as Treatment. And,
sometimes the row heading is labeled as Between to make it clear that the row concerns the variation between the
groups.
(2) Error means "the variability within the groups" or "unexplained random error." Sometimes, the row heading is
labeled asWithin to make it clear that the row concerns the variation within the groups.
(3) Total means "the total variation in the data from the grand mean" (that is, ignoring the factor of interest).
With the column headings and row headings now defined, let's take a look at the individual entries inside a general
one-factor ANOVA table:
Yikes, that looks overwhelming! Let's work our way through it entry by entry to see if we can make it all clear. Let's
start with the degrees of freedom (DF) column:
(1) If there are n total data points collected, then there are n−1 total degrees of freedom.
(2) If there are m groups being compared, then there are m−1 degrees of freedom associated with the factor of
interest.
(3) If there are n total data points collected and m groups being compared, then there are n−m error degrees of
freedom.
(1) As we'll soon formalize below, SS(Between) is the sum of squares between the group means and the grand
mean. As the name suggests, it quantifies the variability between the groups of interest.
(2) Again, as we'll formalize below, SS(Error) is the sum of squares between the data and the group means. It
quantifies the variability within the groups of interest.
(3) SS(Total) is the sum of squares between the n data points and the grand mean. As the name suggests, it
quantifies the total variabilty in the observed data. We'll soon see that the total sum of squares, SS(Total), can be
obtained by adding the between sum of squares, SS(Between), to the error sum of squares, SS(Error). That is:
The mean squares (MS) column, as the name suggests, contains the "average" sum of squares for the Factor and the
Error:
(1) The Mean Sum of Squares between the groups, denoted MSB, is calculated by dividing the Sum of Squares
between the groups by the between group degrees of freedom. That is, MSB = SS(Between)/(m−1).
(2) The Error Mean Sum of Squares, denoted MSE, is calculated by dividing the Sum of Squares within the groups
by the error degrees of freedom. That is, MSE = SS(Error)/(n−m).
The F column, not surprisingly, contains the F-statistic. Because we want to compare the "average" variability
between the groups to the "average" variability within the groups, we take the ratio of the Between Mean Sum of
Squares to the Error Mean Sum of Squares. That is, the F-statistic is calculated as F = MSB/MSE.
When, on the next page, we delve into the theory behind the analysis of variance method, we'll see that the F-
statistic follows an F-distribution with m−1 numerator degrees of freedom and n−m denominator
degrees of freedom. Therefore, we'll calculate the P-value, as it appears in the column labeled P, by comparing the
F-statistic to an F-distribution with m−1 numerator degrees of freedom and n−m denominator degrees
of freedom.
Now, having defined the individual entries of a general ANOVA table, let's revisit and, in the process, dissect the
ANOVA table for the first learning study on the previous page, in which n = 15 students were subjected to one of m
= 3 methods of learning:
(1) Because n = 15, there are n−1 = 15−1 = 14 total degrees of freedom.
(2) Because m = 3, there are m−1 = 3−1 = 2 degrees of freedom associated with the factor.
(3) The degrees of freedom add up, so we can get the error degrees of freedom by
subtracting the degrees of freedom associated with the factor from the total
degrees of freedom. That is, the error degrees of freedom is 14−2 = 12.
Alternatively, we can calculate the error degrees of freedom directly from n−m =
15−3=12.
(4) We'll learn how to calculate the sum of squares in a minute. For now, take note that the total sum of squares,
SS(Total), can be obtained by adding the between sum of squares, SS(Between), to the error sum of squares,
SS(Error). That is:
(5) MSB is SS(Between) divided by the between group degrees of freedom. That is, 1255.3 = 2510.5 ÷ 2.
(6) MSE is SS(Error) divided by the error degrees of freedom. That is, 13.4 = 161.2 ÷ 12.
(7) The F-statistic is the ratio of MSB to MSE. That is, F = 1255.3 ÷ 13.4 = 93.44.
(8) The P-value is P(F(2,12) ≥ 93.44) < 0.001.
Okay, we slowly, but surely, keep on adding bit by bit to our knowledge of an analysis of variance table. Let's now
work a bit on the sums of squares.
Let's see what kind of formulas we can come up with for quantifying these components. But first, as always, we
need to define some notation. Let's represent our data, the group means, and the grand mean as follows:
(2) Xij denote the jth observation in the ith group, where i = 1, 2, ..., m and j = 1, 2, ..., ni. Important thing to note
here... note thatj goes from 1 to ni, not to n. That is, the number of the data points in a group depends on the group i.
That means that the number of data points in each group need not be the same. We could have 5 measurements in
one group, and 6 measurements in another.
(3) X¯i.=1ni∑j=1niXijX¯i.=1ni∑j=1niXij denote the sample mean of the observed data for
group i, where i = 1, 2, ..., m
(4) X¯..=1n∑i=1m∑j=1niXijX¯..=1n∑i=1m∑j=1niXij denote the grand mean of all n data observed
data points
Okay, with the notation now defined, let's first consider the total sum of squares, which we'll denote here as
SS(TO). Because we want the total sum of squares to quantify the variation in the data regardless of its source, it
makes sense that SS(TO) would be the sum of the squared distances of the observations Xij to the grand mean
X¯..X¯... That is:
SS(TO)=∑i=1m∑j=1ni(Xij−X¯..)2SS(TO)=∑i=1m∑j=1ni(Xij−X¯..)2
With just a little bit of algebraic work, the total sum of squares can be alternatively calculated as:
SS(TO)=∑i=1m∑j=1niX2ij−nX¯2..SS(TO)=∑i=1m∑j=1niXij2−nX¯..2
Now, let's consider the treatment sum of squares, which we'll denote SS(T). Because we want the treatment sum of
squares to quantify the variation between the treatment groups, it makes sense that SS(T) would be the sum of the
squared distances of the treatment means X¯ i.X¯i. to the grand mean X¯..X¯... That is:
SS(T)=∑i=1m∑j=1ni(X¯i.−X¯..)2SS(T)=∑i=1m∑j=1ni(X¯i.−X¯..)2
Again, with just a little bit of algebraic work, the treatment sum of squares can be alternatively calculated as:
SS(T)=∑i=1mniX¯2i.−nX¯2..SS(T)=∑i=1mniX¯i.2−nX¯..2
Finally, let's consider the error sum of squares, which we'll denote SS(E). Because we want the error sum of
squares to quantify the variation in the data, not otherwise explained by the treatment, it makes sense that SS(E)
would be the sum of the squared distances of the observations Xij to the treatment means X¯i.X¯i.. That is:
SS(E)=∑i=1m∑j=1ni(Xij−X¯i.)2SS(E)=∑i=1m∑j=1ni(Xij−X¯i.)2
As we'll see in just one short minute why, the easiest way to calculate the error sum of squares is by subtracting the
treatment sum of squares from the total sum of squares. That is:
SS(E)=SS(TO)−SS(T)SS(E)=SS(TO)−SS(T)
Okay, so now do you remember that part about wanting to break down the total variation SS(TO) into a component
due to the treatmentSS(T) and a component due to random error SS(E)? Well, some simple algebra leads us to this:
SS(TO)=SS(T)+SS(E)SS(TO)=SS(T)+SS(E)
and hence why the simple way of calculating the error of sum of squares. At any rate, here's the simple algebra:
Proof. Well, okay, so the proof does involve a little trick of adding 0 in a special way to the total sum of
squares:
Then, squaring the term in parentheses, as well as distributing the summation signs, we get:
SS(TO)=∑i=1m∑j=1ni(Xij−X¯i.)2+2∑i=1m∑j=1ni(Xij−X¯i.)(X¯i.−X¯..)+∑i=1m∑j=1ni(X¯i.
−X¯..)2SS(TO)=∑i=1m∑j=1ni(Xij−X¯i.)2+2∑i=1m∑j=1ni(Xij−X¯i.)(X¯i.−X¯..)
+∑i=1m∑j=1ni(X¯i.−X¯..)2
SS(TO)=SS(T)+SS(E)SS(TO)=SS(T)+SS(E)
as was to be proved.
>student = c(2,6,8,8,12,16,20,20,22,26)
> sales = c(58,105,88,118,117,137,157,169,149,202)
> plot(student,sales)
> fit = lm(sales ~ student)
> summary(fit)
Call:
lm(formula = sales ~ student)
Residuals:
Min 1Q Median 3Q Max
-21.00 -9.75 -3.00 11.25 18.00
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 60.0000 9.2260 6.503 0.000187 ***
student 5.0000 0.5803 8.617 2.55e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
http://www.unesco.org/webworld/idams/advguide/Chapt5_2.htm
http://www.psychstat.missouristate.edu/introbook3/sbk21.htm
statistics, the residual sum of squares (RSS), also known as the sum of squared residuals (SSR) or
the sum of squared errors of prediction (SSE), is the sum of the squares of residuals (deviations of
predicted from actual empirical values of data).
Here fi is the predicted value from the fit, yav is the mean of the observed data yi is the observed data
value. wi is the weighting applied to each data point, usually w i=1. SSE is the sum of squares due to
error and SST is the total sum of squares.
http://www.spiderfinancial.com/support/documentation/numxl/reference-manual/descriptive-stats/sse
http://web.maths.unsw.edu.au/~adelle/Garvan/Assays/GoodnessOfFit.html
http://www.iuj.ac.jp/faculty/kucc625/method/anova.html
Autocorrelation
correlation between the elements of a series and others from the same series separated from them by a given
interval.
n statistics, the Durbin–Watson statistic is a test statistic used to detect the presence of autocorrelation (a
relationship between values separated from each other by a given time lag) in the residuals (prediction errors) from
a regression analysis. It is named after James Durbin and Geoffrey Watson.
http://slideplayer.com/slide/4935003/
In R, the function durbinWatsonTest() from car package verifies if the residuals from a linear
model are correlated or not:
● The null hypothesis (H0H0) is that there is no correlation among residuals, i.e., they
are independent.
● The alternative hypothesis (HaHa) is that residuals are autocorrelated.
As the p value was near from zero it means one can reject the null.
http://artax.karlin.mff.cuni.cz/r-help/library/bstats/html/dwtest.html
http://www.stats.uwo.ca/faculty/aim/tsar/tsar.pdf
http://web.cs.ucla.edu/~costas/r_tutorial/
https://laulima.hawaii.edu/access/content/user/hallston/341website/17a_p-value.pdf
The p-value is the probability of Type I error. Type I error is the probability of rejecting a correct null
hypothesis. However I prefer plain English. The p-value is the probability of incorrectly rejecting the null
hypothesis. Or the p-value is the probability of rejecting a null hypothesis when in fact it is ‘true.’ Or the p-value
is the chance of error you will have to accept if you want to reject the null hypothesis. All of these are
different ways of explaining p-value in plain English.
Examples:
● A p-value of .01 means there is a 1% chance that we will incorrectly reject the null hypothesis. Or that
we could reject the null hypothesis with a 1% chance of error.
● A p-value of .04 means there is a 4% chance that we are incorrectly rejecting the null hypothesis. Or
that we could reject the null hypothesis with a 4% chance of error.
● A p-value of .10 means there is a 10% chance that our decision to reject the null hypothesis was in
error. Or that we could reject the null hypothesis with a 10% chance of error
Using p-value to make a decision in (place of) step 7 In step 7 you make the decision of whether or not to
reject the null hypothesis.
Recall in step 2 of the 7 steps you set alpha, or the amount of error you are willing to accept if you reject the
null hypothesis. Using a p-value, one can make the decision to reject or fail to reject the null hypothesis. If p>α
then FAIL TO REJECT the null hypothesis. If p< α then REJECT the null hypothesis. Computing p-value by
hand NOTE! We will not compute p value by hand when n<30 (and we use t table) in this class. This is
because of the way the t-table in the book is structured. A better t table would allow for hand computations.
But in this class, when we use the t table we will rely on SPSS to compute the p-value. p-value for a two tailed
test To compute a p-value by hand all you do is find the area “outside” of the test ratio value from step 6 in
‘normal curve’ – that is your p-value. There are two areas “outside” of your test ratio from step 6 – one on each
side of the normal curve.
http://www2.fiu.edu/~howellip/P-VALUEF00.htm
P Values
http://trendingsideways.com/index.php/the-p-value-formula-testing-your-hypothesis/