0% found this document useful (0 votes)

93 views45 pages

Statistics - Sods

The document discusses various topics in statistics including: - Four types of measurement scales: nominal, ordinal, interval, and ratio scales. - How to detect and handle outliers in data. - Stem-and-leaf plots which graphically display data through grouping values and showing original data. They can accommodate large data sets and show patterns but are not visually appealing. - Splitting stems of stem-and-leaf plots to avoid overcrowding when leaves have too many values.

Uploaded by

Hiren Chauhan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

93 views45 pages

Statistics - Sods

Uploaded by

Hiren Chauhan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 45

Statistics - SoDS

https://www.analyticsvidhya.com/blog/2016/01/guide-data-exploration/
http://www.cpalms.org/Public/PreviewResourceLesson/Preview/71148
http://www.graphpad.com/guides/prism/6/statistics/index.htm?stat_standard_deviation_and_standar.htm

Data Scale
Nominal
Ordinal
Interval
Ratio
Outlier
How to Detect Outliers
Stem and Leaf
Splitting the stems
Stem & Leaf vs Histogram
Measures of Dispersion
Standard Error
Decile, Percentile, Quartile
Coefficient of Variation
Combined Mean(Pooled) and Weighted Mean
http://www.slideshare.net/infinityrulz/combined-mean-weighted-mean
Weight Assigning Rules
Q1: Why do we require Weighted Mean
Harmonic Mean
Geometric Mean
Skewness
Kurtosis
Estimate / Estimator
Moments
Measures of Association
Correlation Analysis
Correlation
Pearson r correlation:
Questions a Pearson correlation answers
Spearman rank correlation:
(1) your data does not have tied ranks or
(2) your data has tied ranks.
Assumptions in Regression
The ANOVA Table
The Sums of Squares
Autocorrelation
Calculate P Value Manually
P Values
Symbols

ρ rho ŷ “y-hat” x̅ “x-bar

zarea β “beta” μ mu

ρ rho σx̅ “sigma-sub-x-bar”; χ² “chi-squared”

σ “sigma” R² Ho
H1 or Ha z t χ²
p̂ “p-hat”

x-bar = x̄ or x-bar = x̄ or x̄ (hex)

x̄ (hex) ε p-hat = p̂ or p̂ (hex)
p-hat = x̂ or
x̂ (hex)

y-hat (ŷ)

Data Scale
There are four measurement scales (or types of data): nominal, ordinal, interval and ratio. These are simply
ways to categorize different types of variables.

Nominal
Let’s start with the easiest one to understand. Nominal scales are used for labeling variables, without any
quantitative value. “Nominal” scales could simply be called “labels.” Here are some examples, below.
Notice that all of these scales are mutually exclusive (no overlap) and none of them have any numerical
significance. A good way to remember all of this is that “nominal” sounds a lot like “name” and nominal
scales are kind of like “names” or labels.

Examples of Nominal Scales

Note: a sub-type of nominal scale with only two categories (e.g. male/female) is called “dichotomous.” If you
are a student, you can use that to impress your teacher.
Ordinal

With ordinal scales, it is the order of the values is what’s important and significant, but the differences
between each one is not really known. Take a look at the example below. In each case, we know that a #4
is better than a #3 or #2, but we don’t know–and cannot quantify–how much better it is. For example, is the
difference between “OK” and “Unhappy” the same as the difference between “Very Happy” and “Happy?” We
can’t say.
Ordinal scales are typically measures of non-numeric concepts like satisfaction, happiness, discomfort,
etc.
“Ordinal” is easy to remember because is sounds like “order” and that’s the key to remember with “ordinal
scales”–it is the order that matters, but that’s all you really get from these.
Advanced note: The best way to determine central tendency on a set of ordinal data is to use the mode or
median; the mean cannot be defined from an ordinal set.

Example of Ordinal Scales

Interval
Interval scales are numeric scales in which we know not only the order, but also the exact differences between
the values. The classic example of an interval scale is Celsius temperature because the difference between
each value is the same. For example, the difference between 60 and 50 degrees is a measurable 10 degrees,
as is the difference between 80 and 70 degrees. Time is another good
example of an interval scale in which the increments are known,
consistent, and measurable.
Interval scales are nice because the realm of statistical analysis on these
data sets opens up. For example,central tendency can be measured
by mode, median, or mean; standard deviation can also be
calculated.
Like the others, you can remember the key points of an “interval scale”
pretty easily. “Interval” itself means “space in between,” which is the
important thing to remember–interval scales not only tell us about
order, but also about the value between each item.
Here’s the problem with interval scales: they don’t have a “true zero.” For
example, there is no such thing as “no temperature.” Without a true zero,
it is impossible to compute ratios. With interval data, we can add and
subtract, but cannot multiply or divide. Confused? Ok, consider this:
10 degrees + 10 degrees = 20 degrees. No problem there. 20 degrees is not twice as hot as 10 degrees,
however, because there is no such thing as “no temperature” when it comes to the Celsius scale. I hope that
makes sense. Bottom line, interval scales are great, but we cannot calculate ratios, which brings us to our last
measurement scale…

Ratio
Ratio scales are the ultimate nirvana when it comes to measurement scales because
they tell us about the order, they tell us the exact value between units, AND
they also have an absolute zero–which allows for a wide range of both
descriptive and inferential statistics to be applied. At the risk of repeating myself,
everything above about interval data applies to ratio scales + ratio scales have a
clear definition of zero. Good examples of ratio variables include height and
weight.
Ratio scales provide a wealth of possibilities when it comes to statistical analysis. These variables can be
meaningfully added, subtracted, multiplied, divided (ratios). Central tendency can be measured by
mode, median, or mean; measures of dispersion, such as standard deviation and coefficient of
variation can also be calculated from ratio scales.
This Device Provides Two Examples of Ratio Scales (height and weight)
Outlier
https://en.wikipedia.org/wiki/Outlier

[1][2]
In statistics, an outlier is an observation point that is distant from other observations. An outlier may be due to
variability in the measurement or it may indicate experimental error; the latter are sometimes excluded from the data

How to Detect Outliers

http://www.itl.nist.gov/div898/handbook/eda/section3/eda35h.htm
http://www.eng.tau.ac.il/~bengal/outlier.pdf
http://www.wikihow.com/Calculate-Outliers
http://machinelearningmastery.com/how-to-identify-
outliers-in-your-data/
Stem and Leaf
http://www.purplemath.com/modules/stemleaf2.htm
http://www.statcan.gc.ca/edu/power-pouvoir/ch8/5214816-eng.htm
https://answers.yahoo.com/question/index?qid=20080827120853AAglWxP
http://www.beaconlearningcenter.com/documents/1600_01.pdf Good

Demerit of Histogram - Do not give specific info of each class. The main advantage of a stem and leaf plot is
that the data are grouped and all the original data are shown,

Stem and leaf diagrams record data values in rows, and can easily be
made into a histogram. Large data sets can be accommodated by
splitting stems.
Advantages:
- Concise representation of data
- Shows range, minimum & maximum, gaps & clusters, and outliers
easily
- Can handle extremely large data sets
Disadvantages:
- Not visually appealing
- Does not easily indicate measures of centrality for large data sets

To make a stem and leaf plot, each observed value must first be
separated into its two parts:
● The stem is the first digit or digits;
● The leaf is the final digit of a value;
● Each stem can consist of any number of digits; but
● Each leaf can have only a single digit

Splitting the stems

The organization of this stem and leaf plot does not give much information about the data. With only one stem,
the leaves are overcrowded. If the leaves become too crowded, then it might be useful to split each stem into
two or more components. Thus, an interval 0–9 can be split into two intervals of 0–4 and 5–9. Similarly, a 0–9
stem could be split into five intervals: 0–1, 2–3, 4–5, 6–7 and 8–9.
The stem and leaf plot should then look like this:

Stem Leaf

0(0) 0012334 Note: The stem 0(0) means all the data within the interval 0–4.
The stem 0(5) means all the data within the interval 5–9.
0(5) 55778999
● Complete a stem-and-leaf plot for the following list of values:
○ 100, 110, 120, 130, 130, 150, 160, 170, 170, 190,
○ 210, 230, 240, 260, 270, 270, 280. 290, 290

● Complete a stem- and-leaf plot for the following list of

values:
23.25, 24.13, 24.76, 24.81, 24.98, 25.31, 25.57, 25.89, 26.28, 26.34, 27.09

● If I try to use the last digit, the hundredths digit, for these numbers, the
stem-and-leaf plot will be enormously long, because these values are so
spread out. (With the numbers' first three digits ranging from 232 to 270,
I'd have thirty-nine leaves, most of which would be empty.) So instead of
working with the given numbers, I'll round each of the numbers to the
nearest tenth, and then use those new values for my plot. Rounding gives
me the following list:
23.3, 24.1, 24.8, 24.8, 25.0, 25.3, 25.6, 25.9, 26.3, 26.3, 27.1

Q1: Take a random sample of 20 values and develop Stem & Leaf. Leaf unit - 1/10
Q2 : Take a random sample. Compare them. Discuss the contrast
Q3: Take 3 digit (10) and 4 digit (5) numbers. Develop stem and leaf.

Stem & Leaf vs Histogram

http://cd1.edb.hkedcity.net/cd/maths/en/ref_res/material/DH_e/dh04_e.pdf
It is easier to construct a stem-and-leaf diagram than a histogram.
● The distribution of the data can be easily seen
from both diagrams.
● The frequency of each group of data is
proportional to the length of each bar on a
histogram or the leaf on a stem-and-leaf diagram.
● The original data can be reconstructed from a
stem-and-leaf diagram but not from a histogram.
● Although both the histogram and the stem-and-
leaf diagram can show the frequency distribution,
more information can be read from the stem-and-
leaf diagram than from the histogram, including
the original data, the exact value of the median
and the inter-quartile range, etc.
● A stem-and-leaf diagram can indicate individual values and is appropriate for a small set of data. A
histogram is more suitable for a larger data set and its class widths can be adjusted easily.

http://www.ck12.org/section/Stem-and-Leaf-Plots-and-Histograms-::of::-Radicals-and-Geometry-Connections-
Data-Anaylsis-::of::-CK-12-Algebra-Basic/

It is important to note that when there is a repeated number in the data (such as two 72s) then the plot must reflect
such (so the plot would look like 7 | 2 2 5 6 when it has the numbers 72 72 75 76)

4|4679
5|
6|34688
7|2256
8|148
9|
10 | 6
key: 6|3=63
leaf unit: 1.0
stem unit: 10.0
----------------
Rounding may be needed to create a stem-and-leaf display. Based on the following set of data, the stem plot below
would be created:

-23.678758, -12.45, -3.4, 4.43, 5.5, 5.678, 16.87, 24.7, 56.8

-23.7, -12.5, -3.4, 4.4, 5.5, 5.7, 16.9, 24.7, 56.8

For negative numbers, a negative is placed in front of the stem unit, which is still the value X / 10. Non-integers are
rounded. This allowed the stem and leaf plot to retain its shape, even for more complicated data sets. As in this
example below:

-2 | 4
-1 | 2
-0 | 3
0|466
1|7
2|5
3|
4|
5|7
key: -2|4=-24
Measures of Central Tendency
http://www.abs.gov.au/websitedbs/a3121120.nsf/home/statistical+language+-+measures+of+central+tendency
A measure of central tendency (also referred to as measures of centre or central location) is a summary measure
that attempts to describe a whole set of data with a single value that represents the middle or centre of its
distribution.

There are three main measures of central tendency: the mode, the median and the mean. Each of these measures
describes a different indication of the typical or central value in the distribution.

https://en.wikipedia.org/wiki/Central_tendency
● Arithmetic mean (or simply, mean) – the sum of all measurements divided by the number of
observations in the data set.
● Median – the middle value that separates the higher half from the lower half of the data set. The median
and the mode are the only measures of central tendency that can be used for ordinal data, in which
values are ranked relative to each other but are not measured absolutely.
● Mode – the most frequent value in the data set. This is the only central tendency measure that can be
used with nominal data, which have purely qualitative category assignments.
● Geometric mean – the nth root of the product of the data values, where there are n of these. This
measure is valid only for data that are measured absolutely on a strictly positive scale.
● Harmonic mean – the reciprocal of the arithmetic mean of the reciprocals of the data values. This
measure too is valid only for data that are measured absolutely on a strictly positive scale.
● Weighted mean – an arithmetic mean that incorporates weighting to certain data elements.
● Truncated mean (or trimmed mean) – the arithmetic mean of data values after a certain number or
proportion of the highest and lowest data values have been discarded.
● Interquartile mean – a truncated mean based on data within the interquartile range.
● Midrange – the arithmetic mean of the maximum and minimum values of a data set.
● Midhinge – the arithmetic mean of the two quartiles.
● Trimean – the weighted arithmetic mean of the median and two quartiles.
● Winsorized mean – an arithmetic mean in which extreme values are replaced by values closer to the
median.

Average Advantages Disadvantages

Mean All the data is used to find the Very large or very small numbers can distort
answer the answer

Median Very big and very small values Takes a long time to calculate for a very
don't affect it large set of data
Mode or The only average we can use There may be more than one mode
modal class when the data is not numerical There may be no mode at all if none of the
data is the same
It may not accurately represent the data

Eg of Calculation of Mean Median Mode - Grouped Data

Measures of Dispersion
http://www.regentsprep.org/regents/math/algtrig/ats1/dispersion.htm

In statistics, dispersion (also called variability, scatter, or spread) denotes how stretched or squeezed[1] a
distribution (theoretical or that underlying a statistical sample) is. Common examples of measures of statistical
dispersion are the variance, standard deviation and interquartile range.
Dispersion is contrasted with location or central tendency, and together they are the most used properties of
distributions.

1. Range: The simplest of our methods for measuring dispersion is range. Range is the difference
between the largest value and the smallest value in the data set. While being simple to compute, the
range is often unreliable as a measure of dispersion since it is based on only two values in the set.

A range of 50 tells us very little about how the values are dispersed.
Are the values all clustered to one end with the low value (12) or the high value (62) being an outlier?
Or are the values more evenly dispersed among the range?

Before discussing our next methods, let's establish some vocabulary:

Population form: Sample form:
The population form is used when the data The sample form is used when the data is a
being analyzed includes the entire set of random sample taken from the entire set of data.
possible data. When using this form, divide by When using this form, divide by n - 1.
n, the number of values in the data set. (It can be shown that dividing by n - 1 makes S2 for the
sample, a better estimate of for the population from
which the sample was taken.)

All people living in the US.

Sam, Pete and Claire who live in the US.

The population form should be used unless you know a random sample is being analyzed.
3. Variance: To find the variance:

• subtract the mean, , from each of the values in the data set, .
• square the result
• add all of these squares
• and divide by the number of values in the data set.

4. Standard Deviation: Standard deviation is the square root of the variance. The formulas are:

http://classroom.synonym.com/conceptual-difference-between-standard-deviation-variance-2870.html
http://www.diffen.com/difference/Standard_Deviation_vs_Variance

Standard Deviation Variance

Mathematical Formula Square root of Average of the squares of deviations of

Variance each value from the mean in a sample.

Symbol Greek letter sigma - σ No dedicated symbol; expressed in terms

of standard deviation or other values.

Values in relation to Same scale as values Scale larger than the values in the given
given data set in the given data set; data set; not expressed in the same unit
therefore, expressed in as the values themselves.
the same units.

Are Values Negative or Always non-negative Always non-negative

Positive?

Real World Application Population sampling; Statistical formulas, finan

identifying outliers
Standard Error
[1]
The standard error (SE) is the standard deviation of the sampling distribution of a statistic, most commonly of the
mean.

Decile, Percentile, Quartile

http://mba-lectures.com/statistics/descriptive-statistics/603/relationship-between-quartiles-deciles-and-
percentiles-grouped-data.html
http://www.slideshare.net/raiuniversity/mba-i-qt-unit21measures-of-variations

Coefficient of Variation
http://www.ats.ucla.edu/stat/mult_pkg/faq/general/coefficient_of_variation.htm

The coefficient of variation is a measure of spread that describes the amount of variability relative
to the mean. Because the coefficient of variation is unitless, you can use it instead of the standard
deviation to compare the spread of data sets that have different units or different means.

Combined Mean(Pooled) and Weighted Mean

http://www.statisticshowto.com/weighted-mean/
http://www.slideshare.net/infinityrulz/combined-mean-weighted-mean

Weight Assigning Rules

● Most Recent data attracts relatively large weight to be assigned. (Time series - X section &
Longitudinal) to decide on which objectives are important/ matter
● Sum total of all weights = 1 (Probability) or 100% (Percentage)

Q1: Why do we require Weighted Mean

For real life problems we require different weights to be assigned to different factors. To take into account for
overall measurement & representation. In such a condition, general/ combine mean or other category of mean
(GM/ HM) fail to serve our purpose. As such we do need weighted mean.

Maths Physics Stats English

100 50 60 40

1 - (½ + ⅙ + 1/9) 1/2 1/6 1/9

Weighted Mean = 100 * 2/9 + 50 * ½ + 60 * ⅙ + 40 * 1/9

Harmonic Mean

harmonic mean (sometimes called the subcontrary mean) is one of several kinds of average, and in particular
one of the Pythagorean means. Typically, it is appropriate for situations when the average of rates is desired.

The harmonic mean can be expressed as the reciprocal of the

arithmetic mean of the reciprocals. As a simple example, the
harmonic mean of 1, 2, and 4 is

The harmonic mean H of the positive real

numbers

Geometric Mean

type of mean or average, which indicates the central tendency or typical value of a set of numbers by using the
product of their values (as opposed to the arithmetic mean which uses their
sum). The geometric mean is defined as the nth root of the product of n
numbers, i.e., for a set of numbers x1, x2, ..., xn, the geometric mean is
defined as

Skewness
http://www.investopedia.com/terms/s/skewness.asp

In probability theory and statistics, skewness is a measure of the asymmetry of the probability
distribution of a real-valued random variable about its mean. The skewness value can be positive or
negative, or even undefined.
Kurtosis

http://www.itl.nist.gov/div898/handbook/eda/section3/eda35b.htm

Estimate / Estimator
https://www.quora.com/What-is-an-estimator-and-an-estimands-in-statistical-models-Why-this-is-important
The estimand is the quantity of interest whose true value you want to know.

An estimator is a method for estimating the estimand.

An estimate is a numerical estimate of the estimand that results from the use of a particular
estimator.

For example, suppose we are interested in the mean height of all male adults in the United States. Our
estimand is "the mean height of all male adults in the United States". A foolproof way to find this mean
exactly would be to measure the height of each and every male adult in the United States and compute the
mean. But that sounds too hard, so instead we decide to estimate the mean height by taking a random
sample of male adults in the United States and measuring the height of each individual. Suppose we take a
random sample of 100 adult men in the United States and measure their heights. Using this data, we now
have to choose an estimator that will provide us with an estimate of our estimand.

The most obvious thing to do would be to compute the sample average of the heights. That is, "the sample
average" is an estimator that provides an estimate of our estimand. Suppose the sample average is 70
inches. Then 70 inches is the estimate of our estimand provided by the "sample average" estimator.

In statistics, an estimator is a rule for calculating an estimate of a given quantity based on observed
data: thus the rule (the estimator), the quantity of interest (the estimand) and its result (the estimate)
are distinguished. There are point and interval estimators.
Moments
Measures of Association
http://uregina.ca/~gingrich/ch11a.pdf
http://www.statisticssolutions.com/directory-of-statistical-analyses-correlation-measures-of-association/
http://sphweb.bumc.bu.edu/otlt/MPH-Modules/EP/EP713_Association/EP713_Association_print.html
http://www.slideshare.net/gane_spm/measures-of-association

The measures of association refer to a wide variety of coefficients that measure the statistical
strength of the relationship on the variables of interest; these measures of strength, or association,
can be described in several ways, depending on the analysis

http://orb.essex.ac.uk/hs/hs908/general%20pages/measures_of_association.htm

http://www.neha.org/sites/default/files/pd/edu-train/Calculating-Measures-Association.pdf

https://www.r-bloggers.com/measuring-associations/
https://stat.ethz.ch/R-manual/R-devel/library/stats/html/cor.test.html
https://gist.github.com/marcschwartz/3665743

14 Aug 16
Measures of Association

Correlation Analysis
http://www.statisticssolutions.com/correlation-pearson-kendall-spearman/
http://www.dummies.com/how-to/content/how-to-calculate-a-correlation.html

Statisticians use the correlation coefficient to measure the strength and direction of the linear relationship
between two numerical variables X and Y. The correlation coefficient for a sample of data is denoted by r.
Although the street definition of correlation applies to any two items that are related (such as gender and
political affiliation), statisticians use this term only in the context of two numerical variables. The formal
term for correlation is the correlation coefficient. Many different correlation measures have been created;
the one used in this case is called the Pearson correlation coefficient.

The formula for the correlation (r) is

where n is the number of pairs of data;

are the sample means of all the x-values and

all the y-values, respectively; and sx and sy are
the sample standard deviations of all the x- and
y-values, respectively.
You can use the following steps to calculate the correlation, r, from a data set:

Find the mean of all the x-values

Find the standard deviation of all the x-values (call it sx) and the

standard deviation of all the y-

values (call it sy).

1. For example, to find sx, you would use the following equation:

2. For each of the n pairs (x, y) in the data set, take

3. Add up the n results from Step 3.

4. Divide the sum by sx ∗ sy.
5. Divide the result by n – 1, where n is the number of (x, y) pairs. (It’s the same as multiplying
by 1 over n – 1.)
https://statistics.laerd.com/statistical-guides/spearmans-rank-order-correlation-statistical-guide.php

Correlation
is a bivariate analysis that measures the strengths of association between two variables. In statistics, the
value of the correlation coefficient varies between +1 and -1. When the value of the correlation coefficient lies
around ± 1, then it is said to be a perfect degree of association between the two variables. As the correlation
coefficient value goes towards 0, the relationship between the two variables will be weaker. Usually, in
statistics, we measure three types of correlations:
● Pearson correlation,
● Kendall rank correlation and
● Spearman correlation

Pearson r correlation:

http://learntech.uwe.ac.uk/da/Default.aspx?pageid=1442

Pearson r correlation is widely used in statistics to measure the degree of the relationship between linear
related variables. For example, in the stock market, if we want to measure how two commodities are related to
each other, Pearson r correlation is used to measure the degree of relationship between the two commodities.
The following formula is used to calculate the Pearson r correlation:

r = Pearson r correlation coefficient N = number of value in each data set

∑xy = sum of the products of paired scores ∑y2= sum of squared y scores

∑x = sum of x scores ∑y = sum of y scores

∑x2= sum of squared x scores ∑x2= sum of squared x scores

Questions a Pearson correlation answers

● Is there a statistically significant relationship between age, as measured in years, and height,
measured in inches?
● Is there a relationship between temperature, measure in degree Fahrenheit, and ice cream sales,
measured by income?
● Is there a relationship among job satisfaction, as measured by the JSS, and income, measured in
dollars?
Assumptions: For the Pearson r correlation, both variables should be normally distributed. Other assumptions
include linearity and homoscedasticity. Linearity assumes a straight line relationship between each of the
variables in the analysis and homoscedasticity assumes that data is normally distributed about the regression
line.

Significance
The t-test is used to establish if the correlation coefficient is significantly different from zero, and, hence that
there is evidence of an association between the two variables. There is then the underlying assumption that
the data is from a normal distribution sampled randomly. If this is not true, the conclusions may well be
invalidated. If this is the case, then it is better to use Spearman's coefficient of rank correlation (for non-
parametric variables). See Campbell & Machin (1999) appendix A12 for calculations and more discussion of
this.
It is interesting to note that with larger samples, a low strength of correlation, for example r = 0.3, can be
highly statistically significant (ie p < 0.01). However, is this an indication of a meaningful strength of
association?

Spearman rank correlation:

Spearman rank correlation is a non-parametric test that is used to measure the

degree of association between two variables. It was developed by Spearman,
thus it is called the Spearman rank correlation. Spearman rank correlation test
does not assume any assumptions about the distribution of the data and is
the appropriate correlation analysis when the variables are measured on a
scale that is at least ordinal.
The following formula is used to calculate the Spearman rank correlation:

P= Spearman rank correlation,di= the difference between the ranks of corresponding values Xi and Yi
di= the difference between the ranks of corresponding values Xi and Yi , n= number of value in each data
set

There are two methods to calculate Spearman's rank-order correlation depending on whether:
(1) your data does not have tied ranks or
(2) your data has tied ranks.

The formula for when there are no tied ranks is:

di = difference in paired ranks and n = number of cases

The formula to use when there are tied ranks is:

where i = paired score

http://study.com/academy/lesson/pearson-correlation-coefficient-formula-example-significance.html

https://statistics.laerd.com/statistical-guides/spearmans-rank-order-correlation-statistical-guide-2.php

English (mark) Maths (mark) Rank (English) Rank (maths)

56 66 9 4
75 70 3 2
45 40 10 10
71 60 4 7
61 65 6.5 5
64 56 5 9
58 59 8 8
80 77 1 1
76 67 2 3
61 63 6.5 6

You need to rank the scores for maths and English separately. The score with the highest value should be
labelled "1" and the lowest score should be labelled "10" (if your data set has more than 10 cases then the
lowest score will be how many cases you have).
Look carefully at the two individuals that scored 61 in the English exam (highlighted in bold). Notice their joint
rank of 6.5. This is because when you have two identical values in the data (called a "tie"), you need to take
the average of the ranks that they would have otherwise occupied.
We do this as, in this example, we have no way of knowing which score should be put in rank 6 and which
score should be ranked 7. Therefore, you will notice that the ranks of 6 and 7 do not exist for English. These
two ranks have been averaged ((6 + 7)/2 = 6.5) and assigned to each of these "tied" scores

https://www.rgs.org/NR/rdonlyres/4844E3AB-B36D-4B14-8A20-
3A3C28FAC087/0/OASpearmansRankExcelGuidePDF.pdf

Questions Answered through Correlation

Linear Correlation

Karl Pearson

Spearman Correlation Coefficient

Without Tie

With Tie

Interpretations
Regression Analysis

● Linear
○ Simple SLR
○ Multiple MLR
● Non-Linear

Models
● Analytic
● Stochastic
● Simulation

http://www.slideshare.net/linashuja/regression-analysis-29424735
https://www.analyticsvidhya.com/blog/2015/08/comprehensive-guide-regression/

Regression
http://ci.columbia.edu/ci/premba_test/c0331/s7/s7_6.html
http://blog.minitab.com/blog/adventures-in-statistics/regression-analysis-tutorial-and-examples

When you think of regression, think prediction. A regression uses the historical relationship between an independent
and a dependent variable to predict the future values of the dependent variable. Businesses use regression to predict
such things as future sales, stock prices, currency exchange rates, and productivity gains resulting from a training
program.

Types of Regression
A regression models the past relationship between variables to predict their future behavior. As an example, imagine
that your company wants to understand how past advertising expenditures have related to sales in order to make
future decisions about advertising. The dependent variable in this instance is sales and the independent variable is
advertising expenditures.

Usually, more than one independent variable influences the dependent variable. You can imagine in the above
example that sales are influenced by advertising as well as other factors, such as the number of sales representatives
and the commission percentage paid to sales representatives. When one independent variable is used in a regression,
it is called a simple regression; when two or more independent variables are used, it is called a multiple
regression.
Regression models can be either linear or nonlinear. A linear model assumes the relationships between variables are
straight-line relationships, while a nonlinear model assumes the relationships between variables are represented by
curved lines. In business you will often see the relationship between the return of an individual stock and the returns
of the market modeled as a linear relationship, while the relationship between the price of an item and the demand
for it is often modeled as a nonlinear relationship.
As you can see, there are several different classes of regression procedures, with each having varying degrees of
complexity and explanatory power. The most basic type of regression is that of simple linear regression. A simple
linear regression uses only one independent variable, and it describes the relationship between the independent
variable and dependent variable as a straight line. This review will focus on the basic case of a simple linear
regression.
http://ci.columbia.edu/ci/premba_test/c0331/s8/answers.html
http://reliawiki.org/index.php/Simple_Linear_Regression_Analysis
http://www.r-tutor.com/elementary-statistics/simple-linear-regression

http://www.gardenersown.co.uk/education/lectures/r/regression.htm#multiple_regression\

Simple Linear Model

https://onlinecourses.science.psu.edu/stat501/node/250

Simple linear regression is a statistical method that allows us to summarize and study relationships between two
continuous (quantitative) variables. This lesson introduces the concept and basic procedures of simple linear
regression. We will also learn two measures that describe the strength of the linear association that we find in data.
https://rstudio-pubs-static.s3.amazonaws.com/119859_a290e183ff2f46b2858db66c3bc9ed3a.html

Assumptions in Regression

http://www.statisticssolutions.com/assumptions-of-multiple-linear-regression/
http://www.statisticssolutions.com/assumptions-of-linear-regression/
http://people.duke.edu/~rnau/testing.htm

Multiple linear regression analysis makes several key assumptions:

● Linear relationship
● Multivariate normality
● No or little multicollinearity
● No auto-correlation
● Homoscedasticity
There are four principal assumptions which justify the use of linear regression models for purposes of
inference or prediction:
(i) linearity and additivity of the relationship between dependent and independent variables:
(a) The expected value of dependent variable is a straight-line function of each independent
variable, holding the others fixed.
(b) The slope of that line does not depend on the values of the other variables.
(c) The effects of different independent variables on the expected value of the dependent variable
are additive.
(ii) statistical independence of the errors (in particular, no correlation between consecutive errors in
the case of time series data)
(iii) homoscedasticity (constant variance) of the errors
(a) versus time (in the case of time series data)
(b) versus the predictions
(c) versus any independent variable
(iv) normality of the error distribution.
https://statistics.laerd.com/spss-tutorials/multiple-regression-using-spss-statistics.php

ANOVA Table
http://www.itl.nist.gov/div898/handbook/prc/section4/prc433.htm

https://onlinecourses.science.psu.edu/stat414/node/218

The ANOVA Table

Printer-friendly version
For the sake of concreteness here, let's recall one of the analysis of variance tables from the previous page:
In working to digest what is all contained in an ANOVA table, let's start with the column headings:

(1) Source means "the source of the variation in the data." As we'll soon see, the possible choices for a one-factor
study, such as the learning study, are Factor, Error, and Total. The factor is the characteristic that defines the
populations being compared. In the tire study, the factor is the brand of tire. In the learning study, the factor is the
learning method.

(2) DF means "the degrees of freedom in the source."

(3) SS means "the sum of squares due to the source."

(4) MS means "the mean sum of squares due to the source."

(5) F means "the F-statistic."

(6) P means "the P-value."

Now, let's consider the row headings:

(1) Factor means "the variability due to the factor of interest." In the tire example on the previous page, the factor
was the brand of the tire. In the learning example on the previous page, the factor was the method of learning.

Sometimes, the factor is a treatment, and therefore the row heading is instead labeled as Treatment. And,
sometimes the row heading is labeled as Between to make it clear that the row concerns the variation between the
groups.

(2) Error means "the variability within the groups" or "unexplained random error." Sometimes, the row heading is
labeled asWithin to make it clear that the row concerns the variation within the groups.

(3) Total means "the total variation in the data from the grand mean" (that is, ignoring the factor of interest).

With the column headings and row headings now defined, let's take a look at the individual entries inside a general
one-factor ANOVA table:
Yikes, that looks overwhelming! Let's work our way through it entry by entry to see if we can make it all clear. Let's
start with the degrees of freedom (DF) column:

(1) If there are n total data points collected, then there are n−1 total degrees of freedom.

(2) If there are m groups being compared, then there are m−1 degrees of freedom associated with the factor of
interest.

(3) If there are n total data points collected and m groups being compared, then there are n−m error degrees of
freedom.

Now, the sums of squares (SS) column:

(1) As we'll soon formalize below, SS(Between) is the sum of squares between the group means and the grand
mean. As the name suggests, it quantifies the variability between the groups of interest.

(2) Again, as we'll formalize below, SS(Error) is the sum of squares between the data and the group means. It
quantifies the variability within the groups of interest.

(3) SS(Total) is the sum of squares between the n data points and the grand mean. As the name suggests, it
quantifies the total variabilty in the observed data. We'll soon see that the total sum of squares, SS(Total), can be
obtained by adding the between sum of squares, SS(Between), to the error sum of squares, SS(Error). That is:

SS(Total) = SS(Between) + SS(Error)

The mean squares (MS) column, as the name suggests, contains the "average" sum of squares for the Factor and the
Error:

(1) The Mean Sum of Squares between the groups, denoted MSB, is calculated by dividing the Sum of Squares
between the groups by the between group degrees of freedom. That is, MSB = SS(Between)/(m−1).

(2) The Error Mean Sum of Squares, denoted MSE, is calculated by dividing the Sum of Squares within the groups
by the error degrees of freedom. That is, MSE = SS(Error)/(n−m).
The F column, not surprisingly, contains the F-statistic. Because we want to compare the "average" variability
between the groups to the "average" variability within the groups, we take the ratio of the Between Mean Sum of
Squares to the Error Mean Sum of Squares. That is, the F-statistic is calculated as F = MSB/MSE.

When, on the next page, we delve into the theory behind the analysis of variance method, we'll see that the F-
statistic follows an F-distribution with m−1 numerator degrees of freedom and n−m denominator
degrees of freedom. Therefore, we'll calculate the P-value, as it appears in the column labeled P, by comparing the
F-statistic to an F-distribution with m−1 numerator degrees of freedom and n−m denominator degrees
of freedom.

Now, having defined the individual entries of a general ANOVA table, let's revisit and, in the process, dissect the
ANOVA table for the first learning study on the previous page, in which n = 15 students were subjected to one of m
= 3 methods of learning:

(1) Because n = 15, there are n−1 = 15−1 = 14 total degrees of freedom.

(2) Because m = 3, there are m−1 = 3−1 = 2 degrees of freedom associated with the factor.

(3) The degrees of freedom add up, so we can get the error degrees of freedom by
subtracting the degrees of freedom associated with the factor from the total
degrees of freedom. That is, the error degrees of freedom is 14−2 = 12.
Alternatively, we can calculate the error degrees of freedom directly from n−m =
15−3=12.

(4) We'll learn how to calculate the sum of squares in a minute. For now, take note that the total sum of squares,
SS(Total), can be obtained by adding the between sum of squares, SS(Between), to the error sum of squares,
SS(Error). That is:

2671.7 = 2510.5 + 161.2

(5) MSB is SS(Between) divided by the between group degrees of freedom. That is, 1255.3 = 2510.5 ÷ 2.

(6) MSE is SS(Error) divided by the error degrees of freedom. That is, 13.4 = 161.2 ÷ 12.

(7) The F-statistic is the ratio of MSB to MSE. That is, F = 1255.3 ÷ 13.4 = 93.44.
(8) The P-value is P(F(2,12) ≥ 93.44) < 0.001.

Okay, we slowly, but surely, keep on adding bit by bit to our knowledge of an analysis of variance table. Let's now
work a bit on the sums of squares.

The Sums of Squares

In essence, we now know that we want to break down the TOTAL variation in the data into two components:

(1) a component that is due to the TREATMENT (or FACTOR), and

(2) a component that is due to just RANDOM ERROR.

Let's see what kind of formulas we can come up with for quantifying these components. But first, as always, we
need to define some notation. Let's represent our data, the group means, and the grand mean as follows:

That is, we'll let:

(1) m denote the number of groups being compared

(2) Xij denote the jth observation in the ith group, where i = 1, 2, ..., m and j = 1, 2, ..., ni. Important thing to note
here... note thatj goes from 1 to ni, not to n. That is, the number of the data points in a group depends on the group i.
That means that the number of data points in each group need not be the same. We could have 5 measurements in
one group, and 6 measurements in another.

(3) X¯i.=1ni∑j=1niXijX¯i.=1ni∑j=1niXij denote the sample mean of the observed data for
group i, where i = 1, 2, ..., m
(4) X¯..=1n∑i=1m∑j=1niXijX¯..=1n∑i=1m∑j=1niXij denote the grand mean of all n data observed
data points

Okay, with the notation now defined, let's first consider the total sum of squares, which we'll denote here as
SS(TO). Because we want the total sum of squares to quantify the variation in the data regardless of its source, it
makes sense that SS(TO) would be the sum of the squared distances of the observations Xij to the grand mean
X¯..X¯... That is:

SS(TO)=∑i=1m∑j=1ni(Xij−X¯..)2SS(TO)=∑i=1m∑j=1ni(Xij−X¯..)2

With just a little bit of algebraic work, the total sum of squares can be alternatively calculated as:

SS(TO)=∑i=1m∑j=1niX2ij−nX¯2..SS(TO)=∑i=1m∑j=1niXij2−nX¯..2

Can you do the algebra?

Now, let's consider the treatment sum of squares, which we'll denote SS(T). Because we want the treatment sum of
squares to quantify the variation between the treatment groups, it makes sense that SS(T) would be the sum of the
squared distances of the treatment means X¯ i.X¯i. to the grand mean X¯..X¯... That is:

SS(T)=∑i=1m∑j=1ni(X¯i.−X¯..)2SS(T)=∑i=1m∑j=1ni(X¯i.−X¯..)2

Again, with just a little bit of algebraic work, the treatment sum of squares can be alternatively calculated as:

SS(T)=∑i=1mniX¯2i.−nX¯2..SS(T)=∑i=1mniX¯i.2−nX¯..2

Can you do the algebra?

Finally, let's consider the error sum of squares, which we'll denote SS(E). Because we want the error sum of
squares to quantify the variation in the data, not otherwise explained by the treatment, it makes sense that SS(E)
would be the sum of the squared distances of the observations Xij to the treatment means X¯i.X¯i.. That is:

SS(E)=∑i=1m∑j=1ni(Xij−X¯i.)2SS(E)=∑i=1m∑j=1ni(Xij−X¯i.)2

As we'll see in just one short minute why, the easiest way to calculate the error sum of squares is by subtracting the
treatment sum of squares from the total sum of squares. That is:

SS(E)=SS(TO)−SS(T)SS(E)=SS(TO)−SS(T)

Okay, so now do you remember that part about wanting to break down the total variation SS(TO) into a component
due to the treatmentSS(T) and a component due to random error SS(E)? Well, some simple algebra leads us to this:

SS(TO)=SS(T)+SS(E)SS(TO)=SS(T)+SS(E)

and hence why the simple way of calculating the error of sum of squares. At any rate, here's the simple algebra:

Proof. Well, okay, so the proof does involve a little trick of adding 0 in a special way to the total sum of
squares:
Then, squaring the term in parentheses, as well as distributing the summation signs, we get:

SS(TO)=∑i=1m∑j=1ni(Xij−X¯i.)2+2∑i=1m∑j=1ni(Xij−X¯i.)(X¯i.−X¯..)+∑i=1m∑j=1ni(X¯i.
−X¯..)2SS(TO)=∑i=1m∑j=1ni(Xij−X¯i.)2+2∑i=1m∑j=1ni(Xij−X¯i.)(X¯i.−X¯..)
+∑i=1m∑j=1ni(X¯i.−X¯..)2

Now, it's just a matter of recognizing each of the terms:

That is, we've shown that:

SS(TO)=SS(T)+SS(E)SS(TO)=SS(T)+SS(E)

as was to be proved.

‹ The Basic IdeaupTheor

Class Example - Student - Sales

>student = c(2,6,8,8,12,16,20,20,22,26)
> sales = c(58,105,88,118,117,137,157,169,149,202)
> plot(student,sales)
> fit = lm(sales ~ student)
> summary(fit)

Call:
lm(formula = sales ~ student)

Residuals:
Min 1Q Median 3Q Max
-21.00 -9.75 -3.00 11.25 18.00

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 60.0000 9.2260 6.503 0.000187 ***
student 5.0000 0.5803 8.617 2.55e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 13.83 on 8 degrees of freedom

Multiple R-squared: 0.9027, Adjusted R-squared: 0.8906
F-statistic: 74.25 on 1 and 8 DF, p-value: 2.549e-05

http://www.unesco.org/webworld/idams/advguide/Chapt5_2.htm

ANOVA Table for Multiple Regression

Source of Variation Sum of Squares Degrees of Mean Squares F ratio
freedom

Regression SSR p MSR MSR/MSE

Error SSE (n-p-1) MSE

Total SST (n-1)

http://www.psychstat.missouristate.edu/introbook3/sbk21.htm

statistics, the residual sum of squares (RSS), also known as the sum of squared residuals (SSR) or
the sum of squared errors of prediction (SSE), is the sum of the squares of residuals (deviations of
predicted from actual empirical values of data).
Here fi is the predicted value from the fit, yav is the mean of the observed data yi is the observed data
value. wi is the weighting applied to each data point, usually w i=1. SSE is the sum of squares due to
error and SST is the total sum of squares.
http://www.spiderfinancial.com/support/documentation/numxl/reference-manual/descriptive-stats/sse
http://web.maths.unsw.edu.au/~adelle/Garvan/Assays/GoodnessOfFit.html
http://www.iuj.ac.jp/faculty/kucc625/method/anova.html
Autocorrelation
correlation between the elements of a series and others from the same series separated from them by a given
interval.
n statistics, the Durbin–Watson statistic is a test statistic used to detect the presence of autocorrelation (a
relationship between values separated from each other by a given time lag) in the residuals (prediction errors) from
a regression analysis. It is named after James Durbin and Geoffrey Watson.

http://slideplayer.com/slide/4935003/

In R, the function durbinWatsonTest() from car package verifies if the residuals from a linear
model are correlated or not:

● The null hypothesis (H0H0) is that there is no correlation among residuals, i.e., they
are independent.
● The alternative hypothesis (HaHa) is that residuals are autocorrelated.

As the p value was near from zero it means one can reject the null.

http://artax.karlin.mff.cuni.cz/r-help/library/bstats/html/dwtest.html

http://www.stats.uwo.ca/faculty/aim/tsar/tsar.pdf
http://web.cs.ucla.edu/~costas/r_tutorial/

https://laulima.hawaii.edu/access/content/user/hallston/341website/17a_p-value.pdf

Calculate P Value Manually

The p-value is the probability of Type I error. Type I error is the probability of rejecting a correct null
hypothesis. However I prefer plain English. The p-value is the probability of incorrectly rejecting the null
hypothesis. Or the p-value is the probability of rejecting a null hypothesis when in fact it is ‘true.’ Or the p-value
is the chance of error you will have to accept if you want to reject the null hypothesis. All of these are
different ways of explaining p-value in plain English.

Examples:
● A p-value of .01 means there is a 1% chance that we will incorrectly reject the null hypothesis. Or that
we could reject the null hypothesis with a 1% chance of error.
● A p-value of .04 means there is a 4% chance that we are incorrectly rejecting the null hypothesis. Or
that we could reject the null hypothesis with a 4% chance of error.
● A p-value of .10 means there is a 10% chance that our decision to reject the null hypothesis was in
error. Or that we could reject the null hypothesis with a 10% chance of error

Using p-value to make a decision in (place of) step 7 In step 7 you make the decision of whether or not to
reject the null hypothesis.
Recall in step 2 of the 7 steps you set alpha, or the amount of error you are willing to accept if you reject the
null hypothesis. Using a p-value, one can make the decision to reject or fail to reject the null hypothesis. If p>α
then FAIL TO REJECT the null hypothesis. If p< α then REJECT the null hypothesis. Computing p-value by
hand NOTE! We will not compute p value by hand when n<30 (and we use t table) in this class. This is
because of the way the t-table in the book is structured. A better t table would allow for hand computations.
But in this class, when we use the t table we will rely on SPSS to compute the p-value. p-value for a two tailed
test To compute a p-value by hand all you do is find the area “outside” of the test ratio value from step 6 in
‘normal curve’ – that is your p-value. There are two areas “outside” of your test ratio from step 6 – one on each
side of the normal curve.

http://www2.fiu.edu/~howellip/P-VALUEF00.htm

P Values
http://trendingsideways.com/index.php/the-p-value-formula-testing-your-hypothesis/

Chapter 14 Ia2
75% (4)
Chapter 14 Ia2
18 pages
Ata 36
100% (1)
Ata 36
113 pages
Measurement Scales
No ratings yet
Measurement Scales
9 pages
Standard Costs and Variance Analysis Part 3
100% (2)
Standard Costs and Variance Analysis Part 3
6 pages
BRM End Sem Complete
No ratings yet
BRM End Sem Complete
350 pages
Nominal Ordinal Interval Ratio
100% (2)
Nominal Ordinal Interval Ratio
7 pages
Statistics
83% (6)
Statistics
33 pages
Measurement Scales
No ratings yet
Measurement Scales
14 pages
Introduction Statistics
100% (1)
Introduction Statistics
23 pages
Scales of Measurements
No ratings yet
Scales of Measurements
18 pages
Bididi Industries
No ratings yet
Bididi Industries
12 pages
BBA 4 RM Unit 4
No ratings yet
BBA 4 RM Unit 4
99 pages
Lecture No 2 Measurements of Scales
No ratings yet
Lecture No 2 Measurements of Scales
55 pages
My Link Building Recommendations
100% (1)
My Link Building Recommendations
2 pages
Statistics 1A Lecture Notes Article
No ratings yet
Statistics 1A Lecture Notes Article
123 pages
Module 3 - Measurement and Scaling
No ratings yet
Module 3 - Measurement and Scaling
65 pages
Unit 1
No ratings yet
Unit 1
72 pages
Data Levels of Measurement
No ratings yet
Data Levels of Measurement
33 pages
Stat 01
No ratings yet
Stat 01
35 pages
Depedpang 1
No ratings yet
Depedpang 1
127 pages
2 Levels of Measurement
No ratings yet
2 Levels of Measurement
24 pages
Types of Data Measurement Scales - Nominal, Ordinal, Interval, and Ratio
No ratings yet
Types of Data Measurement Scales - Nominal, Ordinal, Interval, and Ratio
51 pages
STATS
No ratings yet
STATS
22 pages
Levels of Measurement: Nominal Data
100% (1)
Levels of Measurement: Nominal Data
2 pages
Measurement Scales
No ratings yet
Measurement Scales
17 pages
Sec 1.1 - Definitions - 2020
No ratings yet
Sec 1.1 - Definitions - 2020
26 pages
04-C - (I) Measurement Scales
No ratings yet
04-C - (I) Measurement Scales
14 pages
Scales of Measurement
No ratings yet
Scales of Measurement
3 pages
Test and Measurement
No ratings yet
Test and Measurement
23 pages
CH 01
No ratings yet
CH 01
15 pages
Measurement - of Scales
No ratings yet
Measurement - of Scales
12 pages
Types of Scale
No ratings yet
Types of Scale
6 pages
Assessments and Interpretation
No ratings yet
Assessments and Interpretation
13 pages
Understanding Scales of Meausurement in Statistics
No ratings yet
Understanding Scales of Meausurement in Statistics
10 pages
Level of Measurements
No ratings yet
Level of Measurements
7 pages
Kyu Edu 2301 WK9
No ratings yet
Kyu Edu 2301 WK9
10 pages
Unit 3 Measurement and Scaling
No ratings yet
Unit 3 Measurement and Scaling
11 pages
Lecture-2 Very Very New Measurement Scales
No ratings yet
Lecture-2 Very Very New Measurement Scales
20 pages
Scales
No ratings yet
Scales
11 pages
Scan 01 Dec 23 22 06 34 Recearch Metholody 2
No ratings yet
Scan 01 Dec 23 22 06 34 Recearch Metholody 2
11 pages
Stats Ass 1
No ratings yet
Stats Ass 1
9 pages
Levels of Measurement: Research Methodology
No ratings yet
Levels of Measurement: Research Methodology
14 pages
Module 04 Scales Measurement
No ratings yet
Module 04 Scales Measurement
6 pages
Scales of Measurement
No ratings yet
Scales of Measurement
8 pages
Nominal Scales Are Used For Labeling Variables, Without Any Quantitative Value
No ratings yet
Nominal Scales Are Used For Labeling Variables, Without Any Quantitative Value
4 pages
Types of Data & Measurement Scales: Nominal, Ordinal, Interval and Ratio
No ratings yet
Types of Data & Measurement Scales: Nominal, Ordinal, Interval and Ratio
5 pages
Note On Scaling Techniques
No ratings yet
Note On Scaling Techniques
8 pages
Measurement & Scaling: Asst - Prof .Seena Alappatt Dept - Management Studies
No ratings yet
Measurement & Scaling: Asst - Prof .Seena Alappatt Dept - Management Studies
13 pages
9 Measurement: Types of Data & Measurement Scales: Nominal, Ordinal, Interval and Ratio
No ratings yet
9 Measurement: Types of Data & Measurement Scales: Nominal, Ordinal, Interval and Ratio
5 pages
Week 3 Examples
No ratings yet
Week 3 Examples
4 pages
Measurement Scales 3
No ratings yet
Measurement Scales 3
2 pages
Scales in Research
No ratings yet
Scales in Research
4 pages
Types of Data
No ratings yet
Types of Data
9 pages
Nominal Data Ordinal Data
No ratings yet
Nominal Data Ordinal Data
4 pages
Statistics: Notes: Named
No ratings yet
Statistics: Notes: Named
5 pages
Scales of Data
No ratings yet
Scales of Data
6 pages
C4000 PM en 31 PDF
No ratings yet
C4000 PM en 31 PDF
195 pages
08Pr067C Electrical Safety: Safety Management System Procedure
No ratings yet
08Pr067C Electrical Safety: Safety Management System Procedure
8 pages
Scales of Measurement: Nominal, Ordinal, Interval & Ratio: Categorical, Classification or Qualitative Variables
No ratings yet
Scales of Measurement: Nominal, Ordinal, Interval & Ratio: Categorical, Classification or Qualitative Variables
4 pages
Statistics: Parameter Mean Standard Deviation
No ratings yet
Statistics: Parameter Mean Standard Deviation
4 pages
Types of Data & Measurement Scales: Nominal, Ordinal, Interval and Ratio
No ratings yet
Types of Data & Measurement Scales: Nominal, Ordinal, Interval and Ratio
5 pages
Invitation Letter For Visa Spouse
No ratings yet
Invitation Letter For Visa Spouse
2 pages
Interval Scale
No ratings yet
Interval Scale
7 pages
Subject: Probability and Random Variables Submitted To: Sir Tauseef Ahmed Submitted By: Amina Nadeem Roll No: 090317 Section: BEE-4-B
No ratings yet
Subject: Probability and Random Variables Submitted To: Sir Tauseef Ahmed Submitted By: Amina Nadeem Roll No: 090317 Section: BEE-4-B
4 pages
Measurement of Scale: Nominal Ordinal Interval Ratio
No ratings yet
Measurement of Scale: Nominal Ordinal Interval Ratio
4 pages
Muhammad Okasha - KFUPM GRADUATE PROGRAMS - Online Application - Regular Programs - 241
No ratings yet
Muhammad Okasha - KFUPM GRADUATE PROGRAMS - Online Application - Regular Programs - 241
8 pages
Netbackup Troubleshooting Commands
No ratings yet
Netbackup Troubleshooting Commands
4 pages
SW 4048 120 Spec Sheet
No ratings yet
SW 4048 120 Spec Sheet
2 pages
Bio-Based Insulator
No ratings yet
Bio-Based Insulator
15 pages
01 AUBF Notes On Lab Safety (HIGHLIGHTED)
No ratings yet
01 AUBF Notes On Lab Safety (HIGHLIGHTED)
5 pages
Advance Structures (7th Semester) (B.ARCH)
No ratings yet
Advance Structures (7th Semester) (B.ARCH)
93 pages
Attachment J - Weekly Inspection Report - New
No ratings yet
Attachment J - Weekly Inspection Report - New
8 pages
Public Administration:: Your Unofficially The Compulsory Subject (In The Changed Context)
No ratings yet
Public Administration:: Your Unofficially The Compulsory Subject (In The Changed Context)
4 pages
Backgroud of Malaysia Airlines 1
No ratings yet
Backgroud of Malaysia Airlines 1
38 pages
Unit III
No ratings yet
Unit III
58 pages
Alternating Current: Avg. & Rms Values
No ratings yet
Alternating Current: Avg. & Rms Values
41 pages
Reconfigurable RF Communication Components: By: Hiren Chauhan
No ratings yet
Reconfigurable RF Communication Components: By: Hiren Chauhan
23 pages
Plane Areas
No ratings yet
Plane Areas
26 pages
Data Mining UNIT - 2 (Data Warehouse Architecture)
No ratings yet
Data Mining UNIT - 2 (Data Warehouse Architecture)
3 pages
Day On Day International Numbers
No ratings yet
Day On Day International Numbers
794 pages
Electric Transport in The Netherlands
No ratings yet
Electric Transport in The Netherlands
44 pages
The Interactive Effect of Job Involvement and Organizational Commitment On Job Turnover Revisited: A Note On The Mediating Role of Turnover Intention
No ratings yet
The Interactive Effect of Job Involvement and Organizational Commitment On Job Turnover Revisited: A Note On The Mediating Role of Turnover Intention
6 pages
Full Literature Review Sample
No ratings yet
Full Literature Review Sample
8 pages
Preparation and Applications of Foam Ceramics
No ratings yet
Preparation and Applications of Foam Ceramics
6 pages
Rrkabel
No ratings yet
Rrkabel
1 page
BerlanShiffman 2011
No ratings yet
BerlanShiffman 2011
10 pages
DSSF RefGuide 16-17 11-16-16
No ratings yet
DSSF RefGuide 16-17 11-16-16
41 pages
References
No ratings yet
References
6 pages
Child Care Resources - Seven Hills Foundation
No ratings yet
Child Care Resources - Seven Hills Foundation
1 page
How To Add Startup - Cs Class in ASP - NET Core 6 Project
No ratings yet
How To Add Startup - Cs Class in ASP - NET Core 6 Project
3 pages
MEMS TUT5 C - Section Cntilever
No ratings yet
MEMS TUT5 C - Section Cntilever
2 pages
Illuminating Data: A hands on guide to data visualization in R
From Everand
Illuminating Data: A hands on guide to data visualization in R
Eman Ahmad
No ratings yet
From Average To K-means
From Everand
From Average To K-means
Beam van Waardenberg
No ratings yet
Errors of Regression Models: Bite-Size Machine Learning, #1
From Everand
Errors of Regression Models: Bite-Size Machine Learning, #1
Lee Baker
No ratings yet

Statistics - Sods

Uploaded by

Statistics - Sods

Uploaded by

Statistics - SoDS

ρ rho ŷ “y-hat” x̅ “x-bar

ρ rho σx̅ “sigma-sub-x-bar”; χ² “chi-squared”

x-bar = x&#772; or x-bar = x̄ or x̄ (hex)

Examples of Nominal Scales

Example of Ordinal Scales

How to Detect Outliers

Splitting the stems

● Complete a stem- and-leaf plot for the following list of

Stem & Leaf vs Histogram

-23.678758, -12.45, -3.4, 4.43, 5.5, 5.678, 16.87, 24.7, 56.8

Average Advantages Disadvantages

Eg of Calculation of Mean Median Mode - Grouped Data

Before discussing our next methods, let's establish some vocabulary:

All people living in the US.

Sam, Pete and Claire who live in the US.

Standard Deviation Variance

Mathematical Formula Square root of Average of the squares of deviations of

Symbol Greek letter sigma - σ No dedicated symbol; expressed in terms

Are Values Negative or Always non-negative Always non-negative

Real World Application Population sampling; Statistical formulas, finan

Decile, Percentile, Quartile

Combined Mean(Pooled) and Weighted Mean

Weight Assigning Rules

Q1: Why do we require Weighted Mean

Maths Physics Stats English

1 - (½ + ⅙ + 1/9) 1/2 1/6 1/9

Weighted Mean = 100 * 2/9 + 50 * ½ + 60 * ⅙ + 40 * 1/9

The harmonic mean can be expressed as the reciprocal of the

The harmonic mean H of the positive real

An estimator is a method for estimating the estimand.

The formula for the correlation (r) is

are the sample means of all the x-values and

Find the mean of all the x-values

standard deviation of all the y-

2. For each of the n pairs (x, y) in the data set, take

3. Add up the n results from Step 3.

r = Pearson r correlation coefficient N = number of value in each data set

∑x = sum of x scores ∑y = sum of y scores

∑x2= sum of squared x scores ∑x2= sum of squared x scores

Questions a Pearson correlation answers

Spearman rank correlation:

Spearman rank correlation is a non-parametric test that is used to measure the

The formula for when there are no tied ranks is:

di = difference in paired ranks and n = number of cases

The formula to use when there are tied ranks is:

where i = paired score

English (mark) Maths (mark) Rank (English) Rank (maths)

Questions Answered through Correlation

Spearman Correlation Coefficient

Simple Linear Model

Multiple linear regression analysis makes several key assumptions:

The ANOVA Table

(2) DF means "the degrees of freedom in the source."

(3) SS means "the sum of squares due to the source."

(4) MS means "the mean sum of squares due to the source."

(5) F means "the F-statistic."

(6) P means "the P-value."

Now, let's consider the row headings:

Now, the sums of squares (SS) column:

SS(Total) = SS(Between) + SS(Error)

2671.7 = 2510.5 + 161.2

The Sums of Squares

(1) a component that is due to the TREATMENT (or FACTOR), and

(2) a component that is due to just RANDOM ERROR.

That is, we'll let:

(1) m denote the number of groups being compared

Can you do the algebra?

Can you do the algebra?

Now, it's just a matter of recognizing each of the terms:

That is, we've shown that:

‹ The Basic IdeaupTheor

Class Example - Student - Sales

Residual standard error: 13.83 on 8 degrees of freedom

ANOVA Table for Multiple Regression

Regression SSR p MSR MSR/MSE

x-bar = x̄ or x-bar = x̄ or x̄ (hex)