0% found this document useful (0 votes)

17 views36 pages

Introduction To Statistics

The document provides an introduction to statistics, focusing on descriptive statistics, data collection methods, and types of data. It explains the importance of primary and secondary data, classification, tabulation, and various graphical representations of data. Additionally, it covers measures of central tendency, including arithmetic mean and median, along with examples to illustrate these concepts.

Uploaded by

Rahul Venu Pulikkal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views36 pages

Introduction To Statistics

Uploaded by

Rahul Venu Pulikkal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 36

Introduction to Statistics

Unit 1: Descriptive Statistics

Statistics:

The word Statistics refers to either to quantitative information or to a method dealing with
quantitative information. Statistics deals with collection, organization, analysis and interpretation
of numerical data.

 To some statistics is an imposing form of mathematics, where as to others it may be

tables, charts and figures.

Eg: The statements like,

“There are 1058 females for 1000 males in Kerala”

“Delhi accounts the highest rate of road accidents in India”

These statements contains figures and as such they are called “numerical statements of facts”

Eg: Biostatistics, Economics, Industrial Statistics, Actuarial etc.

 In every fields of science, one would come across different types of information which
will be termed as data.

 Once we have the information on certain characterizations, we can analyze it and draw
conclusions about the population.

Data:

 It is the fundamental of Statistics.

 The collected numerical information is called the data.

Primary and Secondary Data:

Collection of data can be done in two different ways:

a) The investigator may collect the required information directly,

a) He/She may make use of data collected by others.

1
c) Data collected by the investigator for the purpose of the investigation at hand is called
Primary data.

d) Data collected by others for some other purpose and used by the investigator is called
Secondary data.

e) Collection of primary data involves much time, money and labour. So, if available,
secondary data is preferred.

Collection of Primary Data:

Census and Sampling

 ‘Population’ or ‘Universe’ is a term usually used in statistics for the aggregate or totality
of units such as people, trees, animals, firms, plots of lands, houses, manufactured articles
etc, etc. about which information is required in a study.

 If information is collected from a representative part of it, the method of collection is

called sampling.

There are two branches of Statistics:

 Descriptive Statistics - Statistical methods can be used to summarize or describe a

collection of data.

 Inferential Statistics - Statistical methods can be used to drawing conclusions or making

decisions about population characteristics based on the information collected from a
sample drawn from the population.

Descriptive Statistics:

 Gives numerical and graphical procedures to summarize a collection of data in a clear

and understandable way.

We can present the collected data in following ways:

Classification

Tabulation

Diagrammatic Presentation

Graphical Presentation

2
Classification:

 Classification is the process of arranging things in the groups according to their

resemblances, similarity, or identity.

 It may be classified as:

1. Chronological or Temporal classification

2. Geographical or Spatial Classification

3. Qualitative classification

4. Quantitative classification

Types of variables:

 Qualitative variable:- A variable which cannot be numerically measured but can be

measured by their quality is called qualitative variable.

Ex:- sex, intelligence, beauty

 Quantitative variable:- A variable which can be measured numerically is called

Quantitative variable.

Ex:- Height, Weight, temperature, Pressure

Quantitative variables are of two types:

 Discrete variable:- A quantitative variable which can assume finite or countable number
of isolated values is called a discrete variable.

Ex:- Family size, no. of kittens in a litter, number of defectives in a lot, number of
petals in a flower etc.

 Continuous variable:- A quantitative variable which can assume any numerical value
within a certain interval of real line is called a continuous variable.

Ex:- Temperature, age, pressure, height, weight etc.

Frequency distribution:

 A frequency distribution is a statement of the possible values of the variable together with
their respective frequencies.

 Frequency of a variant value is the number of occurrences of that particular value of the
variable in the data set.

3
Tabulation:

 Tabulation is the process of arranging the classified data in the form of a table consisting
of rows and columns. It is the continuation of classification.

 The purpose of tabulation is that it is easily understood and an investigator is quickly able
to locate the desired information.

 If classification is done with respect to one characteristic only, the corresponding table
will be a one way table.

TABULATION OF DATA:

Tabular presentation of qualitative data

 The data set of the smoking status of 20 individuals, collected as yes (y) or no (n). The
raw is given as

n, y, n, n, n, y, n, y, n, n, n, y, n, n, y, n, y, n, n, y

Smoking Status Number of Patients

Yes 7
No 13
Total 20

Two way table

Cellularity R HG

No Cells 4 1

Hypocellular 14 15

Cellular 12 4

Tabular presentation of quantitative data:

 Systolic blood pressure (mmHg) values of 25 patients attending a clinic

4
99 110 96 160 106 144 109 156 110 128 118 168 132 140 160 102 159 148 149 156
154 143 108 146 145

The frequency distribution of systolic blood pressure in patients attending a clinic

Graphical Representation:

Pie Chart:

 Circles drawn with areas proportional to the magnitudes of the observations constitute a
pie diagram.

 Circle is divided into sectors with areas proportional to the magnitudes of the components
in component bar diagram.

 Pie diagrams are also called circular diagrams.

 Degree of any component/

Angle of Sector =

Magnitude of component x 360

Total magnitude of phenomena

Frequency distribution of stomach cancer with respect to operable condition

Surgery Frequency
Operable 60
In-operable 14

5
Stomach Cancer
Operable In-operable

19%

81%

Bar diagrams:

 A method of presenting data in which frequencies are displayed along one axis and
categories of the variable along the other, the frequencies being represented by the bar
lengths.

 To avoid any misunderstanding, the bars are drawn with the same width at equal
distances.

Bar diagrams are of various types:

→ Simple bar diagrams

→ Sub-divided bar diagrams / Component bar diagrams

→ Multiple bar diagrams

→ Percentage bar diagrams

→ Deviation bars

Simple bar diagram:

 A simple bar diagram is used to represent only one variable. The figures of sales,
production, population etc., for various years may be shown by means of a simple bar
diagram.

 Bar diagrams are the easiest and most commonly used devices.

6
 The distance between every two consecutive bars should be uniform. The width of each
bar should be the same.

 The height of the bars should be proportional to the magnitude of the variable.

 All the bars should stand on the same base line (X-axis).

Bar Diagram:

Number of chemotherapy cycles completed

No. of Chemo Cycles Frequency

2 2
3 27
4 12
5 7
6 23

Chemoterapy for Breast Cancer Patients

20
Frequency

10 Frequency

0
2 3 4 5 6
No. of Cycles Completed

Multiple (Adjacent) bar diagram:

 Multiple bar diagrams are used to represent two or more sets of interrelated data.

 Whenever a comparison between two or more related variable is to be made, multiple

bars should be preferred.

7
 An index is also prepared to identify bars.

Cellularity R HG

No Cells 4 1

Hypocellular 14 15

Cellular 12 4

Cellularity
16
14
12
Frequency

10
8
R
6
HG
4
2
0
No Cells Hypocellular Cellular
Cellularity

What is the Distribution?

 Gives us a picture of the variability and central tendency.

 Can also show the amount of skewness and Kurtosis.

8
Graphing Data: Types:

Creating Frequencies:

 We create frequencies by sorting data by value or category and then summing the cases
that fall into those values.

 How often do certain scores occur? This is a basic descriptive data question.

Histogram:

 Histogram is a bar diagram (two dimensional) which is suitable for frequency distribution
with continuous classes.

 Class intervals are shown on the ‘X-axis’ and the frequencies on the ‘Y-axis’. The height
of each rectangle represents the frequency of the class interval.

9
Eg: Age distribution of liver cancer patients

Age Frequency

25 – 29 3

30 – 34 10

35 - 39 5

40 - 44 3

45 - 49 5

50 - 54 2

55 - 59 1

60 - 64 1

65 - 69 3

70+ 7

Total 40

10
Frequency

Mean =47.42
Std. Dev. =17.11
N =40

0
20.00 40.00 60.00 80.00
Age

10
Line Graphs or line diagram:

Age distribution of Cancer cases in a Panchayat

Age Frequency

30 - 39 7

40 - 49 9

50 - 59 4

60 - 69 4

70+ 6

Total 30

Line Graphs: A Time Series

11
100

80
Approval

70
Approval

Economic approval
20

81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 00 01
19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 20 20
Month

Scatter Plot (Two variable):

Presidential Approval and Unemployment

100

80
Approval

60
Approve
40

0
0 2 4 6 8 10 12
Unemployment

Descriptive Statistics Cont…

 There are four characteristics for assessing the data.

(i) Measures of central tendency

(ii) Measures of dispersion

(iii) Skewness

(iv) Kurtosis

Measures of central tendency:

 The main purpose of the statistical treatment of any data is to summarize and describe the
data meaningfully and adequately.

12
 Certain summary indices tell about the center or middle value of the set of data around
which the other observations are lying. This index is called the measures of central
tendency.

 They are computed to give a “center” around which measurements in the data are
distributed.

Arithmetic mean:

 It is sometimes referred to as ‘the mean’ or ‘the average’.

 There are two types of arithmetic means, depending upon whether all the observations are
given equal importance or unequal importance.

 They are,

 Simple arithmetic mean

 Weighted arithmetic mean

Let x1, x2,…, xn be the n observations of the sample.

 Simple arithmetic mean is

Sum of the observations
X
Total number of observations

x1 +x 2 +...+x n

n
n

x i
 i 1
n

Example: The ages of the members of a family are 53, 48, 25, 32 and 8. Find the simple
arithmetic mean of the ages.
53+48+25+32+8
X = 33.2
5
 Arithmetic mean for ungrouped data

13
Let the values of the variates be x1, x2, …, xk and let f1, f2, …, fk be the corresponding
frequencies, then arithmetic mean is
f1x1  f 2 x 2  ....  f k x k
X
f1  f 2  ...  f k

x f i i
 i 1
k

f
i 1
i

Example:

Mean = 6458/56 = 117.42

Arithmetic mean for grouped data:-

Formula is same as the ungrouped data where case xi’s are mid values of the classes.

Weighted arithmetic mean:

 In general, if x1, x2, …, xn are the observations and w1, w2, …, wn are the weights
assigned, the weighted arithmetic mean is given by,
n

wx i i
x i 1 w x  w 2 x 2  ....  w n x n 1110
n  141 1 X= = 22.2
w
i 1
i
w1  w 2  ...  w n 50
It may be noted that the A.M. of a frequency distribution is in fact a weighted arithmetic mean,
the weights being the frequencies.

Example: A student’s marks in the laboratory, lecture and recitation parts of a Statistics course
were 71, 78 and 89 respectively. If the weights accorded to these marks are 2, 4 and 5
respectively what is an appropriate average mark?

71x2  78x4  89x5

x
245

= 81.73

Median:

 The median represents the middle value of the data set when it has been arranged in
ascending order.

 Half the value will lie below the median and half will lie above it.

 When the sample size is odd, the median is the middle value

 When the sample size is even, the median is the midpoint/mean of the two middle values

Calculating the median for the gestation period of new born:

Gestation Period Data in ascending

(Weeks) order
36 32 38  39
38 35 median   38.5
2
40 36
37 37
42 38
35 39
39 40
32 40
40 41
41 42

• Notice that only the two central values are used in the computation.

15
• The median is not sensible to extreme values

Example of Median:

Measurements Measurements
Ranked
3 0
5 1
5 2
1 3
7 4
2 5
6 5
7 6
0 7
4 7
40 40

 Median: (4+5)/2 = 4.5: As the no. of observation is even median is calculated as the
average of N/2 th observation and N/2 +1 th observation. Here N/2 th observation after
arranging in ascending order, ie. 5th observation is 4 and N/2 + 1 th observation, ie. 6th
observation is 5.

 Notice that only the two central values are used in the computation.

 The median is not sensible to extreme values

Mode:

 The mode is the value that occurs most frequently

 It is the least useful (and least used) of the three measures of central tendency

16
Eg: The mode for the gestation period of new borns

Gestation Period
(Weeks)

36
38
40
37
42
35
39
32
40
41
• Mode is 40 weeks.

Eg 2:

Tumor size (in cm)

3
5
5
1
7
2
6
7
0
4

 In this case the data have two modes: 5 and 7

 Both measurements are repeated twice

Measures of central tendency and levels of measurement:

 Mean assumes numerical values and requires interval data

 Median requires ordering of values and can be used with both interval and ordinal data

17
 Mode only involves determination of most common value and can be used with interval,
ordinal, and nominal data

Harmonic Mean

The harmonic mean is a better "average" when the numbers are defined in relation to some
unit. The common example is averaging speed.

For example, suppose that you have four 10 km segments to your automobile trip. You drive
your car:

 100 km/hr for the first 10 km

 110 km/hr for the second 10 km
 90 km/hr for the third 10 km
 120 km/hr for the fourth 10 km.

What is your average speed? Here is a spreadsheet solution:

Distance Velocity Time

km km/hr hr
10 100 0.100
10 110 0.091
10 90 0.111
10 120 0.083

40 0.385
103.80 Avg V

The harmonic mean formula is:

Geometric Mean

In economic evaluation work, the geometric mean is often useful.

The large P represents multiplication, analogous to the S representing summation.

18
The GM is for situations we want the average used in a multiplicative situation, such as the
"average" dimension of a box that would have the same volume as length x width x height. For
example, suppose a box has dimensions 50 x 80 x 100 cm.

The customary economic evaluation application is in determining "average" inflation or rate of

return across several time periods. In calculating the GM, the numbers must all be positive.

Measures of variation:

 Measures of variability can help us to create a mental picture of the spread of the data.

 They describe “data spread” or how far the measurements are from the center.

Range

Variance and standard deviation

Range:

 The range R of a set of n measurements is defined as the difference between the largest
and the smallest measurements.

Eg.: Heights of students in a class.

Eg: The range for the gestation period of new borns

Gestation Period
(Weeks)

36
38
40
37
42
35
39
32
40
41
• Range=Max. value – Min. value

= 42 – 32 = 10

19
Variance (for a sample):

 Steps:

 Compute each deviation

 Square each deviation

 Sum all the squares

 Divide by the data size (sample size) minus one: n-1

Example of Variance:

Measurements Deviations Square of

deviations
x x - mean
3 -1 1
5 1 1
5 1 1
1 -3 9
7 3 9
2 -2 4
6 2 4
7 3 9
0 -4 16
4 0 0
40 0 54

 Variance = 54/9 = 6

 It is a measure of “spread”.

 Notice that the larger the deviations (positive or negative) the larger the variance

The standard deviation:

 It is defines as the square root of the variance

 In the previous example

 Variance = 6

 Standard deviation = Square root of the variance = Square root of 6 = 2.45

Variance and standard deviation:

20
 The variance s2 is the sum of the squared deviations from the mean divided by the
number of cases minus 1

 y  y 
2
 s 2
 i

n 1

 The standard deviation s is the square root of the variance

 y  y 
2

s i

n 1

Coefficient of variation:

Coefficient of variation is defined as measure of relative variation (also sometimes coefficient of

dispersion).
std dev s
CV  100  100
mean y
 Use to compare variation of distributions with different units relative to their means

 For the comparison of variability in the same variable measured in two different
heterogeneous populations.

Eg: We are given that the average birth weight as 2592 g. and the corresponding S.D. as 354.90g.
Therefore the CV = (354.90/2592)*100 = 13.69%

Example: - For a class of students, the height (X) has a distribution with mean 162 cm and sd
10cm and weight (Y) has mean 57 kg and sd 8 kg. Compare the variability aspect of X and Y.
C.V. (X)  (10/162) *100  6.17 %
C.V. (Y)  (8/57) *100  14.04 %
which shows that variability is more for the weight distribution.

Skewness:

 Defines the departure from symmetry of the distribution.

 When the mean is greater than the median the data distribution is skewed to the right
(positively skewed).

 When the median is greater than the mean the data distribution is skewed to the left
(negatively skewed).

 When the mean and median are very close to each other, the data distribution is
approximately symmetric.

21
Distribution shapes:

Symmetric

0
1 2 3 4 5 6 7

Positively skewed

0
1 2 3 4 5 6

Negatively skewed

0
1 2 3 4 5 6

Measure of Skewness

22
 A measure of skewness should indicate by its sign whether the skewness is positive or
negative and by its magnitude the degree of skewness.

 A relative or coefficient measure of skewness suggested by Karl Pearson usually denoted

by J is
Mean - Mode 3 (Mean - Median)
 J= =
S.D. S.D.
J will lie between -3 and +3.

Kurtosis:

 It also measures the deviation from normality.

 It gives the extent of peakedness of the curve.

Leptokurtic

Mesokurtic

Platykurtic

4
2 
2 2
Further Notes:

 When the Mean is greater than the Median the data distribution is skewed to the Right.

 When the Median is greater than the Mean the data distribution is skewed to the Left.

 When Mean and Median are very close to each other the data distribution is
approximately symmetric.

The z-score or the “standardized score”:

z x x
x

23
Normal curve:

 Skewness = 0

 Kurtosis = 3
1
f ( x)  e ( x   ) / 2
2

 2

24
Lecturer Notes on Bivariate Statistical Data

Bivariate Analysis: The analysis of statistical data with two variable is termed as bivariate data.

SCATTER DIAGRAM:

 A simplest way of the diagrammatic representation of bivariate data.

 Plot of values of the variables X and Y, (xi, yi), i = 1, 2, …, n. The diagram of dots so
obtained is known as the scatter diagram.

 Will give a good idea about the relationship between the two variables X and Y.

Correlation coefficient:

• A Measure of intensity or degree or linear relationship between two variables

• Correlation does not necessarily imply causation or functional relationship though the
existence of causation always implies correlation.

Types of Correlation:

On the basis of the nature of relationship between the variables, correlation may be:

Positive and Negative correlation

Simple, Partial or Multiple correlation

Linear or Non-linear correlation

Simple, Partial or Multiple correlation:

25
 When we study only two variables, the relationship is described as simple correlation.

 In multiple correlation, three or more variables are studied simultaneously.

 The study of two variables excluding some other variables is called partial correlation.

Correlation due to pure chance…

We may get a high degree of correlation between two variables, but there may not be any
relationship between the variables at all.

- Due to pure random sampling variation or because of the bias of the

investigator in selecting the sample

Income 5000 6000 7000 8000 9000

Weight 120 140 160 180 200

The above data show a perfect positive relationship between income and weight, i.e., as the
income is increasing the weight is increasing and the rate of change between the two variables is
the same.
cov ariance( x, y )
Pearson’s Correlation Coefficient: r
var x var y

INTERPRETING COVARIANCE:

Covariance between two random variables:

Cov (X,Y) > 0 X and Y tend to move in the same direction

Cov (X,Y) < 0 X and Y tend to move in opposite directions

Cov (X,Y) = 0 X and Y are independent

26
Types of Correlation

Positive Correlation Negative Correlation No correlation

Measures the relative strength of the linear relationship between two variables

 Unitless

 Ranges between –1 and 1

 The closer to –1, the stronger the negative linear relationship

 The closer to 1, the stronger the positive linear relationship

 The closer to 0, the weaker any positive linear relationship

27
LINEAR CORRELATION

Strong relationships Weak relationships

Y Y

X X

Y Y

98
X X
98

What’s a good guess for the Pearson’s correlation coefficient (r) for this scatter plot?

a) –1.0 b) +1.0 c) 0 d) - 0.5 d) - 0.1

A Numerical example:

28
( X  X )2 (Y  Y )2 ( X  X )(Y  Y )

Calculation of Pearson r
1 n

 xi  x 
2
 x2   53.71
n i

1 n

 yi  y 
2
 y2   1310.79
n i

1 n
Cov (X, Y)   (x i  x)(yi  y)  261.24
n i 1
261.24 261.24
r   0.985
(53.71)(1310.79) 70402.53

Interpretation

The direction of the relationship between X and Y is positive. As X increases Y also increases.
There is sufficiently high degree of positive correlation between X and Y.

Pearson Correlation Assumptions:

29
 That the relationship between X and Y can be represented by a straight line, i.e. it is
linear.

 That X and Y are metric variables, measured on an interval or ratio scale of

measurement.

 In using a t distribution to test the significance of the correlation coefficient …

 That the sample was randomly drawn from the population, and

 That X and Y are normally distributed in the population. This assumption is less
important as the sample size increases

Degree of Correlation:

 The following chart will show the approximate degree of correlation according to Karl
Pearson’s formula:

Spearman’s Rank-Order Correlation:

Suppose we want to measure the degree of association between two variables in the situation
where it is difficult to assign some definite value with respect to some character (like
intelligence, efficiency etc.), but ordering the individuals is possible.

 Let (xi, yi), i =1, 2, ..., n be ‘n’ pairs of observation for ‘n’ individuals.

 Let (ui, vi), i = 1, 2, …, n be the ranks of x values and y values respectively.

 Let Di = ui - vi be the difference in ranks for ith individual.

Spearman’s rank correlation coefficient is given by,

 n 2 
 6 Di 
r  1   i 12 
 n(n  1) 
 
 

30
REGRESSION:

 “Regression” means “stepping back towards the average”.

 Galton found that the offspring of abnormally tall or short parents tend to “regress” or
“step back” to the average population height.

 A mathematical measure of average relationship between two or more variables in terms

of the original units of the data.

Regression analysis is a statistical technique for investigating and modeling the

relationship between variables

 Most widely used statistical technique

 Applications of regression are numerous and occur in almost every field, including

Engineering, Physical Sciences, Economics, Management, Life and Biological Sciences

and Social Sciences

Uses of Regression:

 Data description: Engineers and scientists frequently use equations to summarize or

describe a set of data.

 Regression analysis is helpful in developing such equations.

 For example, for a problem of delivering soft drink bottles with regard to the delivery
time and the delivery volume, a regression model would probably be a much more
convenient and useful summary of those data than a table or even a graph.

 Data description: Engineers and scientists frequently use equations to summarize or

describe a set of data.

 Regression analysis is helpful in developing such equations.

 Prediction and estimation: Many applications of regression involve prediction of the

response variable.

 For example, we may wish to predict delivery time for a specified number of cases of soft
drinks to be delivered.

31
 These predictions may be helpful in planning delivery activities such as routing and
scheduling, or in evaluating the productivity or delivery operations.

 However, even when the model form is correct, poor estimates of the model parameters
may still cause poor prediction performance.

 Control: Regression models may be used for control purposes.

 For example, a chemical engineer could use regression analysis to develop a model
relating the tensile strength of paper to the hardwood concentration in the pulp.

 This equation could then be used to control the strength to suitable values by varying the
level of hardwood concentration.

 When a regression equation is used for control purposes, it is important that the variables
be related in a causal manner.

Types of Regression Models

Simple linear regression describes the linear relationship between a predictor (regressor)
variable, plotted on the x-axis, and a response variable, plotted on the y-axis

32
y   o  1 x

SIMPLE LINEAR REGRESSION:

The linear regression model:

Hours of homework/week = 12.5 + 0.6  hours of exercise/week

 Example: The distribution of baby weights at Cochin ~ N(3400, 360)

33
Your “Best guess” at a random baby’s weight, given no information about the baby, is
what? 3400 grams

PREDICTOR (REGRESSOR) VARIABLE

 X = gestation time

 Assume that babies that gestate for longer are born heavier, all other things being equal.

 Pretend (at least for the purposes of this example) that this relationship is linear.

 Example: suppose a one-week increase in gestation, on average, leads to a 100-gram

increase in birth-weight

Y depends on X:

At 30 weeks…

 The babies that gestate for 30 weeks appear to center around a weight of 3000 grams.

 E(Y/x = 30 weeks) = 3000 grams

Note that not every Y-value (Yi) sits on the line. There’s variability.

yi = 3000 + random errori

34
 In fact, babies that gestate for 30 weeks have birth-weights that center at 3000 grams, but
vary around 3000 with some variance 2

 Approximately what distribution do birth-weights follow? Normal. Y/X=30

weeks ~ N(3000, 2)

Linear Regression Assumption:

 Error values (ε) are statistically independent

 Error values are normally distributed for any given value of x
 The probability distribution of the errors is normal
 The probability distribution of the errors has constant variance
 The underlying relationship between the x variable and the y variable is linear

Example 2: The article chemithermomechanical pulp from mixed density hardwoods reports on a
study in which the accompanying data was obtained to relate y = specific surface area (sq. cm/g)
to x1= % NaOH used as a pretreatment chemical and x2= treatment time (min) for a batch of
pulp.

X1 3 3 9 9 9 15 15 15

X2 60 90 30 60 90 30 60 90

y 5.60 5.44 6.22 5.85 5.61 8.36 7.30 6.43

 Regression equation: y  0  1 x1  2 x 2 

Root Mean Square Error:

The regression line predicts the average y value associated with a given x value. Note that it is
also necessary to get a measure of the spread of the y values around that average. To do this, we
use the root-mean-square error (r.m.s. error).

To construct the r.m.s. error, you first need to determine the residuals. Residuals are the
difference between the actual values and the predicted values. I denoted them by (𝑦̂𝑖 − 𝑦𝑖 ),
where yi is the observed value for the ith observation and 𝑦̂𝑖 is the predicted value.

They can be positive or negative as the predicted value under or over estimates the actual value.
Squaring the residuals, averaging the squares, and taking the square root gives us the r.m.s error.
You then use the r.m.s. error as a measure of the spread of the y values about the predicted y
value.

35
As before, you can usually expect 68% of the y values to be within one r.m.s. error, and 95% to
be within two r.m.s. errors of the predicted values.

Squaring the residuals, taking the average then the root to compute the r.m.s. error is a lot of
work. Fortunately, algebra provides us with a shortcut (whose mechanics we will omit).

Thus the RMS error is measured on the same scale, with the same units as y.

The term is always between 0 and 1, since r is between -1 and 1. It tells us how much smaller the
r.m.s error will be than the SD.

For example, if all the points lie exactly on a line with positive slope, then r will be 1, and the
r.m.s. error will be 0. This means there is no spread in the values of y around the regression line
(which you already knew since they all lie on a line).

The residuals can also be used to provide graphical information. If you plot the residuals against
the x variable, you expect to see no pattern. If you do see a pattern, it is an indication that there is
a problem with using a line to approximate this data set.

In an analogy to standard deviation, taking the square root of MSEyields the root-mean-square
error or root-mean-square deviation (RMSE or RMSD), which has the same units as the quantity
being estimated; for an unbiased estimator, the RMSE is the square root of the variance, known
as the standard deviation.

Data Presentation
No ratings yet
Data Presentation
22 pages
Lecture 5 Representation of Data
No ratings yet
Lecture 5 Representation of Data
53 pages
Fundamentalsofbiostatistics 130802024231 Phpapp02
No ratings yet
Fundamentalsofbiostatistics 130802024231 Phpapp02
69 pages
Statistics
No ratings yet
Statistics
289 pages
Research and Statistics
No ratings yet
Research and Statistics
18 pages
1 Intro - Stat MBA Chap1&2
No ratings yet
1 Intro - Stat MBA Chap1&2
62 pages
Statistics Pages
No ratings yet
Statistics Pages
67 pages
STAT 111: Introduction To Statistics & Probability For Actuaries
100% (2)
STAT 111: Introduction To Statistics & Probability For Actuaries
230 pages
Basics of Biostats
No ratings yet
Basics of Biostats
33 pages
Radiation Emergency
No ratings yet
Radiation Emergency
77 pages
INTRODUCTION TO STATIATICS Basic Medical Sciences
No ratings yet
INTRODUCTION TO STATIATICS Basic Medical Sciences
79 pages
Statistics 180930091746
No ratings yet
Statistics 180930091746
117 pages
Statistics
No ratings yet
Statistics
49 pages
Lecture 01 Introduction To Statistics PPT 06022025 095924am
No ratings yet
Lecture 01 Introduction To Statistics PPT 06022025 095924am
40 pages
Introduction To Statistics
No ratings yet
Introduction To Statistics
34 pages
Applied Biostatistics
No ratings yet
Applied Biostatistics
53 pages
Distribution of Data
No ratings yet
Distribution of Data
30 pages
Market Research Process
100% (3)
Market Research Process
17 pages
01 - Introduction To Statistics
No ratings yet
01 - Introduction To Statistics
24 pages
Unit-2 MFAI
No ratings yet
Unit-2 MFAI
118 pages
Card 05
No ratings yet
Card 05
23 pages
2 Lecture 2 Organizing and Displaying of Data
No ratings yet
2 Lecture 2 Organizing and Displaying of Data
37 pages
Statitics
No ratings yet
Statitics
87 pages
Eapp q2 Module 2
100% (8)
Eapp q2 Module 2
18 pages
Biostatistics Notes-Numbered
No ratings yet
Biostatistics Notes-Numbered
21 pages
Ae 9 Reviewer
No ratings yet
Ae 9 Reviewer
7 pages
Statistics
No ratings yet
Statistics
46 pages
Chapter 03-Presentation of Data (STA 1102-0542)
No ratings yet
Chapter 03-Presentation of Data (STA 1102-0542)
36 pages
Ns Statistics 2022
No ratings yet
Ns Statistics 2022
70 pages
Statistics Ns 20231
No ratings yet
Statistics Ns 20231
49 pages
STA 111 Note
No ratings yet
STA 111 Note
12 pages
Basic and Valuable Concepts of Statistics
No ratings yet
Basic and Valuable Concepts of Statistics
16 pages
Data Presentation
100% (1)
Data Presentation
51 pages
Graphical Presentation
No ratings yet
Graphical Presentation
27 pages
2nd Software Engineering
No ratings yet
2nd Software Engineering
107 pages
Introduction To Statistics: Ratheesh R.L Lecturer Murlidhar College of Nursing
No ratings yet
Introduction To Statistics: Ratheesh R.L Lecturer Murlidhar College of Nursing
50 pages
M 301 - Ch1 - Introduction To Statistics
No ratings yet
M 301 - Ch1 - Introduction To Statistics
96 pages
Statistics - Slide 2
No ratings yet
Statistics - Slide 2
15 pages
Exploring The Impact of Social Media On The Attention Span and Concentration Capability of Divine Word College of Calapan Grade 10 Students During School Hours
No ratings yet
Exploring The Impact of Social Media On The Attention Span and Concentration Capability of Divine Word College of Calapan Grade 10 Students During School Hours
9 pages
DR Ganesh Patil Medical Statistics
No ratings yet
DR Ganesh Patil Medical Statistics
77 pages
CHAIR Safety in Design Tool
No ratings yet
CHAIR Safety in Design Tool
100 pages
Biostatistics..and Orthodontics
No ratings yet
Biostatistics..and Orthodontics
99 pages
01 - Basic Statistical Concepts
No ratings yet
01 - Basic Statistical Concepts
31 pages
QM Tutorial - Session 1 Introduction, Descriptive Statistics and Numerical Measures
No ratings yet
QM Tutorial - Session 1 Introduction, Descriptive Statistics and Numerical Measures
13 pages
Intro To Statistics Lecture
No ratings yet
Intro To Statistics Lecture
41 pages
Stat 1&2
No ratings yet
Stat 1&2
35 pages
2. presenting of data - ١١١٠٥٩
No ratings yet
2. presenting of data - ١١١٠٥٩
39 pages
Semiconductors Transistors and Application
No ratings yet
Semiconductors Transistors and Application
22 pages
1 Stats Intro 14022024 105127am
No ratings yet
1 Stats Intro 14022024 105127am
26 pages
Stats For PGDM
No ratings yet
Stats For PGDM
52 pages
Basics of Statistics - Summary Notes
No ratings yet
Basics of Statistics - Summary Notes
8 pages
Introduction To Statistics: "There Are Three Kinds of Lies: Lies, Damned Lies, and Statistics." (B.Disraeli)
No ratings yet
Introduction To Statistics: "There Are Three Kinds of Lies: Lies, Damned Lies, and Statistics." (B.Disraeli)
32 pages
Unit 2
No ratings yet
Unit 2
11 pages
01 - Introduction To Statistics
No ratings yet
01 - Introduction To Statistics
38 pages
UNILAK Project Management Power Point
No ratings yet
UNILAK Project Management Power Point
179 pages
Yifredew Mba Thesis-1
100% (1)
Yifredew Mba Thesis-1
80 pages
Smrithy
No ratings yet
Smrithy
23 pages
Matematik
No ratings yet
Matematik
26 pages
Transport of RAM
No ratings yet
Transport of RAM
77 pages
Part1 141104090445 Conversion Gate01
No ratings yet
Part1 141104090445 Conversion Gate01
27 pages
K.Santoshi 1 Year PG: Biostatistics
No ratings yet
K.Santoshi 1 Year PG: Biostatistics
60 pages
Business Statistics: Saroj Kanta Jena
No ratings yet
Business Statistics: Saroj Kanta Jena
41 pages
1 Introduction To Biostatistics
No ratings yet
1 Introduction To Biostatistics
54 pages
1st Mid
No ratings yet
1st Mid
19 pages
Pilion, Omra S, - Thesis Manuscript
No ratings yet
Pilion, Omra S, - Thesis Manuscript
80 pages
Biological Effects of RT
No ratings yet
Biological Effects of RT
63 pages
Attenuation F
No ratings yet
Attenuation F
61 pages
Bio Statics
No ratings yet
Bio Statics
93 pages
UBTM UKTM1013 Course Plan Jan 2019 FBF PDF
No ratings yet
UBTM UKTM1013 Course Plan Jan 2019 FBF PDF
21 pages
Statistics 2ND Sem Reviewer
No ratings yet
Statistics 2ND Sem Reviewer
5 pages
Building The Business Plan
100% (1)
Building The Business Plan
5 pages
Biostatistics (DR Shilpi Gilra)
No ratings yet
Biostatistics (DR Shilpi Gilra)
45 pages
Think Aloud Protocol Example
No ratings yet
Think Aloud Protocol Example
3 pages
Use and Application of Statistic
No ratings yet
Use and Application of Statistic
9 pages
G2 - MSc. A Guide To Writing A Research Dissertation, 13-4-22
No ratings yet
G2 - MSc. A Guide To Writing A Research Dissertation, 13-4-22
20 pages
Alladin Et Al 2024 Children Aged 5 13 Years Show Adult Like Disgust Avoidance But Not Proto Nausea
No ratings yet
Alladin Et Al 2024 Children Aged 5 13 Years Show Adult Like Disgust Avoidance But Not Proto Nausea
8 pages
2.1a RF - Casptone Proposal
No ratings yet
2.1a RF - Casptone Proposal
37 pages
Ib Math Assessment - Social Media
No ratings yet
Ib Math Assessment - Social Media
18 pages
The Effect of Financial Literacy On The Financial
No ratings yet
The Effect of Financial Literacy On The Financial
9 pages
Eeg 823
No ratings yet
Eeg 823
71 pages
ESIA Unit 1 Notes
No ratings yet
ESIA Unit 1 Notes
24 pages
Introduction To Statistics: Learning Objectives
No ratings yet
Introduction To Statistics: Learning Objectives
33 pages
The Impact of Social Media Marketing On Consumer Buying Behaviour
No ratings yet
The Impact of Social Media Marketing On Consumer Buying Behaviour
16 pages
Sociology Project Guidelines ISC 11.docx2018
No ratings yet
Sociology Project Guidelines ISC 11.docx2018
3 pages
DaCUM Chart - CIVIL
No ratings yet
DaCUM Chart - CIVIL
5 pages
HWRE - Chapter 2
No ratings yet
HWRE - Chapter 2
14 pages
Self-Quiz Unit 4 - Attempt Review - Home
No ratings yet
Self-Quiz Unit 4 - Attempt Review - Home
4 pages
CSET240 Handout
No ratings yet
CSET240 Handout
4 pages
Wards Simulating Urinalysis Lab
No ratings yet
Wards Simulating Urinalysis Lab
4 pages
Student Housing Guideline
No ratings yet
Student Housing Guideline
4 pages
Defining and Refining The Problem
No ratings yet
Defining and Refining The Problem
4 pages
4 t2.5 Comp Enzymes
No ratings yet
4 t2.5 Comp Enzymes
4 pages
QQ Plots For Normality Check
No ratings yet
QQ Plots For Normality Check
1 page
Dissertation Structure
No ratings yet
Dissertation Structure
3 pages
Co-Clustering: Models, Algorithms and Applications
From Everand
Co-Clustering: Models, Algorithms and Applications
Gérard Govaert
No ratings yet
Chi Squared for Beginners
From Everand
Chi Squared for Beginners
Stephanie Glen
No ratings yet