0% found this document useful (0 votes)
17 views36 pages

Introduction To Statistics

The document provides an introduction to statistics, focusing on descriptive statistics, data collection methods, and types of data. It explains the importance of primary and secondary data, classification, tabulation, and various graphical representations of data. Additionally, it covers measures of central tendency, including arithmetic mean and median, along with examples to illustrate these concepts.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views36 pages

Introduction To Statistics

The document provides an introduction to statistics, focusing on descriptive statistics, data collection methods, and types of data. It explains the importance of primary and secondary data, classification, tabulation, and various graphical representations of data. Additionally, it covers measures of central tendency, including arithmetic mean and median, along with examples to illustrate these concepts.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 36

Introduction to Statistics

Unit 1: Descriptive Statistics

Statistics:

The word Statistics refers to either to quantitative information or to a method dealing with
quantitative information. Statistics deals with collection, organization, analysis and interpretation
of numerical data.

 To some statistics is an imposing form of mathematics, where as to others it may be


tables, charts and figures.

Eg: The statements like,

“There are 1058 females for 1000 males in Kerala”

“Delhi accounts the highest rate of road accidents in India”

These statements contains figures and as such they are called “numerical statements of facts”

Eg: Biostatistics, Economics, Industrial Statistics, Actuarial etc.

 In every fields of science, one would come across different types of information which
will be termed as data.

 Once we have the information on certain characterizations, we can analyze it and draw
conclusions about the population.

Data:

 It is the fundamental of Statistics.

 The collected numerical information is called the data.

Primary and Secondary Data:

Collection of data can be done in two different ways:

a) The investigator may collect the required information directly,

or

a) He/She may make use of data collected by others.

1
c) Data collected by the investigator for the purpose of the investigation at hand is called
Primary data.

d) Data collected by others for some other purpose and used by the investigator is called
Secondary data.

e) Collection of primary data involves much time, money and labour. So, if available,
secondary data is preferred.

Collection of Primary Data:

Census and Sampling

 ‘Population’ or ‘Universe’ is a term usually used in statistics for the aggregate or totality
of units such as people, trees, animals, firms, plots of lands, houses, manufactured articles
etc, etc. about which information is required in a study.

 If information is collected from a representative part of it, the method of collection is


called sampling.

There are two branches of Statistics:

 Descriptive Statistics - Statistical methods can be used to summarize or describe a


collection of data.

 Inferential Statistics - Statistical methods can be used to drawing conclusions or making


decisions about population characteristics based on the information collected from a
sample drawn from the population.

Descriptive Statistics:

 Gives numerical and graphical procedures to summarize a collection of data in a clear


and understandable way.

We can present the collected data in following ways:

Classification

Tabulation

Diagrammatic Presentation

Graphical Presentation

2
Classification:

 Classification is the process of arranging things in the groups according to their


resemblances, similarity, or identity.

 It may be classified as:

1. Chronological or Temporal classification

2. Geographical or Spatial Classification

3. Qualitative classification

4. Quantitative classification

Types of variables:

 Qualitative variable:- A variable which cannot be numerically measured but can be


measured by their quality is called qualitative variable.

Ex:- sex, intelligence, beauty

 Quantitative variable:- A variable which can be measured numerically is called


Quantitative variable.

Ex:- Height, Weight, temperature, Pressure

Quantitative variables are of two types:

 Discrete variable:- A quantitative variable which can assume finite or countable number
of isolated values is called a discrete variable.

Ex:- Family size, no. of kittens in a litter, number of defectives in a lot, number of
petals in a flower etc.

 Continuous variable:- A quantitative variable which can assume any numerical value
within a certain interval of real line is called a continuous variable.

Ex:- Temperature, age, pressure, height, weight etc.

Frequency distribution:

 A frequency distribution is a statement of the possible values of the variable together with
their respective frequencies.

 Frequency of a variant value is the number of occurrences of that particular value of the
variable in the data set.

3
Tabulation:

 Tabulation is the process of arranging the classified data in the form of a table consisting
of rows and columns. It is the continuation of classification.

 The purpose of tabulation is that it is easily understood and an investigator is quickly able
to locate the desired information.

 If classification is done with respect to one characteristic only, the corresponding table
will be a one way table.

TABULATION OF DATA:

Tabular presentation of qualitative data

 The data set of the smoking status of 20 individuals, collected as yes (y) or no (n). The
raw is given as

n, y, n, n, n, y, n, y, n, n, n, y, n, n, y, n, y, n, n, y

Smoking Status Number of Patients


Yes 7
No 13
Total 20

Two way table

Cellularity R HG

No Cells 4 1

Hypocellular 14 15

Cellular 12 4

Tabular presentation of quantitative data:

 Systolic blood pressure (mmHg) values of 25 patients attending a clinic

4
99 110 96 160 106 144 109 156 110 128 118 168 132 140 160 102 159 148 149 156
154 143 108 146 145

The frequency distribution of systolic blood pressure in patients attending a clinic

Graphical Representation:

Pie Chart:

 Circles drawn with areas proportional to the magnitudes of the observations constitute a
pie diagram.

 Circle is divided into sectors with areas proportional to the magnitudes of the components
in component bar diagram.

 Pie diagrams are also called circular diagrams.

 Degree of any component/

Angle of Sector =

Magnitude of component x 360

Total magnitude of phenomena

Frequency distribution of stomach cancer with respect to operable condition

Surgery Frequency
Operable 60
In-operable 14

5
Stomach Cancer
Operable In-operable

19%

81%

Bar diagrams:

 A method of presenting data in which frequencies are displayed along one axis and
categories of the variable along the other, the frequencies being represented by the bar
lengths.

 To avoid any misunderstanding, the bars are drawn with the same width at equal
distances.

Bar diagrams are of various types:

→ Simple bar diagrams

→ Sub-divided bar diagrams / Component bar diagrams

→ Multiple bar diagrams

→ Percentage bar diagrams

→ Deviation bars

Simple bar diagram:

 A simple bar diagram is used to represent only one variable. The figures of sales,
production, population etc., for various years may be shown by means of a simple bar
diagram.

 Bar diagrams are the easiest and most commonly used devices.

6
 The distance between every two consecutive bars should be uniform. The width of each
bar should be the same.

 The height of the bars should be proportional to the magnitude of the variable.

 All the bars should stand on the same base line (X-axis).

Bar Diagram:

Number of chemotherapy cycles completed

No. of Chemo Cycles Frequency


2 2
3 27
4 12
5 7
6 23

Chemoterapy for Breast Cancer Patients


30

25

20
Frequency

15

10 Frequency

0
2 3 4 5 6
No. of Cycles Completed

Multiple (Adjacent) bar diagram:

 Multiple bar diagrams are used to represent two or more sets of interrelated data.

 Whenever a comparison between two or more related variable is to be made, multiple


bars should be preferred.

7
 An index is also prepared to identify bars.

Cellularity R HG

No Cells 4 1

Hypocellular 14 15

Cellular 12 4

Cellularity
16
14
12
Frequency

10
8
R
6
HG
4
2
0
No Cells Hypocellular Cellular
Cellularity

What is the Distribution?

 Gives us a picture of the variability and central tendency.

 Can also show the amount of skewness and Kurtosis.

8
Graphing Data: Types:

Creating Frequencies:

 We create frequencies by sorting data by value or category and then summing the cases
that fall into those values.

 How often do certain scores occur? This is a basic descriptive data question.

Histogram:

 Histogram is a bar diagram (two dimensional) which is suitable for frequency distribution
with continuous classes.

 Class intervals are shown on the ‘X-axis’ and the frequencies on the ‘Y-axis’. The height
of each rectangle represents the frequency of the class interval.

9
Eg: Age distribution of liver cancer patients

Age Frequency

25 – 29 3

30 – 34 10

35 - 39 5

40 - 44 3

45 - 49 5

50 - 54 2

55 - 59 1

60 - 64 1

65 - 69 3

70+ 7

Total 40

15

10
Frequency

Mean =47.42
Std. Dev. =17.11
N =40

0
20.00 40.00 60.00 80.00
Age

10
Line Graphs or line diagram:

Age distribution of Cancer cases in a Panchayat

Age Frequency

30 - 39 7

40 - 49 9

50 - 59 4

60 - 69 4

70+ 6

Total 30

Line Graphs: A Time Series

11
100

90

80
Approval

70
Approval

60

50

40

30

Economic approval
20

10

81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 00 01
19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 20 20
Month

Scatter Plot (Two variable):

Presidential Approval and Unemployment

100

80
Approval

60
Approve
40

20

0
0 2 4 6 8 10 12
Unemployment

Descriptive Statistics Cont…

 There are four characteristics for assessing the data.

(i) Measures of central tendency

(ii) Measures of dispersion

(iii) Skewness

(iv) Kurtosis

Measures of central tendency:

 The main purpose of the statistical treatment of any data is to summarize and describe the
data meaningfully and adequately.

12
 Certain summary indices tell about the center or middle value of the set of data around
which the other observations are lying. This index is called the measures of central
tendency.

 They are computed to give a “center” around which measurements in the data are
distributed.

Arithmetic mean:

 It is sometimes referred to as ‘the mean’ or ‘the average’.

 There are two types of arithmetic means, depending upon whether all the observations are
given equal importance or unequal importance.

 They are,

 Simple arithmetic mean

 Weighted arithmetic mean

Let x1, x2,…, xn be the n observations of the sample.

 Simple arithmetic mean is


Sum of the observations
X
Total number of observations

x1 +x 2 +...+x n

n
n

x i
 i 1
n

Example: The ages of the members of a family are 53, 48, 25, 32 and 8. Find the simple
arithmetic mean of the ages.
53+48+25+32+8
X = 33.2
5
 Arithmetic mean for ungrouped data

13
Let the values of the variates be x1, x2, …, xk and let f1, f2, …, fk be the corresponding
frequencies, then arithmetic mean is
f1x1  f 2 x 2  ....  f k x k
X
f1  f 2  ...  f k

x f i i
 i 1
k

f
i 1
i

Example:

Mean = 6458/56 = 117.42

Arithmetic mean for grouped data:-


Formula is same as the ungrouped data where case xi’s are mid values of the classes.

Weighted arithmetic mean:

 In general, if x1, x2, …, xn are the observations and w1, w2, …, wn are the weights
assigned, the weighted arithmetic mean is given by,
n

wx i i
x i 1 w x  w 2 x 2  ....  w n x n 1110
n  141 1 X= = 22.2
w
i 1
i
w1  w 2  ...  w n 50
It may be noted that the A.M. of a frequency distribution is in fact a weighted arithmetic mean,
the weights being the frequencies.

Example: A student’s marks in the laboratory, lecture and recitation parts of a Statistics course
were 71, 78 and 89 respectively. If the weights accorded to these marks are 2, 4 and 5
respectively what is an appropriate average mark?

71x2  78x4  89x5


x
245

= 81.73

Median:

 The median represents the middle value of the data set when it has been arranged in
ascending order.

 Half the value will lie below the median and half will lie above it.

 When the sample size is odd, the median is the middle value

 When the sample size is even, the median is the midpoint/mean of the two middle values

Calculating the median for the gestation period of new born:

Gestation Period Data in ascending


(Weeks) order
36 32 38  39
38 35 median   38.5
2
40 36
37 37
42 38
35 39
39 40
32 40
40 41
41 42

• Notice that only the two central values are used in the computation.

15
• The median is not sensible to extreme values

Example of Median:

Measurements Measurements
Ranked
3 0
5 1
5 2
1 3
7 4
2 5
6 5
7 6
0 7
4 7
40 40

 Median: (4+5)/2 = 4.5: As the no. of observation is even median is calculated as the
average of N/2 th observation and N/2 +1 th observation. Here N/2 th observation after
arranging in ascending order, ie. 5th observation is 4 and N/2 + 1 th observation, ie. 6th
observation is 5.

 Notice that only the two central values are used in the computation.

 The median is not sensible to extreme values

Mode:

 The mode is the value that occurs most frequently

 It is the least useful (and least used) of the three measures of central tendency

16
Eg: The mode for the gestation period of new borns

Gestation Period
(Weeks)

36
38
40
37
42
35
39
32
40
41
• Mode is 40 weeks.

Eg 2:

Tumor size (in cm)


3
5
5
1
7
2
6
7
0
4

 In this case the data have two modes: 5 and 7

 Both measurements are repeated twice

Measures of central tendency and levels of measurement:

 Mean assumes numerical values and requires interval data

 Median requires ordering of values and can be used with both interval and ordinal data

17
 Mode only involves determination of most common value and can be used with interval,
ordinal, and nominal data

Harmonic Mean

The harmonic mean is a better "average" when the numbers are defined in relation to some
unit. The common example is averaging speed.

For example, suppose that you have four 10 km segments to your automobile trip. You drive
your car:

 100 km/hr for the first 10 km


 110 km/hr for the second 10 km
 90 km/hr for the third 10 km
 120 km/hr for the fourth 10 km.

What is your average speed? Here is a spreadsheet solution:

Distance Velocity Time


km km/hr hr
10 100 0.100
10 110 0.091
10 90 0.111
10 120 0.083

40 0.385
103.80 Avg V

The harmonic mean formula is:

Geometric Mean

In economic evaluation work, the geometric mean is often useful.

The large P represents multiplication, analogous to the S representing summation.

18
The GM is for situations we want the average used in a multiplicative situation, such as the
"average" dimension of a box that would have the same volume as length x width x height. For
example, suppose a box has dimensions 50 x 80 x 100 cm.

The customary economic evaluation application is in determining "average" inflation or rate of


return across several time periods. In calculating the GM, the numbers must all be positive.

Measures of variation:

 Measures of variability can help us to create a mental picture of the spread of the data.

 They describe “data spread” or how far the measurements are from the center.

Range

Variance and standard deviation

Range:

 The range R of a set of n measurements is defined as the difference between the largest
and the smallest measurements.

Eg.: Heights of students in a class.

Eg: The range for the gestation period of new borns

Gestation Period
(Weeks)

36
38
40
37
42
35
39
32
40
41
• Range=Max. value – Min. value

= 42 – 32 = 10

19
Variance (for a sample):

 Steps:

 Compute each deviation

 Square each deviation

 Sum all the squares

 Divide by the data size (sample size) minus one: n-1

Example of Variance:

Measurements Deviations Square of


deviations
x x - mean
3 -1 1
5 1 1
5 1 1
1 -3 9
7 3 9
2 -2 4
6 2 4
7 3 9
0 -4 16
4 0 0
40 0 54

 Variance = 54/9 = 6

 It is a measure of “spread”.

 Notice that the larger the deviations (positive or negative) the larger the variance

The standard deviation:

 It is defines as the square root of the variance

 In the previous example

 Variance = 6

 Standard deviation = Square root of the variance = Square root of 6 = 2.45

Variance and standard deviation:

20
 The variance s2 is the sum of the squared deviations from the mean divided by the
number of cases minus 1

 y  y 
2
 s 2
 i

n 1

 The standard deviation s is the square root of the variance


 y  y 
2

s i

n 1

Coefficient of variation:

Coefficient of variation is defined as measure of relative variation (also sometimes coefficient of


dispersion).
std dev s
CV  100  100
mean y
 Use to compare variation of distributions with different units relative to their means

 For the comparison of variability in the same variable measured in two different
heterogeneous populations.

Eg: We are given that the average birth weight as 2592 g. and the corresponding S.D. as 354.90g.
Therefore the CV = (354.90/2592)*100 = 13.69%

Example: - For a class of students, the height (X) has a distribution with mean 162 cm and sd
10cm and weight (Y) has mean 57 kg and sd 8 kg. Compare the variability aspect of X and Y.
C.V. (X)  (10/162) *100  6.17 %
C.V. (Y)  (8/57) *100  14.04 %
which shows that variability is more for the weight distribution.

Skewness:

 Defines the departure from symmetry of the distribution.

 When the mean is greater than the median the data distribution is skewed to the right
(positively skewed).

 When the median is greater than the mean the data distribution is skewed to the left
(negatively skewed).

 When the mean and median are very close to each other, the data distribution is
approximately symmetric.

21
Distribution shapes:

Symmetric

0
1 2 3 4 5 6 7

Positively skewed

0
1 2 3 4 5 6

Negatively skewed

0
1 2 3 4 5 6

Measure of Skewness

22
 A measure of skewness should indicate by its sign whether the skewness is positive or
negative and by its magnitude the degree of skewness.

 A relative or coefficient measure of skewness suggested by Karl Pearson usually denoted


by J is
Mean - Mode 3 (Mean - Median)
 J= =
S.D. S.D.
J will lie between -3 and +3.

Kurtosis:

 It also measures the deviation from normality.


 It gives the extent of peakedness of the curve.

Leptokurtic

Mesokurtic

Platykurtic

4
2 
2 2
Further Notes:

 When the Mean is greater than the Median the data distribution is skewed to the Right.

 When the Median is greater than the Mean the data distribution is skewed to the Left.

 When Mean and Median are very close to each other the data distribution is
approximately symmetric.

The z-score or the “standardized score”:


z x x
x

23
Normal curve:

 Skewness = 0

 Kurtosis = 3
1
f ( x)  e ( x   ) / 2
2

 2

24
Lecturer Notes on Bivariate Statistical Data

Bivariate Analysis: The analysis of statistical data with two variable is termed as bivariate data.

SCATTER DIAGRAM:

 A simplest way of the diagrammatic representation of bivariate data.

 Plot of values of the variables X and Y, (xi, yi), i = 1, 2, …, n. The diagram of dots so
obtained is known as the scatter diagram.

 Will give a good idea about the relationship between the two variables X and Y.

Correlation coefficient:

• A Measure of intensity or degree or linear relationship between two variables

• Correlation does not necessarily imply causation or functional relationship though the
existence of causation always implies correlation.

Types of Correlation:

On the basis of the nature of relationship between the variables, correlation may be:

Positive and Negative correlation

Simple, Partial or Multiple correlation

Linear or Non-linear correlation

Simple, Partial or Multiple correlation:

25
 When we study only two variables, the relationship is described as simple correlation.

 In multiple correlation, three or more variables are studied simultaneously.

 The study of two variables excluding some other variables is called partial correlation.

Correlation due to pure chance…

We may get a high degree of correlation between two variables, but there may not be any
relationship between the variables at all.

- Due to pure random sampling variation or because of the bias of the


investigator in selecting the sample

Income 5000 6000 7000 8000 9000

Weight 120 140 160 180 200

The above data show a perfect positive relationship between income and weight, i.e., as the
income is increasing the weight is increasing and the rate of change between the two variables is
the same.
cov ariance( x, y )
Pearson’s Correlation Coefficient: r
var x var y

INTERPRETING COVARIANCE:

Covariance between two random variables:

Cov (X,Y) > 0 X and Y tend to move in the same direction

Cov (X,Y) < 0 X and Y tend to move in opposite directions

Cov (X,Y) = 0 X and Y are independent

26
Types of Correlation

Positive Correlation Negative Correlation No correlation

Measures the relative strength of the linear relationship between two variables

 Unitless

 Ranges between –1 and 1

 The closer to –1, the stronger the negative linear relationship

 The closer to 1, the stronger the positive linear relationship

 The closer to 0, the weaker any positive linear relationship

27
LINEAR CORRELATION

Strong relationships Weak relationships

Y Y

X X

Y Y

98
X X
98

What’s a good guess for the Pearson’s correlation coefficient (r) for this scatter plot?

a) –1.0 b) +1.0 c) 0 d) - 0.5 d) - 0.1

A Numerical example:

28
( X  X )2 (Y  Y )2 ( X  X )(Y  Y )

Calculation of Pearson r
1 n

 xi  x 
2
 x2   53.71
n i

1 n

 yi  y 
2
 y2   1310.79
n i

1 n
Cov (X, Y)   (x i  x)(yi  y)  261.24
n i 1
261.24 261.24
r   0.985
(53.71)(1310.79) 70402.53

Interpretation

The direction of the relationship between X and Y is positive. As X increases Y also increases.
There is sufficiently high degree of positive correlation between X and Y.

Pearson Correlation Assumptions:

29
 That the relationship between X and Y can be represented by a straight line, i.e. it is
linear.

 That X and Y are metric variables, measured on an interval or ratio scale of


measurement.

 In using a t distribution to test the significance of the correlation coefficient …

 That the sample was randomly drawn from the population, and

 That X and Y are normally distributed in the population. This assumption is less
important as the sample size increases

Degree of Correlation:

 The following chart will show the approximate degree of correlation according to Karl
Pearson’s formula:

Spearman’s Rank-Order Correlation:

Suppose we want to measure the degree of association between two variables in the situation
where it is difficult to assign some definite value with respect to some character (like
intelligence, efficiency etc.), but ordering the individuals is possible.

 Let (xi, yi), i =1, 2, ..., n be ‘n’ pairs of observation for ‘n’ individuals.

 Let (ui, vi), i = 1, 2, …, n be the ranks of x values and y values respectively.

 Let Di = ui - vi be the difference in ranks for ith individual.

Spearman’s rank correlation coefficient is given by,


 n 2 
 6 Di 
r  1   i 12 
 n(n  1) 
 
 

30
REGRESSION:

 “Regression” means “stepping back towards the average”.

 Galton found that the offspring of abnormally tall or short parents tend to “regress” or
“step back” to the average population height.

 A mathematical measure of average relationship between two or more variables in terms


of the original units of the data.

Regression analysis is a statistical technique for investigating and modeling the


relationship between variables

 Most widely used statistical technique

 Applications of regression are numerous and occur in almost every field, including

Engineering, Physical Sciences, Economics, Management, Life and Biological Sciences


and Social Sciences

Uses of Regression:

 Data description: Engineers and scientists frequently use equations to summarize or


describe a set of data.

 Regression analysis is helpful in developing such equations.

 For example, for a problem of delivering soft drink bottles with regard to the delivery
time and the delivery volume, a regression model would probably be a much more
convenient and useful summary of those data than a table or even a graph.

 Data description: Engineers and scientists frequently use equations to summarize or


describe a set of data.

 Regression analysis is helpful in developing such equations.

 For example, for a problem of delivering soft drink bottles with regard to the delivery
time and the delivery volume, a regression model would probably be a much more
convenient and useful summary of those data than a table or even a graph.

 Prediction and estimation: Many applications of regression involve prediction of the


response variable.

 For example, we may wish to predict delivery time for a specified number of cases of soft
drinks to be delivered.

31
 These predictions may be helpful in planning delivery activities such as routing and
scheduling, or in evaluating the productivity or delivery operations.

 However, even when the model form is correct, poor estimates of the model parameters
may still cause poor prediction performance.

 Control: Regression models may be used for control purposes.

 For example, a chemical engineer could use regression analysis to develop a model
relating the tensile strength of paper to the hardwood concentration in the pulp.

 This equation could then be used to control the strength to suitable values by varying the
level of hardwood concentration.

 When a regression equation is used for control purposes, it is important that the variables
be related in a causal manner.

Types of Regression Models

Simple linear regression describes the linear relationship between a predictor (regressor)
variable, plotted on the x-axis, and a response variable, plotted on the y-axis

32
y   o  1 x

SIMPLE LINEAR REGRESSION:

The linear regression model:

Hours of homework/week = 12.5 + 0.6  hours of exercise/week

 Example: The distribution of baby weights at Cochin ~ N(3400, 360)

33
Your “Best guess” at a random baby’s weight, given no information about the baby, is
what? 3400 grams

PREDICTOR (REGRESSOR) VARIABLE

 X = gestation time

 Assume that babies that gestate for longer are born heavier, all other things being equal.

 Pretend (at least for the purposes of this example) that this relationship is linear.

 Example: suppose a one-week increase in gestation, on average, leads to a 100-gram


increase in birth-weight

Y depends on X:

At 30 weeks…

 The babies that gestate for 30 weeks appear to center around a weight of 3000 grams.

 E(Y/x = 30 weeks) = 3000 grams

Note that not every Y-value (Yi) sits on the line. There’s variability.

yi = 3000 + random errori

34
 In fact, babies that gestate for 30 weeks have birth-weights that center at 3000 grams, but
vary around 3000 with some variance 2

 Approximately what distribution do birth-weights follow? Normal. Y/X=30


weeks ~ N(3000, 2)

Linear Regression Assumption:

 Error values (ε) are statistically independent


 Error values are normally distributed for any given value of x
 The probability distribution of the errors is normal
 The probability distribution of the errors has constant variance
 The underlying relationship between the x variable and the y variable is linear

Example 2: The article chemithermomechanical pulp from mixed density hardwoods reports on a
study in which the accompanying data was obtained to relate y = specific surface area (sq. cm/g)
to x1= % NaOH used as a pretreatment chemical and x2= treatment time (min) for a batch of
pulp.

X1 3 3 9 9 9 15 15 15

X2 60 90 30 60 90 30 60 90

y 5.60 5.44 6.22 5.85 5.61 8.36 7.30 6.43

 Regression equation: y  0  1 x1  2 x 2 

Root Mean Square Error:

The regression line predicts the average y value associated with a given x value. Note that it is
also necessary to get a measure of the spread of the y values around that average. To do this, we
use the root-mean-square error (r.m.s. error).

To construct the r.m.s. error, you first need to determine the residuals. Residuals are the
difference between the actual values and the predicted values. I denoted them by (𝑦̂𝑖 − 𝑦𝑖 ),
where yi is the observed value for the ith observation and 𝑦̂𝑖 is the predicted value.

They can be positive or negative as the predicted value under or over estimates the actual value.
Squaring the residuals, averaging the squares, and taking the square root gives us the r.m.s error.
You then use the r.m.s. error as a measure of the spread of the y values about the predicted y
value.

35
As before, you can usually expect 68% of the y values to be within one r.m.s. error, and 95% to
be within two r.m.s. errors of the predicted values.

Squaring the residuals, taking the average then the root to compute the r.m.s. error is a lot of
work. Fortunately, algebra provides us with a shortcut (whose mechanics we will omit).

Thus the RMS error is measured on the same scale, with the same units as y.

The term is always between 0 and 1, since r is between -1 and 1. It tells us how much smaller the
r.m.s error will be than the SD.

For example, if all the points lie exactly on a line with positive slope, then r will be 1, and the
r.m.s. error will be 0. This means there is no spread in the values of y around the regression line
(which you already knew since they all lie on a line).

The residuals can also be used to provide graphical information. If you plot the residuals against
the x variable, you expect to see no pattern. If you do see a pattern, it is an indication that there is
a problem with using a line to approximate this data set.

In an analogy to standard deviation, taking the square root of MSEyields the root-mean-square
error or root-mean-square deviation (RMSE or RMSD), which has the same units as the quantity
being estimated; for an unbiased estimator, the RMSE is the square root of the variance, known
as the standard deviation.

36

You might also like