Introduction To Statistics
Introduction To Statistics
Statistics:
The word Statistics refers to either to quantitative information or to a method dealing with
quantitative information. Statistics deals with collection, organization, analysis and interpretation
of numerical data.
These statements contains figures and as such they are called “numerical statements of facts”
In every fields of science, one would come across different types of information which
will be termed as data.
Once we have the information on certain characterizations, we can analyze it and draw
conclusions about the population.
Data:
or
1
c) Data collected by the investigator for the purpose of the investigation at hand is called
Primary data.
d) Data collected by others for some other purpose and used by the investigator is called
Secondary data.
e) Collection of primary data involves much time, money and labour. So, if available,
secondary data is preferred.
‘Population’ or ‘Universe’ is a term usually used in statistics for the aggregate or totality
of units such as people, trees, animals, firms, plots of lands, houses, manufactured articles
etc, etc. about which information is required in a study.
Descriptive Statistics:
Classification
Tabulation
Diagrammatic Presentation
Graphical Presentation
2
Classification:
3. Qualitative classification
4. Quantitative classification
Types of variables:
Discrete variable:- A quantitative variable which can assume finite or countable number
of isolated values is called a discrete variable.
Ex:- Family size, no. of kittens in a litter, number of defectives in a lot, number of
petals in a flower etc.
Continuous variable:- A quantitative variable which can assume any numerical value
within a certain interval of real line is called a continuous variable.
Frequency distribution:
A frequency distribution is a statement of the possible values of the variable together with
their respective frequencies.
Frequency of a variant value is the number of occurrences of that particular value of the
variable in the data set.
3
Tabulation:
Tabulation is the process of arranging the classified data in the form of a table consisting
of rows and columns. It is the continuation of classification.
The purpose of tabulation is that it is easily understood and an investigator is quickly able
to locate the desired information.
If classification is done with respect to one characteristic only, the corresponding table
will be a one way table.
TABULATION OF DATA:
The data set of the smoking status of 20 individuals, collected as yes (y) or no (n). The
raw is given as
n, y, n, n, n, y, n, y, n, n, n, y, n, n, y, n, y, n, n, y
Cellularity R HG
No Cells 4 1
Hypocellular 14 15
Cellular 12 4
4
99 110 96 160 106 144 109 156 110 128 118 168 132 140 160 102 159 148 149 156
154 143 108 146 145
Graphical Representation:
Pie Chart:
Circles drawn with areas proportional to the magnitudes of the observations constitute a
pie diagram.
Circle is divided into sectors with areas proportional to the magnitudes of the components
in component bar diagram.
Angle of Sector =
Surgery Frequency
Operable 60
In-operable 14
5
Stomach Cancer
Operable In-operable
19%
81%
Bar diagrams:
A method of presenting data in which frequencies are displayed along one axis and
categories of the variable along the other, the frequencies being represented by the bar
lengths.
To avoid any misunderstanding, the bars are drawn with the same width at equal
distances.
→ Deviation bars
A simple bar diagram is used to represent only one variable. The figures of sales,
production, population etc., for various years may be shown by means of a simple bar
diagram.
Bar diagrams are the easiest and most commonly used devices.
6
The distance between every two consecutive bars should be uniform. The width of each
bar should be the same.
The height of the bars should be proportional to the magnitude of the variable.
All the bars should stand on the same base line (X-axis).
Bar Diagram:
25
20
Frequency
15
10 Frequency
0
2 3 4 5 6
No. of Cycles Completed
Multiple bar diagrams are used to represent two or more sets of interrelated data.
7
An index is also prepared to identify bars.
Cellularity R HG
No Cells 4 1
Hypocellular 14 15
Cellular 12 4
Cellularity
16
14
12
Frequency
10
8
R
6
HG
4
2
0
No Cells Hypocellular Cellular
Cellularity
8
Graphing Data: Types:
Creating Frequencies:
We create frequencies by sorting data by value or category and then summing the cases
that fall into those values.
How often do certain scores occur? This is a basic descriptive data question.
Histogram:
Histogram is a bar diagram (two dimensional) which is suitable for frequency distribution
with continuous classes.
Class intervals are shown on the ‘X-axis’ and the frequencies on the ‘Y-axis’. The height
of each rectangle represents the frequency of the class interval.
9
Eg: Age distribution of liver cancer patients
Age Frequency
25 – 29 3
30 – 34 10
35 - 39 5
40 - 44 3
45 - 49 5
50 - 54 2
55 - 59 1
60 - 64 1
65 - 69 3
70+ 7
Total 40
15
10
Frequency
Mean =47.42
Std. Dev. =17.11
N =40
0
20.00 40.00 60.00 80.00
Age
10
Line Graphs or line diagram:
Age Frequency
30 - 39 7
40 - 49 9
50 - 59 4
60 - 69 4
70+ 6
Total 30
11
100
90
80
Approval
70
Approval
60
50
40
30
Economic approval
20
10
81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 00 01
19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 20 20
Month
100
80
Approval
60
Approve
40
20
0
0 2 4 6 8 10 12
Unemployment
(iii) Skewness
(iv) Kurtosis
The main purpose of the statistical treatment of any data is to summarize and describe the
data meaningfully and adequately.
12
Certain summary indices tell about the center or middle value of the set of data around
which the other observations are lying. This index is called the measures of central
tendency.
They are computed to give a “center” around which measurements in the data are
distributed.
Arithmetic mean:
There are two types of arithmetic means, depending upon whether all the observations are
given equal importance or unequal importance.
They are,
x1 +x 2 +...+x n
n
n
x i
i 1
n
Example: The ages of the members of a family are 53, 48, 25, 32 and 8. Find the simple
arithmetic mean of the ages.
53+48+25+32+8
X = 33.2
5
Arithmetic mean for ungrouped data
13
Let the values of the variates be x1, x2, …, xk and let f1, f2, …, fk be the corresponding
frequencies, then arithmetic mean is
f1x1 f 2 x 2 .... f k x k
X
f1 f 2 ... f k
x f i i
i 1
k
f
i 1
i
Example:
In general, if x1, x2, …, xn are the observations and w1, w2, …, wn are the weights
assigned, the weighted arithmetic mean is given by,
n
wx i i
x i 1 w x w 2 x 2 .... w n x n 1110
n 141 1 X= = 22.2
w
i 1
i
w1 w 2 ... w n 50
It may be noted that the A.M. of a frequency distribution is in fact a weighted arithmetic mean,
the weights being the frequencies.
Example: A student’s marks in the laboratory, lecture and recitation parts of a Statistics course
were 71, 78 and 89 respectively. If the weights accorded to these marks are 2, 4 and 5
respectively what is an appropriate average mark?
= 81.73
Median:
The median represents the middle value of the data set when it has been arranged in
ascending order.
Half the value will lie below the median and half will lie above it.
When the sample size is odd, the median is the middle value
When the sample size is even, the median is the midpoint/mean of the two middle values
• Notice that only the two central values are used in the computation.
15
• The median is not sensible to extreme values
Example of Median:
Measurements Measurements
Ranked
3 0
5 1
5 2
1 3
7 4
2 5
6 5
7 6
0 7
4 7
40 40
Median: (4+5)/2 = 4.5: As the no. of observation is even median is calculated as the
average of N/2 th observation and N/2 +1 th observation. Here N/2 th observation after
arranging in ascending order, ie. 5th observation is 4 and N/2 + 1 th observation, ie. 6th
observation is 5.
Notice that only the two central values are used in the computation.
Mode:
It is the least useful (and least used) of the three measures of central tendency
16
Eg: The mode for the gestation period of new borns
Gestation Period
(Weeks)
36
38
40
37
42
35
39
32
40
41
• Mode is 40 weeks.
Eg 2:
Median requires ordering of values and can be used with both interval and ordinal data
17
Mode only involves determination of most common value and can be used with interval,
ordinal, and nominal data
Harmonic Mean
The harmonic mean is a better "average" when the numbers are defined in relation to some
unit. The common example is averaging speed.
For example, suppose that you have four 10 km segments to your automobile trip. You drive
your car:
40 0.385
103.80 Avg V
Geometric Mean
18
The GM is for situations we want the average used in a multiplicative situation, such as the
"average" dimension of a box that would have the same volume as length x width x height. For
example, suppose a box has dimensions 50 x 80 x 100 cm.
Measures of variation:
Measures of variability can help us to create a mental picture of the spread of the data.
They describe “data spread” or how far the measurements are from the center.
Range
Range:
The range R of a set of n measurements is defined as the difference between the largest
and the smallest measurements.
Gestation Period
(Weeks)
36
38
40
37
42
35
39
32
40
41
• Range=Max. value – Min. value
= 42 – 32 = 10
19
Variance (for a sample):
Steps:
Example of Variance:
Variance = 54/9 = 6
It is a measure of “spread”.
Notice that the larger the deviations (positive or negative) the larger the variance
Variance = 6
20
The variance s2 is the sum of the squared deviations from the mean divided by the
number of cases minus 1
y y
2
s 2
i
n 1
s i
n 1
Coefficient of variation:
For the comparison of variability in the same variable measured in two different
heterogeneous populations.
Eg: We are given that the average birth weight as 2592 g. and the corresponding S.D. as 354.90g.
Therefore the CV = (354.90/2592)*100 = 13.69%
Example: - For a class of students, the height (X) has a distribution with mean 162 cm and sd
10cm and weight (Y) has mean 57 kg and sd 8 kg. Compare the variability aspect of X and Y.
C.V. (X) (10/162) *100 6.17 %
C.V. (Y) (8/57) *100 14.04 %
which shows that variability is more for the weight distribution.
Skewness:
When the mean is greater than the median the data distribution is skewed to the right
(positively skewed).
When the median is greater than the mean the data distribution is skewed to the left
(negatively skewed).
When the mean and median are very close to each other, the data distribution is
approximately symmetric.
21
Distribution shapes:
Symmetric
0
1 2 3 4 5 6 7
Positively skewed
0
1 2 3 4 5 6
Negatively skewed
0
1 2 3 4 5 6
Measure of Skewness
22
A measure of skewness should indicate by its sign whether the skewness is positive or
negative and by its magnitude the degree of skewness.
Kurtosis:
Leptokurtic
Mesokurtic
Platykurtic
4
2
2 2
Further Notes:
When the Mean is greater than the Median the data distribution is skewed to the Right.
When the Median is greater than the Mean the data distribution is skewed to the Left.
When Mean and Median are very close to each other the data distribution is
approximately symmetric.
23
Normal curve:
Skewness = 0
Kurtosis = 3
1
f ( x) e ( x ) / 2
2
2
24
Lecturer Notes on Bivariate Statistical Data
Bivariate Analysis: The analysis of statistical data with two variable is termed as bivariate data.
SCATTER DIAGRAM:
Plot of values of the variables X and Y, (xi, yi), i = 1, 2, …, n. The diagram of dots so
obtained is known as the scatter diagram.
Will give a good idea about the relationship between the two variables X and Y.
Correlation coefficient:
• Correlation does not necessarily imply causation or functional relationship though the
existence of causation always implies correlation.
Types of Correlation:
On the basis of the nature of relationship between the variables, correlation may be:
25
When we study only two variables, the relationship is described as simple correlation.
The study of two variables excluding some other variables is called partial correlation.
We may get a high degree of correlation between two variables, but there may not be any
relationship between the variables at all.
The above data show a perfect positive relationship between income and weight, i.e., as the
income is increasing the weight is increasing and the rate of change between the two variables is
the same.
cov ariance( x, y )
Pearson’s Correlation Coefficient: r
var x var y
INTERPRETING COVARIANCE:
26
Types of Correlation
Measures the relative strength of the linear relationship between two variables
Unitless
27
LINEAR CORRELATION
Y Y
X X
Y Y
98
X X
98
What’s a good guess for the Pearson’s correlation coefficient (r) for this scatter plot?
A Numerical example:
28
( X X )2 (Y Y )2 ( X X )(Y Y )
Calculation of Pearson r
1 n
xi x
2
x2 53.71
n i
1 n
yi y
2
y2 1310.79
n i
1 n
Cov (X, Y) (x i x)(yi y) 261.24
n i 1
261.24 261.24
r 0.985
(53.71)(1310.79) 70402.53
Interpretation
The direction of the relationship between X and Y is positive. As X increases Y also increases.
There is sufficiently high degree of positive correlation between X and Y.
29
That the relationship between X and Y can be represented by a straight line, i.e. it is
linear.
That the sample was randomly drawn from the population, and
That X and Y are normally distributed in the population. This assumption is less
important as the sample size increases
Degree of Correlation:
The following chart will show the approximate degree of correlation according to Karl
Pearson’s formula:
Suppose we want to measure the degree of association between two variables in the situation
where it is difficult to assign some definite value with respect to some character (like
intelligence, efficiency etc.), but ordering the individuals is possible.
Let (xi, yi), i =1, 2, ..., n be ‘n’ pairs of observation for ‘n’ individuals.
30
REGRESSION:
Galton found that the offspring of abnormally tall or short parents tend to “regress” or
“step back” to the average population height.
Applications of regression are numerous and occur in almost every field, including
Uses of Regression:
For example, for a problem of delivering soft drink bottles with regard to the delivery
time and the delivery volume, a regression model would probably be a much more
convenient and useful summary of those data than a table or even a graph.
For example, for a problem of delivering soft drink bottles with regard to the delivery
time and the delivery volume, a regression model would probably be a much more
convenient and useful summary of those data than a table or even a graph.
For example, we may wish to predict delivery time for a specified number of cases of soft
drinks to be delivered.
31
These predictions may be helpful in planning delivery activities such as routing and
scheduling, or in evaluating the productivity or delivery operations.
However, even when the model form is correct, poor estimates of the model parameters
may still cause poor prediction performance.
For example, a chemical engineer could use regression analysis to develop a model
relating the tensile strength of paper to the hardwood concentration in the pulp.
This equation could then be used to control the strength to suitable values by varying the
level of hardwood concentration.
When a regression equation is used for control purposes, it is important that the variables
be related in a causal manner.
Simple linear regression describes the linear relationship between a predictor (regressor)
variable, plotted on the x-axis, and a response variable, plotted on the y-axis
32
y o 1 x
33
Your “Best guess” at a random baby’s weight, given no information about the baby, is
what? 3400 grams
X = gestation time
Assume that babies that gestate for longer are born heavier, all other things being equal.
Pretend (at least for the purposes of this example) that this relationship is linear.
Y depends on X:
At 30 weeks…
The babies that gestate for 30 weeks appear to center around a weight of 3000 grams.
Note that not every Y-value (Yi) sits on the line. There’s variability.
34
In fact, babies that gestate for 30 weeks have birth-weights that center at 3000 grams, but
vary around 3000 with some variance 2
Example 2: The article chemithermomechanical pulp from mixed density hardwoods reports on a
study in which the accompanying data was obtained to relate y = specific surface area (sq. cm/g)
to x1= % NaOH used as a pretreatment chemical and x2= treatment time (min) for a batch of
pulp.
X1 3 3 9 9 9 15 15 15
X2 60 90 30 60 90 30 60 90
Regression equation: y 0 1 x1 2 x 2
The regression line predicts the average y value associated with a given x value. Note that it is
also necessary to get a measure of the spread of the y values around that average. To do this, we
use the root-mean-square error (r.m.s. error).
To construct the r.m.s. error, you first need to determine the residuals. Residuals are the
difference between the actual values and the predicted values. I denoted them by (𝑦̂𝑖 − 𝑦𝑖 ),
where yi is the observed value for the ith observation and 𝑦̂𝑖 is the predicted value.
They can be positive or negative as the predicted value under or over estimates the actual value.
Squaring the residuals, averaging the squares, and taking the square root gives us the r.m.s error.
You then use the r.m.s. error as a measure of the spread of the y values about the predicted y
value.
35
As before, you can usually expect 68% of the y values to be within one r.m.s. error, and 95% to
be within two r.m.s. errors of the predicted values.
Squaring the residuals, taking the average then the root to compute the r.m.s. error is a lot of
work. Fortunately, algebra provides us with a shortcut (whose mechanics we will omit).
Thus the RMS error is measured on the same scale, with the same units as y.
The term is always between 0 and 1, since r is between -1 and 1. It tells us how much smaller the
r.m.s error will be than the SD.
For example, if all the points lie exactly on a line with positive slope, then r will be 1, and the
r.m.s. error will be 0. This means there is no spread in the values of y around the regression line
(which you already knew since they all lie on a line).
The residuals can also be used to provide graphical information. If you plot the residuals against
the x variable, you expect to see no pattern. If you do see a pattern, it is an indication that there is
a problem with using a line to approximate this data set.
In an analogy to standard deviation, taking the square root of MSEyields the root-mean-square
error or root-mean-square deviation (RMSE or RMSD), which has the same units as the quantity
being estimated; for an unbiased estimator, the RMSE is the square root of the variance, known
as the standard deviation.
36