Statistics Intro 1
Statistics Intro 1
• Simple data X 1 n X ; 1 n w X i i
n
i X A ( X i A);
n i 1
X i 1n
i 1
w i 1
i
fX i i f (X i i A)
X i 1n ; X A i 1
n
f i 1
i f
i 1
i
N /2 F
Grouped data: M L C
f
L : lower limit of median class;
N : total of all frequencies;
F : cumulative frequency of previous class
C : width of class interval;
f : frequency of the median class
Statistical Data Analysis
• Median: Median for ordinal data
– identify the category associated with the
observation located at the middle of the
distribution
– Median class: Moderate (closest to 718 )
Statistical Data Analysis
• Mode:
– It is the most frequently or commonly occurring value in a
distribution
– Applicable to both quantitative (numerical ) and qualitative
(categorical ) data
– Not affected by outliers
– Only measure for nominal data
– Sometimes , there are more than one mode for the same
distribution of data, (bi-modal, or multi-modal). The presence
of more than one mode can limit the ability of the mode in
describing the centre or typical value of the distribution
because a single value to describe the centre cannot be
identified.
– Commonly used index in analysis of population size (interest
about a particular product )
Statistical Data Analysis
• Mode:
Ungrouped Data: Mode most fequently occurring value
f1
Group Data: Mode, Z L C
f1 f 2
L : lower limit of modal class;
f1 : absolute diff. b/w the freq. of modal class
and preceeding class;
f 2 : absolute diff. b/w the freq. of modal class
and succeeding class;
C : width of class interval;
Statistical Data Analysis
• Geometric mean:
– It is nth root of the product of the values of n
items in a distribution
– Central tendency measure where distribution is
exponential; to measure the av. when quantity of
interest is expressed as the the relative change
wrt some reference or time
– index preparation (); growth of population;
–
GM x1 x2 xn
1/ n
Statistical Data Analysis
• Harmonic Mean:
– Used as central tendency measure when time ,
ratio and rate are involved (production per day;
distance travelled per hour)
– harmonic mean of a distribution tends strongly
toward the least elements, it tends (compared to
the arithmetic mean) to mitigate the impact of
large outliers and aggravate the impact of small
ones.
n
HM
1 1 1 1
x1 x2 x3 xn
Statistical Data Analysis
• Measures of Dispersion (variation):
– Scatter of item values around the average
– Measures variability of the data (extent to which data
points differ in a statistical distribution or data set
diverge from the average, or mean value as well as the
extent to which these data points differ from each
other)
– Risk analysis (volatility)in investment
– Range, variance, standard variation
Statistical Data Analysis
• Range:
– The range refers to the difference between the
largest and smallest data value in a distribution
– Idea about variability; affected by fluctuations in
sampling ; rough measure of variability
R H L
H L
coeff .of range
H L
Statistical Data Analysis
• Mean deviation: Average of the difference of
the data points from some average in the
series
x
x x
i
;
x M
i
; Z
x Z i
M
n n n
x
f i xi x
; M
fi xi M ; Z
f i xi Z
n n n
x M Z
coeff x ; coeff M ; coeff Z
x M Z
x , M , Z : mean devaition from mean,median and mo
Statistical Data Analysis
• Standard Deviation:
– Standard deviation is a measure of the dispersion of a set of data from
its mean
– If the data points are farther from the mean, there is higher deviation
within the data set
– Standard deviation is calculated as the square root of variance by
determining the variation between each data point relative to the mean
– E.G . In financial analysis, standard deviation is applied to the
annual rate of return of an investment to measure the investment's
volatility. A volatile return has a high standard deviation whereas the
deviation of a stable return is lower. A large dispersion indicates how
much the return on the fund is deviating from the expected normal
returns.
– In physical experiments, standard deviation provides a way to check the
results. Large values of standard deviation means the experiment is
faulty - either there is too much noise from outside or there could be a
fault in the measuring instrument
– Amenable to mathematical manipulations as algebraic signs are not
ignored; less affected by fluctuations in sampling
Statistical Data Analysis
• Standard Deviation:
(x x )
2
i
ungrouped data
n
f (x x )
2
i i
ungrouped data with frequency
f i
grouped data
fi xi fi xi
2
=
fx i
2
i
x2
fi fi f i
(x x )
2
i
sample standard devaition s
n 1
Statistical Data Analysis
• Standard Deviation:
2
(x ( xi A)
2
i A)
ungrouped data
n n
2
grouped data f (x
i i A) 2
f i ( xi A)
f fi
i
2
(x x )
i
varience 2
n
coeff. of SD ;coeff. of variation 100
x x
Statistical Data Analysis
• Skewness:
– Asymmetry in data distribution; manner in which data is
clustered around an average
– Symmetric data: zero skewness; normal distribution (equal
distribution)
– Asymmetric data: skewed data ; right(positively) or left
(negatively)distributed
–
Statistical Data Analysis
• Skewness:
– Coefficient of skewness: amount by which the
balance exceeds on one side (-1 to 1)
– Karl Pearson’s coeff. of skewness:
x Z
CS
– Also, mode=3(median)-2(mean)
3( x M )
CS
Statistical Data Analysis
• Measures of relationship (correlation and regression):
– For bivariate or multivariate analysis
– Does there exist any association or correlation between
two (or more ) variables? If yes, of what degree? :
correlation
– Is there any cause-effect relationship (independent-
dependent) between the variables? If yes, of what degree
and in which direction? : regression
– Bivariate: Cross-tabulation; Charles Spearman's coeff. of
correlation, Karl Pearson’s coeff. of correlation, simple
regression
– Multivariate: coeff. of multiple correlation, coeff. of partial
correlation, multiple regression
Statistical Data Analysis
• TYPES OF CORRELATION
– Positive correlation : If two variables move in the same
direction i.e. when one variable increases, other variable also
increases and when one decreases, other also decreases. E.g.
The age and height of a child.
– Negative correlation: If two variables move in opposite
directions, i.e., when one increases, other decreases and vice
versa. Eg. The price and demand of a normal good.
– Zero Correlation : If there is no relationship between the two
variables; independent variables
Statistical Data Analysis
• Correlation coefficient:
– Statistical correlation is measured by coefficient of
correlation (r). Its numerical value ranges from +1.0 to -1.0.
– It gives an indication of the strength of relationship. In
general, r > 0 indicates positive relationship, r < 0 indicates
negative relationship while r = 0 indicates no relationship
(or that the variables are independent and not related).
– r = +1.0 : perfect positive correlation and r = -1.0 : perfect
negative correlation. Closer the coefficients are to +1.0 and
-1.0, greater is the strength of the relationship between the
variables.
– -1.0 to -0.5 or 1.0 to 0.5 :Strong ; -0.5 to -0.3 or 0.3 to 0.5 :
Moderate; -0.3 to -0.1 or 0.1 to 0.3 :Weak ; -0.1 to 0.1 :
None or very weak
Statistical Data Analysis
• Correlation coefficient:
– Correlation is only appropriate for examining the relationship
between meaningful quantifiable data (e.g. air pressure,
temperature) rather than categorical data such as gender and
favorite color etc.
– Normally correlation coefficients measure linear relationship.
It is therefore possible that while there is strong non linear
relationship between the variables, r is close to 0 or even 0.
In such a case, a scatter diagram can roughly indicate the
existence or otherwise of a non linear relationship.
– r does not provide any information about cause- effect
relationship but simply states that the variables X and Y are
related. Statistical correlation should not be the primary tool
used to study causation, because of the problem with third
variables.
Statistical Data Analysis
• Correlation coefficient:
– Charles Spearman’s coeff. of correlation:
• for ordinal data with ranking (rank correlation)
• measures similarity/ dissimilarity between two sets of
ranking
• e.g. ranking of products by two persons; ranking of
employees in two skill tests
n
6 ( xi yi ) 2
rs 1 i 1
2
;
n(n 1)
th
xi , yi : rank of i observation/ item in x and y
n : no. of observations
Statistical Data Analysis
• Correlation coefficient:
– Karl Pearson’s coefficient: product of moment
correlation
• Linear relation between variables; causal relation
(dependent- independent); follow normal distribution
r
(X i X )(Yi Y )
;r
( X X )(Y Y )
i i
n X Y ( X X ) (Y Y )
i
2
i
2
r
X Y nXY i i
X nX Y nY
i
2 2
i
2 2
n X Y X Y
i i i i
n X ( X ) n Y ( Y )
i
2
i
2
i
2
i
2
Statistical Data Analysis
• Correlation coefficient:
– Karl Pearson’s coefficient: grouped data
m n m n
N f ij fXfY i i j j
i 1 j 1 i 1 j 1
r
m m n n
N f i X i2 ( f i X i ) 2 N f jY j2 ( f jY j ) 2
i 1 i 1 j 1 j 1
th th
X i , Y j : midpoint of i and j class interval of X and Y
aij : frequency of observation of i th and jth class interval of X an
m n m m
fij aij X iY j ; N aij ; f i aij ; f j aij
i 1 j 1 j 1 i 1
xi X i Ax ; y j Y j Ay
Statistical Data Analysis
• Regression:
– Determination of statistical relationship (regression
function) between two or more variables
– Relation between dependent and independent variable :
cause-effect
– Best fit theory: determining the mathematical model
(regression equation )that fits best to the data collected
– Bivariate (two variables)and multivariate (more than two
variables )
– Linear (linear functions of parameter )and nonlinear
regression (nonlinear functions of parameter i.e. at least
one of the derivatives of the regression function with
respect to the parameters depends on at least one of the
parameters. )
Statistical Data Analysis
• Regression:
– Regression equation:
Y f (b1 , b2 , X 1 , X 2 )
Y : dependent variable;X1 ,X 2 : independent variables / regressors
b1 , b2 : regression coeff./parameters
Linear regression equation Nonlinear regression equation
Y b0 Y b0eb1 X
Y b0 b1 X
Y b0eb1 X1 eb2 X 2
Y b0 b1 X 1 b2 X 2
Y b0 X b1
Y b0 b1 X b2 X 2
1
Y
Y b0 b1 X 1 b2 X 12 b3 X 2 b4 X 22 b0 b1 X
Statistical Data Analysis
• Linear bivariate regression: Linear in parameter
• Normally determine the st. line relation between the two
variables that fits best to data collected
Statistical Data Analysis
• Linear bivariate regression(simple regression):
Y a bX
Y : dependent variable;
X : independent/explanatory variable
a : intercept; b : slope; a,b : regression coeff.
{1,X}: regressor
– Estimate the slope and intercept so that this equation
fits to the data set
– Estimation technique: least square method (minimize
the sum of the squares of the differences between the
observed values of dependent variable and those
predicted by the straight line equation for the same
set of independent variable )
Statistical Data Analysis
• Linear bivariate regression: Least square method
Y a bX
define
x X
2
i i
2
nX ; y Yi nY
2 2
i
2 2
and
X Y nXY
i i
b x 2
i
b , a Y bX , r
X nXi
2 2
y 2
i
Statistical Data Analysis
• Linear Multivariate Regression: straight line
equation with one dependent and n
independent variables
Y a b1 X 1 b2 X 2 bn X n
Y : dependent variable; X i : independent variable
a : intercept; bi : slope of X i
Y na b X
i 1
i 1
i 1
1i b2 X 2i bn X ni
i 1 i 1
n n n n n
1i i 1i 1 1i b2 X 1i X 2i bn X 1i X ni
X
i 1
Y a X b X 2
i 1 i 1 i 1 i 1
n n n n n
ni i ni 1 1i ni 2 2i ni
X
i 1
Y a X b X X
i 1
b X X bn
i 1
X 2
ni
i 1 i 1
Statistical Data Analysis
• Linear Multivariate Regression: Matrix method
– Regression equation :
Y a b1 X 1 b2 X 2 bn X n
– Data set: data collected by performing the
experiment k times
Y X1 X2 Xn
Y1 X 11 X 21 X n1
Y2 X 12 X 22 X n2
Yk X 1k X 2k X nk
Statistical Data Analysis
• Linear Multivariate Regression: Matrix method
– Represent the data in the form of matrices and
determine coefficients of regression using least
square algorithm
Y XB
a
1 X 11 X 21 X n1 Y1 b
1 X X 22 X n2 Y2 1
X 12 ; Y ; B b2
1 X X 2k
X nk Y
1k k bn
1
B X X X T Y
T
Statistical Data Analysis
• Nonlinear regression:
– Nonlinear in parameter
– Transformation to linear form e.g.
b0 X
Y
b1 X
1 b1 X 1 b1 1
Y b0 X b0 b0 X
Y k1 k2 X
Ry . x1x2 ryx2 1
ryx1 . x2 2
; x2 on Y
1 r yx1
Statistical Data Analysis
• Partial correlation:
ryx1 ryx2 .rx1x2
ryx1 . x2
1 ryx2 2 1 rx21x2
ryx2 ryx1 .rx1x2
ryx2 . x1
1 ryx2 1 1 rx21x2