0% found this document useful (0 votes)
18 views41 pages

Statistics Intro 1

Uploaded by

ishikagarg210666
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views41 pages

Statistics Intro 1

Uploaded by

ishikagarg210666
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 41

Statistical Analysis of Data

Statistical Data Analysis


• Data for research: Steps
– Data Collection: raw data (surveys, interviews,
observations )
– Data Processing : processed data (editing , coding,
classification , tabulation )
– Data analysis: computation of certain indices or
measures to analyze relationship among the data
groups, estimating the unknown parameters,
hypothesis testing
Statistical Data Analysis
• Statically Data Analysis:
– Application of statistical measures (methods) for
data analysis
– Descriptive statistics(development of certain
indices from data); Inferential statistics (parameter
estimation, hypothesis testing )
– Statistical Measures :
• Measures of central tendency (statistical averages)
• Measures of dispersion
• Measures of asymmetry (skewness, kurtosis)
• Measures of relationship
Statistical Data Analysis
• Statistical Measures:
– Measures of central tendency: Arithmetic mean, ,
median, mode, geometric mean, harmonic mean
– Measures of dispersion: variance, standard deviation,
mean deviation, range, coefficient of standard deviation,
coefficient of variation
– Measures of asymmetry: skewness based mean, mode,
median, methods of moments
– Measures of relationship : Karl Pearson’s coefficient of
correlation, multiple correlation coefficient, partial
correlation coefficient, regression analysis
Statistical Data Analysis
• Central Tendency:
– Defines the point about which the data is
distributed
– Balance point (center point )of a distribution (data)
– center of a distribution is the number for which the
sum of the absolute deviations is smallest
– center of a distribution is the number for which the
sum of squared deviations is smallest
– Arithmetic Mean:
• sum of values of all the items in a dataset
divided by the number of items. This is also
known as the arithmetic average.
Statistical Data Analysis
• Arithmetic Mean:
– For both continuous and discrete numeric data
– Cannot be calculated for categorical data, as the values cannot
be summed
– As the mean includes every value in the distribution the mean
is influenced by outliers (extreme or atypical) and skewed
distributions
– Used when direct quantitative measurement is possible
(economy)
– Best measure for symmetrical data (includes all the values in
the data set for its calculation, and any change in any of the
scores will affect the value of the mean; not the case with the
median or mode.)
Statistical Data Analysis
• Arithmetic mean: n

• Simple data X  1 n X ; 1 n w X i i

 n
i X  A   ( X i  A);
n i 1
X  i 1n
i 1
w i 1
i

X i : value of i thitem; n: total items; A : assumed mean


wi : weight of i thitem
• Grouped data
n n

fX i i  f (X i i  A)
X  i 1n ; X A  i 1
n

f i 1
i f
i 1
i

fi : frequency of i th group; xi :mid point of ith group;A : assumed mean


Statistical Data Analysis
• Median:
– Middle item of the series when it is arranged in ascending
or descending order of the magnitude
– If the no of items is odd, median is the middle value and if
the no is even then median is the arithmetic mean of the
two middle values
– Positional average ; used for categorical ordinal (qualitative
) phenomenon like intelligence, academic grades
– The median is less affected by outliers than the mean and
best measure for skewed data
– The median cannot be identified for
categorical (qualitative) nominal data, as it cannot be
logically ordered like business type, eye color, religion etc.
– Applicable to open ended data
Statistical Data Analysis
• Median Ungrouped data: M val.( n  1)th item
2
M  AM of two middle values

N /2 F
Grouped data: M L  C
f
L : lower limit of median class;
N : total of all frequencies;
F : cumulative frequency of previous class
C : width of class interval;
f : frequency of the median class
Statistical Data Analysis
• Median: Median for ordinal data
– identify the category associated with the
observation located at the middle of the
distribution
– Median class: Moderate (closest to 718 )
Statistical Data Analysis
• Mode:
– It is the most frequently or commonly occurring value in a
distribution
– Applicable to both quantitative (numerical ) and qualitative
(categorical ) data
– Not affected by outliers
– Only measure for nominal data
– Sometimes , there are more than one mode for the same
distribution of data, (bi-modal, or multi-modal). The presence
of more than one mode can limit the ability of the mode in
describing the centre or typical value of the distribution
because a single value to describe the centre cannot be
identified.
– Commonly used index in analysis of population size (interest
about a particular product )
Statistical Data Analysis
• Mode:
Ungrouped Data: Mode most fequently occurring value
f1
Group Data: Mode, Z L  C
f1  f 2
L : lower limit of modal class;
f1 : absolute diff. b/w the freq. of modal class
and preceeding class;
f 2 : absolute diff. b/w the freq. of modal class
and succeeding class;
C : width of class interval;
Statistical Data Analysis
• Geometric mean:
– It is nth root of the product of the values of n
items in a distribution
– Central tendency measure where distribution is
exponential; to measure the av. when quantity of
interest is expressed as the the relative change
wrt some reference or time
– index preparation (); growth of population;

GM  x1 x2  xn 
1/ n
Statistical Data Analysis
• Harmonic Mean:
– Used as central tendency measure when time ,
ratio and rate are involved (production per day;
distance travelled per hour)
– harmonic mean of a distribution tends strongly
toward the least elements, it tends (compared to
the arithmetic mean) to mitigate the impact of
large outliers and aggravate the impact of small
ones.
n
HM 
1 1 1 1
   
x1 x2 x3 xn
Statistical Data Analysis
• Measures of Dispersion (variation):
– Scatter of item values around the average
– Measures variability of the data (extent to which data
points differ in a statistical distribution or data set
diverge from the average, or mean value as well as the
extent to which these data points differ from each
other)
– Risk analysis (volatility)in investment
– Range, variance, standard variation
Statistical Data Analysis
• Range:
– The range refers to the difference between the
largest and smallest data value in a distribution
– Idea about variability; affected by fluctuations in
sampling ; rough measure of variability
R H  L
H L
coeff .of range 
H L
Statistical Data Analysis
• Mean deviation: Average of the difference of
the data points from some average in the
series

x 
 x  x
i
; 
 x  M
i
; Z 
 x  Z i
M
n n n

x 
 f i xi  x
; M 
 fi xi  M ; Z 
 f i xi  Z
n n n
x M Z
coeff  x  ; coeff  M  ; coeff  Z 
x M Z
 x ,  M ,  Z : mean devaition from mean,median and mo
Statistical Data Analysis
• Standard Deviation:
– Standard deviation is a measure of the dispersion of a set of data from
its mean
– If the data points are farther from the mean, there is higher deviation
within the data set
– Standard deviation is calculated as the square root of variance by
determining the variation between each data point relative to the mean
– E.G . In financial analysis, standard deviation is applied to the
annual rate of return of an investment to measure the investment's
volatility. A volatile return has a high standard deviation whereas the
deviation of a stable return is lower. A large dispersion indicates how
much the return on the fund is deviating from the expected normal
returns.
– In physical experiments, standard deviation provides a way to check the
results. Large values of standard deviation means the experiment is
faulty - either there is too much noise from outside or there could be a
fault in the measuring instrument
– Amenable to mathematical manipulations as algebraic signs are not
ignored; less affected by fluctuations in sampling
Statistical Data Analysis
• Standard Deviation:

 (x  x )
2
i
ungrouped data  
n

 f (x  x )
2
i i
ungrouped data with frequency 
f i

grouped data  
 fi xi   fi xi 
2

  =
fx i
2
i
 x2
 fi   fi  f i

 (x  x )
2
i
sample standard devaition  s 
n 1
Statistical Data Analysis
• Standard Deviation:
2
 (x    ( xi  A) 
2
i A)
ungrouped data    
n  n 
2

grouped data    f (x 
i i A) 2
  f i ( xi  A) 
 
f   fi 
i 

2
(x  x )
i
varience  2

n
 
coeff. of SD  ;coeff. of variation  100
x x
Statistical Data Analysis
• Skewness:
– Asymmetry in data distribution; manner in which data is
clustered around an average
– Symmetric data: zero skewness; normal distribution (equal
distribution)
– Asymmetric data: skewed data ; right(positively) or left
(negatively)distributed


Statistical Data Analysis
• Skewness:
– Coefficient of skewness: amount by which the
balance exceeds on one side (-1 to 1)
– Karl Pearson’s coeff. of skewness:
x Z
CS 

– Also, mode=3(median)-2(mean)

3( x  M )
CS 

Statistical Data Analysis
• Measures of relationship (correlation and regression):
– For bivariate or multivariate analysis
– Does there exist any association or correlation between
two (or more ) variables? If yes, of what degree? :
correlation
– Is there any cause-effect relationship (independent-
dependent) between the variables? If yes, of what degree
and in which direction? : regression
– Bivariate: Cross-tabulation; Charles Spearman's coeff. of
correlation, Karl Pearson’s coeff. of correlation, simple
regression
– Multivariate: coeff. of multiple correlation, coeff. of partial
correlation, multiple regression
Statistical Data Analysis
• TYPES OF CORRELATION
– Positive correlation : If two variables move in the same
direction i.e. when one variable increases, other variable also
increases and when one decreases, other also decreases. E.g.
The age and height of a child.
– Negative correlation: If two variables move in opposite
directions, i.e., when one increases, other decreases and vice
versa. Eg. The price and demand of a normal good.
– Zero Correlation : If there is no relationship between the two
variables; independent variables
Statistical Data Analysis
• Correlation coefficient:
– Statistical correlation is measured by coefficient of
correlation (r). Its numerical value ranges from +1.0 to -1.0.
– It gives an indication of the strength of relationship. In
general, r > 0 indicates positive relationship, r < 0 indicates
negative relationship while r = 0 indicates no relationship
(or that the variables are independent and not related).
– r = +1.0 : perfect positive correlation and r = -1.0 : perfect
negative correlation. Closer the coefficients are to +1.0 and
-1.0, greater is the strength of the relationship between the
variables.
– -1.0 to -0.5 or 1.0 to 0.5 :Strong ; -0.5 to -0.3 or 0.3 to 0.5 :
Moderate; -0.3 to -0.1 or 0.1 to 0.3 :Weak ; -0.1 to 0.1 :
None or very weak
Statistical Data Analysis
• Correlation coefficient:
– Correlation is only appropriate for examining the relationship
between meaningful quantifiable data (e.g. air pressure,
temperature) rather than categorical data such as gender and
favorite color etc.
– Normally correlation coefficients measure linear relationship.
It is therefore possible that while there is strong non linear
relationship between the variables, r is close to 0 or even 0.
In such a case, a scatter diagram can roughly indicate the
existence or otherwise of a non linear relationship.
– r does not provide any information about cause- effect
relationship but simply states that the variables X and Y are
related. Statistical correlation should not be the primary tool
used to study causation, because of the problem with third
variables.
Statistical Data Analysis
• Correlation coefficient:
– Charles Spearman’s coeff. of correlation:
• for ordinal data with ranking (rank correlation)
• measures similarity/ dissimilarity between two sets of
ranking
• e.g. ranking of products by two persons; ranking of
employees in two skill tests
n
6 ( xi  yi ) 2
rs 1  i 1
2
;
n(n  1)
th
xi , yi : rank of i observation/ item in x and y
n : no. of observations
Statistical Data Analysis
• Correlation coefficient:
– Karl Pearson’s coefficient: product of moment
correlation
• Linear relation between variables; causal relation
(dependent- independent); follow normal distribution

r
 (X i  X )(Yi  Y )
;r 
 ( X  X )(Y  Y )
i i

n X  Y  ( X  X )  (Y  Y )
i
2
i
2

r
 X Y  nXY i i

 X  nX  Y  nY
i
2 2
i
2 2

n X Y   X  Y
i i i i

n X  ( X ) n  Y  ( Y )
i
2
i
2
i
2
i
2
Statistical Data Analysis
• Correlation coefficient:
– Karl Pearson’s coefficient: grouped data
m n m n
N  f ij  fXfY i i j j
i 1 j 1 i 1 j 1
r
m m n n
N  f i X i2  ( f i X i ) 2 N  f jY j2  ( f jY j ) 2
i 1 i 1 j 1 j 1
th th
X i , Y j : midpoint of i and j class interval of X and Y
aij : frequency of observation of i th and jth class interval of X an
m n m m
fij aij X iY j ; N  aij ; f i  aij ; f j  aij
i 1 j 1 j 1 i 1

xi  X i  Ax ; y j Y j  Ay
Statistical Data Analysis
• Regression:
– Determination of statistical relationship (regression
function) between two or more variables
– Relation between dependent and independent variable :
cause-effect
– Best fit theory: determining the mathematical model
(regression equation )that fits best to the data collected
– Bivariate (two variables)and multivariate (more than two
variables )
– Linear (linear functions of parameter )and nonlinear
regression (nonlinear functions of parameter i.e. at least
one of the derivatives of the regression function with
respect to the parameters depends on at least one of the
parameters. )
Statistical Data Analysis
• Regression:
– Regression equation:
Y  f (b1 , b2 , X 1 , X 2 )
Y : dependent variable;X1 ,X 2 : independent variables / regressors
b1 , b2 : regression coeff./parameters
Linear regression equation Nonlinear regression equation

Y b0 Y b0eb1 X
Y b0  b1 X
Y b0eb1 X1  eb2 X 2
Y b0  b1 X 1  b2 X 2
Y b0 X b1
Y b0  b1 X  b2 X 2
1
Y
Y b0  b1 X 1  b2 X 12  b3 X 2  b4 X 22 b0  b1 X
Statistical Data Analysis
• Linear bivariate regression: Linear in parameter
• Normally determine the st. line relation between the two
variables that fits best to data collected
Statistical Data Analysis
• Linear bivariate regression(simple regression):
Y a  bX
Y : dependent variable;
X : independent/explanatory variable
a : intercept; b : slope; a,b : regression coeff.
{1,X}: regressor
– Estimate the slope and intercept so that this equation
fits to the data set
– Estimation technique: least square method (minimize
the sum of the squares of the differences between the
observed values of dependent variable and those
predicted by the straight line equation for the same
set of independent variable )
Statistical Data Analysis
• Linear bivariate regression: Least square method

Y a  bX
define
 x  X
2
i i
2
 nX ;  y  Yi  nY
2 2
i
2 2

and

 X Y  nXY
i i
b x 2
i
b , a Y  bX , r 
 X  nXi
2 2
y 2
i
Statistical Data Analysis
• Linear Multivariate Regression: straight line
equation with one dependent and n
independent variables
Y a  b1 X 1  b2 X 2    bn X n
Y : dependent variable; X i : independent variable
a : intercept; bi : slope of X i

– Using the data set, construct equations solve for


various coefficients or use matrix method
– Associated with problem of multicollinaerity
(distortion of regression coefficients due to the
correlation between independent variables )
Statistical Data Analysis
• Linear Multivariate Regression: set of
equations
n n n n

 Y na  b  X
i 1
i 1
i 1
1i  b2  X 2i    bn  X ni
i 1 i 1
n n n n n

 1i i  1i 1  1i  b2  X 1i X 2i    bn  X 1i X ni
X
i 1
Y a X  b X 2

i 1 i 1 i 1 i 1


n n n n n

 ni i  ni 1  1i ni 2  2i ni
X
i 1
Y a X  b X X
i 1
 b X X    bn
i 1
X 2
ni
i 1 i 1
Statistical Data Analysis
• Linear Multivariate Regression: Matrix method
– Regression equation :
Y a  b1 X 1  b2 X 2    bn X n
– Data set: data collected by performing the
experiment k times

Y X1 X2  Xn
Y1 X 11 X 21  X n1
Y2 X 12 X 22  X n2
    
Yk X 1k X 2k  X nk
Statistical Data Analysis
• Linear Multivariate Regression: Matrix method
– Represent the data in the form of matrices and
determine coefficients of regression using least
square algorithm
Y  XB
 a
 1 X 11 X 21 X n1   Y1  b 
1 X X 22 X n2   Y2   1

X  12  ; Y   ; B  b2 
        
1 X X 2k 
X nk   Y   
 1k  k  bn 
1
B  X X   X T Y 
T
Statistical Data Analysis
• Nonlinear regression:
– Nonlinear in parameter
– Transformation to linear form e.g.
b0 X
Y
b1  X
1 b1  X 1 b1 1
  
Y b0 X b0 b0 X
Y  k1  k2 X 

– Least square method in iterative form with no explicit


relation
– Concentration of a chemical in a solution and intensity of
transmitted light
Statistical Data Analysis
• Multiple correlation:
– Collective effect:
b1  Yi X 1i  nYX 1  b2  Yi X 2i  nYX 2
Ry . x1x2 
iY 2
 nY 2

– Partial correlation: relation between a dependent


variable and a particular independent variable by
holding all other variables constant
Ry . x1x2  ryx2 2
ryx1 . x2  2
; x1 on Y
1 r yx2

Ry . x1x2  ryx2 1
ryx1 . x2  2
; x2 on Y
1 r yx1
Statistical Data Analysis
• Partial correlation:
ryx1  ryx2 .rx1x2
ryx1 . x2 
1  ryx2 2 1  rx21x2
ryx2  ryx1 .rx1x2
ryx2 . x1 
1  ryx2 1 1  rx21x2

You might also like