0% found this document useful (0 votes)
1 views22 pages

Statistical Evaluation of Big Data

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1 views22 pages

Statistical Evaluation of Big Data

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 22

Statistical modelling

• The process of applying statistical analysis to a dataset. A statistical


model is a mathematical representation (or mathematical model) of
observed data.
• When data analysts apply various statistical models to the data they are
investigating, they are able to understand and interpret the information
more strategically.
• This practice allows to identify relationships between variables, make
predictions about future sets of data, and visualize that data so that non-
analysts and stakeholders can consume and leverage it.
• Better preparation of data for analysis.

Most common techniques fall into the following two groups:


• Supervised learning, including regression and classification models.
• Unsupervised learning, including clustering algorithms and association
rules.
Big data move evaluations of results closer to idealized
model.
• A statistical model embodies a set of statistical
assumptions concerning the generation of sample data (and similar
data from a larger population) and represents, often in considerably
idealized form.
Idealized Model
New cases will be drawn randomly from the same population. Cases
need not have unique measurements i.e. two cases can be identical.
It is very useful to evaluate hypothesis and experimental procedures.
Random Sampling: A random sample is a subset of a statistical
population in which each member of the subset has an equal
probability of being chosen.

Random samples are used to avoid bias and other unwanted effects.

1 2 3
2 5
4 5 6
7
7 8 9
Classical Statistical Comparison and Evaluation
• Most widely used evaluation technique is Hypothesis
Testing

A statistical hypothesis is a hypothesis that is testable


on the basis of observed data modelled as the realised
values taken by a collection of random variables.
• This method is also used to evaluate results of polls and
medical studies.
• Measures to evaluate: Average value, Mean or the
average (squared) difference from the mean, variance
etc.
• These measures may not be adequate for prediction but
can estimate the shape of the population curve from
the sampling data.
Sampling distribution
Suppose that we draw all possible samples of size n from a given
population and further if we compute a statistic mean, proportion,
standard deviation etc. for each sample, then the probability
distribution of this statistic is called sampling distribution.
• The standard deviation of this statistic is called standard error (se).
• i.e. Sampling distribution shows every possible result a statistic can
take in every possible sample from a population and how often each
result happens.
• The variance from the mean is var/n, where var is the sample
variance and n is the number of cases in the sample.
• The square root of this sampling variance is called standard error.
• E.g. Two independent samples are taken such as A and B, the means of
the two samples will be determined and check whether the difference in
means is significant.

• Significance test to be determined from sample A with n1 cases and


sample B with n2 cases. It is measured in terms of number of standard
error(se), sig value typically 2.
• Hypothesis testing model tells us whether differences in two hypothesis can
be attributed to chance.
• Comparisons of predictive performance can be made using hypothesis
testing model.
• If there is a difference in means, it is likely to be statistically significant.
E.g. Significance testing

S.D- provides an indication of how far the individual responses to a


question varies or deviates from the mean.
Example

No of cases/ Sample A (Product Sample B (Product


Respondent Quality) Reliability)
T1 3 1
T2 3 1
T3 3 1
T4 3 1
T5 4 5
T6 4 5
T7 3 5
T8 3 5
T9 3 5
T10 3 5
Mean for sample A = 3.2
se(A-B) =
Var(A) = 0.24
S.D for sample A = 0.4
Mean for sample B = 3.4
Var(B) 0.83>sig (sig value typically 2) which is wrong
S.D for sample B = 2.1

If there is a difference in means, it is likely to be statistically significant.


Here, there is minimal difference in the mean and the value is less than
sig value, hence it is not statistically significant or less significance.
• A Sample A with 5 cases having values 160, 150, 165, 170, 140 and Sample
B with values 100,90,110,95,120 respectively. Find the sample variance
and standard error. Test the significance.
Sample A Sample B
160 100
150 90
165 110
170 95
140 120

Mean of sample A, B
Var, S.D
Hypothesis testing
Sig=2
Predicting true – or- false: classification
Classification is the most common application of
computer based prediction.
The typical problem is to distinguish between two
classes.
Error rates
Performance is measured by keeping track of the number
of mistakes that are made on sample cases.
The sample error rate (erate) is the percent of
classifications that are incorrect. It is given by:
erate =*100

Forecasting Numbers: Regression


Regression is also called as function approximation – The
objective is to predict a number. These numbers can be
real numbers or ordered numbers, not categories and
labels.
The performance can be measured in terms of distance from the true
value.

Line of Best fit: minimizes the distance between each individual point
and the regression line.
Distance Measures
The objective of regression is to minimize the distance between the
true/observed value for case i, yi and the predicted value yi’.
Two measures of distance are commonly used. The classical regression
measure is mean squared error (mse) and Mean absolute distance
(mad) as:

The mean absolute distance is the more intuitive measure and is less
sensitive to outliers.
Square root of mse (rmse) is slightly larger than mad.
Example
Calculate the distance measures for the following data:

No of cases (Half yearly Observed value (in Lakhs) Predicted Value (in Lakhs)
sales)
January 125 128
February 132 117
March 115 105
April 137 125
May 122 126
June 130 138

mse = {+………….}
mad = {|125-128|+|132-117|………….}
• Computing error measures and Moving Average
Week Sales 3MA Error= Actual – |Error| |%Error|
Forecast
1 39
2 44
3 40
4 45 41 45-41=4 |4|=4 4^2=16 4/45=8.89%
5 38 43 -5 |-5|=5 25 5/38=13.16%
6 43 41 2 2 4 2/43=4.65%
7 39 42 -3 3 9 3/39=7.69%
8 40 14/4=3.5 54/4 = 13.5 34.39%/4 =
MAD= MSE= 13.5 8.60%
MAD=3.5 MAPE=8.60%

Moving average for 4th week = (39+44+40)/3 =41


Moving average for 5th week = (44+40+45)/3 =43
Forecasting Example
Measuring Predictive Performance
The ideal model randomly samples from populations and measure the
performance in terms of mean error, error rates or distance.
The objective is correct prediction on future cases.
Random Training and Testing
Error estimate
For the ideal model, performance is measured in terms of mean errors on sample
test cases.

Technically, the error rate for classification is a proportion, but in large samples the
error rate is equivalent to a mean.

For regression, we extrapolate this analysis of sample mean error and variance.
Where merr is the mean error measure, either mad or mse, and erri is the error
distance for case i, either |yi- yi‘| or (yi- yi‘)2.
Comparing Results for Error Measures
Performance is measured in terms of mean error(merr) on independent test
cases(A and B)
To compare two results, for example The results for two different prediction
methods and the standard hypothesis testing model can determine whether
differences between merr(A) and merr(B) are significant.

You might also like