0% found this document useful (0 votes)

1 views22 pages

Statistical Evaluation of Big Data

Uploaded by

Shivanee Ningthoujam

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

1 views22 pages

Statistical Evaluation of Big Data

Uploaded by

Shivanee Ningthoujam

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 22

Statistical modelling

• The process of applying statistical analysis to a dataset. A statistical

model is a mathematical representation (or mathematical model) of
observed data.
• When data analysts apply various statistical models to the data they are
investigating, they are able to understand and interpret the information
more strategically.
• This practice allows to identify relationships between variables, make
predictions about future sets of data, and visualize that data so that non-
analysts and stakeholders can consume and leverage it.
• Better preparation of data for analysis.

Most common techniques fall into the following two groups:

• Supervised learning, including regression and classification models.
• Unsupervised learning, including clustering algorithms and association
rules.
Big data move evaluations of results closer to idealized
model.
• A statistical model embodies a set of statistical
assumptions concerning the generation of sample data (and similar
data from a larger population) and represents, often in considerably
idealized form.
Idealized Model
New cases will be drawn randomly from the same population. Cases
need not have unique measurements i.e. two cases can be identical.
It is very useful to evaluate hypothesis and experimental procedures.
Random Sampling: A random sample is a subset of a statistical
population in which each member of the subset has an equal
probability of being chosen.

Random samples are used to avoid bias and other unwanted effects.

1 2 3
2 5
4 5 6
7
7 8 9
Classical Statistical Comparison and Evaluation
• Most widely used evaluation technique is Hypothesis
Testing

A statistical hypothesis is a hypothesis that is testable

on the basis of observed data modelled as the realised
values taken by a collection of random variables.
• This method is also used to evaluate results of polls and
medical studies.
• Measures to evaluate: Average value, Mean or the
average (squared) difference from the mean, variance
etc.
• These measures may not be adequate for prediction but
can estimate the shape of the population curve from
the sampling data.
Sampling distribution
Suppose that we draw all possible samples of size n from a given
population and further if we compute a statistic mean, proportion,
standard deviation etc. for each sample, then the probability
distribution of this statistic is called sampling distribution.
• The standard deviation of this statistic is called standard error (se).
• i.e. Sampling distribution shows every possible result a statistic can
take in every possible sample from a population and how often each
result happens.
• The variance from the mean is var/n, where var is the sample
variance and n is the number of cases in the sample.
• The square root of this sampling variance is called standard error.
• E.g. Two independent samples are taken such as A and B, the means of
the two samples will be determined and check whether the difference in
means is significant.

• Significance test to be determined from sample A with n1 cases and

sample B with n2 cases. It is measured in terms of number of standard
error(se), sig value typically 2.
• Hypothesis testing model tells us whether differences in two hypothesis can
be attributed to chance.
• Comparisons of predictive performance can be made using hypothesis
testing model.
• If there is a difference in means, it is likely to be statistically significant.
E.g. Significance testing

S.D- provides an indication of how far the individual responses to a

question varies or deviates from the mean.
Example

No of cases/ Sample A (Product Sample B (Product

Respondent Quality) Reliability)
T1 3 1
T2 3 1
T3 3 1
T4 3 1
T5 4 5
T6 4 5
T7 3 5
T8 3 5
T9 3 5
T10 3 5
Mean for sample A = 3.2
se(A-B) =
Var(A) = 0.24
S.D for sample A = 0.4
Mean for sample B = 3.4
Var(B) 0.83>sig (sig value typically 2) which is wrong
S.D for sample B = 2.1

If there is a difference in means, it is likely to be statistically significant.

Here, there is minimal difference in the mean and the value is less than
sig value, hence it is not statistically significant or less significance.
• A Sample A with 5 cases having values 160, 150, 165, 170, 140 and Sample
B with values 100,90,110,95,120 respectively. Find the sample variance
and standard error. Test the significance.
Sample A Sample B
160 100
150 90
165 110
170 95
140 120

Mean of sample A, B
Var, S.D
Hypothesis testing
Sig=2
Predicting true – or- false: classification
Classification is the most common application of
computer based prediction.
The typical problem is to distinguish between two
classes.
Error rates
Performance is measured by keeping track of the number
of mistakes that are made on sample cases.
The sample error rate (erate) is the percent of
classifications that are incorrect. It is given by:
erate =*100

Forecasting Numbers: Regression

Regression is also called as function approximation – The
objective is to predict a number. These numbers can be
real numbers or ordered numbers, not categories and
labels.
The performance can be measured in terms of distance from the true
value.

Line of Best fit: minimizes the distance between each individual point
and the regression line.
Distance Measures
The objective of regression is to minimize the distance between the
true/observed value for case i, yi and the predicted value yi’.
Two measures of distance are commonly used. The classical regression
measure is mean squared error (mse) and Mean absolute distance
(mad) as:

The mean absolute distance is the more intuitive measure and is less
sensitive to outliers.
Square root of mse (rmse) is slightly larger than mad.
Example
Calculate the distance measures for the following data:

No of cases (Half yearly Observed value (in Lakhs) Predicted Value (in Lakhs)
sales)
January 125 128
February 132 117
March 115 105
April 137 125
May 122 126
June 130 138

mse = {+………….}
mad = {|125-128|+|132-117|………….}
• Computing error measures and Moving Average
Week Sales 3MA Error= Actual – |Error| |%Error|
Forecast
1 39
2 44
3 40
4 45 41 45-41=4 |4|=4 4^2=16 4/45=8.89%
5 38 43 -5 |-5|=5 25 5/38=13.16%
6 43 41 2 2 4 2/43=4.65%
7 39 42 -3 3 9 3/39=7.69%
8 40 14/4=3.5 54/4 = 13.5 34.39%/4 =
MAD= MSE= 13.5 8.60%
MAD=3.5 MAPE=8.60%

Moving average for 4th week = (39+44+40)/3 =41

Moving average for 5th week = (44+40+45)/3 =43
Forecasting Example
Measuring Predictive Performance
The ideal model randomly samples from populations and measure the
performance in terms of mean error, error rates or distance.
The objective is correct prediction on future cases.
Random Training and Testing
Error estimate
For the ideal model, performance is measured in terms of mean errors on sample
test cases.

Technically, the error rate for classification is a proportion, but in large samples the
error rate is equivalent to a mean.

For regression, we extrapolate this analysis of sample mean error and variance.
Where merr is the mean error measure, either mad or mse, and erri is the error
distance for case i, either |yi- yi‘| or (yi- yi‘)2.
Comparing Results for Error Measures
Performance is measured in terms of mean error(merr) on independent test
cases(A and B)
To compare two results, for example The results for two different prediction
methods and the standard hypothesis testing model can determine whether
differences between merr(A) and merr(B) are significant.

Engine Mechanical Text Section
100% (5)
Engine Mechanical Text Section
103 pages
Statistics Cheat Sheet
100% (3)
Statistics Cheat Sheet
23 pages
Business Statistics
No ratings yet
Business Statistics
20 pages
Typical Statistical Testing Procedures
No ratings yet
Typical Statistical Testing Procedures
29 pages
05 - Statistical Processing and Analysis of Medical Data
No ratings yet
05 - Statistical Processing and Analysis of Medical Data
14 pages
Lecture 4 - Data Science Statistics
No ratings yet
Lecture 4 - Data Science Statistics
21 pages
Descriptive Statistics PDF
100% (1)
Descriptive Statistics PDF
40 pages
Six Sigma Mission: Statistical Sample Population Parameter
No ratings yet
Six Sigma Mission: Statistical Sample Population Parameter
4 pages
Islamabad Semester Terminal Exam Autumn 2020 Name Zeenat Bibi Roll Number By479775 Program Bs English Course Name Introduction To Statistics
100% (1)
Islamabad Semester Terminal Exam Autumn 2020 Name Zeenat Bibi Roll Number By479775 Program Bs English Course Name Introduction To Statistics
23 pages
2 - Analyze - Inferential Statistics
No ratings yet
2 - Analyze - Inferential Statistics
27 pages
Business Statistics
No ratings yet
Business Statistics
25 pages
Quantitative Methods and Business Statistics For Decision Making (MSA606)
No ratings yet
Quantitative Methods and Business Statistics For Decision Making (MSA606)
63 pages
Short Notes
No ratings yet
Short Notes
2 pages
Seminar Week 4 - With Solutions - Fullpage
No ratings yet
Seminar Week 4 - With Solutions - Fullpage
35 pages
Instructor'S Manual: Statistical Techniques in Financial Management
No ratings yet
Instructor'S Manual: Statistical Techniques in Financial Management
3 pages
Analytics PrepBook AnSoc 2017 PDF
100% (1)
Analytics PrepBook AnSoc 2017 PDF
41 pages
Ch-9 Data Preparation and Preliminary Analysis
No ratings yet
Ch-9 Data Preparation and Preliminary Analysis
15 pages
ETF1100 Business Statistics Week 6: Midterm Test Revision
No ratings yet
ETF1100 Business Statistics Week 6: Midterm Test Revision
25 pages
Lectorial Slides 6a
No ratings yet
Lectorial Slides 6a
30 pages
Statisticsgm
No ratings yet
Statisticsgm
2 pages
Biostatistics Revision DR - NJ
No ratings yet
Biostatistics Revision DR - NJ
67 pages
Inferential Statistics
No ratings yet
Inferential Statistics
42 pages
Statistics
No ratings yet
Statistics
64 pages
One Dimensional Statistics
No ratings yet
One Dimensional Statistics
21 pages
Hypothesis Testing II
No ratings yet
Hypothesis Testing II
98 pages
Unit 3
No ratings yet
Unit 3
20 pages
Lecture 6
No ratings yet
Lecture 6
84 pages
Measure of Central Tendency
No ratings yet
Measure of Central Tendency
40 pages
Statistical Characteristics of Numerical Data
No ratings yet
Statistical Characteristics of Numerical Data
9 pages
Statistics For Data Science
No ratings yet
Statistics For Data Science
30 pages
MMW Notes
No ratings yet
MMW Notes
10 pages
VIVA - Revision
No ratings yet
VIVA - Revision
5 pages
Regression
No ratings yet
Regression
86 pages
Statistics For Data Analytics
No ratings yet
Statistics For Data Analytics
15 pages
Statistical Treatment - PPTX Rev
No ratings yet
Statistical Treatment - PPTX Rev
42 pages
E Book - Unit 4
No ratings yet
E Book - Unit 4
12 pages
MAT 211 Introduction To Business Statistics I Lecture Notes
No ratings yet
MAT 211 Introduction To Business Statistics I Lecture Notes
69 pages
SCS3250A - Module 1 - Introduction To Statistics and Analytics
No ratings yet
SCS3250A - Module 1 - Introduction To Statistics and Analytics
44 pages
MS 1724012 Exp8
No ratings yet
MS 1724012 Exp8
13 pages
Pad Unit 2 Ibm
No ratings yet
Pad Unit 2 Ibm
61 pages
MATH30 6 Lecture 3
No ratings yet
MATH30 6 Lecture 3
66 pages
Fem 11110 Iar 2022 Key Definitions
No ratings yet
Fem 11110 Iar 2022 Key Definitions
4 pages
Statistics - The Big Picture
No ratings yet
Statistics - The Big Picture
4 pages
Statistics
No ratings yet
Statistics
28 pages
Tuesday, 16 January 2024 2:58 PM
No ratings yet
Tuesday, 16 January 2024 2:58 PM
46 pages
A Statistical Perspective On Data Mining
No ratings yet
A Statistical Perspective On Data Mining
25 pages
Lecture 2
No ratings yet
Lecture 2
63 pages
Lecture 1
No ratings yet
Lecture 1
72 pages
Stats
No ratings yet
Stats
52 pages
3 Measures of Central Tendency
No ratings yet
3 Measures of Central Tendency
30 pages
Statistical Techniques - Bda
No ratings yet
Statistical Techniques - Bda
33 pages
C207 Study Guide
No ratings yet
C207 Study Guide
27 pages
Standard Errors: A Review and Evaluation of Standard Error Estimators Using Monte Carlo Simulations
No ratings yet
Standard Errors: A Review and Evaluation of Standard Error Estimators Using Monte Carlo Simulations
17 pages
Mathematics in The Modern World
No ratings yet
Mathematics in The Modern World
13 pages
Statistics
No ratings yet
Statistics
152 pages
10 Question Answer
No ratings yet
10 Question Answer
2 pages
Business Modelling Confidence Intervals: Prof Baibing Li BE 1.26 E-Mail: Tel 228841
No ratings yet
Business Modelling Confidence Intervals: Prof Baibing Li BE 1.26 E-Mail: Tel 228841
11 pages
Basic Statistics For Data Science
100% (1)
Basic Statistics For Data Science
45 pages
13.8kV Bus Duct Sizing For Arar Dated 23.03.2005
No ratings yet
13.8kV Bus Duct Sizing For Arar Dated 23.03.2005
9 pages
7SX80003BA501BA0-Z+P10 Datasheet en
No ratings yet
7SX80003BA501BA0-Z+P10 Datasheet en
3 pages
Kenyatta University: Postgraduate Dissertation Handbook
No ratings yet
Kenyatta University: Postgraduate Dissertation Handbook
29 pages
(IJCST-V11I2P16) :shikha, Jatinder Singh Saini
No ratings yet
(IJCST-V11I2P16) :shikha, Jatinder Singh Saini
9 pages
Comfort & Heat Balance Between Human Body and Cloth
No ratings yet
Comfort & Heat Balance Between Human Body and Cloth
9 pages
January 2007 QP - C1 Edexcel
No ratings yet
January 2007 QP - C1 Edexcel
14 pages
Cell Structures and Their Functions
No ratings yet
Cell Structures and Their Functions
1 page
Condensed Matter Theory I - WS14/15
No ratings yet
Condensed Matter Theory I - WS14/15
13 pages
Wei Et Al 2025 Lowering The Kinetic Barrier Via The Synergistic Catalysis of N Cnts Supported RHP Subnanoclusters and
No ratings yet
Wei Et Al 2025 Lowering The Kinetic Barrier Via The Synergistic Catalysis of N Cnts Supported RHP Subnanoclusters and
12 pages
Buchholz Relay Operation and Principle
No ratings yet
Buchholz Relay Operation and Principle
6 pages
Parameter List For GSM Huawei
No ratings yet
Parameter List For GSM Huawei
1,103 pages
Presentation For Industrial
No ratings yet
Presentation For Industrial
22 pages
Common Mode Noise On Bob Smith Termination
No ratings yet
Common Mode Noise On Bob Smith Termination
15 pages
BH35 2
100% (1)
BH35 2
4 pages
1Z0 1087 24 Demo
No ratings yet
1Z0 1087 24 Demo
4 pages
03 Amelogenesis - English
No ratings yet
03 Amelogenesis - English
158 pages
Qgis Shortcuts
No ratings yet
Qgis Shortcuts
2 pages
Exp11 RA2112703010019
No ratings yet
Exp11 RA2112703010019
4 pages
Concept of Inheritance Encapsulation and Polymorphism
No ratings yet
Concept of Inheritance Encapsulation and Polymorphism
36 pages
VZ 950 Titan 2018 en
100% (2)
VZ 950 Titan 2018 en
8 pages
How To Make A Leather Bushcraft Hat
100% (1)
How To Make A Leather Bushcraft Hat
10 pages
GIS Assignment
No ratings yet
GIS Assignment
9 pages
5.2 5.3 Exam Questions
No ratings yet
5.2 5.3 Exam Questions
5 pages
Pengaruh Kecerdasan Emosional Dan Komitmen Organisasi Terhadap Kinerja Karyawan Pada Pt. Rismawan Pratama Bersinar Sukabumi
No ratings yet
Pengaruh Kecerdasan Emosional Dan Komitmen Organisasi Terhadap Kinerja Karyawan Pada Pt. Rismawan Pratama Bersinar Sukabumi
24 pages
FY BCA Syllabus 2024 25
No ratings yet
FY BCA Syllabus 2024 25
54 pages
27 36 and 84 87
No ratings yet
27 36 and 84 87
34 pages
Rubber Material Properties EPDM
No ratings yet
Rubber Material Properties EPDM
1 page
Aflatoxin 4
No ratings yet
Aflatoxin 4
34 pages
Zones of Protection
100% (1)
Zones of Protection
25 pages

Statistical Evaluation of Big Data

Uploaded by

Statistical Evaluation of Big Data

Uploaded by

Statistical modelling

• The process of applying statistical analysis to a dataset. A statistical

Most common techniques fall into the following two groups:

A statistical hypothesis is a hypothesis that is testable

• Significance test to be determined from sample A with n1 cases and

S.D- provides an indication of how far the individual responses to a

No of cases/ Sample A (Product Sample B (Product

If there is a difference in means, it is likely to be statistically significant.

Forecasting Numbers: Regression

Moving average for 4th week = (39+44+40)/3 =41

You might also like