0% found this document useful (0 votes)
37 views4 pages

PFC - Workshop 08: Statistics

This document discusses statistical analysis techniques for summarizing and understanding data. It describes calculating the mean, standard deviation, and performing linear regression on data sets. It provides examples of programs that would read data from files to calculate these statistics and determine the slope, y-intercept, and correlation coefficient from a linear regression analysis.

Uploaded by

Nhân Trung
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views4 pages

PFC - Workshop 08: Statistics

This document discusses statistical analysis techniques for summarizing and understanding data. It describes calculating the mean, standard deviation, and performing linear regression on data sets. It provides examples of programs that would read data from files to calculate these statistics and determine the slope, y-intercept, and correlation coefficient from a linear regression analysis.

Uploaded by

Nhân Trung
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
You are on page 1/ 4

PFC - Workshop 08

Statistics

In this workshop, you are to work with data values stored in a file. 

Learning Outcome

Upon successful completion of this workshop, you will have demonstrated the ability to read sequential
text files.

Introduction

Statistics helps summarize data in interesting and meaningful ways.  The data may consist of values
from a variety of sources.  Statistics identifies what is common and the degree of variation about what
is common.  In cases where a data set has several aspects, statistics determines the extent to which one
aspect of the set is related to another aspect of the set. 

For example, consider a video store with a large selection of rental videos at various prices.  We
determine the average price of a video in the store, the degree of price variation about this average and
then infer something about the clientele who rent videos from that store. 

If we suspect that the price of each video was determined by the date of its original release, we use
statistics to determine numerically if there is indeed any correlation between the rental price and the
original release date.  Moreover, we determine the degree of any such correlation. 

Statistics Calculator

Design and code a program that calculates statistical measures for a set of data values stored in a
file.  The program prompts for and accepts the name of the data file and reads each record in the file,
calculating the mean and the standard deviation of the data values.  The data file contains one floating
point value per record.  Each record is delimited by a newline.

For a sample containing n values, the key statistical indicators include

1. Mean

m = ( x1 + x2 + x3 + ... + xn ) / n (mx = )
2. Sum of the squares of the values

ss = x12 + x22 + x32 + ... + xn2

3. Variance (d2 )

( = = ) = ( ss / n ) - m2

4. Standard deviation (d)

= sqrt( ( s / n ) - m2 )

The program output might look something like: 

Statistics Calculator
=====================
Enter the name of the data file : sample_1.dat 

The number of data values read from this file was 39 
Their statistical mean is 8.08
Their standard deviation is 2.15

Regression Analysis Calculator


Design and code a program that prompts for and accepts the name of a data file and then reads data
value pairs from the file, while performing a linear regression analysis on the set of data value pairs. 

The data file contains two values per record, with the two fields delimited by whitespace and each
record delimited by a newline.

Linear regression analysis models the statistical relationship between two sets of data values. 
Typically, we use the method of least squares to determine the line that best fits the scatter of data
points on an x-y graph.  The formulas for the line are:

Linear equation:

y=a*x+b

5. slope (a)

a = [ (x1 - mx)(y1 - my) + (x2 - mx)(y2 - my) +


... + (xn - mx)(yn - my)] /
[ (x1 - mx)2 + (x2 - mx)2 + ... + (xn - mx)2]
(a = )

6. y-intercept (b)

b = my - a mx

where m denotes the mean, which is given by


mx = ( x1 + x2 + x3 + ... + xn ) / n
my = ( y1 + y2 + y3 + ... + yn ) / n

The statistical measure that indicates how well our best fit line models the relationship between the
variables is called the correlation coefficient.

7. Pearson Correlation Coefficient given by the formula

r = { [ (x1y1) + (x2y2) + ... + (xnyn) ] / n - mx my} / [ dx dy ]

(r = )

where d denotes the standard deviation, which is given by


(dx)2 = ( sx / n ) - mx2
(dy)2 = ( sy / n ) - my2

where s denotes the sum of the squares, which is given by


sx = x12 + x22 + x32 + ... + xn2
sy = y12 + y22 + y32 + ... + yn2
Your program output might look something like: 

Regression Analysis
===================
Enter the name of the data file : sample_2.dat

The slope of the least squares fit is 0.26


The y-intercept of the least squares fit is 26.94
The correlation coefficient is 0.91

What is the age of your car in months ? 32


You can expect a stopping distance of 35.13 metres

The data in the file "sample_2.dat" represents the minimum stopping distance in metres of cars aged
between 9 months and 76 months travelling at 40kph.

You might also like