PFC - Workshop 08: Statistics
PFC - Workshop 08: Statistics
Statistics
In this workshop, you are to work with data values stored in a file.
Learning Outcome
Upon successful completion of this workshop, you will have demonstrated the ability to read sequential
text files.
Introduction
Statistics helps summarize data in interesting and meaningful ways. The data may consist of values
from a variety of sources. Statistics identifies what is common and the degree of variation about what
is common. In cases where a data set has several aspects, statistics determines the extent to which one
aspect of the set is related to another aspect of the set.
For example, consider a video store with a large selection of rental videos at various prices. We
determine the average price of a video in the store, the degree of price variation about this average and
then infer something about the clientele who rent videos from that store.
If we suspect that the price of each video was determined by the date of its original release, we use
statistics to determine numerically if there is indeed any correlation between the rental price and the
original release date. Moreover, we determine the degree of any such correlation.
Statistics Calculator
Design and code a program that calculates statistical measures for a set of data values stored in a
file. The program prompts for and accepts the name of the data file and reads each record in the file,
calculating the mean and the standard deviation of the data values. The data file contains one floating
point value per record. Each record is delimited by a newline.
1. Mean
m = ( x1 + x2 + x3 + ... + xn ) / n (mx = )
2. Sum of the squares of the values
3. Variance (d2 )
( = = ) = ( ss / n ) - m2
= sqrt( ( s / n ) - m2 )
Statistics Calculator
=====================
Enter the name of the data file : sample_1.dat
The number of data values read from this file was 39
Their statistical mean is 8.08
Their standard deviation is 2.15
The data file contains two values per record, with the two fields delimited by whitespace and each
record delimited by a newline.
Linear regression analysis models the statistical relationship between two sets of data values.
Typically, we use the method of least squares to determine the line that best fits the scatter of data
points on an x-y graph. The formulas for the line are:
Linear equation:
y=a*x+b
5. slope (a)
6. y-intercept (b)
b = my - a mx
The statistical measure that indicates how well our best fit line models the relationship between the
variables is called the correlation coefficient.
(r = )
Regression Analysis
===================
Enter the name of the data file : sample_2.dat
The data in the file "sample_2.dat" represents the minimum stopping distance in metres of cars aged
between 9 months and 76 months travelling at 40kph.