0% found this document useful (0 votes)

139 views8 pages

Basic Statistics: Simple Linear Regression

Uploaded by

ashish_bifaas

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

139 views8 pages

Basic Statistics: Simple Linear Regression

Uploaded by

ashish_bifaas

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 8

Basic Statistics

Over the last few years I have been collecting statistics books. I never took statistics in college
since everyone I knew who took introductory statistics hated the class. Perhaps the reason for
this is that statistics is poorly taught. Or perhaps people find statistics boring because they do not
see how to apply statistics to applications that they find interesting.

Perhaps only sex or our survival focuses our fascination as much as money. Statistical
mathematics are an indispensable tool in finance (e.g., portfolio analysis, risk analysis, statistical
arbitrage). My interest in statistics arose from working on wavelet techniques for filtering and
analyzing financial data (financial time series, like stock market daily close prices). Statistics
provides a tool to understand whether a particular modeling technique may make money.

The rest of this web page consists of some notes on basic statistical functions. These statistical
functions are not only useful for data analysis, but they are also building blocks for other
techniques (like the calculation of the Hurst exponent). I've done little more than list the
equations. In some cases I list the reference where I got the equation. See the reference list below
for a list of statistics books. The C++ code that implements these functions is published here as
well.

Simple Linear Regression

Linear regression plots a line through a cloud of points. This is shown in the plot below.
The data points are from section 6.1 of Statistics Manual by Crow, Davis and Maxfield, Dover
Press.

Given a set of values, X0...XN-1 and Y0...YN-1, linear regression calculates the constant coefficients
a and b for the line

The coefficient b (the slope of the line) is calculated with the equation

Where is the mean of the x values and is the mean of the y values.

The once the coefficient b is known, the coefficient a, the y-intercept, can be calculated using the
equation
When calculating the linear regression it is also important to calculate the error statistics, to give
a measure of how closely the linear regression line fits the data. The standard deviation of the
points around the regression line is

From Cartoon Guide to Statistics and Introductory Statistics for Business and Economics

The standard error of the regression coefficient b is

From Cartoon Guide to Statistics and Introductory Statistics for Business and Economics

Correlation

Calculation of the regression line is closely related to the calculation of the correlation. Two
values are correlated when a change in one causes a change, in the same direction, in the other.
For example, if a respected stock market analyst publishes a report that states that she believes
that a stock will go up, the is some probability that the report will be positively correlated with a
rise in the stock. Two values may be anti-correlated when a change in one causes a change in the
opposite direction in the other. For example, a positive change in the polls for a politician, W,
who favors deficits may be negatively correlated with the direction of US dollar currency
futures, D. Or, put more simply as deficits go up, the value of the dollar in foreign exchange
tends to go down. The correlation coefficient, r, has a value -1 <= r <= 1. A correlation of 0
means that the two values are unrelated. If 0 < r <= 1, then there is a positive correlation. If -1 <=
r < 0, then there is a negative correlation.

The value R2 is calculated using the equation

From Statistics Manual by Crow et al

The correlation is the square root of R2, times the sign of the regression coefficient b (e.g., the
slope of the line).

The Autocorrelation Function (ACF)

Autocorrelation is the correlation of a data set with itself, offset by n-values. For example,
autocorrelation with an offset of 5 would correlate the data set {s0, s1, s2, s3 ... sn-5} is correlated
with {s4, s5, s6, s7 ... sn}. The autocorrelation function is the set of autocorrelations with offsets 1,
2, 3, 4 .. limit, where limit <= n/2.

The equation for the autocorrelation function (ACF) is

This function is related to the autocovariance, with a forward step of k elements.

The ACF equation I've used here (and implemented in software) is a modified version of the
ACF given in Non-linear time series models in empirical finance by Philip Hans Frances and
Kick van Kijk, Cambridge University Press, 2000, Section 2.2. Any errors are mine. I have
gotten the same plots as Frances has published, using his data, so I have some reason to belive
that my version is a correct translation.

Standard Deviation

The standard deviation of a data distribution is calculated from the equation below. Here the
standard deviation is calculated from the square root of the "unbiased" estimate of the variance
(where n-1 rather than n is used as a divisor). Some statistics text books use a form of the
standard deviation where n is the devisor, but calculators and statistics software use the version
shown here. Again, in the equation below, is the mean.

Math Packages vs. Writing Your Own Code

So far I've implemented all the statistics functions I've used in either C++ or Java. This has the
advantage that I've developed a fairly complete understanding of the functions I've implemented.
Math packages like MatLab, Mathematica and S+ provide support for basic statistics and include
packages that support additional functions like wavelets. These packages also provide some level
of support for plotting the results. So far I have not been able to justify the significant license
fees for this software. Although implementing software forces a better understanding of the
statistical functions being used, it is very time consuming. These statistical functions are a tool
for what I'm trying to acomplish, rather than an end in themselves. So better math support is
becoming increasingly attractive. There are three open source (or "public domain") packages that
look interesting:

1. Octave

To quote the Octave web page:

GNU Octave is a high-level language, primarily intended for numerical

computations. It provides a convenient command line interface for solving linear
and nonlinear problems numerically, and for performing other numerical
experiments using a language that is mostly compatible with Matlab. It may also
be used as a batch-oriented language.

MatLab seems to be one of the most, if not the most, popular mathematics environments.
In the book Ripples in Mathematics, which I used for reference for much of my wavelet
work, many of the examples are given in MatLab code.

2. The R Project for Statistical Computing

R is a free software version of the S+ mathematics environment. I used S+ for work I did
on wavelets while I was working at Prediction Company. The version of S+ we were
using was heavily modified for the local environment (Prediction Company has a source
license). This made it difficult to migrate to new versions of S+. So I'm not sure whether
my experience with S+ is representative of the current product. I was not very happy with
the S+ graphics environment. Also, S+ seems to be used much less than MatLab.

Although S+ may have limitations, the R Project does look interesting. The
documentation looks good and it has a more "professional feel" than Octave.

3. NIST Dataplot

Dataplot has been developed by the US National Institute of Standards and Technology.
The software started being developed by James J. Filliben in 1978. It now includes a
fairly sophisticated plotting and data analysis environment. The software is supported on
virtually every major platform.

Copyright and use of the associated software

You may use this source code without limitation and without fee as long as you include:

This software was written and is copyrighted by Ian Kaplan, Bear Products International,
www.bearcave.com, 2001.
This software is provided "as is", without any warranty or claim as to its usefulness. Anyone who
uses this source code uses it at their own risk. Nor is any support provided by Ian Kaplan and
Bear Products International. Please send any bug fixes or suggested source changes to:
[email protected]

Software

The C++ source code that implements the basic statistics functions described here (along with
C++ code to calculate the probability density function, which is not described here) can be
downloaded by clicking here. This file is in uncompressed tar format.

The documentation for this source code, generated by Doxygen can be found here.

Books

 The Cartoon Guide to Statistics by Larry Gonick and Woollcott Smith, Harper Collins,
1993

I'm always a bit embarrassed to list this book as a reference, since the title sounds so non-
serious. But this is one of the best concise introductions to statistics that I've found. The
book is an easy introduction to basic statistics, although it can be a bit shallow. The
Cartoon Guide to Statistics includes many of the basic statistics equations.

 Introductory Statistics for Business and Economics Fourth Edition, Thomas H.

Wonnacott and Ronald J. Wonnacott, John Wiley and Sons 1990

This is an excellent statistics book which provides a clear introduction with out shying
away from equations, but avoiding proofs. This book has more depth than The Cartoon
Guide to Statistics and covers topics like multiple linear regression.

 "Social Science" majors usually have to take a statistics course as part of their degree
requirements. In some cases they choose their major because they did not like (or do well
in) college math. So there are a statistics books for social science majors that approach
statistics without much in the way of equations. While these books usually provide a very
readable introduction to statistics, they make poor references in many cases, because in a
reference you want to find the equation. Two "social science" statistics books that I've
purchased are:
1. Statistics: A Bayesian Perspective by Donald A. Berry, Wadsworth, 1996.

This book is readable and provides basic statistics equations, in some cases. But
like the book below it avoids any notation that might be encountered in calculus

(e.g., ).

2. Statistics, Third Edition, by Freedman, Pisani and Purves, Norton, 1998

This book is the worst of the two. By staying away from equations, the discussion
is needlessly obscure. You get derivations like this one:

This attempts to describe the unbiased standard deviation. Clearly the equation
above is simpler. I ended up getting rid of this book.

 Principles of Statics M.G. Bulmer, Dover Press, 1979

Frequently I want a quick discussion of a statistical technique, where the author is not
afraid to give the equations and even use basic calculus. However, I don't want the
"statistics for math majors" which is heavy on proofs. This book nicely fits my
requirements. In some cases the explainations are a bit obscure and I've had to read some
sections a few times to understand the material. Dover Press has been reprinting math
books at very reasonable prices, so the book is reasonably priced as well.

 Statistics Manual by Crow, Davis and Maxfield, Dover Press,1960

This is an even more concise catalog of statistical techniques. Each technique includes a
brief explanation and the basic equations. This book was written before computing power
was either affordable or widely available. Some of the simplifications for the equations
make sense if you are doing the calculation by hand, but offer no real advantage in
software. As with all Dover math books, this is reasonably priced.

 Against the Gods: The Remarkable Story of Risk by Peter L. Bernstein, 1996

This excellent books discusses the origins of statistics (the gambling salons in Europe)
and practical applications of statistical theory to insurance and finance.

 Capital Ideas: The Improbable Origins of Modern Wall Street by Peter L. Bernstein,
1993

Another great book by Peter Bernstein. This book proves one of the best overviews of
modern economic market theory and quantitative finance (which includes statistical
techniques). Quantitative finance is driven by a desire for profit, so it is a field that moves
rapidly as very smart people apply leading edge mathematical techniques to market
trading and modeling. As a result, Capital Ideas is somewhat out of date, but it still is
worth reading.
Web Based References

 Fast Median Finding Algorithms

When George W. Bush was selling the second of his tax cuts to the people of the United
States and their representatives in Congress he talked about the average tax cut that a US
family would receive. The problem with the number quoted by Bush was that while a
high income family would receive a large tax cut, a low income family would receive
very little. What "W" should have used was the median, rather than the mean, or average.
This would have yielded a number much closer to what most families would actually
receive.

If we have an ordered list of numbers, the median is the number in the middle. If the
number of values is even, then the median is the value between the two numbers closest
to the middle. The obvious way to find the median is to sort the numbers. A fast sort
algorithm has a time complexity of On log(n). Once the numbers are sorted we can
directly find the median at the middle index of the number list.

CAR Hoare, who invented the Quick Sort algorithm, and developed a lot of the theory of
communicating sequential processes also developed a fast median finding algorithm that
will find the median in O(n) time. In An Efficient Algorithm for the Approximate Median
Selection Problem Battiato et al, October 1999 (in pdf format) has an algorithm that is
even faster (basicly N).

 Chi Square Tutorial (PDF) by Prof. Jeff Connor-Linton, Georgetown University

Acknowledgments

I want to thank Keith Dyke who, in a time which seems increasingly long ago, in a land far away
from where I am now, patiently explained basic statistics to me and helped me gain some facility
with the S+ statistics package.

Ian Kaplan
May 2003
Revised:

AS330 Series Elevator-Used Inverter User Manual V1.01
No ratings yet
AS330 Series Elevator-Used Inverter User Manual V1.01
128 pages
4 Measures of Central Tendency, Position, Variability PDF
100% (1)
4 Measures of Central Tendency, Position, Variability PDF
24 pages
Linear Combination of Random Variables: E (X) and Var (X) of Modified Random Variable
No ratings yet
Linear Combination of Random Variables: E (X) and Var (X) of Modified Random Variable
2 pages
Introduction To IBM SPSS Statistics
100% (1)
Introduction To IBM SPSS Statistics
85 pages
Underground Mining Fundamentals P13GR37WEBPDF
No ratings yet
Underground Mining Fundamentals P13GR37WEBPDF
4 pages
1
100% (1)
1
385 pages
READINGS On The Road
100% (1)
READINGS On The Road
80 pages
Cell Broadcast (GBSS19.1 01)
No ratings yet
Cell Broadcast (GBSS19.1 01)
87 pages
A Lesson 1 Introduction To Statistics & SPSS
100% (1)
A Lesson 1 Introduction To Statistics & SPSS
8 pages
Glossary of Statistical Terms and Symbols
No ratings yet
Glossary of Statistical Terms and Symbols
4 pages
Ignou PGDAST Assignment Booklet Jan-Dec 2020
No ratings yet
Ignou PGDAST Assignment Booklet Jan-Dec 2020
30 pages
Summary Statistics and Visualization Techniques To Explore
100% (1)
Summary Statistics and Visualization Techniques To Explore
30 pages
Logit Model For Binary Data
No ratings yet
Logit Model For Binary Data
50 pages
Basic - Statistics 30 Sep 2013 PDF
100% (1)
Basic - Statistics 30 Sep 2013 PDF
20 pages
Automatic Transfer Switch - Ats 22 Manual
No ratings yet
Automatic Transfer Switch - Ats 22 Manual
38 pages
Quartiles, Deciles, Percentiles
100% (1)
Quartiles, Deciles, Percentiles
5 pages
Exploratory Data Analysis - Komorowski PDF
No ratings yet
Exploratory Data Analysis - Komorowski PDF
20 pages
Class 7
No ratings yet
Class 7
42 pages
Data Analysis
No ratings yet
Data Analysis
17 pages
S.No. Name of The Agency Contact Details: M/s M.P. Printers
100% (1)
S.No. Name of The Agency Contact Details: M/s M.P. Printers
3 pages
Applications of Statistical Software For Data Analysis
No ratings yet
Applications of Statistical Software For Data Analysis
5 pages
Stress Management and Employee Engagement - A Study Case-Chance For Change Conference 19.9
100% (1)
Stress Management and Employee Engagement - A Study Case-Chance For Change Conference 19.9
10 pages
Session 15 Regression and Correlation
No ratings yet
Session 15 Regression and Correlation
66 pages
9-3 Basics of Statistics: Unit 9 Probability and Mathematical Induction
No ratings yet
9-3 Basics of Statistics: Unit 9 Probability and Mathematical Induction
16 pages
Sbe10 10 Simple Regression
No ratings yet
Sbe10 10 Simple Regression
100 pages
Topic03 Correlation Regression
No ratings yet
Topic03 Correlation Regression
81 pages
Probability Distribution: Shreya Kanwar (16eemme023)
No ratings yet
Probability Distribution: Shreya Kanwar (16eemme023)
51 pages
02-03 ASAP Business Analytics-2 Descriptive Statistics
No ratings yet
02-03 ASAP Business Analytics-2 Descriptive Statistics
109 pages
Stat Term Paper
No ratings yet
Stat Term Paper
17 pages
Graphical and Tabular Descriptive Techniques
No ratings yet
Graphical and Tabular Descriptive Techniques
40 pages
Chilled Displays
No ratings yet
Chilled Displays
65 pages
Linear Regression Model
No ratings yet
Linear Regression Model
3 pages
Statistics (0.0) : IB Diploma Biology
No ratings yet
Statistics (0.0) : IB Diploma Biology
44 pages
Statatistical Inferences
No ratings yet
Statatistical Inferences
22 pages
Calculating Total Scale Scores and Reliability SPSS - D.boduszek
No ratings yet
Calculating Total Scale Scores and Reliability SPSS - D.boduszek
16 pages
Chapter 1 Data Analysis
No ratings yet
Chapter 1 Data Analysis
18 pages
Chapter 9 Fundamental of Hypothesis Testing
No ratings yet
Chapter 9 Fundamental of Hypothesis Testing
26 pages
10 - 11 SPSS Introduction PDF
No ratings yet
10 - 11 SPSS Introduction PDF
25 pages
Common Probability Distributions: D. Joyce, Clark University Aug 2006
No ratings yet
Common Probability Distributions: D. Joyce, Clark University Aug 2006
9 pages
Saudi Aramco Pre-Commissioning Form: High Voltage Motor Controlgear
100% (1)
Saudi Aramco Pre-Commissioning Form: High Voltage Motor Controlgear
4 pages
MBA-SEM I-E Finance - Business Research Methods and Analytics - Unit 5
No ratings yet
MBA-SEM I-E Finance - Business Research Methods and Analytics - Unit 5
44 pages
Powerpoint Workshop Introduction To Deep Learning - Statistics and Data Analysis
No ratings yet
Powerpoint Workshop Introduction To Deep Learning - Statistics and Data Analysis
26 pages
R Programming Introduction
No ratings yet
R Programming Introduction
20 pages
Special Request/Authorization: For Official Use Only Privacy Sensitive
No ratings yet
Special Request/Authorization: For Official Use Only Privacy Sensitive
1 page
OverviewPricingContact Us
No ratings yet
OverviewPricingContact Us
15 pages
Groebner Business Statistics 7 Ch07
No ratings yet
Groebner Business Statistics 7 Ch07
34 pages
Data Science With R
No ratings yet
Data Science With R
46 pages
Parameters: Unless Otherwise Noted, These Formulas Assume
No ratings yet
Parameters: Unless Otherwise Noted, These Formulas Assume
6 pages
BA Module 02 - 2.4 - Confidence Interval
No ratings yet
BA Module 02 - 2.4 - Confidence Interval
41 pages
R Programming Swirl
No ratings yet
R Programming Swirl
22 pages
Sir - 11 - 21 Rate List 2022
No ratings yet
Sir - 11 - 21 Rate List 2022
10 pages
An Introduction To T
No ratings yet
An Introduction To T
7 pages
Unit Iii
No ratings yet
Unit Iii
41 pages
R Basics Notes
No ratings yet
R Basics Notes
15 pages
A Short Course in Multivariate Statistical Methods With R
No ratings yet
A Short Course in Multivariate Statistical Methods With R
11 pages
Thcs An Lac - Thi HK I. k9. 2020-2021
No ratings yet
Thcs An Lac - Thi HK I. k9. 2020-2021
8 pages
Sampling Distributions Coursera
No ratings yet
Sampling Distributions Coursera
8 pages
VAC Choke Multivariadores sandCoresDatasheet
No ratings yet
VAC Choke Multivariadores sandCoresDatasheet
16 pages
Lay Out of Geo Tech Lab
No ratings yet
Lay Out of Geo Tech Lab
15 pages
Teaching Early Numeracy Skills Hands-On Learning in Times of The Covid-19 Pandemic
No ratings yet
Teaching Early Numeracy Skills Hands-On Learning in Times of The Covid-19 Pandemic
17 pages
Harmony 895 Logitech Manuel Us
No ratings yet
Harmony 895 Logitech Manuel Us
17 pages
SPSS Syntax
No ratings yet
SPSS Syntax
17 pages
SPSS Intermediate Understanding Your Data
No ratings yet
SPSS Intermediate Understanding Your Data
23 pages
Senior Business Systems Analyst in Orange County CA Resume Jody Sturner
No ratings yet
Senior Business Systems Analyst in Orange County CA Resume Jody Sturner
2 pages
Ofosu
No ratings yet
Ofosu
9 pages
Ecn 104 Foundations of Managerial Economics Syllabus
No ratings yet
Ecn 104 Foundations of Managerial Economics Syllabus
11 pages
What Is Chemical Engineering
No ratings yet
What Is Chemical Engineering
8 pages
Bus 5115 - Discussion Forum Unit 1 University of The People
No ratings yet
Bus 5115 - Discussion Forum Unit 1 University of The People
5 pages
Discrete Data Is A Count That Involves Integers. Only A Limited Number of
No ratings yet
Discrete Data Is A Count That Involves Integers. Only A Limited Number of
3 pages
Iq Check Real Time PCR Solution
No ratings yet
Iq Check Real Time PCR Solution
4 pages
Soe Word Problems Worksheet
No ratings yet
Soe Word Problems Worksheet
6 pages
WiNG 5.0 Cheat Sheet - RF Domains
No ratings yet
WiNG 5.0 Cheat Sheet - RF Domains
6 pages
How To Use All 3 Types of ANOVA Built Into Excel To Make Your Internet Marketing More Effective
No ratings yet
How To Use All 3 Types of ANOVA Built Into Excel To Make Your Internet Marketing More Effective
20 pages
Basic Business Statistics: 11 Edition
No ratings yet
Basic Business Statistics: 11 Edition
24 pages
Kaizen Costing
No ratings yet
Kaizen Costing
4 pages
Report General Chejj
No ratings yet
Report General Chejj
3 pages
An R Tutorial Starting Out
No ratings yet
An R Tutorial Starting Out
9 pages
Untitled - Notepad
No ratings yet
Untitled - Notepad
1 page
9800 Relay Series
No ratings yet
9800 Relay Series
2 pages
Log
No ratings yet
Log
2 pages
Reaction Paper
No ratings yet
Reaction Paper
2 pages
DVT Business Analyst Competency Profile
No ratings yet
DVT Business Analyst Competency Profile
2 pages
Making Sense of Numbers and Math: My Method for Learning
From Everand
Making Sense of Numbers and Math: My Method for Learning
Dr. Cary N. Schneider
1/5 (2)
IBM SPSS Statistics 21 Brief Guide
From Everand
IBM SPSS Statistics 21 Brief Guide
Andrei Besedin
No ratings yet
Analysis of Experimental Data Microsoft®Excel or Spss??! Sharing of Experience English Version: Book 3
From Everand
Analysis of Experimental Data Microsoft®Excel or Spss??! Sharing of Experience English Version: Book 3
Ping Yuen PY Cheng
No ratings yet
The Nature of Statistics
From Everand
The Nature of Statistics
W. Allen Wallis
No ratings yet
Learn R By Coding
From Everand
Learn R By Coding
Thomas Kurnicki
No ratings yet
Economic data Second Edition
From Everand
Economic data Second Edition
Gerardus Blokdyk
No ratings yet
Tableau Software Second Edition
From Everand
Tableau Software Second Edition
Gerardus Blokdyk
No ratings yet
Big Data Analytics Complete Self-Assessment Guide
From Everand
Big Data Analytics Complete Self-Assessment Guide
Gerardus Blokdyk
No ratings yet

Basic Statistics: Simple Linear Regression

Uploaded by

Basic Statistics: Simple Linear Regression

Uploaded by

Basic Statistics

Simple Linear Regression

The standard error of the regression coefficient b is

The value R2 is calculated using the equation

From Statistics Manual by Crow et al

The Autocorrelation Function (ACF)

The equation for the autocorrelation function (ACF) is

This function is related to the autocovariance, with a forward step of k elements.

Math Packages vs. Writing Your Own Code

To quote the Octave web page:

GNU Octave is a high-level language, primarily intended for numerical

2. The R Project for Statistical Computing

Copyright and use of the associated software

 Introductory Statistics for Business and Economics Fourth Edition, Thomas H.

2. Statistics, Third Edition, by Freedman, Pisani and Purves, Norton, 1998

 Principles of Statics M.G. Bulmer, Dover Press, 1979

 Statistics Manual by Crow, Davis and Maxfield, Dover Press,1960

 Fast Median Finding Algorithms

 Chi Square Tutorial (PDF) by Prof. Jeff Connor-Linton, Georgetown University

You might also like