0% found this document useful (0 votes)
39 views31 pages

101827-FS2018-0: Programming With MATLAB: Advanced Course: Felix Wichmann

This document discusses using MATLAB for fitting functions to data and computational statistics. It begins with an example of fitting linear, polynomial, and nonlinear functions to reaction time data. It then covers downloading example regression data and using the polyfit function to fit lines and polynomials to the data. The document also discusses why understanding statistics is important, challenges with conventional statistical analyses, and how MATLAB can make statistics easier through simulation and machine learning methods like regularization and cross-validation.

Uploaded by

Rebesques
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views31 pages

101827-FS2018-0: Programming With MATLAB: Advanced Course: Felix Wichmann

This document discusses using MATLAB for fitting functions to data and computational statistics. It begins with an example of fitting linear, polynomial, and nonlinear functions to reaction time data. It then covers downloading example regression data and using the polyfit function to fit lines and polynomials to the data. The document also discusses why understanding statistics is important, challenges with conventional statistical analyses, and how MATLAB can make statistics easier through simulation and machine learning methods like regularization and cross-validation.

Uploaded by

Rebesques
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

101827-FS2018-0:


Programming with MATLAB:



Advanced course
Felix Wichmann

Neural Information Processing Group and

Bernstein Center for Computational Neuroscience, 

Eberhard Karls Universität Tübingen &

Max Planck Institute for Intelligent Systems, Tübingen
3: fitting, regression &

computational statistics
I am doing psychology and not mathematics or physics—so why should I care
about MATLAB and mathematics?
First, this course will be super-light on mathematical details.
Of course, quantitative analysis of data is everywhere in most areas of
psychological research, either openly or disguised in “cookbook-recipe-
style” statistics.
In particular, some of you may be planning to do EEG or functional imaging,
which generates lots of very high-dimensional data, for which statistical
analysis can be challenging.
Software packages (e.g. SPM or Brain Voyager) can do a lot of work for
you, but are not always applicable, or sometimes “overkill” for a simple
analysis. In addition, it is important to understand what you are doing, and
packages often hide many of the important details and assumptions from
you.

3
Using MATLAB to fit functions to your data: Linear functions

10

8
Reaction time

0
0 5 10 15 20
Task difficulty

4
Using MATLAB to fit functions to your data: polynomials

10

8
Reaction time

0
0 5 10 15 20
Task difficulty

5
Using MATLAB to fit functions to your data: general nonlinear functions

4
Lab measurement

−2

−4

−6
0 5 10 15 20
Day after first treatment

6
Plan for today, part 1
First, you download the file RegressionData.mat from ILIAS, and read about
the very convenient MATLAB function polyfit (help polyfit).
Second, try and fit a straight line to the data y and x1 in the above data
file. Plot data and your fitted line—does it look “good”?
Third, try and fit a parabola (polynomial of degree 2) to the data y and x2.
Plot data and your fitted parabola—does it look “good”?
Fourth, try and fit a polynomial to the data y and x3. Plot data and your
fitted polynomial—does it look “good”? Which order of a polynomial do
you need to fit to get a “good” fit?
Finally, study the FunctionFittingDemoScript.m and make sure you
understand exactly what is going on; ideally generate yourself new
datasets and try and fit them.

7
Statistics
How many of you have taken a statistics class before?
How many of you like statistics?
Science is (often) about formulating and testing hypotheses.
Testing hypotheses requires statistics.
In addition, some knowledge of statistics is useful for interpreting
information in your everyday life.

8
Conventional statistical analyses can be really annoying
a) Identify the problem you want to solve.
I want to find out if attention decreases detection thresholds. I need to
test if detection thresholds were significantly lower in the attended than in
the unattended condition.
b) Decide which test you need to use. (You might have to ask a colleague/
flatmate that does statistics/friend/supervisor/use google/read a book.)
To test this, you will need a t-test/Chi-Square-test-for-variance/Q-test/F-
test/Kruskal-Wallis-test/Two-proportion-Z-test/Jarque-Bera-test/
Kolmogorov-Smirnoff-test/Ansari-Bradley-test/Student-t-test/Bla-Bla-
Bla-test...
c) Find out which function in which software implements the test, and how
to use it.
You need the function ttest2 from the MATLAB statistics toolbox, and you
will need to pass your data x and y, the significance level (as probability,
not percent) and the tailedness (left/right/both).
9
Conventional statistical analyses can be really annoying (cont’d)
d) Your annoying colleague/supervisor/referee/examiner asks “Are the
assumptions of your test satisfied?
Is the data Gaussian with equal variances? Are your residuals uncorrelated
and homoscedastic? Do you have enough data for a test that is only valid
asymptotically?
Testing whether the assumptions are truly satisfied might require another
test … thus you have to go back to b.

10
A brief glance at the history of statistical testing
Student’s t-test was developed in 1908 by William S. Gosset

under the pseudonym “Student”. He was a chemist working at

the Guinness Brewery in Dublin.
Statistical testing dates back to John Arbuthnot, 1694, and the

first (known) lecture series on it was given by Karl Pearson

in 1893.
Picture source: Wikipedia

There were no computers in 1908. Statistical tests had to be such that they
could be done by pen, paper and lists of numbers.
To make this possible, one need to write down mathematical formulas which
describe the distribution of the test-statistic. Thus, tests were tailor-made to
specific simplifying assumptions (data is Gaussian), and often used
approximations which are valid for large N (asymptotics).

11
A brief glance at the history of statistical testing (cont’d)
Thus, for classical tests you have to …
1. know the name of the test
2. check its distributional assumptions
3. (sometimes) live with the fact that they are only approximations.

12
Good news:

With computers (and MATLAB), statistics can be much easier
Key question in statistics:

Could this effect just come about by chance?
If the null-hypothesis was true—i.e. there is no difference

between condition a and b—would we still observe similar—or more
extreme—effects just by chance?
You play dice with a friend, and he has suspiciously many ‘sixes’. Out of 60
attempts, he had 15 sixes. How can you test whether his die is fair?
If you have a fair die (and lots of time), play this game a 1000 times. If you
observe 15 (or more) sixes only rarely (e.g. less than 5% of the time), then
his die is ‘significantly unfair’.

13
Good news:

With computers (and MATLAB), statistics can be much easier (cont’d)
In MATLAB, we can simulated data for which the null-hypothesis is true—
MATLAB is our perfectly fair die — and it is very fast! See the demo
IsThisDieFair.m

Classical approach: Binomial test, p-value is about 6.5%.


Note: I am not saying that classical statistical methods are bad. If you
know the name of the test that is appropriate for your problem, and you
know how to use it, go ahead.
In fact, MATLAB has functions for many classical statistical tests in the
statistics toolbox. Type in available hypothesis tests into the
MATLAB documentation search.

14
Machine Learning
Comparatively new sub-branch of computational statistics jointly
developed in computer science and statistics.
Machine learning is empirical inference performed by computers based on
past observations and learning algorithms: Machine learning algorithms
are mainly concerned with discovering hidden structure in data in order to
predict novel data—exploratory methods.
Machine learning—and in particular kernel methods as well as
convolutional deep neural networks—have proven successful whenever
there is an abundance of empirical data but a lack of explicit knowledge
how the data were generated.

15
Regularisation & Cross-validation
Find a compromise between complexity and classification performance (or
goodness-of-fit in classical statistics).
Penalise complex functions via a regularisation term or regulariser
Cross-validate the results (leave-one-out or 10-fold typically used)

16
Polynomial Curve Fitting 1
t

0 x 1

N=10 datapoints (training set): x = (x1, … xN) and t = (t1, … tN)



Prediction game: t* for a new x*

M
X
Choose a polynomial: y(x, ) = 0 + 1x + 2x
2
+ ... + Mx
M
= jx
j

j=0

Figure (1.2) taken from Bishop, C.M. (2006). Pattern Recognition and Machine Learning. Springer Verlag.
Error Function
t tn

y(xn , w)

xn x

N
X
2
E( ) = {y(xi , i) ti }
i=1

p
ERMS = E( )/N
1 M =0
t

0 x 1

Figure (1.4a) taken from Bishop, C.M. (2006). Pattern Recognition and Machine Learning. Springer Verlag.
1 M =1
t

0 x 1

Figure (1.4b) taken from Bishop, C.M. (2006). Pattern Recognition and Machine Learning. Springer Verlag.
1 M =3
t

0 x 1

Figure (1.4c) taken from Bishop, C.M. (2006). Pattern Recognition and Machine Learning. Springer Verlag.
1 M =9
t

0 x 1

Note: Excellent fit to the data—error free … but a poor representation of the
green curve: over-fitting

Figure (1.4d) taken from Bishop, C.M. (2006). Pattern Recognition and Machine Learning. Springer Verlag.
1
Training
Test

ERMS 0.5

0
0 3 M 6 9

p
ERMS = E( )/N

Figure (1.5) taken from Bishop, C.M. (2006). Pattern Recognition and Machine Learning. Springer Verlag.
More data typically helps …

1 M =9 1 N = 15 1 N = 100
t t t

0 0 0

1 1 1

0 x 1 0 x 1 0 x 1

Figure (1.4d and 1.6) taken from Bishop, C.M. (2006). Pattern Recognition and Machine Learning. Springer Verlag.
… or appropriate regularisation …

1
λ=0 M =9 1
λ reasonable
ln λ = −18 1
λ too large
ln λ = 0
t t t

0 0 0

1 1 1

0 x 1 0 x 1 0 x 1

N
X
2 2
E( ) = {y(xi , i) ti } + ⇥|| ||
i=1

2 T 2 2 2
with || || = = 0 + 1 + ... + M

Figure (1.4d and 1.7) taken from Bishop, C.M. (2006). Pattern Recognition and Machine Learning. Springer Verlag.
Two basic types of statistical analyses

cover many scenarios you will come across in your research
1) What are the error bars on this quantity? [Is there a significant
difference between A and B?—You can answer this once you have the
error bars for A, B]: Bootstrap Test
2) I have two models (one of which might have more parameters than the
other), which of them is a better model of the data? Cross-Validation
Today, we will discuss example situations for each of the two questions. We
will give “general recipes” for how to address each of the two questions in
MATLAB.


Caveat: Of course, this is a gross simplification. There are cases which do
not quite fit into either of the two boxes.
There are also situations which do fit into one of the boxes, but do require
a more complicated analysis than the ones described here.

26
Scenario 1: What are the error bars on this quantity?
Examples:
I have measured percentage correct of 40%—how accurate is this
measurement?
My average measurement is 30 seconds—what is a 90% confidence
region for this measurement?
The median of my data is 15.3—is this significantly bigger than 10 or not?
Is height correlated with IQ?
Strategy: Boostrap test. Take random subsets of the data (with
replacement), calculate quantity of interest on each subset, get histogram
across different subsets and derive error bars/confidence region from the
percentiles of the histogram. See bootstrapDemoScript.m

27
Scenario 2: I have two models (one of which might have more parameters than the
other), which of them is a better model of the data?
Examples:
Can the dependence of reaction time on the stimulus intensity be
described using a linear function, or do I need a quadratic function?
I have a new saliency model. Is it better at predicting eye movements than
previously developed models?
How well can I decode from my fMRI/EEG what stimulus the subject saw?
Catch: When comparing models of different complexity—i.e. with different
numbers of parameters—the model with more parameters will have an
unfair advantage (better goodness-of-fit).
For example, a quadratic function will always fit the data better than a
line.

28
Scenario 2: I have two models (one of which might have more parameters than the
other), which of them is a better model of the data? (cont’d)
We are interested in generalisation ability—science is a prediction game—
not just fitting the data (‘over-fitting’).
Strategy: Cross-validation. Fit parameters on one subset of the data
(‘training set’), evaluate goodness of fit on other subset (‘test set’).
Repeat.
K-fold cross-validation. Split data into K non-overlapping subsets. Take K-th
subset as test-set, and the other K-1 as training set. Then, take the K-1-th
set as test set, etc.
Leave-One-Out cross-validation. If you have N data-points, take all but one
data-points for training and one data-point as test-set, repeat this
procedure N times.
Cross-validation is extremely important for high-dimensional data or
models!

29
Plan for today, part 2
Computational statistics in MATLAB: Bootstrap and cross-validation.
First, you download the file DemoData.mat from ILIAS, and a support
function called Percentile.m, as well as the following three m-files:
IsThisDieFair.m, bootstrapDemoScript.m,
crossvalidationDemoScript.m

Second, you play with the demo scripts and try and ensure you understand
both the

i.) logic behind the bootstrap and cross-validation

as well as

ii.) the MATLAB code implementing them.

30
Homework
First, read and work through all the files and demos I have supplied you
with and make sure you understand what is happening.
Second, modify the crossvalidationDemoScript.m script and
implement 2-fold as well as 10-fold cross-validation
Third, modify the script again such that the user can set a CONSTANT at the
top of the file, choosing to do either leave-one-out, 2-fold or 10-fold cross-
validation

31

You might also like