Data Driven Modelling
Data Driven Modelling using MATLAB
Shan He
School for Computational Science
University of Birmingham
Module 06-23836: Computational Modelling with MATLAB
Data Driven Modelling
Outline
Outline of Topics
What is data driven modelling?
Regression Analysis in MATLAB
Artificial Neural Networks
Conclusion
Data Driven Modelling
What is data driven modelling?
What is data driven modelling?
I
For equation and agent-based models, we assume the model is
known.
However, sometimes we have large amount of data but very
little prior knowledge.
Finding the model in the first place is the most difficult and
important question.
A new research field: data driven modelling (DDM).
Based on the data, a model is built on the basis of
connections between the system state variables, e.g., input,
internal and output variables, with only a limited assumption
about the system.
Data Driven Modelling
What is data driven modelling?
Goals/purposes of data driven modelling
Extract and recognize patterns in data
Interpret or explain observations
Test validity of hypotheses
Search the space of hypotheses
Data Driven Modelling
What is data driven modelling?
Tasks of data driven modelling
Classification: where the task constitutes of assigning a class
for an input data point.
Association: where association between variables
characterising the system is to be identified, which is used in
subsequent prediction.
Regression: where the task constitutes of predicting a real
value associated with an input data point.
Clustering: where groups of data points with within group
similarity are to be determined.
Data Driven Modelling
What is data driven modelling?
It is new and old!
Before it was called observational modelling.
Based on methods in statistics, e.g., regression.
These methods usually cannot handle nonlinear systems.
Recent years, machine learning techniques have been applied.
We will learn how to use regression and Artificial Neural
Networks to build data-driven models in MATLAB.
Data Driven Modelling
What is data driven modelling?
Data driven modelling process
I
Data preparation: obtain data / data checking/ data
cleaning
Feature selection: if you have high-dimensional data.
Specify assumptions based on domain knowledge.
Develop Model based on the assumptions.
Specify loss function, e.g., the mean least square error
between the model output and the real data.
Use algorithms to minimize loss based on the train data.
Test the model using testing data
Data Driven Modelling
What is data driven modelling?
What tools can we use?
I
Statistics:
I
I
I
I
Linear regression
Nonlinear regression
Logistic regression
Probit regression
Machine Learning techniques:
I
I
I
I
I
Decision tree
Artificial Neural Network
Nearest Neighbours
Support Vector Machine
Association rule learning
Data Driven Modelling
Regression Analysis in MATLAB
Linear regression analysis in MATLAB
I
For linear regression, we can use polynomial curve fitting.
MATLAB function: p = polyfit(x,y,n)
It finds the coefficients of a polynomial p(x) of degree n that
fits the data, p(x(i)) to y(i), in a least squares sense.
The output p is a row vector of length n+1 containing the
polynomial coefficients in descending powers:
p(x) = p1 x n + p2 x n1 + + pn x + pn+1
To evaluate the polynomial at the data points: y =
polyval(p,x)
Data Driven Modelling
Regression Analysis in MATLAB
A very simple example: fitting error function
Regression: We aim to fit the data points from the error
function erf(X) is twice the integral of the Gaussian
distribution with 0 mean and variance of 1/2:
Z
2
2
e t dt
erf(x) =
x
Data Driven Modelling
Regression Analysis in MATLAB
A more complex example: fitting traffic data
Hourly traffic counts at three intersections for a single day.
Regression: We aim to fit the data with polyval
Data Driven Modelling
Regression Analysis in MATLAB
Logistic regression
Sometimes called the logistic model or logit model.
Can be used for predicting the outcome of a binary dependent
variable: Classification.
MATLAB function: b = glmfit(X,y,distr)
Output: a p-by-1 vector b of coefficient estimates for a
generalized linear regression of the responses in y on the
predictors in X, using the distribution distr
Data Driven Modelling
Regression Analysis in MATLAB
Australian Credit Card Assessment
Task: to assess applications to an Australian bank for a credit
card based on a number of attributes.
2 classes: granted (44.5% of the instances) or denied (55.5%
of the instances)
14 attributes: names and values have been changed to
meaningless symbols to protect confidentiality of the data.
Mixing-value inputs: there are 5 continuous, 4 binary and 5
nominal
A lot of missing value.
Data Driven Modelling
Regression Analysis in MATLAB
Military Trauma survival prediction
Data Driven Modelling
Artificial Neural Networks
What is Artificial Neural Networks (ANNs)?
Input
I
I
Hidden Layer
Output
ANN: Mathematical model or computational model inspired
by biological neural networks.
Consists of an interconnected group of artificial neurons
Data Driven Modelling
Artificial Neural Networks
What are Artificial Neural Networks (ANNs)?
Non-linear statistical data modeling tools:
I
I
Model complex relationships between inputs and outputs;
Discover patterns in data.
Can be used for classification, association, regression and
clustering.
MATLAB Neural Network Toolbox (Click for more detailed
tutorial)
Data Driven Modelling
Artificial Neural Networks
Example: Prediction of number of sun spots
Sunspot series is a record of the activity of the surface of the
sun.
Important: Telecommunication will by disrupted by a
sufficiently large solar flare.
Time series data for sunspot activity over the last 300 years.
Sunspot activity is cyclical, reaching a maximum about every
11 years.
Challenging: sunspot series is nonlinear, non-stationary and
non-Gaussian
Data Driven Modelling
Artificial Neural Networks
Prediction of sunspot number by ANNs
Task: We use recorded sunspot data to train our ANN to
predict sunspot number based on the sunspot numbers of
previous 3 years.
Training data: sunspot numbers from 1705 1884
Test data: sunspot numbers from 1884 1987
Data Driven Modelling
Artificial Neural Networks
New direction for ANNs: Deep Learning
I
ANNs fell out of favour in 90s because they are slow and
inefficient
In 2006, Prof. Geoff Hinton made a breakthrough: deep
learning
Excels at unsupervised learning, e.g., recognise handwritten
words
Key idea: learn categories incrementally, e.g., lower-level
categories (letters) higher-level categories (words)
Google, Microsoft and along with other big names have
jumped on the bandwagon
Microsoft Project: Speech Recognition
Data Driven Modelling
Conclusion
Conclusion
If you know the underlying mechanisms of the system (even
partially), DO NOT use data-driven modelling methods.
How to choose your tools: start from simple tools
Regression Decision Tree ANNs (SVM, Random Forest)
Hybrid methods, e.g., Evolutionary ANNs
Also need to consider interpretability: simpler tools do better
Data Driven Modelling
Conclusion
Assignment
Based on the sunspot number prediction example, use linear
regression (polyfit) and ANNs to model Hudson Bay
Company fur record data.
Investigate how to use decision tree for Australian Credit Card
Assessment problem. Compare the results with ANNs and
Logistic regression.