Predictive Analytics Big Data With Matlab
Predictive Analytics Big Data With Matlab
MATLAB
Introduction
Predictive Modeling
– Supervised Machine Learning
– Time Series Modeling
Financial
Databases Applications
Modeling
Application
Development
Production
Datafeeds
Scale 3
Financial Modeling Workflow
Predictive Modeling
Explore and Prototype
Data Analysis
& Visualization
Financial
Modeling
Application
Development
4
Agenda
Introduction
Predictive Modeling
– Supervised Machine Learning
– Time Series Modeling
Input/ Output/
Predictors Predictive Response
model
Examples
EL f (T , t , DP,...)
Forecast prices/returns
Stress testing
9
Challenges
10
Predictive Modeling Workflow
11
Classes of Response Variables
Structure Type
Sequential Continuous
Non-Sequential Categorical
13
Examples
90
80
Misclassification Rate
– Classification techniques
70
60
Percentage
No
Misclassified
50
Yes
Misclassified
30
20
10
B
or s
s
et
s
VM
n
ge
is
r ee
aye
dT
s sio
lN
ly s
ag
hb
nT
o rt
ce
ura
B
na
eB
g re
ig
ive
pp
du
Ne
A
io
Ne
Tre
Re
cis
Su
Re
nt
Na
t
res
ina
De
tic
gis
cr im
ea
k -n
Lo
Dis
Realized vs Median Forecasted Path
1800
Original Data
1600
1500
– ARIMA modeling
1400
S&P 500
1300
1200
– GARCH modeling
1100
1000
900
800
May-01 Feb-04 Nov-06 Aug-09 May-12
14
Getting Started with Predictive Modeling
16
Example – Bank Marketing Campaign
Goal:
– Predict if customer would subscribe to
bank term deposit based on different 100
Bank Marketing Campaign
Misclassification Rate
attributes
90
80
70
60
Percentage
No
Misclassified
50
Yes
Approach: 40
30
Misclassified
10
–
0
B
or s
s
et
s
VM
n
ge
is
r ee
aye
dT
s sio
lN
ly s
ag
hb
nT
o rt
ce
ura
B
na
eB
g re
ig
ive
pp
du
Ne
A
io
Ne
Tre
Re
cis
Su
Re
nt
Na
t
res
ina
De
tic
gis
cr im
ea
k -n
–
Lo
Dis
Reduce model complexity
– Use classifier for prediction
Regression
Classification
22
Example – Bank Marketing Campaign
90
80
60
Percentage
No
aid discovery 50
40
Misclassified
Yes
Misclassified
30
20
10
B
or s
s
et
s
VM
n
ge
is
r ee
aye
dT
s sio
lN
ly s
ag
hb
nT
o rt
ce
ura
B
na
eB
g re
ig
ive
pp
du
Ne
A
io
Ne
Tre
Re
cis
Su
Re
nt
Na
t
res
ina
De
tic
gis
cr im
ea
k -n
Lo
Dis
Quick prototyping; Focus on
modeling not programming
26
Example – Time Series Modeling and
Forecasting for the S&P 500 Index
Realized vs All Forecasted Paths
11000
Original Data
Goal: 10000
9000
Simulated Data
7000
combined ARIMA/GARCH
S&P 500
6000
5000
process and forecast on test data 4000
3000
2000
1000
Original Data
1600
1400
volatility
S&P 500
1300
1100
900
800
May-01 Feb-04 Nov-06 Aug-09 May-12
27
Models for Time Series Data
11000
Original Data
9000
Simulated Data
7000
S&P 500
6000
5000
4000
3000
2000
1600
1400
S&P 500
1300
1200
1000
development 900
800
May-01 Feb-04 Nov-06 Aug-09 May-12
29
Agenda
Introduction
Predictive Modeling
– Supervised Machine Learning
– Time Series Modeling
Financial
Databases Applications
Modeling
Application
Development
Production
Datafeeds
Scale 35
Financial Modeling Workflow
Scale 36
Challenges of Big Data
Volume
– The amount of data
Velocity
– The speed data is generated/analyzed
Variety
– Range of data types and sources
Value
– What business intelligence can be obtained from the data?
37
Big Data Capabilities in MATLAB
Consulting
datastore MapReduce
Scale
64bit Workstation
RAM
Embarrassingly Non-
Parallel Partitionable
Complexity
39
Techniques for Big Data in MATLAB
Hard drive
Scale
64bit Workstation
RAM
Embarrassingly Non-
Parallel Partitionable
Complexity
40
Memory Usage Best Practices
parfor
RAM
Embarrassingly Non-
Parallel Partitionable
Complexity
43
Parallel Computing with MATLAB
Worker
Worker
MATLAB
Desktop (Client)
Worker
Worker
Worker
Worker
44
Example: Analyzing an Investment Strategy
45
When to Use parfor
Data Characteristics
– The data for each iteration must
fit in memory
– Loop iterations must be independent
48
Techniques for Big Data in MATLAB
Hard drive
Scale
Embarrassingly Non-
Parallel Partitionable
Complexity
49
Parallel Computing – Distributed Memory
Core 3 Core 4
… Core 3 Core 4
RAM RAM
50
spmd blocks
spmd
% single program across workers
end
52
Example: Airline Delay Analysis
Data
– Airline On-Time Statistics
– 123.5M records, 29 fields
Analysis
– Calculate delay patterns
– Visualize summaries
– Estimate & evaluate
predictive models
53
When to Use Distributed Memory
Data Characteristics
– Data must be fit in collective
memory across machines
Compute Platform
– Prototype (subset of data) on desktop
– Run on a cluster or cloud
Analysis Characteristics
– Distributed arrays support a subset of functions
54
Techniques for Big Data in MATLAB
Hard drive
datastore
Scale
RAM
Embarrassingly Non-
Parallel Partitionable
Complexity
55
Access Big Data
datastore
Incrementally read
subsets of the data
airdata = datastore('*.csv');
airdata.SelectedVariables = {'Distance', 'ArrDelay‘};
data = read(airdata);
56
Example: Determine unique tickers
57
When to Use datastore
Data Characteristics
– Text files, databases, or stored in the
Hadoop Distributed File System (HDFS)
Analysis Characteristics
– Load, Analyze, Discard workflows
– Incrementally read chunks of data,
process within a while loop
58
Reading in Part of a Dataset from Files
MAT file
– Load and save part of a variable using the matfile
Binary file
– Read and write directly to/from file using memmapfile
Databases
– ODBC and JDBC-compliant (e.g. Oracle, MySQL, Microsoft SQL Server)
59
Techniques for Big Data in MATLAB
Hard drive
MapReduce
Scale
RAM
Embarrassingly Non-
Parallel Partitionable
Complexity
60
Analyze Big Data
mapreduce
61
MapReduce
63
Challenges
64
Challenges
Mean
– Coupling between rows
Covariance
– Coupling between rows
– Coupling between columns
66
Approach
Calculate mean/cov
MapReduce
Combine mean/cov
Scale Hadoop
68
The Big Data Platform
Datastore
HDFS
Node Data
Map
Map Reduce
Reduce
HDFS
Node Data
Map
Map Reduce
Reduce MapReduce
Node Data
Map
Map Reduce
Reduce
Datastore MATLAB
HDFS runtime
Node Data
Map Reduce
Node Data
Map Reduce
Node Data
Map Reduce
MATLAB
MapReduce
Code 72
Solution
Datastore
– Treat multiple files as a pool of data
– Parse data in chunks to determine unique values
Mapreduce
– Group, filter, and calculate summary statistics
Hadoop
– Algorithm is the same as the one developed on desktop
– Easily deploy to Hadoop using interactive tools
MATLAB Interactive Environment
– Debugger and profiler
– Validate algorithms using built-in functions for rapid prototyping
75
Big Data Summary
– MapReduce
– Distributed arrays …
..…
..…
…
…
Prototype code for your cluster ..…
no algorithm changes
77
Agenda
Introduction
Predictive Modeling
– Supervised Machine Learning
– Time Series Modeling
Financial
Databases Applications
Modeling
Application
Development
Production
Datafeeds
Scale 79
Financial Modeling Workflow
Deploy
Hadoop
Share
Reporting
Applications
80
Deployed Applications
81
MATLAB Production Server
85
Integrating with IT systems
MATLAB
™
Compiler SDK
® Pricing
Excel
Desktop
Applications
Application
Server Risk
Analytics
Database Server
87
Benefits of the MATLAB Production Server
89
Summary
96
Financial Modeling Workflow
Report Generator
Production Server
Database Financial
MATLAB Compiler
SDK
Datafeed Statistics & Machine
Optimization
Learning
MATLAB Compiler
Trading
MATLAB
MATLAB Documentation
– Strategies for Efficient Use of Memory
– Resolving "Out of Memory" Errors
102
Training Services mathworks.com/training
Classroom Training
– Customized curriculum
– Usually 2-5 day consecutive format
Live Online
– Flexible scheduling
– Full or Half Day Sessions
Self-Paced
– Learn whenever you want and at your own pace
– Online discussion boards and live trainer chats
103
Training Roadmap
Risk Management
Content for On-site Customization
Optimization Techniques
Asset Allocation
A global team of experts supporting every stage of tool and process integration
Continuous Improvement
Automation
Process and Technology
Standardization
Full Application
Deployment
Process Assessment
Component
Deployment
Advisory Services
Jumpstart
Migration Planning
105
Q&A
106