0% found this document useful (0 votes)
8 views68 pages

Predictive Analytics Big Data With Matlab

The document discusses predictive analytics and big data analysis using MATLAB, covering topics such as predictive modeling, supervised machine learning, and time series modeling. It outlines workflows for financial modeling, challenges associated with big data, and techniques for handling large datasets, including parallel computing and distributed processing. The document also provides examples of applications, such as predicting customer responses and analyzing investment strategies.

Uploaded by

Walvede
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views68 pages

Predictive Analytics Big Data With Matlab

The document discusses predictive analytics and big data analysis using MATLAB, covering topics such as predictive modeling, supervised machine learning, and time series modeling. It outlines workflows for financial modeling, challenges associated with big data, and techniques for handling large datasets, including parallel computing and distributed processing. The document also provides examples of applications, such as predicting customer responses and analyzing investment strategies.

Uploaded by

Walvede
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 68

Predictive Analytics and Big Data with

MATLAB

Ian McKenna, Ph.D.

© 2015 The MathWorks, Inc.1


Agenda

 Introduction

 Predictive Modeling
– Supervised Machine Learning
– Time Series Modeling

 Big Data Analysis


– Load, Analyze, Discard workflows
– Scale computations with parallel computing
– Distributed processing of large data sets

 Moving to Production with MATLAB


2
Financial Modeling Workflow
Small/Big Data Predictive Modeling Deploy
Access Explore and Prototype Share
Files Reporting
Data Analysis
& Visualization

Financial
Databases Applications
Modeling

Application
Development
Production
Datafeeds

Scale 3
Financial Modeling Workflow
Predictive Modeling
Explore and Prototype

Data Analysis
& Visualization

Financial
Modeling

Application
Development

4
Agenda

 Introduction

 Predictive Modeling
– Supervised Machine Learning
– Time Series Modeling

 Big Data Analysis


– Load, Analyze, Discard workflows
– Scale computations with parallel computing
– Distributed processing of large data sets

 Moving to Production with MATLAB


7
What is Predictive Modeling?

 Use of mathematical language to make predictions


about the future

Input/ Output/
Predictors Predictive Response
model

 Examples

EL  f (T , t , DP,...)

Trading strategies Electricity Demand


8
Why develop predictive models?

 Forecast prices/returns

 Price complex instruments

 Analyze impact of predictors (sensitivity analysis)

 Stress testing

 Gain economic/market insight

 And many more reasons

9
Challenges

 Significant technical expertise required

 No “one size fits all” solution

 Locked into Black Box solutions

 Time required to conduct the analysis

10
Predictive Modeling Workflow

Train: Iterate till you find the best model


LOAD PREPROCESS SUPERVISED MODEL
DATA DATA LEARNING

FILTERS PCA CLASSIFICATION

SUMMARY CLUSTER REGRESSION


STATISTICS ANALYSIS

Predict: Integrate trained models into applications


NEW PREDICTION
DATA

11
Classes of Response Variables

Structure Type

Sequential Continuous

Non-Sequential Categorical

13
Examples

 Classification Learner App

Bank Marketing Campaign

 Predicting Customer Response 100

90

80
Misclassification Rate

– Classification techniques
70

60

Percentage
No
Misclassified
50
Yes
Misclassified

– Measure accuracy and compare models


40

30

20

10

B
or s

s
et

s
VM
n

ge
is

r ee
aye

dT
s sio
lN

ly s

ag
hb

nT
o rt

ce
ura

B
na

eB
g re

ig

ive

pp

du
Ne
A

io
Ne

Tre
Re

cis
Su

Re
nt

Na
t
res
ina

De
tic
gis

cr im

ea
k -n
Lo

Dis
Realized vs Median Forecasted Path
1800
Original Data

 Predicting S&P 500


1700 Simulated Data

1600

1500

– ARIMA modeling
1400

S&P 500
1300

1200

– GARCH modeling
1100

1000

900

800
May-01 Feb-04 Nov-06 Aug-09 May-12

14
Getting Started with Predictive Modeling

 Perform common tasks interactively


– Classification Learner App
– Neural Net App

16
Example – Bank Marketing Campaign

 Goal:
– Predict if customer would subscribe to
bank term deposit based on different 100
Bank Marketing Campaign
Misclassification Rate

attributes
90

80

70

60

Percentage
No
Misclassified
50
Yes

 Approach: 40

30
Misclassified

– Train a classifier using different models


20

10


0

Measure accuracy and compare models

B
or s

s
et

s
VM
n

ge
is

r ee
aye

dT
s sio
lN

ly s

ag
hb

nT
o rt

ce
ura

B
na

eB
g re

ig

ive

pp

du
Ne
A

io
Ne

Tre
Re

cis
Su

Re
nt

Na
t
res
ina

De
tic
gis

cr im

ea
k -n

Lo

Dis
Reduce model complexity
– Use classifier for prediction

Data set downloaded from UCI Machine Learning repository


http://archive.ics.uci.edu/ml/datasets/Bank+Marketing
21
Classification Techniques

Regression

Neural Ensemble Non-linear Reg. Linear


Decision Trees
Networks Methods (GLM, Logistic) Regression

Classification

Support Vector Discriminant Nearest


Naive Bayes
Machines Analysis Neighbor

22
Example – Bank Marketing Campaign

 Numerous predictive models with rich


documentation
Bank Marketing Campaign
Misclassification Rate
100

90

80

 Interactive visualizations and apps to 70

60

Percentage
No

aid discovery 50

40
Misclassified
Yes
Misclassified

30

20

10

Built-in parallel computing support


0

B
or s

s
et

s
VM
n

ge
is

r ee
aye

dT
s sio
lN

ly s

ag
hb

nT
o rt

ce
ura

B
na

eB
g re

ig

ive

pp

du
Ne
A

io
Ne

Tre
Re

cis
Su

Re
nt

Na
t
res
ina

De
tic
gis

cr im

ea
k -n
Lo

Dis
 Quick prototyping; Focus on
modeling not programming

26
Example – Time Series Modeling and
Forecasting for the S&P 500 Index
Realized vs All Forecasted Paths

11000
Original Data

 Goal: 10000

9000
Simulated Data

– Model S&P 500 time series as a 8000

7000

combined ARIMA/GARCH

S&P 500
6000

5000
process and forecast on test data 4000

3000

2000

1000

 Approach: May-01 Feb-04 Nov-06 Aug-09 May-12

– Fit ARIMA model with S&P 500 1800


Realized vs Median Forecasted Path

Original Data

returns and estimate parameters 1700 Simulated Data

1600

– Fit GARCH model for S&P 500 1500

1400
volatility

S&P 500
1300

– Perform statistical tests for time


1200

1100

series attributes e.g. stationarity 1000

900

800
May-01 Feb-04 Nov-06 Aug-09 May-12

27
Models for Time Series Data

Conditional Mean Models Conditional Variance Models


AR – Autoregressive ARCH

MA – Moving Average GARCH

ARIMA – Integrated EGARCH

ARIMAX – eXogenous inputs GJR

VARMA – Vector ARMA


Non-Linear Models
VARMAX – eXogenous inputs
NAR Neural Network
VEC – Vector Error Correcting
NARX Neural Network

State Space Models


Time Varying Regression

Time Invariant Regression with ARIMA errors


28
Example – Time Series Modeling and
Forecasting for the S&P 500 Index
Realized vs All Forecasted Paths

11000
Original Data

 Numerous ARIMAX and 10000

9000
Simulated Data

GARCH modeling techniques 8000

7000

with rich documentation

S&P 500
6000

5000

4000

3000

2000

 Interactive visualizations 1000

May-01 Feb-04 Nov-06 Aug-09 May-12

Realized vs Median Forecasted Path


1800

 Code parallelization to 1700


Original Data
Simulated Data

1600

maximize computing resources 1500

1400

S&P 500
1300

1200

 Rapid exploration & 1100

1000

development 900

800
May-01 Feb-04 Nov-06 Aug-09 May-12

29
Agenda

 Introduction

 Predictive Modeling
– Supervised Machine Learning
– Time Series Modeling

 Big Data Analysis


– Load, Analyze, Discard workflows
– Scale computations with parallel computing
– Distributed processing of large data sets

 Moving to Production with MATLAB


34
Financial Modeling Workflow
Small/Big Data Predictive Modeling Deploy
Access Explore and Prototype Share
Files Reporting
Data Analysis
& Visualization

Financial
Databases Applications
Modeling

Application
Development
Production
Datafeeds

Scale 35
Financial Modeling Workflow

Scale 36
Challenges of Big Data

“Any collection of data sets so large and complex that it becomes


difficult to process using … traditional data processing applications.”
(Wikipedia)

 Volume
– The amount of data

 Velocity
– The speed data is generated/analyzed

 Variety
– Range of data types and sources

 Value
– What business intelligence can be obtained from the data?
37
Big Data Capabilities in MATLAB

Memory and Data Access


 64-bit processors
 Memory Mapped Variables
 Disk Variables
 Databases Programming Constructs
 Datastores  Streaming
 Block Processing
 Parallel-for loops
 GPU Arrays
 SPMD and Distributed Arrays
 MapReduce
Native ODBC interface

Database datastore Platforms


 Desktop (Multicore, GPU)
Fetch in batches  Clusters
 Cloud Computing (MDCS on EC2)
Scrollable cursors
 Hadoop
38
Techniques for Big Data in MATLAB
Hard drive

Consulting
datastore MapReduce
Scale

parfor SPMD, Distributed Memory

64bit Workstation
RAM

Embarrassingly Non-
Parallel Partitionable

Complexity
39
Techniques for Big Data in MATLAB
Hard drive
Scale

64bit Workstation
RAM

Embarrassingly Non-
Parallel Partitionable

Complexity
40
Memory Usage Best Practices

 Expand Workspace: 64bit MATLAB

 Use the appropriate data storage


– Categorical Arrays
– Be aware of overhead of cells and structures
– Use only the precision your need
– Sparse Matrices

 Minimize Data Copies


– In place operations, if possible
– Use nested functions
– Inherit data using object handles
41
Techniques for Big Data in MATLAB
Hard drive
Scale

parfor
RAM

Embarrassingly Non-
Parallel Partitionable

Complexity
43
Parallel Computing with MATLAB

Worker
Worker
MATLAB
Desktop (Client)
Worker

Worker
Worker

Worker

44
Example: Analyzing an Investment Strategy

 Optimize portfolios against target


benchmark

 Analyze and report performance


over time

 Backtest over 20-year period,


parallelize 3-month rebalance

45
When to Use parfor

 Data Characteristics
– The data for each iteration must
fit in memory
– Loop iterations must be independent

 Transition from desktop to cluster with


minimal code changes

 Speed up analysis on big data

48
Techniques for Big Data in MATLAB
Hard drive
Scale

SPMD, Distributed Memory


RAM

Embarrassingly Non-
Parallel Partitionable

Complexity
49
Parallel Computing – Distributed Memory

Using More Computers (RAM)

Core 1 Core 2 Core 1 Core 2

Core 3 Core 4
… Core 3 Core 4

RAM RAM

50
spmd blocks

spmd
% single program across workers
end

 Mix parallel and serial code in the same function


 Single Program runs simultaneously across
workers
 Multiple Data spread across multiple workers

52
Example: Airline Delay Analysis

 Data
– Airline On-Time Statistics
– 123.5M records, 29 fields

 Analysis
– Calculate delay patterns
– Visualize summaries
– Estimate & evaluate
predictive models

53
When to Use Distributed Memory

 Data Characteristics
– Data must be fit in collective
memory across machines

 Compute Platform
– Prototype (subset of data) on desktop
– Run on a cluster or cloud

 Analysis Characteristics
– Distributed arrays support a subset of functions

54
Techniques for Big Data in MATLAB
Hard drive

datastore
Scale

RAM

Embarrassingly Non-
Parallel Partitionable

Complexity
55
Access Big Data
datastore

 Easily specify data set


– Single text file (or collection of text files)

 Preview data structure and format

 Select data to import


using column names

 Incrementally read
subsets of the data
airdata = datastore('*.csv');
airdata.SelectedVariables = {'Distance', 'ArrDelay‘};

data = read(airdata);

56
Example: Determine unique tickers

 15 years of daily S&P 500 data

 Data in multiple files of different


sizes

 Many irrelevant columns in


dataset

57
When to Use datastore

 Data Characteristics
– Text files, databases, or stored in the
Hadoop Distributed File System (HDFS)

 Analysis Characteristics
– Load, Analyze, Discard workflows
– Incrementally read chunks of data,
process within a while loop

58
Reading in Part of a Dataset from Files

 Text file, ASCII file


– Read part of a collection of files using datastore

 MAT file
– Load and save part of a variable using the matfile

 Binary file
– Read and write directly to/from file using memmapfile

 Databases
– ODBC and JDBC-compliant (e.g. Oracle, MySQL, Microsoft SQL Server)

59
Techniques for Big Data in MATLAB
Hard drive

MapReduce
Scale

RAM

Embarrassingly Non-
Parallel Partitionable

Complexity
60
Analyze Big Data
mapreduce

 MapReduce programming technique to analyze big data


– mapreduce uses a datastore to process data
in small chunks that individually fit into memory ********************************
* MAPREDUCE PROGRESS *
********************************
Map 0% Reduce 0%

 mapreduce on the desktop Map 20%


Map 40%
Reduce 0%
Reduce 0%
Map 60% Reduce 0%
– Access data on HDFS Map 80%
Map 100%
Reduce 0%
Reduce 25%
Map 100% Reduce 50%
– Integrates with Parallel Computing Toolbox Map 100% Reduce 75%
Map 100% Reduce 100%

 mapreduce with Hadoop


– Run on Hadoop using MATLAB Distributed Computing Server
– Deploy to Hadoop using MATLAB Compiler

61
MapReduce

Data Store Map Reduce

Shuffle & Sort


Date Ticker Return Key Unique Tickers
Key: 3-Jan
3-Jan AIG -0.051 AIG
Key: 3-Jan AIG
3-Jan AMZN NaN GE
3-Jan AIG, GEGE
3-Jan GE -0.040
3-Jan INTC NaN Key: 3-Jan AIG AIG

3-Jan AIG -0.051


Key: 4-Jan
4-Jan YHOO -0.067 YHOO
Key: 4-Jan
INTC YHOO
4-Jan INTC -0.046 4-Jan YHOO, INTC
INTC
5-Jan GE 0.025
5-Jan AIG NaN Key: 5-Jan GE
Key: 5-Jan
5-Jan AMZN 0.078
AMZN GE
5-Jan GE 0.025
GE AMZN
5-Jan YHOO -0.039 Key: 5-Jan
5-Jan AMZN, GE, YHOO
YHOO GE
YHOO
62
Example: Calculate covariance of S&P500
Using MapReduce

 15 years of daily S&P500 returns


stored in multiple files

 Use all the data to calculate the


mean and covariance

 Computation must scale to 1-minute


bars for 30 years of data

63
Challenges

 Multiple files of differing sizes

64
Challenges

 How do we read/partition this dataset if it doesn’t fit in


memory?
Date Ticker Open High Low Close Volume Return
3-Jan-2000 AIG 107.13 107.44 103 103.94 166500 NaN
3-Jan-2000 AMZN 87.25 89.56 79.05 89.56 16117600 NaN
3-Jan-2000 GE 147.25 148 144 144 22121400 -0.040
8-Jan-2000 AMZN 81.5 89.56 79.05 89.38 16117600 NaN
4-Jan-2000 AIG 101.5 102.13 98.31 98.63 364000 -0.051
Jan 4,2000 YHOO 464.5 500.12 442 443 69868800 -0.067
4-Jan-2000 INTC 85.44 87.88 82.25 92.94 51019600 -0.046
4-Jan-2000 GE 147.25 148 144 144 22121400 -0.040
8-Jan-2000 GE 143.12 146.94 142.63 145.67 19873200 0.013

 Missing data (explicit/implicit)


65
Challenges

 Mean
– Coupling between rows

 Covariance
– Coupling between rows
– Coupling between columns

66
Approach

 Reading in chunks – do we have a full column of data?


Date Ticker Return Date AIG AMZN GE YHOO
3-Jan-2000 AIG -0.012 3-Jan-2000 -0.012 NaN 0.051 NaN
3-Jan-2000 AMZN NaN 4-Jan-2000 0.097 NaN NaN -0.035
3-Jan-2000 GE 0.051
4-Jan-2000 AMZN NaN
4-Jan-2000 AIG 0.097
4-Jan-2000 YHOO -0.035
4-Jan-2000 GE NaN

 Solution: convert to tabular form with all columns


 Further memory savings (ticker/date not repeated)
67
Approach

 Goal: Calculate mean/covariance for big data sets

Unique tickers Data Store


S&P500 Data File 1

S&P500 Data File 2



• Tabular conversion MapReduce

Validate

S&P500 Data File N

Calculate mean/cov
MapReduce
Combine mean/cov

Scale Hadoop

68
The Big Data Platform

Datastore
HDFS

Node Data
Map
Map Reduce
Reduce

 HDFS
Node Data
Map
Map Reduce
Reduce  MapReduce

Node Data
Map
Map Reduce
Reduce

 Fault-tolerant distributed data storage


 Take the computation to the data
70
Deployed Applications with Hadoop

Datastore MATLAB
HDFS runtime
Node Data
Map Reduce

Node Data
Map Reduce

Node Data
Map Reduce

MATLAB
MapReduce
Code 72
Solution

 Datastore
– Treat multiple files as a pool of data
– Parse data in chunks to determine unique values
 Mapreduce
– Group, filter, and calculate summary statistics
 Hadoop
– Algorithm is the same as the one developed on desktop
– Easily deploy to Hadoop using interactive tools
 MATLAB Interactive Environment
– Debugger and profiler
– Validate algorithms using built-in functions for rapid prototyping

75
Big Data Summary

 Access portions of data with datastore


Cluster
 Cluster-ready programming constructs
– parfor MATLAB

– SPMD Desktop (Client)

– MapReduce
– Distributed arrays …
..…

..…



 Prototype code for your cluster ..…

– Transition from desktop to cluster with Scheduler

no algorithm changes

77
Agenda

 Introduction

 Predictive Modeling
– Supervised Machine Learning
– Time Series Modeling

 Big Data Analysis


– Load, Analyze, Discard workflows
– Scale computations with parallel computing
– Distributed processing of large data sets

 Moving to Production with MATLAB


78
Financial Modeling Workflow
Small/Big Data Predictive Modeling Deploy
Access Explore and Prototype Share
Files Reporting
Data Analysis
& Visualization

Financial
Databases Applications
Modeling

Application
Development
Production
Datafeeds

Scale 79
Financial Modeling Workflow
Deploy
Hadoop
Share
Reporting

Applications

Desktop Enterprise Web


Production

80
Deployed Applications

 Example: Portfolio optimization and simulation


 Example: Day-ahead system load forecasting

81
MATLAB Production Server

 Enterprise framework for running packaged MATLAB programs

MATLAB Production Server


Web App
.. Server Server Request
. Broker
&
Program
Manager

 Scalable & reliable


– Service large numbers of concurrent requests

 Use with web, database & application servers


– Easily integrates with IT systems (Java, .NET, C++, Python)

85
Integrating with IT systems
MATLAB

Compiler SDK

MATLAB Production Server


Web
Server Portfolio
Optimization
Web
Applications

® Pricing
Excel

Desktop
Applications
Application
Server Risk
Analytics

Database Server

87
Benefits of the MATLAB Production Server

 Reduce cost of building and deploying in-house analytics


– Quants/Analysts/Financial Modelers do not have to rewrite code
in another language
– Update deployed models easily without restarting the server
– Single environment for model development and testing

 IT can efficiently integrate models/analytics in to


production systems
– Centrally manage packaged MATLAB programs
– Handoff from Quant to IT only requires function signatures
– Easily support analytics built with multiple releases of MATLAB
– Simultaneous multiple instances of MATLAB Production Server

89
Summary

Challenges MATLAB Solution

Time (loss of productivity) Rapid analysis and application development


Easily access big data sets, interactive exploratory analysis
and visualization, apps to get started, debugger

No “one-size-fits-all” Multiple algorithms and programming constructs


Regression, machine learning, time series modeling, parfor,
MapReduce, datastore
Big data and scaling Work on the desktop and scale to clusters
Hadoop support, no algorithm changes required

Time to deploy & integrate Ease of deployment and leveraging enterprise


Push-button deployment into production

96
Financial Modeling Workflow

Access Research and Quantify Share


Files Data Analysis and Visualization Reporting
Databases Financial Modeling Applications
Datafeeds Application Development Production

Spreadsheet Link EX Financial Instruments Econometrics

Report Generator
Production Server

Database Financial
MATLAB Compiler
SDK
Datafeed Statistics & Machine
Optimization
Learning
MATLAB Compiler
Trading

MATLAB

Parallel Computing Neural Networks

MATLAB Distributed Computing Server Curve Fitting


98
Learn More: Predictive Modeling with MATLAB
To learn more, visit:
www.mathworks.com/machine-learning

Basket Selection using Classification in the


Stepwise Regression presence of missing data

Regerssion with Boosted Hierarchical Clustering


Decision Trees
101
Learn More: Big Data

 MATLAB Documentation
– Strategies for Efficient Use of Memory
– Resolving "Out of Memory" Errors

 Big Data with MATLAB


– www.mathworks.com/discovery/big-data-matlab.html

 MATLAB MapReduce and Hadoop


– www.mathworks.com/discovery/matlab-mapreduce-hadoop.html

102
Training Services mathworks.com/training

 Classroom Training
– Customized curriculum
– Usually 2-5 day consecutive format

 Live Online
– Flexible scheduling
– Full or Half Day Sessions

 Self-Paced
– Learn whenever you want and at your own pace
– Online discussion boards and live trainer chats

CPE APPROVED PROVIDER: Earn one CPE


credit per hour of content.

103
Training Roadmap

MATLAB for Financial Applications

Data Analysis and Modeling Application Development

Statistical Methods Programming Techniques

Machine Learning Interactive User Interfaces

Time-Series Modeling (Econometrics) Parallel Computing

Risk Management
Content for On-site Customization
Optimization Techniques
Asset Allocation

Interfacing with Databases

Interfacing with Excel


104
Consulting Services
Accelerating return on investment

A global team of experts supporting every stage of tool and process integration

Process and Technology

Continuous Improvement
Automation
Process and Technology
Standardization

Full Application
Deployment
Process Assessment
Component
Deployment
Advisory Services

Jumpstart
Migration Planning

Research Advanced Engineering Product Engineering Teams Supplier Involvement

105
Q&A

106

You might also like