Introduction To Data Science
Introduction To Data Science
Today’s Presenter
Salford Systems – Pioneering Predictive Analytics and
Machine Learning Charlie Harrison
CLASSIFICATION
Predict a qualitative value
TIME SERIES
SURVIVAL ANALYSIS
Predict future values
Predict time until occurrence
based on past values
Have you downloaded SPM 8.2? After this webinar, we’ll give you access to the dataset used so you
can try it out for yourself.
https://info.salford-systems.com/spm-8-download
The algorithms covered today were either created or co-created by either Dr. Breiman
or Dr. Friedman.
Ease of Use
Salford’s models don’t require coding
Accuracy of Prediction
Salford’s models stand the test of time and are used by some of the biggest
corporations in the world
Defensibility of Models
Salford’s models are defensible internally to executive stakeholders and
externally to regulators
INDUSTRIES
opportunities by narrowing down with factors
have the most impact in your outcome FINANCIAL SERVICES Does level of education impact credit
risk?
Some of the most common applications
include: HEALTH CARE Does body weight influence the risk
• Fraud Prevention of heart disease?
• Risk Reduction in Credit Scoring and Loan SALES
FUNCTIONAL
What promotions are most effective?
Default
AREAS
• Optimizing Marketing Campaigns MARKETING Does customer satisfaction influence
• Improving Operations loyalty?
AUTOMATIC INVARIANT TO
AUTOMATIC AUTOMATIC AUTOMATIC
PREDICTIVE MISSING MONOTONE INTERPRETABILITY
SPM ENGINE PERFORMANCE
VARIABLE INTERACTION
VALUE/OUTLIER
MODELING OF
TRANSFORMATIONS
SELECTION DETECTION LOCAL EFFECTS
HANDLING OF PREDICTORS
10/24/2017
16
Manufacturing
We will use an algorithm called gradient boosting to do this. TreeNet® software will be used. TreeNet is
unique in that its code was originally written by Jerome Friedman, the creator of gradient boosting.
Link: https://archive.ics.uci.edu/ml/datasets/SECOM
SIGNAL_60 SIGNAL_359
Applying CART
1. Build the model in SPM SIGNAL_246 SIGNAL_158 SIGNAL_21
CART automatically:
1. Selects variables
2. Models nonlinear relationships
3. Model local effects
4. Models interactions
5. Handles missing values
The No Data Optimal Rule classifies every observation as one class. More specifically, the class
Class Cases %
Circle 1 14.3
Class Cases %
Circle 9 75.0
chosen for the no data optimal rule is the class that has the lowest cost compared to the other(s)
Triangle 6
W = 7.00
85.7 Triangle 3
W = 12.00
25.0
N=7 N = 12
Relative Cost = .44
Good: If the relative cost is closer to zero (closer is better) then CART is better than the No Data
Optimal Rule CART Predicted
CART Predicted No Data Optimal Rule
Class: Predicted Class:
Class:
Bad: If the relative cost is equal to 1 then the CART error is the same as the No Data Optimal Rule
which means that CART is no better than just predicting every observation as the same class
CART Predicted
The relative cost can be greater 1 which is especially bad
Class:and, more generally, values around 1 should be
considered “bad”
CART Confusion Matrix
Use the Confusion Matrix to assess CART and the
types of correct or incorrect predictions that it
makes.
SIGNAL_60 SIGNAL_359
terminal node
SIGNAL_247 SIGNAL_111 SIGNAL_158
Hotspot Detection.
2. Translate CART into C (or Java, PMML, or SAS) and deploy your
CART model in your environment in order to make predictions
in real-time.
Applying TreeNet
1. Understanding the model: Partial Dependency Plots
2. Choosing the number of trees (set the maximum number of trees such that the
error no longer meaningfully declines; SPM will choose the optimal number for
you)
Tree 10
Tree 50
Tree 600
Tree 100
Tree 150
Tree 200
Tree 400
Tree 600
− −
𝐆𝐥𝐨𝐛𝐚𝐥 𝐒𝐜𝐨𝐫𝐞 =
Model experimentation and optimization routines are pre-packaged for you in SPM, so you
never have to write even a single line of code. We want you to spend time on solving problems,
not troubleshooting while loops and function calls!
We will discuss this more in the second webinar, but we will provide one example.
2. Translate CART into C (or Java, PMML, or SAS) and deploy your
CART model in your environment in order to make predictions
in real-time.
Mining the customer credit using classification and regression tree and multivariate
adaptive regression splines
We’ll provide you a Download a trial version of SPM Schedule a demo and we’ll walk you
link to the dataset https://info.salford- through the example shown today
used today in a follow systems.com/spm-8-download
up email
Mining the customer credit using classification and regression tree and multivariate adaptive regression splines
Computational Statistics & Data Analysis: http://www.sciencedirect.com/science/article/pii/S016794730400355X
Factors Associated With Increased Reading Frequency in Children Exposed to Reach Out and Read
Academic Pediatrics: ttp://www.sciencedirect.com/science/article/pii/S1876285915002752
This paper used Random Forests® software to pick the factors
Using Random Forests to Provide Predicted Species Distribution Maps as a Metric for Ecological Inventory & Monitoring
Programs
Applications of Computational Intelligence in Biology: https://link.springer.com/chapter/10.1007/978-3-540-78534-7_9
Random Forest for Gene Expression Based Cancer Classification: Overlooked Issues
Iberian Conference on Pattern Recognition and Image Analysis: https://link.springer.com/chapter/10.1007/978-3-540-72849-8_61