0% found this document useful (0 votes)
62 views70 pages

Introduction To Data Science

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
62 views70 pages

Introduction To Data Science

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 70

Real-world demonstration

For the beginner modeler


Revisit Today’s Webinar Materials
For anyone who may have been running late
or wanted to reference these materials, we are
happy to provide the presentation and a link
to the recording of the webinar.

Expect to hear from us after the presentation!

© Minitab Inc. 10/24/2017 2


Today’s Discussion (10/24)
Quick Refresher – What can Machine Learning do for you

Today’s Presenter
Salford Systems – Pioneering Predictive Analytics and
Machine Learning Charlie Harrison

Manufacturing Defects Dataset: Applied Examples Charlie is part of Salford’s Data


CART Scientist Team, and has been
providing customer support and
TreeNet training for several years.

His favorite thing about Data Science


Random Forest
is proving theoretical results.

© Minitab Inc. 10/24/2017 3


What Can Machine Learning Do For You?

Find the Most


Discover the Important
Predict Future Solve Your
Explore Data Most Important Relationships in
Observations Problem
Features Factors &
Response

© Minitab Inc. 10/24/2017 4


How Broad and Deep is the Application
Potential?
Machine learning methods can be applied in almost any context. The following is a
brief selection of industry and functional examples:
INDUSTRIES FUNCTIONAL AREAS

MANUFACTURING FINANCIAL HEALTH CARE OTHER SALES MARKETING


SERVICES INDUSTRIES

Manufacturing Disease Customer Customer


Loan Defaults Insurance Claims
Defects Prevention Churn Segmentation

Preventative Environmental Cross-


Fraud Prevention Genetics Marketing Lift
Maintenance Impacts Sell/Upsell

© Minitab Inc. 10/24/2017 5


CLASSIFICATION MODELS using CART, Gradient Boosting
& Random Forests

REGRESSION UNSUPERVISED LEARNING


Predict a quantitative value Clustering

CLASSIFICATION
Predict a qualitative value
TIME SERIES
SURVIVAL ANALYSIS
Predict future values
Predict time until occurrence
based on past values

© Minitab Inc. 10/24/2017 6


What Do You Need to Get Started?

Sufficient Data Pick the Right Solve with


Problem the Right Tool

Have you downloaded SPM 8.2? After this webinar, we’ll give you access to the dataset used so you
can try it out for yourself.

https://info.salford-systems.com/spm-8-download

© Minitab Inc. 10/24/2017 7


Salford Systems

© Minitab Inc. 10/24/2017 8


Salford’s Legacy in Pioneering Predictive Analytics &
Machine Learning
Salford’s solutions are innovative, reliable and robust because they were created and
are implemented by inventors and pioneers of Predictive Analytics & Machine
Learning (PAML):
• Dr. Jerome Friedman (Professor of Statistics, Stanford)
• Dr. Leo Breiman (Professor of Statistics, UC Berkeley)

The algorithms covered today were either created or co-created by either Dr. Breiman
or Dr. Friedman.

© Minitab Inc. 10/24/2017 9


Salford Stands Out Against Competitors
Salford solutions are distinguished in particular by their:

Ease of Use
Salford’s models don’t require coding

Accuracy of Prediction
Salford’s models stand the test of time and are used by some of the biggest
corporations in the world

Defensibility of Models
Salford’s models are defensible internally to executive stakeholders and
externally to regulators

© Minitab Inc. 10/24/2017 10


Suite of Solutions – Data Science Toolkit
Time- and market-tested predictive modeling tools including
everything from market-leading decision tree and classification
engines to advanced interaction detection and automation to state-
of-the-art machine learning capabilities.

SPM Software Suite


Random
CART MARS TreeNet RuleLearner ISLE GPS
Forests
Decision trees Nonlinear Data ensemble Gradient Rule ensemble Model Regularized
regression bagging boosting compression regression

© Minitab Inc. 10/24/2017 11


Why Do Classification Models Matter?
Classification methods are a simple, effective
and accurate approach to solve organization’s
most difficult problems and uncover new MANUFACTURING What machine signals are predictive
of defects?

INDUSTRIES
opportunities by narrowing down with factors
have the most impact in your outcome FINANCIAL SERVICES Does level of education impact credit
risk?
Some of the most common applications
include: HEALTH CARE Does body weight influence the risk
• Fraud Prevention of heart disease?
• Risk Reduction in Credit Scoring and Loan SALES

FUNCTIONAL
What promotions are most effective?
Default

AREAS
• Optimizing Marketing Campaigns MARKETING Does customer satisfaction influence
• Improving Operations loyalty?

© Minitab Inc. 10/24/2017 12


Machine Learning Terminology
Response Variable = Dependent Variable = Target Putting It All Together
Variable
This is what we are trying to predict Signal 1, Signal 2, … Signal 590
Target Variable: Defect
Examples: default vs. no default, air pressure, number of
claims, etc. Predictor Variables: Signal
1, Signal 2, …, Signal 590 𝐷𝑒𝑓𝑒𝑐𝑡 = Regression = 𝛽0 + 𝛽1 𝑆𝑖𝑔𝑛𝑎𝑙1 + ⋯
Predictor Variables = Predictors = Factors + 𝛽590 𝑆𝑖𝑔𝑛𝑎𝑙590
This is what we use to predict the response. Algorithm: Logistic
Example: I will use two predictors, level of education and Regression
work experience, to predict income which is the target
variable.
Signal 1, Signal 2, … Signal 590
Algorithm = Method Used = Technique
This is the method that we will use to both predict the Target Variable: Defect
target variable and discover the relationships, if any,
between the predictors and the target. Predictor Variables: CART =
Signal 1, Signal 2, …, Signal
𝐷𝑒𝑓𝑒𝑐𝑡 =
Examples: CART decision trees, gradient boosted trees, 590
Random Forests, LASSO, Elastic Net, MARS, Support
Vector Machines (SVMs), and Neural Networks. Algorithm: CART decision
tree

© Minitab Inc. 10/24/2017 13


Hands-on Practice

© Minitab Inc. 10/24/2017 14


Manufacturing
Defects
Let’s Get Started . . .
Live Demo
A manufacturing process involves myriad machines, and the information concerning the operation of the machines is recorded.
There are 590 metrics recorded from the machines from the start of the process to the end and we’ll refer to these metrics as
“signals.”
Open SPM
MANUFACTURING DATA SET

1. What signals, if any, are


predictive of
manufacturing defects?
2. If signals are predictive
of defects, then how are
these signals related to
the likelihood of
manufacturing defects?

© Minitab Inc. 10/24/2017 15


© Minitab Inc.

CART and Random Forests


A Random Forest prediction is really just an average of CART tree predictions. When you build a Random
Forest model just keep this picture in the back of your mind:

AUTOMATIC INVARIANT TO
AUTOMATIC AUTOMATIC AUTOMATIC
PREDICTIVE MISSING MONOTONE INTERPRETABILITY
SPM ENGINE PERFORMANCE
VARIABLE INTERACTION
VALUE/OUTLIER
MODELING OF
TRANSFORMATIONS
SELECTION DETECTION LOCAL EFFECTS
HANDLING OF PREDICTORS

10/24/2017

16
Manufacturing

Solving Problems with Machine Learning: Defects

Machine Settings and Manufacturing Defects


A manufacturing process involves myriad machines, and the information concerning the operation of the
machines is recorded. There are 590 metrics recorded from the machines from the start of the process to
the end and we’ll refer to these metrics as “signals.”

We will try to answer two primary questions:


1. What signals, if any, are predictive of manufacturing defects?
2. If signals are predictive of defects, then how are these signals related to the likelihood of manufacturing
defects?

We will use an algorithm called gradient boosting to do this. TreeNet® software will be used. TreeNet is
unique in that its code was originally written by Jerome Friedman, the creator of gradient boosting.

© Minitab Inc. 10/24/2017 17


Dataset Citations
Manufacturing Defect Dataset: Michael McCann and Adrian
Johnston donated the dataset to the UCI Machine Learning
Repository in 2008:

Link: https://archive.ics.uci.edu/ml/datasets/SECOM

© Minitab Inc. 10/24/2017 18


CART
Let’s apply CART to the SIGNAL_294

manufacturing defect dataset. SIGNAL_293 SIGNAL_359

SIGNAL_60 SIGNAL_359

Applying CART
1. Build the model in SPM SIGNAL_246 SIGNAL_158 SIGNAL_21

SIGNAL_247 SIGNAL_111 SIGNAL_158

2. Understand CART Relative Cost


SIGNAL_66 SIGNAL_60 SIGNAL_311 SIGNAL_112

3. Find the most interesting rules that


are predictive of manufacturing SIGNAL_246

defects using Hotspot Detection


SIGNAL_549

4. Using the model: Generating


manufacturing defect predictions
and deploying CART outside of SPM

© Minitab Inc. 10/24/2017 19


CART Review
CART is a decision tree algorithm that divides
the data so that the dependent variable can be
predicted more accurately

CART automatically:
1. Selects variables
2. Models nonlinear relationships
3. Model local effects
4. Models interactions
5. Handles missing values

© Minitab Inc. 10/24/2017 20


Node 1
Class = Circle
X2 <= -0.49
Class Cases %

CART : Relative Cost


Circle 16 64.0
Triangle 9 36.0
W = 25.00
N = 25

X2 <= -0.49 X2 > -0.49


Terminal Node 2
Node 1 Class = Circle
Class = Circle X1 <= 0.23

𝑂𝑣𝑒𝑟𝑎𝑙𝑙 𝑀𝑖𝑠𝑐𝑙𝑎𝑠𝑠𝑖𝑓𝑖𝑐𝑎𝑡𝑖𝑜𝑛 𝐶𝑜𝑠𝑡 𝑈𝑠𝑖𝑛𝑔 𝑎 𝐶𝐴𝑅𝑇 𝑡𝑟𝑒𝑒


Class Cases %
Circle 6 100.0
Class Cases %
Circle 10 52.6

Relative Cost = Triangle 0


W = 6.00
0.0 Triangle 9 47.4
W = 19.00

𝑁𝑜 𝐷𝑎𝑡𝑎 𝑂𝑝𝑡𝑖𝑚𝑎𝑙 𝑅𝑢𝑙𝑒


N=6 N = 19

X1 <= 0.23 X1 > 0.23


Terminal Terminal
Node 2 Node 3
Class = Triangle Class = Circle

The No Data Optimal Rule classifies every observation as one class. More specifically, the class
Class Cases %
Circle 1 14.3
Class Cases %
Circle 9 75.0

chosen for the no data optimal rule is the class that has the lowest cost compared to the other(s)
Triangle 6
W = 7.00
85.7 Triangle 3
W = 12.00
25.0

N=7 N = 12
Relative Cost = .44

Good: If the relative cost is closer to zero (closer is better) then CART is better than the No Data
Optimal Rule CART Predicted
CART Predicted No Data Optimal Rule
Class: Predicted Class:
Class:

Bad: If the relative cost is equal to 1 then the CART error is the same as the No Data Optimal Rule
which means that CART is no better than just predicting every observation as the same class
CART Predicted
The relative cost can be greater 1 which is especially bad
Class:and, more generally, values around 1 should be
considered “bad”
CART Confusion Matrix
Use the Confusion Matrix to assess CART and the
types of correct or incorrect predictions that it
makes.

CART correctly predicted “No Defect” 935 times

CART correctly predicted “No Defect” 57 times

CART incorrectly predicted “Defect” when


there was actually no defect 528 times (we call
this a false positive)

CART incorrectly predicted “No Defect” when


there actually was a defect 47 times (we call this
a false negative)

© Minitab Inc. 10/24/2017 22


CART: Variable Selection & Importance
There were 590 variables
available to be selected by CART.

13 variables appear in the tree

79 variables are used in the


model (i.e. 13 variables used in
the tree and 66 used to handle
missing values via surrogate
splits)

© Minitab Inc. 10/24/2017 23


CART: Hotspot Detection
Recall: a CART tree can be thought SIGNAL_294

of as a collection of rules. SIGNAL_293 SIGNAL_359

SIGNAL_60 SIGNAL_359

Each rule defines a path to a SIGNAL_246 SIGNAL_158 SIGNAL_21

terminal node
SIGNAL_247 SIGNAL_111 SIGNAL_158

SIGNAL_66 SIGNAL_60 SIGNAL_311 SIGNAL_112


For large CART trees, is there an
easy way to find the “most SIGNAL_246

interesting” rules? Yes, use SIGNAL_549

Hotspot Detection.

© Minitab Inc. 10/24/2017 24


CART: Hotspot Detection
Hotspot Detection computes
summary information about
each terminal node (every rule
leads to a terminal node) and
displays the information
conveniently to the user.

Use this information to easily


and efficiently find the most
important rules in your CART
tree.

© Minitab Inc. 10/24/2017 25


CART: Using Hotspot Detection

Here terminal node 5 has the


largest class count and a lift
value of around 2.5. This
means that the probability of a
“Defect” is 2.5 times more
likely than the overall
population.

What rule leads to terminal


node 5?

© Minitab Inc. 10/24/2017 26


CART Hotspot Interpretation
If Signal 294 <= 368.82 and Signal 293 > .006
and Signal 60 > 1.51 and Signal 246 <= 1.42 and
Signal 247 > 2.98 then we predict “Defect”.

If the machine signals satisfy this rule then the


probability of a defect is 2.5 times larger than
the overall probability of a defect.

© Minitab Inc. 10/24/2017 27


CART: Hotspot Detection
Focus Class: the class (i.e. “Defect” or “No Defect” that you
want to generate the hotspot report for. I set the focus
class to be “Defect.”

𝑃𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑜𝑓 𝐹𝑜𝑐𝑢𝑠 𝐶𝑙𝑎𝑠𝑠 𝑖𝑛 𝑡ℎ𝑒 𝑡𝑒𝑟𝑚𝑖𝑛𝑎𝑙 𝑛𝑜𝑑𝑒


𝐿𝑖𝑓𝑡 =
𝑃𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑜𝑓 𝐹𝑜𝑐𝑢𝑠 𝐶𝑙𝑎𝑠𝑠 𝑂𝑣𝑒𝑟𝑎𝑙𝑙

Node Class Count = number of records in the sample that


fall into the node

If Lift = 1, then the probability of a “Defect” is the same as


it is in the overall sample.

If Lift = 2 then the probability of a “Defect” is twice as


much in the terminal node as it is in the overall sample.

If Lift = .5 then the probability of a “Defect” is half as


much in the terminal node as it is in the overall sample.

© Minitab Inc. 10/24/2017 28


What Can Machine Learning Do For You?

Find the Most


Discover the Important
Explore Data Predict Future Solve Your
Most Important Relationships in
Observations Problem
Features Factors &
Response

© Minitab Inc. 10/24/2017 29


Deploying CART
If you want to use CART to generate predictions, you have two
primary options:

1. Generate Predictions inside of SPM

2. Translate CART into a programming language and deploy it in


your environment

© Minitab Inc. 10/24/2017 30


Generating Predictions Inside of SPM
Let’s suppose that you have a
set of machine signal values
(i.e. you know the values for
Signal 1 – Signal 590) and you
want to predict if there will be
a product defect (i.e. you don’t
know the “STATUS” value)

© Minitab Inc. 10/24/2017 31


Deploying CART via Code Translations
A CART model is fundamentally a
collection of rules where each rule is an
if-then statement (also else-if statements
etc.). We can then take these if-then
statements and translate them into
different programming languages. In
SPM we can translate into 4 languages: C,
PMML, Java, and SAS.

***Use the code to generate CART


predictions in other
applications/programs or to make
predictions in real-time.

© Minitab Inc. 10/24/2017 32


What Can Machine Learning Do For You?

Find the Most


Discover the Important
Explore Data Predict Future Solve Your
Most Important Relationships in
Observations Problem
Features Factors &
Response

© Minitab Inc. 10/24/2017 33


CART: Finding Rules
CART automatically gave us a set of interpretable rules that are
predictive of manufacturing defects. Now we will need to determine
what the signals actually measure and determine if we can control
the inputs that drive the settings.

© Minitab Inc. 10/24/2017 34


CART: Generating Predictions
1. Use CART to predict if there will or will not be a product defect
inside of SPM.

2. Translate CART into C (or Java, PMML, or SAS) and deploy your
CART model in your environment in order to make predictions
in real-time.

© Minitab Inc. 10/24/2017 35


TreeNet Gradient Boosting
Let’s apply the gradient boosting algorithm using TreeNet® software

Applying TreeNet
1. Understanding the model: Partial Dependency Plots

2. Choosing the number of trees (set the maximum number of trees such that the
error no longer meaningfully declines; SPM will choose the optimal number for
you)

3. Choosing the number of nodes with Automate NODES

4. Discover important interactions with interaction reporting

5. Making predictions and deploying the model.

© Minitab Inc. 10/24/2017 36


Gradient Boosting Review
Idea: fit a CART tree to the error
from the previous error and use
this new prediction to update
the model

© Minitab Inc. 10/24/2017 37


Gradient Boosting: Why it works
How does TreeNet model this curve? It makes small improvements (i.e. the
learning rate is a small number that “shrinks” the model updates). The
Tree 1
small improvements, taken together, produce an accurate model.

Tree 10

Tree 50
Tree 600
Tree 100

Tree 150

Tree 200

Tree 400

Tree 600

Note: Noise ~ N(0,1)

© Minitab Inc. 10/24/2017 38


What Can Machine Learning Do For You?

Find the Most


Discover the Important
Explore Data Predict Future Solve Your
Most Important Relationships in
Observations Problem
Features Factors &
Response

© Minitab Inc. 10/24/2017 39


Manufacturing
Defects
Most Important Signals
TreeNet, like CART, automatically selects the
most important variables (i.e. the signals).
Steps
1. Import the dataset
2. Select “TreeNet Gradient Boosting
Machine”
3. Set variables
4. Click “Start”
5. View variable importance measures

Of the 590 signals, TreeNet automatically


identifies 299 of them as useful (you can actually
run a series of variable “shaving” experiments to
see if you can reduce the number of variables
used even more)

© Minitab Inc. 10/24/2017 40


What Can Machine Learning Do For You?

Find the Most


Discover the Important
Explore Data Predict Future Solve Your
Most Important Relationships in
Observations Problem
Features Factors &
Response

© Minitab Inc. 10/24/2017 41


Manufacturing

How are Most Important Signals Related to Defects

the Likelihood of Product Defects?


The plots on the right are generated
automatically from a TreeNet model, so you
only have to click two buttons to see the plots.

The plots are ordered in terms of the variable


importance (most important first).

© Minitab Inc. 10/24/2017 42


Manufacturing
Defects
Most Important Signal: Signal 60
This plot tells us that, after accounting
for the other 299 variables in the
model, the likelihood of a product
defect increases once Signal 60 has
values beyond 3.25. Once Signal 60
reaches about 13.3, the likelihood of a
defect remains constant.

TreeNet automatically discovered this


relationship. Now we have a few
questions to answer:
What does Signal 60 actually measure?
Signal_60=3.25
Signal_60=13.3
What machine settings have an effect on
Signal 60? To what extent, if any, can we
control these settings?

© Minitab Inc. 10/24/2017 43


Manufacturing

Most Important Two-Way Interaction: Defects

Signal 60 and Signal 334


The most important two-way interaction in the
model is between Signal 60 and Signal 334. Defect is more likely

The red and orange areas in the plot on the right


mean that the likelihood of a defect is higher.
When Signal 60 is between about 15 and 150 and
Signal 334 is between 30 and 100, then the
likelihood of a defect is higher.

Follow-up questions for identifying the machine


settings that affect the signals:
What do the two signals measure?

What machine settings, if any, have an affect on


Signal 60 and Signal 334?

© Minitab Inc. 10/24/2017 44


Interaction Statistics: Global Score
Use the Global Score to find the most important two-way interactions in the model. The
Global Score for a pair of variables tells you the percentage of the total variation in the
predicted response that is accounted for by the two-way interaction between two variables. A
value of 5.66 means that 5.66% of the variation in the predicted response is accounted for by
the interaction between Signal 60 and Signal 334.

− −
𝐆𝐥𝐨𝐛𝐚𝐥 𝐒𝐜𝐨𝐫𝐞 =

Total Variation in the Predicted Response

© Minitab Inc. 10/24/2017 45


Using the Interaction Statistics: Next Webinar
One way to leverage the interaction statistics is allow only
interactions between the pairs of variable deemed to be “important”
by the TreeNet interaction statistics and disallow interactions
among the unimportant variables. If we do this and the model error
does not change meaningfully then we can be more confident that
the interaction is real (i.e. not noise!). We will talk more about this
in Webinar 5.

© Minitab Inc. 10/24/2017 46


What Can Machine Learning Do For You?

Find the Most


Discover the Important
Explore Data Predict Future Solve Your
Most Important Relationships in
Observations Problem
Features Factors &
Response

© Minitab Inc. 10/24/2017 47


Manufacturing

Solving the Problem: Defects

Predicting Future Observations & Running Simulations


Engineers can predict the likelihood Proposed Machine Settings
of a defect based on the signal
values:
1. Take data (i.e. hypothetical signal Hypothetical (or estimated) Signal Values
values or estimated signal values
given the machine settings) and
substitute the values into the
TreeNet model
2. TreeNet will generate the
probability of a defect based on the
signal values supplied.

***If we can predict signal values based on


the machine settings, then we could Predicted probability of “Defect” and the
predict the probability of a defect based on predicted class: “Defect” or “No Defect.”
chosen machine settings***

© Minitab Inc. 10/24/2017 48


Generating Predictions in SPM
We can generate predictions
inside of SPM just like CART
(the same is true for Random
Forests, MARS, etc.)

Click the “Score” button

© Minitab Inc. 10/24/2017 49


Deploying TreeNet via Code Translations
A TreeNet model is fundamentally a
collection of rules where each rule is an
if-then statement (also else-if statements
etc.). We can then take these if-then
statements and translate them into
different programming languages. In
SPM we can translate into 4 languages: C,
PMML, Java, and SAS.

***Use the code to generate TreeNet


predictions in other
applications/programs or to make
predictions in real-time.

© Minitab Inc. 10/24/2017 50


What Can Machine Learning Do For You?

Find the Most


Discover the Important
Explore Data Predict Future Solve Your
Most Important Relationships in
Observations Problem
Features Factors &
Response

© Minitab Inc. 10/24/2017 51


Manufacturing

Solving the Problem: Defects

Predicting Future Observations & Running Simulations


Engineers can predict the likelihood Proposed Machine Settings
of a defect based on the signal
values:
1. Take data (i.e. hypothetical signal Hypothetical (or estimated) Signal Values
values or estimated signal values
given the machine settings) and
substitute the values into the
TreeNet model
2. TreeNet will generate the
probability of a defect based on the
signal values supplied.

***If we can predict signal values based on


the machine settings, then we could Predicted probability of “Defect” and the
predict the probability of a defect based on predicted class: “Defect” or “No Defect.”
chosen machine settings***

© Minitab Inc. 10/24/2017 52


Manufacturing

Solving the Problem: Defects

Understanding the relationship of signals and the likelihood of defects


Use TreeNet gradient boosting to
1. View signals that are useful in
predicting defects (or, conversely,
non-defects; signals that are not
important are either rarely used in
the model or not used at all)

2. Visually understand the


relationship between the likelihood
of a defect and a signal

3. Visually understand the nature of


the interactions that are important
in the model.

© Minitab Inc. 10/24/2017 53


Optimizing Models with SPM Automates
One way to choose the optimal value for a model parameter in TreeNet is to run an experiment:
build multiple TreeNet models with identical settings except that change the value of one
parameter each time.

Model experimentation and optimization routines are pre-packaged for you in SPM, so you
never have to write even a single line of code. We want you to spend time on solving problems,
not troubleshooting while loops and function calls!

We will discuss this more in the second webinar, but we will provide one example.

© Minitab Inc. 10/24/2017 54


Automate NODES
The number of terminal nodes in
each tree in the TreeNet model
controls the extent to which the
model can capture interactions.

Use Automate NODES to easily


find the optimal number of
terminal nodes in each tree. Here
the optimal number of terminal
nodes is 6 (this is actually the
default value).

© Minitab Inc. 10/24/2017 55


What Can Machine Learning Do For You?

Find the Most


Discover the Important
Explore Data Predict Future Solve Your
Most Important Relationships in
Observations Problem
Features Factors &
Response

© Minitab Inc. 10/24/2017 56


CART: Finding Rules
CART automatically gave us a set of interpretable rules that are
predictive of manufacturing defects. Now we will need to determine
what the signals actually measure and determine if we can control
the inputs that drive the settings.

© Minitab Inc. 10/24/2017 57


CART: Generating Predictions
1. Use CART to predict if there will or will not be a product defect
inside of SPM.

2. Translate CART into C (or Java, PMML, or SAS) and deploy your
CART model in your environment in order to make predictions
in real-time.

© Minitab Inc. 10/24/2017 58


Random Forests: Review
Idea: fit CART trees to
independent bootstrap samples
and combine the predictions

© Minitab Inc. 10/24/2017 59


Random Forest Output
For smaller datasets (i.e. <10,000 records) we can compute a variety
of useful metrics including outlier statistics.

© Minitab Inc. 10/24/2017 60


Optimizing Random Forests: Automate
RFNPREDS
Use Automate RFNPREDS to
conveniently find optimal value
for the random variable subset
size.

Here the optimal size is


𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑜𝑟𝑠 ∗ 2 =
49

© Minitab Inc. 10/24/2017 61


Other Machine Learning Applications

© Minitab Inc. 10/24/2017 62


Manufacturing: Value Creation through
Machine Learning Application
INDUSTRIES

Organizations Gain Efficiencies Through Smarter Lean Adoption


MANUFACTURING
Identifying challenges and the benefits of LEAN implementation in
small to medium sized companies using CART.

Implementation of lean manufacturing in Saudi manufacturing organizations: an empirical


study

Proceedings of the 2011 International Conference on Materials and Products Manufacturing


Technology: https://eprints.qut.edu.au/46594/1/2011011893_Karim_ePrints.pdf

© Minitab Inc. 10/24/2017 63


Financial Services: Value Creation through
Machine Learning Application
INDUSTRIES

Improving Credit Scoring in Highly-Competitive Environment


FINANCIAL SERVICES
Accurate credit scoring using CART and TreeNet is critical for
financial services and is increasingly competitive. Less risk is assumed
as future instances of loan default are predicted.

Mining the customer credit using classification and regression tree and multivariate
adaptive regression splines

Computational Statistics & Data Analysis:


http://www.sciencedirect.com/science/article/pii/S016794730400355X

© Minitab Inc. 10/24/2017 64


Healthcare: Value Creation through Machine
Learning Application
INDUSTRIES

Predicting Lung Cancer for High Risk Patients


HEALTHCARE
Medical researchers were looking to improve lung cancer detection
through blood testing. CART analysis was leveraged to predict which
patients had cancer given the serum biomarkers.

Panel of Serum Biomarkers for the Diagnosis of Lung Cancer

Journal of Clinical Oncology: http://ascopubs.org/doi/full/10.1200/JCO.2007.13.5392

© Minitab Inc. 10/24/2017 65


Continue To Use Machine Learning On Your Own
Practice, Practice, Practice

We’ll provide you a Download a trial version of SPM Schedule a demo and we’ll walk you
link to the dataset https://info.salford- through the example shown today
used today in a follow systems.com/spm-8-download
up email

Feeling Stuck? We Can Help!


Check out our other training materials online: If you need help getting started, give us a shout:
https://www.salford-systems.com/resources/training- [email protected]
videos

© Minitab Inc. 10/24/2017 66


Ready For More? Join Our Next Webinar
Tuesday October 31, 2017 @ 10 am (PDT):
Real-world demonstration for the advanced modeler
Register: http://info.salford-
systems.com/datascience101webinarseries

In this webinar I am going to explain the how to leverage powerful


Machine Learning algorithms in detail using SPM software.

© Minitab Inc. 10/24/2017 67


Appendix

© Minitab Inc. 10/24/2017 68


CART® Software Applications
Predicting Return to Work with Data Mining
Society of Actuaries: https://www.soa.org/files/research/projects/data-mining.pdf

Implementation of lean manufacturing in Saudi manufacturing organizations: an empirical study


Proceedings of the 2011 International Conference on Materials and Products Manufacturing Technology:
https://eprints.qut.edu.au/46594/1/2011011893_Karim_ePrints.pdf

Assessing the prediction of employee productivity: a comparison of OLS vs. CART


International Journal of Productivity and Quality Management: http://www.inderscienceonline.com/doi/abs/10.1504/IJPQM.2011.042511

Mining the customer credit using classification and regression tree and multivariate adaptive regression splines
Computational Statistics & Data Analysis: http://www.sciencedirect.com/science/article/pii/S016794730400355X

Panel of Serum Biomarkers for the Diagnosis of Lung Cancer


Journal of Clinical Oncology: http://ascopubs.org/doi/full/10.1200/JCO.2007.13.5392

Automated urban land-use classification with remote sensing


International Journal of Remote Sensing: http://www.tandfonline.com/doi/abs/10.1080/01431161.2012.714510

© Minitab Inc. 10/24/2017 69


Random Forest® Software Applications
Mapping Oil and Gas Development Potential in the US Intermountain West and Estimating Impacts to Species
http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0007400

Random Forests applied as a soil spatial predictive model in arid Utah


Digital Soil Mapping: http://link.springer.com/content/pdf/10.1007/978-90-481-8863-5.pdf#page=188

Factors Associated With Increased Reading Frequency in Children Exposed to Reach Out and Read
Academic Pediatrics: ttp://www.sciencedirect.com/science/article/pii/S1876285915002752
This paper used Random Forests® software to pick the factors

Using Random Forests to Provide Predicted Species Distribution Maps as a Metric for Ecological Inventory & Monitoring
Programs
Applications of Computational Intelligence in Biology: https://link.springer.com/chapter/10.1007/978-3-540-78534-7_9

Random Forest for Gene Expression Based Cancer Classification: Overlooked Issues
Iberian Conference on Pattern Recognition and Image Analysis: https://link.springer.com/chapter/10.1007/978-3-540-72849-8_61

© Minitab Inc. 10/24/2017 70

You might also like