Data Science Reference Guide
Data Science Reference Guide
Analytic Solver
Data Science Reference Guide
Copyright
Software copyright 1991-2025 by Frontline Systems, Inc.
User Guide copyright 2025 by Frontline Systems, Inc.
GRG/LSGRG Solver: Portions copyright 1989 by Optimal Methods, Inc. SOCP Barrier Solver: Portions
copyright 2002 by Masakazu Muramatsu. LP/QP Solver: Portions copyright 2000-2010 by International
Business Machines Corp. and others. Neither the Software nor this User Guide may be copied, photocopied,
reproduced, translated, or reduced to any electronic medium or machine-readable form without the express
written consent of Frontline Systems, Inc., except as permitted by the Software License agreement below.
Trademarks
Frontline Solvers®, XLMiner®, Analytic Solver®, Risk Solver®, Premium Solver®, Solver SDK®, and RASON®
are trademarks of Frontline Systems, Inc. Windows and Excel are trademarks of Microsoft Corp. Gurobi is a
trademark of Gurobi Optimization, Inc. Knitro is a trademark of Artelys. MOSEK is a trademark of MOSEK
ApS. OptQuest is a trademark of OptTek Systems, Inc. XpressMP is a trademark of FICO, Inc. Firefox is a
trademark of the Mozilla Foundation in the U.S. and other countries.
Patent Pending
Systems and Methods for Automated Risk Analysis of Machine Learning Models.
Acknowledgements
Thanks to Dan Fylstra and the Frontline Systems development team for a 25-year cumulative effort to build the
best possible optimization and simulation software for Microsoft Excel. Thanks to Frontline’s customers who
have built many thousands of successful applications, and have given us many suggestions for improvements.
Risk Solver Pro and Risk Solver Platform have benefited from reviews, critiques, and suggestions from several
risk analysis experts:
• Sam Savage (Stanford Univ. and AnalyCorp Inc.) for Probability Management concepts including SIPs,
SLURPs, DISTs, and Certified Distributions.
• Sam Sugiyama (EC Risk USA & Europe LLC) for evaluation of advanced distributions, correlations, and
alternate parameters for continuous distributions.
• Savvakis C. Savvides for global bounds, censor bounds, base case values, the Normal Skewed distribution
and new risk measures.
How to Order
Contact Frontline Systems, Inc., P.O. Box 4288, Incline Village, NV 89450.
Tel (775) 831-0300 Fax (775) 831-0314 Email [email protected] Web http://www.solver.com
Table of Contents
Table of Contents 3
Generate Data 45
Introduction................................................................................................................. 45
Generate Data Example ............................................................................................... 45
Results........................................................................................................... 53
Generate Data Options................................................................................................. 63
Variables ....................................................................................................... 63
Selected Variables ......................................................................................... 63
Metalog Terms .............................................................................................. 63
Metalog Selection Test................................................................................... 64
Fit Correlation ............................................................................................... 65
Generate Sample ............................................................................................ 66
Random Seed................................................................................................. 66
Random Generator ......................................................................................... 67
Sampling Method .......................................................................................... 67
Random Streams............................................................................................ 68
Reports and Charts......................................................................................... 68
Exploring Data 69
Introduction................................................................................................................. 69
Analyze Data............................................................................................................... 69
Analyze Data Example................................................................................... 69
Bin Details View ........................................................................................... 78
Chart Settings View ....................................................................................... 79
Analyze Data Report ...................................................................................... 80
Analyze Data Options .................................................................................... 81
Feature Selection ......................................................................................................... 82
Feature Selection Example ............................................................................. 83
Feature Selection Example........................................................................................... 84
Feature Selection Options ............................................................................................ 91
Variables listbox ............................................................................................ 92
Continuous Variables listbox.......................................................................... 92
Categorical Variables listbox.......................................................................... 92
Output Variable ............................................................................................. 92
Output Variable Type .................................................................................... 92
Discretize predictors ...................................................................................... 94
Discretize output variable............................................................................... 94
Pearson correlation ........................................................................................ 95
Spearman rank correlation.............................................................................. 95
Kendall concordance...................................................................................... 95
Introduction
Analytic Solver Data Science V2025 Q1 comes in two versions: Analytic
Solver Desktop – a traditional “COM add-in” that works only in Microsoft
Excel for Windows PCs (desktops and laptops), and Analytic Solver Cloud – a
modern “JavaScript add-in” that works in Excel for Windows and Excel for
Macintosh (desktops and laptops), and also in Excel for the Web (formerly
Excel Online) using Web browsers such as Chrome, FireFox and Safari. Your
license gives you access to both versions, and your Excel workbooks and
optimization, simulation and data science models work in both versions, no
matter where you save them (though OneDrive is most convenient).
This Reference Guide gives step-by-step instructions on how to utilize the data
science and predictive methods and algorithms included in both the Desktop and
Cloud applications. The overwhelming majority of features in Analytic Solver
Desktop are also included in the Cloud app. However, there are a few variations
between the two products. This guide documents any key differences between
the two products.
Ribbon Overview
Analytic Solver Data Science, previously referred to as XLMinerTM and more
recently Analytic Solver Data Science, is a comprehensive data science software
package for use in the Cloud or as an add-in to Excel. Data science is a
discovery-driven data analysis technology used for identifying patterns and
relationships in data sets. With overwhelming amounts of data now available
from transaction systems and external data sources, organizations are presented
with increasing opportunities to understand their data and gain insights into it.
Data science is still an emerging field, and is a convergence of fields like
statistics, machine learning, and artificial intelligence.
Often, there may be more than one approach to a problem. Analytic Solver Data
Science is a tool belt to help you get started quickly offering a variety of
methods to analyze your data. It has extensive coverage of statistical and
machine learning techniques for classification, prediction, affinity analysis and
data exploration and reduction.
• Click the Model button to display the Solver Task Pane. This feature
allows you to quickly navigate through datasets and worksheets containing
Analytic Solver Data Science results.
• Click the Get Data button to draw a random sample of data, or summarize
data from a (i) an Excel worksheet, (ii) the PowerPivot “spreadsheet data
model” which can hold 10 to 100 million rows of data in Excel, (iii) an
external SQL database such as Oracle, DB2 or SQL Server, or (iv) a dataset
with up to billions of rows, stored across many hard disks in an external Big
Data compute cluster running Apache Spark (https://spark.apache.org/).
• You can use the Data Analysis group of buttons to explore your data, both
visually and through methods like cluster analysis, transform your data with
methods like Principal Components, Missing Value imputation, Binning
continuous data, and Transforming categorical data, or use the Text Mining
feature to extract information from text documents.
• Use the Time Series group of buttons for time series forecasting, using both
Exponential Smoothing (including Holt-Winters) and ARIMA (Auto-
Regressive Integrated Moving Average) models, the two most popular time
series forecasting methods from classical statistics. These methods forecast
a single data series forward in time.
• The Data Science group of buttons give you access to a broad range of
methods for prediction, classification and affinity analysis, from both
classical statistics and data science. These methods use multiple input
variables to predict an outcome variable or classify the outcome into one of
several categories. Ensemble Methods are available for use with all data
science and regression learners. The Find Best Model feature which allows
you to input your data once and run all classification or regression learners
at one time.
• Use the Predict button to build prediction models using Multiple Linear
Regression (with variable subset selection and diagnostics), k-Nearest
Neighbors, Regression Trees, and Neural Networks. Use Ensemble
Methods with Regression Trees and Neural Networks to create more
accurate prediction models. Use Find Best Model to run all 4 regression
methods and 3 ensemble methods at once, and select the best fit model.
• Use the Classify button to build classification models with Discriminant
Analysis, Logistic Regression, k-Nearest Neighbors, Classification Trees,
Naïve Bayes, and Neural Networks. Use Ensemble Methods with
Classification Trees and Neural Networks to create more accurate
classification models. Use Find Best Model to run all 6 classification
methods and 3 ensemble methods at once, and select the best fit model.
• Use the Associate button to perform affinity analysis (“what goes with
what” or market basket analysis) using Association Rules.
Differences
• Minor Ribbon Differences:
o The Text Mining icon is located in the Text section of the Ribbon
o Standard Partitioning is located in the Partition section of the
Ribbon
o The Tools section of the Ribbon includes Score
• A new icon, License, has been added. Click this icon to manage your
Analytic Solver licenses. See the section "License in the Cloud Apps"
within this guide for more information.
• The Options button has been removed. All menu items previously
appearing on this menu, now appear on the License or Help menus.
• Workflows created in Data Science Cloud are not supported in Analytic
Solver Desktop.
These options, fields and command buttons appear on most Analytic Solver
Data Science dialogs.
Worksheet
The active worksheet appears in this field.
Workbook
The active workbook appears in this field.
# Rows # Cols
The number of rows and columns in the dataset appear in these two fields,
respectively.
Selected Variables
Variables listed in this field will be included in the output. Select the
desired variables listed in the Variables In Input Data listbox, then click the
> button to shift variables to the Selected Variables field.
Help
Click this command button to open the Analytic Solver Data Science Help
text file.
Next
Click this command button to progress to the next tab of the dialog.
OK/Finish
Click this command button to initiate the desired method and produce the
output report.
Cancel
Click this command button to close the open dialog without saving any
options or creating an output report.
References
See below for a list of references sited when compiling this guide.
Websites
1. The Data & Analysis Center for Software. <https://www.thecsiac.com>
2. NEC Research Institute Research Index: The NECI Scientific Literature
Digital Library.
<http://www.iicm.tugraz.at/thesis/cguetl_diss/literatur/Kapitel02/URL/NEC
/cs.html>.
3. Thearling, Kurt. Data Mining and Analytic Technologies.
<http://www.thearling.com>
Books
1. Anderberg, Michael R. Cluster Analysis for Applications. Academic Press
(1973).
2. Berry, Michael J. A., Gordon S. Linoff. Mastering Data Mining. Wiley
(2000).
3. Breiman, Leo Jerome H. Friedman, Richard A. Olshen, Charles J. Stone.
Classification and Regression Trees. Chapman & Hall/CRC (1998).
4. Han, Jiawei, Micheline Kamber. Data Mining: Concepts and Techniques.
Morgan Kaufmann Publishers (2000).
5. Hand, David, Heikki Mannila, Padhraic Smyth. Principles of Data Mining.
MIT Press, Cambridge" (2001).
6. Hastie, Trevor, Robert Tibshirani, Jerome Friedman. The Elements of
Statistical Learning: Data Mining, Inference, and Prediction. Springer, New
York (2001).
7. Shmueli, Galit, Nitin R. Patel, Peter C. Bruce. Data Mining for Business
Intelligence. Wiley, New Jersey (2010).
Introduction
Large amounts of data are being generated and collected continuously from a multitude of sources every minute of
every day. From your toothbrush to your vehicle GPS to Twitter/Facebook/Google/Yahoo, data is everywhere.
Being able to make decisions based on this information requires the ability to extract trends and patterns that can be
buried deeply within the numbers.
Generally these large datasets contain millions of records (rows) requiring multiple gigabytes or terabytes of storage
space across multiple hard drives in an external compute cluster. Analytic Solver Platform V2016-R3 enables users,
for the first time, to ‘pull’ sampled and summarized data into Excel from compute clusters running Apache Spark,
the open-source software widely embraced by Big Data vendors and users.
See the Analytic Solver Data Science User Guide for a complete step by step example illustrating how to sample and
summarize big data using Analytic Solver. See the Big Data Options section below for a comprehensive
explanation of each option that appears on the Big Data dialog tabs.
Credentials If your dataset is located on Amazon S3, click Credentials to enter your Access and
Secret Keys.
When All Variables is selected for this option, all columns (features) in the dataset
would be selected for the analysis without the need for the user to select the
variables.
Schema
When Select variables is selected for this option, the command button Infer
Schema is enabled. Once Infer Schema is clicked, schema (variables) will be
inferred from the dataset on the cluster and listed in the Variables grid. Users may
use the > and < buttons to select variables for inclusion in the sample.
Variables Variables available for inclusion in the sample will appear here. Use the > button
to select variables to be included in the sample.
Select Variables Variables transferred here will be included in the sample. Use the < button to
remove variables from the sample.
Clicking Submit sends a request for sampling to the compute cluster but does not
wait for completion. The result is output containing the Job ID and basic
Submit
information about the submitted job so different submissions may be identified.
This information can be used at any time later for querying the status of the job and
generating reports based on the results of the completed job.
Sends a request for sampling to the Apache Spark compute cluster where the
Run Frontline Systems access server is installed and waits for the results. Once the job
is completed and results are returned to the Analytic Solver Data Science client, a
report is inserted into the Model tab of the Analytic Solver Task Pane under Data
Science – Results - Sampling.
Cancel Click this command button to close the open dialog without saving any options or
creating an output report.
Data Format
Analytic Solver Data Science can process data from Hadoop Distributed File
System (HDFS), local file systems that are visible to Spark cluster and Amazon
S3. Performance is best with HDFS, and it is recommended that you load data
from a local file system or Amazon S3 into HDFS. If the local file system is
used, the data must be accessible at the same path on all Spark workers, either via
a network path, or because it was copied to the same location on all workers.
At present, Analytic Solver Data Science can process data in Apache Parquet and
CSV (delimited text) formats. Performance is far better with Parquet, which
stores data in a compressed, columnar representation; it is highly recommended
that you convert CSV data to Parquet before you seek to sample or summarize the
data.
If this option is selected, data records in the resulting sample will carry the correct
ordinal IDs that correspond to the original data records, so that records can be
Track Record IDs
matched. Note: Selecting this option may significantly increase running time so it
should be applied only when necessary.
Sample with When selected, records in the dataset may be chosen for inclusion in the sample
Replacement multiple times.
If an integer value appears for Random seed, Analytic Solver Data Science will
use this value to set the feature selection random number seed. Setting the
random number seed to a nonzero value ensures that the same sequence of
random numbers is used each time the dataset is chosen for sampling. The default
Random Seed value is “12345”. If left blank, the random number generator is initialized from
the system clock, so the random sample would be represented by different records
from run to run. If you need the results from successive samples to be strictly
comparable, you should set the seed. To do this, type the desired number you
want into the box. This option accepts positive integers with up to 7 digits.
When this option is selected, Analytic Solver Data Science will return a fixed –
Exact Sampling
size sampled subset of data according to the setting for Desired Sample Size.
Desired Sample Size Enter the number of records to be included in the sample.
When this option is selected, the size of the resultant sample will be determined
by the value entered for Desired Sample Fraction. Approximate sampling is
much faster than Exact Sampling. Usually, the resultant fraction is very close to
Approximate Sampling
the Desired Sample Fraction so this option should be preferred over exact
sampling as often as possible. Even if the resultant sample slightly deviates from
the desired size, this would be easy to correct in Excel.
This is the expected size of the sample as a fraction of the dataset's size. If
Sampling with Replacement is selected, the value for Desired Sample Fraction
Desired Sample must be greater than 0. If Sampling without replacement (i.e. Sampling with
Fraction Replacement is not selected), the Desired Sample Fraction becomes the
probability that each element is chosen and, as a result, Desired Sample Fraction
must be between 0 and 1.
See the Sample Big Data, Data tab above for all option explanations except Group Variables.
Summarize Big Data, Data tab Option Name Summarize Big Data Options Dialog Description – Data tab
Group Variables are variables from the dataset that are treated as key variables for
aggregation. In the screenshot above, two variables have been selected as Group
Group Variables Variables: Year and UniqueCarrier. The variables will be grouped so that all
records with the same Year and UniqueCarrier are included in the same group,
and then all aggregate functions for each group will be calculated.
Introduction
Sampling
A statistician often comes across huge volumes of information from which he or
she wants to draw inferences. Since time and cost limitations make it impossible
to go through every entry in these enormous datasets, statisticians must resort to
sampling techniques. These sampling techniques choose a reduced sample or
subset from the complete dataset. The statistician can then perform his or her
statistical procedures on this reduced dataset saving much time and money.
Let’s review a few statistical terms. The entire dataset is called the population.
A sample is the portion of the population that is actually examined. A good
sample should be a true representation of the population to avoid forming
misleading conclusions. Various methods and techniques have been developed
to ensure a representative sample is chosen from the population. A few are
discussed here.
• Simple Random Sampling This is probably the simplest method for
obtaining a good sample. A simple random sample of say, size n, is
chosen from the population in such a way that every random set of n
items from the population has an equal chance of being chosen to be
included in the sample. Thus simple random sampling not only avoids
bias in the choice of individual items but also gives every possible
sample an equal chance.
The Data Sampling utility in Analytic Solver Data Science offers the
user the freedom to choose sample size, seed for randomization, and
sampling with or without replacement.
• Stratified Random Sampling In this technique, the population is first
divided into groups of similar items. These groups are called strata.
Each stratum, in turn, is sampled using simple random sampling. These
samples are then combined to form a stratified random sample.
The Data Sampling utility in Analytic Solver Data Science offers the
user the freedom to choose a sorting seed for randomization and
sampling with or without replacement. The desired sample size can be
prefixed by the user depending on which method is being chosen for
stratified random sampling.
Analytic Solver Data Science Desktop allows sampling either from a worksheet,
database or file folder. Analytic Solver Data Science Cloud, does not support
sampling from a database or file folder.
The result Sampling will be inserted in the Model tab of the Analytic Solver task
pane under Transformations – Sample From Worksheet. A portion of the output
is shown below.
The output indicates "True" for Sampling with Replacement. As a result, the
desired sample size is greater than the number of records in the input data (289
records vs a sample size of 300). Looking closely at the ID column, you’ll see
that multiple records have been sampled more than once.
Click Get Data – Worksheet and select all variables under Variables, click > to
include them in the sample data. Select Stratified random sampling.
Click the down arrow next to Stratum Variable and select v8. The strata number
is automatically displayed once you select v8. Keep the default setting selected,
Proportional to stratum size. Then click OK.
If a sample with an equal number of records for each stratum but of bigger size
is desired, use the same options above with Sampling with Replacement
selected.
Click Get Data -- Worksheet once again. Select all variables under Variables,
click > to include them in the sample data. Select Sample with replacement
and Stratified random sampling. Select V8 for the Stratum variable. Select
Equal from each stratum, please specify #records and enter 20. Though the
smallest stratum size is 8 in this dataset, we can acquire more records for our
Since the output sample has 20 records per stratum, the #records in sampled
data is 160 (20 records per stratum for 8 strata).
Click Get Data -- Worksheet one last time. Select all variables under
Variables, click > to include them in the sample data. Select Stratified random
sampling. Select V8 for the Stratum variable and select Equal from each
stratum, # records = smallest stratum size. The edit box to the right of the
Keeping all other options the same, click OK. The output, found under
Transformations -- Sample From Worksheet in the Analytic Solver task pane, is
below.
Since the output sample has 8 records per stratum, the Sample Size is 64 (8
records per stratum for 8 strata).
Variables
This list box contains the names of the variables in the selected data range. If the
first row of the range contains the variable names, then these names appear in
this list box. If the first row of the dataset does not contain the headers, then
Analytic Solver Data Science lists the variable names using its default naming
convention. In this case the first column is named Var1; the second column is
named Var2 and so on. To select a variable for sampling, select the variable,
then click the ">" button. Use the CTRL key to select multiple variables.
Stratum Variable
Select the variable to be used for stratified random sampling by clicking the
down arrow and selecting the desired variable. As the user selects the variable
name, Analytic Solver Data Science displays the #Strata that variable contains
in a box to the left and the smallest stratum size in a field beside the option
Equal from each stratum, #records = smallest stratum size. (Note: Analytic
Solver Comprehensive and Data Science support an unlimited number of
variables each having an unlimited number of distinct values. Versions of
Analytic Solver with basic limits support variables with 2 to 30 distinct values.)
Click the down arrow next to Data Source and select MS-Access, then click
Connect to database.
Files
The files contained within the file folder as selected for Directory will appear
here. Click the > command button to move individual files or the >> button to
move the entire collection to the Selected Files listbox.
Selected Files
The text files listed here have been selected for import or sampling.
Set Seed
This option initializes the random number generator. Setting the random
number seed to a nonzero value (any number of your choice is OK) ensures that
the same sequence of random numbers is used each time the sample of
documents is selected. The default value is “12345”. When the seed is zero, the
random number generator is initialized from the system clock, so the sequence
of documents selected will be different each time a sample is taken. If you need
the results from successive runs to be strictly comparable, you should set the
seed. To do this, select the checkbox next to the Set Seed edit box, or type the
number you want into the box. This option is selected by default when Sample
from selected files is enabled. This option accepts both positive and negative
integers with up to 9 digits.
Output
If Write file paths is selected, pointers to the file locations are stored on the
FileSampling output sheet. If Write file contents is selected, the content of each
text document will be written to a cell on the FileSampling output, up to a
maximum of 32,767 characters.
Introduction
The newly released Synthetic Data Generation feature included in the latest
version of Analytic Solver Data Science allows users to generate synthetic data
by automated Metalog probability distribution selection and parameter fitting,
Rank Correlation or Copula fitting, and random sampling. This can be
beneficial for several reasons such as when the actual training data is limited or
when the data owner is unwilling to release the actual, full dataset but agrees to
supply a limited copy or a synthetic version that statistically resembles the
properties of the actual dataset.
This process consists of three main steps.
1. Fit and select a marginal probability distribution to each feature – by automated
and semi-automated search within the family of bounded, semi-bounded or
unbounded Metalog distributions.
2. Identify correlations among features, by using Rank Correlation or one of
available Copulas – Clayton, Gumbel, Frank, Student or Gauss.
3. Generate the random sample consistent with the best-fit probability distributions
and correlations.
Additional to the generated synthetic data, Analytic Solver Data Science can
optionally provide the details of the fitting process – fitted coefficients and
goodness-of-fit metrics for all fitted candidate Metalog distributions, selected
distribution for each feature and fitted correlation matrix.
To further explore the original/synthetic data and compare them, one may easily
compute basic and advanced statistics for original and/or synthetic data,
including but not limited to percentiles and Six Sigma metrics.
Notes: Supported only in Analytic Solver Data Science and Analytic Solver
Comprehensive. This button will be disabled in all other product licenses.
4. Select all continuous variables under Variables and then use the > to move
them to Selected Variables.
Recall that categorical variables are not supported. Anaemia, diabetes,
high_blood_pressue, sex, smoking and death_event are all categorical
variables that will not be included in the example.
Variables section of Generate Synthetic Data tab
6. For this example, select Auto and leave Metalog Selection Test at the
default setting of Anderson-Darling.
Metalog Terms:
• If Fixed is selected, Analytic Solver will attempt to fit and use the
Metalog distribution with the specified number of terms entered into
the # Terms column. (Only 1 distribution will be fit.) If Fixed is
selected, Metalog Selection Test is disabled.
• If Auto is selected, Analytic Solver will attempt to fit all possible
Metalog distributions, up to the entered value for Max Terms, and
select and utilize the best Metalog distribution according to the
goodness-of-fit test selected in the Metalog Selection Test menu.
Click the down arrow on the right of Fitting Options to enter either the
maximum number of terms (if Auto is selected) or the exact number of
terms (if Fixed is selected) for each variable as well as a lower and/or upper
bound. By default the lower and upper bounds are set to the variable’s
minimum and maximum values, respectively. If no lower or upper bound is
entered, Analytic Solver will fit a semi- (with one bound present) or
unbounded (with no bounds present) Metalog function.
Click Min/Max as
Use Terms ↑ and Terms bounds button to remove
↓ buttons to increment or or add lower and upper
decrement # Terms for bounds.
all variables at once.
Metalog Selection Test: Click the down arrow to select the desired
Goodness-of-Fit test used by Analytic Solver. The Goodness of Fit test is
used to select the best Metalog form for each data variable among the
candidate distributions containing a different number of terms, from 2 to the
value entered for Max Terms. The default Goodness-of-Fit test is
Anderson-Darling.
Metalog Selection Test menu
Results
With the selected options shown in the screenshot above, two workbooks will be
inserted into the workbook, SyntheticData_Output and SyntheticData_Sample.
At the top of both worksheets is the Output Navigator. Click any link to easily
navigate to that section of the worksheet.
SyntheticData_Output: Output Navigator
SynthethicData_Output
Scroll down to the Inputs section of the SyntheticData_Output worksheet to
view all user inputs including the data source, the selected variables, the
distribution fitting parameters, correlation fitting parameters, sampling
parameters and display parameters.
• Metalog Coefficients: This table shows the fitted coefficients for all
feasible Metalog distributions that Analytic Solver Data Science
attempted to fit for each variable. The best distribution (as decided by
the chosen Goodness-of-Fit test) that will be used in sample generation
(if requested) is highlighted in red.
Note that it is not guaranteed that all possible Metalog distributions will
be fit. As shown in the screenshot below, not all variables have exactly
5 Metalog distributions for 1, 2, 3, 4 and 5 terms.
SyntheticData_Output: Metalog Coefficients report
SynthethicData_Sample
Click the SyntheticData_Sample worksheet to view the synthetic data for each
selected variable. (Recall that this data is available because Generate Data was
selected on the Parameters tab on the Generate Data dialog.) This worksheet
also includes the Output Navigator as described above for the
SyntheticData_Output worksheet.
Recall that to produce this synthetic data, Analytic Solver:
1. For each selected variable, Analytic Solver first fit the original data to a
Metalog distribution by using either a fixed number of terms or the
automatic search option.
2. Afterwards, Analytic Solver fit a correlation to all variables by using either
Rank Correlation or one of the five available copulas.
3. Lastly, the trial values (i.e. synthetic data) were generated using the fitted
distribution and correlations.
The screenshot below displays the first 10 trial values generated for each
selected variable: age, creatinine_phosphokinase, ejection_fraction, platelets,
serum_creatinine, serum_sodium and time.
If the Frequency Chart Option was selected on the Parameters tab of the
Generate Data dialog, a dialog containing frequency charts for both the original
data and the synthetic data, for each of the selected variables is displayed
immediately when the SyntheticData_Sample worksheet is opened. This chart is
discussed in-depth in the Aanlyze Data section of the Exploring
From here, each chart may be selected to open a larger, more detailed,
interactive chart. If this chart is closed, by clicking the X in the upper right hand
corner, simply click to a different tab in the worksheet and then back to the
SyntheticData_Sample tab to reopen.
Since Metalog Curves was selected on the Parameters tab, on the Generate Data
dialog, the curve for each fitted Metalog function is displayed on each.
Double click any of the charts, for this example double click the original data
ejection_fraction chart, to open a detailed, interactive frequency chart for the
Original variable data.
To overlay the generated synthetic data on top of the Original data, click
Original in the upper right hand corner and select both checkboxes in the Data
dialog.
Notice in the screenshot below that both the Original and Synthetic data appear
in the chart together, and statistics for both data appear on the right.
To remove either the Original or the Synthetic data from the chart, click
Original/Synthetic in the top right and then uncheck the data type to be removed.
This chart behaves the same as the interactive chart in the Analyze Data feature
found on the Explore menu.
• Use the mouse to hover over any of the bars in the graph to populate
the Bin and Frequency headings at the top of the chart.
• When displaying either Original or Synthetic data (not both), red
vertical lines will appear at the 5% and 95% percentile values in all
three charts (Frequency, Cumulative Frequency and Reverse
Cumulative Frequency) effectively displaying the 90 th confidence
interval. The middle percentage is the percentage of all the variable
values that lie within the ‘included’ area, i.e. the darker shaded area.
The two percentages on each end are the percentage of all variable
values that lie outside of the ‘included’ area or the “tails”. i.e. the
lighter shaded area. Percentile values can be altered by moving either
red vertical line to the left or right.
• Click the down arrow next to Statistics to view Percentiles for each
type of data along with Six Sigma indices. Use the Chart Options view
to manually select the number of bins to use in the chart, as well as to
set personalization options.
• Click the down arrow next to Statistics to view Bin Details for each bin in
the chart.
Bin: If viewing a chart with a single variable, only one grid will be
displayed on the Bin Details pane. This grid displays important bin statistics
such as frequency, relative frequency, sum and absolute sum.
Bin Details View with continuous (scale) variables
Variables
All variables in the data source data range are listed in this field. If the first row
in the dataset contains headings, select First Row Contains Headers.
Selected Variables
Select a variable(s) in the Variables field, then click > to move the variable(s) to
the Selected Variables field. Synthetic data will be generated for the variables
appearing in this field.
Metalog Terms
• If Fixed is selected, Analytic Solver will attempt to fit and use the
Metalog distribution with the specified number of terms entered into
the # Terms column. (Only 1 distribution will be fit.) If Fixed is
selected, Metalog Selection Test is disabled.
• If Auto is selected, Analytic Solver will attempt to fit all possible
Metalog distributions, up to the entered value for Max Terms, and
select and utilize the best Metalog distribution according to the
goodness-of-fit test selected in the Metalog Selection Test menu.
Click the down arrow on the right of Fitting Options to enter either the
maximum number of terms (if Auto is selected) or the exact number of terms (if
Fixed is selected) for each variable as well as a lower and/or upper bound. By
default the lower and upper bounds are set to the variable’s minimum and
Click Min/Max as
Use Terms ↑ and Terms bounds button to remove
↓ buttons to increment or or add lower and upper
decrement # Terms for bounds.
all variables at once.
Fit Correlation
Select Fit Correlation to fit a correlation between the variables. If this option is
left unchecked, correlation fitting will not be performed.
• If Rank is selected Analytic Solver will use the Spearman rank order
correlation coefficient to compute a correlation matrix that includes all
included variables.
• Selecting Copula opens the Copula Options dialog where you can select and
drag five types of copulas into a desired order of priority.
Correlation Fitting section of the Generate Data dialog
Click Select
All to select Click Deselect
all 5 copula All to uncheck all
types. 5 copulas.
Generate Sample
Select Generate Sample to generate synthetic data for each selected variable.
Use the Sample Size field to increase the size of the sample generated.
If this option is left unchecked, variable data will be fitted to a Metalog
distribution and also correlations, if Fit Correlation is selected, but no synthetic
data will be generated.
Click Advanced to open the Sampling Options dialog.
Sampling Options dialog
From this dialog, users can set the Random Seed, Random Generator, Sampling
Method and Random Streams.
Random Seed
Setting the random number seed to a nonzero value (any number of your
choice is OK) ensures that the same sequence of random numbers is used
for each simulation. When the seed is zero or the field is left empty, the
random number generator is initialized from the system clock, so the
Random Generator
Use this menu to select a random number generation algorithm. Analytic
Solver Data Science includes an advanced set of random number generation
capabilities.
Computer-generated numbers are never truly “random,” since they are
always computed by an algorithm – they are called pseudorandom numbers.
A random number generator is designed to quickly generate sequences of
numbers that are as close to statistically independent as possible.
Eventually, an algorithm will generate the same number seen sometime
earlier in the sequence, and at this point the sequence will begin to repeat.
The period of the random number generator is the number of values it can
generate before repeating.
A long period is desirable, but there is a tradeoff between the length of the
period and the degree of statistical independence achieved within the
period. Hence, Analytic Solver Data Science offers a choice of four random
number generators:
o Park-Miller “Minimal” Generator with Bayes-Durham shuffle and
safeguards. This generator has a period of 231-2. Its properties are
good, but the following choices are usually better.
o Combined Multiple Recursive Generator of L’Ecuyer (L’Ecuyer-
CMRG). This generator has a period of 2191, and excellent statistical
independence of samples within the period.
o Well Equidistributed Long-period Linear (WELL) generator of
Panneton, L’Ecuyer and Matsumoto. This generator combines a long
period of 21024 with very good statistical independence.
o Mersenne Twister (default setting) generator of Matsumoto and
Nishimura. This generator has the longest period of 219937-1, but the
samples are not as “equidistributed” as for the WELL and L-Ecuyer-
CMRG generators.
o HDR Random Number Generator, designed by Doug Hubbard. Permits
data generation running on various computer platforms to generate
identical or independent streams of random numbers.
Sampling Method
Use this option group to select Monte Carlo, Latin Hypercube, or Sobol
RQMC sampling.
o Monte Carlo: In standard Monte Carlo sampling, numbers generated by
the chosen random number generator are used directly to obtain sample
values. With this method, the variance or estimation error in computed
samples is inversely proportional to the square root of the number of
trials (controlled by the Sample Size); hence to cut the error in half,
four times as many trials are required.
Analytic Solver Data Science provides two other sampling methods than
can significantly improve the ‘coverage’ of the sample space, and thus
Random Streams
Use this option group to select a Single Stream for each variable or an
Independent Stream (the default) for each variable.
If Single Stream is selected, a single sequence of random numbers is
generated. Values are taken consecutively from this sequence to obtain
samples for each selected variable. This introduces a subtle dependence
between the samples for all distributions in one trial. In many applications,
the effect is too small to make a difference – but in some cases, better
results are obtained if independent random number sequences (streams) are
used for each distribution in the model. Analytic Solver Data Science offers
this capability for Monte Carlo sampling and Latin Hypercube sampling; it
does not apply to Sobol numbers.
Introduction
The Explore menu gives you access to Analytic Solver’s new Data Analysis
tool, Dimensionality Reduction via Feature Selection and the ability to explore
your data using charts such as Bar charts, Line Charts, Scatterplots, Boxplots,
Histograms, Parallel Coordinates, Scatter Plot Matrices and Variable Plots.
Analyze Data
In the latest release of Analytic Solver, users now have access to the Analyze
Data feature located on the Explore menu. With the Analyze Data application,
users can generate a multivariate chart for any number of scale (continuous) or
categorical variables. This feature can be used as a standalone application and
can be particularly useful as a step in understanding your data while in the
process of building your data science model. This feature allows you to look at
your data not as static historical data but as a realization of an uncertain variable
such as what you would encounter in simulation modeling.
Double-clicking the preview chart will display a detailed view of a histogram for
scale (continuous) variables and bar charts for categorical variables, allowing
users to view their data from a frequency perspective, such as a historical sample
from a possible probability distribution. Various statistics pertaining to each
variable’s range of values are displayed on the right, such as count, mean,
standard deviation, etc.
Detailed view for the Alcohol scale (continuous) variable
3. The top of the dialog displays the information for the Data Source:
worksheet name, Data, the workbook name, Wine.xlsx, the data range,
A1:N179 and the number of rows and columns in the dataset.
Data Source section on the Analyze Data dialog
5. Click Write report to write all computed statistics for each variable to the
Statistics worksheet. (The Statistics worksheet will be inserted to the right
of the Data tab.) If this checkbox is left unchecked, no report will be
inserted into the workbook. The preview dialog will be displayed, and
detailed charts will be available if double-clicked. Once the dialog is
closed, it will not persist in the workbook. (The application would have to
be re-run to re-open the chart.)
Options section on the Analyze Data dialog
7. Click Finish.
Results
After Finish is clicked in the Analyze Data dialog. A new Statistics worksheet
is inserted to the left of the Data tab and an Analyze Data Results dialog appears
displaying a bar chart (for categorical variables) or histogram (for continuous or
scale variables) for each variable included in the analysis.
Double - click any chart to display a more detailed view of the chart and various
computed statistics, including Six Sigma, and percentiles.
Display Placement: Click the title bar of the multivariate dialog to drag to a
new location.
Malic_Acid chart view
Percentile
Bars
Tabs: The Analyze Data dialog contains three tabs: Frequency, Cumulative
Frequency, and Reverse Cumulative Frequency. Each tab displays different
information about the distribution of variable values.
Hovering over a bar in either of the three charts will populate the Bin and
Frequency headings at the top of the chart. In the Frequency chart above, the
Cumulative Frequency Chart: Bins containing the range of values for the
variable appear on the horizontal axis, the cumulative frequency of occurrence
of the bin values appear on the left vertical axis while the actual cumulative
frequency of the bin values appear on the right vertical axis.
Reverse Cumulative Frequency Chart: Bins containing the range of values for
the variable appear on the horizontal axis, similar to the Cumulative Frequency
Statistics View
The Statistics tab displays numeric values for several summary statistics,
computed from all values for the specified variable. The statistics shown on the
pane below were computed for the Malic Acid variable.
Statistics Pane
All statistics appearing on the Statistics pane are briefly described below.
Statistics
• Mean, the average of all the values.
• Standard Deviation, the square root of variance.
• Variance, describes the spread of the distribution of values.
• Skewness, which describes the asymmetry of the distribution of values.
The values displayed here represent 99 equally spaced points on the Cumulative
Frequency chart: In the Percentile column, the numbers rise smoothly on the
vertical axis, from 0 to 1.0, and in the Value column, the corresponding values
from the horizontal axis are shown. For example, the 75th Percentile value is a
number such that three-quarters of the values occurring in the last simulation are
less than or equal to this value.
Six Sigma View
Selecting Six Sigma from the menu displays various computed Six Sigma
measures. In this display, the red vertical lines on the chart are the Lower
Specification Limit (LSL) and the Upper Specification Limit (USL) which are
initially set equal to the 5th and 95th percentile values, respectively.
These functions compute values related to the Six Sigma indices used in
manufacturing and process control. For more information on these functions,
see the Appendix located at the end of this guide.
• SigmaCP calculates the Process Capability.
• SigmaCPK calculates the Process Capability Index.
• SigmaCPKLower calculates the one-sided Process Capability Index based
on the Lower Specification Limit.
• SigmaCPKUpper calculates the one-sided Process Capability Index based
on the Upper Specification Limit.
• SigmaCPM calculates the Taguchi Capability Index.
• SigmaDefectPPM calculates the Defect Parts per Million statistic.
• SigmaDefectShiftPPM calculates the Defective Parts per Million statistic
with a Shift.
The controls are divided into three groups: Binning, Method and Style.
• Binning: Applies to the number of bins in the chart.
Editable
Sample
• Output Navigator: Click any of the links to jump to that section of the
report.
• Inputs: This section contains information pertaining to the data source and
the variables included in the data analysis.
• Parameters: If you find that the Preview dialog is taking a long time to
open, you can edit the Sample Data Fraction % here. Simply enter a
smaller percentage to speed up the opening of the dialog.
Scroll down to the Six Sigma section of the report to see all 19 Six Sigma
statistics and indices.
Analyze Data Report: Six Sigma
Finally, scroll down to Percentiles to view all 99 percentile values from 0.01 to
.99.
Analyze Data Report: Percentiles
Data Source
This portion of the dialog includes data relevant to the data source such as the
Worksheet name, the Workbook name, the Data range in the worksheet and the
number of rows and columns in the data range.
Feature Selection
Dimensionality Reduction is the process of deriving a lower-dimensional
representation of original data, that still captures the most significant
relationships, to be used to represent the original data in a model. This domain
can be divided into two branches, feature selection and feature extraction.
Feature selection attempts to discover a subset of the original variables while
Feature Extraction attempts to map a high – dimensional model to a lower
dimensional space. In past versions, Analytic Solver Data Science only
contained one feature extraction tool which could be used outside of a
classification or regression method, Principal Components Analysis (Transform
– Principal Components on the Data Science ribbon). However, in V2015, a
new feature selection tool was added, Feature Selection. For more information
on Principal Components Analysis, please see the chapter of the same name.
In V2015, a new tool for Dimensionality Reduction was introduced, Feature
Selection. Feature Selection attempts to identify the best subset of variables (or
features) out of the available variables (or features) to be used as input to a
classification or regression method. The main goal of Feature Selection is
threefold – to “clean” the data, to eliminate redundancies, and to quickly identify
the most relevant and useful information hidden within the data thereby
reducing the scale or dimensionality of the data. Feature Selection results in an
enhanced ability to explore the data, visualize the data and in some cases to
make some previously infeasible analytic models feasible.
One important issue in Feature Selection is how to define the “best” subset. If
using a supervised learning technique (classification/regression model), the
“best” subset would result in a model with the lowest misclassification rate or
residual error. This presents a different question – which classification method
• Correlation-based
o Pearson product-moment correlation
o Spearman rank correlation
o Kendall concordance
• Statistical/probabilistic independence metrics
o Chi-square statistic
o Cramer’s V
o F-statistic
o Fisher score
o Welch’s statistic
• Information-theoretic metrics
o Gain Ratio
o Mutual Information (Information Gain)
then we can describe the applicability of the Feature Selection metrics by the
following table:
"N" means that metrics can be applied naturally, and “D” means that features
and/or the outcome variable must be discretized before applying the particular
filter.
As a result, depending on the variables (features) selected and the type of
problem chosen in the first dialog, various metrics will be available or disabled
in the second dialog.
VARIABLE DESCRIPTION
AGE Age of patient
ANAEMIA Decrease of red blood cells
or hemoglobin (boolean)
CREATINE_PHOSPHOKINASE Level of the CPK enzyme in
the blood (mcg/L)
DIABETES If the patient has diabetes
(boolean)
EJECTION_FRACTION Percentage of blood leaving
the heart at each contraction
(percentage)
HIGH_BLOOD_PRESSURE If the patient has
hypertension (boolean)
PLATELETS Platelets in the blood
(kiloplatelets/mL)
SERUM_CREATININE Level of serum creatinine in
the blood (mg/dL)
SERUM_SODIUM Level of serum sodium in the
blood (mEq/L)
SEX Woman (0) or man (1)
SMOKING If the patient smokes or not
(boolean)
TIME Follow-up period (days)
DEATH_EVENT If the patient deceased during
the follow-up period
(boolean)
1
Davide Chicco, Giuseppe Jurman: Machine learning can predict survival of patients with heart failure from serum creatinine and ejection
fraction alone. BMC Medical Informatics and Decision Making 20, 16 (2020). (link)
Click the Measures tab or click Next to open the Measures dialog.
Since we have continuous variables, Discretize predictors is enabled. When this
option is selected, Analytic Solver Data Science will transform continuous
variables into discrete, categorical data in order to be able to calculate statistics,
as shown in the table in the Introduction to this chapter.
This dataset contains both continuous (or real-valued) features and categorical
features which puts this dataset into the following category.
Click Finish.
Two worksheets are inserted to the right of the heart_failure_clinical_records
worksheet: FS_Output and FS_Top_Features.
Click the FS_Top_Features tab.
In the Data Science Cloud app, click the Charts icon on the Ribbon to open the Charts dialog, then
select FS_Top_Features for Worksheet and Feature Importance Chart for Chart.
The Feature Importance Plot ranks the variables by most important or relevant
according to the selected measure. In this example, we see that the
ejection_fraction, serum_creatinine, age, serum_sodium and
creatinine_phosphokinase are the top five most important or relevant variables
according to the Chi-Squared statistic. It’s beneficial to examine the Feature
Selection Importance Plot in order to quickly identify the largest drops or
“elbows” in feature relevancy (importance) and select the optimal number of
variables for a given classification or regression model.
Note: We could have limited the number of variables displayed on the plot to a
specified number of variables (or features) by selecting Number of features and
then specifying the number of desired variables. This is useful when the number
of input variables is large or we are particularly interested in a specific number
of highly – ranked features.
Run your mouse over each bar in the graph to see the Variable name and
Importance factor, in this case Chi-Square, in the top of the dialog.
Click the X in the upper right hand corner to close the dialog, then click
FS_Output tab to open the Feature Selection report.
Figure 6: Feature Selection: Statistics Table
The Detailed Feature Selection Report displays each computed metric selected
on the Measures tab: Chi-squared statistic, Chi-squared P-Value, Cramer’s V,
Mutual Information, and Gain Ratio.
Mutual Information
Sort the Mutual Information column by largest to smallest value. This statistic
measures how much information the presence/absence of a term contributes to
making the correct classification decision.2 The closer the value to 1, the more
contribution the feature provides.
Figure 9: Mutual Information Statistic
When compared to the Chi2 and Cramer's V statistic, the top four most
significant variables calculated for Mututal Information are the same:
ejection_fraction, serum_creatinine, age, and serum_sodium.
Gain Ratio
Finally, sort the Gain Ratio from largest to smallest. (Recall that the larger the
gain ratio value, the larger the evidence for the feature to be relevant in the
classification model.)
2
https://nlp.stanford.edu/IR-book/html/htmledition/mutual-information-1.html
While this statistic's rankings differ from the first 4 statistic's rankings,
ejection_fraction, age and serum_creatinine are still ranked in the top four
positions.
The Feature Selection tool has allowed us to quickly explore and learn about our
data. We now have a pretty good idea of which variables are the most relevant
or most important to our classification or prediction model, how our variables
relate to each other and to the output variable, and which data attributes would
be worth extra time and money in future data collection. Interestingly, for this
example, most of our ranking statistics have agreed (mostly) on the most
important or relevant features with strong evidence. We computed and
examined various metrics and statistics and for some (where p-values can be
computed) we’ve seen a statistical evidence that the test of interest succeeded
with definitive conclusion. In this example, we’ve observed that several
variable (or features) were consistently ranked in the top 3-4 most important
variables by most of the measures produced by Analytic Solver Data Science’s
Feature Selection tool. However, this will not always be the case. On some
datasets you will find that the ranking statistics and metrics compete on
rankings. In cases such as these, further analysis may be required.
Fitting the Model
See the Analytic Solver User Guide for an extension of this example which fits a
model to the heart_failure dataset using the top variables found by Feature
Selection and compares that model to a model fit using all variables in the
dataset.
Output Variable
Click the > command button to select the Output Variable. This variable may
be continuous or categorical. If the variable contains more than 10 unique
values, the output variable will be considered “continuous”. If the variable
contains less than 10 unique values, the output variable will be considered
“categorical”.
then we can describe the applicability of the Feature Selection metrics by the
following table:
"N" means that metrics can be applied naturally, and “D” means that features
and/or the outcome variable must be discretized before applying the particular
filter. As a result, depending on the variables (features) selected and the type of
problem chosen in the first dialog, various metrics will be available or disabled
in this dialog.
Discretize predictors
When this option is selected, Analytic Solver Data Science will transform
continuous variables listed under Continuous Variables on the Data Source tab
into categorical variables.
Click the Advanced command button to open the Predictor Discretization -
Advanced dialog. Here the Maximum number of bins can be selected. Analytic
Solver Data Science will assign records to the bins based on if the variable’s
value falls within the interval of the bin (if Equal interval is selected for Bins to
be made with) or on an equal number of records in each bin (if Equal Count is
selected for Bins to be made with). These settings will be applied to each of the
variables listed under Continuous Variables on the Data Source tab.
Kendall concordance
Kendall concordance, also known as Kendall’s tau coefficient, is also used to
measure the level of association between two variables. A tau value of +1
signifies perfect agreement and a -1 indicates complete disagreement. If a
variable and the outcome variable are independent, then one could expect the
Kendall tau to be approximately zero.
Welch’s Test
Welch’s Test is a two-sample test (i.e. applicable for binary classification
problems) that is used to check the hypothesis that two populations with
possibly unequal variances have equal means. When used with the Feature
Selection tool, a large T-statistic value (in conjunction with a small p-value)
would provide sufficient evidence that the Distribution of values for each of the
two classes are distinct and the variable may have enough discriminative power
to be included in the classification model.
F-Statistic
F-Test tests the hypothesis of at least one sample mean being different from
other sample means assuming equal variances among all samples. If the
variance between the two samples is large with respect to the variance within the
sample, the F-statistic will be large. Specifically for Feature Selection purposes,
it is used to test if a particular feature is able to separate the records from
different target classes by examining between-class and within-class variances.
Chi-Squared
The Chi-squared test statistic is used to assess the statistical independence of
two events. When applied to Feature Selection, it is used as a test of
independence to assess whether the assigned class is independent of a particular
variable. The minimum value for this statistic is 0. The higher the Chi-Squared
statistic, the more independent the variable.
Cramer's V
Cramer’s V is a variation of the Chi-Squared statistic that also measures the
association between two discrete nominal variables. This statistic ranges from 0
to 1 with 0 indicating no association between the two variables and 1 indicating
complete association (the two variables are equal).
Mutual information
Mutual information is the degree of a variables’ mutual dependence or the
amount of uncertainty in variable 1 that can be reduced by incorporating
knowledge about variable 2. Mutual Information is non-negative and is equal to
zero if the two variables are statistically independent. Also, it is always less
than the entropy (amount of information contained) in each individual variable.
Gain ratio
This ratio, ranging from 0 and 1, is defined as the mutual information (or
information gain) normalized by the feature entropy. This normalization helps
address the problem of overestimating features with many values but the
normalization overestimates the relevance of features with low entropy. It is a
good practice to consider both mutual information and gain ratio for deciding on
feature rankings. The larger the gain ratio, the larger the evidence for the feature
to be relevant in a classification model.
Gini index
The Gini index measures a variable’s ability to distinguish between classes. The
maximum value of the index for binary classification is 0.5. The smaller the
Gini index, the more relevant the variable.
Number of features
Enter a value here ranging from 1 to the number of features selected in the
Continuous and Categorical Variables listboxes on the Data Source tab. This
value, along with the Rank By option setting, will be used to determine the
variables included in the Top Features Table and Feature Importance Plot. This
option has a default setting of “2”.
Chart Wizard
To create a chart, you can invoke the Chart Wizard by clicking Explore on the
Data Science ribbon. A description of each chart type follows.
Bar Chart
The bar chart is one of the easiest and effective plots to create and understand.
The best application for this type of chart is comparing an individual statistic
(i.e. mean, count, etc.) across a group of variables. The bar height represents the
statistic while the bars represent the different variable groups. An example of a
bar chart is shown below.
Line Chart
A line chart is best suited for time series datasets. In the example below, the line
chart plots the number of airline passengers from January 1949 to December
1960. (The X – axis is the number of months starting with January 1949 as “1”.)
Scatterplot
One of the most common, effective and easy to create plots is the scatterplot.
These graphs are used to compare the relationships between two variables and
are useful in identifying clusters and variable “overlap”.
Variable Plot
Analytic Solver Data Science’s Variables graph simply plots each selected
variable’s distribution. See below for an example.
o To change the variables plotted on the X Axis, use the Filters pane.
Uncheck the variable to remove from the chart. Select to include in the
chart. The data range may be altered by clicking the up an down
buttons on the spinner fields or by moving the sliders left (to
decrement) or right (to increment).
On charts that explicitly include the variable tag field, use these tags to
add to or remove variables from the chart. For more information on the
tag field, see below.
o To change the plotted metric on the Y Axis, click the down arrow next
to Y Axis.
o To add a 2nd chart to the window, simply click the New Chart icon at
the top, left of the dialog.
Click the icon to open the chart and the icon to delete the chart.
You can also open an existing chart by clicking the Existing Charts tab
on the opening screen of the Chart Wizard.
Charts are saved by
worksheet. Select
the desired
worksheet to view
the saved charts.
o Click the back arrow to go back to the chart type selection dialog. Note
that the current chart will be lost if not saved.
o Chart changes are tracked. Show confirm dialogs will be displayed
when going back, closing or opening an existing chart when the current
chart has been changed. The chart title will display an * when unsaved
changes have been applied to the chart.
o Use the Data field to control which data to include in the chart.
o First Row Contains Headers is selected by default. This option should
be selected if the first row in the data range contains column headings.
o Use the icons on the top right of the chart to :
Collapse current chart Print current chart
Uncheck class 4 under the X-Var filter to remove this class from the plot.
• To select a different variable on the X-axis, click the down arrow net to
X-Axis and select the desired variable from the menu, say X-Var.
Y1 plotted on the Y
axis
The solid line denotes the Median. The box reaches from the 25 th Percentile to
the 75th Percentile. The upper “whiskers” denote the extreme minimum and
maximum values, excluding outliers. In this example, there is one outlier below
the minimum “whisker”. Outliers are values that fall outside of the ranges:
• Min: 25th Percentile – 1.5 * IQR – Any data points falling below this
limit is considered an outlier. In this example, there is one outlier,
shown above.
Note that the Filters pane applies to both charts. The bottom graph charts the Y1
and Y2 values for (only) X variable 3.
Histogram Example
The example below illustrates the use of Analytic Solver Data Science’s chart
wizard in drawing a histogram of the Utilities dataset. Click Help – Example
Models on the Data Science ribbon to open the example dataset, Utilities.xlsx
under Forecasting/Data Science Examples. Select a cell within the dataset, say
A2, and then click Explore – Chart Wizard on the Data Science ribbon. Select
Histogram, and then click Next.
• On the opening chart, select X1 for the X Axis to create a histogram of
the values for the x1 variable.
• Number of Bins: Move the slider right and left to right and left to
increase or decrease the number of bins in the chart.
• For the Y Axis, select Frequency, Cumulative Frequency or Reverse
Cumulative Frequency.
Set Color By to x7. This new chart indicates the corresponding value for x7 for
each utility. For example, the 5th bin consists of 1 value, 0.96 for the Pacific
utility. The color of this bin indicates that the value of x7 for this specific utility
is 0.9.
The y-axis plots the number of passengers and the x-axis plots the month. This
plot shows that as the months progress, the number of airline passengers appear
to be in an increasing trend with yearly seasonality dips.
Remove all variables from the chart except Indy500 and Daytona 500.
The first thing that we notice is the range of each of the races that are indicated
at the top and bottom of each vertical line. The range of ratings for the Indy 500
has a high of 10.9 and a low of 2.3 whereas the range for the Daytona 500 is
11.3 to 4.4. As a result, this chart already conveys that the viewership for the
Daytona 500 is larger than the viewship for the Indy 500.
When looking at the observations for each feature, this chart shows that in most
years, the viewship of the Indy 500 was low whereas the viewship for the
Daytona 500 was high. There are just four years wheree high ratings for the
Indy 500 was recorded. In these same years, the ratings for the Daytona 500
was correspondingly lower.
ScatterPlot Example
The example below illustrates the use of Analytic Solver Data Science’s chart
wizard in drawing a Scatterplot using the Iris.xlsx dataset. Click Help –
Example Models on the Data Science ribbon to open the example dataset,
Iris.xlsx from Forecasting/Data Science Examples. Select a cell within the
dataset, say A2, and then click Explore – Chart Wizard on the Data Science
ribbon. Select Scatter Plot.
Note that the graph shows that Iris flowers with small, medium or large pedal widths
belong to the same Iris species.
Click the down arrow next to X Axis (at the top, left) and select Petal_length from the
menu.
Notice that Iris flowers with small, medium or large pedal widths generally have the
same (small, medium or large) pedal lengths.
This chart clearly shows that, out of the three varieties of iris flowers, the Setosa variety
has the smallest petal widths and lengths, the Verginica has the latest petal widths and
lengths and the Versicolor variety is in between the Setosa and Verginica.
Alternatively, select the down arrow next to Color By and select Species_No to color
each data point by species type.
The first four continous varaibles in the dataset are included in the chart by
default. (Note: Only continuous variables may be included in the Variable
chart.)
To open a previously saved chart, click the Existing Charts tab and then select
the desired chart.
Open a new
chart.
Note on data selection: When a single cell is selected that is located inside or adjacent to a dataset,
the entire dataset will be used for the “Data” cell range. If the user has selected multiple cells in a
single rectangular region, those cells, excluding any bounding empty cells, will be used for the
“Data” cell range. This behavior was added to allow users to create a chart on a subset of a
worksheet dataset without the User being forced to edit the Data address after the chart has been
created.
Note on chart changes: Chart changes are tracked. Show confirm dialogs will be displayed when
going back, closing or opening an existing chart when the current chart has been changed. The chart
title will display an * when unsaved changes have been applied to the chart.
The following options appear on the following chart types: bar, box plot,
histogram, line, scatter plot and variable.
X Axis: Click the down arrow select the variable(s) to appear on the x-axis.
YAxis: Click the down arrow to select the metric to appear on the y-axis.
Panel By: If the dataset contains categorical variables, click the down arrow
next to Panel By to display the data by category in separate charts. Each
category will be displayed in a separate chart, as shown below.
Introduction
Analytic Solver Data Science’s Missing Data Handling utility allows users to
detect missing values in the dataset and handle them in a way you specify.
Analytic Solver Data Science considers an observation to be missing if the cell
is empty or contains an invalid formula. Analytic Solver Data Science also
allows you to indicate specific data that you want designated as "missing" or
"corrupt".
Analytic Solver Data Science offers several different methods for dealing with
missing values. Each variable can be assigned a different “treatment”. For
example, if there is a missing value, then the entire record could be deleted or
the missing value could be replaced by an estimated mean/median/mode of the
bin or even with a value that you specify. The available options depend on the
variable type.
In the following examples, we will explore the various ways in which Analytic
Solver Data Science can treat missing or invalid values in a dataset.
Open the Missing Data Handling dialog by clicking Data Science – Transform –
Missing Data Handling. Confirm that Example 1 is displayed for Worksheet.
Now select Variable_3 in the Variables field and click the down arrow next to
Mean under How do you want to handle missing values for the selected
variable(s).
Click OK to transform the data. See the newly inserted Imputation1 worksheet
for the results, shown below.
In the Variable_1 column, invalid or missing values have been replaced with the
mean calculated from the remaining values in the column. (12.34, 34, 44, -433,
43, 34, 6743, 3, 4 & 3). The cells containing missing values or invalid values in
the Variable_3 column, have been replaced by the median of the remaining
values in that column (12, 33, 44, 66, 33, 66, 22, 88, 55 & 79). The invalid data
for Variable_2 remains since no treatment was selected for this variable.
In the Example 3 dataset, Variable_3 has been replaced with date values.
The missing values in the Variable_2 column have been replaced by the mode of
the valid values (dd) even though, in this instance, the data is non-numeric.
(Remember, the mode is the most frequently occurring value in the Variable_2
column.)
In the Variable_3 column, the third and ninth records contained missing values.
As you can see, they have been replaced by the mode for that column, 2 – Feb –
01.
Open the Missing Data Handling dialog. Confirm that Example 4 is displayed
for Worksheet.
Select Variable_1, then click the down arrow next to Select treatment under
How do you want to handle missing values for the selected variable(s), then
select User specified value. In the field that appears directly to the right of User
specified value, enter 100, then click Apply to selected variable(s). Repeat
these steps for Variable_2. Then click OK.
Open the Missing Data Handling dialog. Confirm that Example 5 is displayed
for Worksheet or Data Source within the Data Source group.
Select Missing values are represented by this value and enter -999 in the field
that appears directly to the right of the option.
Select Variable_1 in the Variables field, click the down arrow next to Select
treatment and choose Mean from the menu, then click Apply to selected
variable(s).
Select Variable_2 in the Variables field, click the down arrow next to No
Treatment and choose User specified value from the menu. Enter “zzz” for the
value then click Apply to selected variable(s).
Finally, select Variable_3 in the Variables field, click the down arrow next to
User specified value and choose Mode from the menu. Click Apply to selected
variable(s).
Note that in the Variable_1 column, the specified missing code (-999) was
replaced by the mean of the column (in record 12). In the Variable_2 column,
the missing values have been replaced by the user specified value of “zzz” in
records 5 and 7, and for variable_3, by the mode of the column in record 9.
Let’s take a look at one more dataset, Example 6, of Examples.xlsx.
Records 7 and 12 have been deleted since Delete Record was chosen for the
treatment of missing values for Variable_1. In the Variable_2 column, the
missing values in records 2 and 11 have been replaced by the mode of the
column, "dd". (Remember, record 7 (which included #NAME for Variable_2)
Variables
Each variable and its selected treatment option are listed here.
Reset
Resets treatment for all variables listed in the Variables field. Also, deselects
the Overwrite Existing Worksheet option if selected.
OK
Click to run the Missing Data Handling feature of Analytic Solver Data Science.
Introduction
Analytic Solver Data Science contains two techniques for transforming
continuous data: Binning and Rescaling.
Click Finish. The Bin_Output2 output sheet displays the number of bins
created and the number of records assigned to each bin.
The Bin_Ouput3 worksheet displays the 4 intervals and the number of records
along with the range of values assigned to each bin: Bin 1 (96 to 135), Bin 2
(135 – 174), Bin 3 (174 – 213), and Bin 4 (213 – 252).
Click Done to accept the random partition defaults. For more information on
partition, see the Random Data Partitioning chapter that occurs later in this
guide.
• Under Rescaling: Fitting, Select Adjusted Normalization.
• Leave the Correction option set to the default of 0.01.
• Select Show Fitted Statistics to include in the output.
Click the Fitted Statistics link to navigate to the Fitted Statistics table located on
the Rescaling output sheet. Shift and Scale values are inferred from the training
data. Each formula below can be rearranged into the form (x-shift)/scale. Then
other partitions/new data is rescaled using the statistics of data features in the
training set.
Click the Transformed: Training link on the Output Navigator to display the
rescaled variable values for the Training partition.
Note: Unselected variables are appended to the rescaled variables in the
Transformed: Training and Transformed: Validation data tables to maintain the
complete input data.
Click the Transformed: Validation link on the Output Navigator to display the
rescaled variable values for the Validation partition.
Rescaling Options
See below for an explanation of options on all three tabs of the Rescaling dialog:
Data, Parameters and Transformation tabs.
The following options appear on all three tabs of the Rescaler dialog.
Help: Click the Help button to access documentation on all k-Means Clustering
options.
Data Source
Worksheet: Click the down arrow to select the desired worksheet where the
dataset is contained.
Workbook: Click the down arrow to select the desired workbook where the
dataset is contained.
Data range: Select or enter the desired data range within the dataset. This data
range may either be a portion of the dataset or the complete dataset.
#Columns: Displays the number of columns in the data range. This option is
read only.
Variables
First Row Contains Headers: Select this checkbox if the first row in the
dataset contains column headings.
Variables: This field contains the list of the variables, or features, included in
the data range.
Selected Variables: This field contains the list of variables, or features, to be
included in k-Means Clustering.
• To include a variable in k-Means Clustering, select the variable in the
Variables list, then click > to move the variable to the Selected
Variables list.
• To remove a variable as a selected variable, click the variable in the
Selected Variables list, then click < to move the variable back to the
Variables list.
Rescaling: Fitting
Use Rescaling to normalize one or more features in your data. Many Data
Science workflows include feature scaling/normalization during the data
preprocessing stage. Along with this general-purpose facility, you can access
rescaling functionality directly from the dialogs for Supervised Algorithms
available in Analytic Solver Data Science application.
Analytic Solver Data Science provides the following methods for feature
scaling: Standardization, Normalization, Adjusted Normalization and Unit
Norm.
• Standardization makes the feature values have zero mean and unit
variance. (x−mean)/std.dev.
• Normalization scales the data values to the [0,1]
range. (x−min)/(max−min)
New Data
See the Scoring New Data chapter in the Data Miing User Guide for more
information on scoring new data within a worksheet or database.
Introduction
Analysts often deal with data that is not numeric. Non numeric data values can
be alphanumeric (mix of text and numbers) or numeric values with no numerical
significance (such as a postal code). Such variables are called 'Categorical'
variables, where every unique value of the variable is a separate 'category'.
Categorical variables can be nominal or ordinal. Nominal variable values have
no order, for example, True or False or Male or Female. Values for an ordinal
variable have a clear order but no fixed unit of measurement, i.e. Kinder, First,
Second, Third, Fourth, and Fifth or a Size Chart of 1, 2, 3, 4, 5.
Dealing with categorical data poses some limitations. For example, if your data
contains a multitude of categories, you might want to combine several categories
into one or perhaps you may want to use a data science technique that does not
directly handle untransformed categorical variables.
Analytic Solver Data Science provides options to transform data in the
following ways:
1. By Creating Dummy Variables: When this feature is used, a non-numeric
variable (column) is transformed into several new numeric or binary
variables (columns).
Imagine a variable called Language which has data values English, French,
German and Spanish. Running this transformation will result in the creation
of four new variables: Language_English, Language_French,
Language_German, and Language_Spanish. Each of these variables will
take on values of either 0 or 1 depending on the value of the Language
variable in the record. For instance, if in a particular record Language =
German, then among the dummy variables created, Language_German
will be 1 while the other Language_XXX variables will be set to zero.
2. Create Category Scores: In this feature, a string variable is converted into
a new numeric, categorical variable.
3. Reduce Categories: This utility helps you create a new categorical
variable that reduces the number of categories. You can reduce the number
of categories “by frequency” or “manually”.
There are two different options to choose from.
A. Option 1 assigns categories 1 through n - 1 to the n - 1
most frequently occurring categories, and assigns
category n to all remaining categories.
B. Option 2 maps multiple distinct category values in the
original column to a new category variable between 1 and
n where n is the number of observations.
Note: See the Analytic Solver User Guide for data limitations in Analytic
Solver Comprehensive/Data Science.
Click OK and view the output, Encoding, which is inserted on the Model tab of
the Analytic Solver Task Pane under Data Science – Transformations – Create
Dummies.
Analytic Solver Data Science has sorted the values of the Species_name variable
alphabetically and then assigned values of 1, 2 or 3 to each record depending on
the species type. (Starting from 1 because we selected Assign numbers 1,2,3....
To have Analytic Solver Data Science start from 0, select the option Assign
numbers 0, 1, 2,… on the Create Category Scores dialog.) A variable,
Factorized_Species_name is created to store these assigned numbers. Analytic
Solver Data Science has converted this dataset to an entirely numeric dataset.
…then select the Manually radio button under Assign Category heading.
All unique values of the Petal_length variable are now listed. Select all
categories with Values less than 2 (so Value = 1 to 1.9); click the down arrow
next to Category and select 1, then click Apply.
Repeat these steps for categories with values from 3 to 3.9 and apply a Category
number of 2. Continue repeating these steps until values ranging from 4 thru 4.9
are assigned a category number = 3, values ranging from 5 thru 5.9 are assigned
a category number = 4, and values ranging from 6 thru 6.9 are assigned a
category = 5.
In the output, Analytic Solver Data Science has assigned new categories as
shown in the column, Reduced-Petal_Length, based on the choices made in the
Reduce Categories dialog.
There are 22 unique values for Petal_width and Analytic Solver Data Science
has classified the Petal_width variable using 12 different categories. The most
frequently appearing value is 0.2 (with 29 instances) which has been assigned to
category 1. The second most frequently appearing value is 1.3 (with 13
instances) which has been assigned to category 2. See the chart below for all
category assignments.
Value Number of Instances Assigned Category
0.2 29 1
1.3 13 2
1.8 12 3
1.5 12 4
2.3 8 5
1.4 8 6
0.4 7 7
Data Range
Either type the cell address directly into this field, or using the reference button;
select the required data range from the worksheet or data set. If the cell pointer
(active cell) is already somewhere in the data range, Analytic Solver Data
Science automatically picks up the contiguous data range surrounding the active
cell. When the data range is selected Analytic Solver Data Science displays the
number of records in the selected range.
Variables
This list box contains the names of the variables in the selected data range. To
select a variable, simply click to highlight, then click the > button. Use the
CTRL or SHIFT keys to select multiple variables.
Variables to be factored
This list box contains the names of the input variables or the variables that will
be replaced with dummy variables. To remove a variable, simply click to
highlight, then click the < button. Use the CTRL or SHIFT keys to select
multiple variables.
Category variable
Click the down arrow to select the desired variable for category reduction.
Assign Category
If By frequency is selected, incrementally increased category numbers will be
assigned to each category as the number of instances decrease until the n - 1
category is assigned. All remaining values will then be lumped into the last
category, n. If this option is selected, the Limit number of categories to option
will be enabled.
If Manually is selected, Analytic Solver Data Science allows you to assign a
specific category number to single or multiple categories using the Assign
Category ID dropdown menu. If this option is selected, the Category option
will be enabled.
Assign Category ID
If Manually is selected, Assign Category ID is enabled. Click the down arrow to
select the Category number to assign to each unique value for the variable. This
list will contain values from 1 to n where n is the maximum number of distinct
values contained in the variable. Click Apply to apply this mapping, or Reset to
start over.
Reset
Click the Reset command button to reset all categories in the variable to
unassigned.
Introduction
In the data science field, databases with large amounts of variables are routinely
encountered. In most cases, the size of the database can be reduced by removing
highly correlated or superfluous variables. The accuracy and reliability of a
classification or regression model produced from this resultant database will be
improved by the removal of these redundant and unnecessary variables. In
addition, superfluous variables increase the data-collection and data-processing
costs of deploying a model on a large database. As a result, one of the first steps
in data science should be finding ways to reduce the number of independent or
input variables used in the model (otherwise known as dimensionality) without
sacrificing accuracy.
Dimensionality Reduction is the process of reducing the amount of variables to
be used as input in a regression or classification model. This domain can be
divided into two branches, feature selection and feature extraction. Feature
selection attempts to discover a subset of the original variables while Feature
Extraction attempts to map a high – dimensional model to a lower dimensional
space. In the past, Analytic Solver (previously referred to as XLMiner) only
contained a feature extraction tool, Principal Components Analysis (Transform –
Principal Components). However, in V2015, a new feature selection tool was
introduced, Feature Selection. This chapter explains Analytic Solver Data
Science’s Principal Components Analysis functionality. For more information
on Analytic Solver Data Science’s Feature Selection tool, please see the
previous chapter, “Feature Selection”.
Principal component analysis (PCA) is a mathematical procedure that
transforms a number of (possibly) correlated variables into a smaller number of
uncorrelated variables called principal components. The objective of principal
component analysis is to reduce the dimensionality (number of variables) of the
dataset but retain as much of the original variability in the data as possible. The
first principal component accounts for the majority of the variability in the data,
the second principal component accounts for the majority of the remaining
variability, and so on.
A principal component analysis is concerned with explaining the variance
covariance structure of a high dimensional random vector through a few linear
combinations of the original component variables. Consider a database X with m
rows and n columns (X4x3)
X11 X12 X13
X21 X22 X23
X31 X32 X33
X41 X42 X43
1. The first step in reducing the number of columns (variables) in the X matrix
using the Principal Components Analysis algorithm is to find the mean of
each column.
(X11 + X21 + X31 + X41)/4 = Mu1
where the coefficient vectors l1, l2 ,..etc. are chosen such that they satisfy the
following conditions:
First Principal Component = Linear combination l1'X that maximizes Var(l1'X)
and || l1 || =1
Second Principal Component = Linear combination l2'X that maximizes
Var(l2'X) and || l2 || =1
and Cov(l1'X , l2'X) =0
The output from PCA_Scores1 is shown below. This table holds the weighted
averages of the normalized variables (after each variable’s mean is subtracted).
(This matrix is described in the 2nd step of the PCA algorithm - see Introduction
above.) Again, we are looking for the magnitude or absolute value of each figure
in the table.
Data Source
Worksheet: Click the down arrow to select the desired worksheet where the
dataset is contained.
Workbook: Click the down arrow to select the desired workbook where the
dataset is contained.
Data range: Select or enter the desired data range within the dataset. This data
range may either be a portion of the dataset or the complete dataset.
#Columns: Displays the number of columns in the data range. This option is
read only.
#Rows: Displays the number of rows in the data range. This option is read
only.
Variables
First Row Contains Headers: Select this checkbox if the first row in the
dataset contains column headings.
Variables In Input Data: This field contains the list of the variables, or
features, included in the data range.
Selected Variables: This field contains the list of variables, or features, to be
included in PCA.
• To include a variable in PCA, select the variable in the Variables In
Input Data list, then click > to move the variable to the Selected
Variables list.
• To remove a variable as a selected variable, click the variable in the
Selected Variables list, then click < to move the variable back to the
Variables In Input Data list.
PCA: Model
Select the number of principal components displayed in the output.
# components
Specify a fixed number of components by selecting this option and entering an
integer value from 1 to n where n is the number of Input variables selected in
the Data tab. This option is selected by default, the default value of n is equal to
the number of input variables. This value can be decreased to 1.
3
Shmueli, Galit, Nitin R. Patel, and Peter C. Bruce. Data Science for Business Intelligence. 2nd ed. New Jersey: Wiley, 2010
PCA: Display
Select the type of output to be inserted into the workbook in this section of the
Parameters tab. If no output is selected, by default, PC will output three tables:
Inputs, Principal components and Explained Variance on the PCA_Output
worksheet.
Inputs: This table displays the options selected on both tabs of the Principal
Components Analysis dialog. This table is accessible by clicking the Inputs link
in the Output Navigator.
Principal Components: This table displays how each variable affects each
component.
In the example below, the maximum magnitude element for Component1
corresponds to x2 (-0.5712). This signifies that the first principal component is
measuring the effect of x2 on the utility companies. Likewise, the second
component appears to be measuring the effect of x6 on the utility companies
(maximum magnitude = |-0.6031|). This table is accessible by clicking the
Principal Components link in the Output Navigator.
Explained Variance: This table includes 3 columns: Eigenvalue, Variance, %
and Cumulative Variance. This table is accessible by clicking the Explained
Variance link in the Output Navigator.
• The Eigenvalue column displays the eigenvalues and eigenvectors
computed from the covariance matrix for each variable. They are listed
in order from largest to smallest. Larger eigenvalues denote that the
variable should remain in the database. Variables with smaller
eigenvalues will be removed according to the user’s preference.
• The Variance, % column displays the variance attributed by each
Component. In the example below, Component1 accounts for 27.16%
of the variance while the second component accounts for 23.75%.
• The Cumulative Variance column displays the cumulative variance. In
the example below, Components 1 and 2 account for more than 50% of
the total variation. You can alternatively say the maximum magnitude
element for component 1 corresponds to x2.
Show Q - Statistics
If this option is selected, Analytic Solver Data Science will include Q-Statistics
in the output worksheet, PCA_Stats. Q statistics (or residuals) measure the
difference between sample data and the projection of the model onto the sample
data. These statistics an also be used to determine if any outliers exist in the
data. Low values for Q statistics indicate a well fit model. This table is also
accessible by clicking the Q-Statistic link in the Output Navigator.
Introduction
Cluster Analysis, also called data segmentation, has a variety of goals which all
relate to grouping or segmenting a collection of objects (also called observations,
individuals, cases, or data rows) into subsets or "clusters". These “clusters” are
grouped in such a way that the observations included in each cluster are more
closely related to one another than objects assigned to different clusters. The most
important goal of cluster analysis is the notion of the degree of similarity (or
dissimilarity) between the individual objects being clustered. There are two major
methods of clustering -- hierarchical clustering and k-means clustering.
This chapter explains the k-Means Clustering algorithm. (See the Hierarchical
Clustering chapter for information on this type of clustering analysis.) The goal of
this process is to divide the data into a set number of clusters (k) and to assign each
record to a cluster while minimizing the distribution within each cluster. A non-
hierarchical approach to forming good clusters is to specify a desired number of
clusters, say, k, then assign each case (object) to one of k clusters so as to minimize
a measure of dispersion within the clusters. A very common measure is the sum of
distances or sum of squared Euclidean distances from the mean of each cluster. The
problem can be set up as an integer programming problem but because solving
integer programs with a large number of variables is time consuming, clusters are
often computed using a fast, heuristic method that generally produces good (but not
necessarily optimal) solutions. The k-Means algorithm is one such method.
Inputs
Click Data Science – Cluster – k-Means Clustering to open the k – Means
Clustering dialog.
Select all variables under Variables except Type, then click the > button to shift
the selected variables to the Selected Variables field.
Partition Data
Analytic Solver Data Science includes the
ability to partition a dataset from within the
k-Means Clustering method by clicking
Partition Data on the Parameters tab.
Analytic Solver Data Science will partition
your dataset (according to the partition
options you set) immediately before
running k-Means Clustering. If partitioning
has already occurred on the dataset, this
option will be disabled. For more
information on partitioning options, please
see the Data Science Partitioning chapter
that appears later in this guide. This
example does not perform partitioning on
the dataset.
Rescale Data
Use Rescaling to normalize one or more
features in your data during the data
preprocessing stage. Analytic Solver Data
Science provides the following methods for
feature scaling: Standardization, Afterwards, click Next to advance to the next tab.
Normalization, Adjusted Normalization and Enter 8 for # Clusters to instruct the k-Means Clustering algorithm to form 8
Unit Norm. For more information on this cohesive groups of observations in the Wine data. One can use the results of
feature, see the Rescale Continuous Data Hierarchical Clustering or several different values of k to understand the best
section within the Transform Continuous setting for # Clusters.
Data chapter that occurs earlier in this
guide. If rescaling has already been Enter 10 for # Iterations. This option limits the number of iterations for the k-
performed, this button will be disabled. Means Clustering algorithm.
This example does not utilize the Rescale Random Seed is initialized to the default setting of “12345”. This option
Data feature. initializes the random number generator that is used to assign the initial cluster
centroids. Setting the random number seed to a positive value ensures the
reproducibility of the analysis.
Increase #Starts to 5. The final result of the k-Means Clustering algorithm
depends on the initial choice on the cluster centroids. The best assignment
(based on Sum of Squared Distances) is chosen as an initialization for further k-
Means iterations.
Leave Cluster Centers selected under Fitting and Fitting Metrics selected
under Training Data. For more information on the remaining output options,
see the k-Means Clustering Options section immediately following this example.
Click Next to advance to the Scoring tab. Notice that Training, under Score
Partitioned Data, is selected by default. Validation and Testing are disabled
under Partitioned Data since partitioning was not performed on the dataset.
Analytic Solver will score the training dataset. See the Scoring chapter in the
Analytic Solver User Guide for information on scoring new data In worksheet or
In Database.
k-Means Clustering dialog, Scoring tab
Click Finish.
The k-Means Clustering method starts with k initial clusters. The algorithm
proceeds by alternating between two steps: "assignment" – where each record is
assigned to the cluster with the nearest centroid, and "update" – where new
cluster centroids are recomputed based on the partitioning found in the
"assignment" step.
Results
The results of the clustering method, KMC_Output and KMC_TrainingClusters,
are inserted to the right of the Data worksheet.
Help: Click the Help button to access documentation on all k-Means Clustering
options.
Cancel: Click the Cancel button to close the dialog without running k-Means
Clustering.
Next: Click the Next button to advance to the next tab.
Finish: Click Finish to accept all option settings on all three dialogs, and run k-
Means Clustering.
Data Source
Worksheet: Click the down arrow to select the desired worksheet where the
dataset is contained.
Workbook: Click the down arrow to select the desired workbook where the
dataset is contained.
Data range: Select or enter the desired data range within the dataset. This data
range may either be a portion of the dataset or the complete dataset.
#Columns: Displays the number of columns in the data range. This option is
read only.
#Rows In Training Set, Validation Set and Test Set: Displays the number of
columns in training, validation and/or test partitions, if they exist. This option is
read only.
Variables
First Row Contains Headers: Select this checkbox if the first row in the
dataset contains column headings.
Variables: This field contains the list of the variables, or features, included in
the data range.
Selected Variables: This field contains the list of variables, or features, to be
included in k-Means Clustering.
• To include a variable in k-Means Clustering, select the variable in the
Variables list, then click > to move the variable to the Selected
Variables list.
• To remove a variable as a selected variable, click the variable in the
Selected Variables list, then click < to move the variable back to the
Variables list.
Preprocessing
Analytic Solver Data Science allows partitioning to be performed on the
Parameters tab for k – Means Clustering, if the active data set is un-partitioned.
If the active data set has already been partitioned, this button will be disabled.
Clicking the Partition Data button opens the following dialog. Select Partition
Data on the dialog to enable the partitioning options. See the Partitioning
chapter for descriptions of each Partitioning option shown in the dialog below.
Rescale Data
Use Rescaling to normalize one or more features in your data during the data
preprocessing stage. Analytic Solver Data Science provides the following
methods for feature scaling: Standardization, Normalization, Adjusted
Normalization and Unit Norm. If the input data is normalized, k-Means
# Clusters
Enter the number of final cohesive groups of observations (k) to be formed here.
The number of clusters should be at least 1 and at most the number of
observations-1 in the data range. This value should be based on your knowledge
of the data and the number of projected clusters. One can use the results of
Hierarchical Clustering or several values of k to understand the best value for #
Clusters. The default value for this option is 2.
# Iterations
This option limits the number of iterations for the k-Means Clustering algorithm.
Even if the convergence criteria has not yet been met, the cluster adjustment will
stop once the limit on # Iterations has been reached. The default value for this
option is 10.
Random Seed
This option initializes the random number generator that is used to assign the
initial cluster centroids. Setting the random number seed to a nonzero value
(any number of your choice is OK) ensures that the same sequence of random
numbers is used each time the initial cluster centroids are calculated. The
default value is “12345”. The minimum value for this option is 1. To set the
seed, type the number you want into the box. This option accepts positive
integers with up to 9 digits.
# Starts
Enter a positive value greater than 1 for this option, to enter the number of
desired starting points. The final result of the k-Means Clustering algorithm
depends on the initial choice on the cluster centroids. The “random starts”
k-Means Display
Use these options to display various output for k-Means Clustering. Output
options under Fitting apply to the fitting of the model. Output options under
Training are related to the application of the fitted model to the training
partition.
o Cluster Centers
The Cluster Centers table displays detailed information about
the clusters formed by the k-Means Clustering algorithm.
This table contains the coordinates of the cluster centers found
by the algorithm.
Accessible by clicking the Clusters Centers link on the Output
Navigator.
Example of Cluster Centers output
o Inter-Cluster Distances
This table displays the distance between each cluster center.
Accessible by clicking the Inter-Cluster Distances link on the
Output Navigator.
o Fitting Metrics
Fitting Metrics: This table lists several metrics computed
from the training dataset and is accessible by clicking the
Fitting Metrics: Training link on the Output Navigator.
(Recall that our training dataset is the full dataset since we did
not partition into training, validation or test partitions.)
• Avg Within-Cluster Distance – Average total
distance from each record to the corresponding
cluster center, for each cluster.
• Within Cluster SS – Sum of distances between the
records and the corresponding cluster centers for
each cluster. This statistic measures cluster
compactness. (The algorithm is trying to minimize
this measure.)
• Between Cluster SS – Sum of distances between the
cluster centers and the total sample mean, divided by
the number of records within each cluster. (Between
Cluster SS measures cluster separation which the
algorithm is trying to maximize. This is equivalent to
minimizing Within Cluster SS.)
o Record-Cluster Distances
Appends the distance of each record to each cluster in the
Clusters: Training output table. Records are assigned to the
"closest" cluster, i.e. the one with the nearest centroid. For
example, in the 4th record, the final cluster assignment is 4
since the distance to that cluster centroid is the closest
(119.34).
Example of Record-Cluster Distances output
Scoring Tab
Select the desired partition to apply k-Means Clustering, if the partition exists.
If Validation and/or Testing partitions do not exist, then these two options will
be disabled.
Introduction
Cluster Analysis, also called data segmentation, has a variety of goals. All
relate to grouping or segmenting a collection of objects (also called
observations, individuals, cases, or data rows) into subsets or "clusters", such
that those within each cluster are more closely related to one another than
objects assigned to different clusters. The most important goal of cluster
analysis is the notion of degree of similarity (or dissimilarity) between the
individual objects being clustered. There are two major methods of clustering --
hierarchical clustering and k-means clustering. (See the k-means clustering
chapter for information on this type of clustering analysis.)
In hierarchical clustering the data are not partitioned into a particular cluster in
a single step. Instead, a series of partitions takes place, which may run from a
single cluster containing all objects to n clusters each containing a single object.
Hierarchical Clustering is subdivided into agglomerative methods, which
proceed by a series of fusions of the n objects into groups, and divisive
methods, which separate n objects successively into finer groupings. The
hierarchical clustering technique employed by Analytic Solver Data Science is
an Agglomerative technique. Hierarchical clustering may be represented by a
two dimensional diagram known as a dendrogram which illustrates the fusions
or divisions made at each successive stage of analysis. An example of such a
dendrogram is given below:
Agglomerative methods
An agglomerative hierarchical clustering procedure produces a series of
partitions of the data, Pn, Pn-1, ....... , P1. The first Pn consists of n single object
'clusters', the last P1, consists of a single group containing all n cases.
At each particular stage the method joins the two clusters which are closest
together (most similar). (At the first stage, this amounts to joining together the
two objects that are closest together, since at the initial stage each cluster has
one object.)
Differences between methods arise because of the different methods of defining
distance (or similarity) between clusters. Several agglomerative techniques are
featured in Analytic Solver’s Hierarchical Clustering. See the description of
each occuring further down in this chapter.
An economist analyzing this data might first begin her analysis by building a
detailed cost model of the various utilities. However, to save a considerable
amount of time and effort, she could instead cluster similar types of utilities,
build a detailed cost model for just one ”typical” utility in each cluster, then
from there, scale up from these models to estimate results for all utilities. This
example will do just that.
Click Cluster -- Hierarchical Clustering to bring up the Hierarchical
Clustering dialog.
Analytic Solver Data Science will create four clusters using the group average
linkage method. The output HC_Output, HC_Clusters and HC_Dendrogram
are inserted to the right of the Data worksheet.
HC_Output Worksheet
The top portion of the output simply displays the options selected on the
Hierarchical Clustering dialog tabs.
HC_Dendrogram Output
Click the HC_Dendrogram worksheet tab to view the clustering dendrogram. A
dendrogram is a diagram that illustrates the hierarchical association between the
clusters.
The Sub Cluster IDs are listed along the x-axis (in an order convenient for
showing the cluster structure). The y-axis measures inter-cluster distance.
Consider Cluster IDs 3 and 8-- they have an inter-cluster distance of 2.753.
(Hover over the horizontal connecting line to see the Between-Cluster Distance.)
No other cases have a smaller inter-cluster distance, so 3 and 8 are joined into
one cluster, indicated by the horizontal line linking them.
Next, we see that cases 1 and 5 have the next smallest inter-cluster distance, so
they are joined into a 2nd cluster.
Clusters 1 and 5
combine into 1
cluster.
The next smallest inter-cluster distance is between the newly formed 3/8 and 1/5
clusters. This process repeats until all subclusters have been formed into 1
cluster.
If we draw a horizontal line through the diagram at any level on the y-axis (the
distance measure), the vertical cluster lines that intersect the horizontal line
indicate clusters whose members are at least that close to each other. If we draw
a horizontal line at distance = 3.8, for example, we see that there are 4 clusters
HC_Output Worksheet
As in the example above, the top of the HC_Output worksheet is the Inputs
portion, which displays the choices selected on both tabs of the Hierarchical
Clustering dialog.
Scroll down to the Clustering Stages table. As discussed above, this table
details the history of the cluster formation. At the beginning, each individual
case was considered its own cluster, # clusters = # cases. At stage 1, below,
clusters (i.e. cases) 4 and 10 were found to be closer together than any other two
clusters (i.e. cases), so 4 absorbed 10. At stage 2, clusters 4 and 15 are found to
be closer together than any other two clusters, so 4 absorbed 15. At this point
there is one cluster with three cases (cases 4, 10 and 15), and 20 additional
clusters that still have just one case in each. This process continues until there is
just one cluster at stage 21.
Data Source
Worksheet: Click the down arrow to select the desired worksheet where the
dataset is contained.
Variables
First Row Contains Headers: Select this checkbox if the first row in the
dataset contains column headings.
Variables: This field contains the list of the variables, or features, included in
the data range.
Selected Variables: This field contains the list of variables, or features, to be
included in Hierarchical Clustering.
• To include a variable in Hierarchical Clustering, select the variable in
the Variables list, then click > to move the variable to the Selected
Variables list.
• To remove a variable as a selected variable, click the variable in the
Selected Variables list, then click < to move the variable back to the
Variables list.
Data Type
The Hierarchical clustering method can be used on raw data as well as the data
in Distance Matrix format. Choose the appropriate option to fit your dataset. If
Raw Data is chosen, Analytic Solver Data Science computes the similarity
matrix before clustering.
Dissimilarity Measures
Hierarchical clustering uses the Euclidean Distance as the similarity measure for
working on raw numeric data.
When the data is binary, the remaining two options, Jaccard's coefficients and
Matching coefficient are enabled.
Suppose we have binary values for all the xij ’s. See the table below for
individual i’s and j’s.
Clustering Method
Single linkage clustering
One of the simplest agglomerative hierarchical clustering methods is single
linkage, also known as the nearest neighbor technique. The defining feature of
this method is that distance between groups is defined as the distance between
the closest pair of objects, where only pairs consisting of one object from each
group are considered.
In the single linkage method, D(r,s) is computed as
D(r,s) = Min { d(i,j) : Where object i is in cluster r and object j is cluster s }
Here the distance between every possible object pair (i,j) is computed, where
object i is in cluster r and object j is in cluster s. The minimum value of these
distances is said to be the distance between clusters r and s. In other words, the
distance between two clusters is given by the value of the shortest link between
the clusters.
At each stage of hierarchical clustering, the clusters r and s, for which D(r,s) is
minimum, are merged.
McQuitty's Method
When this procedure is selected, at each step, when two clusters are to be joined,
the distance of the new cluster to an existing cluster is computed as the average
of the distances from the proposed cluster to the existing cluster.
Median Method
The Median Method also uses averaging when calculating the distance between
two records or observations. However, this method uses the median instead of
the mean.
Draw Dendrogram
Select this option to have Analytic Solver Data Science create a dendrogram to
illustrate the clustering process.
Number of Clusters
Recall that the agglomerative method of hierarchical clustering continues to
form clusters until only one cluster is left. This option lets you stop the process
at a given number of clusters.
Introduction
Text mining is the practice of automated analysis of one document or a
collection of documents (corpus) and extracting non-trivial information from it.
Also, Text Mining usually involves the process of transforming unstructured
textual data into structured representation by analyzing the patterns derived from
text. The results can be analyzed to discover interesting knowledge, some of
which would only be found by a human carefully reading and analyzing the text.
Typical widely-used tasks of Text Mining include but are not limited to
Automatic Text Classification/Categorization, Topic Extraction, Concept
Extraction, Documents/Terms Clustering, Sentiment Analysis, Frequency-based
Analysis and many more. Some of these tasks could not be completed by a
human, which makes Text Mining a particularly useful and applicable tool in
modern Data Science. Analytic Solver Data Science takes an integrated
approach to text mining as it does not totally separate analysis of unstructured
data from traditional data science techniques applicable for structured
information. While Analytic Solver Data Science is a very powerful tool for
analyzing text only, it also offers automated treatment of mixed data, i.e.
combination of multiple unstructured and structured fields. This is a particularly
useful feature that has many real-world applications, such as analyzing
maintenance reports, evaluation forms, insurance claims, etc. Analytic Solver
Data Science uses the “bag of words” model – the simplified representation of
text, where the precise grammatical structure of text and exact word order is
disregarded. Instead, syntactic, frequency-based information is preserved and is
used for text representation. Although such assumptions might be harmful for
some specific applications of Natural Language Processing (NLP), it has been
proven to work very well for applications such as Text Categorization, Concept
Extraction and others, which are the particular areas addressed by Analytic
Solver Data Science’s Text Mining capabilities. It has been shown in many
theoretical/empirical studies that syntactic similarity often implies semantic
similarity. One way to access syntactic relationships is to represent text in terms
of Generalized Vector Space Model (GVSP). Advantage of such representation
is a meaningful mapping of text to the numeric space, the disadvantage is that
some semantic elements, e.g. order of words, are lost (recall the bag-of-words
assumption).
Input to Text Miner (the Text Mining tool within Analytic Solver Data Science)
could be of two main types – few relatively large documents (e.g. several books)
or relatively large number of smaller documents (e.g. collection of emails, news
articles, product reviews, comments, tweets, Facebook posts, etc.). While
Analytic Solver Data Science is capable of analyzing large text documents, it is
particularly effective for large corpuses of relatively small documents.
Obviously, this functionality has limitless number of applications – for instance,
email spam detection, topic extraction in articles, automatic rerouting of
correspondence, sentiment analysis of product reviews and many more.
The input for text mining is a dataset on a worksheet, with at least one column
that contains free-form text (or file paths to documents in a file system
containing free-form text), and, optionally, other columns that contain traditional
The selected file paths are now in random order, but we will need to categorize
the “Autos” and “Electronics” files in order to be able to identify them later. To
do this, we’ll use Excel to sort the rows by the file path: Select columns C
through D and rows 23 through 323, then choose Sort from the Data tab. In the
Sort dialog, select column d, where the file paths are located, and click OK.
On the Data Science Platform Ribbon tab, click the Text icon to bring up the
Text Miner dialog. Select TextVar in the Variables list box, and click the upper
> button to move it to the Selected Text Variables list box. By doing so, we are
selecting the text in the documents as input to the Text Miner model. Ensure that
“Text variables contain file paths” is checked.
Click the Next button, or click the Pre-Processing tab at the top.
Leave the default setting for Analyze all terms selected under Mode. When this
option is selected, Analytic Solver Data Science will examine all terms in the
document. A “term” is defined as an individual entity in the text, which may or
may not be an English word. A term can be a word, number, email, url, etc.
terms are separated by all possible delimiting characters (i.e. \, ?, ', `, ~, |, \r,
\n, \t, :, !, @, #, $, %, ^, &, *, (, ), [, ], {, }, <>,_, ;, =, -, +, \) with some
exceptions related to stopwords, synonyms, exclusion terms and boilerplate
Note: Exceptions are related not to how terms are separated but as to whether
they are split based on the delimiter. For example: URL's contain many
characters such as "/", ";", etc. Text Miner will not tokenize on these characters
in the URL but will consider the URL as a whole and will remove the URL if
selected for removal. (See below for more information.)
If Analyze specified terms only is selected, the Edit Terms button will be
enabled. If you click this button, the Edit Exclusive Terms dialog opens. Here
you can add and remove terms to be considered for text mining. All other terms
will be disregarded. For example, if we wanted to mine each document for a
specific part name such as “alternator” we would click Add Term on the Edit
Exclusive Terms dialog, then replace “New term” with “alternator” and click
Done to return to the Pre-Processing dialog. During the text mining process,
Analytic Solver Data Science would analyze each document for the term
“alternator”, excluding all other terms.
Leave both Start term/phrase and End term/phrase empty under Text
Location. If this option is used, text appearing before the first occurrence of the
Start Phrase will be disregarded and similarly, text appearing after End Phrase
(if used) will be disregarded. For example, if text mining the transcripts from a
Live Chat service, you would not be particularly interested in any text appearing
before the heading “Chat Transcript” or after the heading “End of Chat
Transcript”. Thus you would enter “Chat Transcript” into the Start Phrase field
and “End of Chat Transcript” into the End Phrase field.
Leave the default setting for Stopword removal. Click Edit to view a list of
commonly used words that will be removed from the documents during pre-
processing. To remove a word from the Stopword list, simply highlight the
desired word, then click Remove Stopword. To add a new word to the list,
click Add Stopword, a new term “stopword” will be added. Double click to
edit.
Analytic Solver Data Science also allows additional stopwords to be added or
existing to be removed via a text document (*.txt) by using the Browse button to
navigate to the file. Terms in the text document can be separated by a space, a
comma, or both. If we were supplying our three terms in a text document, rather
than in the Edit Stopwords dialog, the terms could be listed as: subject
emailterm from or subject,emailterm,from or subject, emailterm, from. If we
had a large list of additional stopwords, this would be the preferred way to enter
the terms.
We can take the email issue one step further and completely remove the term
“emailtoken” from the collection. Click Add Exclusion Term and edit
“exclusionterm” to “emailtoken”.
To remove a term from the exclusion list, highlight the term and click Remove
Exclusion Term.
We could have also entered these terms into a text document (*.txt) and added
the terms all at once by using the Browse button to navigate to the file and
import the list. Terms in the text document can be separated by a space, a
comma, or both. If, for example we were supplying excluded terms in a
document rather than in the Edit Exclusion List dialog, we would enter the terms
as: subject emailtoken from, or subject,emailtoken,from, or subject, emailtoken,
from. If we had a large list of terms to be excluded, this would be the preferred
way to enter the terms.
Analytic Solver Data Science also allows the combining of synonyms and full
phrases by clicking Advanced within Vocabulary Reduction. Select Synonym
reduction at the top of the dialog to replace synonyms such as
“car”, “automobile”, “convertible”, “vehicle”, “sedan”, “coupe”,
“subcompact”, and “jeep” with “auto”. Click Add Synonym and replace
“rootterm” with “auto” then replace “synonym list” with “car, automobile,
convertible, vehicle, sedan, coupe” (without the quotes). During pre-
processing, Analytic Solver Data Science will replace the terms “car”,
“automobile”, “convertible”, “vehicle”, “sedan”, “coupe”, “subcompact” and
“jeep” with the term “auto”. To remove a synonym from the list, highlight the
term and click Remove Synonym.
Analytic Solver Data Science also allows the combining of words into phrases
that indicate a singular meaning such as “station wagon” which refers to a
specific type of car rather than two distinct tokens – station and wagon. To add
a phrase in the Vocabulary Reduction – Advanced dialog, select Phrase
reduction and click Add Phrase. The term “phrasetoken” will be appear, click
to edit and enter “wagon”. Click “phrase” to edit and enter “station wagon”. If
supplying phrases through a text file (*.txt), each line of the file must be of the
form phrasetoken:phrase or using our example, wagon:station wagon. If we had
a large list of phrases, this would be the preferred way to enter the terms.
Enter 200 for Maximum Vocabulary Size. Analytic Solver Data Science will
reduce the number of terms in the final vocabulary to the top 200 most
frequently occurring in the collection.
Leave the default selection for Normalize case. When this option is checked,
Analytic Solver Data Science converts all text to a consistent (lower) case, so
that Term, term, TERM, etc. are all normalized to a single token “term” before
any processing, rather than creating three independent tokens with different
case. This simple method can dramatically affect the frequency distributions of
the corpus, leading to biased results.
Enter 3 for Remove terms occurring in less than _% of documents and 97 for
Remove terms occurring in more than _% of documents. For many text mining
applications, the goal is to identify terms that are useful for discriminating
between documents. If a particular term occurs in all or almost all documents, it
may not be possible to highlight the differences. If a term occurs in very few
documents, it will often indicate great specificity of this term, which is not very
useful for some Text Mining purposes.
Enter 20 for Maximum term length. Terms that contain more than 20 characters
will be excluded from the text mining analysis and will not be present in the
final reports. This option can be extremely useful for removing some parts of
text which are not actual English words, for example, URLs or computer-
generated tokens, or to exclude very rare terms such as Latin species or disease
names, i.e. Pneumonoultramicroscopicsilicovolcanoconiosis.
It’s also possible to create your own scheme by clicking the Advanced command
button to open the Term Document Matrix – Advanced dialog. Here users can
select their own choices for local weighting, global weighting, and
normalization. Please see the table below for definitions regarding options for
Term Frequency, Document Frequency and Normalization.
Notations:
• 𝑡𝑓𝑡𝑑 – frequency of term 𝑡 in a document 𝑑;
• 𝑑𝑓𝑡 – document frequency of term 𝑡;
• 𝑙𝑤𝑡𝑑 – local weighting of term 𝑡 in a document 𝑑;
Leave Perform latent semantic indexing selected (the default). When this option
is selected, Analytic Solver Data Science will use Latent Semantic Indexing
(LSI) to detect patterns in the associations between terms and concepts to
discover the meaning of the document.
Select Maximum number of concepts and increment the counter to 20. Doing
so will tell Analytic Solver Data Science to retain the top 20 of the most
significant concepts. If Automatic is selected, Analytic Solver Data Science will
calculate the importance of each concept, take the difference between each and
report any concepts above the largest difference. For example if three concepts
were identified (Concept1, Concept2, and Concept3) and given importance
factors of 10, 8, and 2, respectively, Analytic Solver Data Science would keep
Concept1 and Concept2 since the difference between Concept2 and Concept 3
(8-2=6) is larger than the difference between Concept1 and Concept2 (10-
8=2). If Minimum percentage explained is selected, Analytic Solver Data
Science will identify the concepts with singular values that, when taken
together, sum to the minimum percentage explained, 90% is the default.
Keep Term frequency table selected (the default) under Preprocessing Summary
and select Zipf’s plot. Increase the Most frequent terms to 20 and select
Maximum corresponding documents. The Term frequency table will include
the top 20 most frequently occurring terms. The first column, Collection
Frequency, displays the number of times the term appears in the collection. The
2nd column, Document Frequency, displays the number of documents that
include the term. The third column, Top Documents, displays the top 5
documents where the corresponding term appears the most frequently. The Zipf
Plot graphs the document frequency against the term ranks in descending order
of frequency.. Zipf’s law states that the frequency of terms used in a free-form
text drops exponentially, i.e. that people tend to use a relatively small number of
words extremely frequently and use a large number of words very rarely.
Keep Show documents summary selected and check Keep a short excerpt. under
Documents. Analytic Solver Data Science will produce a table displaying the
document ID, length of the document, number of terms and 20 characters of the
text of the document.
Click the Finish button to run the Text Mining analysis. Result worksheets are
inserted to the right.
Select the TM_Output tab. The Term Count table shows that the original term
count in the documents was reduced by 16.02% by the removal of stopwords,
excluded terms, synonyms, phrase removal and other specified preprocessing
procedures.
Scroll down to the Documents table. This table lists each Document with its
length, number of terms, and if Keep a short excerpt is selected on the Output
Options tab and a value is present for Number of characters, then an excerpt
from each document will be displayed.
Click the TM_Vocabulary tab to view the Final List of Terms table. This table
contains the top 20 terms occurring in the document collection, the number of
documents that include the term and the top 5 document IDs where the
corresponding term appears most frequently. In this list we see terms such as
“car”, “power”, “engine”, “drive”, and “dealer” which suggests that many of the
documents, even the documents from the electronic newsgroup, were related to
autos.
When you click on the TM_Vocabulary tab, the Zipf Plot opens. We see that
our collection of documents obey the power law stated by Zipf (see above). As
we move from left to right on the graph, the documents that contain the most
frequently appearing terms (when ranked from most frequent to least frequent)
drop quite steeply. Hover over each data point to see the detailed information
about the term corresponding to this data point.
Note: To view these charts in the Cloud app, click the Charts icon on the
Ribbon, select the desired worksheet, in this case TM_Vocabulary, for
Worksheet and the desired chart for Chart.
Click the TM_LSASummary tab to view the Concept Importance and Term
Importance tables. The first table, the Concept Importance table, lists each
concept, it’s singular value, the cumulative singular value and the % singular
value explained. (The number of concepts extracted is the minimum of the
number of documents (985) and the number of terms (limited to 200).) These
values are used to determine which concepts should be used in the Concept –
Document Matrix, Concept – Term Matrix and the Scree Plot according to the
Users selection on the Representation tab. In this example, we entered “20” for
Maximum number of concepts.
When you click the TM_LSASummary tab, the Scree Plot opens. This plot
gives a graphical representation of the contribution or importance of each
concept. The largest “drop” or “elbow” in the plot appears between the 1st and
2nd concept. This suggests that the first top concept explains the leading topic in
our collection of documents. Any remaining concepts have significantly
reduced importance. However, we can always select more than 1 concept to
increase the accuracy of the analysis – it is advised to examine the Concept
Importance table and the “Cumulative Singular Value” in particular to identify
how many top concepts capture enough information for your application.
You can examine all extracted concepts by changing the axes on a scatter plot -
click the down pointing arrow next to Concept 1 or the concept on the Y axis by
clicking the right pointing arrow next to Concept 2. Use your touchscreen or
your mouse scroll wheel to zoom in and out.
Double click TM_LSA_CTM to display the Concept – Term Matrix which lists
the top 20 most important concepts along the top of the matrix and the top 200
most frequently appearing terms down the side of the matrix.
When you click on the TM_LSA-CTM tab, the Term-Concept Scatter Plot
opens. This graph is a visual representation of the Concept – Term Matrix. It
displays all terms from the final vocabulary in terms of two concepts. Similarly
Recall that if you want to examine different pair of concepts, click the down
pointing arrow next to Concept 1 and the right pointing arrow next to Concept 2
to change the concepts on either axis. Use your touchscreen or mouse wheel to
scroll in or out.
The TFIDF_Stored and LSA_Stored output sheets are used to process new
documents using an existing text mining model. See the section below,
Processing New Documents Based on an Existing Text Mining Model, to find
out how to score new text documents using an existing Text Mining model.
Note: When adding additional documents to an existing text mining model,
Analytic Solver Data Science will not extract new terms or phrases from these
new documents. Rather, Analytic Solver Data Science will first use the
vocabulary from the model to build a Term-Document Matrix and then, if
requested, will use transformation matrices to map documents in the new data
onto the existing semantic space extracted from the “base” model. Please see
below for an example explaining how to add additional documents to an existing
Text Mining model.
From here, we can use any of the six classification algorithms to classify our
documents according to some term or concept using the Term – Document
Click OK. FileSampling1 is inserted into the Solver Task Pane. We will again
sort the documents by type (electronic or auto) by using Microsoft Excel’s Sort
functionality (on the Data menu).
To extract concepts for new data based on the LSA model (i.e. product CDM -
Concept-Document matrix), we will score the term-document matrix. Click
Score on the Data Science ribbon to bring up the Select New Data Sheet &
Stored Model Sheet dialog.
Select LSA_Stored for Worksheet under Stored Model. Select Match By Name
to match the terms from the Stored Model sheet (LSA_Stored) with the terms
from the term document matrix.
Click OK to score the term document matrix. The output is the Concept-
Document matrix.
Variables
Variables contained in this listbox are text variables included within a dataset
with at least one column that contains free-form text (or file paths to documents
containing free-form text), and optionally other columns that contain traditional
structured data.
Match Selected
Select one text variable from the Selected Text Variables and Model Text
Variables listbox, then select Match Selected to manually map variables from
the dataset to the existing model. The match will appear under Model Text
Variables.
Unmatch Selected
Select a set of matched variables under Model Text Variables and click
Unmatch Selected to unmatch the pair.
Unmatch All
Click Unmatch All, to unmatch all previously matched variables under Model
Text Variables.
Match By Name
Click Match By Name to match all variables in the Selected Text Variables
listbox with variables of the same name in the Model Text Variables listbox.
Match Sequentially
Click Match Sequentially to match all variables, in order as listed, in the
Selected Text Variables listbox with variables, in order as listed, in the Model
Text Variables listbox.
Start term/phrase
If this option is used, text appearing before the first occurrence of the Start
Phrase will be disregarded and similarly, text appearing after End Phrase (if
used) will be disregarded. For example, if text mining the transcripts from a
Live Chat service, you would not be particularly interested in any text appearing
before the heading “Chat Transcript” or after the heading “End of Chat
Transcript”. Thus you would enter “Chat Transcript” into the Start Phrase field
and “End of Chat Transcript” into the End Phrase field.
Stopword removal
If selected (the default), over 300 commonly used words/terms (such as a, to,
the, and, etc.) will be removed from the document collection during
preprocessing. Click the Edit command button to view the list of terms. To
remove a word from the Stopword list, simply highlight the desired word, then
click Remove Stopword. To add a new word to the list, click Add Stopword, a
new term “stopword” will be added. Double click to edit.
Analytic Solver Data Science also allows additional stopwords to be added or
existing to be removed via a text document (*.txt) by using the Browse button to
navigate to the file. Terms in the text document can be separated by a space, a
comma, or both. If we were supplying our three terms in a text document, rather
than in the Edit Stopwords dialog, the terms could be listed as: subject
emailterm from or subject,emailterm,from or subject, emailterm, from. If we
had a large list of additional stopwords, this would be the preferred way to enter
the terms.
Click done to close the Edit Stopwords dialog and return to the Pre-Processing
tab.
Synonym Reduction
Select Synonym reduction at the top of the dialog to replace synonyms such as
“car”, “automobile”, “convertible”, “vehicle”, “sedan”, “coupe”, “subcompact”,
and “jeep” with “auto”. Click Add Synonym and replace “rootterm” with the
term to be substituted, then replace “synonym list” with the list of synonyms,
i.e. :car, automobile, convertible, vehicle, sedan, coupe. During pre-processing,
Analytic Solver Data Science will replace the terms “car”, “automobile”,
“convertible”, “vehicle”, “sedan”, “coupe”, “subcompact” and “jeep” with the
Phrase Reduction
Analytic Solver Data Science also allows the combining of words into phrases
that indicate a singular meaning such as “station wagon” which refers to a
specific type of car rather than two distinct tokens – station and wagon. To add
a phrase in the Vocabulary Reduction – Advanced dialog, select Phrase
reduction and click Add Phrase. The term “phrasetoken” will be appear. Click
to edit and enter the term that will replace the phrase. i.e. wagon. Click “phrase”
to edit and enter the phrase that will be substituted, i.e. “station wagon”. If
supplying phrases through a text file (*.txt), each line of the file must be of the
form phrasetoken:phrase or using our example, wagon:station wagon. If we had
a large list of phrases, this would be the preferred way to enter the terms.
Perform stemming
Stemming is the practice of stripping words down to their “stems” or “roots”, for
example, stemming terms such as “argue”, “argued”, “argues”, “arguing”, and
“argus” would result in the stem “argu. However “argument” and “arguments”
would stem to “argument”. The stemming algorithm utilized in Analytic Solver
Data Science is “smart” in the sense that while “running” would be stemmed to
“run”, “runner” would not. . Analytic Solver Data Science uses the Porter
Stemmer 2 algorithm for the English Language. For more information on this
algorithm, please see the Webpage: http://tartarus.org/martin/PorterStemmer/
Normalize case
When this option is checked, Analytic Solver Data Science converts all text to a
consistent (lower) case, so that Term, term, TERM, etc. are all normalized to a
single token “term” before any processing, rather than creating three
independent tokens with different case. This simple method can dramatically
affect the frequency distributions of the corpus, leading to biased results.
Normalize URL’s
If selected, URLs appearing in the document collection will be replaced with the
term, “urltoken”. URLs do not normally add any meaning, but it is sometimes
interesting to know how many URLs are included in a document. This option is
not selected by default.
Notations:
• 𝑡𝑓𝑡𝑑 – frequency of term 𝑡 in a document 𝑑;
• 𝑑𝑓𝑡 – document frequency of term 𝑡;
• 𝑙𝑤𝑡𝑑 – local weighting of term 𝑡 in a document 𝑑;
• 𝑔𝑤𝑡𝑑 – global weighting of term 𝑡 in a document 𝑑;
• 𝑛𝑑 – normalization of vector of terms representing the document 𝑑;
• 𝑁 – total number of documents in the collection;
• 𝑐𝑓𝑡 – collection frequency of term 𝑡;
• 𝑝𝑡𝑑 – estimated probability of term 𝑡 to appear in a document 𝑑
𝑡𝑓
(𝑝𝑡𝑑 = 𝑡𝑑⁄𝑐𝑓 );
𝑡
• 𝑔 𝑑 – vector of terms representing the document 𝑑.
̅̅̅
Concept-Document Matrix
The Concept – Document Matrix is enabled when Perform latent semantic
indexing is selected on the Representation tab. The most important concepts
will be listed across the top of the matrix and the documents will be listed down
the left side of the matrix. The number of concepts is controlled by the setting
for Concept Extraction – Latent Semantic indexing on the Representation tab:
Automatic, Maximum number of concepts, or Minimum percentage explained. If
a concept appears in a document, the singular value decomposition weight is
placed in the corresponding column indicating the importance of the concept in
the document. If Perform latent semantic indexing is selected, this option will
also be selected by default.
Full vocabulary
This option is enabled only when Term frequency table is selected. If selected,
the full vocabulary list will be displayed in the term frequency table.
Zipf’s plot
The Zipf Plot graphs the document frequency against the term ranks (or terms
ranked in order of importance). Typically the number of terms in a document
follow Zipf’s law which states that the frequency of terms used in a free-form
text drops exponentially. In other “words” (pun intended) when we speak we
tend to use a few words a lot but most words very rarely. Hover over each
point in the plot to see the most frequently occurring terms in the document
collection. This option is not selected by default.
Scree Plot
This plot gives a graphical representation of the contribution or importance of
each concept according to the setting for Maximum number of concepts. Find
the largest “drop” or “elbow” in the plot to discover the leading topics in the
document collection. When moving from left to right on the x-axis, the
importance of each concept will diminish. This information may be used to
limit the number of concepts (as variables) used as inputs into a classification
model. This option is not selected by default.
Concept importance
This table displays the total number of concepts extracted, the Singular Value
for each, the Cumulative Singular Value and the % of Singular Value explained
which is used when Minimum percentage explained is selected for Concept
Extraction – Latent Semantic Indexing on the Representation tab. This option is
not selected by default.
Term Importance
This table display each term along with its Importance as calculated by singular
value decomposition. This option is not selected by default.
Introduction
Time series datasets contain a set of observations generated sequentially in time.
Organizations of all types and sizes utilize time series datasets for analysis and
forecasting for predicting next year’s sales figures, raw material demand,
monthly airline bookings, etc. .
Autocorrelation (ACF)
Autocorrelation (ACF) is the correlation between neighboring observations in a
time series. When determining if an autocorrelation exists, the original time
series is compared to the “lagged” series. This lagged series is simply the
original series moved one time period forward (xn vs xn+1). Suppose there are 5
time based observations: 10, 20, 30, 40, and 50. When lag = 1, the original
series is moved forward one time period. When lag = 2, the original series is
moved forward two time periods.
∑𝑛 ̅ ̅
𝑖=𝑘+1(𝑌𝑡 −𝑌 )(𝑌𝑡−𝑘 −𝑌 )
𝑟𝑘 = ∑𝑛 ̅ 2
where k = 0, 1, 2, …., n
𝑖=1(𝑌𝑡 −𝑌 )
Where Yt is the Observed Value at time t, 𝑌̅ is the mean of the Observed Values
and Yt –k is the value for Lag-k.
For example, using the values above, the autocorrelation for Lag-1 and Lag - 2
can be calculated as follows.
𝑌̅ = (10 + 20 + 30 + 40 + 50) / 5 = 30
r1 = ((20 – 30) * (10 - 30) + (30 - 30) * (20 - 30) + (40 - 30) * (30 - 30) + (50 –
30) * (40 – 30)) / ((10 – 30)2 + (20 - 30)2 + (30 – 30)2 + (40 – 30)2 + (50 – 30)2)
= 0.4
r2 =( (30 – 30) * (10 – 30) + (40 – 30) * (20 – 30) + (50 – 30) * (30 – 30)) / (((10
– 30)2 + (20 - 30)2 + (30 – 30)2 + (40 – 30)2 + (50 – 30)2) = -0.1
The two red horizontal lines on the graph below delineate the Upper confidence
level (UCL) and the Lower confidence level (LCL). If the data is random, then
the plot should be within the UCL and LCL. If the plot exceeds either of these
two levels, as seen in the plot above, then it can be presumed that some
correlation exists in the data.
After Analytic Solver Data Science fits the model, various results will be
available. The quality of the model can be evaluated by comparing the time plot
of the actual values with the forecasted values. If both curves are close, then it
can be assumed that the model is a good fit. The model should expose any
trends and seasonality, if any exist. If the residuals are random then the model
can be assumed a good fit. However, if the residuals exhibit a trend, then the
model should be refined. Fitting an ARIMA model with parameters (0,1,1) will
give the same results as exponential smoothing. Fitting an ARIMA model with
parameters (0,2,2) will give the same results as double exponential smoothing.
Partitioning
To avoid over fitting of the data and to be able to evaluate the predictive
performance of the model on new data, we must first partition the data into
training and validation sets using Analytic Solver Data Science’s time series
partitioning utility. After the data is partitioned, ACF, PACF, and ARIMA can
be applied to the dataset.
Note in the output above, the partitioning method is sequential (rather than
random). The first 50 observations have been assigned to the training set and
the remaining 21 observations have been assigned to the validation set.
Open the Lag Analysis dialog by clicking ARIMA – Lag Analysis. Select CA
under Variables in input data, then click > to move the variable to Selected
First, let's take a look at the ACF charts. Note on each chart, the autocorrelation
decreases as the number of lags increase. This suggests that a definite pattern
does exist in each partition. However, since the pattern does not repeat, it can be
assumed that no seasonality is included in the data. In addition, both charts
appear to exhibit a similar pattern.
Note: To view these two charts in the Cloud app, click the Charts icon on the
Ribbon, select TS_Lags for Worksheet and ACF/ACVF/PACF
Training/Validation Data for Chart.
All three charts suggest that a definite pattern exists in the data, but no
seasonality. In addition, both datasets exhibit the same behavior in both the
training and validation sets which suggests that the same model could be
appropriate for each. Now we are ready to fit the model.
The ARIMA model accepts three parameters: p – the number of autoregressive
terms, d – the number of non-seasonal differences, and q – the number of lagged
errors (moving averages).
Recall that the ACF plot showed no seasonality in the data which means that
autocorrelation is almost static, decreasing with the number of lags increasing.
This suggests setting q = 0 since there appears to be no lagged errors. The
PACF plot displayed a large value for the first lag but minimal plots for
successive lags. This suggest setting p =1. With most datasets, setting d =1 is
sufficient or can at least be a starting point.
Click back to the TSPartition tab and then click ARIMA – ARIMA Model to
bring up the Time Series – ARIMA dialog.
Select CA under Variables in input data then click > to move the variable to the
Selected Variable field. Under Nonseasonal Parameters set Autoregressive (p)
to 1, Difference (d) to 1 and Moving Average (q) to 0.
Click OK on the ARIMA-Advanced Options dialog and again on the Time Series
– ARIMA dialog. Analytic Solver Data Science calculates and displays various
parameters and charts in four output sheets, Arima_Output, Arima_Fitted,
Arima_Forecast and Arima_Stored. Click the Arima_Output tab to view the
Output Navigator.
Analytic Solver has calculated the constant term and the AR1 term for our
model, as seen above. These are the constant and f1 terms of our forecasting
equation. See the following output of the Chi - square test.
The very small p-values for the constant term (1.119E-7) and AR1 term (1.19e-
89) suggest that the model is a good fit to our data.
Click the Fitted link on the Output Navigator. This table plots the actual and
fitted values and the resulting residuals for the training partition. As shown in
the graph below, the Actual and Forecasted values match up fairly well. The
usefulness of the model in forecasting will depend upon how close the actual
and forecasted values are in the Forecast, which we will inspect later.
Use your mouse to select a point on the graph to compare the Actual value to the
Forecasted value.
Note: To view these two charts in the Cloud app, click the Charts icon on the
Ribbon, select Arima_Fitted for Worksheet and ACF/ACVF/PACF
Training/Validation Data for Chart.
With the exception of Lag1, the majority of the lags in the PACF and ACF
charts are either clearly within the UCL and LCL bands or just outside of these
bands. This suggests that the residuals are random and are not correlated.
Click the Forecast link on the Output Navigator to display the Forecast Data
table and charts.
The options below appear on the Time Series Partition Data tab.
Time variable
Select a time variable from the available variables and click the > button. If a
Time Variable is not selected, Analytic Solver will assign one to the partitioned
data.
Selected variable
The selected variable appears here.
Parameters: Validation
Enter the minimum and maximum lags for the Validation Data here. The # lags
for the Validation Data set should be >= 1 and < N where N is the number of
records in the Validation dataset.
Selected Variable
Select the desired variable to be included in the ARIMA model by clicking the >
button.
Period
If Fit seasonal model is selected, this option is enabled. Seasonality in a dataset
appears as patterns at specific periods in the time series.
Nonseasonal Parameters
Enter the nonseasonal parameters here for Autoregressive (p), Difference (d),
and Moving Average (q).
Seasonal Parameters
Enter the Seasonal parameters here for Autoregressive (P), Difference (D), and
Moving Average (Q).
Variance-covariance matrix
Analytic Solver Data Science will include the variance-covariance matrix in the
output if this option is selected. This option is selected by default.
Produce forecasts
If this option is selected, Analytic Solver Data Science will display the desired
number of forecasts. If the data has been partitioned, Analytic Solver will
display the forecasts on the validation data.
Number of forecasts
If Produce forecasts is selected and a non-partitioned dataset is being used, this
option is enabled. The maximum number of forecasts is 100.
Introduction
Data collected over time is likely to show some form of random variation.
"Smoothing techniques" can be used to reduce or cancel the effect of these
variations. These techniques, when properly applied, will “smooth” out the
random variation in the time series data to reveal any underlying trends that may
exist.
Analytic Solver Data Science features four different smoothing techniques:
Exponential, Moving Average, Double Exponential, and Holt Winters. The first
two techniques, Exponential and Moving Average, are relatively simple
smoothing techniques and should not be performed on datasets involving
seasonality. The last two techniques are more advanced techniques which can
be used on datasets involving seasonality.
Exponential smoothing
Exponential smoothing is one of the more popular smoothing techniques due to
its flexibility, ease in calculation and good performance. As in Moving Average
Smoothing, a simple average calculation is used. Exponential Smoothing,
however, assigns exponentially decreasing weights starting with the most recent
observations. In other words, new observations are given relatively more weight
in the average calculation than older observations. Analytic Solver Data
Science utilizes the formulas below in the Exponential Smoothing tool.
S0 = x0
St = αxt-1 + (1-α)st-1, t > 0
where
• original observations are denoted by {xt} starting at t = 0
• α is the smoothing factor which lies between 0 and 1
Then click OK to partition the data into training and validation sets.
(Partitioning is optional. Smoothing techniques may be run on full unpartitioned
datasets.) The result of the partition, TSPartition, is inserted right of the Airpass
worksheet.
Click Smoothing – Exponential to open the Exponential Smoothing dialog.
Select Month as the Time Variable, unless already selected. Select Passengers
as the Selected variable and also Produce Forecast on validation.
Click OK to accept the partitioning defaults and create the two sets (Training
and Validation). TSPartition is inserted right of the Income worksheet. Click
Smoothing – Exponential from the Data Science ribbon to open the
Exponential Smoothing dialog.
Select Year for Time Variable if it has not already been selected. Select CA as
the Selected Variable and Produce forecast on validation.
The smoothing parameter (Alpha) determines the magnitude of weights assigned
to the observations. For example, a value close to 1 would result in the most
recent observations being assigned the largest weights and the earliest
observations being assigned the smallest weights. A value close to 0 would
result in the earliest observations being assigned the largest weights and the
latest observations being assigned the smallest weights. As a result, the value of
Alpha depends on how much influence the most recent observations should have
on the model.
Analytic Solver includes the Optimize feature that will choose the Alpha
parameter value that results in the minimum residual mean squared error. It is
recommended that this feature be used carefully as it can often lead to a model
that is overfit to the training set. An overfit model rarely exhibits high
predictive accuracy in the validation set.
Expo1 is inserted right of the Expo worksheet. Analytic Solver used an Alpha =
0.9976…
which results in a MSE of 22,110.2 for the Training Set and a MSE of 1.93E08
for the Validation Set. Although an alpha of .9976 did result in lower values,
the MSE in both the training and validation sets indicates the model is still not a
good fit.
Then click OK to accept the partitioning defaults and create the two partitions
(Training and Validation). The output, TSPartition, will be inserted right of the
Income sheet..
Click Smoothing – Moving Average from the Data Science ribbon to open the
Moving Average Smoothing dialog. Year is already selected for Time Variable.
Select CA as the Selected variable and Produce forecast on validation
checkbox.
Note: To view these two charts in the Cloud app, click the Charts icon on the
Ribbon, select MovingAvg for Worksheet and Time Series Training Data or
Time Series Validation Data for Chart.
Then click OK to partition the data into training and validation sets. TSPartition
will be inserted right of the Airpass worksheet.
Click Smoothing – Double Exponential to open the Double Exponential
Smoothing dialog.
Select Month as the Time Variable, if not already selected. Select Passengers
as the Selected variable, then check Produce Forecast on validation to test the
forecast on the validation set.
This example uses the defaults for both the Alpha and Trend parameters.
However, Analytic Solver Data Science includes a feature that will choose the
These parameters result in a MSE of 450.7 for the Training set and a MSE of
8477.64 for the Validation Set. Again the model created with the parameters
from the Optimize algorithm appear to result in a model with a better fit than a
model created with the default parameters.
Note: To view these two charts in the Cloud app, click the Charts icon on the
Ribbon, select DoubleExp1 for Worksheet and Time Series Training Data or
Time Series Validation Data for Chart.
Note: To view these two charts in the Cloud app, click the Charts icon on the
Ribbon, select MulHoltWinters for Worksheet and Time Series Training Data
or Time Series Validation Data for Chart.
If you inspect the MSE (Mean Squared Error) term in the Error Measures
(Validation) table, you’ll see that this value is fairly high. In addition, the peaks
for the Forecast data appear to lag behind the peaks in the Validation data. This
suggests that our Trend (Beta) parameter is too large.
Let’s go back and try the Multiplicative method one more time using the
Optimize parameter. This parameter will choose the best values for the Alpha,
Let’s try the Additive model again using the Optimize feature. Click back to
TSPartition and then click Smoothing – Holt-Winters – Additive on the Data
Science ribbon.
Select Month for Time variable and Select Passengers for Selected variable,
then 12 for Period. Produce Forecast on Validation is selected by default.
Select Optimize to run the Optimize algorithm which will pick the best values
for the three parameters, Alpha, Beta, and Gamma.
Notice the parameter values chosen by the Optimize algorithm were 0.858 for
Alpha, .00351 for Beta, and 0.917 for Gamma. Scroll down to view the results
of the model fitting.
Note: To view these two charts in the Cloud app, click the Charts icon on the
Ribbon, select AddHoltWinters1 for Worksheet and Time Series Training Data
or Time Series Validation Data for Chart.
Note: To view these two charts in the Cloud app, click the Charts icon on the
Ribbon, select NoTrendHoltWinters for Worksheet and Time Series Training
Data or Time Series Validation Data for Chart.
Let’s try the No Trend model again using the Optimize feature. Click back to
TSPartition, then click Smoothing – Holt-Winters – No Trend on the Data
Science ribbon.
Notice the parameter values chosen by the Optimize algorithm were 0.984 for
Alpha and 0.233 for Gamma. Scroll down to view the results of the model
fitting.
Time variable
Select a variable associated with time from the Variables in input data list box.
Output Options
If applying this smoothing technique to partitioned data, the option Produce
forecast on validation will appear. Otherwise, the option Produce forecast will
appear. If selected, Analytic Solver Data Science will include a forecast on the
output results.
Optimize
Select this option if you want to select the Alpha Level that minimizes the
residual mean squared errors in the training and validation sets. Take care when
using this feature as this option can result in an over fitted model. This option is
not selected by default.
Level (Alpha)
Enter the smoothing parameter here. This parameter is used in the weighted
average calculation and can be from 0 to 1. A value of 1 or close to 1 will result
in the most recent observations being assigned the largest weights and the
earliest observations being assigned the smallest weights. A value of 0 or close
to 0 will result in the most recent observations being assigned the smallest
weights and the earliest observations being assigned the largest weights. The
default value is 0.2.
Optimize
Select this option to select the Alpha and Beta values that minimize the residual
mean squared errors in the training and validation sets. Take care when using
this feature as this option can result in an over fitted model. This option is not
selected by default.
Level (Alpha)
Enter the smoothing parameter here. This parameter is used in the weighted
average calculation and can be from 0 to 1. A value of 1 or close to 1 will result
in the most recent observations being assigned the largest weights and the
earliest observations being assigned the smallest weights in the weighted
average calculation. A value of 0 or close to 0 will result in the most recent
observations being assigned the smallest weights and the earliest observations
being assigned the largest weights in the weighted average calculation. The
default is 0.2.
Trend (Beta)
The Double Exponential Smoothing technique includes an additional parameter,
Beta, to contend with trends in the data. This parameter is also used in the
weighted average calculation and can be from 0 to 1. A value of 1 or close to 1
will result in the most recent observations being assigned the largest weights and
the earliest observations being assigned the smallest weights in the weighted
average calculation. A value of 0 or close to 0 will result in the most recent
observations being assigned the smallest weights and the earliest observations
being assigned the largest weights in the weighted average calculation. The
default is 0.15.
Period
Enter the number of periods that make up one season. The value for # Complete
seasons will be automatically filled.
Optimize
Select this option to select the Alpha, Beta, and Gamma values that minimize
the residual mean squared errors in the training and validation sets. Take care
when using this feature as this option can result in an over fitted model. This
option is not selected by default.
Level (Alpha)
Enter the smoothing parameter here. This parameter is used in the weighted
average calculation and can be from 0 to 1. A value of 1 or close to 1 will result
in the most recent observations being assigned the largest weights and the
earliest observations being assigned the smallest weights in the weighted
average calculation. A value of 0 or close to 0 will result in the most recent
observations being assigned the smallest weights and the earliest observations
being assigned the largest weights in the weighted average calculation. The
default is 0.2.
Trend (Beta)
The Holt Winters Smoothing also utilizes the Trend parameter, Beta, to contend
with trends in the data. This parameter is also used in the weighted average
calculation and can be from 0 to 1. A value of 1 or close to 1 will result in the
most recent observations being assigned the largest weights and the earliest
observations being assigned the smallest weights in the weighted average
calculation. A value of 0 or close to 0 will result in the most recent observations
being assigned the smallest weights and the earliest observations being assigned
the largest weights in the weighted average calculation. The default is 0.15.
This option is not included on the No Trend Model dialog.
Produce Forecast
If this option is selected, Analytic Solver Data Science will include a forecast on
the output results.
# Forecasts
If applying this smoothing technique to an unpartitioned dataset, this option is
enabled. Enter the desired number of forecasts here.
Introduction
One very important issue when fitting a model is how well the newly created
model will behave when applied to new data. To address this issue, the dataset
can be divided into multiple partitions: a training partition used to create the
model, a validation partition to test the performance of the model and, if desired,
a third test partition. Partitioning is performed randomly, to protect against a
biased partition, according to proportions specified by the user or according to
rules concerning the dataset type. For example, when creating a time series
forecast, data is partitioned by chronological order.
Training Set
The training dataset is used to train or build a model. For example, in a linear
regression, the training dataset is used to fit the linear regression model, i.e. to
compute the regression coefficients. In a neural network model, the training
dataset is used to obtain the network weights. After fitting the model on the
training dataset, the performance of the model should be tested on the validation
dataset.
Validation Set
Once a model is built using the training dataset, the performance of the model
must be validated using new data. If the training data itself was utilized to
compute the accuracy of the model fit, the result would be an overly optimistic
estimate of the accuracy of the model. This is because the training or model
fitting process ensures that the accuracy of the model for the training data is as
high as possible -- the model is specifically suited to the training data. To obtain
a more realistic estimate of how the model would perform with unseen data, we
must set aside a part of the original data and not include this set in the training
process. This dataset is known as the validation dataset.
To validate the performance of the model, Analytic Solver Data Science
measures the discrepancy between the actual observed values and the predicted
value of the observation. This discrepancy is known as the error in prediction
and is used to measure the overall accuracy of the model.
Test Set
The validation dataset is often used to fine-tune models. For example, you might
try out neural network models with various architectures and test the accuracy of
each on the validation dataset to choose the best performer among the competing
architectures. In such a case, when a model is finally chosen, its accuracy with
the validation dataset is still an optimistic estimate of how it would perform with
unseen data. This is because the final model has come out as the winner among
the competing models based on the fact that its accuracy with the validation
dataset is highest. As a result, it is a good idea to set aside yet another portion of
data which is used neither in training nor in validation. This set is known as the
Random Partitioning
In simple random sampling, every observation in the main dataset has equal
probability of being selected for the partition dataset. For example, if you
specify 60% for the training dataset, then 60% of the total observations are
randomly selected for the training dataset. In other words, each observation has
a 60% chance of being selected.
Random partitioning uses the system clock as a default to initialize the random
number seed. Alternatively, the random seed can be manually set which will
result in the same observations being chosen for the training/validation/test sets
each time a standard partition is created.
Partition Options
It is no longer always necessary to partition a dataset before running a
classification or regression algorithm. Rather, you can now perform partitioning
on the Parameters tab for each classification or regression method.
If the active data set is un-partitioned, the Partition Data command button, will
be enabled. If the active data set has already been partitioned, this button will be
disabled. Clicking the Partition Data button opens the following dialog. Select
Partition Data on the dialog to enable the partitioning options.
It is also possible for the user to specify which sets each observation should be
assigned. In column O, enter a “t”, “v” or “s” to indicate the assignment of each
record to either the training dataset (t), the validation dataset (v), or the test
dataset (s), as shown in the screenshot below.
Wine Dataset with Partition Variable
Click Partition – Standard Partition on the Data Science ribbon to open the
Standard Data Partition dialog.
Select Use Partition Variable in the Partitioning options section, select
Partition Variable in the Variables list box, then click > next to Use Partition
Variable. Analytic Solver Data Science will use the values in the Partition
Variable column to create the training, validation, and test sets. Records with a
“t” in the O column will be designated as training records. Records with a “v”
in the O column will be designated as validating records and records with an “s”
in this column will be designated as testing records. Now highlight all
remaining variables in the list box and click > to include them in the partitioned
data.
Set Seed
Random partitioning uses the system clock as a default to initialize the random
number seed. By default this option is selected to specify a seed for random
number generation for the partitioning. Setting this option will result in the same
records being assigned to the same set on successive runs. The default seed
entry is 12345.
Specify percentages
If Pick up rows randomly is selected under Partitioning options, this option will
be enabled. Select this option to manually enter percentages for training set,
validation set and test sets. Records will be randomly allocated to each set
according to these percentages.
Equal percentages
If Pick up rows randomly is selected under Partitioning options, this option will
be enabled. If this option is selected, Analytic Solver Data Science will allocate
33.33% of the records in the database to each set: training, validation, and test.
Output variable
Select the output variable from the variables listed in the Variables in the
partition data list box.
#Classes
After the output variable is chosen, the number of classes (distinct values) for
the output variable will be displayed here. Analytic Solver Data Science
supports a class size of 2.
Introduction
Analytic Solver Data Science includes comprehensive, powerful support for data
science and machine learning. Using these tools, you can train or fit your data to
a wide range of statistical and machine learning models: Classification and
regression trees, neural networks, linear and logistic regression, discriminant
analysis, naïve Bayes, k-nearest neighbors and more. But the task of choosing
and comparing these models, and selecting parameters for each one was up to
you.
With the new Find Best Model options, you can automate this work as well!
Find Best Model uses methods similar to those in (expensive high-end) tools
like DataRobot and RapidMiner, to automatically choose types of ML models
and their parameters, validate and compare them according to criteria that you
choose, and deliver the model that best fits your data.
See the Analytic Solver User Guide to find a complete walk-through of this
feature. Each classification learner may be ran independently. The rest of this
chapter contains explanations of each option contained on the Find Best Model
dialogs: Data, Parameter and Scoring.
Data Tab
The Data tab is where the data source is listed, the input and output variables are
selected and the Success Class and Probability are set.
Data Source
Workbook Click the down arrow to select the workbook where the
Find Best Model: Classification method will be
applied.
Worksheet Click the down arrow to select the worksheet where the
Find Best Model: Classification method will be
applied.
Data range Select the range where the data appears on the selected
worksheet.
#Columns (Read-only) The number of columns in the data range.
# Rows In: Training Set (Read-only) The number of rows in the training
partition.
# Rows In: Validation Set (Read-only) The number of rows in the validation
partition.
# Rows In: Test Set (Read-only) The number of rows in the test partition.
Variables
First Row Contains Select this option if the first row of the dataset contains
Headers column headings.
Variables in Input Data Variables contained in the dataset.
Selected Variables Variables appearing under Selected Variables will be
treated as continuous.
Categorical Variables Variables appearing under Categorical Variables will
be treated as categorical.
Output Variable Select the output variable, or the variable to be
classified, here.
Preprocessing
Partition
Analytic Solver Data Science includes the ability to partition a dataset
Data
from within a classification or prediction method by clicking Partition
Data on the Parameters tab. Analytic Solver Data Science will partition
your dataset (according to the partition options you set) immediately
before running the classification method. If partitioning has already
occurred on the dataset, this option will be disabled. For more
information on partitioning, please see the Data Science Partitioning
chapter.
Rescale
Use Rescaling to normalize one or more features in your data during the
Data
data preprocessing stage. Analytic Solver Data Science provides the
following methods for feature scaling: Standardization, Normalization,
Adjusted Normalization and Unit Norm. For more information on this
feature, see the Rescale Continuous Data section within the Transform
Find Best Model: Prediction Parameters Tab Continuous Data chapter that occurs earlier in this guide.
If Rescale Data has been selected on the Rescaling dialog, users can still
manually use the “Min/Max as bounds” button within the Fitting Options
section of the Simulation tab, to populate the parameter grid with the bounds
from the original data, not the rescaled data. Note that the “Min/Max as
bounds” feature is available for the user’s convenience. Users must still be
aware of any possible data tranformations (i.e. Rescaling) and review the
bounds to make sure that all are appropriate.
Hidden Layer
Nodes in the hidden layer receive input from the input layer. The output of the hidden nodes is
a weighted sum of the input values. This weighted sum is computed with weights that are
initially set at random values. As the network “learns”, these weights are adjusted. This
weighted sum is used to compute the hidden node’s output using a transfer function. The
default selection is Sigmoid.
Output Layer
As in the hidden layer output calculation (explained in the above paragraph), the output layer is
also computed using the same transfer function as described for Activation: Hidden Layer.
The default selection is Sigmoid.
Training Parameters
Click Training Parameters to open the Training Parameters dialog to specify parameters related
to the training of the Neural Network algorithm.
Stopping Rules
Click Stopping Rules to open the Stopping Rules dialog. Here users can specify a
comprehensive set of rules for stopping the algorithm early plus cross-validation on the training
error.
Weak Learner
Adaboost Variant
In AdaBoost.M1 (Freund), the constant is calculated as:
αb= ln((1-eb)/eb)
Weak Learner
DecisionTree For more information on each parameter, see the Regression Tree Method chapter within the
Analytic Solver Reference Guide.
Tree Growth Levels, In the Tree Growth section, select Levels, Nodes, Splits, and Records in Terminal Nodes.
Nodes, Splits, Tree Values entered for these options limit tree growth, i.e. if 10 is entered for Levels, the tree will be
Records in Terminal limited to 10 levels.
Nodes
Prune If a validation partition exists, this option is enabled. When this option is selected, Analytic
Solver Data Science will prune the tree using the validation set. Pruning the tree using the
validation set reduces the error from over-fitting the tree to the training data.
Click Tree for Scoring to click the Tree type used for scoring: Fully Grown, Best Pruned,
Minimum Error, User Specified or Number of Decision Nodes.
Neural Network For more information on each parameter, see the Neural Network Regression Method chapter
within the Analytic Solver Reference Guide.
Architecture Click Add Layer to add a hidden layer. To delete a layer, click Remove Layer. Once the layer is
added, enter the desired Neurons.
Hidden Layer Nodes in the hidden layer receive input from the input layer. The output of the hidden nodes
is a weighted sum of the input values. This weighted sum is computed with weights that are
initially set at random values. As the network “learns”, these weights are adjusted. This
weighted sum is used to compute the hidden node’s output using a transfer function. The
default selection is Sigmoid.
Stopping Rules Click Stopping Rules to open the Stopping Rules dialog. Here users can specify a comprehensive
set of rules for stopping the algorithm early plus cross-validation on the training error.
Bagging Ensemble For more information on each parameter, see the Ensemble Methods chapter within the Analytic
Method Solver Reference Guide.
Number of Weak
This option controls the number of “weak” regression models that will be created. The ensemble
Learners
method will stop when the number of regression models created reaches the value set for this
option. The algorithm will then compute the weighted sum of votes for each class and assign the
“winning” value to each record.
Weak Learner
Under Ensemble: Common click the down arrow beneath Weak Leaner to select one of the four
featured classifiers: Linear Regression, k-NN, Neural Networks or Decision Tree. The
command button to the right will be enabled. Click this command button to control various
option settings for the weak leaner.
Random Seed for Enter an integer value to specify the seed for random resampling of the training data for each
Bootstrapping weak learner. Setting the random number seed to a nonzero value (any number of your choice is
OK) ensures that the same sequence of random numbers is used each time the dataset is chosen
for the classifier. The default value is “12345”. If left blank, the random number generator is
initialized from the system clock, so the sequence of random numbers will be different in each
calculation. If you need the results from successive runs of the algorithm to another to be strictly
comparable, you should set the seed. To do this, type the desired number you want into the box.
This option accepts both positive and negative integers with up to 9 digits.
Boosting Ensemble For more information on each parameter, see the Boosting Regression Ensemble Method chapter
Method within the Analytic Solver Reference Guide.
Number of Weak See description above.
Learners
Weak Learner
Step Size The Adaboost algorithm minimizes a loss function using the gradient descent method. The Step
size option is used to ensure that the algorithm does not descend too far when moving to the next
step. It is recommended to leave this option at the default of 0.3, but any number between 0 and
1 is acceptable. A Step size setting closer to 0 results in the algorithm taking smaller steps to the
next point, while a setting closer to 1 results in the algorithm taking larger steps towards the next
point.
Random Trees Ensemble For more information on each parameter, see the Boosting Classification Ensemble Method
Method chapter within the Analytic Solver Reference Guide.
Number of Weak See description above.
Learners
Weak Learner
Number of Randomly The Random Trees ensemble method works by training multiple “weak” classification trees
Selected Features using a fixed number of randomly selected features then taking the mode of each class to create a
where
𝑦̂𝑖 is the predicted value for obs i
𝑦𝑖 is the actual value for obs i
𝑦̅ is mean of the y values
Specificity Specificity is defined as the
SSE Sum of Squared Error – The sum of
proportion of negative
the squares of the differences
classifications that were
between the actual and predicted
actually negative.
values.
SSE = ∑𝑛𝑖=1(𝑦̂𝑖 − 𝑦𝑖 )2
𝑦̂𝑖 is the predicted value for obs i
𝑦𝑖 is the actual value for obs i
Sensitivity Sensitivity is defined as the
MSE Mean Squared Error – The average
proportion of positive cases
of the squared differences between
there were classified correctly
the actual and predicted values.
as positive.
1
MSE = 𝑛 ∑𝑛𝑖=1(𝑦̂𝑖 − 𝑦𝑖 )2
Introduction
Linear Discriminant analysis (LD) is a generative classifier; it models the joint
probability distribution of the input and target variables. As a result, this
classifier can “generate” new input variables given the target variable.
The discriminant analysis model is built using a set of observations for which
the classes are known. This set of observations is sometimes referred to as the
training set. Based on the training set, the technique constructs a set of linear
functions of the predictors, known as discriminant functions, such that L = b1x1
+ b2x2 + … + bnxn + c, where the b's are discriminant coefficients, the x's are
the input variables or predictors and c is a constant.
These discriminant functions are used to predict the class of a new observation
with an unknown class. For a k class problem, k discriminant functions are
constructed. Given a new observation, all k discriminant functions are evaluated
and the observation is assigned to the class with the largest discriminant function
value.
Discriminant analysis assumes that:
1. The data is normally distributed.
2. Means of each class are specific to that class.
3. All classes have a common covariance matrix.
If these assumptions are realized, DA generates a linear decision boundary.
The latest version of Analytic Solver Data Science now contains Quadratic
Discriminant Analysis (QDA). QDA produces a quadratic decision boundary,
rather than a linear decision boundary. While QDA also assumes that the data is
normally distributed, QDA does not assume that all classes share the same
covariance matrix.
QDA is a more flexible technique when compared to LDA. QDA's performance
improves over LDA when the class covariance matrices are disparate. Since
each class has a different covariance matrix, the number of parameters that must
be estimated increases significantly as the number of dimensions (predictors)
increase. As a result, LDA might be a better choice over QDA on datasets with
small numbers of observations and large numbers of classes. It’s advisable to
try both techniques to determine which one performs best on your model. You
can easily switch between LDA and QDA simply by setting this option to true or
false.
Variable Description
age Age of patient
anaemia Decrease of red blood cells or hemoglobin (boolean)
creatinine_phosphokinase Level of the CPK enzyme in the blood (mcg/L)
diabetes If the patient has diabetes (boolean)
ejection_fraction Percentage of blood leaving the heart at each contraction (percentage)
high_blood_pressure If the patient has hypertension (boolean)
platelets Platelets in the blood (kiloplatelets/mL)
serum_creatinine Level of serum creatinine in the blood (mg/dL)
serum_sodium Level of serum sodium in the blood (mEq/L)
sex Female (0) or Male (1)
smoking If the patient smokes or not (boolean)
time Follow-up period (days)
DEATH_EVENT If the patient was deceased during the follow-up period (boolean)
All supervised algorithms include a new Simulation tab. This tab uses the
functionality from the Generate Data feature (described in the What’s New
section of the Analytic Solver Data Science User Guide) to generate synthetic
data based on the training partition, and uses the fitted model to produce
predictions for the synthetic data. The resulting report, DA_Simulation, will
contain the synthetic data, the predicted values and the Excel-calculated
Expression column, if present. In addition, frequency charts containing the
Predicted, Training, and Expression (if present) sources or a combination of any
pair may be viewed, if the charts are of the same type. Since this new
functionality does not support categorical variables, these types of variables will
not be included in the model, only continuous variables.
Inputs
1. First, we’ll need to perform a standard partition, as explained in the
previous chapter, using percentages of 60% training and 40% validation.
STDPartition will be inserted to the right of the Data worksheet. (For more
information on how to partition a dataset, please see the previous Data
Science Partitioning chapter.)
If the first option is selected, Empirical, Analytic Solver Data Science will
assume that the probability of encountering a particular class in the dataset
is the same as the frequency with which it occurs in the training data.
If the second option is selected, Uniform, Analytic Solver Data Science will
assume that all classes occur with equal probability.
Select the third option, Manual, to manually enter the desired class and
probability values of .3 for Class 0 and .7 for Class 1, as shown in the
screenshot above.
Click Done to close the dialog.
9. Keep the default setting for Type under Discriminant Analysis: Fitting, to
use linear discriminant analysis. See the options descriptions below for
more information on linear vs quadratic Discriminant Analysis.
10. Select Canonical Variate Analysis. When this option is selected, Analytic
Solver Data Science produces the canonical variates for the data based on
an orthogonal representation of the original variates. This has the effect of
choosing a representation which maximizes the distance between the
different groups. For a k class problem there are k-1 Canonical variates.
Typically, only a subset of the canonical variates is sufficient to
discriminate between the classes. For this example, we have two canonical
variates which means that if we replace the four original predictors by just
two predictors, X1 and X2, (which are actually linear combinations of the
four original predictors) the discrimination based on these two predictors
will perform just as well as the discrimination based on the original
predictors.
For more information on the remaining options shown on this dialog in the
Distribution Fitting, Correlation Fitting and Sampling sections, see the
Generate Data chapter that appears earlier in this guide.
17. Click Finish to run Discriminant Analysis on the example dataset.
Output Worksheets
Output sheets containing the Discriminant Analysis results will be inserted into
your active workbook to the right of the STDPartition worksheet.
DA_Output
This result worksheet includes 4 segments: Output Navigator, Inputs, Linear
Discriminant Functions and Canonical Variates.
• Output Navigator: The Output Navigator appears at the top of all result
worksheets. Use this feature to quickly navigate to all reports included in
the output.
DA_Output: Output Navigator
• Inputs: Scroll down to the Inputs section to find all inputs entered or
selected on all tabs of the Discriminant Analysis dialog.
DA_TrainingScore
Click the DA_TrainingScore tab to view the newly added Output Variable
frequency chart, the Training: Classification Summary and the Training:
Classification Details report. All calculations, charts and predictions on this
worksheet apply to the Training data.
Note: To view charts in the Cloud app, click the Charts icon on the Ribbon, select a
worksheet under Worksheet and a chart under Chart.
To see both the actual and predicted frequency, click Prediction and select
Actual. This change will be reflected on all charts.
Click Predicted/Actual to change view
Metrics
The following metrics are computed using the values in the confusion
matrix.
• Accuracy (#Correct and %Correct): 59.78% - Refers to the ability of
the classifier to predict a class label correctly.
• Specificity: 0.44 - Also called the true negative rate, measures the
percentage of failures correctly identified as failures
Specificity (SPC) or True Negative Rate =TN / (FP + TN)
DA_ValidationScore
Click the DA_ValidationScore tab to view the newly added Output Variable
frequency chart, the Validation: Classification Summary and the Validation:
Classification Details report. All calculations, charts and predictions on this
worksheet apply to the Validation data.
• Frequency Charts: The output variable frequency chart opens
automatically once the DA_ValidationScore worksheet is selected. To close
this chart, click the “x” in the upper right hand corner. To reopen, click
onto another tab and then click back to the DA_ValidationScore tab.
Click the Frequency chart to display the frequency for both the predicted
and actual values of the output variable, along with various statistics such as
count, number of classes and the mode. Selective Relative Frequency from
the drop down menu, on the right, to see the relative frequencies of the
output variable for both actual and predicted. See above for more
information on this chart.
• Classification Summary: This report contains the confusion matrix for the
validation data set.
DA_ValidationScore: Classification Summary
Metrics
The following metrics are computed using the values in the confusion
matrix.
• Accuracy (#Correct and %Correct): 56.67% - Refers to the ability of
the classifier to predict a class label correctly.
Decile-wise Lift Chart, ROC Curve, and Lift Charts for Valid. Partition
After the model is built using the training data set, the model is used to score on
the training data set and the validation data set (if one exists). Then the data
set(s) are sorted in decreasing order using the predicted output variable value.
After sorting, the actual outcome values of the output variable are cumulated
and the lift curve is drawn as the cumulative number of cases in decreasing
probability (on the x-axis) vs the cumulative number of true positives on the y-
axis. The baseline (red line connecting the origin to the end point of the blue
line) is a reference line. For a given number of cases on the x-axis, this line
represents the expected number of successes if no model existed, and instead
cases were selected at random. This line can be used as a benchmark to measure
the performance of the fitted model. The greater the area between the lift curve
and the baseline, the better the model. In the Training Lift chart, if we selected
100 cases as belonging to the success class and used the fitted model to pick the
members most likely to be successes, the lift curve tells us that we would be
right on about 52 of them. Conversely, if we selected 100 random cases, we
could expect to be right on about 35 (34.63) of them. In the Validation Lift
chart, if we selected 50 cases as belonging to the success class and used the
fitted model to pick the members most likely to be successes, the lift curve tells
us that we would be right on about 23 of them. Conversely, if we selected 50
random cases, we could expect to be right on about 14 (14.167) of them.
The decilewise lift curve is drawn as the decile number versus the cumulative
actual output variable value divided by the decile's mean output variable value.
This bars in this chart indicate the factor by which the model outperforms a
random assignment, one decile at a time. Records are sorted by their predicted
values (scores) and divided into ten equal-sized bins or deciles. The first decile
contains 10% of patients that are most likely to experience catastrophic heart
failure. The 10th or last decile contains 10% of the patients that are least likely
to experience catastrophic heart failure. Ideally, the decile wise lift chart should
resemble a stair case with the 1st decile as the tallest bar, the 2nd decile as the 2nd
tallest, the 3rd decile as the 3rd tallest, all the way down to the last or 10th decile
as the smallest bar. This “staircase” conveys that the model “binned” the
Select Lift Chart (Alternative) to display Analytic Solver Data Science's new
Lift Chart. Each of these charts consists of an Optimum Classifier curve, a
Fitted Classifier curve, and a Random Classifier curve. The Optimum Classifier
curve plots a hypothetical model that would provide perfect classification for
our data. The Fitted Classifier curve plots the fitted model and the Random
Classifier curve plots the results from using no model or by using a random
guess (i.e. for x% of selected observations, x% of the total number of positive
observations are expected to be correctly classified).
The Alternative Lift Chart plots Lift against the Predictive Positive Rate or
Support.
Click the down arrow and select Gain Chart from the menu. In this chart, the
True Positive Rate or Sensitivity is plotted against the Predictive Positive Rate
or Support.
Note the first column in the output, Expression. This column was inserted into
the Synthetic Data results because Calculate Expression was selected and an
Excel function was entered into the Expression field, on the Simulation tab of
the Discriminant Analysis dialog
IF([@ejection_fraction]<20, [@DEATH_EVENT], “EF>=20”)
The results in this column are either 0, 1, or EF > 20.
• DEATH_EVENT = 0 indicates that the patient had an ejection_fraction
<= 20 but did not suffer catastrophic heart failure.
• DEATH_EVENT = 1 in this column indicates that the patient had an
ejection_fraction <= 20 and did suffer catastrophic heart failure.
• EF>20 indicates that the patient had an ejection fraction of greater than
20.
The remainder of the data in this report is synthetic data, generated using the
Generate Data feature described in the chapter with the same name, that appears
earlier in this guide. Note: If the data had been rescaled, i.e. Rescale Data was
selected on the Parameters tab, the data shown in this table would have been fit using the
rescaled data.
The chart that is displayed once this tab is selected, contains frequency
information pertaining to the output variable in the actual data, the synthetic data
and the expression, if it exists. In the screenshot below, the bars in the darker
shade of blue are based on the synthetic data. The bars in the lighter shade of
blue are based on the training data.
In the synthetic data (the columns in the darker shade of blue), about 36% of
patients survived while about 64% of patients succumbed to the complications
of heart failure and in the training partition, about 32% patients survived while
about 68% of the patients did not.
Notice that the Relative Bin Difference curve is flat. Click the down arrow next
to Frequency and select Bin Details. This view explains why the curve is flat.
Notice that the absolute difference for both bins is 4.16%.
Click Prediction (Simulation)/Prediction (Training) and uncheck Prediction
(Simulation)/Prediction (Training) and select Expression
(Simulation)/Expression (Training) to change the chart view.
Click Expression (Simulation)/Expression (Training) to change the Data view
The chart displays the results of the expression in both datasets. This chart
shows that 2 patients with an ejection fraction less than 20 are predicted to
survive and 1 patient is not.
Frequency Chart with Expression column
Click the down arrow next to Frequency to change the chart view to Relative
Frequency or to change the look by clicking Chart Options. Statistics on the
right of the chart dialog are discussed earlier in this section. For more
information on the generated synthetic data, see the Generate Data chapter that
appears earlier in this guide.
Data Source
Worksheet: Click the down arrow to select the desired worksheet where the
dataset is contained.
Workbook: Click the down arrow to select the desired workbook where the
dataset is contained.
Data range: Select or enter the desired data range within the dataset. This data
range may either be a portion of the dataset or the complete dataset.
#Columns: Displays the number of columns in the data range. This option is
read only.
#Rows In: Training Set, Validation Set, Test Set: Displays the number of
rows in each partition, if it exists. This option is read only.
Variables
First Row Contains Headers: Select this checkbox if the first row in the
dataset contains column headings.
Variables In Input Data: This field contains the list of the variables, or
features, included in the data range.
Selected Variables: This field contains the list of variables, or features, to be
included in DA.
Number of Classes
(Read Only) This value is the number of classes in the output variable.
Binary Classification
Set the Success Class and the Success Probability Cutoff here.
Success Class: This option is selected by default. Select the class to be
considered a “success” or the significant class in the Lift Chart. This option is
enabled when the number of classes in the output variable is equal to 2.
Success Probability Cutoff: Enter a value between 0 and 1 here to denote the
cutoff probability for success. If the calculated probability for success for an
observation is greater than or equal to this value, than a “success” (or a 1) will
Discriminant Analysis dialog, Parameters tab be predicted for that observation. If the calculated probability for success for an
observation is less than this value, then a “non-success” (or a 0) will be
predicted for that observation. The default value is 0.5. This option is only
enabled when the # of classes is equal to 2.
Preprocessing
Partition Data
Partitioning dialog Analytic Solver Data Science includes the ability to partition a dataset from
within a classification or prediction method by clicking Partition Data on the
Parameters tab. Analytic Solver Data Science will partition your dataset
(according to the partition options you set) immediately before running the
classification method. If partitioning has already occurred on the dataset, this
option will be disabled. For more information on partitioning, please see the
Data Science Partitioning chapter.
Rescale Data
Use Rescaling to normalize one or more features in your data during the data
preprocessing stage. Analytic Solver Data Science provides the following
methods for feature scaling: Standardization, Normalization, Adjusted
If Rescale Data has been selected on the Rescaling dialog, users can still manually use the
“Min/Max as bounds” button within the Fitting Options section of the Simulation tab, to
populate the parameter grid with the bounds from the original data, not the rescaled data.
Note that the “Min/Max as bounds” feature is available for the user’s convenience. Users
must still be aware of any possible data tranformations (i.e. Rescaling) and review the
bounds to make sure that all are appropriate.
Prior Probability
Click Prior Probability to open the dialog to the left. Three options appear in
the Prior Probability Dialog: Empirical, Uniform and Manual.
• If the first option is selected, Empirical, Analytic Solver Data Science
will assume that the probability of encountering a particular class in the
dataset is the same as the frequency with which it occurs in the training
data.
• If the second option is selected, Uniform, Analytic Solver Data Science
will assume that all classes occur with equal probability.
• Select the third option, Manual, to manually enter the desired
probability for each class. Probabilities must sum up to 1.
Show DA Model
Select this option to display the functions that define each class in the output.
Simulation Tab
All supervised algorithms include a new Simulation tab in Analytic Solver
Comprehensive and Analytic Solver Data Science. (This feature is not supported
in Analytic Solver Optimization, Analytic Solver Simulation or Analytic Solver
Upgrade.) This tab uses the functionality from the Generate Data feature
(described earlier in this guide) to generate synthetic data based on the training
partition, and uses the fitted model to produce predictions for the synthetic data.
The resulting report, DA_Simulation, will contain the synthetic data, the
predicted values and the Excel-calculated Expression column, if present. In
addition, frequency charts containing the Prediction (Simulation)/Prediction
(Training) or Expression (Simulation)/Expression (Training) sources or a
combination of any pair may be viewed, if the charts are of the same type.
Check Simulate Response Prediction to enable the options on the Simulation
tab.
Evaluation: Select Calculate Expression to amend an Expression column onto
the frequency chart displayed on the DA_Simulation output tab. Expression can
be any valid Excel formula that references a variable and the response as
[@COLUMN_NAME]. Click the Expression Hints button for more information
on entering an expression.
Introduction
Logistic Regression is a regression model where the dependent (target) variable
is categorical. Analytic Solver Data Science provides the functionality to fit a
Logistic Model for binary classification problems, i.e. where the dependent
variable contains exactly two classes. The fitted model can be used to estimate
the posterior probability of the binary outcome based on one or more predictors
(features or independent variables). Examples of such binary outcomes could be
a college acceptance or rejection, loan application approval or rejection, or
classification of a tumor being benign or cancerous.
Logistic Regression is a popular and powerful classification method widely used
in various fields due to the model’s simplicity and high interpretability. Analytic
Solver Data Science implements highly efficient algorithms for Logistic
Regression fitting and scoring procedures, which makes this method applicable
for large datasets. It’s important to note that Logistic Regression is a linear
model and cannot capture the non-linear relationships in the data.
Technically, the Logistic Regression fitting procedure aims to fit the coefficients
(b_i) of a linear combination of predictor variables (X_i) to estimate the log
odds of the binary outcome, i.e. a logit transformation of probability of a
particular outcome (p).
Note the similarity between the formulations of Linear and Logistic Regression.
Both define the response as a linear combination of predictor variables.
However, the linear model predicts a continuous response, which can take any
real value, while Logistic Regression requires a response (probability) to be
bounded in [0,1] range. This is achieved through the logit transformation as
shown below.
Inputs
1. First, we partition the data into training and validation sets using the
Standard Data Partition defaults of 60% of the data randomly allocated to
the Training Set and 40% of the data randomly allocated to the Validation
Set. For more information on partitioning a dataset, see the Data Science
Partitioning chapter.
Output Worksheets
Output sheets containing the Logistic Regression results will be inserted into
your active workbook to the right of the STDPartition worksheet.
LogReg_Output
This result worksheet includes 7 segments: Output Navigator, Inputs,
Regression Summary, Predictor Screening, Coefficients, Variance-Covariance
Matrix of Coefficients and Multicollinearity Diagnostics.
• Output Navigator: The Output Navigator appears at the top of all result
worksheets. Use this feature to quickly navigate to all reports included in
the output.
LogReg_Output: Output Navigator
• Inputs: Scroll down to the Inputs section to find all inputs entered or
selected on all tabs of the Logistic Regression dialog.
LogReg_Output: Inputs
The multiple R-squared value shown here is the r-squared value for a
logistic regression model , defined as
R2 = (D0-D)/D0 ,
where D is the Deviance based on the fitted model and D0 is the deviance
based on the null model. The null model is defined as the model containing
no predictor variables apart from the constant.
• Predictor Screening: Scroll down to the Predictor Screening report. In
Analytic Solver Data Science, a preprocessing feature selection step is
included to take advantage of automatic variable screening and elimination
using Rank-Revealing QR Decomposition. This allows Analytic Solver
Data Science to identify the variables causing multicollinearity, rank
deficiencies and other problems that would otherwise cause the algorithm to
fail. Information about “bad” variables is used in Variable Selection and
Multicollinearity Diagnostics and in computing other reported statistics.
Included and excluded predictors are shown in the table below. In this
model there were no excluded predictors. All predictors were eligible to
enter the model passing the tolerance threshold of 5.26E-10. This denotes a
tolerance beyond which a variance – covariance matrix is not exactly
singular to within machine precision. The test is based on the diagonal
elements of the triangular factor R resulting from Rank-Revealing QR
Decomposition. Predictors that do not pass the test are excluded.
Note: If a predictor is excluded, the corresponding coefficient estimates
will be 0 in the regression model and the variable – covariance matrix
would contain all zeros in the rows and columns that correspond to the
excluded predictor. Multicollinearity diagnostics, variable selection and
other remaining output will be calculated for the reduced model.
The design matrix may be rank-deficient for several reasons. The most
common cause of an ill-conditioned regression problem is the presence of
feature(s) that can be exactly or approximately represented by a linear
combination of other feature(s). For example, assume that among
predictors you have 3 input variables X, Y, and Z where Z = a * X + b * Y
where a and b are constants. This will cause the design matrix to not have a
full rank. Therefore, one of these 3 variables will not pass the threshold for
entrance and will be excluded from the final regression model.
This table contains the coefficient estimate, the standard error of the
coefficient, the p-value, the odds ratio for each variable (which is simply ex
where x is the value of the coefficient) and confidence interval for the odds.
(Note for the Intercept term, the Odds Ratio is calculated as exp^0.)
Note: If a variable has been eliminated by Rank-Revealing QR
Decomposition, the variable will appear in red in the Coefficients table with
a 0 Coefficient, Std. Error, CI Lower, CI Upper, and RSS Reduction and
N/A for the t-Statistic and P-Values.
• Variance-Covariance Matrix of Coefficients: This square matrix contains
the variances of the fitted model’s coefficient estimates in the center
diagonal elements and the pair-wise covariances between coefficient
estimates in the non-diagonal elements.
LogReg_Output: Variance - Covariance Matrix of Coefficients
LogReg_FS
Since we selected Perform Feature Selection on the Feature Selection dialog,
Analytic Solver Data Science has produced the following output on the
LogReg_FS tab which displays the variables that are included in the subsets.
This table contains the two subsets with the highest Residual Sum of Squares
values.
LogReg_FS: Feature Selection
In this table, every model includes a constant term (since Fit Intercept was
selected) and one or more variables as the additional coefficients. We can use
any of these models for further analysis simply by clicking the hyperlink under
Subset ID in the far left column. The Logistic Regression dialog will open.
Click Finish to run Logistic Regression using the variable subset as listed in the
table.
The choice of model depends on the calculated values of various error values
and the probability. RSS is the residual sum of squares, or the sum of squared
deviations between the predicted probability of success and the actual value (1
or 0). "Mallows Cp" is a measure of the error in the best subset model, relative
to the error incorporating all variables. Adequate models are those for which Cp
is roughly equal to the number of parameters in the model (including the
constant), and/or Cp is at a minimum. "Probability" is a quasi hypothesis test of
the proposition that a given subset is acceptable; if Probability < .05 we can rule
out that subset.
The considerations about RSS, Cp and Probability in this example would lead us
to believe that the subset with 12 coefficients is the best model in this example.
Bin Details: See pertinent information about each bin in the chart.
Chart Options: Use this view to change the color of the bars in the chart.
Chart Options View
• To see both the actual and predicted frequency, click Prediction and select
Actual. This change will be reflected on all charts.
Click Prediction, then select Actual
• True Positive cases (TP) are the number of cases classified as belonging to
the Success class that actually were members of the Success class.
• False Negative cases (FN) are the number of cases that were classified as
belonging to the Failure class when they were actually members of the
Success class (i.e. if a cancerous tumor is considered a "success", then
imagine patients with cancerous tumors who were told their tumors were
benign).
• False Positive (FP) cases were assigned to the Success class but were
actually members of the Failure group (i.e. patients who were told they
tested positive for cancer when, in fact, their tumors were benign).
• True Negative (TN) cases were correctly assigned to the Failure group.
• Classification Summary: This report contains the confusion matrix for the
validation data set.
LogReg_ValidationScore: Classification Summary
Decile-wise Lift Chart, ROC Curve, and Lift Charts for Valid. Partition
Click the down arrow and select Gain Chart from the menu. In this chart, the
True Positive Rate or Sensitivity is plotted against the Predictive Positive Rate
or Support.
LogReg_Simulation
As discussed above, Analytic Solver Data Science generates a new output
worksheet, LogReg_Simulation, when Simulate Response Prediction is selected
Note the first column in the output, Expression. This column was inserted into
the Synthetic Data results because Calculate Expression was selected and an
Excel function was entered into the Expression field, on the Simulation tab of
the Logistic Regression dialog
Expression: IF([@RM]>5,[@CAT. MEDV],"racks <= 5 Rooms")
The rest of the data in this report is synthetic data, generated using the Generate
Data feature described in the chapter with the same name, that appears earlier in
this guide.
The chart that is displayed once this tab is selected, contains frequency
information pertaining to the output variable in the actual data and the synthetic
data. In the screenshot below, the bars in the darker shade of blue are based on
the synthetic data. The bars in the lighter shade of blue are based on the
predicted values for the training partition. In the synthetic data, about 70% of
the housing tracts where CAT. MEDV = 0, have more than 5 rooms and about
85% of the housing tracts in the training partition where CAT. MEDV = 0 have
more than 5 rooms.
Frequency Chart for LogReg_Simulation output
Click the array next to Frequency and select Bin Details. Notice that the
abosulte difference in each bin is the same. Hence the flat Relative Bin
Difference curve in the chart.
Expression view
This chart shows the relative bin differences are decreasing. Only about 15% of
the housing tracts in the synthetic data were predicted has having less than 5
rooms. Less than 5% of the housing tracts in the training data were predicted as
having 5 rooms or less.
Selected Variables
Variables listed here will be utilized in the Logistic Regression algorithm.
Weight Variable
One major assumption of Logistic Regression is that each observation provides
equal information. Analytic Solver Data Science offers an opportunity to
provide a Weight variable. Using a Weight variable allows the user to allocate a
weight to each record. A record with a large weight will influence the model
more than a record with a smaller weight.
Output Variable
Select the variable whose outcome is to be predicted. The classes in the output
variable must be equal to 2.
Number of Classes
Displays the number of classes in the Output variable.
Success Class
This option is selected by default. Select the class to be considered a “success”
or the significant class in the Lift Chart. This option is enabled when the
number of classes in the output variable is equal to 2.
Partition Data
Analytic Solver Data Science includes the ability to partition a dataset from
within a classification or prediction method by clicking Partition Data on the
Parameters tab. Analytic Solver Data Science will partition your dataset
(according to the partition options you set) immediately before running the
classification method. If partitioning has already occurred on the dataset, this
option will be disabled. For more information on partitioning, please see the
Data Science Partitioning chapter.
Rescale Data
Use Rescaling to normalize one or more features in your data during the data
preprocessing stage. Analytic Solver Data Science provides the following
methods for feature scaling: Standardization, Normalization, Adjusted
Normalization and Unit Norm. For more information on this new feature, see
the Rescale Continuous Data section within the Transform Continuous Data
chapter that occurs earlier in this guide.
Note: Rescaling has no substantial effect in Logistic Regression other than proportional
scaling.
If Rescale Data has been selected on the Rescaling dialog, users can still manually use the
“Min/Max as bounds” button within the Fitting Options section of the Simulation tab, to
populate the parameter grid with the bounds from the original data, not the rescaled data.
Note that the “Min/Max as bounds” feature is available for the user’s convenience. Users
must still be aware of any possible data tranformations (i.e. Rescaling) and review the
bounds to make sure that all are appropriate.
Prior Probability
Click Prior Probability to open the dialog below. Three options appear in the
Prior Probability Dialog: Empirical, Uniform and Manual.
Partition Data
Analytic Solver Data Science includes the ability to partition a dataset from
within a classification or prediction method by selecting Partition Data on the
Parameters tab. Analytic Solver Data Science will partition your dataset
(according to the partition options you set) immediately before running the
classification method. If partitioning has already occurred on the dataset, this
option will be disabled. For more information on partitioning, please see the
Data Science Partitioning chapter.
Fit Intercept
When this option is selected, the default setting, Analytic Solver Data Science
will fit the Logistic Regression intercept. If this option is not selected, Analytic
Solver Data Science will force the intercept term to 0.
Iterations (Max)
Estimating the coefficients in the Logistic Regression algorithm requires an
iterative non-linear maximization procedure. You can specify a maximum
number of iterations to prevent the program from getting lost in very lengthy
iterative loops. This value must be an integer greater than 0 or less than or equal
to 100 (1< value <= 100).
Multicollinearity Diagnostics
At times, variables can be highly correlated with one another which can result in
large standard errors for the affected coefficients. Analytic Solver Data Science
will display information useful in dealing with this problem if Multicollinearity
Diagnostics is selected.
Analysis Of Coefficients
When this option is selected, Analytic Solver Data Science will produce a table
with all coefficient information such as the Estimate, Odds, Standard Error, etc.
When this option is not selected, Analytic Solver Data Science will only print
the Estimates.
Feature Selection
When you have a large number of predictors and you would like to limit the
model to only significant variables, click Feature Selection to open the Feature
Selection dialog and select Perform Feature Selection at the top of the dialog.
Analytic Solver Data Science offers five different selection procedures for
selecting the best subset of variables.
• Backward Elimination in which variables are eliminated one at a time,
starting with the least significant. If this procedure is selected, FOUT
is enabled. A statistic is calculated when variables are eliminated. For
a variable to leave the regression, the statistic’s value must be less than
the value of FOUT (default = 2.71).
• Forward Selection in which variables are added one at a time, starting
with the most significant. If this procedure is selected, FIN is enabled.
On each iteration of the Forward Selection procedure, each variable is
examined for the eligibility to enter the model. The significance of
variables is measured as a partial F-statistic. Given a model at a current
iteration, we perform an F Test, testing the null hypothesis stating that
the regression coefficient would be zero if added to the existing set if
variables and an alternative hypothesis stating otherwise. Each variable
is examined to find the one with the largest partial F-Statistic. The
decision rule for adding this variable into a model is: Reject the null
hypothesis if the F-Statistic for this variable exceeds the critical value
chosen as a threshold for the F Test (FIN value), or Accept the null
hypothesis if the F-Statistic for this variable is less than a threshold. If
the null hypothesis is rejected, the variable is added to the model and
selection continues in the same fashion, otherwise the procedure is
terminated.
• Sequential Replacement in which variables are sequentially replaced
and replacements that improve performance are retained.
• Stepwise selection is similar to Forward selection except that at each
stage, Analytic Solver Data Science considers dropping variables that
are not statistically significant. When this procedure is selected, the
Stepwise selection options FIN and FOUT are enabled. In the stepwise
selection procedure a statistic is calculated when variables are added or
eliminated. For a variable to come into the regression, the statistic’s
value must be greater than the value for FIN (default = 3.84). For a
variable to leave the regression, the statistic’s value must be less than
the value of FOUT (default = 2.71). The value for FIN must be greater
than the value for FOUT.
• Best Subsets where searches of all combinations of variables are
performed to observe which combination has the best fit. (This option
can become quite time consuming depending on the number of input
variables.) If this procedure is selected, Number of best subsets is
enabled.
Simulation Tab
All supervised algorithms include a new Simulation tab in Analytic Solver
Comprehensive and Analytic Solver Data Science. (This feature is not supported
in Analytic Solver Optimization, Analytic Solver Simulation or Analytic Solver
Upgrade.) This tab uses the functionality from the Generate Data feature
Introduction
K-nearest neighbors is a simple but powerful classifier. This method classifies a
given record based on the predominant classification of it's "k" nearest neighbor
records.
The k-Nearest Neighbors Classifier performs the following steps for each record
in the dataset.
1. The Euclidean Distance between the given record and all remaining records
is calculated. In order for this distance measure to be accurate, all variables
must be scaled appropriately.
2. The classification of the "k" nearest neighbors is examined. The
predominant classification is assigned to the given row.
3. This procedure is repeated for all remaining rows.
Analytic Solver Data Science allows the user to select a maximum value for k
and builds models in parallel on all values of k up to the maximum specified
value. Additional scoring can be performed on the best of these models.
As k increases, the computing time will also increase. If a high value of k is
selected, such as 18 or 20, the risk of underfitting the data is high. Conversely, a
low value of k, such as 1 or 2, runs the risk of overfitting the data. In most
applications, k is in units of tens rather than in hundreds or thousands.
Inputs
1. Partition the data using a standard partition with percentages of 60%
training and 40% validation (the default settings for the Automatic choice).
For more information on how to partition a dataset, please see the previous
Data Science Partitioning chapter.
For this example, click Done to select the default of Empirical and close the
dialog.
k-Nearest Neighbors dialog, Parameters tab
11. Click Next to advance to the Simulation tab. This tab is disabled in Analytic
Solver Optimization, Analytic Solver Simulation and Analytic Solver
Upgrade.
12. Select Simulation Response Prediction to enable all options on the
Simulation tab of the Discriminant Analysis dialog.
Simulation tab: All supervised algorithms include a new Simulation tab.
This tab uses the functionality from the Generate Data feature (described
earlier in this guide) to generate synthetic data based on the training
partition, and uses the fitted model to produce predictions for the synthetic
data. The resulting report, _Simulation, will contain the synthetic data, the
predicted values and the Excel-calculated Expression column, if present. In
addition, frequency charts containing the Predicted, Training, and
Expression (if present) sources or a combination of any pair may be viewed,
if the charts are of the same type.
Output Worksheets
Worksheets containing the results are inserted to the right of the STDPartition
worksheet.
KNNC_Output
Double click the KNNC_Output sheet. The Output Navigator is included at the
top of each output worksheet. The top part of this sheet contains all of our
inputs. At the top of this sheet is the Output Navigator.
Scroll down a bit further to view the Search log. (This output is produced
because we selected Seach 1..k on the Parameters tab. If this option had not
been selected, this output would not be produced.)
Search Log output
The Search Log for the different k's lists the % Misclassification errors for all
values of k for the validation data set, if present. The k with the smallest %
Misclassification is selected as the “Best k”. Scoring is performed later using
this best value of k.
Bin Details: This view displays information pertaining to each bin in the
chart.
Chart Options: Use this view to change the color of the bars in the chart.
Chart Options View
• To see both the actual and predicted frequency, click Prediction and select
Actual. This change will be reflected on all charts.
Click Predicted/Actual to change view
Metrics
The following metrics are computed using the values in the confusion
matrix.
• Accuracy (#Correct = 85 and %Correct = 94.4%): Refers to the ability
of the classifier to predict a class label correctly.
• Classification Details: This table displays how each observation in the
training data was classified. The probability values for success in each
record are shown after the predicted class and actual class columns.
KNNC_ValidationScore
Click the KNNC_ValidationScore tab to view the newly added Output Variable
frequency chart, the Validation: Classification Summary and the Validation:
Classification Details report. All calculations, charts and predictions on this
worksheet apply to the Validation data.
• Frequency Charts: The output variable frequency chart opens
automatically once the KNNC_ValidationScore worksheet is selected. To
close this chart, click the “x” in the upper right hand corner. To reopen,
click onto another tab and then click back to the KNNC_ValidationScore
tab.
Click the Frequency chart to display the frequency for both the predicted
and actual values of the output variable, along with various statistics such as
count, number of classes and the mode. Selective Relative Frequency from
the drop down menu, on the right, to see the relative frequencies of the
output variable for both actual and predicted. See above for more
information on this chart.
KNNC_ValidationScore Frequency Chart
• Classification Summary: This report contains the confusion matrix for the
validation data set.
When the fitted model was applied to the Validation partition, 1 record was
misclassified.
Metrics
The following metrics are computed using the values in the confusion
matrix.
• Accuracy (#Correct = 59/60 and %Correct = 98.3%): Refers to the
ability of the classifier to predict a class label correctly.
• Classification Details: This table displays how each observation in the
validation data was classified. The probability values for success in each
record are shown after the predicted class and actual class columns. Note
that the largest PostProb value depicts the predicted value.
KNNC_ValidationScore: Validation: Classification Details
KNNC_Simulation
As discussed above, Analytic Solver Data Science generates a new output
worksheet, KNNC_Simulation, when Simulate Response Prediction is selected
on the Simulation tab of the k-Nearest Neighbors dialog in Analytic Solver
Comprehensive and Analytic Solver Data Science. (This feature is not supported
in Analytic Solver Optimization, Analytic Solver Simulation or Analytic Solver
Upgrade.)
This report contains the synthetic data, the actual output variable values for the
training partition and the Excel – calculated Expression column, if populated in
the dialog. A chart is also displayed with the option to switch between the
Synthetic, Training, and Expression sources or a combination of two, as long as
they are of the same type.
Note the first column in the output, Expression. This column was inserted into
the Synthetic Data results because Calculate Expression was selected and an
Excel function was entered into the Expression field, on the Simulation tab of
the k-Nearest Neighbors dialog
IF([@Sepal_length]>6, [@Species_name], “Sepal_length <= 6”)
The results in this column are either Setosa, Verginica, Versicolor, if a record’s
Sepal_width is greater than 6, or Sepal_length <= 6, if the record’s Sepal_width
is less than or equal to 6.
The remainder of the data in this report is syntethic data, generated using the
Generate Data feature described in the chapter with the same name, that appears
earlier in this guide.
The chart that is displayed once this tab is selected, contains frequency
information pertaining to the actual output variable in the training partition, the
synthetic data and the expression, if it exists. In the screenshot below, the bars
in the darker shade of blue are based on the synthetic data. The bars in the
lighter shade of blue are based on the actual values in the training partition.
Frequency Chart for KNNC_Simulation output
This chart compares the number of records in the synthetic data vs the result of
the expression on the synthetic data. The dark blue columns represent the
predictions in the synthetic data. In the 100 synthetic data records, 39 records
are predicted to be classified as Setosa, 35 are predicted to be classified as
Verginica and 26 are predicted to be classified as Versicolor. The light blue
columns represent the result of the expression as applied to the synthetic data.
In other words, out of 39 records in the synthetic data predicted to be classified
as Setosa, only 4 are predicted to have sepal_lengths greater than 6.
Click the down arrow next to Frequency to change the chart view to Relative
Frequency or to change the look by clicking Chart Options. Statistics on the
right of the chart dialog are discussed earlier in this section. For more
information on the generated synthetic data, see the Generate Data chapter that
appears earlier in this guide.
For information on Stored Model Sheets, in this example DA_Stored, please
refer to the “Scoring New Data” chapter within the Analytic Solver Data
Science User Guide.
Output variable
The variable to be classified is entered here.
Number of Classes
The number of classes in the output variable appear here.
Success Class
This option is selected by default. Select the class to be considered a “success”
or the significant class in the Lift Chart. This option is enabled when the
number of classes in the output variable is equal to 2.
# Neighbors (k)
This is the parameter k in the k-Nearest Neighbor algorithm.
Prior Probabilities
Analytic Solver Data Science will incorporate prior assumptions about how
frequently the different classes occur and will assume that the probability of
encountering a particular class in the data set is the same as the frequency with
which it occurs in the training dataset.
• If Empirical is selected, Analytic Solver Data Science will assume that
the probability of encountering a particular class in the dataset is the
same as the frequency with which it occurs in the training data.
• If Uniform is selected, Analytic Solver Data Science will assume that
all classes occur with equal probability.
• If Manual is selected, the user can enter the desired class and
probability value.
Rescale Data
Use Rescaling to normalize one or more features in your data during the data
preprocessing stage. Analytic Solver Data Science provides the following
methods for feature scaling: Standardization, Normalization, Adjusted
Normalization and Unit Norm. For more information on this new feature, see
the Rescale Continuous Data section within the Transform Continuous Data
chapter that occurs earlier in this guide.
If Rescale Data has been selected on the Rescaling dialog, users can still manually use the
“Min/Max as bounds” button within the Fitting Options section of the Simulation tab, to
populate the parameter grid with the bounds from the original data, not the rescaled data.
Note that the “Min/Max as bounds” feature is available for the user’s convenience. Users
must still be aware of any possible data tranformations (i.e. Rescaling) and review the
bounds to make sure that all are appropriate.
Partition Data
Rescaling dialog Analytic Solver Data Science includes the ability to partition a dataset from
within a classification or prediction method by selecting Partition Options on the
Parameters tab. If this option is selected, Analytic Solver Data Science will
partition your dataset (according to the partition options you set) immediately
before running the classification method. If partitioning has already occurred on
the dataset, this option will be disabled. For more information on partitioning,
please see the Data Science Partitioning chapter.
Introduction
Classification tree methods (also known as decision tree methods) are a good
choice when the data science task is classification or prediction of outcomes.
The goal of this algorithm is to generate rules that can be easily understood,
explained, and translated into SQL or a natural query language.
A Classification tree labels, records and assigns variables to discrete classes and
can also provide a measure of confidence that the classification is correct. The
tree is built through a process known as binary recursive partitioning. This is an
iterative process of splitting the data into partitions, and then splitting it up
further on each of the branches.
Initially, a training set is created where the classification label (say, "purchaser"
or "non-purchaser") is known (pre-classified) for each record. In the next step,
the algorithm systematically assigns each record to one of two subsets on the
some basis, for example income >= $75,000 or income < $75,000). The object is
to attain as homogeneous set of labels (say, "purchaser" or "non-purchaser") as
possible in each partition. This splitting (or partitioning) is then applied to each
of the new partitions. The process continues until no more useful splits can be
found. The heart of the algorithm is the rule that determines the initial split rule
(see figure below).
As explained above, the process starts with a training set consisting of pre-
classified records (target field or dependent variable with a known class or label
such as "purchaser" or "non-purchaser"). The goal is to build a tree that
distinguishes among the classes. For simplicity, assume that there are only two
target classes and that each split is a binary partition. The splitting criterion
easily generalizes to multiple classes, and any multi-way partitioning can be
achieved through repeated binary splits. To choose the best splitter at a node, the
algorithm considers each input field in turn. In essence, each field is sorted.
Then, every possible split is tried and considered, and the best split is the one
which produces the largest decrease in diversity of the classification label within
Inputs
1. First, we partition the data into training and validation sets using the
Standard Data Partition defaults of 60% of the data randomly allocated to
the Training Set and 40% of the data randomly allocated to the Validation
Set. For more information on partitioning a dataset, see the Data Science
Partitioning chapter.
Standard Data Partition dialog
For more information on the remaining options shown on this dialog in the
Distribution Fitting, Correlation Fitting and Sampling sections, see the
Generate Data chapter that appears earlier in this guide.
19. Click Finish to run Classification Trees on the example dataset. Output
worksheets are inserted to the right of the STDPartition worksheet.
Output
Output containing the results from Classification Trees will be inserted into the
active workbook to the right of the STDPartition worksheet and also in the
Model tab of the task pane under Reports – Clasification Tree.
CT_Output
This result worksheet includes 4 segments: Output Navigator, Inputs, Training
Log, Prune Log and Feature Importance.
• Output Navigator: The Output Navigator appears at the top of all result
worksheets. Use this feature to quickly navigate to all reports included in
the output.
CT_Output: Output Navigator
• Inputs: Scroll down to the Inputs section to find all inputs entered or
selected on all tabs of the Classification Tree dialog.
• Training Log and Prune Log: The training log shows the
misclassification (error) rate as each additional node is added to the tree,
starting with 0 nodes and ending with 17. The error rate is -1.56E-17.
Scoring will be performed using this tree.
Note that since scoring on the training data will be performed using this tree, the total % Error
for the Confusion Matrix on the CT_TrainingScore output sheet will be equal to the error rate
of the fully gown tree, reported as a percentage.
Analytic Solver Data Science chooses the number of decision nodes for the
pruned tree and the minimum error tree from the values of Validation MSE.
In the Prune log shown above, the smallest Validation MSE error belongs to
the trees with 4, 5, 6, 7, 8, 9, 10, 11 and 12 decision nodes. Where there is a
tie, meaning when multiple trees have the exact same Error Rate, the tree
with the smaller number of nodes is selected. In this case, the tree with four
(4) decision nodes is the Minimum Error Tree – the tree with the smallest
misclassification error in the validation dataset.
• Feature Importance: Select Feature Importance to include the Features
Importance table in the output. This table displays the variables that are
included in the model along with their Importance value. The larger the
Importance value, the bigger the influence the variable has on the predicted
classification. In this instance, the census tracts with homes with many
rooms will be predicted as having a larger selling price.
CT_FullTree
Click CT_FullTree to view the full tree.
Recall that the objective of this example is to classify each case as a 0 (low
median value) or a 1 (high median value). Consider the top decision node
(denoted by a circle). The label above this node indicates the variable
represented at this node (i.e. the variable selected for the first split) in this case,
RM (Average # of Rooms ). The value inside the node indicates the split
threshold. (Hover over the decision node to read the decision rule.) If the RM
value for a specific record is greater than or equal to 6.78 (RM >= 6.78), the
record will be assigned to the right node. If the RM value for the record is less
than 6.78, the value will be assigned to the left node. There are 51 records with
values for the RM variable greater than or equal to 6.78 while 253 records
contained RM values less than 6.78. We can think of records with an RM value
less than 6.78 (RM < 6.78) as tentatively classified as "0" (low median value).
Any record where RM >= 6.78 can be tentatively classified as a "1" (high
median value).
Let’s follow the tree as it descends to the left for a couple levels. The 253
records with RM values less than 6.78 are further split as we move down the
tree. The second split occurs with the LSTAT variable (percent of the
population that is of lower socioeconomic status). The LSTAT values for 4
records (out of 253) fell below the split value of 4.07. These records are
tentatively classified as a “1” – high median value. The LSTAT values for the
remaining 249 records are greater than or equal to 4.07, and are tentatively
classified as “0" – low median value.
Following the tree to the left, the 4 records with a LSTAT value < 4.07 are split
on the CRIM variable node, CRIM = per capita crime rate by town. Records
with CRIM values greater than or equal .33 are classified as a 1 and records with
CRIM values less than .33 are classified as a 0, in the terminal nodes. No
further splits occur on terminal nodes.
Node ID 1: The first entry in this table shows a split on the RM variable with a
split value of 6.776 (rounded to 6.78). The 304 total records in the training
partition and 202 records in the validation partition were split between nodes 2
(LeftChild ID) and 3 (Rightchild ID).
Node ID 2:
• In the training partition, 253 records were assigned to this node (from
node 1) which has a “0” value (Response). These cases were split on
the LSTAT variable using a value of 4.07: 249 records were assigned
to node 5 and 4 records were assigned to node 4.
• In the Validation Partition, 155 records were assigned to this node
(from node 1). These cases were split on the same variable (LSTAT)
and value (4.07): 154 records were assigned to node 5 and 1 record
was assigned to node 4.
Node ID 4:
• In the training partition, 4 records, assigned from Node 2, were split on
the CRIM variable using a value of 0.33. This node has a tentative
classification of 1 (Response). Three records were assigned to node 9
and classified as 1. 1 record was assigned to node 8 and classified as a
0. Both nodes 8 and 9 are terminal nodes.
• In the validation partition, 1 node was assigned from Node 2. This
record was assigned to terminal node 8 using the CRIM variable and a
value of 0.33 and classified as 0.
The table can be used to follow the tree all the way down to level 33.
CT_BestTree
Click the CT_BestTree tab to view the Best Pruned Tree and the Rules for the
Best Pruned Tree.
The Validation Partition records are split in the Tree according to the following
rules:
• Node 1: 202 cases were split using the RM variable with a value of
6.78.
o 155 records were assigned to Node 2, a terminal node, and
classified as 0.
o 47 records were assigned to Node 3, a decision node, and
tentatively classified as 1.
• Node 3: 47 cases were split using the LSTAT variable with a value of
9.65.
o 43 records were assigned to Node 4, a terminal node, and
classified as 1.
o 4 records were assigned to Node 5, a decision node, and
tentatively classified as 0.
• Node 5: 4 records were split using the RAD variable with a value of
5.5.
o All 4 records were assigned to node 7 and classified as 0.
CT_MinErrorTree
Click CT_MinErrorTree to view the Minimum Error Tree.
The Validation Partition records are split in the Min Error Tree according to the
following rules:
• Node 1: 202 records were split using the RM variable with a value of
6.78.
o 155 records were assigned to Node 2, a decision node, and
tentatively classified as 0.
o 47 records were assigned to Node 3, a decision node, and
tentatively classified as 1.
• Node 2: 155 records were split using the LSTAT variable with a value
of 4.07.
o 1 record was assigned to Node 4, a terminal node, and
classified as 1.
o 154 records were assigned to Node 5, a terminal node, and
classified as 0.
• Node 3: 47 records were split using the LSTAT variable (again) with a
value of 9.65.
o 43 records were assigned to node 6, a terminal node, and
classified as 1.
o 4 records were assigned to node 7, a decision node, and
classified as 0.
• Node 7: 4 records were split using the RAD variable using a value of
5.5.
o All 4 records were assigned to node 9, a terminal node, and
classified as 0.
Bin Details: Use this view to find metrics related to each bin in the chart.
Bin Details view
Chart Options: Use this view to change the color of the bars in the chart.
Chart Options View
• To see both the actual and predicted frequency, click Prediction and select
Actual. This change will be reflected on all charts.
Selecting Prediction/Actual
Metrics
The following metrics are computed using the values in the confusion
matrix.
• Accuracy (#Correct and %Correct): 100% - Refers to the ability of the
classifier to predict a class label correctly.
• Specificity: 1 - Also called the true negative rate, measures the
percentage of failures correctly identified as failures
Specificity (SPC) or True Negative Rate =TN / (FP + TN)
• Recall (or Sensitivity): 1 - Measures the percentage of actual positives
which are correctly identified as positive (i.e. the proportion of people
who experienced catastrophic heart failure who were predicted to have
catastrophic heart failure).
Sensitivity or True Positive Rate (TPR) = TP/(TP + FN)
• Precision: 1 - The probability of correctly identifying a randomly
selected record as one belonging to the Success class
Precision = TP/(TP+FP)
• F-1 Score: 1 - Fluctuates between 1 (a perfect classification) and 0,
defines a measure that balances precision and recall.
F1 = 2 * TP / (2 * TP + FP + FN)
• Success Class and Success Probability: Selected on the Data tab of the
Discriminant Analysis dialog.
CT_ValidationScore
Click the CT_ValidationScore tab to view the newly added Output Variable
frequency chart, the Validation: Classification Summary and the Validation:
Classification Details report. All calculations, charts and predictions on this
worksheet apply to the Validation data.
• Frequency Charts: The output variable frequency chart opens
automatically once the CT_ValidationScore worksheet is selected. To close
this chart, click the “x” in the upper right hand corner. To reopen, click
onto another tab and then click back to the CT_ValidationScore tab.
Click the Frequency chart to display the frequency for both the predicted
and actual values of the output variable, along with various statistics such as
count, number of classes and the mode. Selective Relative Frequency from
the drop down menu, on the right, to see the relative frequencies of the
output variable for both actual and predicted. See above for more
information on this chart.
CT_ValidationScore Frequency Chart
• Classification Summary: This report contains the confusion matrix for the
validation data set.
Decile-wise Lift Chart, ROC Curve, and Lift Charts for Valid Partition
After the model is built using the training data set, the model is used to score on
the training data set and the validation data set (if one exists). Then the data
set(s) are sorted in decreasing order using the predicted output variable value.
After sorting, the actual outcome values of the output variable are cumulated
and the lift curve is drawn as the cumulative number of cases in decreasing
probability (on the x-axis) vs the cumulative number of true positives on the y-
axis. The baseline (red line connecting the origin to the end point of the blue
line) is a reference line. For a given number of cases on the x-axis, this line
represents the expected number of successes if no model existed, and instead
cases were selected at random. This line can be used as a benchmark to measure
the performance of the fitted model. The greater the area between the lift curve
and the baseline, the better the model. In the Training Lift chart, if we selected
100 cases as belonging to the success class and used the fitted model to pick the
members most likely to be successes, the lift curve tells us that we would be
right on all of them. Conversely, if we selected 100 random cases, we could
expect to be right on about 15 of them.
The decilewise lift curve is drawn as the decile number versus the cumulative
actual output variable value divided by the decile's mean output variable value.
This bars in this chart indicate the factor by which the model outperforms a
random assignment, one decile at a time. Refer to the validation graph above.
In the first decile, taking the most expensive predicted housing prices in the
dataset, the predictive performance of the model is about 5 times better as
simply assigning a random predicted value.
Select Lift Chart (Alternative) to display Analytic Solver Data Science's new
Lift Chart. Each of these charts consists of an Optimum Classifier curve, a
Fitted Classifier curve, and a Random Classifier curve. The Optimum Classifier
curve plots a hypothetical model that would provide perfect classification for
our data. The Fitted Classifier curve plots the fitted model and the Random
Classifier curve plots the results from using no model or by using a random
guess (i.e. for x% of selected observations, x% of the total number of positive
observations are expected to be correctly classified).
The Alternative Lift Chart plots Lift against the Predictive Positive Rate or
Support.
Lift Chart (Alternative) and Gain Chart for Training Partition
CT_Simulation
As discussed above, Analytic Solver Data Science generates a new output
worksheet, CT_Simulation, when Simulate Response Prediction is selected on
the Simulation tab of the Classification Tree dialog in Analytic Solver
Comprehensive and Analytic Solver Data Science. (This feature is not supported
in Analytic Solver Optimization, Analytic Solver Simulation or Analytic Solver
Upgrade.)
This report contains the synthetic data, the predicted values for the training
partition (using the fitted model) and the Excel – calculated Expression results,
if populated in the dialog. A chart is displayed with the option to switch
between the Predicted Simulation and Training sources and the Expression
results for the Simulation and Training data, or a combination of two as long as
they are of the same type.
Synthetic Data
Note the first column in the output, Expression. This column was inserted into
the Synthetic Data results because Calculate Expression was selected and an
Excel function was entered into the Expression field, on the Simulation tab of
the Discriminant Analysis dialog
IF[@RM]>=5,[@CAT.MEDV],"Tracts < 5 Rooms")
The results in this column are either 0, 1, or Tracts <= 5 Rooms.
The remainder of the data in this report is synthetic data, generated using the
Generate Data feature described in the chapter with the same name, that appears
earlier in this guide.
The chart that is displayed once this tab is selected, contains frequency
information pertaining to the output variable in the training partition and the
synthetic data. In the screenshot below, the bars in the darker shade of blue are
based on the synthetic data. The bars in the lighter shade of blue are based on
the predictions for the training partition. In the synthetic data, a little over 70%
of the census tracts are predicted to have a classification equal to 0, or low
median value, while almost 30% of census tracts are predicted to have a
classification equal to 1, or high median value.
Data dialog
Classification Tree dialog, Data tab Classification Tree dialog, Data tab
Selected Variables
Variables selected to be included in the output appear here.
Output Variable
The dependent variable or the variable to be classified appears here.
Categorical Variables
Place categorical variables from the Variables listbox to be included in the
model by clicking the > command button. This classification algorithm will
accept non-numeric categorical variables.
Number of Classes
Displays the number of classes in the Output variable.
Success Class
This option is selected by default. Select the class to be considered a “success”
or the significant class in the Lift Chart. This option is enabled when the
number of classes in the output variable is equal to 2.
Partition Data
Rescaling dialog Analytic Solver Data Science includes the ability to partition a dataset from
within a classification or prediction method by clicking Partition Data on the
Parameters tab. Analytic Solver Data Science will partition your dataset
(according to the partition options you set) immediately before running the
F
classification method. If partitioning has already occurred on the dataset, this
i option will be disabled. For more information on partitioning, please see the
g Data Science Partitioning chapter.
u
r
e Rescale Data
6
R Use Rescaling to normalize one or more features in your data during the data
e preprocessing stage. Analytic Solver Data Science provides the following
s methods for feature scaling: Standardization, Normalization, Adjusted
c Normalization and Unit Norm. For more information on this new feature, see
a the Rescale Continuous Data section within the Transform Continuous Data
l chapter that occurs earlier in this guide.
i
n
g Notes on Rescaling and the Simulation functionality
D If Rescale Data is turned on, i.e. if Rescale Data is selected on the Rescaling dialog as
a shown in the screenshot to the left, then “Min/Max as bounds” on the Simulation tab will
t not be turned on by default. A warning will be reported in the Log on the CT_Simulation
a output sheet, as shown below.
d
i
a
l
o If Rescale Data has been selected on the Rescaling dialog, users can still manually use the
“Min/Max as bounds” button within the Fitting Options section of the Simulation tab, to
g
populate the parameter grid with the bounds from the original data, not the rescaled data.
Note that the “Min/Max as bounds” feature is available for the user’s convenience. Users
must still be aware of any possible data tranformations (i.e. Rescaling) and review the
bounds to make sure that all are appropriate.
Prior Probability
Three options appear in the Prior Probability Dialog: Empirical, Uniform and
Manual.
• If the first option is selected, Empirical, Analytic Solver Data Science
will assume that the probability of encountering a particular class in the
dataset is the same as the frequency with which it occurs in the training
data.
• If the second option is selected, Uniform, Analytic Solver Data Science
will assume that all classes occur with equal probability.
• Select the third option, Manual, to manually enter the desired class and
probability value.
Trees to Display
Select Trees to Display to select the types of trees to display: Fully Grown, Best
Pruned, Minimum Error or User Specified.
• Select Fully Grown to “grow” a complete tree using the training data.
• Select Best Pruned to create a tree with the fewest number of nodes,
subject to the constraint that the error be kept below a specified level
(minimum error rate plus the standard error of that error rate).
Introduction
Suppose your data consists of fruits, described by their color and shape.
Bayesian classifiers operate by saying "If you see a fruit that is red and round,
which type of fruit is it most likely to be? In the future, classify red and round
fruit as that type of fruit."
A difficulty arises when you have more than a few variables and classes – an
enormous number of observations (records) would be required to estimate these
probabilities.
The Naive Bayes classification method avoids this problem by not requiring a
large number of observations for each possible combination of the variables.
Rather, the variables are assumed to be independent of one another and,
therefore the probability that a fruit that is red, round, firm, 3" in diameter, etc.
will be an apple can be calculated from the independent probabilities that a fruit
is red, that it is round, that it is firm, that it is 3" in diameter, etc.
In other words, Naïve Bayes classifiers assume that the effect of a variable value
on a given class is independent of the values of other variables. This assumption
is called class conditional independence and is made to simplify the
computation. In this sense, it is considered to be “Naïve”.
This assumption is a fairly strong assumption and is often not applicable.
However, bias in estimating probabilities often may not make a difference in
practice -- it is the order of the probabilities, not their exact values, which
determine the classifications.
Studies comparing classification algorithms have found the Naïve Bayesian
classifier to be comparable in performance with classification tree and neural
network classifiers. It has also been found that these classifiers exhibit high
accuracy and speed when applied to large databases.
A more technical description of the Naïve Bayesian classification method
follows.
Bayes Theorem
Let X be the data record (case) whose class label is unknown. Let H be some
hypothesis, such as "data record X belongs to a specified class C." For
classification, we want to determine P (H|X) -- the probability that the
hypothesis H holds, given the observed data record X.
P (H|X) is the posterior probability of H conditioned on X. For example, the
probability that a fruit is an apple, given the condition that it is red and round. In
contrast, P(H) is the prior probability, or apriori probability, of H. In this
example P(H) is the probability that any given data record is an apple, regardless
of how the data record looks. The posterior probability, P (H|X), is based on
In this example, we will classify pilots on whether they are fit to fly based on
various physical and psychological tests. The output variable, TestRes/Var1
equals 1 if the pilot is fit and 0 if not.
First, we partition the data into training and validation sets using the Standard
Data Partition defaults of 60% of the data randomly allocated to the Training Set
and 40% of the data randomly allocated to the Validation Set. For more
information on partitioning a dataset, see the Data Science Partitioning chapter.
Click Classify – Naïve Bayes. The following Naïve Bayes dialog appears.
Select Var2, Var3, Var4, Var5, and Var6 as Selected Variables and
TestRest/Var1 as the Output Variable. The Number of Classes statistic will be
automatically updated with a value of 2 when the Output Variable is selected.
This indicates that the Output variable, TestRest/Var1, contains two classes, 0
and 1.
Choose the value that will be the indicator of “Success” by clicking the down
arrow next to Success Class. In this example, we will use the default of 1
indicating that a value of “1” will be specified as a “success”.
Enter a value between 0 and 1 for Success Probability Cutoff. If the Probability
of success (probability of the output variable = 1) is less than this value, then a 0
will be entered for the class value, otherwise a 1 will be entered for the class
value. In this example, we will keep the default of 0.5.
NB_TrainingScore
Click Training: Classification Details in the Output Navigator top open the
NB_TrainingScore output worksheet. Immediately, the Output Variable
frequency chart appears. The worksheet contains the Training: Classification
Summary and the Training: Classification Details reports. All calculations,
charts and predictions on this worksheet apply to the Training data.
Note: To view charts in the Cloud app, click the Charts icon on the Ribbon, select a
worksheet under Worksheet and a chart under Chart.
Bin Details: Use this view to find metrics related to each bin in the chart.
Chart Options: Use this view to change the color of the bars in the chart.
Chart Options View
• To see both the actual and predicted frequency, click Prediction and select
Actual. This change will be reflected on all charts.
Selecting Prediction/Actual
Confusion Matrix
Predicted
Class
Actual Class 1 0
1 TP FN
0 FP TN
TP stands for True Positive. These are the number of cases classified as
belonging to the Success class that actually were members of the Success class.
FN stands for False Negative. These are the number of cases that were
classified as belonging to the Failure class when they were actually members of
the Success class (i.e. patients with cancerous tumors who were told their tumors
were benign). FP stands for False Positive. These cases were assigned to the
Success class but were actually members of the Failure group (i.e. patients who
were told they tested postive for cancer when, in fact, their tumors were benign).
TN stands for True Negative. These cases were correctly assigned to the Failure
group.
Precision is the probability of correctly identifying a randomly selected record
as one belonging to the Success class (i.e. the probability of correctly identifying
a random patient with cancer as having cancer). Recall (or Sensitivity)
measures the percentage of actual positives which are correctly identified as
positive (i.e. the proportion of people with cancer who are correctly identified as
having cancer). Specificity (also called the true negative rate) measures the
percentage of failures correctly identified as failures (i.e. the proportion of
people with no cancer being categorized as not having cancer.) The F-1 score,
which fluctuates between 1 (a perfect classification) and 0, defines a measure
that balances precision and recall.
Precision = TP/(TP+FP)
NB_ValidationScore
Click the link for Validation: Classification Summary in the Output Navigator
to open the Classification Summary for the validation partition.
NB_LogDensity
Click the NB_LogDensity tab to view the Log Densities for each partition.
Log PDF, or Logarithm of Unconditional Probability Density, is the distribution
of the predictors marginalized over the classes and is computed using:
𝐶 𝐶
NB_Output
Click the Prior Conditional Probability: Training link to display the table
below. This table shows the probabilities for each case by variable. For
example, for Var2, 21% of the records where Var2 = 0 were assigned to Class 0,
57% of the records where Var2 = 1 were assigned to Class 0 and 21% of the
records where Var2 = 2 were assigned to Class 0.
Decile-wise Lift Chart, ROC Curve, and Lift Charts for Valid. Partition
After the model is built using the training data set, the model is used to score on
the training data set and the validation data set (if one exists). Then the data
set(s) are sorted in decreasing order using the predicted output variable value.
After sorting, the actual outcome values of the output variable are cumulated
and the lift curve is drawn as the cumulative number of cases in decreasing
probability (on the x-axis) vs the cumulative number of true positives on the y-
axis. The baseline (red line connecting the origin to the end point of the blue
line) is a reference line. For a given number of cases on the x-axis, this line
represents the expected number of successes if no model existed, and instead
cases were selected at random. This line can be used as a benchmark to measure
the performance of the fitted model. The greater the area between the lift curve
and the baseline, the better the model. In the Training Lift chart, if we selected
10 cases as belonging to the success class and used the fitted model to pick the
members most likely to be successes, the lift curve tells us that we would be
right on about 9 of them. Conversely, if we selected 10 random cases, we could
expect to be right on about 4 of them. The Validation Lift chart tells us that we
could expect to see the Random model perform the same or better on the
validation partition than our fitted model.
The decilewise lift curve is drawn as the decile number versus the cumulative
actual output variable value divided by the decile's mean output variable value.
This bars in this chart indicate the factor by which the model outperforms a
random assignment, one decile at a time. Refer to the validation graph above.
In the first decile, the predictive performance of the model is about 1.8 times
better as simply assigning a random predicted value.
The Regression ROC curve was updated in V2017. This new chart compares
the performance of the regressor (Fitted Classifier) with an Optimum Classifier
Curve and a Random Classifier curve. The Optimum Classifier Curve plots a
hypothetical model that would provide perfect classification results. The best
possible classification performance is denoted by a point at the top left of the
Select Lift Chart (Alternative) to display Analytic Solver Data Science's new
Lift Chart. Each of these charts consists of an Optimum Classifier curve, a
Fitted Classifier curve, and a Random Classifier curve. The Optimum Classifier
curve plots a hypothetical model that would provide perfect classification for
our data. The Fitted Classifier curve plots the fitted model and the Random
Classifier curve plots the results from using no model or by using a random
guess (i.e. for x% of selected observations, x% of the total number of positive
observations are expected to be correctly classified).
The Alternative Lift Chart plots Lift against the Predictive Positive Rate or
Support.
Lift Chart (Alternative) and Gain Chart for Training Partition
Selected Variables
Variables selected to be included in the output appear here.
Output Variable
The dependent variable or the variable to be classified appears here.
Success Class
This option is selected by default. Select the class to be considered a “success”
or the significant class in the Lift Chart. This option is enabled when the
number of classes in the output variable is equal to 2.
Partition Data
Analytic Solver Data Science includes the ability to partition a dataset from
within a classification or prediction method by clicking Partition Data on the
Parameters tab. If this option is selected, Analytic Solver Data Science will
partition your dataset (according to the partition options you set) immediately
before running the classification method. If partitioning has already occurred on
the dataset, this option will be disabled. For more information on partitioning,
please see the Data Science Partitioning chapter.
Prior Probability
Click Prior Probability. Three options appear in the Prior Probability Dialog:
Empirical, Uniform and Manual.
Laplace Smoothing
If a particular realization of some feature never occurs in a given class in the
training partition, then the corresponding frequency-based prior conditional
probability estimate will be zero. For example, assume that you have trained a
model to classify emails using the Naïve Bayes Classifier with 2 classes: work
and personal. Assume that the model rates one email as having a high
probability of belonging to the "personal" class. Now assume that there is a 2nd
email that is the same as the previous email, but this email includes one word
that is different. Now, if this one word was not present in any of the “personal”
emails in the training partition, the estimated probability would be
zero. Consequently, the resulting product of all probabilities will be zero,
leading to a loss of all the strong evidence of this email to belong to a “personal”
class. To mitigate this problem, Analytic Solver Data Science allows you to
specify a small correction value, known as a pseudocount, so that no probability
estimate is ever set to 0. Normalizing the Naïve Bayes classifier in this way is
called Laplace smoothing. Pseudocount set to zero is equivalent to no
smoothing. There are arguments in the literature which support a pseudocount
value of 1, although in practice, fractional values are often used. When Laplace
Smoothing is selected, Analytic Solver Data Science will accept any positive
value for pseudocount.
𝐶 𝐶
Introduction
Artificial neural networks are relatively crude electronic networks of "neurons"
based on the neural structure of the brain. They process records one at a time,
and "learn" by comparing their classification of the record (which, at the outset,
is largely arbitrary) with the known actual classification of the record. The errors
from the initial classification of the first record is fed back into the network, and
used to modify the networks algorithm the second time around, and so on for
many iterations.
Roughly speaking, a neuron in an artificial neural network is
1. A set of input values (xi) and associated weights (wi)
2. A function (g) that sums the weights and maps the results to an output
(y).
Neurons are organized into layers: input, hidden and output. The input layer is
composed not of full neurons, but rather consists simply of the record’s values
that are inputs to the next layer of neurons. The next layer is the hidden layer.
Several hidden layers can exist in one neural network. The final layer is the
output layer, where there is one node for each class. A single sweep forward
through the network results in the assignment of a value to each output node,
and the record is assigned to the class node with the highest value.
First, we partition the data into training and validation sets using a Standard
Data Partition with percentages of 60% of the data randomly allocated to the
Training Set and 40% of the data randomly allocated to the Validation Set. For
more information on partitioning a dataset, see the Data Science Partitioning
chapter.
Note: When selecting a rescaling technique, it's recommended that you apply
Normalization ([0,1)] if Sigmoid is selected for Hidden Layer Activation and
Adjusted Normalization ([-1,1]) if Hyperbolic Tangent is selected for Hidden
Layer Activation. This applies to both classification and regression. Since we
will be using Logistic Sigmoid for Hidden Layer Activation, Normalization
was selected.
Click Prior Probability. Three options appear in the Prior Probability Dialog:
Empirical, Uniform and Manual.
If the first option is selected, Empirical, Analytic Solver Data Science will
assume that the probability of encountering a particular class in the dataset is the
same as the frequency with which it occurs in the training data.
If the second option is selected, Uniform, Analytic Solver Data Science will
assume that all classes occur with equal probability.
Select the third option, Manual, to manually enter the desired class and
probability value.
Click Stopping Rules to open the Stopping Rules dialog. Here users can specify
a comprehensive set of rules for stopping the algorithm early plus cross-
validation on the training error. For more information on these options, please
see the Neural Network Classification Options section below. For now, simply
click Done to accept the option defaults and close the dialog.
Keep the default selections for the Hidden Layer and Output Layer options. See
the Neural Network Classification Options section below for more information
on these options.
NNC_Output
Click NNC_Output to open the first output sheet.
The top section of the output includes the Output Navigator which can be used
to quickly navigate to various sections of the output. The Data, Variables, and
Parameters/Options sections of the output all reflect inputs chosen by the user.
A little further down is the Architecture Search Error Log, a portion is shown
below.
Notice the number of networks trained and reported in the Error Report was 90
(# Networks Trained = MIN {100, (10 * (1 + 8)} = 90).
This report may be sorted by each column by clicking the arrow next to each
column heading. Click the arrow next to Validation % Error and select Sort
Smallest to Largest from the menu. Then click the arrow next to Training %
Error and do the same to display all networks 0% Error in both the Training and
Validation sets.
Click a Net ID, say Net 2, hyperlink to bring up the Neural Network
Classification dialog. Click Finish to run the Neural Net Classification method
with Manual Architecture using the input and option settings specified for Net 2.
Scroll down on the NNC_Output sheet to see the confusion matrices for each
Neural Network listed in the table above. Here’s the confusion matrices for Net
The Error Report provides the total number of errors in the classification -- % Error, %
Sensitivity or positive rate, and % Specificity or negative rate -- produced by each
network ID for the Training and Validation Sets. This report may be sorted by column
by clicking the arrow next to each column heading.
Sensitivity and Specificity measures are unique to the Error Report when the Output
Variable contains only two categories. Typically, these two categories can be labeled as
success and failure, where one of them is more important than the other (i.e., the success
of a tumor being cancerous or benign.) Sensitivity (true positive rate) measures the
percentage of actual positives that are correctly identified as positive (i.e., the proportion
of people with cancer who are correctly identified as having cancer). Specificity (true
negative rate) measures the percentage of failures correctly identified as failures (i.e., the
proportion of people with no cancer being categorized as not having cancer). The two
are calculated as in the following (displayed in the Confusion Matrix).
When viewing the Net ID 10, this network has one hidden layer containing 10 nodes.
For this neural network, the percentage of errors in the Training Set is 3.95%, and the
percentage of errors in the Validation Set is 5.45%.The percent sensitivity is 87.25 %
and 89.19% for the training partition and validation partition, respectively. This means
that in the Training Set, 87.25% of the records classified as positive were in fact
positive, and 89.19% of the records in the Validation Set classified as positive were in
fact positive.
The percentage specificity is 87.23% for the Training Set, and 95.76% in the Validation
Set. This means that 87.23% of the records in the training Set and 95.76% of the records
in the Validation Set identified as negative, were in fact negative. In the case of a cancer
diagnosis, we would prefer that this percentage be higher, or much closer to 100%, as it
could potentially be fatal if a person with cancer was diagnosed as not having cancer.
Inputs
1. Select a cell on the StdPartition worksheet, then click Classify – Neural
Network – Manual Network on the Data Science ribbon. The Neural
Network Classification dialog appears.
2. Select Type as the Output variable and the remaining variables as
Selected Variables. Since the Output variable contains three classes (A, B,
and C) to denote the three different wineries, the options for Classes in the
Output Variable are disabled. (The options under Classes in the Output
Variable are only enabled when the number of classes is equal to 2.)
If the first option is selected, Empirical, Analytic Solver Data Science will
assume that the probability of encountering a particular class in the dataset
is the same as the frequency with which it occurs in the training data.
If the second option is selected, Uniform, Analytic Solver Data Science will
assume that all classes occur with equal probability.
Select the third option, Manual, to manually enter the desired class and
probability value.
Click Done to close the dialog and accept the default setting, Empirical.
7. Click Training Parameters to open the Training Parameters dialog. See
the Neural Network Options section below for more information on these
options. For now, click Done to accept the default settings and close the
dialog.
8. Click Stopping Rules to open the Stopping Rules dialog. Here users can
specify a comprehensive set of rules for stopping the algorithm early plus
cross-validation on the training error. Again, see the example above or the
Neural Network Options section below for more information on these
parameters. For now, click Done to accept the default settings and close the
dialog.
Stopping Rules dialog
Output Worksheets
Output worksheets are inserted to the right of the STDPartition worksheet.
NNC_Output
Scroll down to the Inputs section of the output sheet. This section includes all of
the input selections from the
Click NNC_Output1 to view the Output Navigator. Click any link within the
table to navigate to the report. Each output worksheet includes the Output
Navigator at the top of the sheet.
Scroll down to the Inputs section. This section includs all the inputs selected on
the Neural Network Classification dialog.
Recall that a key element in a neural network is the weights for the connections
between nodes. In this example, we chose to have one hidden layer containing 6
neurons. Analytic Solver Data Science's output contains a section that contains
the final values for the weights between the input layer and the hidden layer,
between hidden layers, and between the last hidden layer and the output layer.
This information is useful at viewing the “insides” of the neural network;
however, it is unlikely to be of use to the data analyst end-user. Displayed above
are the final connection weights between the input layer and the hidden layer for
our example.
NNC_TrainLog
Click the Training Log link in the Output Navigator or click the NNC_TrainLog
output tab, to display the Neural Network Training Log. This log displays the
Sum of Squared errors and Misclassification errors for each epoch or iteration of
the Neural Network. Thirty epochs, or iterations, were performed.
NNC_TrainingScore
Click the NNC_TrainingScore tab to view the newly added Output Variable
frequency chart, the Training: Classification Summary and the Training:
Classification Details report. All calculations, charts and predictions on this
worksheet apply to the Training data.
Note: To view charts in the Cloud app, click the Charts icon on the Ribbon, select the
desired worksheet under Worksheet and the desired chart under Chart.
• To see both the actual and predicted frequency, click Prediction and select
Actual. This change will be reflected on all charts.
Click Predicted/Actual to change view
Metrics
The following metrics are computed using the values in the confusion
matrix.
• Accuracy (#Correct = 76 and %Correct = 71.03%): Refers to the
ability of the classifier to predict a class label correctly.
NNC_ValidationScore
Click the NNC_ValidationScore tab to view the newly added Output Variable
frequency chart, the Validation: Classification Summary and the Validation:
Classification Details report. All calculations, charts and predictions on this
worksheet apply to the Validation data.
• Frequency Charts: The output variable frequency chart opens
automatically once the NNC_ValidationScore worksheet is selected. To
close this chart, click the “x” in the upper right hand corner. To reopen,
click onto another tab and then click back to the NNC_ValidationScore tab.
To change the placement of the chart, grab the title bar and move to the
desired location on the screen.
Click the Frequency chart to display the frequency for both the predicted
and actual values of the output variable, along with various statistics such as
count, number of classes and the mode. Selective Relative Frequency from
the drop down menu, on the right, to see the relative frequencies of the
output variable for both actual and predicted. See above for more
information on this chart.
DA_ValidationScore Frequency Chart
• Classification Summary: This report contains the confusion matrix for the
validation data set.
Metrics
The following metrics are computed using the values in the confusion
matrix.
• Accuracy (#Correct = 51 and %Correct = 71.8%): Refers to the ability
of the classifier to predict a class label correctly.
• Classification Details: This table displays how each observation in the
validation data was classified. The probability values for success in each
record are shown after the predicted class and actual class columns.
Records assigned to a class other than what was predicted are highlighted in
red.
NNC_ValidationScore: Validation: Classification Details
NNC_Simulation
As discussed above, Analytic Solver Data Science generates a new output
worksheet, NNC_Simulation, when Simulate Response Prediction is selected on
the Simulation tab of the Neural Network Classification dialog in Analytic
Solver Comprehensive and Analytic Solver Data Science. (This feature is not
supported in Analytic Solver Optimization, Analytic Solver Simulation or
Analytic Solver Upgrade.)
This report contains the synthetic data, the training partition predictions (using
the fitted model) and the Excel – calculated Expression column, if populated in
the dialog. A dialog is displayed with the option to switch between the
Note the first column in the output, Expression. This column was inserted into
the Synthetic Data results because Calculate Expression was selected and an
Excel function was entered into the Expression field, on the Simulation tab of
the Discriminant Analysis dialog
IF([@Alcohol]<10, [@Type], "Alcohol >=10")
The results in this column are either A, B, C or Alcohol >= 10 depending on the
alcohol content for each record in the synthetic data.
The remainder of the data in this report is synthetic data, generated using the
Generate Data feature described in the chapter with the same name, that appears
earlier in this guide.
The chart that is displayed once this tab is selected, contains frequency
information pertaining to the output variable in the training partition and the
synthetic data. The chart below displays frequency information for the predicted
values in the synthetic data.
Prediction Frequency Chart for NNC_Simulation output
In the synthetic data, 25 records were classified as Type A, 66 for Type B and 9
for Type C.
Click Prediction (Simulation) and select Prediction (Training) in the Data dialog
to display a frequemcy chart based on the Training partition.
In this chart, the columns in the darker shade of blue relate to the predicted wine
type in the synthetic, or simulated data. The columns in the lighter shade of blue
relate to the predicted wine type in the training partition.
Note the red Relative Bin Differences curve. Click the arrow next to Frequency
and select Bin Details from the menu. This tab reports the absolute differences
between each bin in the chart.
Click Prediction (Simulation)/Prediction (Training) and select Expression
(Simulation) and Expression (Training) in the Data dialog to display both a chart
of the results for the expression that was entered in the Simulation tab.
The columns in darker blue display the wine type for each record in the
simulated, or synthetic, data. In the simulated data, 25% of the records in the
data were assigned to type A, 66% were assigned to type B and 9% were
assigned to type C. There were no records in the simulated data where the
alcohol content was less than 10. As a result, the value for Expression for all
records in the synthetic data are labeled as “Alcohol >= 10”.
Click the down arrow next to Frequency to change the chart view to Relative
Frequency or to change the look by clicking Chart Options. Statistics on the
right of the chart dialog are discussed earlier in this section. For more
information on the generated synthetic data, see the Generate Data chapter that
appears earlier in this guide.
For information on Stored Model Sheets, in this example DA_Stored, please
refer to the “Scoring New Data” chapter within the Analytic Solver Data
Science User Guide.
The above error report gives the total number of errors, % Error, % Sensitivity
(also known as true positive rate) and % Specificity (also known as true negative
rate) in the classification produced by each network ID for the training and
validation datasets separately. As shown in the Automatic Neural Network
Classification section above, this report may be sorted by column by clicking the
arrow next to each column heading. In addition, click the Net ID hyperlinks to
re-run the Neural Network Classification method with Manual Architecture with
the input and option settings as specified in the specific Net ID.
Let’s take a look at Net ID 5. This network has two hidden layers, each
containing 1 nodes. For this neural network, the percentage of errors in the
training data is 11.16% and the percentage of errors in the validation data is
10.375%. Click the Net5 link in cell C68 to open the Nueral Network
Classification dialog. Notice that the Selected Variables and Output Variable
have been prefilled in the Data tab. If you click Next, you will find that all
Parameters have also been prefilled in the Parameters tab. Click Finish to fit the
model.
Selected Variables
Variables selected to be included in the output appear here.
Categorical Variables
Place categorical variables from the Variables listbox to be included in the
model by clicking the > command button. The Neural Network Classification
algorithm will accept non-numeric categorical variables.
Output Variable
The dependent variable or the variable to be classified appears here.
Number of Classes
Displays the number of classes in the Output variable.
Success Class
This option is selected by default. Click the drop down arrow to select the value
to specify a “success”. This option is only enabled when the # of classes is
equal to 2.
Partition Data
Analytic Solver Data Science includes the ability to partition a dataset from
within a classification or prediction method by clicking Partition Data on the
Parameters tab. Analytic Solver Data Science will partition your dataset
(according to the partition options you set) immediately before running the
classification method. If partitioning has already occurred on the dataset, this
option will be disabled. For more information on partitioning, please see the
Data Science Partitioning chapter.
Rescale Data
Click Rescale Data to open the Rescaling dialog.
On-the-fly Rescaling Dialog
Use Rescaling to normalize one or more features in your data during the data
preprocessing stage. Analytic Solver Data Science provides the following
methods for feature scaling: Standardization, Normalization, Adjusted
Normalization and Unit Norm. For more information on this new feature, see
the Rescale Continuous Data section within the Transform Continuous Data
chapter that occurs earlier in this guide.
If Rescale Data has been selected on the Rescaling dialog, users can still manually use the
“Min/Max as bounds” button within the Fitting Options section of the Simulation tab, to
populate the parameter grid with the bounds from the original data, not the rescaled data.
Note that the “Min/Max as bounds” feature is available for the user’s convenience. Users
must still be aware of any possible data tranformations (i.e. Rescaling) and review the
bounds to make sure that all are appropriate.
Hidden Layers/Neurons
Click Add Layer to add a hidden layer. To delete a layer, click Remove Layer.
Once the layer is added, enter the desired Neurons.
Hidden Layer
Nodes in the hidden layer receive input from the input layer. The output of the
hidden nodes is a weighted sum of the input values. This weighted sum is
computed with weights that are initially set at random values. As the network
“learns”, these weights are adjusted. This weighted sum is used to compute the
hidden node’s output using a transfer function. The default selection is Sigmoid.
Select Sigmoid (the default setting) to use a logistic function for the transfer
function with a range of 0 and 1. This function has a “squashing effect” on very
small or very large values but is almost linear in the range where the value of the
function is between 0.1 and 0.9.
Select Hyperbolic Tangent to use the tanh function for the transfer function, the
range being -1 to 1. If more than one hidden layer exists, this function is used
for all layers.
ReLU (Rectified Linear Unit) is a widely used choice for hidden layers. This
function applies a max(0,x) function to the neuron values. When used instead of
logistic sigmoid or hyperbolic tangent activations, some adjustments to the
Neural Network settings are typically required to achieve a good performance,
such as: significantly decreasing the learning rate, increasing the number of
learning epochs and network parameters.
Prior Probability
Click Prior Probability. Three options appear in the Prior Probability Dialog:
Empirical, Uniform and Manual.
Prior Probability Dialog
If the first option is selected, Empirical, Analytic Solver Data Science will
assume that the probability of encountering a particular class in the dataset is the
same as the frequency with which it occurs in the training data.
If the second option is selected, Uniform, Analytic Solver Data Science will
assume that all classes occur with equal probability.
Select the third option, Manual, to manually enter the desired class and
probability value.
Training Parameters
Click Training Parameters to open the Training Parameters dialog to specify
parameters related to the training of the Neural Network algorithm.
Training Parameters Dialog
Learning Rate
This is the multiplying factor for the error correction during
backpropagation; it is roughly equivalent to the learning rate for the
neural network. A low value produces slow but steady learning, a high
value produces rapid but erratic learning. Values for the step size
typically range from 0.1 to 0.9.
Weight Decay
To prevent over-fitting of the network on the training data, set a weight
decay to penalize the weight in each iteration. Each calculated weight
will be multiplied by (1-decay).
Error Tolerance
The error in a particular iteration is backpropagated only if it is greater
than the error tolerance. Typically error tolerance is a small value in the
range from 0 to 1.
Stopping Rules
Click Stopping Rules to open the Stopping Rules dialog. Here users can specify
a comprehensive set of rules for stopping the algorithm early plus cross-
validation on the training error.
Number of Epochs
An epoch is one sweep through all records in the training set. Use this
option to set the number of epochs to be performed by the algorithm.
Bagging, or bootstrap aggregating, was one of the first ensemble algorithms ever
to be written. It is a simple algorithm, yet very effective. Bagging generates
several training data sets by using random sampling with replacement (bootstrap
sampling), applies the classification algorithm to each dataset, then takes the
majority vote amongst the models to determine the classification of the new
data. The biggest advantage of bagging is the relative ease that the algorithm
can be parallelized which makes it a better selection for very large datasets.
n
eb = wb(i) I (Cb( xi) yi))
i −1
αb= ln((1-eb)/eb)
In AdaBoost.M1 (Breiman), the constant is calculated as:
αb= 1/2ln((1-eb)/eb)
Afterwards, the weights are all readjusted to sum to 1. As a result, the weights
assigned to the observations that were classified incorrectly are increased and
the weights assigned to the observations that were classified correctly are
decreased. This adjustment forces the next classification model to put more
emphasis on the records that were misclassified. (This α constant is also used in
the final calculation which will give the classification model with the lowest
error more influence.) This process repeats until b = Number of weak learners
(controlled by the User). The algorithm then computes the weighted sum of
votes for each class and assigns the “winning” classification to the record.
Boosting generally yields better models than bagging, however, it does have a
disadvantage as it is not parallelizable. As a result, if the number of weak
learners is large, boosting would not be suitable.
Classification Ensemble methods are very powerful methods and typically result
in better performance than a single tree. This feature addition in Analytic Solver
Data Science (introduced in V2015) will provide users with more accurate
classification models and should be considered.
5. Select CAT. MEDV under Variables In Input Data, then click > next to
Output Variable, to select this variable as the output variable. This variable
is derived from the scale MEDV variable.
6. Choose the value that will be the indicator of “Success” by clicking the
down arrow next to Success Class. In this example, we will use the default
of 1.
7. Enter a value between 0 and 1 for Success Probability Cutoff. If the
Probability of success is less than this value, then a 0 will be entered for the
class value, otherwise a 1 will be entered for the class value. In this
example, we will keep the default of 0.5.
All options will be left at their default values. For more information on
these options, see the Logistic Regression chapter that occurs earlier in this
guide.
11. Leave the default setting for Random Seed for Boostrapping at “12345”.
Analytic Solver Data Science will use this value to set the bootstrapping
random number seed. Setting the random number seed to a nonzero value
(any number of your choice is OK) ensures that the same sequence of
random numbers is used each time the dataset is chosen for the classifier.
12. Select Show Weak Learner Models to display the weak learner models in
the output.
15. Click Next to advance to the Simulation tab. This tab is disabled in Analytic
Solver Optimization, Analytic Solver Simulation and Analytic Solver
Upgrade.
Select Simulation Response Prediction to enable all options on the
Simulation tab.
Simulation tab: All supervised algorithms include a new Simulation tab.
This tab uses the functionality from the Generate Data feature (described
earlier in this guide) to generate synthetic data based on the training
partition, and uses the fitted model to produce predictions for the synthetic
data. The resulting report, CBagging_Simulation, will contain the synthetic
data, the predicted values and the Excel-calculated Expression column, if
present. In addition, frequency charts containing the Predicted, Training,
and Expression (if present) sources or a combination of any pair may be
viewed, if the charts are of the same type.
For more information on the remaining options shown on this dialog in the
Distribution Fitting, Correlation Fitting and Sampling sections, see the
Generate Data chapter that appears earlier in this guide.
16. Click Finish.
Output
Output from the Bagging algorithm will be inserted to the right of the Data
worksheet.
• Inputs: Scroll down to the Inputs section to find all inputs entered or
selected on all tabs of the Bagging Classification dialog.
The Classification Summary displays the confusion matrix for the Training
Partition.
• True Positive: 43 records belonging to the Success class were correctly
assigned to that class.
• False Negative: 4 records belonging to the Success class were
incorrectly assigned to the Failure class.
• True Negative: 251 records belonging to the Failure class were
correctly assigned to this same class
• False Positive: 6 records belonging to the Failure class were
incorrectly assigned to the Success class.
The total number of misclassified records was 10 (4 + 6) which results in an
error equal to 3.29%.
CBagging_ValidationScore
Click the CBagging_ValidationScore Scroll down to view the Classification
Summary and Classification Details Reports for the Validation partition as well
as the Frequency charts. For detailed information on each of these components,
see the Logistic Regression chapter that appears earlier in this guide.
• Frequency Chart: This chart shows the frequency for both the predicted
and actual values of the output variable, along with various statistics such as
count, number of classes and the mode.
The Classification Summary displays the confusion matrix for the Training
Partition.
• True Positive: 33 records belonging to the Success class were correctly
assigned to that class.
• False Negative: 4 records belonging to the Success class were
incorrectly assigned to the Failure class.
• True Negative: 153 records belonging to the Failure class were
correctly assigned to this same class
• False Positive: 12 records belonging to the Failure class were
incorrectly assigned to the Success class.
The total number of misclassified records was 16 (12 + 4) which results in
an error equal to 7.92%.
CBagging_Simulation
As discussed above, Analytic Solver Data Science generates a new output
worksheet, CBagging_Simulation, when Simulate Response Prediction is
selected on the Simulation tab of the Bagging Classification dialog in Analytic
Solver Comprehensive and Analytic Solver Data Science. (This feature is not
supported in Analytic Solver Optimization, Analytic Solver Simulation or
Analytic Solver Upgrade.)
Note the first column in the output, Expression. This column was inserted into
the Synthetic Data results because Calculate Expression was selected and an
Excel function was entered into the Expression field, on the Simulation tab of
the Bagging Classification dialog
IF([@RM]>6, [@CAT. MEDV], “RM<=6”)
The Expression column will contain each record’s predicted score for the CAT.
MEDV variable or the string, “RM<=6”.
The remainder of the data in this report is synthetic data, generated using the
Generate Data feature described in the chapter with the same name, that appears
earlier in this guide.
The chart that is displayed once this tab is selected, contains frequency
information pertaining to the predicted values for the output variable in the
training partition, the synthetic data and the expression, if it exists.
In the screenshot below, the bars in the darker shade of blue are based on the
Prediction, or synthetic, data as generated in the table above for the CAT.
MEDV variable. The bars in the lighter shade of blue display the frequency of
the predictions for the CAT. MEDV variable in the training partition.
Frequency Chart for CBagging_Simulation output
The red Relative Bin Differences curve indicate that the absolute difference for
each bin are equal. Click the down arrow next to Frequency and select Bin
Details to view.
11. Click Next to advance to the Boosting Classification Scoring tab. Summary
Report is selected by default under both Score Training Data and Score
Validation Data.
• Select Detailed Report under both Score Training Data and Score
Validation Data to produce a detailed assessment of the
performance of the tree in both sets.
Output
Output from the Ensemble Methods algorithm will be inserted to the right.
CBoosting_Output
This result worksheet includes 4 segments: Output Navigator, Inputs and
Boosting Model.
• Output Navigator: The Output Navigator appears at the top of all result
worksheets. Use this feature to quickly navigate to all reports included in
the output.
CBoosting_Output: Output Navigator
CBoosting_TrainingScore
Click the CBoosting_TrainingScore Scroll down to view the Classification
Summary and Classification Details Reports for the Training partition as well as
the Frequency charts. For detailed information on each of these components,
see the Classification Tree chapter that appears earlier in this guide.
The Classification Summary displays the confusion matrix for the Training
Partition.
• True Positive: 47 records belonging to the Success class were correctly
assigned to that class.
• False Negative: 0 records belonging to the Success class were
incorrectly assigned to the Failure class.
Metrics
The following metrics are computed using the values in the confusion
matrix.
• Accuracy (#Correct and %Correct): 100.00% - Refers to the ability of
the classifier to predict a class label correctly.
• Specificity: 1.0 - Also called the true negative rate, measures the
percentage of failures correctly identified as failures
Specificity (SPC) or True Negative Rate =TN / (FP + TN)
• Recall (or Sensitivity): 1.0 - Measures the percentage of actual
positives which are correctly identified as positive (i.e. the proportion
of people who experienced catastrophic heart failure who were
predicted to have catastrophic heart failure).
Sensitivity or True Positive Rate (TPR) = TP/(TP + FN)
• Precision: 1.0 - The probability of correctly identifying a randomly
selected record as one belonging to the Success class
Precision = TP/(TP+FP)
• F-1 Score: 1.0 - Fluctuates between 1 (a perfect classification) and 0,
defines a measure that balances precision and recall.
F1 = 2 * TP / (2 * TP + FP + FN)
• Success Class and Success Probability: Selected on the Data tab of the
Discriminant Analysis dialog.
• Classification Details: This table displays how each observation in the
training data was classified. The probability values for success in each
record are shown after the predicted class and actual class columns.
Records assigned to a class other than what was predicted are highlighted in
red.
CBoosting_ValidationScore
Click the CBagging_ValidationScore Scroll down to view the Classification
Summary and Classification Details Reports for the Validation partition as well
as the Frequency charts. For detailed information on each of these components,
see the Classification Tree chapter that appears earlier in this guide.
• Frequency Chart: This chart shows the frequency for both the predicted
and actual values of the output variable, along with various statistics such as
count, number of classes and the mode.
The Classification Summary displays the confusion matrix for the Training
Partition.
• True Positive: 33 records belonging to the Success class were correctly
assigned to that class.
• False Negative: 4 records belonging to the Success class were
incorrectly assigned to the Failure class.
• True Negative: 160 records belonging to the Failure class were
correctly assigned to this same class
• False Positive: 5 records belonging to the Failure class were
incorrectly assigned to the Success class.
The total number of misclassified records was 9 (5 + 4) which results in an
error equal to 4.46%.
CBoosting_Simulation
As discussed above, Analytic Solver Data Science generates a new output
worksheet, CBoosting_Simulation, when Simulate Response Prediction is
selected on the Simulation tab of the Boosting Classification dialog in Analytic
Solver Comprehensive and Analytic Solver Data Science. (This feature is not
supported in Analytic Solver Optimization, Analytic Solver Simulation or
Analytic Solver Upgrade.)
This data in this report is synthetic data, generated using the Generate Data
feature described in the chapter with the same name, that appears earlier in this
guide.
The chart that is displayed once this tab is selected, contains frequency
information pertaining to the predictions for the output variable in the training
partition, the synthetic data and the expression, if it exists.
In the screenshot below, the bars in the darker shade of blue are based on the
Prediction, or synthetic, data as generated in the table above for the CAT.
MEDV variable. The bars in the lighter shade of blue display the frequency of
the predictions for the CAT. MEDV variable in the training partition.
Frequency Chart for CBoosting_Simulation output
Click the down arrow next to Frequency to change the chart view to Relative
Frequency, to change the look by clicking Chart Options or to see details of each
bin listed in the chart. Statistics on the right of the chart dialog are discussed
earlier in the Classification Tree chapter. For more information on the generated
synthetic data, see the Generate Data chapter that appears earlier in this guide.
Analytic Solver Data Science generates CBoosting_Stored along with the other
output worksheets. Please refer to the “Scoring New Data” chapter in the
Analytic Solver Data Science User Guide for details.
Inputs
1. First, click Partition – Standard Partition to partition the dataset into
Training, Validation and Test Sets using the default percentages of 60%
allocated to the Training Set and 40% allocate to the Validation Set.
Figure 1: Standard Data Partition dialog
The first time that the model is fit, only two features (ejection_fraction and
serum_creatinine) will be utilized.
3. With the StdPartition workbook selected, click Classify – Ensemble –
Random Trees to open the Random Trees: Classification dialog.
4. Select the two Variables from Variables In Input Data (ejection_fraction
and serum_creatinine) and click the right pointing arrow to the left of
Selected Variables to add these two variables to the model. Then take
similar steps to select DEATH_EVENT as the Output Variable.
5. Leave Success Class as "1" and Success Probability Cutoff at 0.5 under
Binary Classification.
The Random Trees: Classification dialog should be similar to the one
pictured in the Figure 3 below.
Outputs
Five worksheets are inserted to the right of the STDPartition tab:
CRandTrees_Output, CRandTrees_TrainingScore,
CRandTrees_ValidationScore, CRandTrees_Simulation and
CRandTrees_Stored.
CRandTrees_Output reports the input data, output data, and parameter settings.
CRandTrees_TrainingScore reports the confusion matrix, calculated metrics and
the actual classification by row for the training partition.
CRandTrees_ValidationScore reports the confusion matrix, calculated metrics
and the actual classification by row for the validation partition.
CRandTrees_Simulation contains the automated risk analysis simulation results.
CRandTrees_Stored contains the stored model which can be used to apply the
fitted model to new data. See the Scoring chapter within the Analytic Solver
Data Science User Guide for an example of scoring new data using the stored
model.
CRandTrees_TrainingScore
Click CRandTrees_TrainingScore to view the Classification Summary and
then new output variable frequency chart for the Training partition.
Since Frequency Chart was selected on the Scoring tab of the Random Trees
dialog, a frequency chart is displayed upon opening of the worksheet.
Click Prediction in the upper right of the dialog, and select Prediction and
Actual checkboxes to display frequency information between the Actual
(Training) partition and the predicted values (Prediction). This chart quickly
displays the Frequency of records labeled as 0 (survivors) and 1 (patients who
succumbed to the complications of heart disease). Click the down arrow next to
Frequency to view the Relative Frequency chart.
The overall error for the training partition was 18.99 with 18 surviving patients
reported as deceased and 16 deceased patients reported as survivors.
• Accuracy: 81.01% -- Accuracy refers to the ability of the classifier to
predict a class label correctly.
• Specificity: 0.846 – (True Negative)/(True Negative + False Positives)
Specificity is defined as the proportion of negative classifications that were
actually negative, or the fraction of survivors that actually survived. In this
model, 99 actual surviving patients were classified correctly as survivors.
There were 18 false positives or 18 actual survivors classified incorrectly as
deceased.
• Sensitivity or Recall: 0.742 – (True Positive)/(True Positive + False
Negative)
Sensitivity is defined as the proportion of positive cases there were
classified correctly as positive, or the proportion of actually deceased
patients there were classified as deceased. In this model, 46 actual deceased
patients were correctly classified as deceased. There were 16 false
negatives or 16 actual deceased patients were incorrectly classified as
survivors.
Note: Since the object of this model is to correctly classify which patients
will succumb to heart failure, this is an important statistic as it is very
important for a physician to be able to accurately predict which patients
require mitigation.
• Precision: 0.719 – (True Positives)/(True Positives + False Positives)
Precision is defined as the proportion of positive results that are true
positive. In this model, 46 actual deceased patients were classified correctly
as deceased. There were 18 false positives or 18 actual survivors classified
incorrectly as deceased.
• F-1 Score: 0.730 –2 x (Precision * Sensitivity)/(Precision + Sensitivity)
The F-1 Score provides a statistic to balance between Precision and
Sensitivity, especially if an uneven class distribution exists, as in this
example, (99 survivors vs 46 deceased). The closer the F-1 score is to 1
(the upper bound) the better the precision and recall.
CRandTrees_ValidationScore
Click the CRandTrees_ValidationScore tab to view the Summary Results for
the Validation partition.
The Frequency Chart quickly displays how the fitted model performed on the
validation partition.
The overall error for the validation partition was 24.17 with 19 false positives
(surviving patients reported as deceased) and 10 false negatives (deceased
patients reported as survivors).
Figure 7: Validation: Classification Summery
The overall error for the validation partition was 30.83 with 26 false positives
(surviving patients reported as deceased) and 11 false negatives (deceased
patients reported as survivors).
Note the following metrics:
• Accuracy: 69.17
• Specificity: .698
• Sensitivity or Recall: 0.676
CRandTrees_Simulation
As discussed above, Analytic Solver Data Science generates a new output
worksheet, CRandTrees_Simulation, when Simulate Response Prediction is
selected on the Simulation tab of the Random Trees dialog in Analytic Solver
Comprehensive and Analytic Solver Data Science. (This feature is not supported
in Analytic Solver Optimization, Analytic Solver Simulation or Analytic Solver
Upgrade.)
This report contains the synthetic data, the prediction (using the fitted model)
and the Excel – calculated Expression column, if populated in the dialog. Users
can switch between the Predicted, Training, and Expression sources or a
combination of two, as long as they are of the same type.
Synthetic Data
Note the first column in the output, Expression. This column was inserted into
the Synthetic Data results because Calculate Expression was selected and an
Excel function was entered into the Expression field, on the Simulation tab of
the Discriminant Analysis dialog
IF([@ejection_fraction]<=20, [@DEATH_EVENT], “EF>20”)
The results in this column are either 0, 1, or EF > 20.
• DEATH_EVENT = 0 indicates that the patient had an ejection_fraction
<= 20 but did not suffer catastrophic heart failure.
• DEATH_EVENT = 1 in this column indicates that the patient had an
ejection_fraction <= 20 and did suffer catastrophic heart failure.
• EF>20 indicates that the patient had an ejection fraction of greater than
20.
The remainder of the data in this report is synthetic data, generated using the
Generate Data feature described in the chapter with the same name, that appears
earlier in this guide.
The chart that is displayed once this tab is selected, contains frequency
information pertaining to the predictions of the output variable in the training
partition, the synthetic data and the expression, if it exists.
The bars in the darker shade of blue display the frequency of labels in the
Simulation, or synthetic, data. In the synthetic data, 55 “patients” are predicted
to survive and the remaining are not.
The bars in the lighter shade display the frequency information for the training
partition’s predicted values where 115 records were labeled as 0 (survivors) and
64 records were labeled as 1 (non-survivors).
The chart below reveals the results of the expression as applied to each dataset.
The bars in the darker shade of blue display the frequency of labels in the
Simulation, or synthetic, data. In the synthetic data, 6 surviving “patients” have
an ejection fraction less than or equal to 20 while 7 “patients” with an
ejection_fraction less than or equal to 20 did not survive.
The bars in the lighter shade display the frequency information for the training
partition’s predicted values where 15 “patients”, or records, had ejection
fractions less than 20; 1 patient was predicted to survive and 14 were not.
Columns labeled as “EF>20” contain the remainder of the records where the
ejection fraction for each patient is larger than 20.
Frequency Chart for CRandTrees_Simulation output
Variables Overall Accuracy Specificity Sensitivity Precision F1 Score Overall Accuracy Specificity Sensitivity Precision F1
Error % (% Correct) (Recall) Error % (Recall) Score
ejection_fraction, 18.99 81.01 0.846 0.742 0.719 0.730 30.83 69.17 0..698 0..676 0..469 0.554
serum_creatinine
+ age 2.23 97.765 0.974 0.984 0.953 0.68 32.5 67.5 0.686 0.647 0.449 0.530
+ serum_sodium .5587 99.44 0.991 1.00 0.984 0.992 0.5587 99.44 0.991 1.00 0.984 0.992
+ high_blood_pressure 3.35 96.65 0..966 0.968 0.938 0.952 27.50 72.5 0.767 0..618 0.512 0.56
(added as categorical variable)
+ anaemia 2.79 97.21 0.983 0.952 0.967 0.959 31.67 68.33 0.744 0.529 0.45 0.486
(added as categorical variable)
+ serum_phosphokinase 1.676 98.324 0.974 1.00 0.954 0.976 34.167 65.83 0.721 0.50 0.414 0.453
+ platelets 2.235 97.765 0.967 1.00 0.939 0.969 29.167 70.833 0.674 0.794 0.491 0.607
+ smoking 1.678 98.324 0.974 1.00 0.953 0.976 30.833 69.167 0.686 0.706 0.471 0.565
(added as categorical variable)
+ sex 1.117 98.88 0.982 1.00 0.969 0.984 29.177 70.833 0.709 0.706 0.490 0.578
(added as categorical variable)
+ diabetes 2.23 97.77 0.966 0.1.00 0.939 0.969 29.17 70.8333 0.779 0.529 0.486 0.507
(added as categorical variable)
Selected Variables
Variables selected to be included in the output appear here.
Output Variable
The dependent variable or the variable to be classified appears here.
Number of Classes
Displays the number of classes in the Output variable.
Success Class
This option is selected by default. Click the drop down arrow to select the value
to specify a “success”. This option is only enabled when the # of classes is
equal to 2.
Partition Data
Analytic Solver Data Science includes the ability to partition a dataset from
within a classification or prediction method by clicking Partition Data on the
Parameters tab. Click Partition Data to open the Partitioning dialog. Analytic
Solver Data Science will partition your dataset (according to the partition
options you set) immediately before running the classification method. If
partitioning has already occurred on the dataset, this option will be disabled.
For more information on partitioning, please see the Data Science Partitioning
chapter.
Rescale Data
Use Rescaling to normalize one or more features in your data during the data
preprocessing stage. Analytic Solver Data Science provides the following
methods for feature scaling: Standardization, Normalization, Adjusted
Normalization and Unit Norm. For more information on this new feature, see
the Rescale Continuous Data section within the Transform Continuous Data
chapter that occurs earlier in this guide.
On-the-fly Rescaling dialog
If Rescale Data has been selected on the Rescaling dialog, users can still manually use the
“Min/Max as bounds” button within the Fitting Options section of the Simulation tab, to
populate the parameter grid with the bounds from the original data, not the rescaled data.
Note that the “Min/Max as bounds” feature is available for the user’s convenience. Users
must still be aware of any possible data tranformations (i.e. Rescaling) and review the
bounds to make sure that all are appropriate.
Weak Learner
Under Ensemble: Classification click the down arrow beneath Weak Leaner to
select one of the six featured classifiers: Discriminant Analysis, Logistic
Regression, k-NN, Naïve Bayes, Neural Networks, or Decision Trees. After a
weak learner is chosen, the command button to the right will be enabled. Click
this command button to control various option settings for the weak leaner.
AdaBoost Variant
The difference in the algorithms is the way in which the weights assigned to
each observation or record are updated. (Please refer to the section Ensemble
Methods in the Introduction to the chapter.)
In AdaBoost.M1 (Freund), the constant is calculated as:
αb= ln((1-eb)/eb)
In AdaBoost.M1 (Breiman), the constant is calculated as:
αb= 1/2ln((1-eb)/eb)
Random Trees Ensemble Methods dialog, Parameters tab Bagging Classification Dialog, Parameters tab
Please see below for options unique to the Bagging – Parameters tab.
Introduction
Linear regression is performed on a dataset either to predict the response
variable based on the predictor variable, or to study the relationship between the
response variable and predictor variables. For example, using linear regression,
the crime rate of a state can be explained as a function of demographic factors
such as population, education, male to female ratio etc.
This procedure performs linear regression on a selected dataset that fits a linear
model of the form
Y= b0 + b1X1 + b2X2+ .... + bkXk+ e
where Y is the dependent variable (response), X1, X2,.. .,Xk are the independent
variables (predictors) and e is the random error. b0 , b1, b2, .... bk are known as
the regression coefficients, which are estimated from the data. The multiple
linear regression algorithm in Analytic Solver Data Science chooses regression
coefficients to minimize the difference between the predicted and actual values.
See the Analytic Solver Data Science User Guide for a step-by-step example on
how to use Linear Regression to predict housing prices using the example
dataset, Boston_Housing.xlsx.
Selected Variables
Variables listed here will be utilized in the Analytic Solver Data Science output.
Weight Variable
One major assumption of Linear Regression is that each observation provides
equal information. Analytic Solver Data Science offers an opportunity to
provide a Weight variable. Using a Weight variable allows the user to allocate a
weight to each record. A record with a large weight will influence the model
more than a record with a smaller weight.
Linear Regression Dialog, Parameters tab Linear Regression Dialog, Parameters tab
See below, for option explanations included on the Linear Regression
Parameters tab.
Partition Data
Analytic Solver Data Science includes the ability to partition a dataset from
within a classification or prediction method by clicking Partition Data on the
Parameters tab. Analytic Solver Data Science will partition your dataset
(according to the partition options you set) immediately before running the
regression method. If partitioning has already occurred on the dataset, this
option will be disabled. For more information on partitioning, please see the
Data Science Partitioning chapter.
Rescale Data
Use Rescaling to normalize one or more features in your data during the data
preprocessing stage. Analytic Solver Data Science provides the following
methods for feature scaling: Standardization, Normalization, Adjusted
Normalization and Unit Norm. For more information on this new feature, see
the Rescale Continuous Data section within the Transform Continuous Data
chapter that occurs earlier in this guide.
Note: Rescaling has minimal effect in Regression methods. The coefficient estimates will be
scaled proportionally with the data resulting in the same results with or without scaling. This feature
is included on this dialog for consistency.
Fit Intercept
If Rescale Data has been selected on the Rescaling dialog, users can still manually use the
If “Min/Max as bounds”
this option buttonawithin
is selected, the Fitting
constant termOptions
will besection of thein
included Simulation tab, to
the model.
populate the parameter grid with the bounds from the original data, not the rescaled data. Note
Otheriwse, a constant term will not be included in the equation. This option is
that the “Min/Max as bounds” feature is available for the user’s convenience. Users must still
selected by default.
be aware of any possible data tranformations (i.e. Rescaling) and review the bounds to make
sure that all are appropriate.
Fit Intercept
When this option is selected, the default setting, Analytic Solver Data Science
will fit the Linear Regression intercept. If this option is not selected, Analytic
Solver Data Science will force the intercept term to 0.
Feature Selection
When you have a large number of predictors and you would like to limit the
model to only significant variables, click Feature Selection to open the Feature
Selection dialog and select Perform Feature Selection at the top of the dialog.
Maximum Subset Size can take on values of 1 up to N where N is the number of
Selected Variables. If no Categorical Variables exist, the default for this option
is N. If one or more Categorical Variables exist, the default is "15".
Analytic Solver Data Science offers five different selection procedures for
selecting the best subset of variables.
• Backward Elimination in which variables are eliminated one at a time,
starting with the least significant. If this procedure is selected, FOUT
is enabled. A statistic is calculated when variables are eliminated. For
Regression Display
Under Regression: Display, select all desired display options to include each in
the output.
Under Statistics, the following display options are present.
• ANOVA
• Variance-Covariance Matrix
• Multicollinearity Diagnostics
Under Advanced, the following display options are present.
• Analysis of Coefficients
• Analysis of Residuals
• Influence Diagnostics
Linear Regression Dialog, Scoring tab Linear Regression Dialog, Scoring tab
See below, for option explanations included on the Linear Regression Scoring
tab.
Introduction
In the k-nearest-neighbor regression method, the training data set is used to
predict the value of a variable of interest for each member of a "target" data set.
The structure of the data generally consists of a variable of interest ("amount
purchased," for example), and a number of additional predictor variables (age,
income, location, etc.).
1. For each row (case) in the target data set (the set to be predicted), locate the
k closest members (the k nearest neighbors) of the training data set. A
Euclidean Distance measure is used to calculate how close each member of
the training set is to the target row that is being examined.
2. Find the weighted sum of the variable of interest for the k nearest neighbors
(the weights are the inverse of the distances).
3. Repeat this procedure for the remaining rows (cases) in the target set.
4. Additionally, Analytic Solver Data Science also allows the user to select a
maximum value for k, builds models in parallel on all values of k (up to the
maximum specified value) and performs scoring on the best of these
models.
Computing time increases as k increases, but the advantage is that higher values
of k provide “smoothing” that reduces vulnerability to noise in the training data.
Typically, k is in units of tens rather than in hundreds or thousands of units.
Input
1. Click Help – Example Models on the Data Science ribbon, then click
Forecasting/Data Science Examples. Click the Boston Housing link to
open Boston_Housing.xlsx. This dataset contains 14 variables, the
description of each is given in the Description worksheet included within
the example workbook. The dependent variable MEDV is the median value
of a dwelling. The objective of this example is to predict the value of this
variable. A portion of the dataset is shown below.
All supervised algorithms include a new Simulation tab. This tab uses the
functionality from the Generate Data feature (described in the Generate
Data section appearing earlier in this guide) to generate synthetic data based
on the training partition, and uses the fitted model to produce predictions for
the synthetic data. The resulting report, KNNP_Simulation, will contain the
synthetic data, the predicted values and the Excel-calculated Expression
2. Partition the data into training and validation sets using the Standard Data
Partition defaults with percentages of 60% of the data randomly allocated to
the Training Set and 40% of the data randomly allocated to the Validation
Set. For more information on partitioning a dataset, see the Data Science
Partitioning chapter.
Note: If using Analytic Solver Desktop, the STDPartition worksheet is inserted into
the Model tab of the Analytic Solver task pane under Transformations -- Data
Partition and the data used in the partition will appear under Data, as shown in the
screenshot below.
10. Click Next to advance to the Simulation tab. This tab is disabled in Analytic
Solver Optimization, Analytic Solver Simulation and Analytic Solver
Upgrade.
KNNP_Output
This result worksheet includes 3 segments: Output Navigator, Inputs and the
Search Log.
• Output Navigator: The Output Navigator appears at the top of all result
worksheets. Use this feature to quickly navigate to all reports included in
the output.
KNNP_Output: Output Navigator
• Inputs: Scroll down to the Inputs section to find all inputs entered or
selected on all tabs of the k-Nearest Neighbors Prediction dialog.
KNNP_Output: Inputs
• Search Log: Scroll down KNNP_Output to the Search Log report (shown
below). As per our specifications, Analytic Solver Data Science has
calculated the RMS error for all values of k and denoted the value of k with
the smallest RMS Error. The validation partition will be scored using this
value of k.
• Frequency Charts: The output variable frequency chart for the training
partition opens automatically once the KNNP_TrainingScore worksheet is
selected. To close this chart, click the “x” in the upper right hand corner of
the chart. To reopen, click onto another tab and then click back to the
KNNP_TrainingScore tab. To move the dialog to a new location on the
screen, simply grab the title bar and drag the dialog to the desired location.
This chart displays a detailed, interactive frequency chart for the Actual
variable data and the Predicted data, for the training partition.
Predicted values for training partition, MEDV variable
Notice in the screenshot below that both the Actual and Prediction data
appear in the chart together, and statistics for both data appear on the right.
Statistics Pane
Select Bin Details from the drop down menu to view Bin Details for
each bin in the chart. Use the Chart Options view to manually select
the number of bins to use in the chart, as well as to set personalization
options.
As discussed above, see the Analyze Data section of the Exploring Data
chapter in the Data Science Reference Guide for an in-depth discussion of
this chart as well as descriptions of all statistics, percentiles, bin details and
six sigma indices.
• Prediction Summary: A key interest in a data-mining context will be the
predicted and actual values for the MEDV variable along with the residual
(difference) for each predicted value in the Training partition.
The Training: Prediction Summary report summarizes the prediction error.
The first number, the total sum of squared errors, is the sum of the squared
deviations (residuals) between the predicted and actual values. The second
is the average of the squared residuals, the third is the square root of the
average of the squared residuals and the fourth is the average deviation. All
these values are calculated for the best k, i.e. k=6. Note that the algorithm
perfectly predicted the correct median selling price for each census tract in
the training partition.
• Prediction Details displays the predicted value, the actual value and the
difference between them (the residuals), for each record.
KNNP_ValidationScore
A key interest in a data-mining context will be the predicted and actual values
for the MEDV variable along with the residual (difference) for each predicted
value in the Validation partition.
KNNP_ValidationScore displays the newly added Output Variable frequency
chart, the Validation: Prediction Summary and the Validation: Prediction
Details report. All calculations, charts and predictions on the
KNNP_ValidationScore output sheet apply to the Validation partition.
• Frequency Charts: The output variable frequency chart for the validation
partition opens automatically once the KNNP_ValidationScore worksheet is
selected. This chart displays a detailed, interactive frequency chart for the
Actual variable data and the Predicted data, for the validation partition. For
more information on this chart, see the KNNP_TrainingScore explanation
above.
Decile-Wise Lift Chart, RROC Curve and Lift Chart from Validation
Partition
Select Lift Chart (Alternative) to display Analytic Solver Data Science's new
Lift Chart. Each of these charts consists of an Optimum Predictor curve, a
Fitted Predictor curve, and a Random Predictor curve. The Optimum Predictor
curve plots a hypothetical model that would provide a perfect fit to the data.
The Fitted Predictor curve plots the fitted model and the Random Predictor
curve plots the results from using no model or by using a random guess.
The Alternative Lift Chart plots Lift against % Cases. The Gain Chart plots the
Gain Ratio against % Cases.
KNNP_Simulation
As discussed above, Analytic Solver Data Science generates a new output
worksheet, KNNP_Simulation, when Simulate Response Prediction is selected
on the Simulation tab of the k-Nearest Neighbors dialog in Analytic Solver
Comprehensive and Analytic Solver Data Science. (This feature is not supported
in Analytic Solver Optimization, Analytic Solver Simulation or Analytic Solver
Upgrade.)
This report contains the synthetic data, the predicted values for the training
partition (using the fitted model) and the Excel – calculated Expression column,
if populated in the dialog. Users can switch between the Predicted, Training,
and Expression sources or a combination of two, as long as they are of the same
type.
Synthetic Data
The data contained in the Synthetic Data report is syntethic data, generated
using the Generate Data feature described in the chapter with the same name,
that appears earlier in this guide.
To change the data view, click the Prediction (Simulation) button. Select
Prediction (Training) and Prediction (Simulation) to add the training data to the
chart.
Data Dialog
In the chart below, the darker blue bars represent the predictions for the
synthetic data while the lighter blue bars represent the predictions for the
training data.
Prediction (Simulation) and Prediction (Training) Frequency chart for MEDV variable
KNNP_Stored
For information on Stored Model Sheets, in this example KNNP_Stored, please
refer to the “Scoring New Data” chapter that apperas later in this guide.
Selected Variables
Select variables to be included in the model here.
Output Variable
Select the continous variable whose outcome is to be predicted here.
Partition Data
Analytic Solver Data Science includes the ability to partition a dataset from
within a classification or prediction method by selecting Partition Options on the
Parameters tab. If this option is selected, Analytic Solver Data Science will
partition your dataset (according to the partition options you set) immediately
before running the prediction method. If partitioning has already occurred on
the dataset, this option will be disabled. For more information on partitioning,
please see the Data Science Partitioning chapter.
k-Nearest Neighbors Regression Dialog,
Parameters tab
Rescale Data
Use Rescaling to normalize one or more features in your data during the data
preprocessing stage. Analytic Solver Data Science provides the following
methods for feature scaling: Standardization, Normalization, Adjusted
Normalization and Unit Norm. For more information on this new feature, see
the Rescale Continuous Data section within the Transform Continuous Data
chapter that occurs earlier in this guide.
If Rescale Data has been selected on the Rescaling dialog, users can still manually use the
“Min/Max as bounds” button within the Fitting Options section of the Simulation tab, to
populate the parameter grid with the bounds from the original data, not the rescaled data.
Note that the “Min/Max as bounds” feature is available for the user’s convenience. Users
must still be aware of any possible data tranformations (i.e. Rescaling) and review the
bounds to make sure that all are appropriate.
# Neighbors (k)
This is the parameter k in the k-nearest neighbor algorithm. If the number of
observations (rows) is less than 50 then the value of k should be between 1 and
the total number of observations (rows). If the number of rows is greater than
50, then the value of k should be between 1 and 50. The default value is 1.
Introduction
As with all regression techniques, Analytic Solver Data Science assumes the
existence of a single output (response) variable and one or more input
(predictor) variables. The output variable is numerical. The general regression
tree building methodology allows input variables to be a mixture of continuous
and categorical variables. A decision tree is generated where each decision node
in the tree contains a test on some input variable's value. The terminal nodes of
the tree contain the predicted output variable values.
A Regression tree may be considered as a variant of decision trees, designed to
approximate real-valued functions instead of being used for classification
methods.
Methodology
A Regression tree is built through a process known as binary recursive
partitioning. This is an iterative process that splits the data into partitions or
“branches”, and then continues splitting each partition into smaller groups as the
method moves up each branch.
Initially, all records in the training set (the pre-classified records that are used to
determine the structure of the tree) are grouped into the same partition. The
algorithm then begins allocating the data into the first two partitions or
“branches”, using every possible binary split on every field. The algorithm
selects the split that minimizes the sum of the squared deviations from the mean
in the two separate partitions. This splitting “rule” is then applied to each of the
new branches. This process continues until each node reaches a user-specified
minimum node size and becomes a terminal node. (If the sum of squared
deviations from the mean in a node is zero, then that node is considered a
terminal node even if it has not reached the minimum size.)
2. Partition the data into training and validation sets using the Standard Data
Partition defaults of 60% of the data randomly allocated to the Training Set
and 40% of the data randomly allocated to the Validation Set. For more
information on partitioning a dataset, see the Data Science Partitioning
chapter.
8. Select Show Feature Importance. This table shows the relative importance
of the feature measured as the reduction of the error criterion during the tree
growth.
9. Leave Maximum Number of Levels at the default setting of 7. This option
specifies the maximum number of levels in the tree to be displayed in the
output. Select Trees to Display to select the types of trees to display:
Fully Grown, Best Pruned, Minimum Error or User Specified.
• Select Fully Grown to “grow” a complete tree using the training data.
• Select Best Pruned to create a tree with the fewest number of nodes,
subject to the constraint that the error be kept below a specified level
(minimum error rate plus the standard error of that error rate).
• Select Minimum error to produce a tree that yields the minimum
classification error rate when tested on the validation data.
• To create a tree with a specified number of decision nodes select User
Specified and enter the desired number of nodes.
Select Fully Grown, Best Pruned, and Minimum Error.
Select the tree(s) to display in the output.
Output Worksheets
Output sheets containing the results for Regression Tree will be inserted into
your active workbook to the right of the STDPartition worksheet.
RT_Output
Output from prediction method will be inserted to the right of the workbook.
RT_Output includes 4 segments: Output Navigator, Inputs, Training Log and
Feature Importance.
• Output Navigator: The Output Navigator appears at the top of each output
worksheet. Use this feature to quickly navigate to all reports included in the
output.
• Inputs: Scroll down to the Inputs section to find all inputs entered or
selected on all tabs of the Regression Tree dialog.
• Training Log: Scroll down to the Training log (shown below) to see the
mean-square error (MSE) at each stage of the tree for both the training and
validation data sets. The MSE value is the average of the squares of the
errors between the predicted and observed values in the sample. The
training log shows that the training MSE continues reducing as the tree
continues to split.
Best
Pruned
Tree
Min Error
Tree
Feature Importance: This table displays the variables that are included in the
model along with their Importance value. The larger the Importance value, the
bigger the influence the variable has on the predicted classification. In this
instance, the census tracts with homes with many rooms will be predicted as
having a larger selling price.
RT_FullTree
To view the Full Grown Tree, either click the Fully Grown Tree link in the
Output Navigator or click the RT_FullTree worksheet tab. Recall that the Fully
Grown Tree is the tree used to fit the Regression Tree model (using the Training
data) and the tree used to score the validation partition.
RT_BestTree
To view the Best Pruned Tree, either click the Best Pruned Tree link in the
Output Navigator or click the RT_BestTree worksheet tab. Recall that the Best
Pruned Tree is the smallest tree that has an error within one standard error of the
minimum error tree.
The path from above can be followed through the Best Pruned Tree Rules table.
Node 1: 202 cases from the validation partition are assigned to nodes 2 (96
cases) and 3 (106 cases) using the LSTAT variable with a split value of 9.725.
Node 3: 106 cases from the validation partition are assigned to nodes 6 (66
cases) and 7 (30 cases) using the RM variable with a split value of 7.011.
Node 6: 73 cases from the validation partition are assigned to this terminal
node. The predicted response is equal to 19.261.
Node 7: 33 cases from the validation partition are assigned to this termional
node. The predicted response is equal to 12.616.
To add the actual data, click Prediction, then select both Prediction and
Actual.
Click to add Actual data
Notice in the screenshot below that both the Original and Synthetic data
appear in the chart together, and statistics for both data appear on the right.
As you can see from this chart, the fitted regression model perfectly
predicted the values for the output variable, MEDV, in the training
partition.
To remove either the Prediction or Actual data from the chart, click
Prediction/Actual in the top right and then uncheck the data type to be
removed.
This chart behaves the same as the interactive chart in the Analyze Data
feature found on the Explore menu (described in the Analyze Data chapter
that appears earlier in this guide).
• Click the down arrow next to Statistics to view Percentiles for each
type of data along with Six Sigma indices.
• Click the down arrow next to Statistics to view Bin Details to find
information pertaining to each bin in the chart.
• Use the Chart Options view to manually select the number of bins to
use in the chart, as well as to set personalization options.
• Training: Prediction Summary: The Prediction Summary tables contain
summary information for the training partition. These reports contain the
total sum of squared errors, the mean squared error, the root mean square
error (RMS error, or the square root of the mean squared error), and also the
average error (which is much smaller, since errors fall roughly into negative
and positive errors and tend to cancel each other out unless squared first.).
Small error values in both datasets suggest that the Single Tree method has
created a very accurate predictor. However, in general, these errors are not
great measures. RROC curves (discussed below) are much more
sophisticated and provide more precise information about the accuracy of
the predictor.
In this example, we see that the fitted model perfectly predicted the
value for the output variable in all training partition records.
RT_ValidationScore
Another key interest in a data-mining context will be the predicted and actual
values for the MEDV variable along with the residual (difference) for each
predicted value in the Validation partition.
RT_ValidationScore displays the newly added Output Variable frequency chart,
the Training: Prediction Summary and the Training: Prediction Details report.
All calculations, charts and predictions on the RT_ValidationScore output sheet
apply to the Validation partition.
• Frequency Charts: The output variable frequency chart for the validation
partition opens automatically once the RT_ValidationScore worksheet is
selected. This chart displays a detailed, interactive frequency chart for the
Actual variable data and the Predicted data, for the validation partition. For
more information on this chart, see the RT_TrainingLiftChart explanation
above.
After the model is built using the training data set, the model is used to score on
the training data set and the validation data set (if one exists). Then the data
set(s) are sorted in descending order using the predicted output variable value.
After sorting, the actual outcome values of the output variable are cumulated
and the lift curve is drawn as the number of cases versus the cumulated value.
The baseline (red line connecting the origin to the end point of the blue line) is
drawn as the number of cases versus the average of actual output variable values
multiplied by the number of cases. The decilewise lift curve is drawn as the
decile number versus the cumulative actual output variable value divided by the
decile's mean output variable value. This bars in this chart indicate the factor by
Select Lift Chart (Alternative) to display Analytic Solver Data Science's new
Lift Chart. Each of these charts consists of an Optimum Classifier curve, a
Fitted Classifier curve, and a Random Classifier curve. The Optimum Classifier
curve plots a hypothetical model that would provide perfect classification for
our data. The Fitted Classifier curve plots the fitted model and the Random
Classifier curve plots the results from using no model or by using a random
guess (i.e. for x% of selected observations, x% of the total number of positive
observations are expected to be correctly classified).
The Alternative Lift Chart plots Lift against % Cases.
Lift Chart (Alternative) and Gain Chart for Training Partition
Click the down arrow and select Gain Chart from the menu. In this chart, the
Gain Ratio is plotted against the % Cases.
RT_Simulation
As discussed above, Analytic Solver Data Science generates a new output
worksheet, RT_Simulation, when Simulate Response Prediction is selected on
the Simulation tab of the Regression Tree dialog in Analytic Solver
Comprehensive and Analytic Solver Data Science. (This feature is not supported
in Analytic Solver Optimization, Analytic Solver Simulation or Analytic Solver
Upgrade.)
This report contains the synthetic data, the predicted values for the training data
(using the fitted model) and the Excel – calculated Expression column, if
populated in the dialog. Users can switch between the Predicted, Training, and
Expression sources or a combination of two, as long as they are of the same
type.
Synthetic Data
The data contained in the Synthetic Data report is syntethic data, generated
using the Generate Data feature described in the chapter with the same name,
that appears earlier in this guide.
In the chart below, the dark blue bars display the frequencies for the synthetic
data and the light blue bars display the frequencies for the predicted values in
the Training partition.
Bin Details pane
Prediction (Simulation) and Prediction (Training) Frequency chart for MEDV variable
The Relative Bin Differences curve charts the absolute differences between the
data in each bin. Click the down arrow next to Statistics to view the Bin Details
pane to display the calculations.
Regression Tree Dialog, Data tab Regression Tree Dialog, Data tab
Selected variables
Variables listed here will be utilized in the Analytic Solver Data Science output.
Output Variable
Select the variable whose outcome is to be predicted here.
Partition Data
Analytic Solver Data Science includes the ability to partition a dataset from
within a classification or prediction method by clicking Partition Data on the
Parameters tab. Analytic Solver Data Science will partition your dataset
(according to the partition options you set) immediately before running the
regression method. If partitioning has already occurred on the dataset, this
option will be disabled. For more information on partitioning, please see the
Data Science Partitioning chapter.
Rescale Data
If Rescale Data has been selected on the Rescaling dialog, users can still manually use the
“Min/Max as bounds” button within the Fitting Options section of the Simulation tab, to
populate the parameter grid with the bounds from the original data, not the rescaled data. Note
that the “Min/Max as bounds” feature is available for the user’s convenience. Users must still
be aware of any possible data tranformations (i.e. Rescaling) and review the bounds to make
sure that all are appropriate.
Tree Growth
n the Tree Growth section, select Levels, Nodes, Splits, and Records in Terminal
Nodes. Values entered for these options limit tree growth, i.e. if 10 is entered
for Levels, the tree will be limited to 10 levels.
Simulation Tab
All supervised algorithms include a new Simulation tab in Analytic Solver
Comprehensive and Analytic Solver Data Science. (This feature is not supported
in Analytic Solver Optimization, Analytic Solver Simulation or Analytic Solver
Upgrade.) This tab uses the functionality from the Generate Data feature
(described earlier in this guide) to generate synthetic data based on the training
partition, and uses the fitted model to produce predictions for the synthetic data.
The resulting report, RT_Simulation, will contain the synthetic data, the
predicted values and the Excel-calculated Expression column, if present. In
addition, frequency charts containing the Predicted, Training, and Expression (if
present) sources or a combination of any pair may be viewed, if the charts are of
the same type.
Evaluation: Select Calculate Expression to amend an Expression column onto
the frequency chart displayed on the RT_Simulation output tab. Expression can
be any valid Excel formula that references a variable and the response as
[@COLUMN_NAME]. Click the Expression Hints button for more information
on entering an expression.
Introduction
Artificial neural networks are relatively crude electronic networks of "neurons"
based on the neural structure of the brain. They process records one at a time,
and "learn" by comparing their prediction of the record (which, at the outset, is
largely arbitrary) with the known actual record. The errors from the initial
prediction of the first record is fed back to the network and used to modify the
network’s algorithm for the second iteration. These steps are repeated multiple
times.
Roughly speaking, a neuron in an artificial neural network is
1. A set of input values (xi) with associated weights (wi)
2. An input function (g) that sums the weights and maps the results to an
output function(y).
Neurons are organized into layers: input, hidden and output. The input layer is
composed not of full neurons, but simply of the values in a record that are inputs
to the next layer of neurons. The next layer is the hidden layer of which there
could be several. The final layer is the output layer, where there is one node for
each class. A single sweep forward through the network results in the
assignment of a value to each output node. The record is assigned to the class
node with the highest value.
Feedforward, Back-Propagation
The feedforward, back-propagation architecture was developed in the early
1970's by several independent sources (Werbor; Parker; Rumelhart, Hinton and
Williams). This independent co-development was the result of a proliferation of
articles and talks at various conferences which stimulated the entire industry.
Currently, this synergistically developed back-propagation architecture is the
most popular, effective, and easy-to-learn model for complex, multi-layered
networks. Its greatest strength is in non-linear solutions to ill-defined problems.
The typical back-propagation network has an input layer, an output layer, and at
least one hidden layer. Theoretically, there is no limit on the number of hidden
layers but typically there are just one or two. Some studies have shown that the
total number of layers needed to solve problems of any complexity is 5 (one
input layer, three hidden layers and an output layer). Each layer is fully
connected to the succeeding layer.
As noted above, the training process normally uses some variant of the Delta
Rule, which starts with the calculated difference between the actual outputs and
the desired outputs. Using this error, connection weights are increased in
proportion to the error times, which are a scaling factor for global accuracy. This
means that the inputs, the output, and the desired output all must be present at
the same processing element. The most complex part of this algorithm is
determining which input contributed the most to an incorrect output and how to
modify the input to correct the error. (An inactive node would not contribute to
the error and would have no need to change its weights.) To solve this problem,
training inputs are applied to the input layer of the network, and desired outputs
are compared at the output layer. During the learning process, a forward sweep
is made through the network, and the output of each element is computed layer
by layer. The difference between the output of the final layer and the desired
output is back-propagated to the previous layer(s), usually modified by the
derivative of the transfer function. The connection weights are normally
adjusted using the Delta Rule. This process proceeds for the previous layer(s)
until the input layer is reached.
First, we partition the data into training and validation sets using the Standard
Data Partition defaults with percentages of 60% of the data randomly allocated
to the Training Set and 40% of the data randomly allocated to the Validation
Set. For more information on partitioning a dataset, see the Data Science
Partitioning chapter.
Keep the defaults for both Hidden Layer and Output Layer. See the Neural
Network Regression Options section below for more information on these
options.
The top section includes the Output Navigator which can be used to quickly
navigate to various sections of the output. The Data, Variables, and
Parameters/Options sections of the output all reflect inputs chosen by the user.
Scroll down to the Error Report, a portion is shown below. This report displays
each network created by the Automatic Architecture algorithm and can be sorted
by each column by clicking the down arrow next to each column heading.
Click the down arrow next to the last column heading, Validation:MSE (Mean
Standard Error for the Validation dataset), and select Sort Smallest to Largest
from the menu. Note: Sorting is not supported in AnalyticSolver.com.
Immediately, the records in the table are sorted by smallest value to largest value
according to the Validation: MSE values.
Inputs
This example will use the same partitioned dataset to illustrate the use of the
Manual Network Architecture selection.
1. Click back to the STDPartition sheet and then click Predict – Neural
Network – Manual Network on the Data Science ribbon.
2. Select MEDV as the Output variable and the remaining variables as
Selected Variables (except the CAT.MEDV, CHAS and Record ID
variables).
The last variable, CAT.MEDV, is a discrete classification of the MEDV
variable and will not be used in this example. CHAS is a categorial
variable which will also not be used in this example.
Output
Output sheets containing the results of the Neural Network will be inserted into
your active workbook to the right of the STDPartition worksheet.
NNP_Output
This result worksheet includes 3 segments: Output Navigator, Inputs and
Nueron Weights.
• Output Navigator: The Output Navigator appears at the top of all result
worksheets. Use this feature to quickly navigate to all reports included in
the output.
NNP_Output: Output Navigator
• Inputs: Scroll down to the Inputs section to find all inputs entered or
selected on all tabs of the Discriminant Analysis dialog.
NNP_Output, Inputs Report
NNP_TrainLog
Click the Training Log link on the Output Navigator or the NNP_TrainLog
worksheet tab to display the following log.
During an epoch, each training record is fed forward in the network and
classified. The error is calculated and is back propagated for the weights
correction. Weights are continuously adjusted during the epoch. The sum of
squares error is computed as the records pass through the network but does not
report the sum of squares error after the final weight adjustment. Scoring of the
training data is performed using the final weights so the training classification
error may not exactly match with the last epoch error in the Epoch log.
NNP_TrainingScore
Click the NNP_TrainingScore tab to view the newly added Output Variable
frequency chart, the Training: Prediction Summary and the Training:
Prediction Details report. All calculations, charts and predictions on this
worksheet apply to the Training data.
To add the Actual data to the chart, click Prediction in the upper right hand
corner and select both checkboxes in the Data dialog.
Click Prediction to add Actual data to the interactive chart.
Notice in the screenshot below that both the Prediction and Actual data appear
in the chart together, and statistics for both appear on the right.
• Use the mouse to hover over any of the bars in the graph to populate
the Bin and Frequency headings at the top of the chart.
• When displaying either Prediction or Actual data (not both), red
vertical lines will appear at the 5% and 95% percentile values in all
three charts (Frequency, Cumulative Frequency and Reverse
Cumulative Frequency) effectively displaying the 90 th confidence
interval. The middle percentage is the percentage of all the variable
values that lie within the ‘included’ area, i.e. the darker shaded area.
The two percentages on each end are the percentage of all variable
values that lie outside of the ‘included’ area or the “tails”. i.e. the
lighter shaded area. Percentile values can be altered by moving either
red vertical line to the left or right.
Frequency chart with percentage markers moved
• Click the down arrow next to Statistics to view Percentiles for each
type of data along with Six Sigma indices.
Reverse Cumulative Frequency chart and Six Sigma indices displayed.
• Click the down arrow next to Statistics to view Bin Details to display
information related to each bin.
Reverse Cumulative Frequency chart and Bin Details pane displayed
• Use the Chart Options view to manually select the number of bins to
use in the chart, as well as to set personalization options.
As discussed above, see the Analyze Data section of the Exploring Data
chapter for an in-depth discussion of this chart as well as descriptions of all
statistics, percentiles, bin metrics and six sigma indices.
NNP_ValidationScore
Another key interest in a data-mining context will be the predicted and actual
values for the MEDV variable along with the residual (difference) for each
predicted value in the Validation partition.
NNP_ValidationScore displays the newly added Output Variable frequency
chart, the Validation: Prediction Summary and the Validation: Prediction
Details report. All calculations, charts and predictions on the
NNP_ValidationScore output sheet apply to the Validation partition.
• Frequency Charts: The output variable frequency chart for the validation
partition opens automatically once the NNP_ValidationScore worksheet is
selected. This chart displays a detailed, interactive frequency chart for the
Actual variable data and the Predicted data, for the validation partition. For
more information on this chart, see the NNP_TrainingScore explanation
above.
RROC charts, shown below, are better indicators of fit. Read on to view
how these more sophisticated tools can tell us about the fit of the neural
network to our data.
NNP_TrainingDataLiftChart & NNP_ValidationDataLiftChart
Click the NNP_TrainingLiftChart and NNP_ValidationLiftChart tabs to
view the lift charts and Regression ROC charts for both the training and
validation datasets.
Lift charts and Regression ROC Curves are visual aids for measuring model
performance. Lift Charts consist of a lift curve and a baseline. The greater the
area between the lift curve and the baseline, the better the model. RROC
(regression receiver operating characteristic) curves plot the performance of
regressors by graphing over-estimations (or predicted values that are too high)
versus underestimations (or predicted values that are too low.) The closer the
curve is to the top left corner of the graph (in other words, the smaller the area
above the curve), the better the performance of the model.
Note: To view these charts in the Cloud app, click the Charts icon on the
Ribbon, select NNP_TrainingLiftChart or NNP_ValidationLiftChart for
Worksheet and Decile Chart, ROC Chart or Gain Chart for Chart.
Decile-wise Lift Chart, RROC Curve and Lift Chart for Valid. Partition
After the model is built using the training data set, the model is used to score on
the training data set and the validation data set (if one exists). Then the data
set(s) are sorted using the predicted output variable value. After sorting, the
actual outcome values of the output variable are cumulated and the lift curve is
drawn as the number of cases versus the cumulated value. The baseline (red line
connecting the origin to the end point of the blue line) is drawn as the number of
cases versus the average of actual output variable values multiplied by the
number of cases.
The decilewise lift curve is drawn as the decile number versus the cumulative
actual output variable value divided by the decile's mean output variable value.
This bars in this chart indicate the factor by which the NNP model outperforms a
random assignment, one decile at a time. Typically, this graph will have a
"stairstep" appearance - the bars will descend in order from left to right. This
means that the model is "binning" the records correctly, from highest priced to
lowest. However, in this example, the left most bars are shorter than bars
appearing to the right. This type of graph indicates that the model might not be
a good fit to the data. Additional analysis is required.
The Regression ROC curve (RROC) was updated in V2017. This new chart
compares the performance of the regressor (Fitted Classifier) with an Optimum
Classifier Curve. The Optimum Classifier Curve plots a hypothetical model that
would provide perfect prediction results. The best possible prediction
performance is denoted by a point at the top left of the graph at the intersection
of the x and y axis. This point is sometimes referred to as the “perfect
classification”. Area Over the Curve (AOC) is the space in the graph that
appears above the ROC curve and is calculated using the formula: sigma2 * n2/2
where n is the number of records The smaller the AOC, the better the
performance of the model.
In V2017, two new charts were introduced: a new Lift Chart and the Gain
Chart. To display these new charts, click the down arrow next to Lift Chart
(Original), in the Original Lift Chart, then select the desired chart.
Click the down arrow and select Gain Chart from the menu. In this chart, the
Gain Ratio is plotted against the % Cases.
NNP_Simulation
As discussed above, Analytic Solver Data Science generates a new output
worksheet, NNP_Simulation, when Simulate Response Prediction is selected on
the Simulation tab of the Neural Network Regression dialog in Analytic Solver
Comprehensive and Analytic Solver Data Science. (This feature is not supported
in Analytic Solver Optimization, Analytic Solver Simulation or Analytic Solver
Upgrade.)
The data contained in the Synthetic Data report is syntethic data, generated
using the Generate Data feature described in the chapter with the same name,
that appears earlier in this guide.
The chart that is displayed once this tab is selected, contains frequency
information pertaining to the output variable in the training data, the synthetic
data and the expression, if it exists. (Recall that no expression was entered in
this example.)
Frequency Chart for Prediction (Simulation) data
In the chart below, the dark blue bars display the frequencies for the synthetic
data and the light blue bars display the frequencies for the predicted values in
the Training partition.
The Relative Bin Differences curve charts the absolute differences between the
data in each bin. Click the down arrow next to Statistics to view the Bin Details
pane to display the calculations.
Click the down arrow next to Frequency to change the chart view to Relative
Frequency or to change the look by clicking Chart Options. Statistics on the
right of the chart dialog are discussed earlier in this section. For more
information on the generated synthetic data, see the Generate Data chapter that
appears earlier in this guide.
For information on NNP_Stored, please see the “Scoring New Data” chapter
within the Analytic Solver Data Science User Guide.
Selected Variables
Variables listed here will be utilized in the Analytic Solver Data Science output.
Categorical Variables
Place categorical variables from the Variables listbox to be included in the
model by clicking the > command button. The Neural Network Regression
algorithm will accept non-numeric categorical variables.
Output Variable
Select the variable whose outcome is to be predicted here.
Partition Data
Analytic Solver Data Science includes the ability to partition a dataset from
within a classification or prediction method by clicking Partition Data on the
Parameters tab. Analytic Solver Data Science will partition your dataset
(according to the partition options you set) immediately before running the
regression method. If partitioning has already occurred on the dataset, this
Neural Network Regression dialog, Parameters tab
option will be disabled. For more information on partitioning, please see the
Data Science Partitioning chapter.
Rescale Data
Click Rescale Data to open the Rescaling dialog.
Use Rescaling to normalize one or more features in your data during the data
preprocessing stage. Analytic Solver Data Science provides the following
methods for feature scaling: Standardization, Normalization, Adjusted
Normalization and Unit Norm. For more information on this new feature, see
the Rescale Continuous Data section within the Transform Continuous Data
chapter that occurs earlier in this guide.
If Rescale Data has been selected on the Rescaling dialog, users can still manually use the
“Min/Max as bounds” button within the Fitting Options section of the Simulation tab, to
populate the parameter grid with the bounds from the original data, not the rescaled data.
Note that the “Min/Max as bounds” feature is available for the user’s convenience. Users
must still be aware of any possible data tranformations (i.e. Rescaling) and review the
bounds to make sure that all are appropriate.
Hidden Layers/Neurons
Click Add Layer to add a hidden layer. To delete a layer, click Remove Layer.
Once the layer is added, enter the desired Neurons.
Hidden Layer
Nodes in the hidden layer receive input from the input layer. The output of the
hidden nodes is a weighted sum of the input values. This weighted sum is
computed with weights that are initially set at random values. As the network
“learns”, these weights are adjusted. This weighted sum is used to compute the
hidden node’s output using a transfer function.
Select Sigmoid (the default setting) to use a logistic function for the transfer
function with a range of 0 and 1. This function has a “squashing effect” on very
small or very large values but is almost linear in the range where the value of the
function is between 0.1 and 0.9.
Select Hyperbolic Tangent to use the tanh function for the transfer function, the
range being -1 to 1. If more than one hidden layer exists, this function is used
for all layers.
ReLU (Rectified Linear Unit) is a widely used choice for hidden layers. This
activation function applies max(0,x) function to the neuron values. When used
instead of logistic sigmoid or hyperbolic tangent activations, some adjustments
to the Neural Network settings are typically required to achieve a good
performance, such as: significantly decreasing the learning rate, increasing the
number of learning epochs and network parameters.
Training Parameters
Click Training Parameters to open the Training Parameters dialog to specify
parameters related to the training of the Neural Network algorithm.
Learning Rate
This is the multiplying factor for the error correction during
backpropagation; it is roughly equivalent to the learning rate for the
neural network. A low value produces slow but steady learning, a high
value produces rapid but erratic learning. Values for the step size
typically range from 0.1 to 0.9.
Weight Decay
To prevent over-fitting of the network on the training data, set a weight
decay to penalize the weight in each iteration. Each calculated weight
will be multiplied by (1-decay).
Error Tolerance
The error in a particular iteration is backpropagated only if it is greater
than the error tolerance. Typically error tolerance is a small value in the
range from 0 to 1.
Stopping Rules
Click Stopping Rules to open the Stopping Rules dialog. Here users can specify
a comprehensive set of rules for stopping the algorithm early plus cross-
validation on the training error.
Number of Epochs
An epoch is one sweep through all records in the training set. Use this
option to set the number of epochs to be performed by the algorithm.
Simulation Tab
All supervised algorithms include a new Simulation tab in Analytic Solver
Comprehensive and Analytic Solver Data Science. (This feature is not supported
in Analytic Solver Optimization, Analytic Solver Simulation or Analytic Solver
Upgrade.) This tab uses the functionality from the Generate Data feature
(described earlier in this guide) to generate synthetic data based on the training
partition, and uses the fitted model to produce predictions for the synthetic data.
The resulting report, NNP_Simulation, will contain the synthetic data, the
predicted values and the Excel-calculated Expression column, if present. In
addition, frequency charts containing the Predicted, Training, and Expression (if
present) sources or a combination of any pair may be viewed, if the charts are of
the same type.
Evaluation: Select Calculate Expression to amend an Expression column onto
the frequency chart displayed on the RT_Simulation output tab. Expression can
Bagging, or bootstrap aggregating, was one of the first ensemble algorithms ever
to be written. It is a simple algorithm, yet very effective. Bagging generates
several training data sets by using random sampling with replacement (bootstrap
sampling), applies the regression model to each dataset, then takes the average
amongst the models to calculate the predictions for the new data. The biggest
advantage of bagging is the relative ease that the algorithm can be parallelized
which makes it a better selection for very large datasets.
n
eb = wb(i) I (Cb( xi) yi))
i −1
The error of the regression model in the bth iteration is used to calculate the
constant αb. This constant is used to update the weight (wb(i). In AdaBoost.M1
(Freund), the constant is calculated as:
αb= 1/2ln((1-eb)/eb)
Afterwards, the weights are all readjusted to sum to 1. As a result, the weights
assigned to the observations that were assigned inaccurate predicted values are
increased and the weights assigned to the observations that were assigned
accurate predicted values are decreased. This adjustment forces the next
regression model to put more emphasis on the records that were assigned
inaccurate predictions. (This α constant is also used in the final calculation
which will give the regression model with the lowest error more influence.)
This process repeats until b = Number of weak learners (controlled by the User).
The algorithm then computes the weighted average among all weak learners and
assigns that value to the record. Boosting generally yields better models than
bagging, however, it does have a disadvantage as it is not parallelizable. As a
result, if the number of weak learners is large, boosting would not be suitable.
Ensemble Methods are very powerful methods and typically result in better
performance than a single tree. This feature addition in Analytic Solver Data
Science (introduced in V2015) provides users with more accurate prediction
models and should be considered over the single tree method.
Input
1. Open the example dataset by clicking Help – Example Models –
Forecasting / Data Science Examples – Boston Housing. A portion of the
dataset is shown below. Neither the CHAS variable nor the CAT. MEDV
variable will be utilized in this example.
2. First, we partition the data into training and validation sets using the
Standard Data Partition defaults with percentages of 60% of the data
randomly allocated to the Training Set and 40% of the data randomly
allocated to the Validation Set. For more information on partitioning a
dataset, see the Data Science Partitioning chapter.
Standard Data Partitioning dialog
Output
Output sheets containing the results of the Boosting Prediction method will be
inserted into the active workbook, to the right of the STDPartition worksheet.
RBoosting_Output
This result worksheet includes 3 segments: Output Navigator, Inputs and
Boosting Model.
• Output Navigator: The Output Navigator appears at the top of all result
worksheets. Use this feature to quickly navigate to all reports included in
the output.
RBoosting_Output: Output Navigator
• Inputs: Scroll down to the Inputs section to find all inputs entered or
selected on all tabs of the Boosting Regression dialog.
• Boosting Model: Click the Boosting Model link on the Output Naviagator
to view the Boosting model for each weak learner. Recall that the default is
"10" on the Parameters tab.
To add the Actual data to the chart, click Prediction in the upper right hand
corner and select both checkboxes in the Data dialog.
Click Prediction to add Actual data to the interactive chart.
To remove either the Original or the Synthetic data from the chart, click
Original/Synthetic in the top right and then uncheck the data type to be removed.
This chart behaves the same as the interactive chart in the Analyze Data feature
found on the Explore menu.
• Use the mouse to hover over any of the bars in the graph to populate
the Bin and Frequency headings at the top of the chart.
• When displaying either Prediction or Actual data (not both), red
vertical lines will appear at the 5% and 95% percentile values in all
three charts (Frequency, Cumulative Frequency and Reverse
Cumulative Frequency) effectively displaying the 90 th confidence
interval. The middle percentage is the percentage of all the variable
values that lie within the ‘included’ area, i.e. the darker shaded area.
The two percentages on each end are the percentage of all variable
values that lie outside of the ‘included’ area or the “tails”. i.e. the
lighter shaded area. Percentile values can be altered by moving either
red vertical line to the left or right.
Frequency chart with percentage markers moved
• Click the down arrow next to Statistics to view Percentiles for each
type of data along with Six Sigma indices.
Reverse Cumulative Frequency chart and Six Sigma indices displayed.
• Click the down arrow next to Statistics to view Bin Details to display
information related to each bin in the chart.
Bin Details view
RBoosting_ValidationScore
Another key interest in a data-mining context will be the predicted and actual
values for the MEDV variable along with the residual (difference) for each
predicted value in the Validation partition.
RBoosting_ValidationScore displays the newly added Output Variable
frequency chart, the Validation: Prediction Summary and the Validation:
Prediction Details report. All calculations, charts and predictions on the
RBoosting_ValidationScore output sheet apply to the Validation partition.
• Frequency Charts: The output variable frequency chart for the validation
partition opens automatically once the RBoosting_ValidationScore
worksheet is selected. This chart displays a detailed, interactive frequency
chart for the Actual variable data and the Predicted data, for the validation
partition. For more information on this chart, see the
RBoosting_TrainingScore explanation above.
Validation Partition Frequency Chart
RROC charts, shown below, are better indicators of fit. Read on to view
how these more sophisticated tools can tell us about the fit of the neural
network to our data.
Decile-wise Lift Chart, RROC Curve and Lift Chart for Valid. Partition
RBoosting_Simulation
As discussed above, Analytic Solver Data Science generates a new output
worksheet, RBoosting_Simulation, when Simulate Response Prediction is
selected on the Simulation tab of the Boosting Regression dialog.
This report contains the synthetic data, the predicted values for the training data
(using the fitted model) and the Excel – calculated Expression column, if
populated in the dialog. Users can switch between the Predicted, Training, and
Expression sources or a combination of two, as long as they are of the same
type.
Synthetic Data
The data contained in the Synthetic Data report is syntethic data, generated
using the Generate Data feature described in the chapter with the same name,
that appears earlier in this guide.
The chart that is displayed once this tab is selected, contains frequency
information pertaining to the output variable in the training data, the synthetic
data and the expression, if it exists. (Recall that no expression was entered in
this example.)
In the chart below, the dark blue bars display the frequencies for the synthetic
data and the light blue bars display the frequencies for the predicted values in
the Training partition.
Prediction (Simulation) and Prediction (Training) Frequency chart for MEDV variable
The Relative Bin Differences curve charts the absolute differences between the
data in each bin. Click the down arrow next to Statistics to view the Bin Details
pane to display the calculations.
Click the down arrow next to Frequency to change the chart view to Relative
Frequency or to change the look by clicking Chart Options. Statistics on the
Input
1. Click Predict – Ensemble– Bagging on the Data Science ribbon. The
Bagging – Data tab appears.
2. As in the example above, select MEDV as the Output variable and the
remaining variables as Selected Variables (except the CAT.MEDV,
CHAS and Record ID variables). (See screenshot of Boosting Regression
dialog, data tab above.)
3. Click Next to advance to the next tab.
4. Select the down arrow beneath Weak Learner and select Neural Network
from the menu. A command button will appear to the right of the Weak
Learner menu labeled Neural Network. Click this button and then Add
Layer twice to add two layers with 5 and 3 neurons, respectively. For more
information on any of these options, see the Neural Network chapter the
appears earlier in this Guide. Click Done to return to the Parameters tab.
Bagging Weak Learner
Output
Output sheets containing the results of the Bagging Prediction method will be
inserted into the active workbook, to the right of the STDPartition worksheet.
RBagging_Output
This result worksheet includes 3 segments: Output Navigator, Inputs and
Bagging Model.
• Output Navigator: The Output Navigator appears at the top of all result
worksheets. Use this feature to quickly navigate to all reports included in
the output.
RBagging_Output: Output Navigator
• Inputs: Scroll down to the Inputs section to find all inputs entered or
selected on all tabs of the Bagging Regression dialog.
• Boosting Model: Click the Boosting Model link on the Output Naviagator
to view the Boosting model for each weak learner. Recall that the default is
"10" on the Parameters tab.
RBagging_TrainingScore
Click the RBagging_TrainingScore tab to view the newly added Output Variable
frequency chart, the Training: Prediction Summary and the Training:
Prediction Details report. All calculations, charts and predictions on this
worksheet apply to the Training data.
Decile-wise Lift Chart, RROC Curve and Lift Chart for Valid. Partition
The data contained in the Synthetic Data report is syntethic data, generated
using the Generate Data feature described in the chapter with the same name,
that appears earlier in this guide.
The chart that is displayed once this tab is selected, contains frequency
information pertaining to the output variable in the training data, the synthetic
data and the expression, if it exists. (Recall that no expression was entered in
this example.)
Frequency Chart for Prediction (Simulation) data
The Relative Bin Differences curve charts the absolute differences between the
data in each bin. Click the down arrow next to Statistics to view the Bin Details
pane to display the calculations.
Click the down arrow next to Frequency to change the chart view to Relative
Frequency or to change the look by clicking Chart Options. Statistics on the
right of the chart dialog are discussed earlier in this section. For more
information on the generated synthetic data, see the Generate Data chapter that
appears earlier in this guide.
See the “Scoring New Data” chapter in the Analytic Solver Data Science User
Guide for information on the Stored Model Sheet, RBoosting_Stored.
Continue on with the Random Trees Neural Network Regression Example in the
next section to compare the results between the two ensemble methods.
Input
1. Click Predict – Ensemble – Random Trees on the Data Science ribbon.
Output
The output of the Ensemble Methods algorithm are inserted at the end of the
workbook.
RRandTrees_Output
This worksheet contains three sections: the Output Navigator, Inputs and
Boosting Method.
• Output Navigator: Double click RRandTrees_Output to view the Output
Navigator, which is inserted at the top of each output worksheet. Click any
link in this table to navigate to various sections of the output.
• Boosting Model: Click the Boosting Model link on the Output Naviagator
to view the Boosting model for each weak learner. Recall that the default is
"10" on the Parameters tab.
Decile-wise Lift Chart, RROC Curve and Lift Chart for Valid. Partition
RRandTrees_Simulation
As discussed above, Analytic Solver Data Science generates a new output
worksheet, RRandTrees_Simulation, when Simulate Response Prediction is
selected on the Simulation tab of the Randon Trees Regression dialog in
Analytic Solver Comprehensive and Analytic Solver Data Science. (This feature
is not supported in Analytic Solver Optimization, Analytic Solver Simulation or
Analytic Solver Upgrade.)
This report contains the synthetic data, the predicted values for the training data
(using the fitted model) and the Excel – calculated Expression column, if
populated in the dialog. Users can switch between the Predicted, Training, and
Expression sources or a combination of two, as long as they are of the same
type.
Synthetic Data
The data contained in the Synthetic Data report is syntethic data, generated
using the Generate Data feature described in the chapter with the same name,
that appears earlier in this guide.
In the chart below, the dark blue bars display the frequencies for the synthetic
data and the light blue bars display the frequencies for the predicted values in
the Training partition.
Prediction (Simulation) and Prediction (Training) Frequency chart for MEDV variable
Selected Variables
Variables selected to be included in the output appear here.
Categorical Variables
Place categorical variables from the Variables listbox to be included in the
model by clicking the > command button. Ensemble Methods will accept non-
numeric categorical variables.
Output Variable
Boosting Regression dialog, Parameters tab The dependent variable or the variable to be classified appears here.
Please see below for options appearing on the Boosting – Parameters tab.
Partition Data
Analytic Solver Data Science includes the ability to partition a dataset from
within a classification or prediction method by clicking Partition Data on the
Parameters tab. Click Partition Data to open the Partitioning dialog. Analytic
Solver Data Science will partition your dataset (according to the partition
options you set) immediately before running the regression method. If
partitioning has already occurred on the dataset, this option will be disabled.
Rescale Data
Recall that the Euclidean distance measurement performs best when each
variable is rescaled. Here you can select how you want to standardize your
variables using Standardization. Normalization, Adjusted Normalization and
Unit Norm.
"On-the-fly" Rescaling dialog
If Rescale Data has been selected on the Rescaling dialog, users can still manually use the
“Min/Max as bounds” button within the Fitting Options section of the Simulation tab, to
populate the parameter grid with the bounds from the original data, not the rescaled data.
Note that the “Min/Max as bounds” feature is available for the user’s convenience. Users
must still be aware of any possible data tranformations (i.e. Rescaling) and review the
bounds to make sure that all are appropriate.
Weak Learner
Under Ensemble: Regression click the down arrow beneath Weak Leaner to
select one of the four featured classifiers: Linear Regression, k-NN, Neural
Networks, or Decision Trees. After a weak learner is chosen, the command
button to the right will be enabled. Click this command button to control
various option settings for the weak leaner.
Step Size
The Adaboost algorithm minimizes a loss function using the gradient descent
method. The Step size option is used to ensure that the algorithm does not
descend too far when moving to the next step. It is recommended to leave this
option at the default of 0.3, but any number between 0 and 1 is acceptable. A
Step size setting closer to 0 results in the algorithm taking smaller steps to the
next point, while a setting closer to 1 results in the algorithm taking larger steps
towards the next point.
Please see below for options unique to the Random Trees – Parameters tab.
Random Trees Regression dialog, Parameters tab Number of Randomly Selected Features
The Random Trees ensemble method works by training multiple “weak”
classification trees using a fixed number of randomly selected features then
taking the mode of each class to create a “strong” classifier. The option Number
of randomly selected features controls the fixed number of randomly selected
features in the algorithm. The default setting is 3.
Simulation Tab
All supervised algorithms include a new Simulation tab in Analytic Solver
Comprehensive and Analytic Solver Data Science. (This feature is not supported
in Analytic Solver Optimization, Analytic Solver Simulation or Analytic Solver
Upgrade.) This tab uses the functionality from the Generate Data feature
(described earlier in this guide) to generate synthetic data based on the training
partition, and uses the fitted model to produce predictions for the synthetic data.
The resulting report, NNP_Simulation, will contain the synthetic data, the
predicted values and the Excel-calculated Expression column, if present. In
addition, frequency charts containing the Predicted, Training, and Expression (if
present) sources or a combination of any pair may be viewed, if the charts are of
the same type.
Evaluation: Select Calculate Expression to amend an Expression column onto
the frequency chart displayed on the RT_Simulation output tab. Expression can
be any valid Excel formula that references a variable and the response as
[@COLUMN_NAME]. Click the Expression Hints button for more information
on entering an expression.
Introduction
The goal of association rules mining is to recognize associations and/or
correlations among large sets of data items. A typical and widely-used example
of association rules mining is the Market Basket Analysis. Most ‘market basket’
databases consist of a large number of transaction records where each record
lists all items purchased by a customer during a trip through the check-out line.
Data is easily and accurately collected through the bar-code scanners.
Supermarket managers are interested in determining what foods customers
purchase together, like, for instance, bread and milk, bacon and eggs, wine and
cheese, etc. This information is useful in planning store layouts (placing items
optimally with respect to each other), cross-selling promotions, coupon offers,
etc.
Association rules provide results in the form of "if-then" statements. These rules
are computed from the data and, unlike the if-then rules of logic, are
probabilistic in nature. The “if” portion of the statement is referred to as the
antecedent and the “then” portion of the statement is referred to as the
consequent.
In addition to the antecedent (the "if" part) and the consequent (the "then" part),
an association rule contains two numbers that express the degree of uncertainty
about the rule. In association analysis the antecedent and consequent are sets of
items (called itemsets) that are disjoint meaning they do not have any items in
common. The first number is called the support which is simply the number of
transactions that include all items in the antecedent and consequent. (The
support is sometimes expressed as a percentage of the total number of records in
the database.) The second number is known as the confidence which is the ratio
of the number of transactions that include all items in the consequent as well as
the antecedent (namely, the support) to the number of transactions that include
all items in the antecedent. For example, assume a supermarket database has
100,000 point-of-sale transactions, out of which 2,000 include both items A and
B and 800 of these include item C. The association rule "If A and B are
purchased then C is purchased on the same trip" has a support of 800
transactions (alternatively 0.8% = 800/100,000) and a confidence of 40%
(=800/2,000). In other words, support is the probability that a randomly selected
transaction from the database will contain all items in the antecedent and the
consequent. Confidence is the conditional probability that a randomly selected
transaction will include all the items in the consequent given that the transaction
includes all the items in the antecedent.
Lift is one more parameter of interest in the association analysis. Lift is the ratio
of Confidence to Expected Confidence. Expected Confidence, in the example
above, is the "confidence of buying A and B does not enhance the probability of
buying C." or the number of transactions that include the consequent divided by
the total number of transactions. Suppose the total number of transactions for C
is 5,000. Expected Confidence is computed as 5% (5,000/1,000,000) while the
ratio of Lift Confidence to Expected Confidence is 8 (40%/5%). Hence, Lift is a
value that provides information about the increase in probability of the "then"
(consequent) given the "if" (antecedent).
Rule 27 indicates that if a Cook book and a Reference book is purchased, then
with 80% confidence a Child book will also be purchased. The A - Support
indicates that the rule has the support of 305 transactions, meaning that 305
people bought a cook book and a Reference book. The C - Support column
indicates the number of transactions involving the purchase of Child books. The
Data Source
Worksheet: The worksheet name containing the dataset.
Workbook: The workbook name containing the dataset.
Data range: The selected data range.
#Rows: (Read only) The number of rows in the dataset.
#Cols: (Read only) The number of columns in the dataset.
First Row Contains Headers: Select this checkbox if the first row of the dataset
contains column headings.
SigmaCP
A Six Sigma index, SigmaCP predicts what the process is capable of producing
if the process mean is centered between the lower and upper limits. This index
assumes the process output is normally distributed.
𝑈𝑝𝑝𝑒𝑟𝑆𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑎𝑡𝑖𝑜𝑛𝐿𝑖𝑚𝑖𝑡−𝐿𝑜𝑤𝑒𝑟𝑆𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑎𝑡𝑖𝑜𝑛𝐿𝑖𝑚𝑖𝑡
𝐶𝑝 =
6𝜎
̂
SigmaCPK
A Six Sigma index, SigmaCPK predicts what the process is capable of
producing if the process mean is not centered between the lower and upper
limits. This index assumes the process output is normally distributed and will be
negative if the process mean falls outside of the lower and upper specification
limits.
𝑀𝐼𝑁(𝑈𝑝𝑝𝑒𝑟𝑆𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑎𝑡𝑖𝑜𝑛𝐿𝑖𝑚𝑖𝑡−𝜇
̂ ,𝜇
̂ −𝐿𝑜𝑤𝑒𝑟𝑆𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑎𝑡𝑖𝑜𝑛𝐿𝑖𝑚𝑖𝑡)
𝐶𝑝𝑘 =
3𝜎
̂
where 𝜇̂ is the process mean and 𝜎̂ is the standard deviation of the process.
SigmaCPKLower
A Six Sigma index, SigmaCPKLower calculates the one-sided Process
Capability Index based on the lower specification limit. This index assumes the
process output is normally distributed.
𝜇
̂ −𝐿𝑜𝑤𝑒𝑟𝑆𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑎𝑡𝑖𝑜𝑛𝐿𝑖𝑚𝑖𝑡
𝐶𝑝, 𝑙𝑜𝑤𝑒𝑟 = 3𝜎
̂
where 𝜇̂ is the process mean and 𝜎̂ is the standard deviation of the process.
SigmaCPKUpper
A Six Sigma index, SigmaCPKUpper calculates the one-sided Process
Capability Index based on the upper specification limit. This index assumes the
process output is normally distributed.
𝑈𝑝𝑝𝑒𝑟𝑆𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑎𝑡𝑖𝑜𝑛𝐿𝑖𝑚𝑖𝑡−𝜇
̂
𝐶𝑝, 𝑢𝑝𝑝𝑒𝑟 =
3𝜎
̂
where 𝜇̂ is the process mean and 𝜎̂ is the standard deviation of the process.
SigmaDefectPPM
A Six Sigma index, SigmaDefectPPM calculates the Defective Parts per Million.
𝐿𝑜𝑤𝑒𝑟𝑆𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑎𝑡𝑖𝑜𝑛𝐿𝑖𝑚𝑖𝑡 − 𝜇̂
𝐷𝑃𝑀𝑂 = (𝛿 −1 ( )+
𝜎̂
𝑈𝑝𝑝𝑒𝑟𝑆𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑎𝑡𝑖𝑜𝑛𝐿𝑖𝑚𝑖𝑡 − 𝜇̂
1 − 𝛿 −1 ( )) ∗ 1000000
𝜎̂
where𝜇̂ is the process mean, 𝜎̂ is the standard deviation of the process and 𝛿 −1
is the standard normal inverse cumulative distribution function.
SigmaDefectShiftPPM
A Six Sigma index, SigmaDefectShiftPPM calculates the Defective Parts per
Million with an added shift.
𝐿𝑜𝑤𝑒𝑟𝑆𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑎𝑡𝑖𝑜𝑛𝐿𝑖𝑚𝑖𝑡 − 𝜇̂
𝐷𝑃𝑀𝑂𝑆ℎ𝑖𝑓𝑡 = (𝛿 −1 ( − 𝑆ℎ𝑖𝑓𝑡) +
𝜎̂
𝑈𝑝𝑝𝑒𝑟𝑆𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑎𝑡𝑖𝑜𝑛𝐿𝑖𝑚𝑖𝑡−𝜇
̂
1 − 𝛿 −1 ( − 𝑆ℎ𝑖𝑓𝑡)) ∗ 1000000
𝜎
̂
where𝜇̂ is the process mean, 𝜎̂ is the standard deviation of the process and 𝛿 −1 is
the standard normal inverse cumulative distribution function.
SigmaDefectShiftPPMLower
A Six Sigma index, SigmaDefectShiftPPMLower calculates the Defective Parts
per Million, with a shift, below the lower specification limit.
𝐿𝑜𝑤𝑒𝑟𝑆𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑎𝑡𝑖𝑜𝑛𝐿𝑖𝑚𝑖𝑡
𝐷𝑃𝑀𝑂𝑠ℎ𝑖𝑓𝑡, 𝑙𝑜𝑤𝑒𝑟 = (𝛿 −1 ( − 𝑆ℎ𝑖𝑓𝑡) ∗ 1000000
𝜎
̂
where𝜎̂ is the standard deviation of the process and 𝛿 −1 is the standard normal
inverse cumulative distribution function.
SigmaDefectShiftPPMUpper
A Six Sigma index, igmaDefectShiftPPMUpper calculates the Defective Parts
per Million, with a shift, above the lower specification limit.
𝑈𝑝𝑝𝑒𝑟𝑆𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑎𝑡𝑖𝑜𝑛𝐿𝑖𝑚𝑖𝑡
𝐷𝑃𝑀𝑂𝑠ℎ𝑖𝑓𝑡, 𝑢𝑝𝑝𝑒𝑟 = (𝛿 −1 ( 𝜎
̂
− 𝑆ℎ𝑖𝑓𝑡) ∗ 1000000
where𝜎̂ is the standard deviation of the process and 𝛿 −1 is the standard normal
inverse cumulative distribution function.
SigmaLowerBound
A Six Sigma index, SigmaLowerBound calculates the Lower Bound as a
specific number of standard deviations below the mean and is defined as:
𝜇̂ − 𝜎̂ ∗ #𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑𝐷𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛𝑠
where𝜇̂ is the process mean and 𝜎̂ is the standard deviation of the process.
SigmaProbDefectShift
A Six Sigma index, SigmaProbDefectShift calculates the Probability of Defect,
with a shift, outside of the upper and lower limits. This statistic is defined as:
𝐿𝑜𝑤𝑒𝑟𝑆𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑎𝑡𝑖𝑜𝑛𝐿𝑖𝑚𝑖𝑡 − 𝜇̂
𝛿 −1 ( − 𝑆ℎ𝑖𝑓𝑡) +
𝜎̂
𝑈𝑝𝑝𝑒𝑟𝑆𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑎𝑡𝑖𝑜𝑛𝐿𝑖𝑚𝑖𝑡−𝜇
̂
1 − 𝛿 −1( − 𝑆ℎ𝑖𝑓𝑡)
𝜎̂
where 𝜇̂ is the process mean ,𝜎̂ is the standard deviation of the process and 𝛿 −1
is the standard normal inverse cumulative distribution function.
SigmaProbDefectShiftLower
A Six Sigma index, igmaProbDefectShiftLower calculates the Probability of
Defect, with a shift, outside of the lower limit. This statistic is defined as:
𝐿𝑜𝑤𝑒𝑟𝑆𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑎𝑡𝑖𝑜𝑛𝐿𝑖𝑚𝑖𝑡 −𝜇
̂
𝛿 −1 ( − 𝑆ℎ𝑖𝑓𝑡)
𝜎
̂
where𝜇̂ is the process mean ,𝜎̂ is the standard deviation of the process and 𝛿 −1 is
the standard normal inverse cumulative distribution function.
SigmaProbDefectShiftUpper
A Six Sigma index, SigmaProbDefectShiftUpper calculates the Probability of
Defect, with a shift, outside of the upper limit. This statistic is defined as:
𝑈𝑝𝑝𝑒𝑟𝑆𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑎𝑡𝑖𝑜𝑛𝐿𝑖𝑚𝑖𝑡 −𝜇
̂
1 − 𝛿 −1( − 𝑆ℎ𝑖𝑓𝑡)
𝜎
̂
where𝜇̂ is the process mean ,𝜎̂ is the standard deviation of the process and 𝛿 −1 is
the standard normal inverse cumulative distribution function.
SigmaSigmaLevel
A Six Sigma index, SigmaSigmaLevel calculates the Process Sigma Level with
a shift. This statistic is defined as:
where 𝜇̂ is the process mean ,𝜎̂ is the standard deviation of the process 𝛿is the
standard normal cumulative distribution function, and 𝛿 −1 is the standard
normal inverse cumulative distribution function.
SigmaUpperBound
A Six Sigma index, SigmaUpperBound calculates the Upper Bound as a specific
number of standard deviations above the mean and is defined as:
𝜇̂ − 𝜎̂ ∗ #𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑𝐷𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛𝑠
where𝜇̂ is the process mean and 𝜎̂ is the standard deviation of the process.
SigmaYield
A Six Sigma index, SigmaYield calculates the Six Sigma Yield with a shift, or
the fraction of the process that is free of defects. This statistic is defined as:
𝑈𝑝𝑝𝑒𝑟𝑆𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑎𝑡𝑖𝑜𝑛𝐿𝑖𝑚𝑖𝑡 − 𝜇̂
𝛿 −1( − 𝑆ℎ𝑖𝑓𝑡) −
𝜎̂
−1 𝐿𝑜𝑤𝑒𝑟𝑆𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑎𝑡𝑖𝑜𝑛𝐿𝑖𝑚𝑖𝑡−𝜇
̂
𝛿 ( − 𝑆ℎ𝑖𝑓𝑡)
𝜎
̂
where𝜇̂ is the process mean, 𝜎̂ is the standard deviation of the process and 𝛿 −1 is
the standard normal inverse cumulative distribution function.
SigmaZLower
A Six Sigma index, SigmaZLower calculates the number of standard deviations
of the process that the lower limit is below the mean of the process. This
statistic is defined as:
𝜇
̂ −𝐿𝑜𝑤𝑒𝑟𝑆𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑎𝑡𝑖𝑜𝑛𝐿𝑖𝑚𝑖𝑡
𝜎
̂
where𝜇̂ is the process mean and 𝜎̂ is the standard deviation of the process.
SigmaZMin
A Six Sigma index, SigmaZMin calculates the minimum of SigmaZLower and
SigmaZUpper. This statistic is defined as:
𝑀𝐼𝑁(𝜇
̂ −𝐿𝑜𝑤𝑒𝑟𝑆𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑎𝑡𝑖𝑜𝑛𝐿𝑖𝑚𝑖𝑡,𝑈𝑝𝑝𝑒𝑟𝑆𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑎𝑡𝑖𝑜𝑛𝐿𝑖𝑚𝑖𝑡−𝜇
̂)
𝜎
̂
SigmaZUpper
A Six Sigma index, SigmaZUpper calculates the number of standard deviations
of the process that the upper limit is above the mean of the process. This
statistic is defined as:
𝑈𝑝𝑝𝑒𝑟𝑆𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑎𝑡𝑖𝑜𝑛𝐿𝑖𝑚𝑖𝑡−𝜇
̂
𝜎
̂
where𝜇̂ is the process mean and 𝜎̂ is the standard deviation of the process.