Using Driverless AI
Using Driverless AI
Release 1.8.4.1
H2O.ai
3 Key Features 37
3.1 Flexibility of Data and Deployment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.2 NVIDIA GPU Acceleration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.3 Automatic Data Visualization (Autovis) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.4 Automatic Feature Engineering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.5 Automatic Model Documentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.6 Time Series Forecasting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.7 NLP with TensorFlow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.8 Automatic Scoring Pipelines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.9 Machine Learning Interpretability (MLI) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.10 Automatic Reason Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.11 Bring Your Own Recipe (BYOR) Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4 Supported Algorithms 41
4.1 Constant Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.2 Decision Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.3 FTRL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.4 GLM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.5 Isolation Forest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.6 LightGBM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.7 RuleFit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.8 TensorFlow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.9 XGBoost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.10 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5 Driverless AI Workflow 45
6 Driverless AI Licenses 47
6.1 About Licenses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
6.2 Adding Licenses for the First Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
6.3 Updating Licenses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
i
7.2 To sudo or Not to sudo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
7.3 Note about nvidia-docker 1.0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
7.4 Deprecation of nvidia-smi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
7.5 New nvidia-container-runtime-hook Requirement for PowerPC Users . . . . . . . . . . 54
7.6 Note About CUDA Versions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
7.7 Note About Authentication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
7.8 Note About Shared File Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
7.9 Backup Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
7.10 Upgrade Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
ii
16.4 Dataset Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253
16.5 Downloading Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257
16.6 Splitting Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258
16.7 Visualizing Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259
17 Experiments 265
17.1 Before You Begin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265
17.2 Experiment Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265
17.3 Expert Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269
17.4 Scorers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 299
17.5 New Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 308
17.6 Completed Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314
17.7 Experiment Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 316
17.8 Model Scores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 320
17.9 Experiment Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 322
17.10 Viewing Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 326
iii
29.4 The Scoring Service . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 404
29.5 Python Scoring Pipeline FAQ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 406
29.6 Troubleshooting Python Environment Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 406
iv
40.6 NLP Expert Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 477
40.7 A Typical NLP Example: Sentiment Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 478
45 FAQ 545
45.1 General . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 548
45.2 Installation/Upgrade/Authentication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 549
45.3 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 552
45.4 Connectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 553
45.5 Recipes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 553
45.6 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 553
45.7 Feature Transformations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 563
45.8 Predictions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 563
45.9 Deployment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 564
45.10 Time Series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 565
45.11 Logging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 567
v
47.2 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 573
49 References 587
vi
Using Driverless AI, Release 1.8.4.1
H2O Driverless AI is an artificial intelligence (AI) platform for automatic machine learning. Driverless AI automates
some of the most difficult data science and machine learning workflows such as feature engineering, model validation,
model tuning, model selection, and model deployment. It aims to achieve highest predictive accuracy, comparable to
expert data scientists, but in much shorter time thanks to end-to-end automation. Driverless AI also offers automatic
visualizations and machine learning interpretability (MLI). Especially in regulated industries, model transparency and
explanation are just as important as predictive performance. Modeling pipelines (feature engineering and models) are
exported (in full fidelity, without approximations) both as Python modules and as Java standalone scoring artifacts.
Driverless AI runs on commodity hardware. It was also specifically designed to take advantage of graphical processing
units (GPUs), including multi-GPU workstations and servers such as IBM’s Power9-GPU AC922 server and the
NVIDIA DGX-1 for order-of-magnitude faster training.
This document describes how to install and use Driverless AI. For more information about Driverless AI, see https:
//www.h2o.ai/products/h2o-driverless-ai/.
For a third-party review, see https://www.infoworld.com/article/3236048/machine-learning/
review-h2oai-automates-machine-learning.html.
Have Questions?
If you have questions about using Driverless AI, post them on Stack Overflow using the driverless-ai tag at http:
//stackoverflow.com/questions/tagged/driverless-ai.
You can also post questions on the H2O.ai Community Slack workspace in the #driverlessai channel. If you have not
signed up for the H2O.ai Community Slack workspace, you can do so here: https://www.h2o.ai/community/.
RELEASE NOTES 1
Using Driverless AI, Release 1.8.4.1
2 RELEASE NOTES
CHAPTER
ONE
H2O Driverless AI is a high-performance, GPU-enabled, client-server application for the rapid development and de-
ployment of state-of-the-art predictive analytics models. It reads tabular data from various sources and automates data
visualization, grand-master level automatic feature engineering, model validation (overfitting and leakage prevention),
model parameter tuning, model interpretability and model deployment. H2O Driverless AI is currently targeting
common regression, binomial classification, and multinomial classification applications including loss-given-default,
probability of default, customer churn, campaign response, fraud detection, anti-money-laundering, and predictive
asset maintenance models. It also handles time-series problems for individual or grouped time-series such as weekly
sales predictions per store and department, with time-causal feature engineering and validation schemes. The ability
to model unstructured data is coming soon.
High-level capabilities:
• Client/server application for rapid experimentation and deployment of state-of-the-art supervised machine learn-
ing models
• User-friendly GUI
• Python and R client API
• Automatically creates machine learning modeling pipelines for highest predictive accuracy
• Automates data cleaning, feature selection, feature engineering, model selection, model tuning, ensembling
• Automatically creates stand-alone batch scoring pipeline for in-process scoring or client/server scoring via
HTTP or TCP protocols in Python
• Automatically creates stand-alone (MOJO) low latency scoring pipeline for in-process scoring or client/server
scoring via HTTP or TCP protocols, in C++ (with R and Python runtimes) and Java (runs anywhere)
• Multi-GPU and multi-CPU support for powerful workstations and NVidia DGX supercomputers
• Machine Learning model interpretation module with global and local model interpretation
• Automatic Visualization module
• Multi-user support
• Backward compatibility
Problem types supported:
• Regression (continuous target variable like age, income, price or loss prediction, time-series forecasting)
• Binary classification (0/1 or “N”/”Y”, for fraud prediction, churn prediction, failure prediction, etc.)
• Multinomial classification (“negative”/”neutral”/”positive” or 0/1/2/3 or 0.5/1.0/2.0 for categorical target vari-
ables, for prediction of membership type, next-action, product recommendation, sentiment analysis, etc.)
Data types supported:
3
Using Driverless AI, Release 1.8.4.1
1.1 Architecture
1.1. Architecture 5
Using Driverless AI, Release 1.8.4.1
1.2 Roadmap
Available here
• Add option for dynamic port allocation
• Documentation for AWS community AMI
• Various bug fixes (MLI UI)
Available here
• New Features:
– Added ‘Scores’ tab in experiment page to show detailed tuning tables and scores for models and folds
– Added Constant Model (constant predictions) and use it as reference model by default
– Show score of global constant predictions in experiment summary as reference
– Added support for setting up mutual TLS for the DriverlessAI
– Added option to use client/personal certificate as an authentication method
• Documentation Updates:
– Added sections for enabling mTLS and Client Certificate authentication
– Constant Models is now included in the list of Supported Algorithms
– Added a section describing the Model Scores page
– Improved the C++ Scoring Pipeline documentation describing the process for importing datatable
– Improved documentation for the Java Scoring Pipeline
• Bug fixes:
– Fix refitting of final pipeline when new features are added
– Various bug fixes
Available here
• Added option to upload experiment artifacts to a configured disk location
• Various bug fixes (correct feature engineering from time column, migration for brain restart)
Available here
• New Features:
– Decision Tree model
– Automatically enabled for accuracy <= 7 and interpretability >= 7
– Supports all problem types: regression/binary/multiclass
– Using LightGBM GPU/CPU backend with MOJO
– Visualization of tree splits and leaf node decisions as part of pipeline visualization
– Per-Column Imputation Scheme (experimental)
– Select one of [const, mean, median, min, max, quantile] imputation scheme at start of experiment
– Select method of calculation of imputation value: either on entire dataset or inside each pipeline’s training
data split
– Disabled by default and must be enabled at startup time to be effective
– Show MOJO size and scoring latency (for C++/R/Python runtime) in experiment summary
– Automatically prune low weight base models in final ensemble (based on interpretability setting) to reduce
final model complexity
– Automatically convert non-raw github URLs for custom recipes to raw source code URLs
• Improvements:
– Speed up feature evolution for time-series and low-accuracy experiments
– Improved accuracy of feature evolution algorithm
– Feature transformer interpretability, total count, and importance accounted for in genetic algorithm’s model
and feature selection
– Binary confusion matrix in ROC curve of experiment page is made consistent with Diagnostics (flipped
positions of TP/TN)
– Only include custom recipes in Python scoring pipeline if the experiment uses any custom recipes
– Additional documentation (New OpenID config options, JDBD data connector syntax)
– Improved AutoReport’s transformer descriptions
– Improved progress reporting during Autoreport creation
– Improved speed of automatic interaction search for imbalanced multiclass problems
– Improved accuracy of single final model for GLM and FTRL
– Allow config_overrides to be a list/vector of parameters for R client API
– Disable early stopping for Random Forest models by default, and expose new ‘rf_early_stopping’ mode
(optional)
– Create identical example data (again, as in 1.8.0 and before) for all scoring pipelines
– Upgraded versions of datatable and Java
– Installed graphviz in Docker image, now get .png file of pipeline visualization in MOJO package and Au-
toreport. Note: For RPM/DEB/TAR SH installs, user can install graphviz to get this optional functionality
• Documentation Updates:
– Added a simple example for modifying a dataset by recipe using live code
– Added a section describing how to impute datasets (experimental)
– Added Decision Trees to list of supported algorithms
– Fixed examples for enabling JDBC connectors
– Added information describing how to use a JDBC driver that is not tested in house
– Updated the Missing Values Handling topic to include sections for “Clustering in Transformers” and “Iso-
lation Forest Anomaly Score Transformer”
– Improved the “Fold Column” description
• Bug Fixes:
– Fix various reasons why final model score was too far off from best feature evolution score
– Delete temporary files created during test set scoring
– Fixed target transformer tuning (was potentially mixing up target transformers between feature evolution
and final model)
– Fixed tensorflow_nlp_have_gpus_in_production=true mode
– Fixed partial dependence plots for missing datetime values and no longer show them for text columns
– Fixed time-series GUI for quarterly data
– Feature transformer exploration limited to no more than 1000 new features (Small data on 10/10/1 would
try too many features)
– Fixed Kaggle pipeline building recipe to try more input features than 8
– Fixed cursor placement in live code editor for custom data recipe
– Show correct number of cross-validation splits in pipeline visualization if have more than 10 splits
– Fixed parsing of datetime in MOJO for some datetime formats without ‘%d’ (day)
Available here
• Bugfix for time series experiments with quarterly data when launched from GUI
Available here
• New Features:
– Full set of scoring metrics and corresponding downloadable holdout predictions for experiments with
single final models (time-series or i.i.d)
– MLI Updates:
* BYOR (bring your own recipe) in Python: pandas, numpy, datatable, third-party libraries for fast
prototyping of connectors and data preprocessing inside DAI
* data connectors, cleaning, filtering, aggregation, augmentation, feature engineering, splits, etc.
* can create one or multiple datasets from scratch or from existing datasets
* interactive code editor with live preview
* example code at https://github.com/h2oai/driverlessai-recipes/tree/rel-1.8.1/data
– Visualization of final scoring pipeline (Experimental)
* In-GUI display of graph of feature engineering, modeling and ensembling steps of entire machine
learning pipeline
* Addition to Autodoc
– Time-Series:
* Ability to specify which features will be unavailable at test time for time-series experiments
* Custom user-provided train/validation splits (by start/end datetime for each split) for time-series ex-
periments
* Back-testing metrics for time-series experiments (regression and classification, with and without lags)
based on rolling windows (configurable number of windows)
– MOJO:
* PyTorch MOJO (C++/Py/R) for custom recipes based on BERT/DistilBERT NLP models (available
upon request)
• Improvements:
– Accuracy:
* Automatic pairwise interaction search (+,-,*,/) for numeric features (“magic feature” finder)
* Improved accuracy for time series experiments with low interpretability
* Improved leakage detection logic
* Improved genetic algorithm heuristics for feature evolution (more exploration)
– Time-Series Recipes:
* Faster feature evolution part for non-time-series experiments with single final model
* Faster binary imbalanced models for very high class imbalance by limiting internal number of re-
sampling bags
Available here
• Improve speed and memory usage for feature engineering
• Improve speed of leakage and shift detection, and improve accuracy
• Improve speed of AutoVis under high system load
• Improve speed for experiments with large user-given validation data
• Improve accuracy of ensembles for regression problems
• Improve creation of Autoreport (only one background job per experiment)
• Improve sampling techniques for ImbalancedXGBoost and ImbalancedLightGBM models, and disable them by
default since can be slower
• Add Python/R/C++ MOJO support for FTRL and RandomForest
• Add native categorical handling for LightGBM in CPU mode
• Add monotonicity constraints support for LightGBM
• Add Isolation Forest Anomaly Score transformer (outlier detection)
• Re-enable One-Hot-Encoding for GLM models
• Add lexicographical label encoding (disabled by default)
• Add ability to further train user-provided pretrained embeddings for TensorFlow NLP transformers, in addition
to fine-tuning the rest of the neural network graph
• Add timeout for BYOR acceptance tests
• Add log and notifications for large shifts in final model variable importances compared to tuning model
• Add more expert control over time series feature engineering
• Add ability for recipes to be uploaded in bulk as entire (or part of) github repository or as links to python files
on page
Available here
• Added two new models with internal sampling techniques for imbalanced binary classification problems: Im-
balancedXGBoost and ImbalancedLightGBM
• Added support for rolling-window based predictions for time-series experiments (2 options: test-time augmen-
tation or re-fit)
• Added support for setting logical column types for a dataset (to override type detection during experiments)
• Added ability to set experiment name at start of experiment
• Added leakage detection for time-series problems
• Added JDBC connector
• MOJO updates:
– Added Python/R/C++ MOJO support for TensorFlow model
– Added Python/R/C++ MOJO support for TensorFlow NLP transformers: TextCNN, CharCNN, BiGRU,
including any pretrained embeddings if provided
– Reduced memory usage for MOJO creation
– Increased speed of MOJO creation
– Configuration options for MOJO and Python scoring pipelines now have 3-way toggle: “on”/”off”/”auto”
• MLI updates:
– Added disparate impact analysis (DIA) for MLI
– Allow MLI scoring pipeline to be built for datasets with column names that need to be sanitized
– Date-aware binning for partial dependence and ICE in MLI
• Improved generalization performance for time-series modeling with regulariation techniques for lag-based fea-
tures
• Improved “predicted vs actual” plots for regression problems (using adaptive point sizes)
• Fix bug in datatable for manipulations of string columns larger than 2GB
• Fixed download of predictions on user-provided validation data
• Fix bug in time-series test-time augmentation (work-around was to include entire training data in test set)
• Honor the expert settings flag to enable detailed traces (disable again by default)
• Various bug fixes
Available here
• ML Core updates:
– Speed up schema detection
– DAI now drops rows with missing values when diagnosing regression problems
– Speed up column type detection
– Fixed growth of individuals
– Fixed n_jobs for predict
– Target column is no longer included in predictors for skewed datasets
– Added an option to prevent users from downloading data files locally
– Improved UI split functionality
– A new “max_listing_items” config option to limit the number of items fetched in listing pages
• Model Ops updates:
– MOJO runtime upgraded to version 2.1.3 which supports perpetual MOJO pipeline
– Upgraded deployment templates to version matching MOJO runtime version
• MLI updates:
– Fix to MLI schema builder
– Fix parsing of categorical reason codes
– Added ability to handle integer time column
• Various bug fixes
Available here
• Support for Bring Your Own Recipe (BYOR) for transformers, models (algorithms) and scorers
• Added protobuf-based MOJO scoring runtime libraries for Python, R and Java (standalone, low-latency)
• Added local REST server as one-click deployment option for MOJO scoring pipeline, in addition to AWS
Lambda endpoint
• Added R client package, in addition to Python client
• Added Project workspace to group datasets and experiments and to visually compare experiments and create
leaderboards
• Added download of imported datasets as .csv
• Recommendations for columnar transformations in AutoViz
• Improved scalability and performance
• Ability to provide max. runtime for experiments
• Create MOJO scoring pipeline by default if the experiment configuration allows (for convenience, enables lo-
cal/cloud deployment options without user input)
• Support for user provided pre-trained embeddings for TensorFlow NLP models
• Support for holdout splits lacking some target classes (can happen when a fold column is provided)
• MLI updates:
– Added residual plot for regression problems (keeping all outliers intact)
– Added confusion matrix as default metric display for multinomial problems
– Added Partial Dependence (PD) and Individual Conditional Expectation (ICE) plots for Driverless.ai mod-
els in MLI GUI
– Added ability to search by ID column in MLI GUI
– Added ability to run MLI PD/ICE on all features
– Added ability to handle multiple observations for a single time column in MLI TS by taking the mean of
the target and prediction where applicable
– Added ability to handle integer time column in MLI TS
– MLI TS will use train holdout predictions if there is no test set provided
• Faster import of files with “%Y%m%d” and “%Y%m%d%H%M” time format strings, and files with lots of text
strings
• Fix units for RMSPE scorer to be a percentage (multiply by 100)
• Allow non-positive outcomes for MAPE and SMAPE scorers
• Improved listing in GUI
• Allow zooming in GUI
• Upgrade to TensorFlow 1.13.1 and CUDA 10 (and CUDA is part of the distribution now, to simplify installation)
• Add CPU-support for TensorFlow on PPC
• Documentation updates:
– Added documentation for new features including
* Projects
* Custom Recipes
* C++ MOJO Scoring Pipelines
* R Client API
* REST Server Deployment
– Added information about variable importance values on the experiments page
– Updated documentation for Expert Settings
– Updated “Tips n Tricks” with new Scoring Pipeline tips
• Various bug fixes
Available here
• Included an Audit log feature
• Fixed support for decimal types for parquet files in MOJO
• Autodoc can order PDP/ICE by feature importance
• Session Management updates
• Upgraded datatable
• Improved reproducibility
• Model diagnostics now uses a weight column
• MLI can build surrogate models on all the original features or on all the transformed features that DAI uses
• Internal server cache now respects usernames
• Fixed an issue with time series settings
• Fixed an out of memory error when loading a MOJO
• Fixed Python scoring package for TensorFlow
• Added OpenID configurations
• Documentation updates:
– Updated the list of artifacts available in the Experiment Summary
– Clarified language in the documentation for unsupported (but available) features
– For the Terraform requirement in deployments, clarified that only Terraform versions in the 0.11.x release
are supported, and specifically 0.11.10 or greater
– Fixed link to the Miniconda installation instructions
• Various bug fixes
Available here
• This version provides PPC64le artifacts
• Improved stability of datatable
• Improved path filtering in the file browser
• Fixed units for RMSPE scorer to be a percentage (multiply by 100)
• Fixed segmentation fault on Ubuntu 18 with installed font package
• Fixed IBM Spectrum Conductor authentication
• Fixed handling of EC2 machine credentials
• Fixed of Lag transformer configuration
• Fixed KDB and Snowflake Error Reporting
• Gradually reduce number of used workers for column statistics computation in case of failure.
• Hide default Tornado header exposing used version of Tornado
• Documentation updates:
– Added instructions for installing via AWS Marketplace
– Improved documentation for installing via Google Cloud
– Improved FAQ documentation
– Added Data Sampling documentation topic
• Various bug fixes
Available here
• Fix in AWS role handling.
Available here
• Several fixes for MLI (partial dependence plots, Shapley values)
• Improved documentation for model deployment, time-series scoring, AutoVis and FAQs
Available here
• Speed up calculation of column statistics for date/datetime columns using certain formats (now uses
‘max_rows_col_stats’ parameter)
• Added computation of standard deviation for variable importances in experiment summary files
• Added computation of shift of variable importances between feature evolution and final pipeline
• Fix link to MLI Time-Series experiment
• Fix display bug for iteration scores for long experiments
• Fix display bug for early finish of experiment for GLM models
• Fix display bug for k-LIME when target is skewed
• Fix display bug for forecast horizon in MLI for Time-Series
• Fix MLI for Time-Series for single time group column
• Fix in-server scoring of time-series experiments created in 1.5.0 and 1.5.1
Available here
• Added support for splitting datasets by time via time column containing date, datetime or integer values
• Added option to disable file upload
• Require authentication to download experiment artifacts
• Automatically drop predictor columns from training frame if not found in validation or test frame and warn
• Improved performance by using physical CPU cores only (configurable in config.toml)
• Added option to not show inactive data connectors
• Various bug fixes
Available here
• Added world-level bidirectional GRU Tensorflow models for NLP features
• Added character-level CNN Tensorflow models for NLP features
• Added support to import multiple individual datasets at once
• Added support for holdout predictions for time-series experiments
• Added support for regression and multinomial classification for FTRL (in addition to binomial classification)
• Improved scoring for time-series when test data contains actual target values (missing target values will be
predicted)
• Reduced memory usage for LightGBM models
• Improved performance for feature engineering
• Improved speed for TensorFlow models
• Improved MLI GUI for time-series problems
• Fix final model fold splits when fold_column is provided
• Various bug fixes
Available here
• Fix MOJO for GLM
• Add back .csv file of experiment summary
• Improve collection of pipeline timing artifacts
• Clean up Docker tag
Available here
• Added model diagnostics (interactive model metrics on new test data incl. residual analysis for regression)
• Added FTRL model (Follow The Regularized Leader)
• Added Kolmogorov-Smirnov metric (degree of separation between positives and negatives)
• Added ability to retrain (only) the final model on new data
• Added one-hot encoding for low-cardinality categorical features, for GLM
• Added choice between 32-bit (now default) and 64-bit precision
• Added system information (CPU, GPU, disk, memory, experiments)
• Added support for time-series data with many more time gaps, and with weekday-only data
• Added one-click deployment to Amazon Lambda
• Added ability to split datasets randomly, with option to stratify by target column or group by fold column
• Added support for OpenID authentication
• Added connector for BlueData
• Improved responsiveness of the GUI under heavy load situations
• Improved speed and reduce memory footprint of feature engineering
• Improved performance for RuleFit models and enable GPU and multinomial support
• Improved auto-detection of temporal frequency for time-series problems
• Improved accuracy of final single model if external validation provided
• Improved final pipeline if external validation data is provided (add ensembling)
• Improved k-LIME in MLI by using original features deemed important by DAI instead of all original features
• Improved MLI by using 3-fold CV by default for all surrogate models
• Improved GUI for MLI time series (integrated help, better integration)
• Added ability to view MLI time series logs while MLI time series experiment is running
• PDF version of the Automatic Report (AutoDoc) is now replaced by a Word version
• Various bug fixes (GLM accuracy, UI slowness, MLI UI, AutoVis)
Available here
• Support for IBM Power architecture
• Speed up training and reduce size of final pipeline
• Reduced resource utilization during training of final pipeline
• Display test set metrics (ROC, ROCPR, Gains, Lift) in GUI in addition to validation metrics (if test set provided)
• Show location of best threshold for Accuracy, MCC and F1 in ROC curves
• Add relative point sizing for scatter plots in AutoVis
• Fix file upload and add model checkpointing in python client API
• Various bug fixes
Available here
• Improved integration of MLI for time-series
• Reduced disk and memory usage during final ensemble
• Allow scoring and transformations on previously imported datasets
• Enable checkpoint restart for unfinished models
• Add startup checks for OpenCL platforms for LightGBM on GPUs
• Improved feature importances for ensembles
• Faster dataset statistics for date/datetime columns
• Faster MOJO batch scoring
• Fix potential hangs
• Fix ‘not in list’ error in MOJO
• Fix NullPointerException in MLI
• Fix outlier detection in AutoVis
• Various bug fixes
Available here
• Enable LightGBM by default (now with MOJO)
• LightGBM tuned for GBM decision trees, Random Forest (rf), and Dropouts meet Multiple Additive Regression
Trees (dart)
• Add ‘isHoliday’ feature for time columns
• Add ‘time’ column type for date/datetime columns in data preview
• Add support for binary datatable file ingest in .jay format
• Improved final ensemble (each model has its own feature pipeline)
Available here
• Fix ‘Broken pipe’ failures for TensorFlow models
• Fix time-series problems with categorical features and interpretability >= 8
• Various bug fixes
Available here
• Added LightGBM models - now have [XGBoost, LightGBM, GLM, TensorFlow, RuleFit]
• Added TensorFlow NLP recipe based on CNN Deeplearning models (sentiment analysis, document classifica-
tion, etc.)
• Added MOJO for GLM
• Added detailed confusion matrix statistics
• Added more expert settings
• Improved data exploration (columnar statistics and row-based data preview)
• Improved speed of feature evolution stage
• Improved speed of GLM
• Report single-pass score on external validation and test data (instead of bootstrap mean)
• Reduced memory overhead for data processing
• Reduced number of open files - fixes ‘Bad file descriptor’ error on Mac/Docker
• Simplified Python client API
• Query any data point in the MLI UI from the original dataset due to “on-demand” reason code generation
• Enhanced k-means clustering in k-LIME by only using a subset of features. See klime_technique for more
information.
• Report k-means centers for k-LIME in MLI summary for better cluster interpretation
• Improved MLI experiment listing details
• Various bug fixes
Available here
• MOJO Java scoring pipeline for time-series problems
• Multi-class confusion matrices
• AUCMACRO Scorer: Multi-class AUC via macro-averaging (in addition to the default micro-averaging)
• Expert settings (configuration override) for each experiment from GUI and client APIs.
• Support for HTTPS
• Improved downsampling logic for time-series problems (if enabled through accuracy knob settings)
• LDAP readonly access to Active Directory
• Snowflake data connector
• Various bug fixes
• Added LIME-SUP (alpha) to MLI as alternative to k-LIME (local regions are defined by decision tree instead
of k-means)
• Added RuleFit model (alpha), now have [GBM, GLM, TensorFlow, RuleFit] - TensorFlow and RuleFit are
disabled by default
• Added Minio (private cloud storage) connector
• Added support for importing folders from S3
• Added ‘Upload File’ option to ‘Add Dataset’ (in addition to drag & drop)
• Predictions for binary classification problems now have 2 columns (probabilities per class), for consistency with
multi-class
• Improved model parameter tuning
• Improved feature engineering for time-series problems
• Improved speed of MOJO generation and loading
• Improved speed of time-series related automatic calculations in the GUI
• Time-Series recipe
• Low-latency standalone MOJO Java scoring pipelines (now beta)
• Enable Elastic Net Generalized Linear Modeling (GLM) with lambda search (and GPU support), for inter-
pretability>=6 and accuracy<=5 by default (alpha)
• Enable TensorFlow (TF) Deep Learning models (with GPU support) for interpretability=1 and/or multi-class
models (alpha, enable via config.toml)
• Support for pre-tuning of [GBM, GLM, TF] models for picking best feature evolution model parameters
• Support for final ensemble consisting of mix of [GBM, GLM, TF] models
• Automatic Report (AutoDoc) in PDF and Markdown format as part of summary zip file
• Interactive tour (assistant) for first-time users
• MLI now runs on experiments from previous releases
• Surrogate models in MLI now use 3 folds by default
• Improved small data recipe with up to 10 cross-validation folds
• Improved accuracy for binary classification with imbalanced data
• Additional time-series transformers for interactions and aggreations between lags and lagging of non-target
columns
• Faster creation of MOJOs
• Progress report during data ingest
• Normalize binarized multi-class confusion matrices by class count (global scaling factor)
• Improved parsing of boolean environment variables for configuration
• Various bug fixes
• Speed up MOJO pipeline creation and disable MOJO by default (still alpha)
• Improved memory management on GPUs
• Support for optional 32-bit floating-point precision for reduced memory footprint
• Added logging of test set scoring and data transformations
• Various bug fixes
• If MOJO fails to build, no MOJO will be available, but experiment can still succeed
• MOJO scoring pipeline for Java standalone cross-platform low-latency scoring (alpha)
• Various bug fixes
• Fix test set scoring bug for data with an ID column (introduced in 1.0.23)
• Allow renaming of MLI experiments
• Ability to limit maximum number of cores used for datatable
• Print validation scores and error bars across final ensemble model CV folds in logs
• Various UI improvements
• Various bug fixes
• Support for Gains and Lift curves for binomial and multinomial classification
• Support for multi-GPU single-model training for large datasets
• Improved recipes for large datasets (faster and less memory/disk usage)
• Improved recipes for text features
• Increased sensitivity of interpretability setting for feature engineering complexity
• Disable automatic time column detection by default to avoid confusion
• Automatic column type conversion for test and validation data, and during scoring
• Improved speed of MLI
• Improved feature importances for MLI on transformed features
• Added ability to download each MLI plot as a PNG file
• Added support for dropped columns and weight column to MLI stand-alone page
• Fix serialization of bytes objects larger than 4 GiB
• Fix failure to build scoring pipeline with ‘command not found’ error
• Various UI improvements
• Various bug fixes
• ROC curve display for binomial and multinomial classification, with confusion matrices and threshold/F1/MCC
display
• Training/Validation/Test data shift detectors
• Added AUCPR scorer for multinomial classification
• Improved handling of imbalanced binary classification problems
• Configuration file for runtime limits such as cores/memory/harddrive (for admins)
• Various GUI improvements (ability to rename experiments, re-run experiments, logs)
• Various bug fixes
• Fix hang during final ensemble (accuracy >= 5) for larger datasets
• Allow scoring of all models built in older versions (>= 1.0.13) in GUI
• More detailed progress messages in the GUI during experiments
• Fix scoring pipeline to only use relative paths
• Error bars in model summary are now +/- 1*stddev (instead of 2*stddev)
• Added RMSPE scorer (RMS Percentage Error)
• Added SMAPE scorer (Symmetric Mean Abs. Percentage Error)
• Added AUCPR scorer (Area under Precision-Recall Curve)
• Gracefully handle inf/-inf in data
• Various UI improvements
• Various bug fixes
• Fix migration from version 1.0.15 and earlier (partial, for experiments only)
• Added model summary download from GUI
• Restructured and renamed logs archive, and add model summary to it
• Fix regression in AutoVis in 1.0.16 that led to slowdown
• Various bug fixes
• Added support for validation dataset (optional, instead of internal validation on training data)
• Standard deviation estimates for model scores (+/- 1 std.dev.)
• Computation of all applicable scores for final models (in logs only for now)
• Standard deviation estimates for MLI reason codes (+/- 1 std.dev.) when running in stand-alone mode
• Added ability to abort MLI job
• Improved final ensemble performance
• Improved outlier visualization
• Updated H2O-3 to version 3.16.0.4
• More readable experiment names
• Various speedups
• Various bug fixes
• Improved performance
• Improved estimate of generalization performance for final ensemble by removing leakage from target encoding
• Added API for re-fitting and applying feature engineering on new (potentially larger) data
• Remove access to pre-transformed datasets to avoid unintended leakage issues downstream
• Added mean absolute percentage error (MAPE) scorer
• Enforce monotonicity constraints for binary classification and regression models if interpretability >= 6
• Use squared Pearson correlation for R^2 metric (instead of coefficient of determination) to avoid negative values
• Separated HTTP and TCP scoring pipeline examples
• Reduced size of h2oai_client wheel
• No longer require weight column for test data if it was provided for training data
• Improved accuracy of final modeling pipeline
• Include H2O-3 logs in downloadable logs.zip
• Updated H2O-3 to version 3.16.0.2
• Various bug fixes
• Support for time column for causal train/validation splits in time-series datasets
• Automatic detection of the time column from temporal correlations in data
• MLI improvements, dedicated page, selection of datasets and models
• Improved final ensemble meta-learner
• Test set score now displayed in experiment listing
• Original response is preserved in exported datasets
• Various bug fixes
• Sharing of GPUs between experiments - can run multiple experiments at the same time while sharing GPU
resources
• Persistence of experiments and data - can stop and restart the application without loss of data
• Support for weight column for optional user-specified per-row observation weights
• Support for fold column for user-specified grouping of rows in train/validation splits
• Higher accuracy through model tuning
• Faster training - overall improvements and optimization in model training speed
• Separate log file for each experiment
• Ability to delete experiments and datasets from the GUI
• Improved accuracy for regression tasks with very large response values
• Faster test set scoring - Significant improvements in test set scoring in the GUI
• Various bug fixes
• Various speedups
• Results are now reproducible
• Various bug fixes
TWO
Over the last several years, machine learning has become an integral part of many organizations’ decision-making
processes at various levels. With not enough data scientists to fill the increasing demand for data-driven business
processes, H2O.ai offers Driverless AI, which automates several time consuming aspects of a typical data science
workflow, including data visualization, feature engineering, predictive modeling, and model explanation.
H2O Driverless AI is a high-performance, GPU-enabled computing platform for automatic development and rapid
deployment of state-of-the-art predictive analytics models. It reads tabular data from plain text sources and from a
variety of external data sources, and it automates data visualization and the construction of predictive models.
Driverless AI also includes robust Machine Learning Interpretability (MLI), which incorporates a number of contem-
porary approaches to increase the transparency and accountability of complex models by providing model results in a
human-readable format.
Driverless AI targets business applications such as loss-given-default, probability of default, customer churn, campaign
response, fraud detection, anti-money-laundering, demand forecasting, and predictive asset maintenance models. (Or
in machine learning parlance: common regression, binomial classification, and multinomial classification problems.)
Visit https://www.h2o.ai/driverless-ai/ to download your free 21-day evaluation copy.
How do you frame business problems in a data set for Driverless AI?
The data that is read into Driverless AI must contain one entity per row, like a customer, patient, piece of equipment,
or financial transaction. That row must also contain information about what you will be trying to predict using similar
data in the future, like whether that customer in the row of data used a promotion, whether that patient was readmitted
to the hospital within thirty days of being released, whether that piece of equipment required maintenance, or whether
that financial transaction was fraudulent. (In data science speak, Driverless AI requires “labeled” data.) Driverless AI
runs through your data many, many times looking for interactions, insights, and business drivers of the phenomenon
described by the provided dataset.
How do you use Driverless AI results to create commercial value?
Commercial value is generated by Driverless AI in a few ways.
• Driverless AI empowers data scientists or data analysts to work on projects faster and more efficiently by using
automation and state-of-the-art computing power to accomplish tasks in just minutes or hours instead of the
weeks or months that it can take humans.
• Like in many other industries, automation leads to standardization of business processes, enforces best practices,
and eventually drives down the cost of delivering the final product – in this case a predictive model.
• Driverless AI makes deploying predictive models easy – typically a difficult step in the data science process.
In large organizations, value from predictive modeling is typically realized when a predictive model is moved
from a data analyst’s or data scientist’s development environment into a production deployment setting. In this
setting, the model is running on live data and making quick and automatic decisions that make or save money.
Driverless AI provides both Java- and Python-based technologies to make production deployment simpler.
35
Using Driverless AI, Release 1.8.4.1
Moreover, the system was designed with interpretability and transparency in mind. Every prediction made by a
Driverless AI model can be explained to business users, so the system is viable even for regulated industries.
THREE
KEY FEATURES
Driverless AI works across a variety of data sources including Hadoop HDFS, Amazon S3, and more. Driverless AI
can be deployed everywhere including all clouds (Microsoft Azure, AWS, Google Cloud) and on premises on any
system, but it is ideally suited for systems with GPUs, including IBM Power 9 with GPUs built in.
Driverless AI is optimized to take advantage of GPU acceleration to achieve up to 40X speedups for automatic machine
learning. It includes multi-GPU algorithms for XGBoost, GLM, K-Means, and more. GPUs allow for thousands of
iterations of model features and optimizations.
For datasets, Driverless AI automatically selects data plots based on the most relevant data statistics, generates visu-
alizations, and creates data plots that are most relevant from a statistical perspective based on the most relevant data
statistics. These visualizations help users get a quick understanding of their data prior to starting the model building
process. They are also useful for understanding the composition of very large datasets and for seeing trends or even
possible issues, such as large numbers of missing values or significant outliers that could impact modeling results. See
Visualizing Datasets for more information.
Feature engineering is the secret weapon that advanced data scientists use to extract the most accurate results from
algorithms. H2O Driverless AI employs a library of algorithms and feature transformations to automatically engineer
new, high value features for a given dataset. (See Driverless AI Transformations for more information.) Included in
the interface is an easy-to-read variable importance chart that shows the significance of original and newly engineered
features.
37
Using Driverless AI, Release 1.8.4.1
To explain models to business users and regulators, data scientists and data engineers must document the data, algo-
rithms, and processes used to create machine learning models. Driverless AI provides an Autoreport (Autodoc) for
each experiment, relieving the user from the time-consuming task of documenting and summarizing their workflow
used when building machine learning models. The Autoreport includes details about the data used, the validation
schema selected, model and feature tuning, and the final model created. With this capability in Driverless AI, practi-
tioners can focus more on drawing actionable insights from the models and save weeks or even months in development,
validation, and deployment process.
Driverless AI also provides a number of autodoc_ configuration options, giving users even more control over output
of the Autoreport. (Refer to the Sample Config.toml File topic for information about these configuration options.)
Click here to download and view a sample experiment report in Word format.
Time series forecasting is one of the biggest challenges for data scientists. These models address key use cases,
including demand forecasting, infrastructure monitoring, and predictive maintenance. Driverless AI delivers superior
time series capabilities to optimize for almost any prediction time window. Driverless AI incorporates data from
numerous predictors, handles structured character data and high-cardinality categorical variables, and handles gaps in
time series data and other missing values. See Time Series in Driverless AI for more information.
Text data can contain critical information to inform better predictions. Driverless AI automatically converts short text
strings into features using powerful techniques like TFIDF. With TensorFlow, Driverless AI can also process larger
text blocks and build models using all available data to solve business problems like sentiment analysis, document
classification, and content tagging. See NLP in Driverless AI for more information.
For completed experiments, Driverless AI automatically generates both Python scoring pipelines and new ultra-low
latency automatic scoring pipelines. The new automatic scoring pipeline is a unique technology that deploys all feature
engineering and the winning machine learning model in a highly optimized, low-latency, production-ready Java code
that can be deployed anywhere. See Scoring Pipelines Overview for more information.
Driverless AI provides robust interpretability of machine learning models to explain modeling results in a human-
readable format. In the MLI view, Driverless AI employs a host of different techniques and methodologies for in-
terpreting and explaining the results of its models. A number of charts are generated automatically (depending on
experiment type), including K-LIME, Shapley, Variable Importance, Decision Tree Surrogate, Partial Dependence,
Individual Conditional Expectation, Sensitivity Analysis, NLP Tokens, NLP LOCO, and more. Additionally, you can
download a CSV of LIME and Shapley reasons codes from this view. See MLI Overview for more information.
In regulated industries, an explanation is often required for significant decisions relating to customers (for example,
credit denial). Reason codes show the key positive and negative factors in a model’s scoring decision in a simple
language. Reasons codes are also useful in other industries, such as healthcare, because they can provide insights into
model decisions that can drive additional testing or investigation.
Driverless AI allows you to import custom recipes (BYOR) for MLI algorithms, feature engineering (transformers),
scorers, data, and configuration. You can use your custom recipes in combination with or instead of all built-in recipes.
This allows you to have greater influence over the Driverless AI Automatic ML pipeline and gives you control over
the optimization choices that Driverless AI makes. See Appendix A: Custom Recipes for more information.
FOUR
SUPPORTED ALGORITHMS
A Constant Model predicts the same constant value for any input data. The constant value is computed by optimizing
the given scorer. For example, for MSE/RMSE, the constant is the (weighted) mean of the target column. For MAE, it
is the (weighted) median. For other scorers like MAPE or custom scorers, the constant is found with an optimization
process. For classification problems, the constant probabilities are the observed priors.
A constant model is meant as a baseline reference model. If it ends up being used in the final pipeline, a warning will
be issued because that would indicate a problem in the dataset or target column (e.g., when trying to predict a random
outcome).
A Decision Tree is a single (binary) tree model that splits the training data population into sub-groups (leaf nodes) with
similar outcomes. No row or column sampling is performed, and the tree depth and method of growth (depth-wise or
loss-guided) is controlled by hyper-parameters.
4.3 FTRL
Follow the Regularized Leader (FTRL) is a DataTable implementation [1] of the FTRL-Proximal online learning
algorithm proposed in [4]. This implementation uses a hashing trick and Hogwild approach [3] for parallelization.
FTRL supports binomial and multinomial classification for categorical targets, as well as regression for continuous
targets.
4.4 GLM
Generalized Linear Models (GLM) estimate regression models for outcomes following exponential distributions.
GLMs are an extension of traditional linear models. They have gained popularity in statistical data analysis due
to:
• the flexibility of the model structure unifying the typical regression methods (such as linear regression and
logistic regression for binary classification)
• the recent availability of model-fitting software
• the ability to scale well with large datasets
41
Using Driverless AI, Release 1.8.4.1
Isolation Forest is useful for identifying anomalies or outliers in data. Isolation Forest isolates observations by ran-
domly selecting a feature and then randomly selecting a split value between the maximum and minimum values of
that selected feature. This split depends on how long it takes to separate the points. Random partitioning produces
noticeably shorter paths for anomalies. When a forest of random trees collectively produces shorter path lengths for
particular samples, they are highly likely to be anomalies.
4.6 LightGBM
LightGBM is a gradient boosting framework developed by Microsoft that uses tree based learning algorithms. It was
specifically designed for lower memory usage and faster training speed and higher efficiency. Similar to XGBoost,
it is one of the best gradient boosting implementations available. It is also used for fitting Random Forest, DART
(experimental), and Decision Tree models inside of Driverless AI.
Note: LightGBM with GPUs is not currently supported on Power.
4.7 RuleFit
The RuleFit [2] algorithm creates an optimal set of decision rules by first fitting a tree model, and then fitting a Lasso
(L1-regularized) GLM model to create a linear model consisting of the most important tree leaves (rules).
Note: MOJOs are not currently available for RuleFit models.
4.8 TensorFlow
TensorFlow is an open source software library for performing high performance numerical computation. Driverless
AI includes a TensorFlow NLP recipe based on CNN Deeplearning models.
Note: MOJOs are not currently available for TensorFlow models.
4.9 XGBoost
XGBoost is a supervised learning algorithm that implements a process called boosting to yield accurate models. Boost-
ing refers to the ensemble learning technique of building many models sequentially, with each new model attempting
to correct for the deficiencies in the previous model. In tree boosting, each new model that is added to the ensemble is
a decision tree. XGBoost provides parallel tree boosting (also known as GBDT, GBM) that solves many data science
problems in a fast and accurate way. For many problems, XGBoost is one of the best gradient boosting machine
(GBM) frameworks today. Driverless AI supports XGBoost GBM and XGBoost DART (experimental) models.
4.10 References
4.10. References 43
Using Driverless AI, Release 1.8.4.1
FIVE
DRIVERLESS AI WORKFLOW
45
Using Driverless AI, Release 1.8.4.1
SIX
DRIVERLESS AI LICENSES
A valid license is required for running Driverless AI and for running the scoring pipelines.
Driverless AI is licensed per a single named user. Therefore, in order, to have different users run experiments simulta-
neously, they would each need a license. Driverless AI manages the GPU(s) that it is given and ensures that different
experiments from different users can run safely simultaneously and don’t interfere with each other. So when two
licensed users log in with different credentials, neither of them will see the other’s experiment. Similarly, if a licensed
user logs in using a different set of credentials, that user will not see any previously run experiments.
A license file to run Driverless AI can be added in one of three ways when starting Driverless AI.
• Specifying the license.sig file during launch in native installs
• Using the DRIVERLESS_AI_LICENSE_FILE and DRIVERLESS_AI_LICENSE_KEY environment variables
when starting the Driverless AI Docker image
• Uploading your license in the Web UI
By default, Driverless AI looks for a license key in /opt/h2oai/dai/home/.driverlessai/license.sig. If you are installing
Driverless AI programmatically, you can copy a license key file to that location. If no license key is found, the
application will prompted you to add one via the Web UI.
47
Using Driverless AI, Release 1.8.4.1
nvidia-docker run \
--pid=host \
--init \
--rm \
--shm-size=256m \
-u `id -u`:`id -g` \
-p 12345:12345 \
-e DRIVERLESS_AI_LICENSE_FILE="/license/license.sig" \
-v `pwd`/config:/config \
-v `pwd`/data:/data \
-v `pwd`/log:/log \
-v `pwd`/license:/license \
-v `pwd`/tmp:/tmp \
h2oai/dai-centos7-x86_64:TAG
or
nvidia-docker run \
--pid=host \
--init \
--rm \
--shm-size=256m \
-u `id -u`:`id -g` \
-p 12345:12345 \
-e DRIVERLESS_AI_LICENSE_KEY="Y0uRl1cens3KeyH3re" \
-v `pwd`/config:/config \
-v `pwd`/data:/data \
-v `pwd`/log:/log \
-v `pwd`/license:/license \
-v `pwd`/tmp:/tmp \
h2oai/dai-centos7-x86_64:TAG
If Driverless AI does not locate a license.sig file during launch, then the UI will prompt you to enter your license key
after you log in the first time.
Click the Enter License button, and then paste the entire license into the provided License Key entry field. Click
Save when you are done. Upon successful completion, you will be able to begin using Driverless AI.
You can also export the license file when running the scoring pipeline:
export DRIVERLESS_AI_LICENSE_FILE="/path/to/license.sig"
bash run_example.sh
Driverless AI requires a license to be specified in order to run the MOJO Scoring Pipeline. The license can be specified
in one of the following ways:
• Via an environment variable:
– DRIVERLESS_AI_LICENSE_FILE: Path to the Driverless AI license file, or
– DRIVERLESS_AI_LICENSE_KEY: The Driverless AI license key (Base64 encoded string)
• Via a system property of JVM (-D option):
Driverless AI deployment pipelines to AWS Lambdas automatically set the license key as an environment variable
based on the license key that was used in Driverless AI.
If your current Driverless AI license has expired, you will be required to update it in order to continue running
Driverless AI, in order to run the scoring pipeline, in order to access deployed pipelines to AWS Lambdas, etc.
Similar to adding a license for the first time, you can update your license for running Driverless AI either by replacing
your current license.sig file or via the Web UI.
Update the license key in your /opt/h2oai/dai/home/.driverlessai/license.sig file by replacing the existing license with
your new one.
If your license is expired, the Web UI will prompt you to enter a new one. The steps are the same as adding a license
for the first time via the Driverless AI Web UI.
For the Python Scoring Pipeline, simply include the updated license file when setting the environment variable in
Python. Refer to the above Python Scoring Pipeline section for adding licenses.
For the MOJO Scoring Pipeline, the updated license file can be specifed using an environment variable, using a system
property of JVM, or via an application classpath. This is the same as adding a license for the first time. Refer to the
above MOJO Scoring Pipeline section for adding licenses.
The Driverless AI deployment pipeline to AWS Lambdas explicitly sets the license key as an environment variable.
Replace the expired license key with your updated one.
SEVEN
Please review the following information before you begin installing Driverless AI. Be sure to also review the Sizing
Requirements in the next section before beginning the installation.
Driverless AI is tested most extensively on Chrome and Firefox. For the best user experience, we recommend using
the latest version of Chrome. You may encounter issues if you use other browsers or earlier versions of Chrome and/or
Firefox.
Many of the installation steps show sudo prepending different commands. Note that sudo may not always be
required, but the steps that are documented here are the steps that we followed in house.
If you have nvidia-docker 1.0 installed, you need to remove it and all existing GPU containers. Refer to https://github.
com/NVIDIA/nvidia-docker/blob/master/README.md for more information.
53
Using Driverless AI, Release 1.8.4.1
PowerPC users are now required to install the nvidia-container-runtime-hook when running in Docker.
Refer to https://github.com/nvidia/nvidia-docker#rhel-docker for more information. The IBM Docker installation
steps have been updated to reflect this information.
Your host environment must have CUDA 10.0 or later with NVIDIA drivers >= 410 installed (GPU only). Driverless
AI ships with its own CUDA libraries, but the driver must exist in the host environment. Go to https://www.nvidia.
com/Download/index.aspx to get the latest NVIDIA Tesla V/P/K series driver.
The default authentication setting in Driverless AI is “unvalidated.” In this case, Driverless AI will accept any login
and password combination, it will not validate whether the password is correct for the specified login ID, and it will
connect to the system as the user specified in the login ID. This is true for all instances, including Cloud, Docker, and
native instances.
We recommend that you configure authentication. Driverless AI provides a number of authentication options, includ-
ing LDAP, PAM, Local, and None. Refer to Configuring Authentication for information on how to enable a different
authentication method.
Note: Driverless AI is also integrated with IBM Spectrum Conductor and supports authentication from Conductor.
Contact [email protected] for more information about using IBM Spectrum Conductor authentication.
If your environment uses a shared file system, then you must set the following configuration option:
datatable_strategy='write'
The above can be specified in the config.toml file (for native installs) or specified as an environment variable (Docker
image installs).
This configuration is required because, in some cases, Driverless AI can fail to read files during an experiment. The
write option will allow Driverless AI to properly read and write data from shared file systems to disk.
We recommend that you periodically stop Driverless AI and back up your Driverless AI tmp directory, even if you are
not upgrading.
EIGHT
For the best (and intended-as-designed) experience, install Driverless AI on modern data center hardware with GPUs
and CUDA support. Use Pascal or Volta GPUs with maximum GPU memory for best results. (Note that the older K80
and M60 GPUs available in EC2 are supported and very convenient, but not as fast.)
Driverless AI supports local, LDAP, and PAM authentication. Authentication can be configured by setting environment
variables or via a config.toml file. Refer to the Configuring Authentication section for more information. Note that the
default authentication method is “unvalidated.”
Driverless AI also supports HDFS, S3, Google Cloud Storage, Google Big Query, KDB, Minio, and Snowflake access.
Support for these data sources can be configured by setting environment variables for the data connectors or via a
config.toml file. Refer to the Enabling Data Connectors section for more information.
Driverless AI requires a minimum of 5 GB of system memory in order to start experiments and a minimum of 5
GB of disk space in order to run a small experiment. Note that these limits can changed in the config.toml file. We
recommend that you have lots of system CPU memory (64 GB or more) and 1 TB of free disk space available.
For Docker installs, we recommend 1 TB of free disk space. Driverless AI uses approximately 38 GB. In addition,
the unpacking/temp files require space on the same Linux mount /var during installation. Once DAI runs, the mounts
from the Docker container can point to other file system mount points.
If you are running Driverless AI with GPUs, be sure that your GPU has compute capability >=3.5 and at least 4GB of
RAM. If these requirements are not met, then Driverless AI will switch to CPU-only mode.
57
Using Driverless AI, Release 1.8.4.1
We recommend that your tmp directory has at least 500 GB to 1 TB of space. The tmp directory holds all experiments
and all datasets. We also recommend that you use SSDs (preferably NVMe).
If you are running Driverless AI on a Linux machine, we recommend setting the overcommit memory to 0. The setting
can be changed by the following command:
This is the default value, and it indicates that the Linux kernel is free to overcommit memory. If this value is set to
2, then the Linux kernel will not overcommit memory. In this case, the memory requirements of Driverless AI may
surpass the memory allocation limit, which would prevent the experiment from completing.
This section provides installation steps for Linux 86_64 environments. This includes information for Docker image
installs, RPMs, Deb, and Tar installs as well as Cloud installations.
To simplify local installation, Driverless AI is provided as a Docker image for the following system combinations:
Note: CUDA 10.0 or later with NVIDIA drivers >= 410 is required (GPU only).
For the best performance, including GPU support, use nvidia-docker. For a lower-performance experience without
GPUs, use regular docker (with the same docker image).
These installation steps assume that you have a license key for Driverless AI. For information on how to obtain a
license key for Driverless AI, visit https://www.h2o.ai/driverless-ai/. Once obtained, you will be promted to paste the
license key into the Driverless AI UI when you first log in, or you can save it as a .sig file and place it in the license
folder that you will create during the installation process.
Install on Ubuntu
This section describes how to install the Driverless AI Docker image on Ubuntu. The installation steps vary depending
on whether your system has GPUs or if it is CPU only.
Environment
3. Install nvidia-docker2 (if not already installed). More information is available at https://github.com/NVIDIA/
nvidia-docker/blob/master/README.md.
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | \
sudo apt-key add -
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-
˓→docker.list | \
4. Verify that the NVIDIA driver is up and running. If the driver is not up and running, log on to http://www.nvidia.
com/Download/index.aspx?lang=en-us to get the latest NVIDIA Tesla V/P/K series driver:
nvidia-smi
5. Set up a directory for the version of Driverless AI on the host machine, replacing VERSION below with your
Driverless AI Docker image version:
6. Change directories to the new folder, then load the Driverless AI Docker image inside the new directory. This
example shows how to load Driverless AI. Replace VERSION with your image.
7. Enable persistence of the GPU. Note that this needs to be run once every reboot. Refer to the following for more
information: http://docs.nvidia.com/deploy/driver-persistence/index.html.
8. Set up the data, log, and license directories on the host machine:
# Set up the data, log, license, and tmp directories on the host machine
˓→(within the new directory)
mkdir data
mkdir log
mkdir license
mkdir tmp
9. At this point, you can copy data into the data directory on the host machine. The data will be visible inside the
Docker container.
10. Run docker images to find the image tag.
11. Start the Driverless AI Docker image with nvidia-docker and replace TAG below with the image tag:
--------------------------------
Welcome to H2O.ai's Driverless AI
---------------------------------
http://Your-Driverless-AI-Host-Machine:12345
3. Set up a directory for the version of Driverless AI on the host machine, replacing VERSION below with your
Driverless AI Docker image version:
4. Change directories to the new folder, then load the Driverless AI Docker image inside the new directory. This
example shows how to load Driverless AI. Replace VERSION with your image.
5. Set up the data, log, license, and tmp directories on the host machine (within the new directory):
6. At this point, you can copy data into the data directory on the host machine. The data will be visible inside the
Docker container.
7. Run docker images to find the new image tag.
8. Start the Driverless AI Docker image and replace TAG below with the image tag. Note that GPU support will
not be available.
To stop the Driverless AI Docker image, type Ctrl + C in the Terminal (Mac OS X) or PowerShell (Windows 10)
window that is running the Driverless AI Docker image.
This section provides instructions for upgrading Driverless AI versions that were installed in a Docker container. These
steps ensure that existing experiments are saved.
WARNING: Experiments, MLIs, and MOJOs reside in the Driverless AI tmp directory and are not automatically
upgraded when Driverless AI is upgraded.
• Build MLI models before upgrading.
• Build MOJO pipelines before upgrading.
• Stop Driverless AI and make a backup of your Driverless AI tmp directory before upgrading.
If you did not build MLI on a model before upgrading Driverless AI, then you will not be able to view
MLI on that model after upgrading. Before upgrading, be sure to run MLI jobs on models that you want
to continue to interpret in future releases. If that MLI job appears in the list of Interpreted Models in your
current version, then it will be retained after upgrading.
If you did not build a MOJO pipeline on a model before upgrading Driverless AI, then you will not be
able to build a MOJO pipeline on that model after upgrading. Before upgrading, be sure to build MOJO
pipelines on all desired models and then back up your Driverless AI tmp directory.
Requirements
As of 1.7.0, CUDA 9 is no longer supported. Your host environment must have CUDA 10.0 or later with NVIDIA
drivers >= 410 installed (GPU only). Driverless AI ships with its own CUDA libraries, but the driver must exist in the
host environment. Go to https://www.nvidia.com/Download/index.aspx to get the latest NVIDIA Tesla V/P/K series
driver.
Upgrade Steps
1. SSH into the IP address of the machine that is running Driverless AI.
2. Set up a directory for the version of Driverless AI on the host machine:
3. Retrieve the Driverless AI package from https://www.h2o.ai/download/ and add it to the new directory.
4. Load the Driverless AI Docker image inside the new directory. This example shows how to load Driverless AI
version. If necessary, replace VERSION with your image.
5. Copy the data, log, license, and tmp directories from the previous Driverless AI directory to the new Driverless
AI directory:
# Copy the data, log, license, and tmp directories on the host machine
cp -a dai_rel_1.4.2/data dai_rel_VERSION/data
cp -a dai_rel_1.4.2/log dai_rel_VERSION/log
cp -a dai_rel_1.4.2/license dai_rel_VERSION/license
cp -a dai_rel_1.4.2/tmp dai_rel_VERSION/tmp
At this point, your experiments from the previous versions will be visible inside the Docker container.
6. Use docker images to find the new image tag.
7. Start the Driverless AI Docker image.
8. Connect to Driverless AI with your browser at http://Your-Driverless-AI-Host-Machine:12345.
Install on RHEL
This section describes how to install the Driverless AI Docker image on RHEL. The installation steps vary depending
on whether your system has GPUs or if it is CPU only.
Environment
Note: Refer to the following links for more information about using RHEL with GPUs. These links describe how to
disable automatic updates and specific package updates. This is necessary in order to prevent a mismatch between the
NVIDIA driver and the kernel, which can lead to the GPUs failures.
• https://access.redhat.com/solutions/2372971
• https://www.rootusers.com/how-to-disable-specific-package-updates-in-rhel-centos/
Watch the installation video here. Note that some of the images in this video may change between releases, but the
installation steps remain the same.
Note: As of this writing, Driverless AI has only been tested on RHEL version 7.4.
Open a Terminal and ssh to the machine that will run Driverless AI. Once you are logged in, perform the following
steps.
1. Retrieve the Driverless AI Docker image from https://www.h2o.ai/download/.
2. Install and start Docker EE on RHEL (if not already installed). Follow the instructions on https://docs.docker.
com/engine/installation/linux/docker-ee/rhel/.
Alternatively, you can run on Docker CE.
3. Install nvidia-docker2 (if not already installed). More information is available at https://github.com/NVIDIA/
nvidia-docker/blob/master/README.md.
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | \
sudo apt-key add -
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-
˓→docker.list | \
Note: If you would like the nvidia-docker service to automatically start when the server is rebooted then
run the following command. If you do not run this command, you will have to remember to start the
nvidia-docker service manually; otherwise the GPUs will not appear as available.
sudo systemctl enable nvidia-docker
Alternatively, if you have installed Docker CE above you can install nvidia-docker with:
curl -s -L https://nvidia.github.io/nvidia-docker/centos7/x86_64/nvidia-
˓→docker.repo | \
4. Verify that the NVIDIA driver is up and running. If the driver is not up and running, log on to http://www.nvidia.
com/Download/index.aspx?lang=en-us to get the latest NVIDIA Tesla V/P/K series driver.
nvidia-docker run --rm nvidia/cuda nvidia-smi
5. Set up a directory for the version of Driverless AI on the host machine, replacing VERSION below with your
Driverless AI Docker image version:
# Set up directory with the version name
mkdir dai_rel_VERSION
6. Change directories to the new folder, then load the Driverless AI Docker image inside the new directory. This
example shows how to load Driverless AI. Replace VERSION with your image.
# cd into the new directory
cd dai_rel_VERSION
7. Enable persistence of the GPU. Note that this needs to be run once every reboot. Refer to the following for more
information: http://docs.nvidia.com/deploy/driver-persistence/index.html.
sudo nvidia-persistenced --persistence-mode
8. Set up the data, log, and license directories on the host machine (within the new directory):
# Set up the data, log, license, and tmp directories on the host machine
mkdir data
mkdir log
mkdir license
mkdir tmp
9. At this point, you can copy data into the data directory on the host machine. The data will be visible inside the
Docker container.
10. Run docker images to find the image tag.
11. Start the Driverless AI Docker image with nvidia-docker and replace TAG below with the image tag:
# Start the Driverless AI Docker image
nvidia-docker run \
--pid=host \
--init \
--rm \
(continues on next page)
--------------------------------
Welcome to H2O.ai's Driverless AI
---------------------------------
This section describes how to install and start the Driverless AI Docker image on RHEL. Note that this uses Docker
EE and not NVIDIA Docker. GPU support will not be available.
Watch the installation video here. Note that some of the images in this video may change between releases, but the
installation steps remain the same.
Note: As of this writing, Driverless AI has only been tested on RHEL version 7.4.
Open a Terminal and ssh to the machine that will run Driverless AI. Once you are logged in, perform the following
steps.
1. Install and start Docker EE on RHEL (if not already installed). Follow the instructions on https://docs.docker.
com/engine/installation/linux/docker-ee/rhel/.
Alternatively, you can run on Docker CE.
2. On the machine that is running Docker EE, retrieve the Driverless AI Docker image from https://www.h2o.ai/
download/.
3. Set up a directory for the version of Driverless AI on the host machine, replacing VERSION below with your
Driverless AI Docker image version:
4. Load the Driverless AI Docker image inside the new directory. The following example shows how to load
Driverless AI version. Replace VERSION with your image.
5. Set up the data, log, license, and tmp directories (within the new directory):
# Set up the data, log, license, and tmp directories on the host machine
mkdir data
mkdir log
mkdir license
mkdir tmp
6. Copy data into the data directory on the host. The data will be visible inside the Docker container at /<user-
home>/data.
7. Run docker images to find the image tag.
8. Start the Driverless AI Docker image and replace TAG below with the image tag. Note that GPU support will
not be available.
$ docker run \
--pid=host \
--init \
--rm \
--shm-size=256m \
-u `id -u`:`id -g` \
-p 12345:12345 \
-v `pwd`/data:/data \
-v `pwd`/log:/log \
-v `pwd`/license:/license \
-v `pwd`/tmp:/tmp \
-v /etc/passwd:/etc/passwd:ro \
-v /etc/group:/etc/group:ro \
h2oai/dai-centos7-x86_64:TAG
--------------------------------
Welcome to H2O.ai's Driverless AI
---------------------------------
To stop the Driverless AI Docker image, type Ctrl + C in the Terminal (Mac OS X) or PowerShell (Windows 10)
window that is running the Driverless AI Docker image.
This section provides instructions for upgrading Driverless AI versions that were installed in a Docker container. These
steps ensure that existing experiments are saved.
WARNING: Experiments, MLIs, and MOJOs reside in the Driverless AI tmp directory and are not automatically
upgraded when Driverless AI is upgraded.
• Build MLI models before upgrading.
• Build MOJO pipelines before upgrading.
• Stop Driverless AI and make a backup of your Driverless AI tmp directory before upgrading.
If you did not build MLI on a model before upgrading Driverless AI, then you will not be able to view
MLI on that model after upgrading. Before upgrading, be sure to run MLI jobs on models that you want
to continue to interpret in future releases. If that MLI job appears in the list of Interpreted Models in your
current version, then it will be retained after upgrading.
If you did not build a MOJO pipeline on a model before upgrading Driverless AI, then you will not be
able to build a MOJO pipeline on that model after upgrading. Before upgrading, be sure to build MOJO
pipelines on all desired models and then back up your Driverless AI tmp directory.
Note: Stop Driverless AI if it is still running.
Requirements
As of 1.7.0, CUDA 9 is no longer supported. Your host environment must have CUDA 10.0 or later with NVIDIA
drivers >= 410 installed (GPU only). Driverless AI ships with its own CUDA libraries, but the driver must exist in the
host environment. Go to https://www.nvidia.com/Download/index.aspx to get the latest NVIDIA Tesla V/P/K series
driver.
Upgrade Steps
1. SSH into the IP address of the machine that is running Driverless AI.
2. Set up a directory for the version of Driverless AI on the host machine:
3. Retrieve the Driverless AI package from https://www.h2o.ai/download/ and add it to the new directory.
4. Load the Driverless AI Docker image inside the new directory. This example shows how to load Driverless AI
version. If necessary, replace VERSION with your image.
5. Copy the data, log, license, and tmp directories from the previous Driverless AI directory to the new Driverless
AI directory:
# Copy the data, log, license, and tmp directories on the host machine
cp -a dai_rel_1.4.2/data dai_rel_VERSION/data
cp -a dai_rel_1.4.2/log dai_rel_VERSION/log
cp -a dai_rel_1.4.2/license dai_rel_VERSION/license
cp -a dai_rel_1.4.2/tmp dai_rel_VERSION/tmp
At this point, your experiments from the previous versions will be visible inside the Docker container.
6. Use docker images to find the new image tag.
7. Start the Driverless AI Docker image.
8. Connect to Driverless AI with your browser at http://Your-Driverless-AI-Host-Machine:12345.
Driverless AI is supported on the following NVIDIA DGX products, and the installation steps for each platform are
the same.
• NVIDIA GPU Cloud
• NVIDIA DGX-1
• NVIDIA DGX-2
• NVIDIA DGX Station
Environment
Note: These installation instructions assume that you are running on an NVIDIA DGX machine. Driverless AI is only
available in the NGC registry for DGX machines.
1. Log in to your NVIDIA GPU Cloud account at https://ngc.nvidia.com/registry. (Note that NVIDIA Compute is
no longer supported by NVIDIA.)
2. In the Registry > Partners menu, select h2oai-driverless.
3. At the bottom of the screen, select one of the H2O Driverless AI tags to retrieve the pull command.
4. On your NVIDIA DGX machine, open a command prompt and use the specified pull command to retrieve the
Driverless AI image. For example:
5. Set up a directory for the version of Driverless AI on the host machine, replacing VERSION below with your
Driverless AI Docker image version:
6. Set up the data, log, license, and tmp directories on the host machine:
# Set up the data, log, license, and tmp directories on the host machine
mkdir data
mkdir log
mkdir license
mkdir tmp
7. At this point, you can copy data into the data directory on the host machine. The data will be visible inside the
Docker container.
8. Enable persistence of the GPU. Note that this only needs to be run once. Refer to the following for more
information: http://docs.nvidia.com/deploy/driver-persistence/index.html.
nvidia-docker run \
--pid=host \
--init \
--rm \
--shm-size=256m \
-u `id -u`:`id -g` \
-p 12345:12345 \
-v `pwd`/data:/data \
-v `pwd`/log:/log \
-v `pwd`/license:/license \
-v `pwd`/tmp:/tmp \
nvcr.io/h2oai/h2oai-driverless-ai:TAG
--------------------------------
Welcome to H2O.ai's Driverless AI
---------------------------------
http://Your-Driverless-AI-Host-Machine:12345
Stopping Driverless AI
Upgrading Driverless AI
The steps for upgrading Driverless AI on an NVIDIA DGX system are similar to the installation steps.
WARNINGS:
• This release deprecates experiments and MLI models from 1.7.0 and earlier.
• Experiments, MLIs, and MOJOs reside in the Driverless AI tmp directory and are not automatically upgraded
when Driverless AI is upgraded. We recommend you take the following steps before upgrading.
– Build MLI models before upgrading.
– Build MOJO pipelines before upgrading.
– Stop Driverless AI and make a backup of your Driverless AI tmp directory before upgrading.
If you did not build MLI on a model before upgrading Driverless AI, then you will not be able to view
MLI on that model after upgrading. Before upgrading, be sure to run MLI jobs on models that you want
to continue to interpret in future releases. If that MLI job appears in the list of Interpreted Models in your
current version, then it will be retained after upgrading.
If you did not build a MOJO pipeline on a model before upgrading Driverless AI, then you will not be
able to build a MOJO pipeline on that model after upgrading. Before upgrading, be sure to build MOJO
pipelines on all desired models and then back up your Driverless AI tmp directory.
The upgrade process inherits the service user and group from /etc/dai/User.conf and /etc/dai/Group.conf. You do not
need to manually specify the DAI_USER or DAI_GROUP environment variables during an upgrade.
Note: Use Ctrl+C to stop Driverless AI if it is still running.
Requirements
As of 1.7.0, CUDA 9 is no longer supported. Your host environment must have CUDA 10.0 or later with NVIDIA
drivers >= 410 installed (GPU only). Driverless AI ships with its own CUDA libraries, but the driver must exist in the
host environment. Go to https://www.nvidia.com/Download/index.aspx to get the latest NVIDIA Tesla V/P/K series
driver.
Upgrade Steps
1. On your NVIDIA DGX machine, create a directory for the new Driverless AI version.
2. Copy the data, log, license, and tmp directories from the previous Driverless AI directory into the new Driverless
AI directory.
3. Run docker pull nvcr.io/h2oai/h2oai-driverless-ai:latest to retrieve the latest Driver-
less AI version.
4. Start the Driverless AI Docker image.
5. Connect to Driverless AI with your browser at http://Your-Driverless-AI-Host-Machine:12345.
For Linux machines that will not use the Docker image or DEB, an RPM installation is available for the following
environments:
• x86_64 RHEL 7, CentOS 7, or SLES 12
The installation steps assume that you have a license key for Driverless AI. For information on how to obtain a license
key for Driverless AI, visit https://www.h2o.ai/products/h2o-driverless-ai/. Once obtained, you will be prompted to
paste the license key into the Driverless AI UI when you first log in, or you can save it as a .sig file and place it in the
license folder that you will create during the installation process.
Environment
Requirements
• The ‘dai’ service user is created locally (in /etc/passwd) if it is not found by ‘getent passwd’. You can override
the user by providing the DAI_USER environment variable during rpm or dpkg installation.
• The ‘dai’ service group is created locally (in /etc/group) if it is not found by ‘getent group’. You can override
the group by providing the DAI_GROUP environment variable during rpm or dpkg installation.
• Configuration files are put in /etc/dai and owned by the ‘root’ user:
– /etc/dai/config.toml: Driverless AI config file (See Using the config.toml File section for details)
– /etc/dai/User.conf: Systemd config file specifying the service user
– /etc/dai/Group.conf: Systemd config file specifying the service group
– /etc/dai/EnvironmentFile.conf: Systemd config file specifying (optional) environment variable overrides
• Software files are put in /opt/h2oai/dai and owned by the ‘root’ user
• The following directories are owned by the service user so they can be updated by the running software:
– /opt/h2oai/dai/home: The application’s home directory (license key files are stored here)
– /opt/h2oai/dai/tmp: Experiments and imported data are stored here
– /opt/h2oai/dai/log: Log files go here if you are not using systemd (if you are using systemd, then the use
the standard journalctl tool)
• By default, Driverless AI looks for a license key in /opt/h2oai/dai/home/.driverlessai/license.sig. If you are
installing Driverless AI programmatically, you can copy a license key file to that location. If no license key is
found, the application will interactively guide you to add one from the Web UI.
• systemd unit files are put in /usr/lib/systemd/system
• Symbolic links to the configuration files in /etc/dai files are put in /etc/systemd/system
If your environment is running an operational systemd, that is the preferred way to manage Driverless AI. The package
installs the following systemd services and a wrapper service:
• dai: Wrapper service that starts/stops the other three services
• dai-dai: Main Driverless AI process
• dai-h2o: H2O-3 helper process used by Driverless AI
• dai-procsy: Procsy helper process used by Driverless AI
• dai-vis-server: Visualization server helper process used by Driverless AI
If you don’t have systemd, you can also use the provided run script to start Driverless AI.
Installing Driverless AI
Run the following commands to install the Driverless AI RPM. Replace VERSION with your specific version.
By default, the Driverless AI processes are owned by the ‘dai’ userand ‘dai’ group. You can optionally specify a
different service user and group as shown below. Replace <myuser> and <mygroup> as appropriate.
# Temporarily specify service user and group when installing Driverless AI.
# rpm saves these for systemd in the /etc/dai/User.conf and /etc/dai/Group.conf files.
sudo DAI_USER=myuser DAI_GROUP=mygroup rpm -i dai-VERSION.rpm
Starting Driverless AI
If you have NVIDIA GPUs, you must run the following NVIDIA command. This command needs to be run every
reboot. For more information: http://docs.nvidia.com/deploy/driver-persistence/index.html.
Install OpenCL
OpenCL is required in order to run LightGBM on GPUs. Run the following for Centos7/RH7 based systems using
yum and x86.
wget http://dl.fedoraproject.org/pub/epel/7/x86_64/Packages/o/ocl-icd-2.2.12-1.el7.
˓→x86_64.rpm
Stopping Driverless AI
Upgrading Driverless AI
WARNINGS:
• This release deprecates experiments and MLI models from 1.7.0 and earlier.
• Experiments, MLIs, and MOJOs reside in the Driverless AI tmp directory and are not automatically upgraded
when Driverless AI is upgraded. We recommend you take the following steps before upgrading.
– Build MLI models before upgrading.
– Build MOJO pipelines before upgrading.
– Stop Driverless AI and make a backup of your Driverless AI tmp directory before upgrading.
If you did not build MLI on a model before upgrading Driverless AI, then you will not be able to view
MLI on that model after upgrading. Before upgrading, be sure to run MLI jobs on models that you want
to continue to interpret in future releases. If that MLI job appears in the list of Interpreted Models in your
current version, then it will be retained after upgrading.
If you did not build a MOJO pipeline on a model before upgrading Driverless AI, then you will not be
able to build a MOJO pipeline on that model after upgrading. Before upgrading, be sure to build MOJO
pipelines on all desired models and then back up your Driverless AI tmp directory.
The upgrade process inherits the service user and group from /etc/dai/User.conf and /etc/dai/Group.conf. You do not
need to manually specify the DAI_USER or DAI_GROUP environment variables during an upgrade.
Requirements
As of 1.7.0, CUDA 9 is no longer supported. Your host environment must have CUDA 10.0 or later with NVIDIA
drivers >= 410 installed (GPU only). Driverless AI ships with its own CUDA libraries, but the driver must exist in the
host environment. Go to https://www.nvidia.com/Download/index.aspx to get the latest NVIDIA Tesla V/P/K series
driver.
Upgrade Steps
Uninstalling Driverless AI
# Uninstall.
sudo rpm -e dai
# Uninstall.
sudo rpm -e dai
CAUTION! At this point you can optionally completely remove all remaining files, including the database. (This
cannot be undone.)
For Linux machines that will not use the Docker image or RPM, a DEB installation is available for x86_64 Ubuntu
16.04/18.04.
The installation steps assume that you have a license key for Driverless AI. For information on how to obtain a license
key for Driverless AI, visit https://www.h2o.ai/products/h2o-driverless-ai/. Once obtained, you will be promted to
paste the license key into the Driverless AI UI when you first log in, or you can save it as a .sig file and place it in the
license folder that you will create during the installation process.
Environment
Requirements
• The ‘dai’ service user is created locally (in /etc/passwd) if it is not found by ‘getent passwd’. You can override
the user by providing the DAI_USER environment variable during rpm or dpkg installation.
• The ‘dai’ service group is created locally (in /etc/group) if it is not found by ‘getent group’. You can override
the group by providing the DAI_GROUP environment variable during rpm or dpkg installation.
• Configuration files are put in /etc/dai and owned by the ‘root’ user:
– /etc/dai/config.toml: Driverless AI config file (See Using the config.toml File section for details)
– /etc/dai/User.conf: Systemd config file specifying the service user
– /etc/dai/Group.conf: Systemd config file specifying the service group
– /etc/dai/EnvironmentFile.conf: Systemd config file specifying (optional) environment variable overrides
• Software files are put in /opt/h2oai/dai and owned by the ‘root’ user
• The following directories are owned by the service user so they can be updated by the running software:
– /opt/h2oai/dai/home: The application’s home directory (license key files are stored here)
– /opt/h2oai/dai/tmp: Experiments and imported data are stored here
– /opt/h2oai/dai/log: Log files go here if you are not using systemd (if you are using systemd, then the use
the standard journalctl tool)
• By default, Driverless AI looks for a license key in /opt/h2oai/dai/home/.driverlessai/license.sig. If you are
installing Driverless AI programmatically, you can copy a license key file to that location. If no license key is
found, the application will interactively guide you to add one from the Web UI.
• systemd unit files are put in /usr/lib/systemd/system
• Symbolic links to the configuration files in /etc/dai files are put in /etc/systemd/system
If your environment is running an operational systemd, that is the preferred way to manage Driverless AI. The package
installs the following systemd services and a wrapper service:
• dai: Wrapper service that starts/stops the other three services
• dai-dai: Main Driverless AI process
• dai-h2o: H2O-3 helper process used by Driverless AI
• dai-procsy: Procsy helper process used by Driverless AI
• dai-vis-server: Visualization server helper process used by Driverless AI
If you don’t have systemd, you can also use the provided run script to start Driverless AI.
If you have NVIDIA GPUs, you must run the following NVIDIA command. This command needs to be run every
reboot. For more information: http://docs.nvidia.com/deploy/driver-persistence/index.html.
sudo nvidia-persistenced --persistence-mode
Install OpenCL
OpenCL is required in order to run LightGBM on GPUs. Run the following for Ubuntu-based ystems.
sudo apt-get install opencl-headers clinfo ocl-icd-opencl-dev
Run the following commands to install the Driverless AI DEB. Replace VERSION with your specific version.
# Install Driverless AI.
sudo dpkg -i dai_VERSION.deb
By default, the Driverless AI processes are owned by the ‘dai’ user and ‘dai’ group. You can optionally specify a
different service user and group as shown below. Replace <myuser> and <mygroup> as appropriate.
# Temporarily specify service user and group when installing Driverless AI.
# dpkg saves these for systemd in the /etc/dai/User.conf and /etc/dai/Group.conf
˓→files.
Starting Driverless AI
Stopping Driverless AI
Upgrading Driverless AI
WARNINGS:
• This release deprecates experiments and MLI models from 1.7.0 and earlier.
• Experiments, MLIs, and MOJOs reside in the Driverless AI tmp directory and are not automatically upgraded
when Driverless AI is upgraded. We recommend you take the following steps before upgrading.
– Build MLI models before upgrading.
– Build MOJO pipelines before upgrading.
– Stop Driverless AI and make a backup of your Driverless AI tmp directory before upgrading.
If you did not build MLI on a model before upgrading Driverless AI, then you will not be able to view
MLI on that model after upgrading. Before upgrading, be sure to run MLI jobs on models that you want
to continue to interpret in future releases. If that MLI job appears in the list of Interpreted Models in your
current version, then it will be retained after upgrading.
If you did not build a MOJO pipeline on a model before upgrading Driverless AI, then you will not be
able to build a MOJO pipeline on that model after upgrading. Before upgrading, be sure to build MOJO
pipelines on all desired models and then back up your Driverless AI tmp directory.
The upgrade process inherits the service user and group from /etc/dai/User.conf and /etc/dai/Group.conf. You do not
need to manually specify the DAI_USER or DAI_GROUP environment variables during an upgrade.
Requirements
As of 1.7.0, CUDA 9 is no longer supported. Your host environment must have CUDA 10.0 or later with NVIDIA
drivers >= 410 installed (GPU only). Driverless AI ships with its own CUDA libraries, but the driver must exist in the
host environment. Go to https://www.nvidia.com/Download/index.aspx to get the latest NVIDIA Tesla V/P/K series
driver.
Upgrade Steps
Uninstalling Driverless AI
CAUTION! At this point you can optionally completely remove all remaining files, including the database (this cannot
be undone):
Common Problems
Start of Driverless AI fails on the message ``Segmentation fault (core dumped)`` on Ubuntu 18.
This problem is caused by the font NotoColorEmoji.ttf, which cannot be processed by the Python matplotlib
library. A workaround is to disable the font by renaming it. (Do not use fontconfig because it is ignored by matplotlib.)
The following will print out the command that should be executed.
sudo find / -name "NotoColorEmoji.ttf" 2>/dev/null | xargs -I{} echo sudo mv {} {}.
˓→backup
The Driverless AI software is available for use in pure user-mode environments as a self-extracting TAR SH archive.
This form of installation does not require a privileged user to install or to run.
This artifact has the same compatibility matrix as the RPM and DEB packages (combined), it just comes packaged
slightly differently. See those sections for a full list of supported environments.
The installation steps assume that you have a license key for Driverless AI. For information on how to obtain a license
key for Driverless AI, visit https://www.h2o.ai/products/h2o-driverless-ai/. Once obtained, you will be prompted to
paste the license key into the Driverless AI UI when you first log in.
Requirements
Installing Driverless AI
Run the following commands to install the Driverless AI RPM. Replace VERSION with your specific version.
You may now cd to the unpacked directory and optionally make changes to config.toml.
Starting Driverless AI
If you have NVIDIA GPUs, you must run the following NVIDIA command. This command needs to be run every
reboot. For more information: http://docs.nvidia.com/deploy/driver-persistence/index.html.
Install OpenCL
OpenCL is required in order to run LightGBM on GPUs. Run the following for Centos7/RH7 based systems using
yum and x86.
wget http://dl.fedoraproject.org/pub/epel/7/x86_64/Packages/o/ocl-icd-2.2.12-1.el7.
˓→x86_64.rpm
less log/dai.log
less log/h2o.log
less log/procsy.log
less log/vis-server.log
Stopping Driverless AI
Uninstalling Driverless AI
To uninstall Driverless AI, just remove the directory created by the unpacking process. By default, all files for Driver-
less AI are contained within this directory.
Upgrading Driverless AI
WARNINGS:
• This release deprecates experiments and MLI models from 1.7.0 and earlier.
• Experiments, MLIs, and MOJOs reside in the Driverless AI tmp directory and are not automatically upgraded
when Driverless AI is upgraded. We recommend you take the following steps before upgrading.
– Build MLI models before upgrading.
– Build MOJO pipelines before upgrading.
– Stop Driverless AI and make a backup of your Driverless AI tmp directory before upgrading.
If you did not build MLI on a model before upgrading Driverless AI, then you will not be able to view
MLI on that model after upgrading. Before upgrading, be sure to run MLI jobs on models that you want
to continue to interpret in future releases. If that MLI job appears in the list of Interpreted Models in your
current version, then it will be retained after upgrading.
If you did not build a MOJO pipeline on a model before upgrading Driverless AI, then you will not be
able to build a MOJO pipeline on that model after upgrading. Before upgrading, be sure to build MOJO
pipelines on all desired models and then back up your Driverless AI tmp directory.
The upgrade process inherits the service user and group from /etc/dai/User.conf and /etc/dai/Group.conf. You do not
need to manually specify the DAI_USER or DAI_GROUP environment variables during an upgrade.
Requirements
As of 1.7.0, CUDA 9 is no longer supported. Your host environment must have CUDA 10.0 or later with NVIDIA
drivers >= 410 installed (GPU only). Driverless AI ships with its own CUDA libraries, but the driver must exist in the
host environment. Go to https://www.nvidia.com/Download/index.aspx to get the latest NVIDIA Tesla V/P/K series
driver.
Upgrade Steps
To simplify cloud installation, Driverless AI is provided as an AMI for the following cloud platforms:
• AWS AMI
• Azure Image
• Google Cloud
The installation steps for AWS, Azure, and Google Cloud assume that you have a license key for Driverless AI. For
information on how to obtain a license key for Driverless AI, visit https://www.h2o.ai/driverless-ai/. Once obtained,
you will be promted to paste the license key into the Driverless AI UI when you first log in, or you can save it as a .sig
file and place it in the license folder that you will create during the installation process.
Install on AWS
Driverless AI can be installed on Amazon AWS using the AWS Marketplace AMI or the AWS Community AMI.
Consider the following when choosing between the AWS Marketplace and AWS Community AMIs:
• Docker based
• Not certified by AWS
• Will typically have an up-to-date version of Driverless AI for both LTS and latest stable releases
• Base Driverless AI installation on Docker does not feature preset configurations
A Driverless AI AMI is available in the AWS Marketplace beginning with Driverless AI version 1.5.2. This section
describes how to install and run Driverless AI through the AWS Marketplace.
Environment
Installation Procedure
4. Scroll down to review/edit your region and the selected infrastructure and pricing.
8. Review the configuration and choose a method for launching Driverless AI. Be sure to also review the Usage
Instructions. This button provides you with the login and password for launching Driverless AI. Scroll
down to the bottom of the page and click Launch when you are done.
You will receive a “Success” message when the image launches successfully.
Starting Driverless AI
This section describes how to start Driverless AI after the Marketplace AMI has been successfully launched.
1. Navigate to the EC2 Console.
2. Select your instance.
3. Open another browser and launch Driverless AI by navigating to https://<public IP of the instance>:12345.
4. Sign in to Driverless AI with the username and password provided in the Usage Instructions. You will be
prompted to enter your Driverless AI license key the first time that you log in.
The EC2 instance will continue to run even when you close the aws.amazon.com portal. To stop the instance:
1. On the EC2 Dashboard, click the Running Instances link under the Resources section.
2. Select the instance that you want to stop.
3. In the Actions drop down menu, select Instance State > Stop.
4. A confirmation page will display. Click Yes, Stop to stop the instance.
Note that the first offering of the Driverless AI Marketplace image was 1.5.2. As such, it is only possible to upgrade
to versions greater than that.
Perform the following steps if you are upgrading to a Driverless AI Marketeplace image version greater than
1.5.2. Replace dai_NEWVERSION.deb below with the new Driverless AI version (for example, dai_1.5.
4_amd64.deb). Note that this upgrade process inherits the service user and group from /etc/dai/User.conf and
/etc/dai/Group.conf. You do not need to manually specify the DAI_USER or DAI_GROUP environment variables
during an upgrade.
Watch the installation video here. Note that some of the images in this video may change between releases, but the
installation steps remain the same.
Environment
3. Select the EC2 option under the Compute section to open the EC2 Dashboard.
4. Click the Launch Instance button under the Create Instance section.
5. Under Community AMIs, search for h2oai, and then select the version that you want to launch.
6. On the Choose an Instance Type page, select GPU compute in the Filter by dropdown. This will ensure that
your Driverless AI instance will run on GPUs. Select a GPU compute instance from the available options. (We
recommend at least 32 vCPUs.) Click the Next: Configure Instance Details button.
7. Specify the Instance Details that you want to configure. Create a VPC or use an existing one, and ensure that
“Auto-Assign Public IP” is enabled and associated to your subnet. Click Next: Add Storage.
8. Specify the Storage Device settings. Note again that Driverless AI requires 10 GB to run and will stop working
of less than 10 GB is available. The machine should have a minimum of 30 GB of disk space. Click Next: Add
Tags.
9. If desired, add unique Tag name to identify your instance. Click Next: Configure Security Group.
10. Add the following security rules to enable SSH access to Driverless AI, then click Review and Launch.
13. Upon successful completion, a message will display informing you that your instance is launching. Click the
View Instances button to see information about the instance including the IP address. The Connect button on
this page provides information on how to SSH into your instance.
14. Open a Terminal window and SSH into the IP address of the AWS instance. Replace the DNS name below with
your instance DNS.
ssh -i "mykeypair.pem" [email protected]
15. If you selected a GPU-compute instance, then you must enable persistence and optimizations of the GPU. The
commands vary depending on the instance type. Note also that these commands need to be run once every
reboot. Refer to the following for more information:
• http://docs.nvidia.com/deploy/driver-persistence/index.html
• https://docs.aws.amazon.com/AWSEC2/latest/WindowsGuide/optimize_gpu.html
• https://www.migenius.com/articles/realityserver-on-aws
# g3:
sudo nvidia-persistenced --persistence-mode
sudo nvidia-smi -acp 0
sudo nvidia-smi --auto-boost-permission=0
sudo nvidia-smi --auto-boost-default=0
sudo nvidia-smi -ac "2505,1177"
# p2:
sudo nvidia-persistenced --persistence-mode
sudo nvidia-smi -acp 0
sudo nvidia-smi --auto-boost-permission=0
sudo nvidia-smi --auto-boost-default=0
sudo nvidia-smi -ac "2505,875"
# p3:
sudo nvidia-persistenced --persistence-mode
(continues on next page)
16. At this point, you can copy data into the data directory on the host machine using scp. (Note that the data folder
already exists.) For example:
http://Your-Driverless-AI-Host-Machine:12345
The EC2 instance will continue to run even when you close the aws.amazon.com portal. To stop the instance:
1. On the EC2 Dashboard, click the Running Instances link under the Resources section.
2. Select the instance that you want to stop.
3. In the Actions drop down menu, select Instance State > Stop.
4. A confirmation page will display. Click Yes, Stop to stop the instance.
WARNINGS:
• This release deprecates experiments and MLI models from 1.7.0 and earlier.
• Experiments, MLIs, and MOJOs reside in the Driverless AI tmp directory and are not automatically upgraded
when Driverless AI is upgraded. We recommend you take the following steps before upgrading.
– Build MLI models before upgrading.
– Build MOJO pipelines before upgrading.
– Stop Driverless AI and make a backup of your Driverless AI tmp directory before upgrading.
If you did not build MLI on a model before upgrading Driverless AI, then you will not be able to view
MLI on that model after upgrading. Before upgrading, be sure to run MLI jobs on models that you want
to continue to interpret in future releases. If that MLI job appears in the list of Interpreted Models in your
current version, then it will be retained after upgrading.
If you did not build a MOJO pipeline on a model before upgrading Driverless AI, then you will not be
able to build a MOJO pipeline on that model after upgrading. Before upgrading, be sure to build MOJO
pipelines on all desired models and then back up your Driverless AI tmp directory.
The upgrade process inherits the service user and group from /etc/dai/User.conf and /etc/dai/Group.conf. You do not
need to manually specify the DAI_USER or DAI_GROUP environment variables during an upgrade.
The following example shows how to upgrade from 1.2.2 or earlier to the current version. Upgrading from these earlier
versions requires an edit to the start and h2oai scripts.
1. SSH into the IP address of the image instance and copy the existing experiments to a backup location:
2. wget the newer image. The command below retrieves version 1.2.2:
wget https://s3.amazonaws.com/artifacts.h2o.ai/releases/ai/h2o/dai/rel-1.2.2-
˓→6/x86_64-centos7/dai-docker-centos7-x86_64-1.2.2-9.0.tar.gz
3. In the /home/ubuntu/scripts/ folder, edit both the start.sh and h2oai.sh scripts to use the newer image.
4. Use the docker load command to load the image:
5. Optionally run docker images to ensure that the new image is in the registry.
6. Connect to Driverless AI with your browser at http://Your-Driverless-AI-Host-Machine:12345.
2. wget the newer image. Replace VERSION and BUILD below with the Driverless AI version.
wget https://s3.amazonaws.com/artifacts.h2o.ai/releases/ai/h2o/dai/VERSION-
˓→BUILD/x86_64-centos7/dai-docker-centos7-x86_64-VERSION.tar.gz
4. In the new AMI, locate the DAI_RELEASE file, and edit that file to match the new image tag.
5. Stop and then start Driverless AI.
h2oai stop
h2oai start
When installing via AWS, you can also enable role-based authentication.
In Driverless AI, it is possible to enable role-based authentication via the IAM role. This is a two-step process that
involves setting up AWS IAM and then starting Driverless AI by specifying the role in the config.toml file or by setting
the AWS_USE_EC2_ROLE_CREDENTIALS environment variable to True.
1. Create an IAM role. This IAM role should have a Trust Relationship with Principal Trust Entity set to your
Account ID. For example: trust relationship for Account ID 524466471676 would look like:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::524466471676:root"
},
"Action": "sts:AssumeRole"
}
]
}
Driverless AI Setup
Update the aws_use_ec2_role_credentials config variable in the config.toml file or start Driverless AI using
the AWS_USE_EC2_ROLE_CREDENTIALS environment variable.
Resources
Install on Azure
This section describes how to install the Driverless AI image from Azure.
Note: Prior versions of the Driverless AI installation and upgrade on Azure were done via Docker. This is no longer
the case as of version 1.5.2.
Watch the installation video here. Note that some of the images in this video may change between releases, but the
installation steps remain the same.
Environment
• The ‘dai’ service user is created locally (in /etc/passwd) if it is not found by ‘getent passwd’. You can override
the user by providing the DAI_USER environment variable during rpm or dpkg installation.
• The ‘dai’ service group is created locally (in /etc/group) if it is not found by ‘getent group’. You can override
the group by providing the DAI_GROUP environment variable during rpm or dpkg installation.
• Configuration files are put in /etc/dai and owned by the ‘root’ user:
– /etc/dai/config.toml: Driverless AI config file (See Using the config.toml File section for details)
– /etc/dai/User.conf: Systemd config file specifying the service user
– /etc/dai/Group.conf: Systemd config file specifying the service group
– /etc/dai/EnvironmentFile.conf: Systemd config file specifying (optional) environment variable overrides
• Software files are put in /opt/h2oai/dai and owned by the ‘root’ user
• The following directories are owned by the service user so they can be updated by the running software:
– /opt/h2oai/dai/home: The application’s home directory (license key files are stored here)
– /opt/h2oai/dai/tmp: Experiments and imported data are stored here
– /opt/h2oai/dai/log: Log files go here if you are not using systemd (if you are using systemd, then the use
the standard journalctl tool)
• By default, Driverless AI looks for a license key in /opt/h2oai/dai/home/.driverlessai/license.sig. If you are
installing Driverless AI programmatically, you can copy a license key file to that location. If no license key is
found, the application will interactively guide you to add one from the Web UI.
• systemd unit files are put in /usr/lib/systemd/system
• Symbolic links to the configuration files in /etc/dai files are put in /etc/systemd/system
If your environment is running an operational systemd, that is the preferred way to manage Driverless AI. The package
installs the following systemd services and a wrapper service:
• dai: Wrapper service that starts/stops the other three services
• dai-dai: Main Driverless AI process
• dai-h2o: H2O-3 helper process used by Driverless AI
• dai-procsy: Procsy helper process used by Driverless AI
• dai-vis-server: Visualization server helper process used by Driverless AI
If you don’t have systemd, you can also use the provided run script to start Driverless AI.
1. Log in to your Azure portal at https://portal.azure.com, and click the Create a Resource button.
2. Search for and select H2O DriverlessAI in the Marketplace.
3. Click Create. This launches the H2O DriverlessAI Virtual Machine creation process.
5. On the Size tab, select your virtual machine size. Specify the HDD disk type and select a configuration. We
recommend using an N-Series type, which comes with a GPU. Also note that Driverless AI requires 10 GB of
free space in order to run and will stop working of less than 10 GB is available. We recommend a minimum of
30 GB of disk space. Click OK when you are done.
6. On the Settings tab, select or create the Virtual Network and Subnet where the VM is going to be located and
then click OK.
7. The Summary tab performs a validation on the specified settings and will report back any errors. When the
validation passes successfully, click Create to create the VM.
8. After the VM is created, it will be available under the list of Virtual Machines. Select this Driverless AI VM to
view the IP address of your newly created machine.
9. Connect to Driverless AI with your browser using the IP address retrieved in the previous step.
http://Your-Driverless-AI-Host-Machine:12345
The Azure instance will continue to run even when you close the Azure portal. To stop the instance:
1. Click the Virtual Machines left menu item.
2. Select the checkbox beside your DriverlessAI virtual machine.
3. On the right side of the row, click the . . . button, then select Stop. (Note that you can then restart this by
selecting Start.)
WARNINGS:
• This release deprecates experiments and MLI models from 1.7.0 and earlier.
• Experiments, MLIs, and MOJOs reside in the Driverless AI tmp directory and are not automatically upgraded
when Driverless AI is upgraded. We recommend you take the following steps before upgrading.
– Build MLI models before upgrading.
– Build MOJO pipelines before upgrading.
– Stop Driverless AI and make a backup of your Driverless AI tmp directory before upgrading.
If you did not build MLI on a model before upgrading Driverless AI, then you will not be able to view
MLI on that model after upgrading. Before upgrading, be sure to run MLI jobs on models that you want
to continue to interpret in future releases. If that MLI job appears in the list of Interpreted Models in your
current version, then it will be retained after upgrading.
If you did not build a MOJO pipeline on a model before upgrading Driverless AI, then you will not be
able to build a MOJO pipeline on that model after upgrading. Before upgrading, be sure to build MOJO
pipelines on all desired models and then back up your Driverless AI tmp directory.
The upgrade process inherits the service user and group from /etc/dai/User.conf and /etc/dai/Group.conf. You do not
need to manually specify the DAI_USER or DAI_GROUP environment variables during an upgrade.
It is not possible to upgrade from version 1.2.2 or earlier to the latest version. You have to manually remove the 1.2.2
container and then reinstall the latest Driverless AI version. Be sure to backup your data before doing this.
1. SSH into the IP address of the image instance and copy the existing experiments to a backup location:
2. wget the newer image. Replace VERSION and BUILD below with the Driverless AI version.
wget https://s3.amazonaws.com/artifacts.h2o.ai/releases/ai/h2o/dai/VERSION-
˓→BUILD/x86_64-centos7/dai-docker-centos7-x86_64-VERSION.tar.gz
Upgrading to versions 1.5.2 and later is no longer done via Docker. Instead, perform the following steps if you are
upgrading to version 1.5.2 or later. Replace dai_NEWVERSION.deb below with the new Driverless AI version
(for example, dai_1.6.1_amd64.deb). Note that this upgrade process inherits the service user and group from
/etc/dai/User.conf and /etc/dai/Group.conf. You do not need to manually specify the DAI_USER or DAI_GROUP
environment variables during an upgrade.
Note about upgrading to 1.7.x: As of 1.7.0, CUDA 9 is no longer supported. Your host environment must have
CUDA 10.0 or later with NVIDIA drivers >= 410 installed (GPU only). Driverless AI ships with its own CUDA
libraries, but the driver must exist in the host environment. Go to https://www.nvidia.com/Download/index.aspx to get
the latest NVIDIA Tesla V/P/K series driver.
This section describes how to install and start Driverless AI in a Google Compute environment using the GCP Mar-
ketplace. This assumes that you already have a Google Cloud Platform account. If you don’t have an account, go to
https://console.cloud.google.com/getting-started to create one.
If you are trying GCP for the first time and have just created an account, please check your Google Compute En-
gine (GCE) resource quota limits. By default, GCP allocates a maximum of 8 CPUs and no GPUs. Our de-
fault recommendation for launching Driverless AI is 32 CPUs, 120 GB RAM, and 2 P100 NVIDIA GPUs. You
can change these settings to match your quota limit, or you can request more resources from GCP. Refer to
https://cloud.google.com/compute/quotas for more information, including information on how to check your quota
and request additional quota.
Installation Procedure
3. On the Marketplace page, search for Driverless and select the H2O.ai Driverless AI offering. The following
page will display.
4. Click Launch on Compute Engine. (If necessary, refer to Google Compute Instance Types for information
about machine and GPU types.)
• Select a zone that has p100s or k80s (such as us-east1-)
• Optionally change the number of cores and amount of memory. (This defaults to 32 CPUs and 120
GB RAM.)
5. A summary page displays when the compute engine is successfully deployed. This page includes the instance
ID and the username (always h2oai) and password that will be required when starting Driverless AI. Click on
the Instance link to retrieve the external IP address for starting Driverless AI.
b. SSH into the machine running Driverless AI, and verify that the service_account.json file is in the
/etc/dai/ folder.
c. Restart the machine for the changes to take effect.
Perform the following steps to upgrade the Driverless AI Google Platform offering. Replace dai_NEWVERSION.
deb below with the new Driverless AI version (for example, dai_1.6.1_amd64.deb). Note that this upgrade
process inherits the service user and group from /etc/dai/User.conf and /etc/dai/Group.conf. You do not need to man-
ually specify the DAI_USER or DAI_GROUP environment variables during an upgrade.
This section describes how to install and start Driverless AI from scratch using a Docker container in a Google
Compute environment.
This installation assumes that you already have a Google Cloud Platform account. If you don’t have an account,
go to https://console.cloud.google.com/getting-started to create one. In addition, refer to Google’s Machine Types
documentation for information on Google Compute machine types.
Watch the installation video here. Note that some of the images in this video may change between releases, but the
installation steps remain the same.
If you are trying GCP for the first time and have just created an account, please check your Google Compute Engine
(GCE) resource quota limits. By default, GCP allocates a maximum of 8 CPUs and no GPUs. You can change these
settings to match your quota limit, or you can request more resources from GCP. Refer to https://cloud.google.com/
compute/quotas for more information, including information on how to check your quota and request additional quota.
Installation Procedure
5. Create a Firewall rule for Driverless AI. On the Google Cloud Platform left navigation panel, select VPC
network > Firewall rules. Specify the following settings:
• Specify a unique name and Description for this instance.
• Change the Targets dropdown to All instances in the network.
• Specify the Source IP ranges to be 0.0.0.0/0.
• Under Protocols and Ports, select Specified protocols and ports and enter the following: tcp:
12345.
Click Create at the bottom of the form when you are done.
6. On the VM Instances page, SSH to the new VM Instance by selecting Open in Browser Window from the SSH
dropdown.
7. H2O provides a script for you to run in your VM instance. Open an editor in the VM instance (for example,
vi). Copy one of the scripts below (depending on whether you are running GPUs or CPUs). Save the script as
install.sh.
# SCRIPT FOR GPUs ONLY
apt-get -y update
apt-get -y --no-install-recommends install \
curl \
apt-utils \
(continues on next page)
add-apt-repository -y ppa:graphics-drivers/ppa
add-apt-repository -y "deb [arch=amd64] https://download.docker.com/linux/
˓→ubuntu $(lsb_release -cs) stable"
apt-get update
apt-get install -y \
nvidia-384 \
nvidia-modprobe \
docker-ce
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | \
sudo apt-key add -
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-
˓→docker.list | \
apt-get update
apt-get install -y docker-ce
chmod +x install.sh
sudo ./install.sh
mkdir ~/tmp
mkdir ~/log
mkdir ~/data
mkdir ~/scripts
mkdir ~/license
mkdir ~/demo
mkdir -p ~/jupyter/notebooks
10. Add your Google Compute user name to the Docker container.
sudo reboot
14. If you are running CPUs, you can skip this step. Otherwise, you must enable persistence of the GPU. Note that
this needs to be run once every reboot. Refer to the following for more information: http://docs.nvidia.com/
deploy/driver-persistence/index.html.
15. Start the Driverless AI Docker image with nvidia-docker run (GPUs) or docker run (CPUs). Note
that you must have write privileges for the folders that are created below. You can replace ‘pwd’ with the path to
/home/<username> or start with sudo nvidia-docker run. Replace TAG with the Docker image tag (run
docker images if necessary.) Also, refer to Using Data Connectors with the Docker Image for information
on how to add the GCS and GBQ data connectors to your Driverless AI instance.
--------------------------------
Welcome to H2O.ai's Driverless AI
---------------------------------
http://Your-Driverless-AI-Host-Machine:12345
The Google Compute Engine instance will continue to run even when you close the portal. You can stop the instance
using one of the following methods:
Stopping in the browser
1. On the VM Instances page, click on the VM instance that you want to stop.
2. Click Stop at the top of the page.
3. A confirmation page will display. Click Stop to stop the instance.
Stopping in Terminal
SSH into the machine that is running Driverless AI, and then run the following:
h2oai stop
Upgrading Driverless AI
This section provides instructions for upgrading Driverless AI versions that were installed in a Docker container. These
steps ensure that existing experiments are saved.
WARNING: Experiments, MLIs, and MOJOs reside in the Driverless AI tmp directory and are not automatically
upgraded when Driverless AI is upgraded.
• Build MLI models before upgrading.
• Build MOJO pipelines before upgrading.
• Stop Driverless AI and make a backup of your Driverless AI tmp directory before upgrading.
If you did not build MLI on a model before upgrading Driverless AI, then you will not be able to view
MLI on that model after upgrading. Before upgrading, be sure to run MLI jobs on models that you want
to continue to interpret in future releases. If that MLI job appears in the list of Interpreted Models in your
current version, then it will be retained after upgrading.
If you did not build a MOJO pipeline on a model before upgrading Driverless AI, then you will not be
able to build a MOJO pipeline on that model after upgrading. Before upgrading, be sure to build MOJO
pipelines on all desired models and then back up your Driverless AI tmp directory.
Note: Stop Driverless AI if it is still running.
Requirements
As of 1.7.0, CUDA 9 is no longer supported. Your host environment must have CUDA 10.0 or later with NVIDIA
drivers >= 410 installed (GPU only). Driverless AI ships with its own CUDA libraries, but the driver must exist in the
host environment. Go to https://www.nvidia.com/Download/index.aspx to get the latest NVIDIA Tesla V/P/K series
driver.
Upgrade Steps
1. SSH into the IP address of the machine that is running Driverless AI.
2. Set up a directory for the version of Driverless AI on the host machine:
3. Retrieve the Driverless AI package from https://www.h2o.ai/download/ and add it to the new directory.
4. Load the Driverless AI Docker image inside the new directory. This example shows how to load Driverless AI
version. If necessary, replace VERSION with your image.
5. Copy the data, log, license, and tmp directories from the previous Driverless AI directory to the new Driverless
AI directory:
# Copy the data, log, license, and tmp directories on the host machine
cp -a dai_rel_1.4.2/data dai_rel_VERSION/data
cp -a dai_rel_1.4.2/log dai_rel_VERSION/log
cp -a dai_rel_1.4.2/license dai_rel_VERSION/license
cp -a dai_rel_1.4.2/tmp dai_rel_VERSION/tmp
At this point, your experiments from the previous versions will be visible inside the Docker container.
6. Use docker images to find the new image tag.
7. Start the Driverless AI Docker image.
8. Connect to Driverless AI with your browser at http://Your-Driverless-AI-Host-Machine:12345.
This section provides installation steps for IBM Power environments. This includes information for Docker image
installs, RPMs, Deb, and Tar installs.
Notes:
• Ubuntu is not fully tested on Power.
• OpenCL and LightGBM with GPUs are not supported on Power currently.
To simplify local installation, Driverless AI is provided as a Docker image for the following system combination:
Notes:
• CUDA 10 or later with NVIDIA drivers >= 410 (GPU only)
• OpenCL and LightGBM with GPUs are not supported on Power currently.
For the best performance, including GPU support, use nvidia-docker2. For a lower-performance experience without
GPUs, use regular docker (with the same docker image).
These installation steps assume that you have a license key for Driverless AI. For information on how to obtain a license
key for Driverless AI, visit https://www.h2o.ai/products/h2o-driverless-ai/. Once obtained, you will be promted to
paste the license key into the Driverless AI UI when you first log in, or you can save it as a .sig file and place it in the
license folder that you will create during the installation process.
This section describes how to install and start the Driverless AI Docker image on RHEL for IBM Power LE systems
with GPUs. Note that nvidia-docker has limited support for ppc64le machines. More information about nvidia-docker
support for ppc64le machines is available here.
Open a Terminal and ssh to the machine that will run Driverless AI. Once you are logged in, perform the following
steps.
1. Retrieve the Driverless AI Docker image from https://www.h2o.ai/download/.
2. Add the following to cuda-rhel7.repo in /etc/yum.repos.d/:
[cuda]
name=cuda
baseurl=http://developer.download.nvidia.com/compute/cuda/repos/rhel7/ppc64le
enabled=1
gpgcheck=1
gpgkey=http://developer.download.nvidia.com/compute/cuda/repos/rhel7/ppc64le/
˓→7fa2af80.pub
[libnvidia-container]
name=libnvidia-container
baseurl=https://nvidia.github.io/libnvidia-container/centos7/$basearch
repo_gpgcheck=1
gpgcheck=0
enabled=1
gpgkey=https://nvidia.github.io/libnvidia-container/gpgkey
sslverify=1
sslcacert=/etc/pki/tls/certs/ca-bundle.crt
[nvidia-container-runtime]
name=nvidia-container-runtime
baseurl=https://nvidia.github.io/nvidia-container-runtime/centos7/$basearch
(continues on next page)
7. Set up a directory for the version of Driverless AI on the host machine, replacing VERSION below with your
Driverless AI Docker image version:
8. Change directories to the new folder, then load the Driverless AI Docker image inside the new directory. This
example shows how to load Driverless AI. Replace VERSION with your image.
9. Enable persistence of the GPU. Note that this needs to be run once every reboot. Refer to the following for more
information: http://docs.nvidia.com/deploy/driver-persistence/index.html.
10. Set up the data, log, and license directories on the host machine (within the new directory):
# Set up the data, log, license, and tmp directories on the host machine
mkdir data
mkdir log
mkdir license
mkdir tmp
11. At this point, you can copy data into the data directory on the host machine. The data will be visible inside the
Docker container.
12. Run docker images to find the image tag.
13. Start the Driverless AI Docker image with nvidia-docker and replace TAG below with the image tag:
--------------------------------
Welcome to H2O.ai's Driverless AI
---------------------------------
This section describes how to install and start the Driverless AI Docker image on RHEL for IBM Power LE systems
with CPUs. Note that this uses Docker EE and not NVIDIA Docker. GPU support will not be available.
Watch the installation video here. Note that some of the images in this video may change between releases, but the
installation steps remain the same.
Note: As of this writing, Driverless AI has only been tested on RHEL version 7.4.
Open a Terminal and ssh to the machine that will run Driverless AI. Once you are logged in, perform the following
steps.
1. Install and start Docker CE.
2. On the machine that is running Docker EE, retrieve the Driverless AI Docker image from https://www.h2o.ai/
driverless-ai-download/.
3. Set up a directory for the version of Driverless AI on the host machine, replacing VERSION below with your
Driverless AI Docker image version:
4. Load the Driverless AI Docker image inside the new directory. The following example shows how to load
Driverless AI. Replace VERSION with your image.
5. Set up the data, log, license, and tmp directories (within the new directory):
# Set up the data, log, license, and tmp directories on the host machine
mkdir data
mkdir log
mkdir license
mkdir tmp
6. Copy data into the data directory on the host. The data will be visible inside the Docker container at /<user-
home>/data.
7. Run docker images to find the image tag.
8. Start the Driverless AI Docker image and replace TAG below with the image tag. Note that GPU support will
not be available.
$ docker run \
--pid=host \
--init \
--rm \
-u `id -u`:`id -g` \
-p 12345:12345 \
-v `pwd`/data:/data \
-v `pwd`/log:/log \
-v `pwd`/license:/license \
-v `pwd`/tmp:/tmp \
-v /etc/passwd:/etc/passwd:ro \
-v /etc/group:/etc/group:ro \
h2oai/dai-centos7-ppc64le:TAG
--------------------------------
Welcome to H2O.ai's Driverless AI
---------------------------------
To stop the Driverless AI Docker image, type Ctrl + C in the Terminal (Mac OS X) or PowerShell (Windows 10)
window that is running the Driverless AI Docker image.
This section provides instructions for upgrading Driverless AI versions that were installed in a Docker container. These
steps ensure that existing experiments are saved.
WARNING: Experiments, MLIs, and MOJOs reside in the Driverless AI tmp directory and are not automatically
upgraded when Driverless AI is upgraded.
• Build MLI models before upgrading.
• Build MOJO pipelines before upgrading.
• Stop Driverless AI and make a backup of your Driverless AI tmp directory before upgrading.
If you did not build MLI on a model before upgrading Driverless AI, then you will not be able to view
MLI on that model after upgrading. Before upgrading, be sure to run MLI jobs on models that you want
to continue to interpret in future releases. If that MLI job appears in the list of Interpreted Models in your
current version, then it will be retained after upgrading.
If you did not build a MOJO pipeline on a model before upgrading Driverless AI, then you will not be
able to build a MOJO pipeline on that model after upgrading. Before upgrading, be sure to build MOJO
pipelines on all desired models and then back up your Driverless AI tmp directory.
Note: Stop Driverless AI if it is still running.
Requirements
As of 1.7.0, CUDA 9 is no longer supported. Your host environment must have CUDA 10.0 or later with NVIDIA
drivers >= 410 installed (GPU only). Driverless AI ships with its own CUDA libraries, but the driver must exist in the
host environment. Go to https://www.nvidia.com/Download/index.aspx to get the latest NVIDIA Tesla V/P/K series
driver.
Upgrade Steps
1. SSH into the IP address of the machine that is running Driverless AI.
2. Set up a directory for the version of Driverless AI on the host machine:
3. Retrieve the Driverless AI package from https://www.h2o.ai/download/ and add it to the new directory.
4. Load the Driverless AI Docker image inside the new directory. This example shows how to load Driverless AI
version. If necessary, replace VERSION with your image.
5. Copy the data, log, license, and tmp directories from the previous Driverless AI directory to the new Driverless
AI directory:
# Copy the data, log, license, and tmp directories on the host machine
cp -a dai_rel_1.4.2/data dai_rel_VERSION/data
cp -a dai_rel_1.4.2/log dai_rel_VERSION/log
cp -a dai_rel_1.4.2/license dai_rel_VERSION/license
cp -a dai_rel_1.4.2/tmp dai_rel_VERSION/tmp
At this point, your experiments from the previous versions will be visible inside the Docker container.
6. Use docker images to find the new image tag.
7. Start the Driverless AI Docker image.
8. Connect to Driverless AI with your browser at http://Your-Driverless-AI-Host-Machine:12345.
For IBM machines that will not use the Docker image or DEB, an RPM installation is available for ppc64le RHEL 7.
The installation steps assume that you have a license key for Driverless AI. For information on how to obtain a license
key for Driverless AI, visit https://www.h2o.ai/products/h2o-driverless-ai/. Once obtained, you will be promted to
paste the license key into the Driverless AI UI when you first log in, or you can save it as a .sig file and place it in the
license folder that you will create during the installation process.
Note: OpenCL and LightGBM with GPUs are not supported on Power currently.
Requirements
• RedHat 7
• CUDA 10 or later with NVIDIA drivers >= 410 (GPU only)
• cuDNN >= 7.4.1 (Required only if using TensorFlow.)
• Driverless AI RPM, available from https://www.h2o.ai/download/
• The ‘dai’ service user is created locally (in /etc/passwd) if it is not found by ‘getent passwd’. You can override
the user by providing the DAI_USER environment variable during rpm or dpkg installation.
• The ‘dai’ service group is created locally (in /etc/group) if it is not found by ‘getent group’. You can override
the group by providing the DAI_GROUP environment variable during rpm or dpkg installation.
• Configuration files are put in /etc/dai and owned by the ‘root’ user:
– /etc/dai/config.toml: Driverless AI config file (See Using the config.toml File section for details)
– /etc/dai/User.conf: Systemd config file specifying the service user
– /etc/dai/Group.conf: Systemd config file specifying the service group
– /etc/dai/EnvironmentFile.conf: Systemd config file specifying (optional) environment variable overrides
• Software files are put in /opt/h2oai/dai and owned by the ‘root’ user
• The following directories are owned by the service user so they can be updated by the running software:
– /opt/h2oai/dai/home: The application’s home directory (license key files are stored here)
Installing Driverless AI
Run the following commands to install the Driverless AI RPM. Replace VERSION with your specific version.
By default, the Driverless AI processes are owned by the ‘dai’ userand ‘dai’ group. You can optionally specify a
different service user and group as shown below. Replace <myuser> and <mygroup> as appropriate.
# Temporarily specify service user and group when installing Driverless AI.
# rpm saves these for systemd in the /etc/dai/User.conf and /etc/dai/Group.conf files.
sudo DAI_USER=myuser DAI_GROUP=mygroup rpm -i dai-VERSION.rpm
Starting Driverless AI
If you have NVIDIA GPUs, you must run the following NVIDIA command. This command needs to be run every
reboot. For more information: http://docs.nvidia.com/deploy/driver-persistence/index.html.
Stopping Driverless AI
Upgrading Driverless AI
WARNINGS:
• This release deprecates experiments and MLI models from 1.7.0 and earlier.
• Experiments, MLIs, and MOJOs reside in the Driverless AI tmp directory and are not automatically upgraded
when Driverless AI is upgraded. We recommend you take the following steps before upgrading.
– Build MLI models before upgrading.
– Build MOJO pipelines before upgrading.
– Stop Driverless AI and make a backup of your Driverless AI tmp directory before upgrading.
If you did not build MLI on a model before upgrading Driverless AI, then you will not be able to view
MLI on that model after upgrading. Before upgrading, be sure to run MLI jobs on models that you want
to continue to interpret in future releases. If that MLI job appears in the list of Interpreted Models in your
current version, then it will be retained after upgrading.
If you did not build a MOJO pipeline on a model before upgrading Driverless AI, then you will not be
able to build a MOJO pipeline on that model after upgrading. Before upgrading, be sure to build MOJO
pipelines on all desired models and then back up your Driverless AI tmp directory.
The upgrade process inherits the service user and group from /etc/dai/User.conf and /etc/dai/Group.conf. You do not
need to manually specify the DAI_USER or DAI_GROUP environment variables during an upgrade.
Requirements
As of 1.7.0, CUDA 9 is no longer supported. Your host environment must have CUDA 10.0 or later with NVIDIA
drivers >= 410 installed (GPU only). Driverless AI ships with its own CUDA libraries, but the driver must exist in the
host environment. Go to https://www.nvidia.com/Download/index.aspx to get the latest NVIDIA Tesla V/P/K series
driver.
Upgrade Steps
Uninstalling Driverless AI
# Uninstall.
sudo rpm -e dai
# Uninstall.
sudo rpm -e dai
CAUTION! At this point you can optionally completely remove all remaining files, including the database (this cannot
be undone):
The Driverless AI software is available for use in pure user-mode environments as a self-extracting TAR SH archive.
This form of installation does not require a privileged user to install or to run.
This artifact has the same compatibility matrix as the RPM and DEB packages (combined), it just comes packaged
slightly differently. See those sections for a full list of supported environments.
The installation steps assume that you have a license key for Driverless AI. For information on how to obtain a license
key for Driverless AI, visit https://www.h2o.ai/products/h2o-driverless-ai/. Once obtained, you will be prompted to
paste the license key into the Driverless AI UI when you first log in.
Note: OpenCL and LightGBM with GPUs are not supported on Power currently.
Requirements
Installing Driverless AI
Run the following commands to install the Driverless AI RPM. Replace VERSION with your specific version.
You may now cd to the unpacked directory and optionally make changes to config.toml.
Starting Driverless AI
If you have NVIDIA GPUs, you must run the following NVIDIA command. This command needs to be run every
reboot. For more information: http://docs.nvidia.com/deploy/driver-persistence/index.html.
less log/dai.log
less log/h2o.log
less log/procsy.log
less log/vis-server.log
Stopping Driverless AI
Uninstalling Driverless AI
To uninstall Driverless AI, just remove the directory created by the unpacking process. By default, all files for Driver-
less AI are contained within this directory.
Upgrading Driverless AI
WARNINGS:
• This release deprecates experiments and MLI models from 1.7.0 and earlier.
• Experiments, MLIs, and MOJOs reside in the Driverless AI tmp directory and are not automatically upgraded
when Driverless AI is upgraded. We recommend you take the following steps before upgrading.
– Build MLI models before upgrading.
– Build MOJO pipelines before upgrading.
– Stop Driverless AI and make a backup of your Driverless AI tmp directory before upgrading.
If you did not build MLI on a model before upgrading Driverless AI, then you will not be able to view
MLI on that model after upgrading. Before upgrading, be sure to run MLI jobs on models that you want
to continue to interpret in future releases. If that MLI job appears in the list of Interpreted Models in your
current version, then it will be retained after upgrading.
If you did not build a MOJO pipeline on a model before upgrading Driverless AI, then you will not be
able to build a MOJO pipeline on that model after upgrading. Before upgrading, be sure to build MOJO
pipelines on all desired models and then back up your Driverless AI tmp directory.
The upgrade process inherits the service user and group from /etc/dai/User.conf and /etc/dai/Group.conf. You do not
need to manually specify the DAI_USER or DAI_GROUP environment variables during an upgrade.
Requirements
As of 1.7.0, CUDA 9 is no longer supported. Your host environment must have CUDA 10.0 or later with NVIDIA
drivers >= 410 installed (GPU only). Driverless AI ships with its own CUDA libraries, but the driver must exist in the
host environment. Go to https://www.nvidia.com/Download/index.aspx to get the latest NVIDIA Tesla V/P/K series
driver.
Upgrade Steps
For default IBM Power9 systems with RHEL 7 installed, be sure to open port 12345 in the firewall. For example:
Some users may find it necessary to grow their disk. An example describing how to add disk space to a virtual
machine is available at https://www.geoffstratton.com/expand-hard-disk-ubuntu-lvm. The steps for an IBM Power9
system with RHEL 7 would be similar.
8.4 Mac OS X
This section describes how to install, start, stop, and upgrade the Driverless AI Docker image on Mac OS X. Note that
this uses regular Docker and not NVIDIA Docker.
Notes:
• GPU support is not available on Mac OS X.
• Scoring is not available on Mac OS X.
The installation steps assume that you have a license key for Driverless AI. For information on how to obtain a license
key for Driverless AI, visit https://www.h2o.ai/driverless-ai/. Once obtained, you will be promted to paste the license
key into the Driverless AI UI when you first log in, or you can save it as a .sig file and place it in the license folder that
you will create during the installation process.
Caution:
• This is an extremely memory-constrained environment for experimental purposes only. Stick to small datasets!
For serious use, please use Linux.
• Be aware that there are known performace issues with Docker for Mac. More information is available here:
https://docs.docker.com/docker-for-mac/osxfs/#technology.
8.4.1 Environment
4. On the File Sharing tab, verify that your macOS directories (and their subdirectories) can be bind mounted
into Docker containers. More information is available here: https://docs.docker.com/docker-for-mac/osxfs/
#namespaces.
5. Set up a directory for the version of Driverless AI within the Terminal, replacing VERSION below with your
Driverless AI Docker image version:
mkdir dai_rel_VERSION
6. With Docker running, open a Terminal and move the downloaded Driverless AI image to your new directory.
7. Change directories to the new directory, then load the image using the following command. This example shows
how to load Driverless AI. Replace VERSION with your image. Note that this process may take some time to
complete.
cd dai_rel_VERSION
docker load < dai-docker-centos7-x86_64-VERSION.tar.gz
8. Set up the data, log, license, and tmp directories (within the new Driverless AI directory):
mkdir data
mkdir log
mkdir license
mkdir tmp
9. Optionally copy data into the data directory on the host. The data will be visible inside the Docker container at
/data. You can also upload data after starting Driverless AI.
10. Run docker images to find the image tag.
11. Start the Driverless AI Docker image (still within the new Driverless AI directory). Replace TAG below with
the image tag. Note that GPU support will not be available.
docker run \
--pid=host \
--init \
--rm \
--shm-size=256m \
-u `id -u`:`id -g` \
-p 12345:12345 \
(continues on next page)
To stop the Driverless AI Docker image, type Ctrl + C in the Terminal (Mac OS X) or PowerShell (Windows 10)
window that is running the Driverless AI Docker image.
This section provides instructions for upgrading Driverless AI versions that were installed in a Docker container. These
steps ensure that existing experiments are saved.
WARNING: Experiments, MLIs, and MOJOs reside in the Driverless AI tmp directory and are not automatically
upgraded when Driverless AI is upgraded.
• Build MLI models before upgrading.
• Build MOJO pipelines before upgrading.
• Stop Driverless AI and make a backup of your Driverless AI tmp directory before upgrading.
If you did not build MLI on a model before upgrading Driverless AI, then you will not be able to view
MLI on that model after upgrading. Before upgrading, be sure to run MLI jobs on models that you want
to continue to interpret in future releases. If that MLI job appears in the list of Interpreted Models in your
current version, then it will be retained after upgrading.
If you did not build a MOJO pipeline on a model before upgrading Driverless AI, then you will not be
able to build a MOJO pipeline on that model after upgrading. Before upgrading, be sure to build MOJO
pipelines on all desired models and then back up your Driverless AI tmp directory.
Note: Stop Driverless AI if it is still running.
Upgrade Steps
1. SSH into the IP address of the machine that is running Driverless AI.
2. Set up a directory for the version of Driverless AI on the host machine:
3. Retrieve the Driverless AI package from https://www.h2o.ai/download/ and add it to the new directory.
4. Load the Driverless AI Docker image inside the new directory. This example shows how to load Driverless AI
version. If necessary, replace VERSION with your image.
5. Copy the data, log, license, and tmp directories from the previous Driverless AI directory to the new Driverless
AI directory:
# Copy the data, log, license, and tmp directories on the host machine
cp -a dai_rel_1.4.2/data dai_rel_VERSION/data
cp -a dai_rel_1.4.2/log dai_rel_VERSION/log
cp -a dai_rel_1.4.2/license dai_rel_VERSION/license
cp -a dai_rel_1.4.2/tmp dai_rel_VERSION/tmp
At this point, your experiments from the previous versions will be visible inside the Docker container.
6. Use docker images to find the new image tag.
7. Start the Driverless AI Docker image.
8. Connect to Driverless AI with your browser at http://Your-Driverless-AI-Host-Machine:12345.
This section describes how to install, start, stop, and upgrade Driverless AI on a Windows 10 Pro machine. The
installation steps assume that you have a license key for Driverless AI. For information on how to obtain a license key
for Driverless AI, visit https://www.h2o.ai/driverless-ai/. Once obtained, you will be promted to paste the license key
into the Driverless AI UI when you first log in, or you can save it as a .sig file and place it in the license folder that you
will create during the installation process.
The recommended way of installing Driverless AI on Windows is via WSL Ubuntu. Running a Driverless AI Docker
image on Windows is also possible but not preferred.
Notes:
• GPU support is not available on Windows.
• Scoring is not available on Windows.
Caution: This should be used only for experimental purposes and only on small data. For serious use, please use
Linux.
8.5.2 Environment
This section describes how to install the Driverless AI DEB on Windows 10 using Windows Subsystem for Linux
(WSL).
Requirements
Installation Procedure
The Driverless AI Windows DEB cannot be upgraded. In order to run to a newer version, you must first uninstall the
prior version and then install the newer one.
WARNINGS:
• This release deprecates experiments and MLI models from 1.7.0 and earlier.
• Experiments, MLIs, and MOJOs reside in the Driverless AI tmp directory and are not automatically upgraded
when Driverless AI is upgraded. We recommend you take the following steps before upgrading.
– Build MLI models before upgrading.
– Build MOJO pipelines before upgrading.
– Stop Driverless AI and make a backup of your Driverless AI tmp directory before upgrading.
If you did not build MLI on a model before upgrading Driverless AI, then you will not be able to view
MLI on that model after upgrading. Before upgrading, be sure to run MLI jobs on models that you want
to continue to interpret in future releases. If that MLI job appears in the list of Interpreted Models in your
current version, then it will be retained after upgrading.
If you did not build a MOJO pipeline on a model before upgrading Driverless AI, then you will not be
able to build a MOJO pipeline on that model after upgrading. Before upgrading, be sure to build MOJO
pipelines on all desired models and then back up your Driverless AI tmp directory.
The upgrade process inherits the service user and group from /etc/dai/User.conf and /etc/dai/Group.conf. You do not
need to manually specify the DAI_USER or DAI_GROUP environment variables during an upgrade.
Run the following commands to uninstall a prior version.
At this point, follow the previous installation procedure to install a newer version of Driverless AI.
Notes:
• Installing the Driverless AI Docker image on Windows is not the recommended method for running Driverless
AI. RPM and DEB installs are preferred.
• Be aware that there are known issues with Docker for Windows. More information is available here: https:
//github.com/docker/for-win/issues/188.
• Consult with your Windows System Admin if
– Your corporate environment does not allow third-part software installs
– You are running Windows Defender
– You your machine is not running with Enable-WindowsOptionalFeature -Online
-FeatureName Microsoft-Windows-Subsystem-Linux.
Watch the installation video here. Note that some of the images in this video may change between releases, but the
installation steps remain the same.
Requirements
• Windows 10 Pro
Installation Procedure
• Adjust the amount of memory given to Docker to be at least 10 GB. Driverless AI won’t run at all
with less than 10 GB of memory.
• Optionally adjust the number of CPUs given to Docker.
You can adjust these settings by clicking on the Docker whale in your taskbar (look for hidden tasks, if
necessary), then selecting Settings > Shared Drive and Settings > Advanced as shown in the following
screenshots. Don’t forget to Apply the changes after setting the desired memory value. (Docker will
restart.) Note that if you cannot make changes, stop Docker and then start Docker again by right clicking
on the Docker icon on your desktop and selecting Run as Administrator.
4. Open a PowerShell terminal and set up a directory for the version of Driverless AI on the host machine, replacing
VERSION below with your Driverless AI Docker image version:
md dai_rel_VERSION
5. With Docker running, navigate to the location of your downloaded Driverless AI image. Move the downloaded
Driverless AI image to your new directory.
6. Change directories to the new directory, then load the image using the following command. This example shows
how to load Driverless AI. Replace VERSION with your image.
cd dai_rel_VERSION
docker load -i .\dai-docker-centos7-x86_64-VERSION.tar.gz
7. Set up the data, log, license, and tmp directories (within the new directory).
md data
md log
md license
md tmp
8. Copy data into the /data directory. The data will be visible inside the Docker container at /data.
9. Run docker images to find the image tag.
10. Start the Driverless AI Docker image. Be sure to replace path_to_ below with the entire path to the location
of the folders that you created (for example, “c:/Users/user-name/driverlessai_folder/data”), and replace TAG
with the Docker image tag. Note that this is regular Docker, not NVIDIA Docker. GPU support will not be
available.
˓→path_to_tmp:/tmp h2oai/dai-centos7-x86_64:TAG
To stop the Driverless AI Docker image, type Ctrl + C in the Terminal (Mac OS X) or PowerShell (Windows 10)
window that is running the Driverless AI Docker image.
This section provides instructions for upgrading Driverless AI versions that were installed in a Docker container. These
steps ensure that existing experiments are saved.
WARNING: Experiments, MLIs, and MOJOs reside in the Driverless AI tmp directory and are not automatically
upgraded when Driverless AI is upgraded.
• Build MLI models before upgrading.
• Build MOJO pipelines before upgrading.
• Stop Driverless AI and make a backup of your Driverless AI tmp directory before upgrading.
If you did not build MLI on a model before upgrading Driverless AI, then you will not be able to view
MLI on that model after upgrading. Before upgrading, be sure to run MLI jobs on models that you want
to continue to interpret in future releases. If that MLI job appears in the list of Interpreted Models in your
current version, then it will be retained after upgrading.
If you did not build a MOJO pipeline on a model before upgrading Driverless AI, then you will not be
able to build a MOJO pipeline on that model after upgrading. Before upgrading, be sure to build MOJO
pipelines on all desired models and then back up your Driverless AI tmp directory.
Note: Stop Driverless AI if it is still running.
Requirements
As of 1.7.0, CUDA 9 is no longer supported. Your host environment must have CUDA 10.0 or later with NVIDIA
drivers >= 410 installed (GPU only). Driverless AI ships with its own CUDA libraries, but the driver must exist in the
host environment. Go to https://www.nvidia.com/Download/index.aspx to get the latest NVIDIA Tesla V/P/K series
driver.
Upgrade Steps
1. SSH into the IP address of the machine that is running Driverless AI.
2. Set up a directory for the version of Driverless AI on the host machine:
3. Retrieve the Driverless AI package from https://www.h2o.ai/download/ and add it to the new directory.
4. Load the Driverless AI Docker image inside the new directory. This example shows how to load Driverless AI
version. If necessary, replace VERSION with your image.
5. Copy the data, log, license, and tmp directories from the previous Driverless AI directory to the new Driverless
AI directory:
# Copy the data, log, license, and tmp directories on the host machine
cp -a dai_rel_1.4.2/data dai_rel_VERSION/data
cp -a dai_rel_1.4.2/log dai_rel_VERSION/log
cp -a dai_rel_1.4.2/license dai_rel_VERSION/license
cp -a dai_rel_1.4.2/tmp dai_rel_VERSION/tmp
At this point, your experiments from the previous versions will be visible inside the Docker container.
6. Use docker images to find the new image tag.
7. Start the Driverless AI Docker image.
8. Connect to Driverless AI with your browser at http://Your-Driverless-AI-Host-Machine:12345.
NINE
Admins can edit a config.toml file when starting the Driverless AI Docker image. The config.toml file includes all
possible configuration options that would otherwise be specified in the nvidia-docker run command. This file
is located in a folder on the container. You can make updates to environment variables directly in this file. Driverless
AI will use the updated config.toml file when starting from native installs. Docker users can specify that updated
config.toml file when starting Driverless AI Docker image.
The configuration engine reads and overrides variables in the following order:
1. h2oai/config/config.toml - This is an internal file that is not visible.
2. config.toml - Place this file in a folder or mount it in a Docker container and specify the path in the “DRIVER-
LESS_AI_CONFIG_FILE” environment variable.
3. Environment variable - Configuration variables can also be provided as environment variables. They must have
the prefix DRIVERLESS_AI_ followed by the variable name in all caps. For example, “authentication_method”
can be provided as “DRIVERLESS_AI_AUTHENTICATION_METHOD”.
1. Copy the config.toml file from inside the Docker image to your local filesystem.
2. Edit the desired variables in the config.toml file. Save your changes when you are done.
3. Start Driverless AI with the DRIVERLESS_AI_CONFIG_FILE environment variable. Make sure this points to
the location of the edited config.toml file so that the software finds the configuration file.
145
Using Driverless AI, Release 1.8.4.1
nvidia-docker run \
--pid=host \
--init \
--rm \
--shm-size=256m \
-u `id -u`:`id -g` \
-p 12345:12345 \
-e DRIVERLESS_AI_CONFIG_FILE="/config/config.toml" \
-v `pwd`/config:/config \
-v `pwd`/data:/data \
-v `pwd`/log:/log \
-v `pwd`/license:/license \
-v `pwd`/tmp:/tmp \
h2oai/dai-centos7-x86_64:TAG
export DRIVERLESS_AI_CONFIG_FILE=“/config/config.toml”
2. Edit the desired variables in the config.toml file. Save your changes when you are done.
3. Start Driverless AI. Note that the command used to start Driverless AI varies depending on your install type.
For reference, below is a copy of the standard config.toml file included with this version of Driverless AI. The sections
that follow describe some examples showing how to set different environment variables, data connectors, authentica-
tion, and notifications.
TEN
Driverless AI provides a number of environment variables that can be passed when starting Driverless AI or specified
in a config.toml file. The complete list of variables is in the Using the config.toml File section. The steps for specifying
variables vary depending on whether you installed a Driverless AI RPM, DEB, or TAR SH or whether you are running
a Docker image.
Each property must be prepended with DRIVERLESS_AI_. The example below starts Driverless AI with environment
variables that enable S3 and HDFS access (without authentication).
nvidia-docker run \
--pid=host \
--init \
--rm \
-u `id -u`:`id -g` \
-e DRIVERLESS_AI_ENABLED_FILE_SYSTEMS="file,s3,hdfs" \
-e DRIVERLESS_AI_AUTHENTICATION_METHOD="local" \
-e DRIVERLESS_AI_LOCAL_HTPASSWD_FILE="<htpasswd_file_location>" \
-v /etc/passwd:/etc/passwd:ro \
-v /etc/group:/etc/group:ro \
-v `pwd`/data:/data \
-v `pwd`/log:/log \
-v `pwd`/license:/license \
-v `pwd`/tmp:/tmp \
h2oai/dai-centos7-x86_64:TAG
The config.toml file is available in the etc/dai folder after the RPM, DEB, or TAR SH is installed. Edit the desired
variables in this file, and then restart Driverless AI.
The example below shows the configuration options in the config.toml file to set when enabling the S3 and HDFS
access (without authentication)
1. Export the Driverless AI config.toml file or add it to ~/.bashrc. For example:
# DEB and RPM
export DRIVERLESS_AI_CONFIG_FILE="/etc/dai/config.toml"
# TAR SH
export DRIVERLESS_AI_CONFIG_FILE="/path/to/your/unpacked/dai/directory/config.toml"
2. Specify the desired configuration options to enable S3 and HDFS access (without authentication).
# File System Support
# upload : standard upload feature
# file : local file system/server file system
# hdfs : Hadoop file system, remember to configure the HDFS config folder path and keytab below
# dtap : Blue Data Tap file system, remember to configure the DTap section below
# s3 : Amazon S3, optionally configure secret and access key below
# gcs : Google Cloud Storage, remember to configure gcs_path_to_service_account_json below
# gbq : Google BigQuery, remember to configure gcs_path_to_service_account_json below
173
Using Driverless AI, Release 1.8.4.1
# authentication_method
# unvalidated : Accepts user id and password, does not validate password
# none : Does not ask for user id or password, authenticated as admin
# pam : Accepts user id and password, Validates user with operating system
# ldap : Accepts user id and password, Validates against an ldap server, look
# local: Accepts a user id and password, Validated against a htpasswd file provided in local_htpasswd_file
# for additional settings under LDAP settings
authentication_method = "local"
3. Start Driverless AI. Note that the command used to start Driverless AI varies depending on your install type.
# Linux RPM or DEB with systemd
sudo systemctl start dai
# Linux TAR SH
./run-dai.sh
ELEVEN
Driverless AI provides various data connectors for external data sources. Data sources are exposed in the form of the
file systems. Each file system is prefixed by a unique prefix. For example:
• To reference data on S3, use s3://.
• To reference data on HDFS, use the prefix hdfs://.
• To reference data on Azure Blob Store, use https://<storage_name>.blob.core.windows.net.
• To reference data on BlueData Datatap, use dtap://.
• To reference data on Google BigQuery, make sure you know the Google BigQuery dataset and the table that
you want to query. Use a standard SQL query to ingest data.
• To reference data on Google Cloud Storage, use gs://
• To reference data on kdb+, use the hostname and the port http://<kdb_server>:<port>
• To reference data on Minio, use http://<endpoint_url>.
• To reference data on Snowflake, use a standard SQL query to ingest data.
• To access a SQL database via JDBC, use a SQL query with the syntax associated with your database.
Refer to the following sections for more information:
Available file systems can be configured via the enabled_file_systems property. Note that each property must
be prepended with DRIVERLESS_AI_. Replace TAG below with the image tag.
nvidia-docker run \
--pid=host \
--init \
--rm \
--shm-size=256m \
-u `id -u`:`id -g` \
-p 12345:12345 \
-e DRIVERLESS_AI_ENABLED_FILE_SYSTEMS="file,s3,hdfs,gcs,gbq,kdb,minio,snow,dtap,azrbs" \
-v `pwd`/data:/data \
-v `pwd`/log:/log \
-v `pwd`/license:/license \
-v `pwd`/tmp:/tmp \
h2oai/dai-centos7-ppc64le:TAG
The sections that follow shows examples describing how to use environment variables to enable HDFS, S3, Google
Cloud Storage, Google Big Query, Minio, Snowflake, kdb+, Azure Blob Store, BlueData DataTap, and JDBC data
sources.
175
Using Driverless AI, Release 1.8.4.1
11.1.1 S3 Setup
Driverless AI allows you to explore S3 data sources from within the Driverless AI application. This section provides
instructions for configuring Driverless AI to work with S3.
Start Driverless AI
The following sections describes how to enable the S3 data connector when starting Driverless AI in Docker. This can
done by specifying each environment variable in the nvidia-docker run command or by editing the configura-
tion options in the config.toml file and then specifying that file in the nvidia-docker run command.
This example enables the S3 data connector and disables authentication. It does not pass any S3 access key or secret;
however it configures Docker DNS by passing the name and IP of the S3 name node. This allows users to reference
data stored in S3 directly using the name node address, for example: s3://name.node/datasets/iris.csv. Replace TAG
below with the image tag.
nvidia-docker run \
--shm-size=256m \
--add-host name.node:172.16.2.186 \
-e DRIVERLESS_AI_ENABLED_FILE_SYSTEMS="file,s3" \
-p 12345:12345 \
--init -it --rm \
-v /tmp/dtmp/:/tmp \
-v /tmp/dlog/:/log \
-v /tmp/dlicense/:/license \
-v /tmp/ddata/:/data \
-u $(id -u):$(id -g) \
h2oai/dai-centos7-x86_64:TAG
This example enables the S3 data connector with authentication by passing an S3 access key ID and an access key.
It also configures Docker DNS by passing the name and IP of the S3 name node. This allows users to reference data
stored in S3 directly using the name node address, for example: s3://name.node/datasets/iris.csv. Replace TAG below
with the image tag.
nvidia-docker run \
--shm-size=256m \
--add-host name.node:172.16.2.186 \
-e DRIVERLESS_AI_ENABLED_FILE_SYSTEMS="file,s3" \
-e DRIVERLESS_AI_AWS_ACCESS_KEY_ID="<access_key_id>" \
-e DRIVERLESS_AI_AWS_SECRET_ACCESS_KEY="<access_key>" \
-p 12345:12345 \
--init -it --rm \
-v /tmp/dtmp/:/tmp \
-v /tmp/dlog/:/log \
-v /tmp/dlicense/:/license \
-v /tmp/ddata/:/data \
-u $(id -u):$(id -g) \
h2oai/dai-centos7-x86_64:TAG
This example shows how to configure S3 options in the config.toml file, and then specify that file when starting
Driverless AI in Docker. Note that this example enables S3 with no authentication.
1. Configure the Driverless AI config.toml file. Set the following configuration options.
• enabled_file_systems = "file, upload, s3"
2. Mount the config.toml file into the Docker container.
nvidia-docker run \
--pid=host \
--init \
--rm \
--shm-size=256m \
--add-host name.node:172.16.2.186 \
-e DRIVERLESS_AI_CONFIG_FILE=/path/in/docker/config.toml
-p 12345:12345 \
-v /local/path/to/config.toml:/path/in/docker/config.toml
-v /etc/passwd:/etc/passwd:ro \
-v /etc/group:/etc/group:ro \
-v /tmp/dtmp/:/tmp \
-v /tmp/dlog/:/log \
-v /tmp/dlicense/:/license \
-v /tmp/ddata/:/data \
-u $(id -u):$(id -g) \
h2oai/dai-centos7-x86_64:TAG
Driverless AI allows you to explore HDFS data sources from within the Driverless AI application. This section
provides instructions for configuring Driverless AI to work with HDFS.
• CDH 5.4
• CDH 5.5
• CDH 5.6
• CDH 5.7
• CDH 5.8
• CDH 5.9
• CDH 5.10
• CDH 5.13
• CDH 5.14
• CDH 5.15
• CDH 5.16
• CDH 6.0
• CDH 6.1
• CDH 6.2
• CDH 6.3
• HDP 2.2
• HDP 2.3
• HDP 2.4
• HDP 2.5
• HDP 2.6
• HDP 3.0
• HDP 3.1
• hdfs_config_path: The location the HDFS config folder path. This folder can contain multiple config
files.
• hdfs_auth_type: Selects HDFS authentication. Available values are:
– principal: Authenticate with HDFS with a principal user.
– keytab: Authenticate with a keytab (recommended). If running DAI as a service, then the Kerberos
keytab needs to be owned by the DAI user.
– keytabimpersonation: Login with impersonation using a keytab.
– noauth: No authentication needed.
• key_tab_path: The path of the principal key tab file. For use when hdfs_auth_type=principal.
• hdfs_app_principal_user: The Kerberos application principal user.
• hdfs_app_jvm_args: JVM args for HDFS distributions. Separate each argument with spaces.
– -Djava.security.krb5.conf
– -Dsun.security.krb5.debug
– -Dlog4j.configuration
• hdfs_app_classpath: The HDFS classpath.
Start Driverless AI
This section describes how to enable the kdb+ data connector when starting Driverless AI in Docker. This can done
by specifying each environment variable in the nvidia-docker run command or by editing the configuration
options in the config.toml file and then specifying that file in the nvidia-docker run command.
This example enables the HDFS data connector and disables HDFS authentication. It does not pass any HDFS con-
figuration file; however it configures Docker DNS by passing the name and IP of the HDFS name node. This allows
users to reference data stored in HDFS directly using name node address, for example: hdfs://name.node/
datasets/iris.csv. Replace TAG below with the image tag.
nvidia-docker run \
--pid=host \
--init \
--rm \
--shm-size=256m \
--add-host name.node:172.16.2.186 \
-e DRIVERLESS_AI_ENABLED_FILE_SYSTEMS="file,hdfs" \
-e DRIVERLESS_AI_HDFS_AUTH_TYPE='noauth' \
-e DRIVERLESS_AI_PROCSY_PORT=8080 \
-p 12345:12345 \
-v /etc/passwd:/etc/passwd:ro \
-v /etc/group:/etc/group:ro \
-v /tmp/dtmp/:/tmp \
-v /tmp/dlog/:/log \
-v /tmp/dlicense/:/license \
-v /tmp/ddata/:/data \
-u $(id -u):$(id -g) \
h2oai/dai-centos7-x86_64:TAG
Notes:
• If using Kerberos Authentication, the time on the Driverless AI server must be in sync with Kerberos server. If
the time difference between clients and DCs are 5 minutes or higher, there will be Kerberos failures.
• If running Driverless AI as a service, then the Kerberos keytab needs to be owned by the Driverless AI user;
otherwise Driverless AI will not be able to read/access the Keytab and will result in a fallback to simple authen-
tication and, hence, fail.
This example:
• Places keytabs in the /tmp/dtmp folder on your machine and provides the file path as described below.
• Configures the environment variable DRIVERLESS_AI_HDFS_APP_PRINCIPAL_USER to reference a user
for whom the keytab was created (usually in the form of user@realm).
Replace TAG below with the image tag.
# Docker instructions
nvidia-docker run \
--pid=host \
--init \
--rm \
--shm-size=256m \
-e DRIVERLESS_AI_ENABLED_FILE_SYSTEMS="file,hdfs" \
-e DRIVERLESS_AI_HDFS_AUTH_TYPE='keytab' \
-e DRIVERLESS_AI_KEY_TAB_PATH='tmp/<<keytabname>>' \
-e DRIVERLESS_AI_HDFS_APP_PRINCIPAL_USER='<<user@kerberosrealm>>' \
-e DRIVERLESS_AI_PROCSY_PORT=8080 \
-p 12345:12345 \
-v /etc/passwd:/etc/passwd:ro \
-v /etc/group:/etc/group:ro \
-v /tmp/dtmp/:/tmp \
-v /tmp/dlog/:/log \
-v /tmp/dlicense/:/license \
-v /tmp/ddata/:/data \
-u $(id -u):$(id -g) \
h2oai/dai-centos7-x86_64:TAG
Notes:
• If using Kerberos, be sure that the Driverless AI time is synched with the Kerberos server.
• If running Driverless AI as a service, then the Kerberos keytab needs to be owned by the Driverless AI user.
• Logins are case sensitive when keytab-based impersonation is configured.
The example:
• Sets the authentication type to keytabimpersonation.
• Places keytabs in the /tmp/dtmp folder on your machine and provides the file path as described below.
• Configures the DRIVERLESS_AI_HDFS_APP_PRINCIPAL_USER variable, which references a user for
whom the keytab was created (usually in the form of user@realm).
Replace TAG below with the image tag.
# Docker instructions
nvidia-docker run \
--pid=host \
--init \
--rm \
--shm-size=256m \
-e DRIVERLESS_AI_ENABLED_FILE_SYSTEMS="file,hdfs" \
-e DRIVERLESS_AI_HDFS_AUTH_TYPE='keytabimpersonation' \
-e DRIVERLESS_AI_KEY_TAB_PATH='/tmp/<<keytabname>>' \
-e DRIVERLESS_AI_HDFS_APP_PRINCIPAL_USER='<<appuser@kerberosrealm>>' \
-e DRIVERLESS_AI_PROCSY_PORT=8080 \
-p 12345:12345 \
-v /etc/passwd:/etc/passwd:ro \
-v /etc/group:/etc/group:ro \
-v /tmp/dlog/:/log \
-v /tmp/dlicense/:/license \
-v /tmp/ddata/:/data \
-u $(id -u):$(id -g) \
h2oai/dai-centos7-x86_64:TAG
This example shows how to configure HDFS options in the config.toml file, and then specify that file when starting
Driverless AI in Docker. Note that this example enables HDFS with no authentication.
1. Configure the Driverless AI config.toml file. Set the following configuration options. Note that the procsy port,
which defaults to 12347, also has to be changed.
• enabled_file_systems = "file, upload, hdfs"
• procsy_ip = "127.0.0.1"
• procsy_port = 8080
2. Mount the config.toml file into the Docker container.
nvidia-docker run \
--pid=host \
--init \
--rm \
--shm-size=256m \
--add-host name.node:172.16.2.186 \
-e DRIVERLESS_AI_CONFIG_FILE=/path/in/docker/config.toml
-p 12345:12345 \
-v /local/path/to/config.toml:/path/in/docker/config.toml
-v /etc/passwd:/etc/passwd:ro \
-v /etc/group:/etc/group:ro \
-v /tmp/dtmp/:/tmp \
-v /tmp/dlog/:/log \
-v /tmp/dlicense/:/license \
-v /tmp/ddata/:/data \
-u $(id -u):$(id -g) \
h2oai/dai-centos7-x86_64:TAG
The following example shows how to build an H2O-3 Hadoop image and run Driverless AI on that image. This
example uses CDH 6.0. Change the H2O_TARGET to specify a different platform.
1. Clone and then build H2O-3 for CDH 6.0.
git clone https://github.com/h2oai/h2o-3.git
cd h2o-3
./gradlew clean build -x test
export H2O_TARGET=cdh6.0
export BUILD_HADOOP=true
./gradlew clean build -x test
Driverless AI allows you to explore Azure Blob Store data sources from within the Driverless AI application. This
section describes how to enable the Azure Blob Store data connector in Docker environments.
azure_blob_account_name: The Microsoft Azure Storage account name. This should be the dns prefix created
when the account was created (for example, “mystorage”).
azure_blob_account_key: Specify the account key that maps to your account name.
azure_connection_string: Optionally specify a new connection string. With this option, you can include an
override for a host, port, and/or account name. For example,
azure_connection_string = "DefaultEndpointsProtocol=http;AccountName=<account_name>;AccountKey=<account_key>;BlobEndpoint=http://<host>:
˓→<port>/<account_name>;"
Start Driverless AI
This section describes how to enable the Azure Blob Storer data connector when starting Driverless AI in Docker.
This can done by specifying each environment variable in the nvidia-docker run command or by editing the
configuration options in the config.toml file and then specifying that file in the nvidia-docker run command.
This example enables the Azure Blob Store data connector. This allows users to reference data stored on your Azure
storage account using the account name, for example: https://mystorage.blob.core.windows.net. Re-
place TAG below with the image tag.
nvidia-docker run \
--pid=host \
--init \
--rm \
--shm-size=256m \
-e DRIVERLESS_AI_ENABLED_FILE_SYSTEMS="file,azrbs" \
-e DRIVERLESS_AI_AZURE_BLOB_ACCOUNT_NAME="mystorage" \
-e DRIVERLESS_AI_AZURE_BLOB_ACCOUNT_KEY="<access_key>" \
-p 12345:12345 \
-v /tmp/dtmp/:/tmp \
-v /tmp/dlog/:/log \
-v /tmp/dlicense/:/license \
-v /tmp/ddata/:/data \
-u $(id -u):$(id -g) \
h2oai/dai-centos7-x86_64:TAG
This example shows how to configure Azure Blob Store options in the config.toml file, and then specify that file when
starting Driverless AI in Docker.
1. Configure the Driverless AI config.toml file. Set the following configuration options:
• enabled_file_systems = "file, upload, azrbs"
• azure_blob_account_name = "mystorage"
• azure_blob_account_key = "<account_key>"
2. Mount the config.toml file into the Docker container.
nvidia-docker run \
--pid=host \
--init \
--rm \
--shm-size=256m \
--add-host name.node:172.16.2.186 \
-e DRIVERLESS_AI_CONFIG_FILE=/path/in/docker/config.toml
-p 12345:12345 \
-v /local/path/to/config.toml:/path/in/docker/config.toml
-v /etc/passwd:/etc/passwd:ro \
-v /etc/group:/etc/group:ro \
-v /tmp/dtmp/:/tmp \
-v /tmp/dlog/:/log \
-v /tmp/dlicense/:/license \
-v /tmp/ddata/:/data \
-u $(id -u):$(id -g) \
h2oai/dai-centos7-x86_64:TAG
This section provides instructions for configuring Driverless AI to work with BlueData DataTap.
• dtap_key_tab_path: The path of the principal key tab file. For use when
dtap_auth_type=principal.
• dtap_app_principal_user: The Kerberos app principal user (recommended).
• dtap_app_login_user: The user ID of the current user (for example, user@realm).
• dtap_app_jvm_args: JVM args for DTap distributions. Separate each argument with spaces.
• dtap_app_classpath: The DTap classpath.
Start Driverless AI
This section describes how to enable the BlueData DataTap data connector when starting Driverless AI in Docker.
This can done by specifying each environment variable in the nvidia-docker run command or by editing the
configuration options in the config.toml file and then specifying that file in the nvidia-docker run command.
This example enables the DataTap data connector and disables authentication. It does not pass any configuration file;
however it configures Docker DNS by passing the name and IP of the DTap name node. This allows users to reference
data stored in DTap directly using the name node address, for example: dtap://name.node/datasets/iris.
csv or dtap://name.node/datasets/. (Note: The trailing slash is currently required for directories.) Replace
TAG below with the image tag.
nvidia-docker run \
--pid=host \
--init \
--rm \
--shm-size=256m \
--add-host name.node:172.16.2.186 \
-e DRIVERLESS_AI_ENABLED_FILE_SYSTEMS="file,dtap" \
-e DRIVERLESS_AI_DTAP_AUTH_TYPE='noauth' \
-p 12345:12345 \
-v /etc/passwd:/etc/passwd \
-v /tmp/dtmp/:/tmp \
-v /tmp/dlog/:/log \
Notes:
• If using Kerberos Authentication, the time on the Driverless AI server must be in sync with Kerberos server. If
the time difference between clients and DCs are 5 minutes or higher, there will be Kerberos failures.
• If running Driverless AI as a service, then the Kerberos keytab needs to be owned by the Driverless AI user;
otherwise Driverless AI will not be able to read/access the Keytab and will result in a fallback to simple authen-
tication and, hence, fail.
This example:
• Places keytabs in the /tmp/dtmp folder on your machine and provides the file path as described below.
• Configures the environment variable DRIVERLESS_AI_DTAP_APP_PRINCIPAL_USER to reference a user
for whom the keytab was created (usually in the form of user@realm).
Replace TAG below with the image tag.
# Docker instructions
nvidia-docker run \
--pid=host \
--init \
--rm \
--shm-size=256m \
-e DRIVERLESS_AI_ENABLED_FILE_SYSTEMS="file,dtap" \
-e DRIVERLESS_AI_DTAP_AUTH_TYPE='keytab' \
-e DRIVERLESS_AI_DTAP_KEY_TAB_PATH='tmp/<<keytabname>>' \
-e DRIVERLESS_AI_DTAP_APP_PRINCIPAL_USER='<<user@kerberosrealm>>' \
-p 12345:12345 \
-v /etc/passwd:/etc/passwd \
-v /tmp/dtmp/:/tmp \
-v /tmp/dlog/:/log \
-v /tmp/dlicense/:/license \
-v /tmp/ddata/:/data \
-u $(id -u):$(id -g) \
h2oai/dai-centos7-x86_64:TAG
Notes:
• If using Kerberos, be sure that the Driverless AI time is synched with the Kerberos server.
• If running Driverless AI as a service, then the Kerberos keytab needs to be owned by the Driverless AI user.
The example:
• Places keytabs in the /tmp/dtmp folder on your machine and provides the file path as described below.
• Configures the DRIVERLESS_AI_DTAP_APP_PRINCIPAL_USER variable, which references a user for
whom the keytab was created (usually in the form of user@realm).
• Configures the DRIVERLESS_AI_DTAP_APP_LOGIN_USER variable, which references a user who is being
impersonated (usually in the form of user@realm).
Replace TAG below with the image tag.
# Docker instructions
nvidia-docker run \
--pid=host \
--init \
--rm \
--shm-size=256m \
This example shows how to configure DataTap options in the config.toml file, and then specify that file when starting
Driverless AI in Docker. Note that this example enables DataTap with no authentication.
1. Configure the Driverless AI config.toml file. Set the following configuration options:
• enabled_file_systems = "file, upload, dtap"
2. Mount the config.toml file into the Docker container.
nvidia-docker run \
--pid=host \
--init \
--rm \
--shm-size=256m \
--add-host name.node:172.16.2.186 \
-e DRIVERLESS_AI_CONFIG_FILE=/path/in/docker/config.toml
-p 12345:12345 \
-v /local/path/to/config.toml:/path/in/docker/config.toml
-v /etc/passwd:/etc/passwd:ro \
-v /etc/group:/etc/group:ro \
-v /tmp/dtmp/:/tmp \
-v /tmp/dlog/:/log \
-v /tmp/dlicense/:/license \
-v /tmp/ddata/:/data \
-u $(id -u):$(id -g) \
h2oai/dai-centos7-x86_64:TAG
Driverless AI allows you to explore Google BigQuery data sources from within the Driverless AI application. This
section provides instructions for configuring Driverless AI to work with Google BigQuery. This setup requires you to
enable authentication. If you enable the GCS and/or GBQ connectors, those file systems will be available in the UI,
but you will not be able to use those connectors without authentication.
In order to enable the GBQ data connector with authentication, you must:
1. Retrieve a JSON authentication file from GCP.
2. Mount the JSON file to the Docker instance.
3. Specify the path to the /json_auth_file.json in the GCS_PATH_TO_SERVICE_ACCOUNT_JSON environmen-
tal variable.
Note: The account JSON includes authentications as provided by the system administrator. You can be provided a
JSON file that contains both Google Cloud Storage and Google BigQuery authentications, just one or the other, or
none at all.
Start Driverless AI
This section describes how to enable the Google BigQuery data connector when starting Driverless AI in Docker.
This can done by specifying each environment variable in the nvidia-docker run command or by editing the
configuration options in the config.toml file and then specifying that file in the nvidia-docker run command.
This example enables the GBQ data connector with authentication by passing the JSON authentication file. This
assumes that the JSON file contains Google BigQuery authentications. Replace TAG below with the image tag.
nvidia-docker run \
--pid=host \
--init \
--rm \
--shm-size=256m \
-e DRIVERLESS_AI_ENABLED_FILE_SYSTEMS="file,gbq" \
-e DRIVERLESS_AI_GCS_PATH_TO_SERVICE_ACCOUNT_JSON="/service_account_json.json" \
-u `id -u`:`id -g` \
-p 12345:12345 \
-v `pwd`/data:/data \
-v `pwd`/log:/log \
-v `pwd`/license:/license \
-v `pwd`/tmp:/tmp \
-v `pwd`/service_account_json.json:/service_account_json.json \
h2oai/dai-centos7-x86_64:TAG
After Google BigQuery is enabled, you can add datasets by selecting Google Big Query from the Add Dataset (or
Drag and Drop) drop-down menu.
This example shows how to configure the GBQ data connector options in the config.toml file, and then specify that
file when starting Driverless AI in Docker.
1. Configure the Driverless AI config.toml file. Set the following configuration options:
• enabled_file_systems = "file, upload, gbq"
• gcs_path_to_service_account_json = "/service_account_json.json"
2. Mount the config.toml file into the Docker container.
nvidia-docker run \
--pid=host \
--init \
--rm \
--shm-size=256m \
--add-host name.node:172.16.2.186 \
-e DRIVERLESS_AI_CONFIG_FILE=/path/in/docker/config.toml
-p 12345:12345 \
-v /local/path/to/config.toml:/path/in/docker/config.toml
-v /etc/passwd:/etc/passwd:ro \
-v /etc/group:/etc/group:ro \
-v /tmp/dtmp/:/tmp \
-v /tmp/dlog/:/log \
-v /tmp/dlicense/:/license \
-v /tmp/ddata/:/data \
-u $(id -u):$(id -g) \
h2oai/dai-centos7-x86_64:TAG
Driverless AI allows you to explore Google Cloud Storage data sources from within the Driverless AI application. This
section provides instructions for configuring Driverless AI to work with Google Cloud Storage. This setup requires
you to enable authentication. If you enable GCS or GBP connectors, those file systems will be available in the UI, but
you will not be able to use those connectors without authentication.
In order to enable the GCS data connector with authentication, you must:
1. Obtain a JSON authentication file from GCP.
2. Mount the JSON file to the Docker instance.
3. Specify the path to the /json_auth_file.json in the GCS_PATH_TO_SERVICE_ACCOUNT_JSON environmen-
tal variable.
Note: The account JSON includes authentications as provided by the system administrator. You can be provided a
JSON file that contains both Google Cloud Storage and Google BigQuery authentications, just one or the other, or
none at all.
Start Driverless AI
This section describes how to enable the Google Cloud Storage data connector when starting Driverless AI in Docker.
This can done by specifying each environment variable in the nvidia-docker run command or by editing the
configuration options in the config.toml file and then specifying that file in the nvidia-docker run command.
This example enables the GCS data connector with authentication by passing the JSON authentication file. This
assumes that the JSON file contains Google Cloud Storage authentications. Replace TAG below with the image tag.
nvidia-docker run \
--pid=host \
--init \
--rm \
--shm-size=256m \
-e DRIVERLESS_AI_ENABLED_FILE_SYSTEMS="file,gcs" \
-e DRIVERLESS_AI_GCS_PATH_TO_SERVICE_ACCOUNT_JSON="/service_account_json.json" \
-u `id -u`:`id -g` \
-p 12345:12345 \
-v `pwd`/data:/data \
-v `pwd`/log:/log \
-v `pwd`/license:/license \
-v `pwd`/tmp:/tmp \
-v `pwd`/service_account_json.json:/service_account_json.json \
h2oai/dai-centos7-x86_64:TAG
This example shows how to configure the GCS data connector options in the config.toml file, and then specify that file
when starting Driverless AI in Docker.
1. Configure the Driverless AI config.toml file. Set the following configuration options:
• enabled_file_systems = "file, upload, gcs"
• gcs_path_to_service_account_json = "/service_account_json.json"
2. Mount the config.toml file into the Docker container.
nvidia-docker run \
--pid=host \
--init \
--rm \
--shm-size=256m \
--add-host name.node:172.16.2.186 \
-e DRIVERLESS_AI_CONFIG_FILE=/path/in/docker/config.toml
-p 12345:12345 \
-v /local/path/to/config.toml:/path/in/docker/config.toml
-v /etc/passwd:/etc/passwd:ro \
-v /etc/group:/etc/group:ro \
-v /tmp/dtmp/:/tmp \
-v /tmp/dlog/:/log \
-v /tmp/dlicense/:/license \
-v /tmp/ddata/:/data \
-u $(id -u):$(id -g) \
h2oai/dai-centos7-x86_64:TAG
Driverless AI allows you to explore kdb+ data sources from within the Driverless AI application. This section provides
instructions for configuring Driverless AI to work with kdb+.
Start Driverless AI
The following sections describes how to enable the kdb+ data connector when starting Driverless AI in Docker. This
can done by specifying each environment variable in the nvidia-docker run command or by editing the config-
uration options in the config.toml file and then specifying that file in the nvidia-docker run command.
This example enables the kdb+ connector without authentication. The only required flags are the hostname and the
port. Replace TAG below with the image tag.
nvidia-docker run \
--pid=host \
--init \
--rm \
--shm-size=256m \
--add-host name.node:172.16.2.186 \
-e DRIVERLESS_AI_ENABLED_FILE_SYSTEMS="file,kdb" \
-e DRIVERLESS_AI_KDB_HOSTNAME="<ip_or_host_of_kdb_server>" \
-e DRIVERLESS_AI_KDB_PORT="<kdb_server_port>" \
-p 12345:12345 \
-v /tmp/dtmp/:/tmp \
-v /tmp/dlog/:/log \
-v /tmp/dlicense/:/license \
-v /tmp/ddata/:/data \
-u $(id -u):$(id -g) \
h2oai/dai-centos7-x86_64:TAG
This example provides users credentials for accessing a kdb+ server from Driverless AI. Replace TAG below with the
image tag.
# Docker instructions
nvidia-docker run \
--pid=host \
--init \
--rm \
--shm-size=256m \
-e DRIVERLESS_AI_ENABLED_FILE_SYSTEMS="file,kdb" \
-e DRIVERLESS_AI_KDB_HOSTNAME="<ip_or_host_of_kdb_server>" \
-e DRIVERLESS_AI_KDB_PORT="<kdb_server_port>" \
-e DRIVERLESS_AI_KDB_USER="<username>" \
-e DRIVERLESS_AI_KDB_PASSWORD="<password>" \
-p 12345:12345 \
-v /tmp/dtmp/:/tmp \
-v /tmp/dlog/:/log \
-v /tmp/dlicense/:/license \
-v /tmp/ddata/:/data \
-u $(id -u):$(id -g) \
h2oai/dai-centos7-x86_64:TAG
After the kdb+ connector is enabled, you can add datasets by selecting kdb+ from the Add Dataset (or Drag and
Drop) drop-down menu.
This example shows how to configure kdb+ options in the config.toml file, and then specify that file when starting
Driverless AI in Docker. Note that this example enables kdb+ with no authentication.
1. Configure the Driverless AI config.toml file. Set the following configuration options.
• enabled_file_systems = "file, upload, kdb"
• kdb_hostname = <ip_or_host_of_kdb_server>"
• kdb_port = "<kdb_server_port>"
2. Mount the config.toml file into the Docker container.
nvidia-docker run \
--pid=host \
--init \
--rm \
--shm-size=256m \
--add-host name.node:172.16.2.186 \
-e DRIVERLESS_AI_CONFIG_FILE=/path/in/docker/config.toml
-p 12345:12345 \
-v /local/path/to/config.toml:/path/in/docker/config.toml
-v /etc/passwd:/etc/passwd:ro \
-v /etc/group:/etc/group:ro \
-v /tmp/dtmp/:/tmp \
-v /tmp/dlog/:/log \
-v /tmp/dlicense/:/license \
-v /tmp/ddata/:/data \
-u $(id -u):$(id -g) \
h2oai/dai-centos7-x86_64:TAG
After the kdb+ connector is enabled, you can add datasets by selecting kdb+ from the Add Dataset (or Drag and
Drop) drop-down menu.
This section provides instructions for configuring Driverless AI to work with Minio. Note that unlike S3, authentication
must also be configured when the Minio data connector is specified.
Start Driverless AI
The following sections describes how to enable the Minio data connector when starting Driverless AI in Docker.
This can done by specifying each environment variable in the nvidia-docker run command or by editing the
configuration options in the config.toml file and then specifying that file in the nvidia-docker run command.
This example enables the Minio data connector with authentication by passing an endpoint URL, access key ID, and an
access key. It also configures Docker DNS by passing the name and IP of the name node. This allows users to reference
data stored in Minio directly using the endpoint URL, for example: http://<endpoint_url>/<bucket>/datasets/iris.csv.
Replace TAG below with the image tag.
nvidia-docker run \
--shm-size=256m \
--add-host name.node:172.16.2.186 \
-e DRIVERLESS_AI_ENABLED_FILE_SYSTEMS="file,minio" \
-e DRIVERLESS_AI_MINIO_ENDPOINT_URL="<endpoint_url>"
-e DRIVERLESS_AI_MINIO_ACCESS_KEY_ID="<access_key_id>" \
-e DRIVERLESS_AI_MINIO_SECRET_ACCESS_KEY="<access_key>" \
-p 12345:12345 \
--init -it --rm \
-v /tmp/dtmp/:/tmp \
-v /tmp/dlog/:/log \
-v /tmp/dlicense/:/license \
-v /tmp/ddata/:/data \
-u $(id -u):$(id -g) \
h2oai/dai-centos7-x86_64:TAG
This example shows how to configure Minio options in the config.toml file, and then specify that file when starting
Driverless AI in Docker.
1. Configure the Driverless AI config.toml file. Set the following configuration options.
• enabled_file_systems = "file, upload, minio"
• minio_endpoint_url = "<endpoint_url>"
• minio_access_key_id = "<access_key_id>"
• minio_secret_access_key = "<access_key>"
2. Mount the config.toml file into the Docker container.
nvidia-docker run \
--pid=host \
--init \
--rm \
--shm-size=256m \
--add-host name.node:172.16.2.186 \
-e DRIVERLESS_AI_CONFIG_FILE=/path/in/docker/config.toml
-p 12345:12345 \
-v /local/path/to/config.toml:/path/in/docker/config.toml
-v /etc/passwd:/etc/passwd:ro \
-v /etc/group:/etc/group:ro \
-v /tmp/dtmp/:/tmp \
-v /tmp/dlog/:/log \
-v /tmp/dlicense/:/license \
-v /tmp/ddata/:/data \
-u $(id -u):$(id -g) \
h2oai/dai-centos7-x86_64:TAG
11.1.9 Snowflake
Driverless AI allows you to explore Snowflake data sources from within the Driverless AI application. This section
provides instructions for configuring Driverless AI to work with Snowflake. This setup requires you to enable authen-
tication. If you enable Snowflake connectors, those file systems will be available in the UI, but you will not be able to
use those connectors without authentication.
Start Driverless AI
The following sections describes how to enable the Snowflake data connector when starting Driverless AI in Docker.
This can done by specifying each environment variable in the nvidia-docker run command or by editing the
configuration options in the config.toml file and then specifying that file in the nvidia-docker run command.
This example enables the Snowflake data connector with authentication by passing the account, user, and
password variables. Replace TAG below with the image tag.
nvidia-docker run \
--rm \
--shm-size=256m \
-e DRIVERLESS_AI_ENABLED_FILE_SYSTEMS="file,snow" \
-e DRIVERLESS_AI_SNOWFLAKE_ACCOUNT = "<account_id>" \
-e DRIVERLESS_AI_SNOWFLAKE_USER = "<username>" \
-e DRIVERLESS_AI_SNOWFLAKE_PASSWORD = "<password>"\
-u `id -u`:`id -g` \
-p 12345:12345 \
-v `pwd`/data:/data \
-v `pwd`/log:/log \
-v `pwd`/license:/license \
-v `pwd`/tmp:/tmp \
-v `pwd`/service_account_json.json:/service_account_json.json \
h2oai/dai-centos7-x86_64:TAG
After the Snowflake connector is enabled, you can add datasets by selecting Snowflake from the Add Dataset (or
Drag and Drop) drop-down menu.
This example shows how to configure Snowflake options in the config.toml file, and then specify that file when starting
Driverless AI in Docker.
1. Configure the Driverless AI config.toml file. Set the following configuration options.
• enabled_file_systems = "file, snow"
• snowflake_account = "<account_id>"
• snowflake_user = "<username>"
• snowflake_password = "<password>"
2. Mount the config.toml file into the Docker container.
nvidia-docker run \
--pid=host \
--init \
--rm \
--shm-size=256m \
--add-host name.node:172.16.2.186 \
-e DRIVERLESS_AI_CONFIG_FILE=/path/in/docker/config.toml
-p 12345:12345 \
-v /local/path/to/config.toml:/path/in/docker/config.toml
-v /etc/passwd:/etc/passwd:ro \
-v /etc/group:/etc/group:ro \
-v /tmp/dtmp/:/tmp \
-v /tmp/dlog/:/log \
-v /tmp/dlicense/:/license \
-v /tmp/ddata/:/data \
After the Snowflake connector is enabled, you can add datasets by selecting Snowflake from the Add Dataset (or
Drag and Drop) drop-down menu.
8. Enter Snowflake Query: Specify the Snowflake query that you want to execute.
9. When you are finished, select the Click to Make Query button to add the dataset.
11.1.10 JDBC
Driverless AI allows you to explore Java Database Connectivity (JDBC) data sources from within the Driverless AI
application. This section provides instructions for configuring Driverless AI to work with JDBC.
Tested Databases
The following databases have been tested for minimal functionality. Note that JDBC drivers that are not included in
this list should work with Driverless AI. We recommend that you test out your JDBC driver even if you do not see it
on list of tested databases. See the Adding an Untested JDBC Driver section at the end of this chapter for information
on how to try out an untested JDBC driver.
• Oracle DB
• PostgreSQL
• Amazon Redshift
• Teradata
• jdbc_app_configs: Configuration for the JDBC connector. This is a JSON/Dictionary String with multiple
keys. Note: This requires a JSON key (typically the name of the database being configured) to be associated
with a nested JSON that contains the url, jarpath, and classpath fields. In addition, this should take the
format:
"""{"my_jdbc_database": {"url": "jdbc:my_jdbc_database://hostname:port/database",
"jarpath": "/path/to/my/jdbc/database.jar", "classpath": "com.my.jdbc.Driver"}}"""
For example:
"""{
"postgres": {
"url": "jdbc:postgresql://ip address:port/postgres",
"jarpath": "/path/to/postgres_driver.jar",
"classpath": "org.postgresql.Driver"
},
"mysql": {
"url":"mysql connection string",
"jarpath": "/path/to/mysql_driver.jar",
"classpath": "my.sql.classpath.Driver"
}
}"""
• jdbc_app_jvm_args: Extra jvm args for JDBC connector. For example, “-Xmx4g”.
• jdbc_app_classpath: Optionally specify an alternative classpath for the JDBC connector.
Start Driverless AI
This section describes how to enable JDBC when starting Driverless AI in Docker. This can done by specifying
each environment variable in the nvidia-docker run command or by editing the configuration options in the
config.toml file and then specifying that file in the nvidia-docker run command.
This example enables the JDBC connector for PostgresQL. Note that the JDBC connection strings will vary depending
on the database that is used. Replace TAG below with the image tag.
nvidia-docker run \
--pid=host \
--init \
--rm \
--shm-size=256m \
--add-host name.node:172.16.2.186 \
-e DRIVERLESS_AI_ENABLED_FILE_SYSTEMS="file,hdfs,jdbc" \
-e DRIVERLESS_AI_JDBC_APP_CONFIGS="""{"postgres":
{"url": "jdbc:postgres://localhost:5432/my_database",
"jarpath": "/path/to/postgresql/jdbc/driver.jar",
"classpath": "org.postgresql.Driver"}}""" \
-e DRIVERLESS_AI_JDBC_APP_JVM_ARGS="-Xmx2g" \
-p 12345:12345 \
-v /path/to/local/postgresql/jdbc/driver.jar:/path/to/postgresql/jdbc/driver.jar \
-v /etc/passwd:/etc/passwd:ro \
-v /etc/group:/etc/group:ro \
-v /tmp/dtmp/:/tmp \
-v /tmp/dlog/:/log \
-v /tmp/dlicense/:/license \
-v /tmp/ddata/:/data \
-u $(id -u):$(id -g) \
h2oai/dai-centos7-x86_64:TAG
This example shows how to configure JDBC options in the config.toml file, and then specify that file when starting
Driverless AI in Docker.
1. Configure the Driverless AI config.toml file. Set the following configuration options:
enabled_file_systems = "file, upload, jdbc"
jdbc_app_configs = '{JSON string containing configurations}'
jdbc_app_configs = """{"postgres": {"url": "jdbc:postgress://localhost:5432/my_database",
"jarpath": "/path/to/postgresql/jdbc/driver.jar",
"classpath": "org.postgresql.Driver"}}"""
2. Mount the config.toml file and requisite JAR files into the Docker container.
nvidia-docker run \
--pid=host \
--init \
--rm \
--shm-size=256m \
--add-host name.node:172.16.2.186 \
-e DRIVERLESS_AI_CONFIG_FILE=/path/in/docker/config.toml
-p 12345:12345 \
-v /local/path/to/jdbc/driver.jar:/path/in/docker/jdbc/driver.jar
-v /local/path/to/config.toml:/path/in/docker/config.toml
-v /etc/passwd:/etc/passwd:ro \
-v /etc/group:/etc/group:ro \
-v /tmp/dtmp/:/tmp \
-v /tmp/dlog/:/log \
-v /tmp/dlicense/:/license \
-v /tmp/ddata/:/data \
-u $(id -u):$(id -g) \
h2oai/dai-centos7-x86_64:TAG
After the JDBC connector is enabled, you can add datasets by selecting JDBC from the Add Dataset (or Drag and
Drop) drop-down menu.
Query Examples
The following are sample configurations and queries for Oracle DB and PostgreSQL:
Oracle DB
1. Configuration:
jdbc_app_configs = '{"oracledb": {"url": "jdbc:oracle:thin:@localhost:1521/oracledatabase", "jarpath": "/home/ubuntu/jdbc-jars/ojdbc8.jar",
˓→ "classpath": "oracle.jdbc.OracleDriver"}}'
2. Sample Query:
• Select oracledb from the Select JDBC Connection dropdown menu.
• JDBC Username: oracleuser
• JDBC Password: oracleuserpassword
• ID Column Name:
• Query:
SELECT MIN(ID) AS NEW_ID, EDUCATION, COUNT(EDUCATION) FROM my_oracle_schema.creditcardtrain GROUP BY EDUCATION
Note: Because this query does not specify an ID Column Name, it will only work for small data. How-
ever, the NEW_ID column can be used as the ID Column if the query is for larger data.
3. Click the Click to Make Query button to execute the query.
PostgreSQL
1. Configuration:
jdbc_app_configs = '{"postgres": {"url": "jdbc:postgresql://localhost:5432/postgresdatabase", "jarpath": "/home/ubuntu/postgres-artifacts/
˓→postgres/Driver.jar", "classpath": "org.postgresql.Driver"}}'
2. Sample Query:
• Select postgres from the Select JDBC Connection dropdown menu.
• JDBC Username: postgres_user
• JDBC Password: pguserpassword
• ID Column Name: id
• Query:
SELECT * FROM loan_level WHERE LOAN_TYPE = 5 (selects all columns from table loan_level with column LOAN_TYPE containing
˓→value 5)
We encourage you to try out JDBC drivers that are not tested in house.
1. Download the JDBC jar for your database.
2. Move your JDBC jar file to a location that DAI can access.
3. Modify the following config.toml settings. Note that these can also be specified as environment variables when
starting Driverless AI in Docker:
# enable the JDBC file system
enabled_file_systems = "upload, file, hdfs, s3, recipe_file, jdbc"
4. Save the changes when you are done, then stop/restart Driverless AI.
The config.toml file is available in the etc/dai folder after the RPM, DEB, or TAR SH is installed. Before enabling a
connector, be sure to export the Driverless AI config.toml file or add it to ~/.bashrc. For example:
# DEB and RPM
export DRIVERLESS_AI_CONFIG_FILE="/etc/dai/config.toml"
# TAR SH
export DRIVERLESS_AI_CONFIG_FILE="/path/to/your/unpacked/dai/directory/config.toml"
The sections that follow show examples that describe how to specify configuration options in the config.toml file to
enable HDFS, S3, Google Cloud Storage, Google Big Query, Minio, Snowflake, kdb+, Azure Blob Store, BlueData
DataTap, and JDBC data sources.
11.2.1 S3 Setup
Driverless AI allows you to explore S3 data sources from within the Driverless AI application. This section provides
instructions for configuring Driverless AI to work with S3.
S3 with No Authentication
This example enables the S3 data connector and disables authentication. It does not pass any S3 access key or secret;
however it configures Docker DNS by passing the name and IP of the S3 name node. This allows users to reference
data stored in S3 directly using name node address, for example: s3://name.node/datasets/iris.csv.
1. Export the Driverless AI config.toml file or add it to ~/.bashrc. For example:
# DEB and RPM
export DRIVERLESS_AI_CONFIG_FILE="/etc/dai/config.toml"
# TAR SH
export DRIVERLESS_AI_CONFIG_FILE="/path/to/your/unpacked/dai/directory/config.toml"
3. Save the changes when you are done, then stop/restart Driverless AI.
S3 with Authentication
This example enables the S3 data connector with authentication by passing an S3 access key ID and an access key.
It also configures Docker DNS by passing the name and IP of the S3 name node. This allows users to reference data
stored in S3 directly using name node address, for example: s3://name.node/datasets/iris.csv.
1. Export the Driverless AI config.toml file or add it to ~/.bashrc. For example:
# DEB and RPM
export DRIVERLESS_AI_CONFIG_FILE="/etc/dai/config.toml"
# TAR SH
export DRIVERLESS_AI_CONFIG_FILE="/path/to/your/unpacked/dai/directory/config.toml"
# S3 Connector credentials
aws_access_key_id = "<access_key_id>"
aws_secret_access_key = "<access_key>"
3. Save the changes when you are done, then stop/restart Driverless AI.
This section provides instructions for configuring Driverless AI to work with HDFS.
• CDH 5.4
• CDH 5.5
• CDH 5.6
• CDH 5.7
• CDH 5.8
• CDH 5.9
• CDH 5.10
• CDH 5.13
• CDH 5.14
• CDH 5.15
• CDH 5.16
• CDH 6.0
• CDH 6.1
• CDH 6.2
• CDH 6.3
• HDP 2.2
• HDP 2.3
• HDP 2.4
• HDP 2.5
• HDP 2.6
• HDP 3.0
• HDP 3.1
• hdfs_config_path: The location the HDFS config folder path. This folder can contain multiple config
files.
• hdfs_auth_type: Selects HDFS authentication. Available values are:
– principal: Authenticate with HDFS with a principal user.
– keytab: Authenticate with a keytab (recommended). If running DAI as a service, then the Kerberos
keytab needs to be owned by the DAI user.
– keytabimpersonation: Login with impersonation using a keytab.
– noauth: No authentication needed.
• key_tab_path: The path of the principal key tab file. For use when hdfs_auth_type=principal.
• hdfs_app_principal_user: The Kerberos application principal user.
• hdfs_app_login_user: The user ID of the current user (for example, user@realm).
• hdfs_app_jvm_args: JVM args for HDFS distributions. Separate each argument with spaces.
– -Djava.security.krb5.conf
– -Dsun.security.krb5.debug
– -Dlog4j.configuration
• hdfs_app_classpath: The HDFS classpath.
This example enables the HDFS data connector and disables HDFS authentication in the config.toml file. This allows
users to reference data stored in HDFS directly using the name node address, for example: hdfs://name.node/
datasets/iris.csv.
1. Export the Driverless AI config.toml file or add it to ~/.bashrc. For example:
# DEB and RPM
export DRIVERLESS_AI_CONFIG_FILE="/etc/dai/config.toml"
# TAR SH
export DRIVERLESS_AI_CONFIG_FILE="/path/to/your/unpacked/dai/directory/config.toml"
2. Specify the following configuration options in the config.toml file. Note that the procsy port, which defaults to
12347, also has to be changed.
# IP address and port of procsy process.
procsy_ip = "127.0.0.1"
procsy_port = 8080
3. Save the changes when you are done, then stop/restart Driverless AI.
This example:
• Places keytabs in the /tmp/dtmp folder on your machine and provides the file path as described below.
• Configures the option hdfs_app_prinicpal_user to reference a user for whom the keytab was created
(usually in the form of user@realm).
1. Export the Driverless AI config.toml file or add it to ~/.bashrc. For example:
# DEB and RPM
export DRIVERLESS_AI_CONFIG_FILE="/etc/dai/config.toml"
# TAR SH
export DRIVERLESS_AI_CONFIG_FILE="/path/to/your/unpacked/dai/directory/config.toml"
# HDFS connector
# Auth type can be Principal/keytab/keytabPrincipal
# Specify HDFS Auth Type, allowed options are:
# noauth : No authentication needed
# principal : Authenticate with HDFS with a principal user
# keytab : Authenticate with a Key tab (recommended)
# keytabimpersonation : Login with impersonation using a keytab
hdfs_auth_type = "keytab"
3. Save the changes when you are done, then stop/restart Driverless AI.
Notes:
• If using Kerberos, be sure that the Driverless AI time is synched with the Kerberos server.
• If running Driverless AI as a service, then the Kerberos keytab needs to be owned by the Driverless AI user.
• Logins are case sensitive when keytab-based impersonation is configured.
The example:
• Sets the authentication type to keytabimpersonation.
• Places keytabs in the /tmp/dtmp folder on your machine and provides the file path as described below.
• Configures the hdfs_app_principal_user variable, which references a user for whom the keytab was
created (usually in the form of user@realm).
1. Export the Driverless AI config.toml file or add it to ~/.bashrc. For example:
# DEB and RPM
export DRIVERLESS_AI_CONFIG_FILE="/etc/dai/config.toml"
# TAR SH
export DRIVERLESS_AI_CONFIG_FILE="/path/to/your/unpacked/dai/directory/config.toml"
3. Save the changes when you are done, then stop/restart Driverless AI.
Driverless AI allows you to explore Azure Blob Store data sources from within the Driverless AI application. This
section describes how to enable the Azure Blob Store data connector in native install environments.
azure_blob_account_name: The Microsoft Azure Storage account name. This should be the dns prefix created
when the account was created (for example, “mystorage”).
azure_blob_account_key: Specify the account key that maps to your account name.
azure_connection_string: Optionally specify a new connection string. With this option, you can include an
override for a host, port, and/or account name. For example,
azure_connection_string = "DefaultEndpointsProtocol=http;AccountName=<account_name>;AccountKey=<account_key>;BlobEndpoint=http://<host>:
˓→<port>/<account_name>;"
This example enables the Azure Blob Store data connector. This allows users to reference data stored on your Azure
storage account using the account name, for example: https://mystorage.blob.core.windows.net.
1. Export the Driverless AI config.toml file or add it to ~/.bashrc. For example:
# DEB and RPM
export DRIVERLESS_AI_CONFIG_FILE="/etc/dai/config.toml"
# TAR SH
export DRIVERLESS_AI_CONFIG_FILE="/path/to/your/unpacked/dai/directory/config.toml"
3. Save the changes when you are done, then stop/restart Driverless AI.
This section provides instructions for configuring Driverless AI to work with BlueData DataTap.
• dtap_key_tab_path: The path of the principal key tab file. For use when
dtap_auth_type=principal.
• dtap_app_principal_user: The Kerberos app principal user (recommended).
• dtap_app_login_user: The user ID of the current user (for example, user@realm).
• dtap_app_jvm_args: JVM args for DTap distributions. Separate each argument with spaces.
• dtap_app_classpath: The DTap classpath.
This example enables the DataTap data connector and disables authentication in the config.toml file. This allows users
to reference data stored in DataTap directly using the name node address, for example: dtap://name.node/
datasets/iris.csv or dtap://name.node/datasets/. (Note: The trailing slash is currently required
for directories.)
1. Export the Driverless AI config.toml file or add it to ~/.bashrc. For example:
# DEB and RPM
export DRIVERLESS_AI_CONFIG_FILE="/etc/dai/config.toml"
# TAR SH
export DRIVERLESS_AI_CONFIG_FILE="/path/to/your/unpacked/dai/directory/config.toml"
3. Save the changes when you are done, then stop/restart Driverless AI.
This example:
• Places keytabs in the /tmp/dtmp folder on your machine and provides the file path as described below.
• Configures the option dtap_app_prinicpal_user to reference a user for whom the keytab was created
(usually in the form of user@realm).
1. Export the Driverless AI config.toml file or add it to ~/.bashrc. For example:
# DEB and RPM
export DRIVERLESS_AI_CONFIG_FILE="/etc/dai/config.toml"
# TAR SH
export DRIVERLESS_AI_CONFIG_FILE="/path/to/your/unpacked/dai/directory/config.toml"
# Blue Data DTap connector settings are similar to HDFS connector settings.
#
# Specify DTap Auth Type, allowed options are:
# noauth : No authentication needed
# principal : Authenticate with DTab with a principal user
# keytab : Authenticate with a Key tab (recommended). If running
# DAI as a service, then the Kerberos keytab needs to
# be owned by the DAI user.
# keytabimpersonation : Login with impersonation using a keytab
dtap_auth_type = "keytab"
3. Save the changes when you are done, then stop/restart Driverless AI.
The example:
• Places keytabs in the /tmp/dtmp folder on your machine and provides the file path as described below.
• Configures the dtap_app_principal_user variable, which references a user for whom the keytab was
created (usually in the form of user@realm).
• Configures the dtap_app_login_user variable, which references a user who is being impersonated (usu-
ally in the form of user@realm).
1. Export the Driverless AI config.toml file or add it to ~/.bashrc. For example:
# DEB and RPM
export DRIVERLESS_AI_CONFIG_FILE="/etc/dai/config.toml"
# TAR SH
export DRIVERLESS_AI_CONFIG_FILE="/path/to/your/unpacked/dai/directory/config.toml"
# Blue Data DTap connector settings are similar to HDFS connector settings.
#
# Specify DTap Auth Type, allowed options are:
# noauth : No authentication needed
# principal : Authenticate with DTab with a principal user
# keytab : Authenticate with a Key tab (recommended). If running
# DAI as a service, then the Kerberos keytab needs to
# be owned by the DAI user.
# keytabimpersonation : Login with impersonation using a keytab
dtap_auth_type = "keytab"
3. Save the changes when you are done, then stop/restart Driverless AI.
Driverless AI allows you to explore Google BigQuery data sources from within the Driverless AI application. This
section provides instructions for configuring Driverless AI to work with Google BigQuery. This setup requires you to
enable authentication. If you enable GCS or GBP connectors, those file systems will be available in the UI, but you
will not be able to use those connectors without authentication.
In order to enable the GBQ data connector with authentication, you must:
1. Obtain a JSON authentication file from GCP.
2. Mount the JSON file to the Docker instance.
3. Specify the path to the /json_auth_file.json in the gcs_path_to_service_account_json configuration option.
Note: The account JSON includes authentications as provided by the system administrator. You can be provided a
JSON file that contains both Google Cloud Storage and Google BigQuery authentications, just one or the other, or
none at all.
This example enables the GBQ data connector with authentication by passing the JSON authentication file. This
assumes that the JSON file contains Google BigQuery authentications.
1. Export the Driverless AI config.toml file or add it to ~/.bashrc. For example:
# DEB and RPM
export DRIVERLESS_AI_CONFIG_FILE="/etc/dai/config.toml"
# TAR SH
export DRIVERLESS_AI_CONFIG_FILE="/path/to/your/unpacked/dai/directory/config.toml"
3. Save the changes when you are done, then stop/restart Driverless AI.
After Google BigQuery is enabled, you can add datasets by selecting Google Big Query from the Add Dataset (or
Drag and Drop) drop-down menu.
This section provides instructions for configuring Driverless AI to work with Google Cloud Storage. This setup
requires you to enable authentication. If you enable GCS or GBP connectors, those file systems will be available in
the UI, but you will not be able to use those connectors without authentication.
In order to enable the GCS data connector with authentication, you must:
1. Obtain a JSON authentication file from GCP.
2. Mount the JSON file to the Docker instance.
3. Specify the path to the /json_auth_file.json in the gcs_path_to_service_account_json configuration option.
Note: The account JSON includes authentications as provided by the system administrator. You can be provided a
JSON file that contains both Google Cloud Storage and Google BigQuery authentications, just one or the other, or
none at all.
This example enables the GCS data connector with authentication by passing the JSON authentication file. This
assumes that the JSON file contains Google Cloud Storage authentications.
1. Export the Driverless AI config.toml file or add it to ~/.bashrc. For example:
# DEB and RPM
export DRIVERLESS_AI_CONFIG_FILE="/etc/dai/config.toml"
# TAR SH
export DRIVERLESS_AI_CONFIG_FILE="/path/to/your/unpacked/dai/directory/config.toml"
3. Save the changes when you are done, then stop/restart Driverless AI.
Driverless AI allows you to explore kdb+ data sources from within the Driverless AI application. This section provides
instructions for configuring Driverless AI to work with kdb+.
This example enables the kdb+ connector without authentication. The only required flags are the hostname and the
port.
1. Export the Driverless AI config.toml file or add it to ~/.bashrc. For example:
# DEB and RPM
export DRIVERLESS_AI_CONFIG_FILE="/etc/dai/config.toml"
# TAR SH
export DRIVERLESS_AI_CONFIG_FILE="/path/to/your/unpacked/dai/directory/config.toml"
3. Save the changes when you are done, then stop/restart Driverless AI.
This example provides users credentials for accessing a kdb+ server from Driverless AI.
1. Export the Driverless AI config.toml file or add it to ~/.bashrc. For example:
# DEB and RPM
export DRIVERLESS_AI_CONFIG_FILE="/etc/dai/config.toml"
# TAR SH
export DRIVERLESS_AI_CONFIG_FILE="/path/to/your/unpacked/dai/directory/config.toml"
3. Save the changes when you are done, then stop/restart Driverless AI.
After the kdb+ connector is enabled, you can add datasets by selecting kdb+ from the Add Dataset (or Drag and
Drop) drop-down menu.
This section provides instructions for configuring Driverless AI to work with Minio. Note that unlike S3, authentication
must also be configured when the Minio data connector is specified.
This example enables the Minio data connector with authentication by passing an endpoint URL, access key
ID, and an access key. It also configures Docker DNS by passing the name and IP of the Minio end-
point. This allows users to reference data stored in Minio directly using the endpoint URL, for example: http:
//<endpoint_url>/<bucket>/datasets/iris.csv.
1. Export the Driverless AI config.toml file or add it to ~/.bashrc. For example:
# DEB and RPM
export DRIVERLESS_AI_CONFIG_FILE="/etc/dai/config.toml"
# TAR SH
export DRIVERLESS_AI_CONFIG_FILE="/path/to/your/unpacked/dai/directory/config.toml"
3. Save the changes when you are done, then stop/restart Driverless AI.
11.2.9 Snowflake
Driverless AI allows you to explore Snowflake data sources from within the Driverless AI application. This section
provides instructions for configuring Driverless AI to work with Snowflake. This setup requires you to enable authen-
tication. If you enable Snowflake connectors, those file systems will be available in the UI, but you will not be able to
use those connectors without authentication.
This example enables the Snowflake data connector with authentication by passing the account, user, and
password variables.
1. Export the Driverless AI config.toml file or add it to ~/.bashrc. For example:
# DEB and RPM
export DRIVERLESS_AI_CONFIG_FILE="/etc/dai/config.toml"
# TAR SH
export DRIVERLESS_AI_CONFIG_FILE="/path/to/your/unpacked/dai/directory/config.toml"
3. Save the changes when you are done, then stop/restart Driverless AI.
After the Snowflake connector is enabled, you can add datasets by selecting Snowflake from the Add Dataset (or
Drag and Drop) drop-down menu.
8. Enter Snowflake Query: Specify the Snowflake query that you want to execute.
9. When you are finished, select the Click to Make Query button to add the dataset.
11.2.10 JDBC
Driverless AI allows you to explore Java Database Connectivity (JDBC) data sources from within the Driverless AI
application. This section provides instructions for configuring Driverless AI to work with JDBC.
Tested Databases
The following databases have been tested for minimal functionality. Note that JDBC drivers that are not included in
this list should work with Driverless AI. We recommend that you test out your JDBC driver even if you do not see it
on list of tested databases. See the Adding an Untested JDBC Driver section at the end of this chapter for information
on how to try out an untested JDBC driver.
• Oracle DB
• PostgreSQL
• Amazon Redshift
• Teradata
Start Driverless AI
# TAR SH
export DRIVERLESS_AI_CONFIG_FILE="/path/to/your/unpacked/dai/directory/config.toml"
3. Save the changes when you are done, then stop/restart Driverless AI.
After the JDBC connector is enabled, you can add datasets by selecting JDBC from the Add Dataset (or Drag and
Drop) drop-down menu.
After the JDBC connector is enabled, you can add datasets by selecting JDBC from the Add Dataset (or Drag and
Drop) drop-down menu.
• Due to resource sharing within Driverless AI, the JDBC Connector is only allocated a relatively
small amount of memory.
• When making large queries, the ID column is used to partition the data into manageable portions.
This ensures that the maximum memory allocation is not exceeded.
• If a query that is larger than the maximum memory allocation is made without specifying an ID
column, the query will not complete successfully.
5. Write a SQL Query in the format of the database that you want to query. (See the Query Examples section
below.) The format will vary depending on the database that is used.
6. Click the Click to Make Query button to execute the query. The time it takes to complete depends on the size
of the data being queried and the network speeds to the database.
On a successful query, you will be returned to the datasets page, and the queried data will be available as a new dataset.
Query Examples
The following are sample configurations and queries for Oracle DB and PostgreSQL:
Oracle DB
1. Configuration:
jdbc_app_configs = '{"oracledb": {"url": "jdbc:oracle:thin:@localhost:1521/oracledatabase", "jarpath": "/home/ubuntu/jdbc-jars/ojdbc8.jar",
˓→ "classpath": "oracle.jdbc.OracleDriver"}}'
2. Sample Query:
• Select oracledb from the Select JDBC Connection dropdown menu.
• JDBC Username: oracleuser
• JDBC Password: oracleuserpassword
• ID Column Name:
• Query:
SELECT MIN(ID) AS NEW_ID, EDUCATION, COUNT(EDUCATION) FROM my_oracle_schema.creditcardtrain GROUP BY EDUCATION
Note: Because this query does not specify an ID Column Name, it will only work for small data. How-
ever, the NEW_ID column can be used as the ID Column if the query is for larger data.
3. Click the Click to Make Query button to execute the query.
PostgreSQL
1. Configuration:
jdbc_app_configs = '{"postgres": {"url": "jdbc:postgresql://localhost:5432/postgresdatabase", "jarpath": "/home/ubuntu/postgres-artifacts/
˓→postgres/Driver.jar", "classpath": "org.postgresql.Driver"}}'
2. Sample Query:
• Select postgres from the Select JDBC Connection dropdown menu.
• JDBC Username: postgres_user
• JDBC Password: pguserpassword
• ID Column Name: id
• Query:
SELECT * FROM loan_level WHERE LOAN_TYPE = 5 (selects all columns from table loan_level with column LOAN_TYPE containing
˓→value 5)
We encourage you to try out JDBC drivers that are not tested in house.
1. Download the JDBC jar for your database.
2. Move your JDBC jar file to a location that DAI can access.
3. Modify the following config.toml settings. Note that these can also be specified as environment variables when
starting Driverless AI in Docker:
# enable the JDBC file system
enabled_file_systems = "upload, file, hdfs, s3, recipe_file, jdbc"
4. Save the changes when you are done, then stop/restart Driverless AI.
TWELVE
CONFIGURING AUTHENTICATION
Driverless AI supports Client Certificate, LDAP, Local, mTLS, OpenID, none, and unvalidated (default) authentica-
tion. These can be configured by specifying the environment variables when starting the Driverless AI Docker image
or by specifying the appropriate configuration options in the config.toml file.
Notes:
• Driverless AI is also integrated with IBM Spectrum Conductor and supports authentication from Conductor.
Contact [email protected] for more information about using IBM Spectrum Conductor authentication.
• Driverless AI does not support LDAP client auth. If you have LDAP client auth enabled, then the Driverless AI
LDAP connector will not work.
This section describes how to configure client certificate authentication in Driverless AI.
The following options can be specified when configuring client certificate authentication.
Mutual TLS authentication (mTLS) must be enabled in order to enable Client Certificate Authentication. Use the
following configuration options to configure mTLS. Refer to the mTLS Authentication topic for more information on
how to enable mTLS.
• ssl_client_verify_mode: Sets the client verification mode. Choose from the following verification
modes:
• CERT_NONE: The client will not need to provide a certificate. If it does provide a certificate, any resulting
verification errors are ignored.
• CERT_OPTIONAL: The client does not need to provide a certificate. If it does provide a certificate, it is verified
against the configured CA chains.
• CERT_REQUIRED: The client needs to provide a certificate for verification. Note that you will need to configure
the ssl_client_key_file and ssl_client_crt_file options when this mode is selected in order
for Driverless to be able to verify it’s own callback requests.
• ssl_ca_file: Specifies the path to the certification authority (CA) certificate file. This certificate will be
used to verify the client certificate when client authentication is enabled. If this is not specified, clients are
verified using the default system certificates.
221
Using Driverless AI, Release 1.8.4.1
• auth_tls_crl_file: The path to the certificate revocation list (CRL) file that is used to verify the client
certificate.
• auth_tls_subject_field: The subject field that is used as a source for a username or other values that
provide further validation.
• auth_tls_field_parse_regexp: The regular expression that is used to parse the subject field in order
to obtain the username or other values that provide further validation.
• auth_tls_user_lookup: Specifies how a user’s identity is obtained. Choose from the following:
– REGEXP_ONLY: Uses auth_tls_subject_field and auth_tls_field_parse_regexp to
extract the username from the client certificate.
– LDAP_LOOKUP: Uses the LDAP server to obtain the username. (Refer to the LDAP Authentication Ex-
ample section for information about additional LDAP Authentication configuration options.)
• auth_tls_ldap_authorization_lookup_filter: (Optional) Specifies an additional search filter
that is performed after the user is found. For example, this can be used to check whether that user is a member
of a particular group.
• auth_tls_ldap_authorization_search_base: Specifies the base DN to start the authorization
lookup from. Used when the above option is specified.
To enable Client Certificate authentication in Docker images, specify the authentication environment variable that
you want to use. Each variable must be prepended with DRIVERLESS_AI_. The example below enables Client
Certification authentication and uses LDAP_LOOKUP for the TLS user lookup method. Replace TAG below with the
image tag.
nvidia-docker run \
--pid=host \
--init \
--rm \
--shm-size=256m \
-p 12345:12345 \
-u `id -u`:`id -g` \
-e DRIVERLESS_AI_ENABLED_FILE_SYSTEMS="file,s3,hdfs" \
-e DRIVERLESS_AI_ENABLE_HTTPS="true" \
-e DRIVERLESS_AI_SSL_KEY_FILE="/etc/pki/dai-server.key" \
-e DRIVERLESS_AI_SSL_CRT_FILE="/etc/pki/dai-server.crt" \
-e DRIVERLESS_AI_SSL_CA_FILE="/etc/pki/ca.crt" \
-e DRIVERLESS_AI_SSL_CLIENT_VERIFY_MODE="CERT_REQUIRED" \
-e DRIVERLESS_AI_SSL_CLIENT_KEY_FILE="/etc/pki/dai-self.key" \
-e DRIVERLESS_AI_SSL_CLIENT_CRT_FILE="/etc/pki/dai-self.cert" \
-e DRIVERLESS_AI_AUTHENTICATION_METHOD="tls_certificate" \
-e DRIVERLESS_AI_AUTH_TLS_SUBJECT_FIELD="CN" \
-e DRIVERLESS_AI_AUTH_TLS_CRL_FILE="/etc/pki/crl.pem" \
-e DRIVERLESS_AI_AUTH_TLS_FIELD_PARS_REGEXP="(?P<di>.*)" \
-e DRIVERLESS_AI_AUTH_TLS_USER_LOOKUP="LDAP_LOOKUP" \
-e DRIVERLESS_AI_LDAP_SERVER="ldap.forumsys.com" \
-e DRIVERLESS_AI_LDAP_BIND_DN="cn=read-only-admin,dc=example,dc=com" \
-e DRIVERLESS_AI_LDAP_BIND_PASSWORD="password" \
-e DRIVERLESS_AI_LDAP_SEARCH_BASE="dc=example,dc=com" \
-e DRIVERLESS_AI_LDAP_USER_NAME_ATTRIBUTE="uid" \
-e DRIVERLESS_AI_LDAP_SEARCH_FILTER="(&(objectClass=inetOrgPerson)(uid={{id}}))" \
-e DRIVERLESS_AI_AUTH_TLS_LDAP_AUTHORIZATION_SEARCH_BASE="dc=example,dc=com" \
-e DRIVERLESS_AI_AUTH_TLS_LDAP_AUTHORIZATION_LOOKUP_FILTER="(&(objectClass=groupOfUniqueNames)(uniqueMember=uid={{uid}},dc=example,dc=com)(ou=chemists))
˓→" \
-v `pwd`/data:/data \
-v `pwd`/log:/log \
-v `pwd`/license:/license \
-v `pwd`/tmp:/tmp \
h2oai/dai-centos7-x86_64:TAG
12.1.3 Enabling Client Certificate Authentication in the config.toml File for Native
Installs
Native installs include DEBs, RPMs, and TAR SH installs. The example below shows how to edit the config.toml file
to enable Client Certification authentication and uses the LDAP_LOOKUP for the TLS user lookup method.
1. Export the Driverless AI config.toml file or add it to ~/.bashrc. For example:
# DEB and RPM
export DRIVERLESS_AI_CONFIG_FILE="/etc/dai/config.toml"
# TAR SH
export DRIVERLESS_AI_CONFIG_FILE="/path/to/your/unpacked/dai/directory/config.toml"
2. Open the config.toml file and edit the following authentication variables. The config.toml file is available in the
etc/dai folder after Driverless AI is installed.
# https settings
enable_https = true
# https settings
# Path to the SSL key file
#
ssl_key_file = "/etc/pki/dai-server.key"
# https settings
# Path to the SSL certificate file
#
ssl_crt_file = "/etc/pki/dai-server.crt"
# https settings
# Path to the Certification Authority certificate file. This certificate will be
# used when to verify client certificate when client authentication is turned on.
# If this is not set, clients are verified using default system certificates.
#
ssl_ca_file = "/etc/pki/ca.crt"
# https settings
# Sets the client verification mode.
# CERT_NONE: Client does not need to provide the certificate and if it does any
# verification errors are ignored.
# CERT_OPTIONAL: Client does not need to provide the certificate and if it does
# certificate is verified agains set up CA chains.
# CERT_REQUIRED: Client needs to provide a certificate and certificate is
# verified.
# You'll need to set 'ssl_client_key_file' and 'ssl_client_crt_file'
# When this mode is selected for Driverless to be able to verify
# it's own callback requests.
#
ssl_client_verify_mode = "CERT_REQUIRED"
# https settings
# Path to the private key that Driverless will use to authenticate itself when
# CERT_REQUIRED mode is set.
#
ssl_client_key_file = "/etc/pki/dai-self.key"
# https settings
# Path to the client certificate that Driverless will use to authenticate itself
# when CERT_REQUIRED mode is set.
#
ssl_client_crt_file = "/etc/pki/dai-self.crt"
# Subject field that is used as a source for a username or other values that provide further validation
auth_tls_subject_field = "CN"
# Path to the CRL file that will be used to verify client certificate.
auth_tls_crl_file = "/etc/pki/crl.pem"
# A string that describes what you are searching for. You can use Python
# substitution to have this constructed dynamically.
# (only {{DAI_USERNAME}} is supported)
ldap_search_filter = "(&(objectClass=inetOrgPerson)(uid={{id}}))"
This section describes how to enable Lightweight Directory Access Protocol in Driverless AI. The available parameters
can be specified as environment variables when starting the Driverless AI Docker image, or they can be set via the
config.toml file for native installs. Upon completion, all the users in the configured LDAP should be able to log in to
Driverless AI and run experiments, visualize datasets, interpret models, etc.
Note: Driverless AI does not support LDAP client auth. If you have LDAP client auth enabled, then the Driverless AI
LDAP connector will not work.
The following examples describe how to enable LDAP without SSL when running Driverless AI in the Docker image
or through native installs.
The following example shows how to configure LDAP without SSL when starting the Driverless AI Docker image.
Replace TAG below with the image tag.
nvidia-docker run \
--pid=host \
--init \
--rm \
--shm-size=256m \
-p 12345:12345 \
-u `id -u`:`id -g` \
-e DRIVERLESS_AI_ENABLED_FILE_SYSTEMS="file,s3,hdfs" \
-e DRIVERLESS_AI_AUTHENTICATION_METHOD="ldap" \
-e DRIVERLESS_AI_LDAP_USE_SSL="false" \
-e DRIVERLESS_AI_LDAP_SERVER="ldap.forumsys.com" \
-e DRIVERLESS_AI_LDAP_PORT="389" \
-e DRIVERLESS_AI_SEARCH_BASE="dc=example,dc=com" \
-e DRIVERLESS_AI_LDAP_BIND_DN="cn=read-only-admin,dc=example,dc=com" \
-e DRIVERLESS_AI_LDAP_BIND_PASSWORD=password \
-e DRIVERLESS_AI_LDAP_SEARCH_FILTER="(&(objectClass=person)(cn:dn:={{DAI_USERNAME}}))" \
-e DRIVERLESS_AI_LDAP_USER_NAME_ATTRIBUTE="uid" \
-v `pwd`/data:/data \
-v `pwd`/log:/log \
-v `pwd`/license:/license \
-v `pwd`/tmp:/tmp \
h2oai/dai-centos7-x86_64:TAG
The following example shows how to configure LDAP without SSL when starting Driverless AI from a native install.
Native installs include DEBs, RPMs, and TAR SH installs.
1. Export the Driverless AI config.toml file or add it to ~/.bashrc. For example:
# DEB and RPM
export DRIVERLESS_AI_CONFIG_FILE="/etc/dai/config.toml"
# TAR SH
export DRIVERLESS_AI_CONFIG_FILE="/path/to/your/unpacked/dai/directory/config.toml"
# Disable SSL
ldap_use_ssl="false"
# Specify the location in the DIT where the search will start
ldap_search_base = "dc=example,dc=com"
These examples show how to enable LDAP authentication with SSL and additional parameters that can be specified
as environment variables when starting the Driverless AI Docker image, or they can be set via the config.toml file for
native installs. Upon completion, all the users in the configured LDAP should be able to log in to Driverless AI and
run experiments, visualize datasets, interpret models, etc.
Specify the following LDAP environment variables when starting the Driverless AI Docker image. This example
enables LDAP authentication and shows how to specify additional options that are used when recipe=1. Replace
TAG below with the image tag.
nvidia-docker run \
--pid=host \
--init \
--rm \
--shm-size=256m \
-p 12345:12345 \
-u `id -u`:`id -g` \
-e DRIVERLESS_AI_ENABLED_FILE_SYSTEMS="file,s3,hdfs" \
-e DRIVERLESS_AI_AUTHENTICATION_METHOD="ldap" \
-e DRIVERLESS_AI_LDAP_SERVER="ldap.forumsys.com" \
-e DRIVERLESS_AI_LDAP_PORT="389" \
-e DRIVERLESS_AI_LDAP_SEARCH_BASE="dc=example,dc=com" \
-e DRIVERLESS_AI_LDAP_SEARCH_FILTER="(&(objectClass=person)(cn:dn:={{DAI_USERNAME}}))" \
-e DRIVERLESS_AI_LDAP_USE_SSL="true" \
-e DRIVERLESS_AI_LDAP_TLS_FILE="/tmp/abc-def-root.cer" \
-e DRIVERLESS_AI_LDAP_LDAP_BIND_DN="cn=read-only-admin,dc=example,dc=com" \
-e DRIVERLESS_AI_LDAP_LDAP_BIND_PASSWORD="password" \
-e DRIVERLESS_AI_LDAP_USER_NAME_ATTRIBUTE="uid" \
-v `pwd`/data:/data \
-v `pwd`/log:/log \
-v `pwd`/license:/license \
-v `pwd`/tmp:/tmp \
h2oai/dai-centos7-x86_64:TAG
Upon successful completion, all the users in the configured LDAP should be able to log in to Driverless AI and run
experiments, visualize datasets, interpret models, etc.
# TAR SH
export DRIVERLESS_AI_CONFIG_FILE="/path/to/your/unpacked/dai/directory/config.toml"
# Specify the location in the DIT where the search will start
ldap_search_base = "dc=example,dc=com"
3. Start (or restart) Driverless AI. Users can now launch Driverless AI using their LDAP credentials. If authenti-
cation is successful, the user can access Driverless AI and run experiments, visualize datasets, interpret models,
etc.
To enable authentication in Docker images, specify the authentication environment variable that you want to use. Each
variable must be prepended with DRIVERLESS_AI_. Replace TAG below with the image tag. The example below
starts Driverless AI with environment variables the enable the following:
• Local authentication when starting Driverless AI
• S3 and HDFS access (without authentication)
nvidia-docker run \
--pid=host \
--init \
--rm \
--shm-size=256m \
-p 12345:12345 \
-u `id -u`:`id -g` \
-e DRIVERLESS_AI_ENABLED_FILE_SYSTEMS="file,s3,hdfs" \
-e DRIVERLESS_AI_AUTHENTICATION_METHOD="local" \
-e DRIVERLESS_AI_LOCAL_HTPASSWD_FILE="<htpasswd_file_location>" \
-v `pwd`/data:/data \
-v `pwd`/log:/log \
-v `pwd`/license:/license \
-v `pwd`/tmp:/tmp \
h2oai/dai-centos7-x86_64:TAG
12.3.2 Enabling Local Auth in the config.toml File for Native Installs
Native installs include DEBs, RPMs, and TAR SH installs. The example below shows the configuration options in the
config.toml file to set when enabling the following:
• Local authentication when starting Driverless AI
• S3 and HDFS access (without authentication)
1. Export the Driverless AI config.toml file or add it to ~/.bashrc. For example:
# DEB and RPM
export DRIVERLESS_AI_CONFIG_FILE="/etc/dai/config.toml"
# TAR SH
export DRIVERLESS_AI_CONFIG_FILE="/path/to/your/unpacked/dai/directory/config.toml"
2. Open the config.toml file and edit the authentication variables. The config.toml file is available in the etc/dai
folder after the RPM or DEB is installed.
# File System Support
# file : local file system/server file system
# hdfs : Hadoop file system, remember to configure the hadoop coresite and keytab below
# s3 : Amazon S3, optionally configure secret and access key below
# gcs : Google Cloud Storage, remember to configure gcs_path_to_service_account_json below
# gbq : Google Big Query, remember to configure gcs_path_to_service_account_json below
enabled_file_systems = "file,s3,hdfs"
# authentication_method
# unvalidated : Accepts user id and password, does not validate password
# none : Does not ask for user id or password, authenticated as admin
# pam : Accepts user id and password, Validates user with operating system
# ldap : Accepts user id and password, Validates against an ldap server, look
# local: Accepts a user id and password, Validated against a htpasswd file provided in local_htpasswd_file
# for additional settings under LDAP settings
authentication_method = "local"
3. Start (or restart) Driverless AI. Note that the command used to start Driverless AI varies depending on your
install type.
# Linux RPM or DEB with systemd
sudo systemctl start dai
# Linux TAR SH
./run-dai.sh
Driverless AI supports Mutual TLS authentication (mTLS) by setting a specific verification mode along with a certifi-
cate authority file, an SSL private key, and an SSL certificate file. The diagram below is a visual representation of the
mTLS authentication process.
The table below describes user certificate behavior for mTLS authentication based on combinations of the configura-
tion options described above.
config.toml settings User User has a correct and valid User has a revoked
does not certificate certificate
have a
certifi-
cate
User
ssl_client_verify_mode='CERT_NONE' User certs are ignored User revoked certs are ig-
certs are nored
ignored
User User certs are set to Driverless AI
ssl_client_verify_mode='CERT_OPTIONAL' User revoked certs are not
certs are but are not used for validating the validated
ignored certs
Not User provides a valid certificate
ssl_client_verify_mode='CERT_REQUIRED' User revoke lists are not
allowed used by Driverless AI but does not validated
authenticate the user
Not User provides a valid certificate.
sl_client_verify_mode='CERT_REQUIRED' User revoked certs are
AND allowed The certificate is used for connect- validated and the re-
ing to the Driverless AI server as
authentication_method='tls_authentication' voked file is provided in
well as for authentication. AUTH_TLS_CRL_FILE
To enable mTLS authentication in Docker images, specify the authentication environment variable that you want to
use. Each variable must be prepended with DRIVERLESS_AI_. Replace TAG below with the image tag.
nvidia-docker run \
--pid=host \
--init \
--rm \
--shm-size=256m \
-p 12345:12345 \
-u `id -u`:`id -g` \
-e DRIVERLESS_AI_ENABLE_HTTPS=true \
-e DRIVERLESS_AI_SSL_KEY_FILE=/etc/dai/private_key.pem \
-e DRIVERLESS_AI_SSL_CRT_FILE=/etc/dai/cert.pem \
-e DRIVERLESS_AI_AUTHENTICATION_METHOD=tls_certificate \
-e DRIVERLESS_AI_SSL_CLIENT_VERIFY_MODE=CERT_REQUIRED \
-e DRIVERLESS_AI_SSL_CA_FILE=/etc/dai/rootCA.pem \
-e DRIVERLESS_AI_SSL_CLIENT_KEY=/etc/dai/client_config_key.key \
-e DRIVERLESS_AI_SSL_CLIENT_CRT_FILE=/etc/dai/client_config_cert.pem \
-v /user/1.8.4_auth/log:/log -v /home/anu/1.8.4_auth/tmp:/tmp \
-v /user/certificates/server_config_key.pem:/etc/dai/private_key.pem \
-v /user/certificates/server_config_cert.pem:/etc/dai/cert.pem \
-v /user/certificates/client_config_cert.pem:/etc/dai/client_config_cert.pem \
-v /user/certificates/client_config_key.key:/etc/dai/client_config_key.key \
-v /user/certificates/rootCA.pem:/etc/dai/rootCA.pem \
h2oai/dai-centos7-x86_64:TAG
12.4.4 Enabling mTLS Authentication in the config.toml File for Native Installs
Native installs include DEBs, RPMs, and TAR SH installs. The example below shows how to edit the config.toml file
to enable mTLS authentication.
1. Export the Driverless AI config.toml file or add it to ~/.bashrc. For example:
# DEB and RPM
export DRIVERLESS_AI_CONFIG_FILE="/etc/dai/config.toml"
# TAR SH
export DRIVERLESS_AI_CONFIG_FILE="/path/to/your/unpacked/dai/directory/config.toml"
2. Open the config.toml file and edit the following authentication variables. The config.toml file is available in the
etc/dai folder after Driverless AI is installed.
# Path to the Certification Authority certificate file. This certificate will be
# used when to verify client certificate when client authentication is turned on.
# If this is not set, clients are verified using default system certificates.
#
ssl_ca_file = "/etc/pki/ca.crt"
# Path to the private key that Driverless will use to authenticate itself when
# CERT_REQUIRED mode is set.
#
ssl_client_key_file = "/etc/pki/dai-self.key"
# Path to the client certificate that Driverless will use to authenticate itself
# when CERT_REQUIRED mode is set.
#
ssl_client_crt_file = "/etc/pki/dai-self.crt"
This section describes how to enable OpenID Connect authentication in Driverless AI.
Note: The Driverless AI Python and R clients are not compatible with the OpenID Connect authentication method.
Enable OpenID Connect through the UI only, or use a different authentication method is recommended if you plan to
connect to DAI using the R or Python clients.
In order to begin the process of configuring Driverless AI for OpenID-based authentication, the end user must retrieve
OpenID Connect metadata about their authorization server by requesting information from the well-known endpoint.
This information is subsequently used to configure further interactions with the provider.
The well-known endpoint is typically configured as follows:
https://yourOpenIDProviderHostname/.well-known/openid-configuration
Set the following options in the config.toml file for enabling OpenID-based authentication.
# The OpenID server URL. (Ex: https://oidp.ourdomain.com) Do not end with a "/"
auth_openid_provider_base_uri= "https://yourOpenIDProviderHostname"
# The uri to pull OpenID config data from. (You can extract most of required OpenID config from this URL.)
# Usually located at: /auth/realms/master/.well-known/openid-configuration
# Quote method from urllib.parse used to encode payload dict in Authentication Request
auth_openid_urlencode_quote_via="quote"
# These endpoints are made available by the well-known endpoint of the OpenID provider
# All endpoints should start with a "/"
auth_openid_auth_uri=""
auth_openid_token_uri=""
auth_openid_userinfo_uri=""
auth_openid_logout_uri=""
# In most cases, these values are usually 'code' and 'authorization_code' (as shown below)
# Supported values for response_type and grant_type are listed in the response of well-known endpoint
auth_openid_response_type="code"
auth_openid_grant_type="authorization_code"
# Scope values--supported values are available in the response from the well-known endpoint
# 'openid' is required
# Additional scopes may be necessary if the response to the userinfo request
# does not include enough information to use for authentication
# Separate additional scopes with a blank space.
# See https://openid.net/specs/openid-connect-basic-1_0.html#Scopes for more info
auth_openid_scope="openid"
# The OpenID client details that are available from the provider
# A new client for Driverless AI in your OpenID provider must be created if one does not already exist
auth_openid_client_id=""
auth_openid_client_secret=""
# UserInfo response key configs for all users who log in to Driverless AI
# The userinfo_auth_key and userinfi_auth_value are
# a key value combination in the userinfo response that remain static for everyone
# If this key value pair does not exist in the user_info response,
# then the Authentication is considered failed
auth_openid_userinfo_auth_key=""
auth_openid_userinfo_auth_value=""
# Key that specifies username in user_info json (we will use value of this key as username in Driverless AI)
auth_openid_userinfo_username_key=""
The examples that follow describe how to start Driverless AI in the Docker image and with native installs after OpenID
has been configured.
1. Edit the OpenID configuration options in your config.toml file as described in the Open ID Configuration Op-
tions section.
2. Mount the edited config.toml file into the Docker container. Replace TAG below with your Driverless AI tag.
nvidia-docker run \
--net=openid-network \
--name="dai-with-openid" \
--pid=host \
--init \
--rm \
--shm-size=256m \
-u `id -u`:`id -g` \
-p 12345:12345 \
-v "`pwd`/DAI_DATA/data":/data \
-v "`pwd`/DAI_DATA/log":/log \
-v "`pwd`/DAI_DATA/license":/license \
-v "`pwd`/DAI_DATA/tmp":/tmp \
-v "`pwd`/DAI_DATA/config":/config \
-e DRIVERLESS_AI_CONFIG_FILE="/config/config.toml" \
h2oai/dai-centos7-x86_64:TAG
The next step is to launch and log in to Driverless AI. Refer to Logging in to Driverless AI.
# TAR SH
export DRIVERLESS_AI_CONFIG_FILE="/path/to/your/unpacked/dai/directory/config.toml"
2. Edit the OpenID configuration properties in the config.toml file as described in the Open ID Configuration
Options section.
3. Start (or restart) Driverless AI.
The next step is to launch and log in to Driverless AI. Refer to Logging in to Driverless AI.
Open a browser and launch Driverless AI. Notice the you will be prompted to log in with OpenID.
The following sections describe how to enable Pluggable Authentication Modules (PAM) in Driverless AI. You can
do this by specifying environment variables in the Docker image or by updating the config.toml file.
Note: This assumes that the user has an understanding of how to grant permissions in their own environment in
order for PAM to work. Specifically for Driverless AI, be sure that the Driverless AI processes owner has access to
/etc/shadow (without root); otherwise authentication will fail.
Note: The following instructions are only applicable with a CentOS 7 host.
In this example, the host Linux system has PAM enabled for authentication and Docker running on that Linux system.
The goal is to enable PAM for Driverless AI authentication while the Linux system hosts the user information.
1. Verify that the username (“eric” in this case) is defined in the Linux system.
[root@Linux-Server]# cat /etc/shadow | grep eric
eric:$6$inOv3GsQuRanR1H4$kYgys3oc2dQ3u9it02WTvAYqiGiQgQ/yqOiOs.g4F9DM1UJGpruUVoGl5G6OD3MrX/3uy4gWflYJnbJofaAni/::0:99999:7:::
2. Start Docker on the Linux Server and enable PAM in Driverless AI. Replace TAG below with the image tag.
[root@Linux-Server]# docker run \
--rm \
--shm-size=256m \
-u `id -u`:`id -g` \
-p 12345:12345 \
-v `pwd`/config:/config \
-v `pwd`/data:/data \
-v `pwd`/log:/log \
-v `pwd`/license:/license \
-v `pwd`/tmp:/tmp \
-v /etc/passwd:/etc/passwd \
-v /etc/shadow:/etc/shadow \
-v /etc/pam.d/:/etc/pam.d/ \
-e DRIVERLESS_AI_AUTHENTICATION_METHOD="pam" \
h2oai/dai-centos7-x86_64:TAG
3. Obtain the Driverless AI container ID. This ID is required for the next step and will be different every time
Driverless AI is started.
[root@Linux-Server]# docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS
˓→ NAMES
8e333475ffd8 opsh2oai/h2oai-runtime "./run.sh" 36 seconds ago Up 35 seconds 192.168.0.1:9090->9090/tcp, 192.
˓→168.0.1:12345->12345/tcp, 192.168.0.1:12348->12348/tcp clever_swirles
4. From the Linux Server, verify that the Docker Driverless AI instance can see the shadow file. The example
below references 8e333475ffd8, which is the container ID obtained in the previous step.
[root@Linux-Server]# docker exec 8e333475ffd8 cat /etc/shadow|grep eric
eric:$6$inOv3GsQuRanR1H4$kYgys3oc2dQ3u9it02WTvAYqiGiQgQ/yqOiOs.g4F9DM1UJGpruUVoGl5G6OD3MrX/3uy4gWflYJnbJofaAni/::0:99999:7:::
5. Open a Web browser and navigate to port 12345 on the Linux system that is running the Driverless AI Docker
Image. Log in with credentials known to the Linux system. The login information will now be validated using
PAM.
In this example, the host Linux system has PAM enabled for authentication. The goal is to enable PAM for Driverless
AI authentication while the Linux system hosts the user information.
This example shows how to edit the config.toml file to enable PAM. The config.toml file is available in the etc/dai folder
after the RPM or DEB is installed. Edit the authentication_method variable in this file to enable PAM authentication,
and then restart Driverless AI.
1. Verify that the username (“eric” in this case) is defined in the Linux system.
[root@Linux-Server]# cat /etc/shadow | grep eric
eric:$6$inOv3GsQuRanR1H4$kYgys3oc2dQ3u9it02WTvAYqiGiQgQ/yqOiOs.g4F9DM1UJGpruUVoGl5G6OD3MrX/3uy4gWflYJnbJofaAni/::0:99999:7:::
# TAR SH
export DRIVERLESS_AI_CONFIG_FILE="/path/to/your/unpacked/dai/directory/config.toml"
3. Edit the authentication_method variable in the config.toml file so that PAM is enabled.
# authentication_method
# unvalidated : Accepts user id and password, does not validate password
# none : Does not ask for user id or password, authenticated as admin
# pam : Accepts user id and password, Validates user with operating system
# ldap : Accepts user id and password, Validates against an ldap server, look
# local: Accepts a user id and password, Validated against a htpasswd file provided in local_htpasswd_file
# for additional settings under LDAP settings
authentication_method = "pam"
4. Start Driverless AI. Note that the command used to start Driverless AI varies depending on your install type.
# Linux RPM or DEB with systemd
[root@Linux-Server]# sudo systemctl start dai
# Linux TAR SH
[root@Linux-Server]# ./run-dai.sh
5. Open a Web browser and navigate to port 12345 on the Linux system that is running Driverless AI. Log in with
credentials known to the Linux system (as verified in the first step). The login information will now be validated
using PAM.
THIRTEEN
ENABLING NOTIFICATIONS
Driverless AI can be configured to trigger a user-defined script at the beginning and end of an experiment. This
functionality can be used to send notifications to services like Slack or to trigger a machine shutdown.
The config.toml file exposes the following variables:
• listeners_experiment_start: Registers an absolute location of a script that gets executed at the start
of an experiment.
• listeners_experiment_done: Registers an absolute location of a script that gets executed when an
experiment is finished successfully.
Driverless AI accepts any executable as a script. (For example, a script can be implemented in Bash or Python.) There
are only two requirements:
• The specified script can be executed. (i.e., The file has executable flag.)
• The script should be able to accept command line parameters.
When Driverless AI executes a script, it passes the following parameters as a script command line:
• Application ID: A unique identifier of a running Driverless AI instance.
• User ID: The identification of the user who is running the experiment.
• Experiment ID: A unique identifier of the experiment.
• Experiment Path: The location of the experiment results.
13.2 Example
The following example demonstrates how to use notification scripts to shutdown an EC2 machine that is running
Driverless AI after all launched experiments are finished. The example shows how to use a notification script in a
Docker container and with native installations. The idea of a notification script is to create a simple counter (i.e.,
number of files in a directory) that counts the number of running experiments. If counter reaches 0-value, then the
specified action is performed.
In this example, we use the AWS command line utility to shut down the actual machine; however, the same functional-
ity can be achieved by executing sudo poweroff (if the actual user has password-less sudo capability configured)
or poweroff (if the script poweroff has setuid bit set up together with executable bit. For more info, please
visit: https://unix.stackexchange.com/questions/85663/poweroff-or-reboot-as-normal-user.)
• The on_start Script. This script increases the counter of running experiments.
237
Using Driverless AI, Release 1.8.4.1
#!/usr/bin/env bash
app_id="${1}"
experiment_id="${3}"
tmp_dir="${TMPDIR:-/tmp}/${app_id}"
exp_file="${tmp_dir}/${experiment_id}"
mkdir -p "${tmp_dir}"
touch "${exp_file}"
• The on_done Script. This script decreases the counter and executes machine shutdown when the counter
reaches 0-value.
#!/usr/bin/env bash
app_id="${1}"
experiment_id="${3}"
tmp_dir="${TMPDIR:-/tmp}/${app_id}"
exp_file="${tmp_dir}/${experiment_id}"
if [ -f "${exp_file}" ]; then
rm -f "${exp_file}"
fi
1. Copy the config.toml file from inside the Docker image to your local filesystem. (Change nvidia-docker
run to docker run for non-GPU environments.)
# In your Driverless AI folder (for exmaple, dai_1.5.1),
# make config and scripts directories
mkdir config
mkdir scripts
2. Edit the Notification scripts section in the config.toml file and save your changes. Note that in this example,
the scripts are saved to a dai_VERSION/scripts folder.
# Notification scripts
# - the variable points to a location of script which is executed at given event in experiment lifecycle
# - the script should have executable flag enabled
# - use of absolute path is suggested
# The on experiment start notification script location
listeners_experiment_start = "dai_VERSION/scripts/on_start.sh"
# The on experiment finished notification script location
listeners_experiment_done = "dai_VERSION/scripts/on_done.sh"
3. Start Driverless AI with the DRIVERLESS_AI_CONFIG_FILE environment variable. Make sure this points
to the location of the edited config.toml file so that the software finds the configuration file. (Change
nvidia-docker run to docker run for non-GPU environments.)
nvidia-docker run \
--pid=host \
--init \
--rm \
-u `id -u`:`id -g` \
-e DRIVERLESS_AI_CONFIG_FILE="/config/config.toml" \
-v `pwd`/config:/config \
-v `pwd`/data:/data \
-v `pwd`/log:/log \
-v `pwd`/license:/license \
-v `pwd`/tmp:/tmp \
-v `pwd`/scripts:/scripts \
h2oai/dai-centos7-x86_64:TAG
# TAR SH
export DRIVERLESS_AI_CONFIG_FILE="/path/to/your/unpacked/dai/directory/config.toml"
2. Edit the Notification scripts section in the config.toml file to point to the new scripts. Save your changes when
you are done.
# Notification scripts
# - the variable points to a location of script which is executed at given event in experiment lifecycle
# - the script should have executable flag enabled
# - use of absolute path is suggested
# The on experiment start notification script location
listeners_experiment_start = "/opt/h2oai/dai/scripts/on_start.sh"
# The on experiment finished notification script location
listeners_experiment_done = "/opt/h2oai/dai/scripts/on_done.sh"
3. Start Driverless AI. Note that the command used to start Driverless AI varies depending on your install type.
# Deb or RPM with systemd (preferred for Deb and RPM):
# Start Driverless AI.
sudo systemctl start dai
# Tar.sh
# Start Driverless AI
./run-dai.sh
FOURTEEN
EXPORT ARTIFACTS
In some cases, you might find that you do not want your users to download artifacts directly to their machines.
Driverless AI provides several configuration options/environment variables that enable exporting of artifacts instead
of downloading.
The following example shows how to start the Driverless AI Docker image with artifact exporting enabled.
docker run \
--pid=host \
--init \
--rm \
--shm-size=256m \
-e DRIVERLESS_AI_ENABLE_ARTIFACTS_UPLOAD="true" \
-e DRIVERLESS_AI_ARTIFACTS_STORE="file_system" \
-e DRIVERLESS_AI_ARTIFACTS_FILE_SYSTEM_DIRECTORY="tmp" \
-u `id -u`:`id -g` \
-p 12345:12345 \
-v `pwd`/data:/data \
-v `pwd`/log:/log \
-v `pwd`/license:/license \
-v `pwd`/tmp:/tmp \
harbor.h2o.ai/h2oai/dai-centos7-x86_64:TAG
241
Using Driverless AI, Release 1.8.4.1
The following example shows how to start Driverless AI with artifact exporting enabled on native installs.
1. Export the Driverless AI config.toml file or add it to ~/.bashrc. For example:
# DEB and RPM
export DRIVERLESS_AI_CONFIG_FILE="/etc/dai/config.toml"
# TAR SH
export DRIVERLESS_AI_CONFIG_FILE="/path/to/your/unpacked/dai/directory/config.toml"
2. Edit the following configuration option in the config.toml file. Save your changes when you are done.
# Replace all the downloads on the experiment page to exports and allow users to push to the artifact store configured with artifacts_store
enable_artifacts_upload = true
# Artifacts store.
# file_system: stores artifacts on a file system directory denoted by artifacts_file_system_directory.
#
artifacts_store = "file_system"
# File system location where artifacts will be copied in case artifacts_store is set to file_system
artifacts_file_system_directory = "tmp"
3. Start Driverless AI. Note that the command used to start Driverless AI varies depending on your install type.
# Deb or RPM with systemd (preferred for Deb and RPM):
# Start Driverless AI.
sudo systemctl start dai
# Tar.sh
# Start Driverless AI
./run-dai.sh
When the export artifacts options are enabled/configured, the menu options on the Completed Experiment page will
change. Specifically, all “Download” options (with the exception of Autoreport) will change to “Export.”
1. Click on an artifact to begin exporting. For example, click on Export Summary and Logs.
2. Specify a file name or use the default file name. This denotes the new name to be given to the exported artifact.
By default, this name matches the selected export artifact name.
3. Now click the Summary and Logs: Export to Data Store button. (Note that this button name changes de-
pending on the artifact that you select.) This begins the export action. Upon completion, the exported artifact
will display in the list of artifacts. The directory structure is: <path_to_export_to>/<user>/<experiment_id>/
FIFTEEN
LAUNCHING DRIVERLESS AI
Driverless AI is tested on Chrome and Firefox but is supported on all major browsers. For the best user experience,
we recommend using Chrome.
1. After Driverless AI is installed and started, open a browser and navigate to <server>:12345.
2. The first time you log in to Driverless AI, you will be prompted to read and accept the Evaluation Agreement.
You must accept the terms before continuing. Review the agreement, then click I agree to these terms to
continue.
3. Log in by entering unique credentials. For example:
Username: h2oai Password: h2oai
Note that these credentials do not restrict access to Driverless AI; they are used to tie experiments to users.
If you log in with different credentials, for example, then you will not see any previously run experiments.
4. As with accepting the Evaluation Agreement, the first time you log in, you will be prompted to enter your
License Key. Click the Enter License button, then paste the License Key into the License Key entry field.
Click Save to continue. This license key will be saved in the host machine’s /license folder.
Note: Contact [email protected] for information on how to purchase a Driverless AI license.
Upon successful completion, you will be ready to add datasets and run experiments.
245
Using Driverless AI, Release 1.8.4.1
15.1 Resources
The Resources dropdown menu provides you with links to view System Information and the Driverless AI User Guide.
From this dropdown menu, you can also download the following:
• Python Client (See The Python Client)
• R Client (See The R Client)
• MOJO2 Java Runtime (See Driverless AI MOJO Scoring Pipeline - Java Runtime)
• MOJO2 Python Runtime (See Driverless AI MOJO Scoring Pipeline - C++ Runtime with Python and R Wrap-
pers)
• MOJO2 R Runtime (See Driverless AI MOJO Scoring Pipeline - C++ Runtime with Python and R Wrappers)
15.2 Messages
A Messages menu option is available in the top menu when you launch Driverless AI. Click this to view news and
upcoming events regarding Driverless AI.
SIXTEEN
The Datasets Overview page is the Driverless AI Home page. This shows all datasets that have been imported. Note
that the first time you log in, this list will be empty.
249
Using Driverless AI, Release 1.8.4.1
• gz
• jay (See note below)
• parquet (See notes below)
• pkl
• tgz
• tsv
• txt
• xls
• xlsx
• xz
• zip
Notes:
• CSV in UTF-16 encoding is only supported when implemented with a byte order mark (BOM). If a BOM is not
present, the dataset is read as UTF-8.
• For Parquet file formats, if you select to import multiple Parquet files, those files will be imported as multi-
ple datasets. If you select a folder of Parquet files, the folder will be imported as a single dataset. Tools like
Spark/Hive export data as multiple Parquet files that are stored in a directory with a user-defined name. For ex-
ample, if you export with Spark dataFrame.write.parquet("/data/big_parquet_dataset
"), Spark creates a folder /data/big_parquet_dataset, which will contain multiple Parquet files (depending on
the number of partitions in the input dataset) + metadata.
• You may receive a “Failed to ingest binary file with Parquet: lists with structs are not supported” error when
ingesting a Parquet file that has a struct as an element of an array. This is because PyArrow cannot handle a
struct that’s an element of an array. In Sparkling Water, we provide a workaround to flatten the Parquet file.
Refer to our Sparkling Water solution for more information.
• You can create new datasets from Python script files (custom recipes) by selecting Data Recipe URL or Upload
Data Recipe from the Add Dataset (or Drag & Drop) dropdown menu. If you select the Data Recipe URL
option, the URL must point to either a raw file, a GitHub repository or tree, or a local file. In addition, you can
create a new dataset by modifying an existing dataset with a custom recipe. Refer to modify_by_recipe for more
information. Datasets created or added from recipes will be saved as .jay files.
• If Driverless AI was started with data connectors enabled for Azure Blob Store, BlueData Datatap, Google
Big Query, Google Cloud Storage, KDB+, Minio, Snowflake, or JDBC, then these options will appear in the
Add Dataset (or Drag & Drop) dropdown menu. Refer to the Enabling Data Connectors section for more
information.
• When specifying to add a dataset using Data Recipe URL, the URL must point to either a raw file, a GitHub
repository or tree, or a local file. When adding or uploading datasets via recipes, the dataset will be saved as a
.jay file.
• Datasets must be in delimited text format.
• Driverless AI can detect the following separators: ,|;t
• When importing a folder, the entire folder and all of its contents are read into Driverless AI as a single file.
• When importing a folder, all of the files in the folder must have the same columns.
• If you try to import a folder via a data connector on Windows, the import will fail if the folder contains files that
do not have file extensions (the resulting error is usually related to the above note).
Upon completion, the datasets will appear in the Datasets Overview page. Click on a dataset to open a submenu. From
this menu, you can specify to Rename, view Details of, Visualize, Split, Download, or Delete a dataset. Note: You
cannot delete a dataset that was used in an active experiment. You have to delete the experiment first.
In Driverless AI, you can rename datasets from the Datasets Overview page.
To rename a dataset, click on the dataset or select the [Click for Actions] button beside the dataset that you want to
rename, and then select Rename from the submenu that appears.
Note: If the name of a dataset is changed, every instance of the dataset in Driverless AI will be changed to reflect the
new name.
To view a summary of a dataset or to preview the dataset, click on the dataset or select the [Click for Actions] button
beside the dataset that you want to view, and then click Details from the submenu that appears. This opens the Dataset
Details page.
The Dataset Details page provides a summary of the dataset. This summary lists each of the dataset’s columns and
displays accompanying rows for logical type, format, storage type (see note below), count, number of missing values,
mean, minimum, maximum, standard deviation, frequency, and number of unique values.
Note: Driverless AI recognizes the following storage types: integer, string, real, boolean, and time.
Hover over the top of a column to view a summary of the first 20 rows of that column.
To view information for a specific column, type the column name in the field above the graph.
Driverless AI also allows you to change a column type. If a column’s data type or distribution does not match the
manner in which you want the column to be handled during an experiment, changing the Logical Type can help to
make the column fit better. For example, an integer zip code can be changed into a categorical so that it is only used
with categorical-related feature engineering. For Date and Datetime columns, use the Format option. To change the
Logical Type or Format of a column, click on the group of square icons located to the right of the words Auto-detect.
(The squares light up when you hover over them with your cursor.) Then select the new column type for that column.
To switch the view and preview the dataset, click the Dataset Rows button in the top right portion of the UI. Then
click the Dataset Overview button to return to the original view.
The option to create a new dataset by modifying an existing dataset with custom recipes is also available from this
page. Scoring pipelines can be created on the new dataset by building an experiment. This feature is useful when you
want to make changes to the training data that you would not need to make on the new data you are predicting on. For
example, you can change the target column from regression to classification, add a weight column to mark specific
training rows as being more important, or remove outliers that you do not want to model on. Refer to the Adding a
Data Recipe section for more information.
Click the Modify by Recipe button in the top right portion of the UI and select from the following options:
• Data Recipe URL: Load a custom recipe from a URL to use to modify the dataset. The URL must point to
either a raw file, a GitHub repository or tree, or a local file. Sample custom data recipes are available in the
https://github.com/h2oai/driverlessai-recipes/tree/rel-1.8.2/data repository.
• Upload Data Recipe: If you have a custom recipe available on your local system, click this button to upload
that recipe.
• Live Code: Manually enter custom recipe code to use to modify the dataset. Click the Get Preview button to
preview the code’s effect on the dataset, then click Save to create a new dataset.
Notes:
• These options are enabled by default. You can disable them by removing recipe_file and recipe_url
from the enabled_file_systems configuration option.
• Modifying a dataset with a recipe will not overwrite the original dataset. The dataset that is selected for modi-
fication will remain in the list of available datasets in its original form, and the modified dataset will appear in
this list as a new dataset.
• Changes made to the original dataset through this feature will not be applied to new data that is scored.
In Driverless AI, you can download datasets from the Datasets Overview page.
To download a dataset, click on the dataset or select the [Click for Actions] button beside the dataset that you want to
download, and then select Download from the submenu that appears.
Note: The option to download datasets will not be available if the enable_dataset_downloading option is set
to false when starting Driverless AI. This option can be specified in the config.toml file.
In Driverless AI, you can split a training dataset into test and validation datasets.
Perform the following steps to split a dataset.
1. To split a dataset, click on the dataset or select the [Click for Actions] button beside the dataset that you want to
split, and then select Split from the submenu that appears.
2. The Dataset Splitter form displays. Specify an Output Name 1 and an Output Name 2 for the first and second
part of the split. (For example, you can name one test and one valid.)
3. Optionally specify a Target column (for stratified sampling), a Fold column (to keep rows belonging to the same
group together), a Time column, and/or a Random Seed (defaults to 1234).
4. Use the slider to select a split ratio, or enter a value in the Train/Valid Split Ratio field.
5. Click Save when you are done.
Upon completion, the split datasets will be available on the Datasets page.
• Click the Autoviz top menu link to go to the Visualizations list page, click the New Visualization button, then
select or import the dataset that you want to visualize.
The Visualization page shows all available graphs for the selected dataset. Note that the graphs on the Visualization
page can vary based on the information in your dataset. You can also view and download logs that were generated
during the visualization.
upper quartiles, and the ends of the “whiskers” denote that range of values. Sometimes outliers occur, in which
case the adjacent whisker is shortened to the next lower or upper value. For variables (features) having only a
few values, the boxes can be compressed, sometimes into a single horizontal line at the median.
• Biplots: A Biplot is an enhanced scatterplot that uses both points and vectors to represent structure simultane-
ously for rows and columns of a data matrix. Rows are represented as points (scores), and columns are repre-
sented as vectors (loadings). The plot is computed from the first two principal components of the correlation
matrix of the variables (features). You should look for unusual (non-elliptical) shapes in the points that might
reveal outliers or non-normal distributions. And you should look for purple vectors that are well-separated.
Overlapping vectors can indicate a high degree of correlation between variables.
• Outliers: Variables with anomalous or outlying values are displayed as red points in a dot plot. Dot plots are
constructed using an algorithm in Wilkinson, L. (1999). “Dot plots.” The American Statistician, 53, 276–281.
Not all anomalous points are outliers. Sometimes the algorithm will flag points that lie in an empty region (i.e.,
they are not near any other points). You should inspect outliers to see if they are miscodings or if they are due
to some other mistake. Outliers should ordinarily be eliminated from models only when there is a reasonable
explanation for their occurrence.
• Correlation Graph: The correlation network graph is constructed from all pairwise squared correlations be-
tween variables (features). For continuous-continuous variable pairs, the statistic used is the squared Pearson
correlation. For continuous-categorical variable pairs, the statistic is based on the squared intraclass correlation
(ICC). This statistic is computed from the mean squares from a one-way analysis of variance (ANOVA). The
formula is (MSbetween - MSwithin)/(MSbetween + (k - 1)MSwithin), where k is the number of categories in
the categorical variable. For categorical-categorical pairs, the statistic is computed from Cramer’s V squared.
If the first variable has k1 categories and the second variable has k2 categories, then a k1 x k2 table is created
from the joint frequencies of values. From this table, we compute a chi-square statistic. Cramer’s V squared
statistic is then (chi-square / n) / min(k1,k2), where n is the total of the joint frequencies in the table. Variables
with large values of these respective statistics appear near each other in the network diagram. The color scale
used for the connecting edges runs from low (blue) to high (red). Variables connected by short red edges tend
to be highly correlated.
• Parallel Coordinates Plot: A Parallel Coordinates Plot is a graph used for comparing multiple variables. Each
variable has its own vertical axis in the plot. Each profile connects the values on the axes for a single observation.
If the data contain clusters, these profiles will be colored by their cluster number.
• Radar Plot: A Radar Plot is a two-dimensional graph that is used for comparing multiple variables. Each
variable has its own axis that starts from the center of the graph. The data are standardized on each variable
between 0 and 1 so that values can be compared across variables. Each profile, which usually appears in the
form of a star, connects the values on the axes for a single observation. Multivariate outliers are represented
by red profiles. The Radar Plot is the polar version of the popular Parallel Coordinates plot. The polar layout
enables us to represent more variables in a single plot.
• Data Heatmap: The heatmap graphic is constructed from the transposed data matrix. Rows of the heatmap
represent variables, and columns represent cases (instances). The data are standardized before display so that
small values are yellow and large values are red. The rows and columns are permuted via a singular value
decomposition (SVD) of the data matrix so that similar rows and similar columns are near each other.
• Missing Values Heatmap: The missing values heatmap graphic is constructed from the transposed data matrix.
Rows of the heatmap represent variables and columns represent cases (instances). The data are coded into the
values 0 (missing) and 1 (nonmissing). Missing values are colored red and nonmissing values are left blank
(white). The rows and columns are permuted via a singular value decomposition (SVD) of the data matrix so
that similar rows and similar columns are near each other.
• Gaps Histogram: The gaps index is computed using an algorithm of Wainer and Schacht based on work by
John Tukey. (Wainer, H. and Schacht, Psychometrika, 43, 2, 203-12.) Histograms with gaps can indicate a
mixture of two or more distributions based on possible subgroups not necessarily characterized in the dataset.
The images on this page are thumbnails. You can click on any of the graphs to view and download a full-scale image.
You can also view an explanation for each graph by clicking the Help button in the lower-left corner of each expanded
graph.
SEVENTEEN
EXPERIMENTS
This section describes how to run an experiment using the Driverless AI UI. Before you begin, it is best that you
understand the available options that you can specify. Note that only a dataset and a target column are required to be
specified, but Driverless AI provides a variety of experiment and expert settings that you can use to build your models.
After you have a comfortable working knowledge of these options, proceed to the New Experiments section.
This section describes the settings that are available when running an experiment.
Optional: Specify a display name for the new experiment. There are no character or length restrictions for naming. If
this field is left blank, Driverless AI will automatically generate a name for the experiment.
Dropped columns are columns that you do not want to be used as predictors in the experiment. Note that Driver-
less AI will automatically drop ID columns and columns that contain a significant number of unique values (above
max_relative_cardinality in the config.toml file or Max. allowed fraction of uniques for integer and
categorical cols in Expert settings).
The validation dataset is used for tuning the modeling pipeline. If provided, the entire training data will be used for
training, and validation of the modeling pipeline is performed with only this validation dataset. When you do not
include a validation dataset, Driverless AI will do K-fold cross validation for I.I.D. experiments and multiple rolling
window validation splits for time series experiments. For this reason it is not generally recommended to include a
validation dataset as you are then validating on only a single dataset. Please note that time series experiments cannot
be used with a validation dataset: including a validation dataset will disable the ability to select a time column and
vice versa.
This dataset must have the same number of columns (and column types) as the training dataset. Also note that if
provided, the validation set is not sampled down, so it can lead to large memory usage, even if accuracy=1 (which
reduces the train size).
265
Using Driverless AI, Release 1.8.4.1
The test dataset is used for testing the modeling pipeline and creating test predictions. The test set is never used during
training of the modeling pipeline. (Results are the same whether a test set is provided or not.) If a test dataset is
provided, then test set predictions will be available at the end of the experiment.
Optional: Column that indicates the observation weight (a.k.a. sample or row weight), if applicable. This column must
be numeric with values >= 0. Rows with higher weights have higher importance. The weight affects model training
through a weighted loss function and affects model scoring through weighted metrics. The weight column is not used
when making test set predictions, but a weight column (if specified) is used when computing the test score.
Optional: Rows with the same value in the fold column represent groups that should be kept together in the training,
validation, or cross-validation datasets.
By default, Driverless AI assumes that the dataset is i.i.d. (identically and independently distributed) and creates
validation datasets randomly for regression or with stratification of the target variable for classification.
The fold column is used to create the training and validation datasets so that all rows with the same Fold value will
be in the same dataset. This can prevent data leakage and improve generalization. For example, when viewing data
for a pneumonia dataset, person_id would be a good Fold Column. This is because the data may include multiple
diagnostic snapshots per person, and we want to ensure that the same person’s characteristics show up only in either
the training or validation frames, but not in both to avoid data leakage.
This column must be an integer or categorical variable and cannot be specified if a validation set is used or if a Time
Column is specified.
Optional: Specify a column that provides a time order (time stamps for observations), if applicable. This can improve
model performance and model validation accuracy for problems where the target values are auto-correlated with
respect to the ordering (per time-series group).
The values in this column must be a datetime format understood by pandas.to_datetime(), like “2017-11-29 00:30:35”
or “2017/11/29”, or integer values. If [AUTO] is selected, all string columns are tested for potential date/datetime
content and considered as potential time columns. If a time column is found, feature engineering and model validation
will respect the causality of time. If [OFF] is selected, no time order is used for modeling and data may be shuffled
randomly (any potential temporal causality will be ignored).
When your data has a date column, then in most cases, specifying [AUTO] for the Time Column will be sufficient.
However, if you select a specific date column, then Driverless AI will provide you with an additional side menu. From
this side menu, you can specify Time Group columns or specify [Auto] to let Driverless AI determine the best time
group columns. You can also specify the columns that will be unavailable at prediction time (see Notes below), the
Forecast Horizon in weeks, and the Gap between the train and test periods.
Refer to Time Series in Driverless AI for more information about time series experiments in Driverless AI and to see
a time series example.
Notes:
• Engineered features will be used for MLI when a time series experiment is built. This is because munged time
series features are more useful features for MLI compared to raw time series features.
• A Time Column cannot be specified if a Fold Column is specified. This is because both fold and time columns
are only used to split training datasets into training/validation, so once you split by time, you cannot also split
with the fold column. If a Time Column is specified, then the time group columns play the role of the fold
column for time series.
• A Time Column cannot be specified if a validation dataset is used.
• A column that is specified as being unavailable at prediction time will only have lag-related features created for
(or with) it.
The experiment preview describes what the Accuracy, Time, and Interpretability settings mean for your specific ex-
periment. This preview will autmatically update if any of the knob values change. The following is more detailed
information describing how these values affect an experiment.
Accuracy
As accuracy increases (as indicated by the tournament_* toml settings), Driverless AI gradually adjusts the method
for performing the evolution and ensemble. At low accuracy, Driverless AI varies features and models, but they
all compete evenly against each other. At higher accuracy, each independent main model will evolve independently
and be part of the final ensemble as an ensemble over different main models. At higher accuracies, Driverless AI
will evolve+ensemble feature types like Target Encoding on and off that evolve independently. Finally, at highest
accuracies, Driverless AI performs both model and feature tracking and ensembles all those variations.
Changing this value affects the feature evolution and final pipeline.
Note: A check for a shift in the distribution between train and test is done for accuracy >= 5.
Feature evolution: This represents the algorithms used to create the experiment. If a test set is provided without
a validation set, then Driverless AI will perform a 1/3 validation split during the experiment. If a validation set is
provided, then the experiment will perform external validation.
Final Pipeline: This represents the leveling of ensembling done for the final model (if no time column is selected)
along with the cross-validation values.
Time
This specifies the relative time for completing the experiment (i.e., higher settings take longer). Early stopping will
take place if the experiment doesn’t improve the score for the specified amount of iterations.
Interpretability
Specify the relative interpretability for this experiment. Higher values favor more interpretable models. Changing
the interpretability level affects the feature pre-pruning strategy, monotonicity constraints, and the feature engineering
search space.
Feature pre-pruning strategy: This represents the feature selection strategy (to prune-away features that do not
clearly give improvement to model score). Strategy = “FS” if interpretability >= 6; otherwise strategy is None.
Monotonicity constraints: If Monotonicity Constraints are enabled, the model will satisfy knowledge about mono-
tonicity in the data and monotone relationships between the predictors and the target variable. For example, in house
price prediction, the house price should increase with lot size and number of rooms, and should decrease with crime
rate in the area. If enabled, Driverless AI will automatically determine if monotonicity is present and enforce it in its
modeling pipelines. Depending on the correlation, Driverless AI will assign positive, negative, or no monotonicity
constraints. Monotonicity is enforced if the absolute correlation is greater than 0.1. All other predictors will not have
monotonicity enforced.
Note: Monotonicity constraints are used in Decision Trees, XGBoost Dart, XGBoost GBM, LightGBM,
and LightGBM Random Forest models.
Feature engineering search space: This represents the transformers used when Note that when mixing GBM and
GLM in parameter tuning, the search space is split 50%/50% between GBM and GLM.
• Classification or Regression button. Driverless AI automatically determines the problem type based on the
response column. Though not recommended, you can override this setting by clicking this button.
• Reproducible: This button allows you to build an experiment with a random seed and get reproducible results.
If this is disabled (default), then results will vary between runs.
• Enable GPUs: Specify whether to enable GPUs. (Note that this option is ignored on CPU-only systems.)
This section describes the Expert Settings that are available when starting an experiment. Driverless AI provides a
variety of options in the Expert Settings that allow you to customize your experiment. Use the search bar to refine the
list of settings or locate a specific setting.
The default values for these options are derived from the configuration options in the config.toml file. Refer to the
Sample Config.toml File section for more information about each of these options.
Note about Feature Brain Level: By default, the feature brain pulls in any better model regardless of the features
even if the new model disabled those features. For full control over features pulled in via changes in these Expert
Settings, users should set the Feature Brain Level option to 0.
Driverless AI supports the use of custom recipes (optional). If you have a custom recipe available on your local system,
click this button to upload that recipe. If you do not have a custom recipe, you can select from a number of recipes
available in the https://github.com/h2oai/driverlessai-recipes repository. Clone this repository on your local machine
and upload the desired recipe. Refer to the Custom Recipes appendix for examples.
If you have a custom recipe available on an external system, specify the URL for that recipe here. Note that this
must point to the raw recipe file (for example https://raw.githubusercontent.com/h2oai/driverlessai-recipes/master/
transformers/text_sentiment_transformer.py). Refer to the Custom Recipes appendix for examples.
Specify the maximum runtime in minutes for an experiment. This is equivalent to pushing the Finish button once half
of the specified time value has elapsed. Note that the overall enforced runtime is only an approximation.
This value defaults to 1440, which is the equivalent of a 24 hour approximate overall runtime. The Finish button will
be automatically selected once 12 hours have elapsed, and Driverless AI will subsequently attempt to complete the
overall experiment in the remaining 12 hours. Set this value to 0 to disable this setting.
Specify the maximum runtime in minutes for an experiment before triggering the abort button. This option preserves
experiment artifacts that have been generated for the summary and log zip files while continuing to generate additional
artifacts. This value defaults to 10080.
Specify the Pipeline Building recipe type (overrides GUI settings). Select from the following:
• AUTO: Specifies that all models and features are automatically determined by experiment settings, config.toml
settings, and the feature engineering effort. (Default)
• COMPLIANT: Similar to AUTO except for the following:
– Interpretability is set to 10.
– Only uses GLM.
– Fixed ensemble level is set to 0.
– Feature brain level is set to 0.
– Max feature interaction depth is set to 1.
– Target transformers is set to ‘identity’ for regression.
– Does not use distribution shift.
• KAGGLE: Similar to AUTO except for the following:
– Any external validation set is concatenated with the train set, with the target marked as missing.
– The test set is concatenated with the train set, with the target marked as missing
– Transformers that do not use the target are allowed to fit_transform across the entirety of the train,
validation, and test sets.
– Has several config.toml expert options open-up limits.
Specify whether to automatically build a Python Scoring Pipeline for the experiment. Select ON or AUTO (default) to
make the Python Scoring Pipeline immediately available for download when the experiment is finished. Select OFF
to disable the automatic creation of the Python Scoring Pipeline.
Specify whether to automatically build a MOJO (Java) Scoring Pipeline for the experiment. Select ON to make the
MOJO Scoring Pipeline immediately available for download when the experiment is finished. With this option, any
capabilities that prevent the creation of the pipeline are dropped. Select OFF to disable the automatic creation of the
MOJO Scoring Pipeline. Select AUTO (default) to attempt to create the MOJO Scoring Pipeline without dropping
any capabilities.
Specify whether to measure the MOJO scoring latency at the time of MOJO creation. This is set to AUTO by default.
In this case, MOJO scoring latency will be measured if the pipeline.mojo file size is less than 100 MB.
Specify the amount of time in seconds to wait for MOJO creation at the end of an experiment. If the MOJO creation
process times out, a MOJO can still be made from the GUI or the R and Python clients (the timeout contraint is not
applied to these). This value defaults to 1800 (30 minutes).
Specify the number of parallel workers to use during MOJO creation. Higher values can speed up MOJO creation but
use more memory. Set this value to -1 (default) to use all physical cores.
Specify whether to create a visualization of the scoring pipeline at the end of an experiment. This is set to AUTO by
default. Note that the Visualize Scoring Pipeline feature is experimental and is not available for deprecated models.
Visualizations are available for all newly created experiments.
Make Autoreport
Specify whether to create the experiment Autoreport after the experiment is finished. This is enabled by default.
Specify the minimum number of rows that a dataset must contain in order to run an experiment. This value defaults to
100.
Reproducibility Level
Specify one of the following levels of reproducibility (note that this setting is only active while reproducible mode is
enabled):
• 1 = Same experiment results for same O/S, same CPU(s), and same GPU(s) (Default)
• 2 = Same experiment results for same O/S, same CPU architecture, and same GPU architecture
• 3 = Same experiment results for same O/S, same CPU archicture (excludes GPUs)
• 4 = Same experiment results for same O/S (best approximation)
This value defaults to 1.
Random Seed
Specify a random seed for the experiment. When a seed is defined and the reproducible button is enabled (not by
default), the algorithm will behave deterministically.
(Note: Applicable for multiclass problems only.) Specify whether to enable full cross-validation (multiple folds)
during feature evolution as opposed to a single holdout split. This is enabled by default.
Specify the maximum number of classes to allow for a classification problem. A higher number of classes may make
certain processes more time-consuming. Memory requirements also increase with a higher number of classes. This
value defaults to 200.
Specify whether to use H2O.ai brain, which enables local caching and smart re-use (checkpointing) of prior experi-
ments to generate useful features and models for new experiments. It can also be used to control checkpointing for
experiments that have been paused or interrupted.
When enabled, this will use the H2O.ai brain cache if the cache file:
• has any matching column names and types for a similar experiment type
• has classes that match exactly
• has class labels that match exactly
• has basic time series choices that match
• the interpretability of the cache is equal or lower
• the main model (booster) is allowed by the new experiment
• -1: Don’t use any brain cache (default)
• 0: Don’t use any brain cache but still write to cache. Use case: Want to save the model for later use, but we want
the current model to be built without any brain models.
• 1: Smart checkpoint from the latest best individual model. Use case: Want to use the latest matching model.
The match may not be precise, so use with caution.
• 2: Smart checkpoint if the experiment matches all column names, column types, classes, class labels, and time
series options identically. Use case: Driverless AI scans through the H2O.ai brain cache for the best models to
restart from.
• 3: Smart checkpoint like level #1 but for the entire population. Tune only if the brain population is of insufficient
size. Note that this will re-score the entire population in a single iteration, so it appears to take longer to complete
first iteration.
• 4: Smart checkpoint like level #2 but for the entire population. Tune only if the brain population is of insufficient
size. Note that this will re-score the entire population in a single iteration, so it appears to take longer to complete
first iteration.
• 5: Smart checkpoint like level #4 but will scan over the entire brain cache of populations to get the best scored
individuals. Note that this can be slower due to brain cache scanning if the cache is large.
When enabled, the directory where the H2O.ai Brain meta model files are stored is H2O.ai_brain. In addition, the
default maximum brain size is 20GB. Both the directory and the maximum size can be changed in the config.toml file.
This value defaults to 2.
When performing restart or re-fit of type feature_brain_level with a resumed ID, specify which iteration to start from
instead of only last best. Available options include:
• -1: Use the last best
• 1: Run one experiment with feature_brain_iterations_save_every_iteration=1 or some other number
• 2: Identify which iteration brain dump you wants to restart/refit from
• 3: Restart/Refit from the original experiment, setting which_iteration_brain to that number here in expert set-
tings.
Note: If restarting from a tuning iteration, this will pull in the entire scored tuning population and use that for feature
evolution. This value defaults to -1.
Specify whether to use the same best individual when performing a refit. Disabling this setting allows the order of best
individuals to be rearranged, leading to a better final result. Enabling this setting allows you to view the exact same
model or feature with only one new feature added. This is disabled by default.
Feature Brain Adds Features with New Columns Even During Retraining of Final Model
Specify whether to add additional features from new columns to the pipeline, even when performing a retrain of the
final model. Use this option if you want to keep the same pipeline regardless of new columns from a new dataset. New
data may lead to new dropped features due to shift or leak detection. Disable this to avoid adding any columns as new
features so that the pipeline is perfectly preserved when changing data. This is enabled by default.
Specify the minimum number of Driverless AI iterations for an experiment. This can be used during restarting, when
you want to continue for longer despite a score not improving. This value defaults to 0.
Specify whether to automatically select target transformation for regression problems. Selecting identity disables any
transformation. This is set to AUTO by default.
Select a method to decide which models are best at each iteration. This is set to AUTO by default. Choose from the
following:
• auto: Choose based on scoring metric
• fullstack: Choose from optimal model and feature types
• feature: Individuals with similar feature types compete
• model: Individuals with same model type compete
• uniform: All individuals in population compete
Specify a fixed number of folds (if >= 2) for cross-validation. Note that the actual number of allowed folds can be
less than the specified value, and that the number of allowed folds is determined at the time an experiment is run. This
value defaults to -1 (auto).
Specify a fixed number of folds (if >= 2) for cross-validation. Note that the actual number of allowed folds can be
less than the specified value, and that the number of allowed folds is determined at the time an experiment is run. This
value defaults to -1 (auto).
Max Number of Rows Times Number of Columns for Feature Evolution Data Splits
Specify the maximum number of rows allowed for feature evolution data splits (not for the final pipeline). This value
defaults to 100,000,000.
Max Number of Rows Times Number of Columns for Reducing Training Dataset
Specify the upper limit on the number of rows times the number of columns for training the final pipeline. This value
defaults to 500,000,000.
Specify the maximum size of the validation data relative to the training data. Smaller values can make the final pipeline
model training process quicker. Note that final model predictions and scores will always be provided on the full dataset
provided. This value defaults to 2.0.
Perform Stratified Sampling for Binary Classification If the Target Is More Imbalanced Than This
For binary classification experiments, specify a threshold ratio of minority to majority class for the target column
beyond which stratified sampling is performed. If the threshold is not exceeded, random sampling is performed. This
value defaults to 0.1. You can choose to always perform random sampling by setting this value to 0, or to always
perform stratified sampling by setting this value to 1.
Specify any additional configuration overrides from the config.toml file that you want to include in the experiment.
(Refer to the Sample Config.toml File section to view options that can be overridden during an experiment.) Setting
this will override all other settings. Separate multiple config overrides with \n. For example, the following enables
Poisson distribution for LightGBM and disables Target Transformer Tuning. Note that in this example double quotes
are escaped (\" \").
params_lightgbm=\"{'objective':'poisson'}\" \n target_transformer=identity
Or you can specify config overrides similar to the following without having to escape double quotes:
""enable_glm="off" \n enable_xgboost_gbm="off" \n enable_lightgbm="off" \n enable_tensorflow="on"""
""max_cores=10 \n data_precision="float32" \n max_rows_feature_evolution=50000000000 \n ensemble_accuracy_switch=11 \n feature_engineering_effort=1 \n
˓→target_transformer="identity" \n tournament_feature_style_accuracy_switch=5 \n params_tensorflow="{'layers': [100, 100, 100, 100, 100, 100]}"""
When running the Python client, config overrides would be set as follows:
model = h2o.start_experiment_sync(
dataset_key=train.key,
target_col='target',
is_classification=True,
accuracy=7,
time=5,
interpretability=1,
config_overrides="""
feature_brain_level=0
enable_lightgbm="off"
enable_xgboost_gbm="off"
enable_ftrl="off"
"""
)
This option allows you to specify whether to build XGBoost models as part of the experiment (for both the feature
engineering part and the final model). XGBoost is a type of gradient boosting method that has been widely successful
in recent years due to its good regularization techniques and high accuracy. This is set to AUTO by default. In this
case, Driverless AI will use XGBoost unless the number of rows * columns is greater than a threshold. This threshold
is a config setting that is 100M by default for CPU and 30M by default for GPU.
This option specifies whether to use XGBoost’s Dart method when building models for experiment (for both the feature
engineering part and the final model). This is set to AUTO (disabled) by default.
GLM Models
This option allows you to specify whether to build GLM models (generalized linear models) as part of the experiment
(usually only for the final model unless it’s used exclusively). GLMs are very interpretable models with one coefficient
per feature, an intercept term and a link function. This is set to AUTO by default (enabled if accuracy <= 5 and
interpretability >= 6).
This option allows you to specify whether to build Decision Tree models as part of the experiment. This is set to
AUTO by default. In this case, Driverless AI will build Decision Tree models if interpretability is greater than or
equal to the value of decision_tree_interpratibility_switch (which defaults to 7) and accuracy is less
than or equal to decision_tree_accuracy_switch (which defaults to 7).
LightGBM Models
This option allows you to specify whether to build LightGBM models as part of the experiment. LightGBM Models
are the default models. This is set to AUTO (enabled) by default.
TensorFlow Models
This option allows you to specify whether to build TensorFlow models as part of the experiment (usually only for text
features engineering and for the final model unless it’s used exlusively). Enable this option for NLP experiments. This
is set to AUTO by default (not used unless the number of classes is greater than 10).
TensorFlow models are not yet supported by MOJOs (only Python scoring pipelines are supported).
FTRL Models
This option allows you to specify whether to build Follow the Regularized Leader (FTRL) models as part of the
experiment. Note that MOJOs are not yet supported (only Python scoring pipelines). FTRL supports binomial and
multinomial classification for categorical targets, as well as regression for continuous targets. This is set to AUTO
(disabled) by default.
RuleFit Models
This option allows you to specify whether to build RuleFit models as part of the experiment. Note that MOJOs are
not yet supported (only Python scoring pipelines). Note that multiclass classification is not yet supported for RuleFit
models. Rules are stored to text files in the experiment directory for now. This is set to AUTO (disabled) by default.
Specify which boosting types to enable for LightGBM. Select one or more of the following:
• gbdt: Boosted trees
• rf_early_stopping: Random Forest with early stopping
• rf: Random Forest
• dart: Dropout boosted trees with no early stopping
gbdt and rf are both enabled by default.
Specify whether to enable LightGBM categorical feature support (currently only available for CPU mode). This is
disabled by default.
Constant Models
Specify whether to enable constant models. This is set to AUTO (enabled) by default.
Specify whether to show constant models in the iteration panel. This is disabled by default.
Specify specific parameters for TensorFlow to override Driverless AI parameters. The following is an example of how
the parameters can be configured:
params_tensorflow = '{'lr': 0.01, 'add_wide': False, 'add_attention': True, 'epochs': 30,
'layers': [100, 100], 'activation': 'selu', 'batch_size': 64, 'chunk_size': 1000, 'dropout': 0.3,
'strategy': 'one_shot', 'l1': 0.0, 'l2': 0.0, 'ort_loss': 0.5, 'ort_loss_tau': 0.01, 'normalize_type': 'streaming'}'
More information about TensorFlow parameters can be found in the Keras documentation. Different strategies for
using TensorFlow parameters can be viewed here.
Specify the upper limit on the number of trees (GBM) or iterations (GLM). This defaults to 3000. Depending on
accuracy settings, a fraction of this limit will be used.
n_estimators List to Sample From for Model Mutations for Models That Do Not Use Early Stopping
For LightGBM, the dart and normal random forest modes do not use early stopping. This setting allows you to specify
the n_estimators (number of trees in the forest) list to sample from for model mutations for these types of models.
Specify the minimum learning rate for final ensemble GBM models. This value defaults to 0.01.
Specify the maximum learning rate for final ensemble GBM models. This value defaults to 0.05.
Specify the factor by which max_nestimators is reduced for tuning and feature evolution. This option defaults to 0.2.
So by default, Driverless AI will produce no more than 0.2 * 3000 trees/iterations during feature evolution.
Specify the minimum learning rate for feature engineering GBM models. This value defaults to 0.05.
Specify the maximum learning rate for tree models during feature engineering. Higher values can speed up feature
engineering but can hurt accuracy. This value defaults to 0.5.
When building TensorFlow or FTRL models, specify the maximum number of epochs to train models with (it might
stop earlier). This value defaults to 10. This option is ignored if TensorFlow models and/or FTRL models is disabled.
Specify the maximum tree depth. The corresponding maximum value for max_leaves is double the specified value.
This value defaults to 12.
Specify the maximum max_bin for tree features. This value defaults to 256.
Specify the maximum number of rules to be used for RuleFit models. This defaults to -1, which specifies to use all
rules.
Driverless AI normally produces a single final model for low accuracy settings (typically, less than 5). When the
Cross-validate single final model option is enabled (default for regular experiments), Driverless AI will perform
cross-validation to determine optimal parameters and early stopping before training the final single modeling pipeline
on the entire training data. The final pipeline will build 𝑁 + 1 models, with N-fold cross validation for the single final
model. This also creates holdout predictions for all non-time-series experiments with a single final model.
Note that the setting for this option is ignored for time-series experiments or when a validation dataset is provided.
Specify the number of models to tune during pre-evolution phase. Specify a lower value to avoid excessive tuning, or
specify a higher to perform enhanced tuning. This option defaults to -1 (auto).
Specify the sampling method for imbalanced binary classification problems. This is set to off by default. Choose from
the following options:
• auto: sample both classes as needed, depending on data
• over_under_sampling: over-sample the minority class and under-sample the majority class, depending on data
• under_sampling: under-sample the majority class to reach class balance
• off: do not perform any sampling
Ratio of Majority to Minority Class for Imbalanced Binary Classification to Trigger Special Sampling
Techniques (if Enabled)
For imbalanced binary classification problems, specify the ratio of majority to minority class. Special imbalanced
models with sampling techniques are enabled when the ratio is equal to or greater than the specified ratio. This value
defaults to 5.
Ratio of Majority to Minority Class for Heavily Imbalanced Binary Classification to Only Enable Spe-
cial Sampling Techniques if Enabled
For heavily imbalanced binary classification, specify the ratio of the majority to minority class equal and above which
to enable only special imbalanced models on the full original data without upfront sampling. This value defaults to 25.
Number of Bags for Sampling Methods for Imbalanced Binary Classification (if Enabled)
Specify the number of bags for sampling methods for imbalanced binary classification. This value defaults to -1.
Hard Limit on Number of Bags for Sampling Methods for Imbalanced Binary Classification
Specify the limit on the number of bags for sampling methods for imbalanced binary classification. This value defaults
to 10.
Hard Limit on Number of Bags for Sampling Methods for Imbalanced Binary Classification During
Feature Evolution Phase
Specify the limit on the number of bags for sampling methods for imbalanced binary classification. This value defaults
to 3. Note that this setting only applies to shift, leakage, tuning, and feature evolution models. To limit final models,
use the Hard Limit on Number of Bags for Sampling Methods for Imbalanced Binary Classification setting.
Specify the maximum size of the data sampled during imbalanced sampling in terms of the dataset’s size. This setting
controls the approximate number of bags and is only active when the “Hard limit on number of bags for sampling
methods for imbalanced binary classification during feature evolution phase” option is set to -1. This value defaults to
1.
Specify the target fraction of a minority class after applying under/over-sampling techniques. A value of 0.5 means
that models/algorithms will be given a balanced target class distribution. When starting from an extremely imbalanced
original target, it can be advantageous to specify a smaller value such as 0.1 or 0.01. This value defaults to -1.
Max Number of Automatic FTRL Interactions Terms for 2nd, 3rd, 4th order interactions terms (Each)
Specify a limit for the number of FTRL interactions terms sampled for each of second, third, and fourth order terms.
This value defaults to 10,000.
Specify whether to dump every scored individual’s model parameters to a csv/tabulated file. If enabled (default),
Driverless AI produces files such as “individual_scored_id%d.iter%d*params*”. This is enabled by default.
Specify whether to enable bootstrap sampling. When enabled, this setting provides error bars to validation and test
scores based on the standard error of the bootstrap mean. This is enabled by default.
Specify the number of classes above which to use TensorFlow when it is enabled. Others model that are set to AUTO
will not be used above this number. (Models set to ON, however, are still used.) This value defaults to 10.
Specify a value from 0 to 10 for the Driverless AI feature engineering effort. Higher values generally lead to more
time (and memory) spent in feature engineering. This value defaults to 5.
• 0: Keep only numeric features. Only model tuning during evolution.
• 1: Keep only numeric features and frequency-encoded categoricals. Only model tuning during evolution.
• 2: Similar to 1 but instead just no Text features. Some feature tuning before evolution.
• 3: Similar to 5 but only tuning during evolution. Mixed tuning of features and model parameters.
• 4: Similar to 5 but slightly more focused on model tuning.
• 5: Balanced feature-model tuning. (Default)
• 6-7: Similar to 5 but slightly more focused on feature engineering.
• 8: Similar to 6-7 but even more focused on feature engineering with high feature generation rate and no feature
dropping even if high interpretability.
• 9-10: Similar to 8 but no model tuning during feature evolution.
Specify whether Driverless AI should detect data distribution shifts between train/valid/test datasets (if provided).
Currently, this information is only presented to the user and not acted upon.
Specify whether to drop high-shift features. This defaults to AUTO. Note that Auto for time series experiments turns
this feature off.
Specify the maximum allowed AUC value for a feature before dropping the feature.
When train and test differ (or train/valid or valid/test) in terms of distribution of data, then there can be a model built
that tells you for each row whether the row is in train or test. That model includes an AUC value. If the AUC is above
this specified threshold, then Driverless AI will consider it a strong enough shift to drop features that are shifted.
This value defaults to 0.999.
Leakage Detection
Specify whether to check leakage for each feature. Note that this is always disabled if a fold column is specified and
if the experiment is a time series experiment. This is set to AUTO by default.
If Leakage Detection is enabled, specify to drop features for which the AUC (classification)/R2 (regression) is above
this value. This value defaults to 0.999.
Specify the maximum number of rows times the number of columns to trigger sampling for leakage checks. This value
defaults to 10,000,000.
Specify whether Driverless AI reports permutation importance on original features. This is disabled by default.
Specify the maximum number of rows to when performing permutation feature importance. This value defaults to
1,000,000.
Specify the maximum number of columns to be selected from an existing set of columns using feature selection. This
value defaults to 10,000.
Specify the maximum number of non-numeric columns to be selected. Feature selection is performed on all features
when this value is exceeded. This value defaults to 300.
Specify the maximum number of features you want to be selected in an experiment. Additional columns above the
specified value add special individual with original columns reduced. This value defaults to 500.
The maximum number of original numeric columns, above which Driverless AI will do feature selection. Note that this
is applicable only to special individuals with original columns reduced. A separate individual in the genetic algorithm
is created by doing feature selection by permutation importance on original features. This value defaults to 500.
The maximum number of original non-numeric columns, above which Driverless AI will do feature selection on all
features. Note that this is applicable only to special individuals with original columns reduced. A separate individual
in the genetic algorithm is created by doing feature selection by permutation importance on original features. This
value defaults to 200.
Specify the maximum fraction of unique values for integer and categorical columns. If the column has a larger fraction
of unique values than that, it will be considered an ID column and ignored. This value defaults to 0.95.
Specify whether to allow some numerical features to be treated as categorical features. This is enabled by default.
Specify the number of unique values for integer or real columns to be treated as categoricals. This value defaults to
50.
Specify the maximum number of features to include in the final model’s feature engineering pipeline. If -1 is specified
(default), then Driverless AI will automatically determine the number of features.
Specify the maximum number of genes (transformer instances) kept per model (and per each model within the final
model for ensembles). This controls the number of genes before features are scored, so Driverless AI will just ran-
domly samples genes if pruning occurs. If restriction occurs after scoring features, then aggregated gene importances
are used for pruning genes. Instances includes all possible transformers, including original transformer for numeric
features. A value of -1 means no restrictions except internally-determined memory and interpretability restriction.
Specify the threshold of Pearson product-moment correlation coefficient between numerical and encoded transformed
feature and target. This value defaults to 0.1.
Specify the maximum number of features to use for interaction features like grouping for target encoding, weight of
evidence, and other likelihood estimates.
Exploring feature interactions can be important in gaining better predictive performance. The interaction can take
multiple forms (i.e. feature1 + feature2 or feature1 * feature2 + . . . featureN). Although certain machine learning
algorithms (like tree-based methods) can do well in capturing these interactions as part of their training process, still
generating them may help them (or other algorithms) yield better performance.
The depth of the interaction level (as in “up to” how many features may be combined at once to create one single
feature) can be specified to control the complexity of the feature engineering process. Higher values might be able to
make more predictive models at the expense of time. This value defaults to 8.
Specify a fixed non-zero number of features to use for interaction features like grouping for target encoding, weight
of evidence, and other likelihood estimates. To use all features for each transformer, set this to be equal to the number
of columns. To do a 50/50 sample and a fixed feature interaction depth of 𝑛 features, set this to -𝑛.
Specify whether to use Target Encoding when building the model. Target encoding refers to several different feature
transformations (primarily focused on categorical data) that aim to represent the feature using information of the actual
target variable. A simple example can be to use the mean of the target to replace each unique category of a categorical
feature. These type of features can be very predictive but are prone to overfitting and require more memory as they
need to store mappings of the unique categories and the target values.
Isolation Forest is useful for identifying anomalies or outliers in data. Isolation Forest isolates observations by ran-
domly selecting a feature and then randomly selecting a split value between the maximum and minimum values of
that selected feature. This split depends on how long it takes to separate the points. Random partitioning produces
noticeably shorter paths for anomalies. When a forest of random trees collectively produces shorter path lengths for
particular samples, they are highly likely to be anomalies.
This option allows you to specify whether to return the anomaly score of each sample. This is disabled by default.
Specify whether one-hot encoding is enabled. The default AUTO setting is only applicable for small datasets and
GLMs.
Specify the number of estimators for Isolation Forest encoding. This value defaults to 200.
Specify whether to drop columns with constant values. This is enabled by default.
Drop ID Columns
Specify whether to drop columns that appear to be an ID. This is enabled by default.
Specify whether to avoid dropping any columns (original or derived). This is disabled by default.
Features to Drop
Specify which features to drop. This setting allows you to select many features at once by copying and pasting a list
of column names (in quotes) separated by commas.
Features to Group By
Specify which features to group columns by. When this field is left empty (default), Driverless AI automatically
searches all columns (either at random or based on which columns have high variable importance).
Specify whether to sample from given features to group by or to always group all features. This is disabled by default.
Specify whether to enable aggregation functions to use for group by operations. Choose from the following (all are
selected by default):
• mean
• sd
• min
• max
• count
Specify the number of folds to obtain aggregation when grouping. Out-of-fold aggregations will result in less overfit-
ting, but they analyze less data in each fold.
Specify which strategy to apply when performing mutations on transformers. Select from the following:
• sample: Sample transformer parameters (Default)
• batched: Perform multiple types of the same transformation together
• full: Perform more types of the same transformation together than the above strategy
Specify whether to dump every scored individual’s variable importance (both derived and original) to a
csv/tabulated/json file. If enabled, Driverless AI produces files such as “individual_scored_id%d.iter%d*features*”.
This is disabled by default.
Specify whether to dump every scored fold’s timing and feature info to a timings.txt file. This is disabled by default.
Specify whether to compute training, validation, and test correlation matrixes. When enabled, this setting creates table
and heatmap PDF files that are saved to disk. Note that this setting is currently a single threaded process that may be
slow for experiments with many columns. This is disabled by default.
This recipe specifies whether to include Time Series lag features when training a model with a provided (or autode-
tected) time column. This is enabled by default. Lag features are the primary automatically generated time series
features and represent a variable’s past values. At a given sample with time stamp 𝑡, features at some time difference
𝑇 (lag) in the past are considered. For example, if the sales today are 300, and sales of yesterday are 250, then the lag
of one day for sales is 250. Lags can be created on any feature as well as on the target. Lagging variables are important
in time series because knowing what happened in different time periods in the past can greatly facilitate predictions
for the future. Note: Ensembling is disabled when the lag-based recipe with time columns is activated because it only
supports a single final model. Ensembling is also disabled if a time column is selected or if time column is set to
[AUTO] on the experiment setup screen.
More information about time series lag is available in the Time Series Use Case: Sales Forecasting section.
Specify date or datetime timestamps (in the same format as the time column) to use for custom training and validation
splits.
Specify the timeout in seconds for time-series properties detection in Driverless AI’s user interface. This value defaults
to 30.
For time-series experiments, specify whether to generate holiday features for the experiment. This is enabled by
default.
Specify the override lags to be used. These can be used to give more importance to the lags that are still considered
after the override is applied. The following examples show the variety of different methods that can be used to specify
override lags:
• “[7, 14, 21]” specifies this exact list
• “21” specifies every value from 1 to 21
• “21:3” specifies every value from 1 to 21 in steps of 3
• “5-21” specifies every value from 5 to 21
• “5-21:3” specifies every value from 5 to 21 in steps of 3
Specify whether to enable feature engineering based on the selected time column, e.g. Date~weekday. This is enabled
by default.
Specify whether to allow an integer time column to be used as a numeric feature. Note that if you are using a time
series recipe, using a time column (numeric time stamps) as an input feature can lead to a model that memorizes the
actual timestamps instead of features that generalize to the future. This is disabled by default.
Specify the date or date-time transformations to allow Driverless AI to use. Choose from the following transformers:
• year
• quarter
• month
• week
• weekday
• day
• dayofyear
• num (direct numeric value representing the floating point value of time, disabled by default)
• hour
• minute
• second
Features in Driverless AI will appear as get_ followed by the name of the transformation. Note that get_num can
lead to overfitting if used on IID problems and is disabled by default.
Specify whether to consider time groups columns as standalone features. This is disabled by default.
Specify whether to consider time groups columns (TGC) as standalone features. If “Consider time groups columns as
standalone features” is enabled, then specify which TGC feature types to consider as standalone features. Available
types are numeric, categorical, ohe_categorical, datetime, date, and text. All types are selected by default. Note
that “time_column” is treated separately via the “Enable Feature Engineering from Time Column” option. Also note
that if “Time Series Lag-Based Recipe” is disabled, then all time group columns are allowed features.
Specify whether various transformers (clustering, truncated SVD) are enabled, which otherwise would be disabled for
time series experiments due to the potential to overfit by leaking across time within the fit of each fold. This is set to
AUTO by default.
Always Group by All Time Groups Columns for Creating Lag Features
Specify whether to group by all time groups columns for creating lag features. This is enabled by default.
Specify whether to create diagnostic holdout predictions on training data using moving windows. This is enabled by
default. This can be useful for MLI, but it will slow down the experiment considerably when enabled. Note that the
model itself remains unchanged when this setting is enabled.
Specify a fixed number of time-based splits for internal model validation. Note that the actual number of allowed splits
can be less than the specified value, and that the number of allowed splits is determined at the time an experiment is
run. This value defaults to -1 (auto).
Specify the maximum overlap between two time-based splits. The amount of possible splits increases with higher
values. This value defaults to 0.5.
Maximum Number of Splits Used for Creating Final Time-Series Model’s Holdout Predictions
Specify the maximum number of splits used for creating the final time-series Model’s holdout predictions. The default
value (-1) will use the same number of splits that are used during model validation.
Specify whether to speed up time-series holdout predictions for back-testing on training data. This setting is used for
MLI and calculating metrics. Note that predictions can be slightly less accurate when this setting is enabled. This is
disabled by default.
Specify whether to speed up Shapley values for time-series holdout predictions for back-testing on training data. This
setting is used for MLI. Note that predictions can be slightly less accurate when this setting is enabled. This is enabled
by default.
Generate Shapley Values for Time-Series Holdout Predictions at the Time of Experiment
Specify whether to enable the creation of Shapley values for holdout predictions on training data using moving win-
dows at the time of the experiment. This can be useful for MLI, but it can slow down the experiment when enabled. If
this setting is disabled, MLI will generate Shapley values on demand. This is enabled by default.
Specify the lower limit on interpretability setting for time-series experiments. Values of 5 (default) or more can
improve generalization by more aggressively dropping the least important features. To disable this setting, set this
value to 1.
Specify the dropout mode for lag features in order to achieve an equal n.a. ratio between train and validation/tests.
Independent mode performs a simple feature-wise dropout. Dependent mode takes the lag-size dependencies per
sample/row into account. Dependent is enabled by default.
Lags can be created on any feature as well as on the target. Specify a probability value for creating non-target lag
features. This value defaults to 0.1.
Specify the method used to create rolling test set predictions. Choose between test time augmentation (TTA) and a
successive refitting of the final pipeline. TTA is enabled by default.
Specify the probability for new lags or the EWMA gene to use default lags. This is determined independently of the
data by frequency, gap, and horizon. This value defaults to 0.2.
Specify the unnormalized probability of choosing other lag time-series transformers based on interactions. This value
defaults to 0.2.
Specify the unnormalized probability of choosing other lag time-series transformers based on aggregations. This value
defaults to 0.2.
When building TensorFlow NLP features (for text data), specify the maximum number of epochs to train feature
engineering models with (it might stop earlier). The higher the number of epochs, the higher the run time. This value
defaults to 2 and is ignored if TensorFlow models is disabled.
Specify the accuracy threshold. Values equal and above will add all enabled TensorFlow NLP models at the start of
the experiment for text-dominated problems when the following NLP expert settings are set to AUTO:
• Enable word-based CNN TensorFlow models for NLP
• Enable word-based BigRU TensorFlow models for NLP
• Enable character-based CNN TensorFlow models for NLP
If the above transformations are set to ON, this parameter is ignored.
At lower accuracy, TensorFlow NLP transformations will only be created as a mutation. This value defaults to 5.
Specify whether to use Word-based CNN TensorFlow models for NLP. This option is ignored if TensorFlow is dis-
abled. We recommend that you disable this option on systems that do not use GPUs.
Specify whether to use Word-based BiG-RU TensorFlow models for NLP. This option is ignored if TensorFlow is
disabled. We recommend that you disable this option on systems that do not use GPUs.
Specify whether to use Character-level CNN TensorFlow models for NLP. This option is ignored if TensorFlow is
disabled. We recommend that you disable this option on systems that do not use GPUs.
Specify a path to pretrained embeddings that will be used for the TensorFlow NLP models. For example,
/path/on/server/to/file.txt
• You can download the Glove embeddings from here and specify the local path in this box.
• You can download the fasttext embeddings from here and specify the local path in this box.
• You can also train your own custom embeddings. Please refer to this code sample for creating custom embed-
dings that can be passed on to this option.
• If this field is left empty, embeddings will be trained from scratch.
Specify whether to allow training of all weights of the neural network graph, including the pretrained embedding layer
weights. If this is disabled, the embedding layer will be frozen. All other weights, however, will still be fine-tuned.
This is disabled by default.
Specify whether the Python/MOJO scoring runtime will have GPUs (otherwise BiGRU will fail in production if this
is enabled). Enabling this setting can speed up training for BiGRU, but doing so will require GPUs and CuDNN in
production. This is disabled by default.
Specify the fraction of text columns out of all features to be considered as a text-dominated problem. This value
defaults to 0.3.
Specify when a string column will be treated as text (for an NLP problem) or just as a standard categorical variable.
Higher values will favor string columns as categoricals, while lower values will favor string columns as text. This
value defaults to 0.3.
Specify the fraction of text columns out of all features to be considered a text-dominated problem. This value defaults
to 0.3.
Specify the threshold value (from 0 to 1) for string columns to be treated as text (0.0 - text; 1.0 - string). This value
defaults to 0.3.
Select the transformer(s) that you want to use in the experiment. Use the Check All/Uncheck All button to quickly
add or remove all transfomers at once. Note: If you uncheck all transformers so that none is selected, Driverless AI
will ignore this and will use the default list of transformers for that experiment. This list of transformers will vary for
each experiment.
Specify the type(s) of models that you want Driverless AI to build in the experiment.
Specify the scorer(s) that you want Driverless AI to include when running the experiment.
Specify the unnormalized probability to add genes or instances of transformers with specific attributes. If no genes
can be added, other mutations are attempted. This value defaults to 0.5.
Specify the unnormalized probability to add genes or instances of transformers with specific attributes that have shown
to be beneficial to other individuals within the population. This value defaults to 0.5.
Specify the unnormalized probability to prune genes or instances of transformers with specific attributes. This value
defaults to 0.5.
Specify the unnormalized probability to change model hyper parameters. This value defaults to 0.25.
Specify the unnormalized probability to prune features that have low variable importance instead of pruning entire
instances of genes/transformers. This value defaults to 0.25.
Specify the number of minutes to wait until a recipe’s acceptance testing is aborted. A recipe is rejected if acceptance
testing is enabled and it times out. This value defaults to 20.0.
Specify whether to avoid failed models. Failures are logged according to the specified level for logging skipped
failures. This is enabled by default.
Specify one of the following levels for the verbosity of log failure messages for skipped transformers or models:
• 0 = Log simple message
• 1 = Log code line plus message (Default)
• 2 = Log detailed stack traces
Specify the number of cores to use for the experiment. Note that if you specify 0, all available cores will be used.
Lower values can reduce memory usage but might slow down the experiment. This value defaults to 0.
Specify the maximum number of cores to use for a model’s fit call. Note that if you specify 0, all available cores will
be used. This value defaults to 10.
Specify the maximum number of cores to use for a model’s predict call. Note that if you specify 0, all available cores
will be used. This value defaults to 0.
Maximum Number of Cores to Use for Model Transform and Predict When Doing MLI, Autoreport
Specify the maximum number of cores to use for a model’s transform and predict call when doing operations in the
Driverless AI MLI GUI and the Driverless AI R and Python clients. Note that if you specify 0, all available cores will
be used. This value defaults to 4.
Specify the number of workers used in CPU mode for tuning. A value of 0 uses the socket count, while a value of -1
uses all physical cores greater than or equal to 1 that count. This value defaults to 0.
#GPUs/Experiment
Specify the number of GPUs to user per experiment. A value of -1 (default) specifies to use all available GPUs. Must
be at least as large as the number of GPUs to use per model (or -1).
Num Cores/GPU
Specify the number of CPU cores per GPU. In order to have a sufficient number of cores per GPU, this setting limits
the number of GPUs used. This value defaults to 4.
#GPUs/Model
Specify the number of GPUs to user per model, with -1 meaning all GPUs per model. In all cases, XGBoost tree
and linear models use the number of GPUs specified per model, while LightGBM and Tensorflow revert to using 1
GPU/model and run multiple models on multiple GPUs. This value defaults to 1.
Note: FTRL does not use GPUs. Rulefit uses GPUs for parts involving obtaining the tree using LightGBM.
Specify the number of GPUs to use for predict for models and transform for transformers when running outside
of fit/fit_transform. If predict or transform are called in the same process as fit/fit_transform,
the number of GPUs will match. New processes will use this count for applicable models and transformers. Note that
enabling tensorflow_nlp_have_gpus_in_production will override this setting for relevant TensorFlow
NLP transformers. This value defaults to 0.
Max Number of Threads to Use for datatable and OpenBLAS for Munging and Model Training
Specify the maximum number of threads to use for datatable and OpenBLAS during data munging (applied on a per
process basis):
• 0 = Use all threads
• -1 = Automatically select number of threads (Default)
Max Number of Threads to Use for datatable Read and Write of Files
Specify the maximum number of threads to use for datatable during data reading and writing (applied on a per process
basis):
• 0 = Use all threads
• -1 = Automatically select number of threads (Default)
Specify the maximum number of threads to use for datatable stats and OpenBLAS (applied on a per process basis):
• 0 = Use all threads
• -1 = Automatically select number of threads (Default)
GPU Starting ID
Specify Which gpu_id to start with. If using CUDA_VISIBLE_DEVICES=. . . to control GPUs (preferred method),
gpu_id=0 is the first in that restricted list of devices. For example, if CUDA_VISIBLE_DEVICES='4,5' then
gpu_id_start=0 will refer to device #4.
From expert mode, to run 2 experiments, each on a distinct GPU out of 2 GPUs, then:
• Experiment#1: num_gpus_per_model=1, num_gpus_per_experiment=1, gpu_id_start=0
• Experiment#2: num_gpus_per_model=1, num_gpus_per_experiment=1, gpu_id_start=1
From expert mode, to run 2 experiments, each on a distinct GPU out of 8 GPUs, then:
• Experiment#1: num_gpus_per_model=1, num_gpus_per_experiment=4, gpu_id_start=0
• Experiment#2: num_gpus_per_model=1, num_gpus_per_experiment=4, gpu_id_start=4
To run on all 4 GPUs/model, then
• Experiment#1: num_gpus_per_model=4, num_gpus_per_experiment=4, gpu_id_start=0
• Experiment#2: num_gpus_per_model=4, num_gpus_per_experiment=4, gpu_id_start=4
If num_gpus_per_model!=1, global GPU locking is disabled. This is because the underlying algorithms do not support
arbitrary gpu ids, only sequential ids, so be sure to set this value correctly to avoid overlap across all experiments by
all users.
More information is available at: https://github.com/NVIDIA/nvidia-docker/wiki/nvidia-docker#gpu-isolation Note
that gpu selection does not wrap, so gpu_id_start + num_gpus_per_model must be less than the number of visibile
GPUs.
Specify whether to enable detailed tracing in Driverless AI trace when running an experiment. This is disabled by
default.
If enabled, the log files will also include debug logs. This is disabled by default.
Specify whether to include system information such as CPU, GPU, and disk space at the start of each experiment log.
Note that this information is already included in system logs. This is enabled by default.
17.4 Scorers
• GINI (Gini Coefficient): The Gini index is a well-established method to quantify the inequality among values
of a frequency distribution, and can be used to measure the quality of a binary classifier. A Gini index of zero
expresses perfect equality (or a totally useless classifier), while a Gini index of one expresses maximal inequality
(or a perfect classifier).
The Gini index is based on the Lorenz curve. The Lorenz curve plots the true positive rate (y-axis) as a
function of percentiles of the population (x-axis).
The Lorenz curve represents a collective of models represented by the classifier. The location on the
curve is given by the probability threshold of a particular model. (i.e., Lower probability thresholds for
classification typically lead to more true positives, but also to more false positives.)
The Gini index itself is independent of the model and only depends on the Lorenz curve determined by
the distribution of the scores (or probabilities) obtained from the classifier.
Regression
• R2 (R Squared): The R2 value represents the degree that the predicted value and the actual value move in
unison. The R2 value varies between 0 and 1 where 0 represents no correlation between the predicted and actual
value and 1 represents complete correlation.
Calculating the R2 value for linear models is mathematically equivalent to 1 − 𝑆𝑆𝐸/𝑆𝑆𝑇 (or 1 −
residual sum of squares/total sum of squares). For all other models, this equivalence does not hold, so
the 1 − 𝑆𝑆𝐸/𝑆𝑆𝑇 formula cannot be used. In some cases, this formula can produce negative R2 values,
which is mathematically impossible for a real number. Because Driverless AI does not necessarily use
linear models, the R2 value is calculated using the squared Pearson correlation coefficient.
R2 equation:
∑︀𝑛
(𝑥𝑖 − 𝑥 ¯)(𝑦𝑖 − 𝑦¯)
𝑅2 = √︀∑︀𝑛 𝑖=1 ∑︀𝑛
− 2 ¯)2
(𝑥
𝑖=1 𝑖 𝑥
¯ ) 𝑖=1 (𝑦𝑖 − 𝑦
Where:
• x is the predicted target value
• y is the actual target value
• MSE (Mean Squared Error): The MSE metric measures the average of the squares of the errors or deviations.
MSE takes the distances from the points to the regression line (these distances are the “errors”) and squaring
them to remove any negative signs. MSE incorporates both the variance and the bias of the predictor.
MSE also gives more weight to larger differences. The bigger the error, the more it is penalized. For
example, if your correct answers are 2,3,4 and the algorithm guesses 1,4,3, then the absolute error on each
one is exactly 1, so squared error is also 1, and the MSE is 1. But if the algorithm guesses 2,3,6, then the
errors are 0,0,2, the squared errors are 0,0,4, and the MSE is a higher 1.333. The smaller the MSE, the
better the model’s performance. (Tip: MSE is sensitive to outliers. If you want a more robust metric, try
mean absolute error (MAE).)
MSE equation:
𝑁
1 ∑︁
𝑀 𝑆𝐸 = (𝑦𝑖 − 𝑦ˆ𝑖 )2
𝑁 𝑖=1
• RMSE (Root Mean Squared Error): The RMSE metric evaluates how well a model can predict a continuous
value. The RMSE units are the same as the predicted target, which is useful for understanding if the size of the
error is of concern or not. The smaller the RMSE, the better the model’s performance. (Tip: RMSE is sensitive
to outliers. If you want a more robust metric, try mean absolute error (MAE).)
RMSE equation:
⎯
⎸ 𝑁
⎸ 1 ∑︁
𝑅𝑀 𝑆𝐸 = ⎷ (𝑦𝑖 − 𝑦ˆ𝑖 )2
𝑁 𝑖=1
Where:
• N is the total number of rows (observations) of your corresponding dataframe.
• y is the actual target value.
• 𝑦ˆ is the predicted target value.
• RMSLE (Root Mean Squared Logarithmic Error): This metric measures the ratio between actual values and
predicted values and takes the log of the predictions and actual values. Use this instead of RMSE if an under-
prediction is worse than an over-prediction. You can also use this when you don’t want to penalize large differ-
ences when both of the values are large numbers.
RMSLE equation:
⎯
⎸ 𝑁
⎸ 1 ∑︁ (︀ (︀ 𝑦𝑖 + 1 )︀)︀2
𝑅𝑀 𝑆𝐿𝐸 = ⎷ 𝑙𝑛
𝑁 𝑖=1 𝑦ˆ𝑖 + 1
Where:
• N is the total number of rows (observations) of your corresponding dataframe.
• y is the actual target value.
• 𝑦ˆ is the predicted target value.
• RMSPE (Root Mean Square Percentage Error): This metric is the RMSE expressed as a percentage. The
smaller the RMSPE, the better the model performance.
RMSPE equation:
⎯
⎸ 𝑁
⎸ 1 ∑︁ (𝑦𝑖 − 𝑦ˆ𝑖 )2
𝑅𝑀 𝑆𝑃 𝐸 = ⎷
𝑁 𝑖=1 (𝑦𝑖 )2
• MAE (Mean Absolute Error): The mean absolute error is an average of the absolute errors. The MAE units are
the same as the predicted target, which is useful for understanding whether the size of the error is of concern or
not. The smaller the MAE the better the model’s performance. (Tip: MAE is robust to outliers. If you want a
metric that is sensitive to outliers, try root mean squared error (RMSE).)
MAE equation:
𝑁
1 ∑︁
𝑀 𝐴𝐸 = |𝑥𝑖 − 𝑥|
𝑁 𝑖=1
Where:
– N is the total number of errors
– |𝑥𝑖 − 𝑥| equals the absolute errors.
• MAPE (Mean Absolute Percentage Error): MAPE measures the size of the error in percentage terms. It is
calculated as the average of the unsigned percentage error.
MAPE equation:
(︀ 1 ∑︁ |𝐴𝑐𝑡𝑢𝑎𝑙 − 𝐹 𝑜𝑟𝑒𝑐𝑎𝑠𝑡| )︀
𝑀 𝐴𝑃 𝐸 = * 100
𝑁 |𝐴𝑐𝑡𝑢𝑎𝑙|
Because the MAPE measure is in percentage terms, it gives an indication of how large the error is across
different scales. Consider the following example:
Both records have an absolute error of 4, but this error could be considered “small” or “big” when you
compare it to the actual value.
• SMAPE (Symmetric Mean Absolute Percentage Error): Unlike the MAPE, which divides the absolute errors
by the absolute actual values, the SMAPE divides by the mean of the absolute actual and the absolute predicted
values. This is important when the actual values can be 0 or near 0. Actual values near 0 cause the MAPE value
to become infinitely high. Because SMAPE includes both the actual and the predicted values, the SMAPE value
can never be greater than 200%.
Consider the following example:
Actual Predicted
0.01 0.05
0.03 0.04
The MAPE for this data is 216.67% but the SMAPE is only 80.95%.
Both records have an absolute error of 4, but this error could be considered “small” or “big” when you
compare it to the actual value.
• MER (Median Error Rate or Median Absolute Percentage Error): MER measures the median size of the error
in percentage terms. It is calculated as the median of the unsigned percentage error.
MER equation:
(︀ |𝐴𝑐𝑡𝑢𝑎𝑙 − 𝐹 𝑜𝑟𝑒𝑐𝑎𝑠𝑡| )︀
𝑀 𝐸𝑅 = 𝑚𝑒𝑑𝑖𝑎𝑛 * 100
|𝐴𝑐𝑡𝑢𝑎𝑙|
Because the MER is the median, half the scored population has a lower absolute percentage error than the
MER, and half the population has a larger absolute percentage error than the MER.
Classification
• MCC (Matthews Correlation Coefficient): The goal of the MCC metric is to represent the confusion matrix of
a model as a single number. The MCC metric combines the true positives, false positives, true negatives, and
false negatives using the equation described below.
A Driverless AI model will return probabilities, not predicted classes. To convert probabilities to predicted
classes, a threshold needs to be defined. Driverless AI iterates over possible thresholds to calculate a
confusion matrix for each threshold. It does this to find the maximum MCC value. Driverless AI’s goal is
to continue increasing this maximum MCC.
Unlike metrics like Accuracy, MCC is a good scorer to use when the target variable is imbalanced. In the
case of imbalanced data, high Accuracy can be found by simply predicting the majority class. Metrics
like Accuracy and F1 can be misleading, especially in the case of imbalanced data, because they do not
consider the relative size of the four confusion matrix categories. MCC, on the other hand, takes the
proportion of each class into account. The MCC value ranges from -1 to 1 where -1 indicates a classifier
that predicts the opposite class from the actual value, 0 means the classifier does no better than random
guessing, and 1 indicates a perfect classifier.
MCC equation:
𝑇𝑃 𝑥 𝑇𝑁 − 𝐹𝑃 𝑥 𝐹𝑁
𝑀 𝐶𝐶 = √︀
(𝑇 𝑃 + 𝐹 𝑃 )(𝑇 𝑃 + 𝐹 𝑁 )(𝑇 𝑁 + 𝐹 𝑃 )(𝑇 𝑁 + 𝐹 𝑁 )
• F05, F1, and F2: A Driverless AI model will return probabilities, not predicted classes. To convert probabilities
to predicted classes, a threshold needs to be defined. Driverless AI iterates over possible thresholds to calculate
a confusion matrix for each threshold. It does this to find the maximum F metric value. Driverless AI’s goal is
to continue increasing this maximum F metric.
The F1 score provides a measure for how well a binary classifier can classify positive cases (given a
threshold value). The F1 score is calculated from the harmonic mean of the precision and recall. An F1
score of 1 means both precision and recall are perfect and the model correctly identified all the positive
cases and didn’t mark a negative case as a positive case. If either precision or recall are very low it will
be reflected with a F1 score closer to 0.
F1 equation:
(︁ (𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛) (𝑟𝑒𝑐𝑎𝑙𝑙) )︁
𝐹1 = 2
𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑟𝑒𝑐𝑎𝑙𝑙
Where:
• precision is the positive observations (true positives) the model correctly identified from all the
observations it labeled as positive (the true positives + the false positives).
• recall is the positive observations (true positives) the model correctly identified from all the actual
positive cases (the true positives + the false negatives).
The F0.5 score is the weighted harmonic mean of the precision and recall (given a threshold value).
Unlike the F1 score, which gives equal weight to precision and recall, the F0.5 score gives more weight
to precision than to recall. More weight should be given to precision for cases where False Positives are
considered worse than False Negatives. For example, if your use case is to predict which products you
will run out of, you may consider False Positives worse than False Negatives. In this case, you want your
predictions to be very precise and only capture the products that will definitely run out. If you predict
a product will need to be restocked when it actually doesn’t, you incur cost by having purchased more
inventory than you actually need.
F05 equation:
(︁ (𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛) (𝑟𝑒𝑐𝑎𝑙𝑙) )︁
𝐹 0.5 = 1.25
0.25 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑟𝑒𝑐𝑎𝑙𝑙
Where:
• precision is the positive observations (true positives) the model correctly identified from all the
observations it labeled as positive (the true positives + the false positives).
• recall is the positive observations (true positives) the model correctly identified from all the actual
positive cases (the true positives + the false negatives).
The F2 score is the weighted harmonic mean of the precision and recall (given a threshold value). Unlike
the F1 score, which gives equal weight to precision and recall, the F2 score gives more weight to recall
than to precision. More weight should be given to recall for cases where False Negatives are considered
worse than False Positives. For example, if your use case is to predict which customers will churn, you
may consider False Negatives worse than False Positives. In this case, you want your predictions to
capture all of the customers that will churn. Some of these customers may not be at risk for churning, but
the extra attention they receive is not harmful. More importantly, no customers actually at risk of churning
have been missed.
F2 equation:
(︁ (𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛) (𝑟𝑒𝑐𝑎𝑙𝑙) )︁
𝐹2 = 5
4 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑟𝑒𝑐𝑎𝑙𝑙
Where:
• precision is the positive observations (true positives) the model correctly identified from all the
observations it labeled as positive (the true positives + the false positives).
• recall is the positive observations (true positives) the model correctly identified from all the actual
positive cases (the true positives + the false negatives).
• Accuracy: In binary classification, Accuracy is the number of correct predictions made as a ratio of all pre-
dictions made. In multiclass classification, the set of labels predicted for a sample must exactly match the
corresponding set of labels in y_true.
A Driverless AI model will return probabilities, not predicted classes. To convert probabilities to predicted
classes, a threshold needs to be defined. Driverless AI iterates over possible thresholds to calculate a
confusion matrix for each threshold. It does this to find the maximum Accuracy value. Driverless AI’s
goal is to continue increasing this maximum Accuracy.
Accuracy equation:
𝑁
1 ∑︁
𝐿𝑜𝑔𝑙𝑜𝑠𝑠 = − 𝑤𝑖 ( 𝑦𝑖 ln(𝑝𝑖 ) + (1 − 𝑦𝑖 ) ln(1 − 𝑝𝑖 ) )
𝑁 𝑖=1
𝑁 𝐶
1 ∑︁ ∑︁
𝐿𝑜𝑔𝑙𝑜𝑠𝑠 = − 𝑤𝑖 ( 𝑦𝑖 ,𝑗 ln(𝑝𝑖 ,𝑗 ) )
𝑁 𝑖=1 𝑗=1
Where:
• N is the total number of rows (observations) of your corresponding dataframe.
• w is the per row user-defined weight (defaults is 1).
• C is the total number of classes (C=2 for binary classification).
• p is the predicted value (uncalibrated probability) assigned to a given row (observation).
• y is the actual target value.
• AUC (Area Under the Receiver Operating Characteristic Curve): This model metric is used to evaluate how well
a binary classification model is able to distinguish between true positives and false positives. For multi-class
problems, this score is computed by micro-averaging the ROC curves for each class. Use MACROAUC if you
prefer the macro average.
An AUC of 1 indicates a perfect classifier, while an AUC of .5 indicates a poor classifier whose perfor-
mance is no better than random guessing.
• AUCPR (Area Under the Precision-Recall Curve): This model metric is used to evaluate how well a binary
classification model is able to distinguish between precision recall pairs or points. These values are obtained
using different thresholds on a probabilistic or other continuous-output classifier. AUCPR is an average of the
precision-recall weighted by the probability of a given threshold.
The main difference between AUC and AUCPR is that AUC calculates the area under the ROC curve and
AUCPR calculates the area under the Precision Recall curve. The Precision Recall curve does not care
about True Negatives. For imbalanced data, a large quantity of True Negatives usually overshadows the
effects of changes in other metrics like False Positives. The AUCPR will be much more sensitive to True
Positives, False Positives, and False Negatives than AUC. As such, AUCPR is recommended over AUC
for highly imbalanced data.
• MACROAUC (Macro Average of Areas Under the Receiver Operating Characteristic Curves): For multiclass
classification problems, this score is computed by macro-averaging the ROC curves for each class (one per
class). The area under the curve is a constant. A MACROAUC of 1 indicates a perfect classifier, while a
MACROAUC of .5 indicates a poor classifier whose performance is no better than random guessing. This
option is not available for binary classification problems.
When deciding which scorer to use in a regression problem, some main questions to ask are:
• Do you want your scorer sensitive to outliers?
• What unit should the scorer be in?
Sensitive to Outliers
Certain scorers are more sensitive to outliers. When a scorer is sensitive to outliers, it means that it is important that
the model predictions are never “very” wrong. For example, let’s say we have an experiment predicting number of
days until an event. The graph below shows the absolute error in our predictions.
Usually our model is very good. We have an absolute error less than 1 day about 70% of the time. There is one
instance, however, where our model did very poorly. We have one prediction that was 30 days off.
Instances like this will more heavily penalize scorers that are sensitive to outliers. If we do not care about these outliers
in poor performance as long as we typically have a very accurate prediction, then we would want to select a scorer that
is robust to outliers. We can see this reflected in the behavior of the scorers: MSE and RMSE.
MSE RMSE
Outlier 0.99 2.64
No Outlier 0.80 1.0
Calculating the RMSE and MSE on our error data, the RMSE is more than twice as large as the MSE because RMSE is
sensitive to outliers. If we remove the one outlier record from our calculation, RMSE drops down significantly.
Performance Units
Different scorers will show the performance of the Driverless AI experiment in different units. Let’s continue with our
example where our target is to predict the number of days until an event. Some possible performance units are:
• Same as target: The unit of the scorer is in days
– ex: MAE = 5 means the model predictions are off by 5 days on average
When deciding which scorer to use in a classification problem some main questions to ask are:
• Do you want the scorer to evaluate the predicted probabilities or the classes that those probabilities can be
converted to?
• Is your data imbalanced?
Scorer Evaluates Probabilities or Classes
The final output of a Driverless AI model is a predicted probability that a record is in a particular class. The scorer you
choose will either evaluate how accurate the probability is or how accurate the assigned class is from that probability.
Choosing this depends on the use of the Driverless AI model. Do we want to use the probabilities or do we want to
convert those probabilities into classes? For example, if we are predicting whether a customer will churn, we may take
the predicted probabilities and turn them into classes - customers who will churn vs customers who won’t churn. If
we are predicting the expected loss of revenue, we will instead use the predicted probabilities (predicted probability
of churn * value of customer).
If your use case requires a class assigned to each record, you will want to select a scorer that evaluates the model’s
performance based on how well it classifies the records. If your use case will use the probabilities, you will want to
select a scorer that evaluates the model’s performance based on the predicted probability.
Robust to Imbalanced Data
For certain use cases, positive classes may be very rare. In these instances, some scorers can be misleading. For
example, if I have a use case where 99% of the records have Class = No, then a model which always predicts No
will have 99% accuracy.
For these use cases, it is best to select a metric that does not include True Negatives or considers relative size of the
True Negatives like AUCPR or MCC.
Comparison
1. Run an experiment by selecting [Click for Actions] button beside the dataset that you want to use. Click Predict
to begin an experiment.
2. The Experiment Settings form displays and auto-fills with the selected dataset. Optionally enter a custom name
for this experiment. If you do not add a name, Driverless AI will create one for you.
3. Optionally specify a validation dataset and/or a test dataset.
• The validation set is used to tune parameters (models, features, etc.). If a validation dataset is not
provided, the training data is used (with holdout splits). If a validation dataset is provided, training
data is not used for parameter tuning - only for training. A validation dataset can help to improve
the generalization performance on shifting data distributions.
• The test dataset is used for the final stage scoring and is the dataset for which model metrics will be
computed against. Test set predictions will be available at the end of the experiment. This dataset is
not used during training of the modeling pipeline.
Keep in mind that these datasets must have the same number of columns as the training dataset. Also
note that if provided, the validation set is not sampled down, so it can lead to large memory usage, even if
accuracy=1 (which reduces the train size).
4. Specify the target (response) column. Note that not all explanatory functionality will be available for multiclass
classification scenarios (scenarios with more than two outcomes). When the target column is selected, Driverless
AI automatically provides the target column type and the number of rows. If this is a classification problem,
then the UI shows unique and frequency statistics (Target Freq/Most Freq) for numerical columns. If this is a
regression problem, then the UI shows the dataset mean and standard deviation values.
Notes Regarding Frequency:
• For data imported in versions <= 1.0.19, TARGET FREQ and MOST FREQ both represent the count
of the least frequent class for numeric target columns and the count of the most frequent class for
categorical target columns.
• For data imported in versions 1.0.20-1.0.22, TARGET FREQ and MOST FREQ both represent the
frequency of the target class (second class in lexicographic order) for binomial target columns; the
count of the most frequent class for categorical multinomial target columns; and the count of the
least frequent class for numeric multinomial target columns.
• For data imported in version 1.0.23 (and later), TARGET FREQ is the frequency of the target class
for binomial target columns, and MOST FREQ is the most frequent class for multinomial target
columns.
5. The next step is to set the parameters and settings for the experiment. (Refer to the Experiment Settings section
for more information about these settings.) You can set the parameters individually, or you can let Driverless AI
infer the parameters and then override any that you disagree with. Available parameters and settings include the
following:
• Dropped Columns: The columns we do not want to use as predictors such as ID columns, columns
with data leakage, etc.
• Weight Column: The column that indicates the per row observation weights. If “None” is specified,
each row will have an observation weight of 1.
• Fold Column: The column that indicates the fold. If “None” is specified, the folds will be determined
by Driverless AI. This is set to “Disabled” if a validation set is used.
• Time Column: The column that provides a time order, if applicable. If “AUTO” is specified, Driver-
less AI will auto-detect a potential time order. If “OFF” is specified, auto-detection is disabled. This
is set to “Disabled” if a validation set is used.
• Specify the Scorer to use for this experiment. The available scorers vary based on whether this is a
classification or regression experiment. Scorers include:
– Regression: GINI, MAE, MAPE, MER, MSE, R2, RMSE (default), RMSLE, RMSPE,
SMAPE, TOPDECILE
– Classification: ACCURACY, AUC (default), AUCPR, F05, F1, F2, GINI, LOGLOSS,
MACROAUC, MCC
• Specify a desired relative Accuracy from 1 to 10
• Specify a desired relative Time from 1 to 10
• Specify a desired relative Interpretability from 1 to 10
Driverless AI will automatically infer the best settings for Accuracy, Time, and Interpretability and
provide you with an experiment preview based on those suggestions. If you adjust these knobs, the
experiment preview will automatically update based on the new settings.
Expert Settings (optional):
• Optionally specify additional expert settings for the experiment. Refer to the Expert Settings section
for more information about these settings. The default values for these options are derived from the
environment variables in the config.toml file. Refer to the Setting Environment Variables section for
more information.
In addition to the status, as an experiment is running, the UI also displays the following:
• Details about the dataset.
• The iteration data (internal validation) for each cross validation fold along with the specified scorer value. Click
on a specific iteration or drag to view a range of iterations. Double click in the graph to reset the view. In this
graph, each “column” represents one iteration of the experiment. During the iteration, Driverless AI will train 𝑛
models. (This is called individuals in the experiment preview.) So for any column, you may see the score value
for those 𝑛 models for each iteration on the graph.
• The variable importance values. To view variable importance for a specific iteration, just select that iteration in
the Iteration Data graph. The Variable Importance list will automatically update to show variable importance
information for that iteration. Hover over an entry to view more info. Note: When hovering over an entry,
you may notice the term “Internal[. . . ] specification.” This label is used for features that do not need to be
translated/explained and ensures that all features are uniquely identified.
The values that display are specific to the variable importance of the model class:
– XGBoost and LightGBM: Gains Variable importance. Gain-based importance is calculated from the gains
a specific variable brings to the model. In the case of a decision tree, the gain-based importance will sum
up the gains that occurred whenever the data was split by the given variable. The gain-based importance is
normalized between 0 and 1. If a variable is never used in the model, the gain-based importance will be 0.
– GLM: The variable importance is the absolute value of the coefficient for each predictor. The variable
importance is normalized between 0 and 1. If a variable is never used in the model, the importance will be
0.
– TensorFlow: TensorFlow follows the Gedeon method described in this paper: https://www.ncbi.nlm.nih.
gov/pubmed/9327276.
– RuleFit: Sums over a feature’s contribution to each rule. Specifically, Driverless AI:
1. Assigns all features to have zero importance.
2. Scans through all the rules. If a feature is in that rule, Driverless AI adds its contribution (i.e, the
absolute values of a rule’s coefficient ) to its overall feature importance.
3. Normalizes the importance.
The calculation for the shift of variable importance is determined by the ensemble level:
– Ensemble Level = 0: The shift is determined between the last best genetic algorithm (GA) and the single
final model.
– Ensemble Level >=1: GA individuals used for the final model have variable importance blended with
the model’s meta learner weights, and the final model itself has variable importance blended with its
final weights. The shift of variable importance is determined between these two final variable importance
blends.
This information is reported in the logs or in the GUI if the shift is beyond the absolute magnitude speci-
fied by the max_num_varimp_shift_to_log configuration option. The Experiment Summary also
includes experiment_features_shift files that contain information about shift.
• CPU/Memory information including Notifications, Logs, and Trace info. (Note that Trace is used for develop-
ment/debugging and to show what the system is doing at that moment.)
• For classification problems, the lower right section includes a toggle between an ROC curve, Precision-Recall
graph, Lift chart, Gains chart, and GPU Usage information (if GPUs are available). For regression problems,
the lower right section includes a toggle between a Residuals chart, an Actual vs. Predicted chart, and GPU
Usage information (if GPUs are available). (Refer to the Experiment Graphs section for more information.)
Upon completion, an Experiment Summary section will populate in the lower right section.
• The bottom portion of the experiment screen will show any warnings that Driverless AI encounters. You can
hide this pane by clicking the x icon.
You can finish and/or abort experiments that are currently running.
• Finish Click the Finish button to stop a running experiment. Driverless AI will end the experiment and then
complete the ensembling and the deployment package.
• Abort: After clicking Finish, you have the option to click Abort, which terminates the experiment. (You will
be prompted to confirm the abort.) Aborted experiments will display on the Experiments page as Failed. You
can restart aborted experiments by clicking the right side of the experiment, then selecting Restart from Last
Checkpoint. This will start a new experiment based on the aborted one. Alternatively, you can started a new
experiment based on the aborted one by selecting New Model with Same Params. Refer to Checkpointing,
Rerunning, and Retraining for more information.
The final step that Driverless AI performs during an experiment is to complete the experiment report. During this step,
you can click Abort to skip this report.
After an experiment status changes from RUNNING to COMPLETE, the UI provides you with several options:
This section describes the dashboard graphs that display for running and completed experiments. These graphs are
interactive. Hover over a point on the graph for more details about the point.
For Binary Classification experiments, Driverless AI shows a ROC Curve, a Precision-Recall graph, a Lift chart, a
Kolmogorov-Smirnov chart, and a Gains chart.
• ROC: This shows Receiver-Operator Characteristics curve stats on validation data along with the best Accuracy,
FCC, and F1 values. An ROC curve is a useful tool because it only focuses on how well the model was able to
distinguish between classes. Keep in mind, though, that for models where the prediction happens rarely, a high
AUC could provide a false sense that the model is correctly predicting the results. This is where the notion of
precision and recall become important.
The area under this curve is called AUC. The True Positive Rate (TPR) is the relative fraction of correct
positive predictions, and the False Positive Rate (FPR) is the relative fraction of incorrect positive cor-
rections. Each point corresponds to a classification threshold (e.g., YES if probability >= 0.3 else NO).
For each threshold, there is a unique confusion matrix that represents the balance between TPR and FPR.
Most useful operating points are in the top left corner in general.
Hover over a point in the ROC curve to see the True Negative, False Positive, False Negative, True
Positive, Threshold, FPR, TPR, Accuracy, F1, and MCC values for that point in the form of a confusion
matrix.
If a test set was provided for the experiment, then click on the Validation Metrics button below the graph
to view these stats on test data.
• Precision-Recall: This shows the Precision-Recall curve on validation data along with the best Accuracy, FCC,
and F1 values. The area under this curve is called AUCPR. Prec-Recall is a complementary tool to ROC
curves, especially when the dataset has a significant skew. The Prec-Recall curve plots the precision or positive
predictive value (y-axis) versus sensitivity or true positive rate (x-axis) for every possible classification threshold.
At a high level, you can think of precision as a measure of exactness or quality of the results and recall as a
measure of completeness or quantity of the results obtained by the model. Prec-Recall measures the relevance
of the results obtained by the model.
• Precision: correct positive predictions (TP) / all positives (TP + FP).
• Recall: correct positive predictions (TP) / positive predictions (TP + FN).
Each point corresponds to a classification threshold (e.g., YES if probability >= 0.3 else NO). For each
threshold, there is a unique confusion matrix that represents the balance between Recall and Precision.
This ROCPR curve can be more insightful than the ROC curve for highly imbalanced datasets.
Hover over a point in this graph to see the True Positive, True Negative, False Positive, False Negative,
Threshold, Recall, Precision, Accuracy, F1, and MCC value for that point.
If a test set was provided for the experiment, then click on the Validation Metrics button below the graph
to view these stats on test data.
• Lift: This chart shows lift stats on validation data. For example, “How many times more observations of the
positive target class are in the top predicted 1%, 2%, 10%, etc. (cumulative) compared to selecting observations
randomly?” By definition, the Lift at 100% is 1.0. Lift can help answer the question of how much better you
can expect to do with the predictive model compared to a random model (or no model). Lift is a measure of
the effectiveness of a predictive model calculated as the ratio between the results obtained with a model and
with a random model(or no model). In other words, the ratio of gain % to the random expectation % at a given
quantile. The random expectation of the xth quantile is x%.
Hover over a point in the Lift chart to view the quantile percentage and cumulative lift value for that point.
If a test set was provided for the experiment, then click on the Validation Metrics button below the graph
to view these stats on test data.
• Kolmogorov-Smirnov: This chart measures the degree of separation between positives and negatives for vali-
dation or test data.
Hover over a point in the chart to view the quantile percentage and Kolmogorov-Smirnov value for that
point.
If a test set was provided for the experiment, then click on the Validation Metrics button below the graph
to view these stats on test data.
• Gains: This shows Gains stats on validation data. For example, “What fraction of all observations of the positive
target class are in the top predicted 1%, 2%, 10%, etc. (cumulative)?” By definition, the Gains at 100% are 1.0.
Hover over a point in the Gains chart to view the quantile percentage and cumulative gain value for that
point.
If a test set was provided for the experiment, then click on the Validation Metrics button below the graph
to view these stats on test data.
For multiclass classification experiments, a Confusion Matrix is available in addition to the ROC Curve, Precision-
Recall graph, Lift chart, Kolmogorov-Smirnov chart, and Gains chart. Driverless AI generates these graphs by con-
sidering the multiclass problem as multiple one-vs-all problems. These graphs and charts (Confusion Matrix ex-
cepted) are based on a method known as micro-averaging (reference: http://scikit-learn.org/stable/auto_examples/
model_selection/plot_roc.html#multiclass-settings).
For example, you may want to predict the species in the iris data. The predictions would look something like this:
The result is 3 vectors of predicted and actual values for binomial problems. Driverless AI concatenates these 3 vectors
together to compute the charts.
predicted = [0.9628, 0.0182, 0.0191, 0.021, 0.3172, 0.9534, 0.0158, 0.6646, 0.0276]
actual = [1, 0, 0, 0, 1, 1, 0, 0, 0]
A confusion matrix shows experiment performance in terms of false positives, false negatives, true positives, and true
negatives. For each threshold, the confusion matrix represents the balance between TPR and FPR (ROC) or Precision
and Recall (Prec-Recall). In general, most useful operating points are in the top left corner.
In this graph, the actual results display in the columns and the predictions display in the rows; correct predictions are
highlighted. In the example below, Iris-setosa was predicted correctly 30 times, while Iris-virginica was predicted
correctly 32 times, and Iris-versicolor was predicted as Iris-virginica 2 times (against the validation set).
If a test set was provided for the experiment, then click on the Validation Metrics button below the graph to view
these stats on test data.
• Residuals: Residuals are the differences between observed responses and those predicted by a model. Any
pattern in the residuals is evidence of an inadequate model or of irregularities in the data, such as outliers, and
suggests how the model may be improved. This chart shows Residuals (Actual-Predicted) vs Predicted values
on validation or test data. Note that this plot preserves all outliers. For a perfect model, residuals are zero.
Hover over a point on the graph to view the Predicted and Residual values for that point.
If a test set was provided for the experiment, then click on the Validation Metrics button below the graph
to view these stats on test data.
• Actual vs. Predicted: This chart shows Actual vs Predicted values on validation data. A small sample of values
are displayed. A perfect model has a diagonal line.
Hover over a point on the graph to view the Actual and Predicted values for that point.
If a test set was provided for the experiment, then click on the Validation Metrics button below the graph
to view these stats on test data.
You can view detailed information about model scores after an experiment is complete by clicking on the Scores
option.
The Model Scores page that opens includes the following tables:
• Model and feature tuning leaderboard: This leaderboard shows scoring information based on the scorer that
was selected in the experiment. This information is also available in the tuning_leaderboard.json file of the
Experiment Summary. You can download that file directly from the bottom of this table.
• Final pipeline scores across cross-validation folds and models: This table shows the final pipeline scores
across cross-validation folds and models. Note that if Constant Model was enabled (default), then that model
is added in this table as a baseline (reference) only and will be dropped in most cases. This information is also
included in the ensemble_base_learner_fold_scores.json file of the Experiment Summary. You can download
that file directly from a link at the bottom of this table.
• Pipeline Description: This shows how the final Stacked Ensemble pipeline was calculated. This information
is also included in the ensemble_model_description.json file of the Experiment Summary. You can download
that file directly from a link at the bottom of this table.
• Final Ensemble Scores: This shows the final scores for each scorer in DAI. If a custom scorer was used in the
experiment, that scorer will also appear here. This information is also included in the ensemble_scores.json file
of the Experiment Summary. You can download that file directly from a link at the bottom of this table.
An experiment summary is available for each completed experiment. Click the Download Summary & Logs button
to download the h2oai_experiment_summary_<experiment>.zip file.
The files within the experiment summary zip provide textual explanations of the graphical representations that are
shown on the Driverless AI UI. Details of each artifact are described below.
A report file (AutoDoc) is included in the experiment summary. This report provides insight into the training data and
any detected shifts in distribution, the validation schema selected, model parameter tuning, feature evolution and the
final set of features chosen during the experiment.
• report.docx: the report available in Word format
Click here to download and view a sample experiment report in Word format.
Autoreport only supports resumed experiments for certain Driverless AI versions. See the following table to check
what types of resumed experiments are supported for your version:
Autoreport Support for Resumed Experiments Via LTS 1.7.0 and older 1.7.1 1.8.x
New model with same parameters yes yes yes yes
Restart from last checkpoint no no yes yes
Retrain final pipeline no no no yes
Notes:
• Autoreport does not support experiments that were built off of previously aborted or failed experiments.
• Reports for unsupported resumed experiments will still build, but they will only include the following text:
“AutoDoc not yet supported for resumed experiments.”
The Experiment Summary contains artifacts that provide overviews of the experiment.
• preview.txt: Provides a preview of the experiment. (This is the same information that was included on the UI
before starting the experiment.)
• summary: Provides the same summary that appears in the lower-right portion of the UI for the experiment.
(Available in txt or json.)
• config.json: Provides a list of the settings used in the experiment.
• config_overrides_toml_string.txt: Provides any overrides for this experiment that were made to the config.toml
file.
• args_do_auto_dl.json: The internal arguments used in the Driverless AI experiment based on the dataset and
accuracy, time and interpretability settings.
• experiment_column_types.json: Provides the column types for each column included in the experiment.
• experiment_original_column.json: A list of all columns available in the dataset that was used in the experi-
ment.
• experiment_pipeline_original_required_columns.json: For columns used in the experiment, this includes the
column name and type.
• experiment_sampling_description.json: A description of the sampling performed on the dataset.
• timing.json: The timing and number of models generated in each part of the Driverless AI pipeline.
During the Driverless AI experiment, model tuning is performed to determined the optimal algorithm and parameter
settings for the provided dataset. For regression problems, target tuning is also performed to determine the best way
to represent the target column (i.e. does taking the log of the target column improve results). The results from these
tuning steps are available in the Experiment Summary.
• tuning_leaderboard: A table of the model tuning performed along with the score generated from the model
and training time. (Available in txt or json.)
• target_transform_tuning_leaderboard.txt: A table of the transforms applied to the target column along with
the score generated from the model and training time. (This will be empty for binary and multiclass use cases.)
Driverless AI performs feature engineering on the dataset to determine the optimal representation of the data. The
top features used in the final model can be seen in the GUI. The complete list of features used in the final model is
available in the Experiment Summary artifacts.
The Experiment Summary also provides a list of the original features and their estimated feature importance. For
example, given the features in the final Driverless AI model, we can estimate the feature importance of the original
features.
To calculate the feature importance of PAY_3, we can aggregate the feature importance for all variables that used
PAY_3:
• NumToCatWoE:PAY_AMT2: 1 * 0 (PAY_3 not used.)
• PAY_3: 0.92 * 1 (PAY_3 is the only variable used.)
• ClusterDist9:BILL_AMT1:LIMIT_BAL:PAY_3: 0.90 * 1/3 (PAY_3 is one of three variables used.)
Estimated Feature Importance = (1*0) + (0.92*1) + (0.9*(1/3)) = 1.22
Note: The feature importance is converted to relative feature importance. (The feature with the highest estimated
feature importance will have a relative feature importance of 1).
• ensemble_features: A list of features used in the final model, a description of the feature, and the relative
feature importance. Feature importances for multiple models are linearly blended with same weights as the final
ensemble of models. (Available in txt, table, or json.)
• ensemble_features_orig: A complete list of all original features used in the final model, a description of the
feature, the relative feature importance, and the standard deviation of relative importance. (Available in txt or
json.)
• ensemble_features_orig_shift: A list of original user features used in the final model and the difference in
relative feature importance between the final model and the corresponding feature importance of the final pop-
ulation. (Available in txt or json.)
• ensemble_features_prefit: A list of features used by the best individuals in the final population, each model
blended with same weights as ensemble if ensemble used blending. (Available in txt, table, or json.)
• ensemble_features_shift: A list of features used in the final model and the difference in relative feature impor-
tance between the final model and the corresponding feature importance of the final population. (Available in
txt, table, or json.)
• features: A list of features used by the best individual pipeline (identified by the genetic algorithm) and each
feature’s relative importance. (Available in txt, table, or json.)
• features_orig: A list of original user features used by the best individual pipeline (identified by the genetic
algorithm) and each feature’s estimated relative importance. (Available in txt or json.)
• leaked_features.json: A list of all leaked features provided along with the relative importance and the standard
deviation of relative importance. (Available in txt, table, or json.)
• leakage_features_orig.json: A list of leaked original features provided and an estimate of the relative feature
importance of that leaked original feature in the final model.
• shift_features.json: A list of all features provided along with the relative importance and the shift in standard
deviation of relative importance of that feature.
• shifit_features_orig.json: A list of original features provided and an estimate of the shift in relative feature
importance of that original feature in the final model.
The Experiment Summary includes artifacts that describe the final model. This is the model that is used to score new
datasets and create the MOJO scoring pipeline. The final model may be an ensemble of models depending on the
Accuracy setting.
• coefs.json: A list of coefficients and standard deviation of coefficients for features.
• ensemble.txt: A summary of the final model which includes a description of the model(s), gains/lifts table,
confusion matrix, and scores of the final model for our list of scorers.
• ensemble_base_learner_fold_scores: The internal validation scorer metrics for each base learner when the
final model is an ensemble. (Available in txt or json.)
• ensemble_description.txt: A sentence describing the final model. (For example: “Final TensorFlowModel
pipeline with ensemble_level=0 transforming 21 original features -> 54 features in each of 1 models each fit on
full training data (i.e. no hold-out).”)
• ensemble_coefs: The coefficent and standard deviation coefficient for each feature in the ensemble. (Available
as txt or json.)
• ensemble_coefs_shift: The coefficient and shift of coefficient for each feature in the ensemble. (Avalable as txt
or json.)
• ensemble_model_description.json/ensemble_model_extra_description: A json file describing the model(s)
and for ensembles how the model predictions are weighted.
• ensemble_model_params.json: A json file decribing the parameters of the model(s).
• ensemble_folds_data.json: A json file describing the folds used for the final model(s). This includes the size of
each fold of data and the performance of the final model on each fold. (Available if a fold column was specified.)
• ensemble_features_orig: A list of the original features provided and an estimate of the relative feature impor-
tance of that original feature in the ensemble of models. (Available in txt or json.)
• ensemble_features: A complete list of all features used in the final ensemble of models, a description of the
feature, and the relative feature importance. (Available in txt, table, or json.)
• leakage_coefs.json: A list of coefficients and standard deviation of coefficients for leaked features.
• shift_coefs.json: A list of coefficients and the shift in standard deviation for those coefficients used in the
experiment.
The Experiment Summary also includes artifacts about the final model performance.
• ensemble_scores.json: The scores of the final model for our list of scorers.
• ensemble_confusion_matrix: The confusion matrix for the internal validation and test data if test data is pro-
vided.
• ensemble_confusion_matrix_stats_test.json: Confusion matrix statistics on the test data. (Only available if
test data provided)
• ensemble_gains: The lift and gains table for the internal validation and test data if test data is provided. (Visu-
alization of lift and gains can be seen in the UI.)
• ensemble_roc: The ROC and Precision Recall table for the internal validation and test data if test data is
provided. (Visualization of ROC and Precision Recall curve can be seen in the UI.)
• individual_scored.params_base: Detailed information about each iteration run in the experiment. (Available
in csv, table, or json.)
Click this link to open the Experiments page. From this page, you can rename an experiment, view previous experi-
ments, begin a new experiment, rerun an experiment, and delete an experiment.
In Driverless AI, you can retry an experiment from the last checkpoint, you can run a new experiment using an existing
experiment’s settings, and you can retrain an experiment’s final pipeline.
Checkpointing Experiments
In real-world scenarios, data can change. For example, you may have a model currently in production that was built
using 1 million records. At a later date, you may receive several hundred thousand more records. Rather than building
a new model from scratch, Driverless AI includes H2O.ai Brain, which enables caching and smart re-use of prior
models to generate features for new models.
You can configure one of the following Brain levels in the experiment’s Expert Settings.
• -1: Don’t use any brain cache
• 0: Don’t use any brain cache but still write to cache
• 1: Smart checkpoint if an old experiment_id is passed in (for example, via running “resume one like this” in the
GUI)
• 2: Smart checkpoint if the experiment matches all column names, column types, classes, class labels, and time
series options identically. (default)
• 3: Smart checkpoint like level #1, but for the entire population. Tune only if the brain population is of insufficient
size.
• 4: Smart checkpoint like level #2, but for the entire population. Tune only if the brain population is of insufficient
size.
• 5: Smart checkpoint like level #4, but will scan over the entire brain cache of populations (starting from resumed
experiment if chosen) in order to get the best scored individuals.
If you chooses Level 2 (default), then Level 1 is also done when appropriate.
To make use of smart checkpointing, be sure that the new data has:
• The same data column names as the old experiment
• The same data types for each column as the old experiment. (This won’t match if, e.g,. a column was all int and
then had one string row.)
• The same target as the old experiment
• The same target classes (if classification) as the old experiment
• For time series, all choices for intervals and gaps must be the same
When the above conditions are met, then you can:
• Start the same kind of experiment, just rerun for longer.
• Use a smaller or larger data set (i.e. fewer or more rows).
• Effectively do a final ensemble re-fit by varying the data rows and starting an experiment with a new accuracy,
time=1, and interpretability. Check the experiment preview for what the ensemble will be.
• Restart/Resume a cancelled, aborted, or completed experiment
To run smart checkpointing on an existing experiment, click the right side of the experiment that you want to retry,
then select Restart from Last Checkpoint. The experiment settings page opens. Specify the new dataset. If desired,
you can also change experiment settings, though the target column must be the same. Click Launch Experiment to
resume the experiment from the last checkpoint and build a new experiment.
The smart checkpointing continues by adding a prior model as another model used during tuning. If that prior model
is better (which is likely if it was run for more iterations), then that smart checkpoint model will be used during feature
evolution iterations and final ensemble.
Notes:
• Driverless AI does not guarantee exact continuation, only smart continuation from any last point.
• The directory where the H2O.ai Brain meta model files are stored is tmp/H2O.ai_brain. In addition, the default
maximum brain size is 20GB. Both the directory and the maximum size can be changed in the config.toml file.
Rerunning Experiments
To run a new experiment using an existing experiment’s settings, click the right side of the experiment that you want
to use as the basis for the new experiment, then select New Model with Same Params. This opens the experiment
settings page. From this page, you can rerun the experiment using the original settings, or you can specify to use
new data and/or specify different experiment settings. Click Launch Experiment to create a new experiment with the
same options.
To retrain an experiment’s final pipeline, click the right side of the experiment that you want to use as the basis for the
new experiment, then select Retrain Final Pipeline. This opens the experiment settings page with the same settings
as the original experiment except that Time is set to 0.
To delete an experiment, click the right side of the experiment that you want to remove, then select Delete. A confir-
mation message will display asking you to confirm the delete. Click OK to delete the experiment or Cancel to return
to the experiments page without deleting.
EIGHTEEN
DIAGNOSING A MODEL
The Diagnosing Model on New Dataset option allows you to view model performance for multiple scorers based on
existing model and dataset.
On the completed experiment page, click the Diagnose Model on New Dataset button.
Note: You can also diagnose a model by selecting Diagnostic from the top menu, then selecting an experiment and
test dataset.
Select a dataset to use when diagnosing this experiment. Note that the dataset must include the target column that is
in the original dataset. At this point, Driverless AI will begin calculating all available scores for the experiment.
When the diagnosis is complete, it will be available on the Model Diagnostics page. Click on the new diagnosis.
From this page, you can download predictions. You can also view scores and metric plots. The plots are interactive.
Click a graph to enlarge. In the enlarged view, you can hover over the graph to view details for a specific point. You
can also download the graph in the enlarged view.
331
Using Driverless AI, Release 1.8.4.1
Note: In the Confusion Matrix graph, the threshold value defaults to 0.5. For binary classification experiments, users
can specify a different threshold value. The threshold selector is available after clicking on the Confusion Matrix and
opening the enlarged view. When you specify a value or change the slider value, Driverless AI automatically computes
a diagnostic Confusion Matrix for that given threshold value.
NINETEEN
PROJECT WORKSPACE
Driverless AI provides a Project Workspace for managing datasets and experiments related to a specific business
problem or use case. Whether you are trying to detect fraud or predict user retention, datasets and experiments can
be stored and saved in the individual projects. A Leaderboard on the Projects page allows you to easily compare
performance and results and identify the best solution for your problem.
To create a Project Workspace:
1. Click the Projects option on the top menu.
2. Click New Project.
3. Specify a name for the project and provide a description.
4. Click Create Project. This creates an empty Project page.
From the Projects page, you can link datasets and/or experiments, and you can run new experiments. When you link
an existing experiment to a Project, the datasets used for the experiment will automatically be linked to this project (if
not already linked).
Any dataset that has been added to Driverless AI can be linked to a project. In addition, when you link an experiment,
the datasets used for that experiment are also automatically linked to the project.
To link a dataset:
1. Select Training, Validation, or Test from the dropdown menu.
2. Click the Link Dataset button.
3. Select the dataset(s) that you want to link.
335
Using Driverless AI, Release 1.8.4.1
The list available datasets link include those that were added on The Datasets Page, or you can browse datasets in your
file system. Be sure to select the correct dropdown option before linking a training, validation, or test dataset. This
is because, when you run a new experiment in the project, the training data, validation data, and test data options for
that experiment come from list of datasets linked here. You will not be able to, for example, select any datasets from
within the Training tab when specifying a test dataset on the experiment.
When datasets are linked, the same menu options are available here as on the Datasets page. Refer to The Datasets
Page section for more information.
Existing experiments can be selected and linked to a Project. Additionally, you can run a new experiment or check-
pointing an existing experiment from this page, and those experiments will automatically be linked to this Project.
Link an existing experiment to the project by clicking Link Experiment and then selecting the experiment(s) to
include. When you link experiments, the datasets used to create the experiments are also automatically linked.
In the Datasets section, you can select a training, validation, or testing dataset. The Experiments section will show
experiments in the Project that use the selected dataset.
When experiments are run from within a Project, only linked datasets or datasets available on the file system can be
used.
1. Click the New Experiment link to begin a new experiment.
2. Select your training data and optionally your validation and/or testing data.
3. Specify your desired experiment settings (refer to Experiment Settings and Expert Settings), and then click
Launch Experiment.
As the experiment is running, it will be listed at the top of the Experiments Leaderboard until it is completed. It will
also be available on the Experiments page.
When experiments are linked to a Project, the same checkpointing options for experiments are available here as on the
Experiments page. Refer to Checkpointing, Rerunning, and Retraining for more information.
When attempting to solve a business problem, a normal workflow will include running multiple experiments, either
with different/new data or with a variety of settings, and the optimal solution can vary for different users and/or
business problems. For some users, the model with the highest accuracy for validation and test data could be the most
optimal one. Other users might be willing to make an acceptable compromise on the accuracy of the model for a
model with greater performance (faster prediction). For some, it could also mean how quickly the model could be
trained with acceptable levels of accuracy. The Experiments list makes it easy for you to find the best solution for your
business problem.
The list is organized based on experiment name. You can change the sorting of experiments by selecting the up/down
arrows beside a column heading in the experiment menu.
Hover over the right menu of an experiment to view additional information about the experiment, including the prob-
lem type, datasets used, and the target column.
Experiments linked to projects do not automatically include a test score. To view Test Scores in the Leaderboard, you
must first complete the scoring step for a particular dataset and experiment combination. Without the scoring step, no
scoring data is available to populate in the Test Score and Score Time columns. Experiments that do not include a test
score or that have an invalid scorer (for example, if the R2 scorer is selected for classification experiments) show N/A
in the Leaderboard. Also, if None is selected for the scorer, then all experiments will show N/A.
To score the experiment:
1. Click Select Scoring Dataset at the top of the Experiments list and select a linked Test Dataset or a test dataset
available on the file system.
2. Select the model or models that you want to score.
3. Click the Select Scorer button at the top of the Experiments list and select a scorer.
4. Click the Score n Items button.
This starts the Model Diagnostic process and scores the selected experiment(s) against the selected scorer and dataset.
(Refer to Diagnosing a Model for more information.) Upon completion, the experiment(s) will be populated with a
test score, and the performance information will also be available on the Model Diagnostics page.
Notes:
• If an experiment has already scored a dataset, Driverless AI will not score it again. The scoring step is deter-
ministic, so for a particular test dataset and experiment combination, the score will be same regardless of how
many times you repeat it.
• The test dataset absolutely needs to have all the columns that are expected by the various experiments you are
scoring it on. However, the columns of the test dataset need not be exactly the same as input features expected by
the experiment. There can be additional columns in the test dataset. If these columns were not used for training,
they will be ignored. This feature gives you the ability to train experiments on different training datasets (i.e.,
having different features), and if you have an “uber test dataset” that includes all these feature columns, then
you can use the same dataset to score these experiments.
• You will notice a Score Time in the Experiments Leaderboard. This values shows the total time (in seconds) that
it took for calculating the experiment scores for all applicable scorers for the experiment type. This is valuable
to users who need to estimate the runtime performance of an experiment.
You can compare two or three experiments and view side-by-side detailed information about each.
1. Click the Select button at the top of the Leaderboard and select either two or three experiments that you want to
compare. You cannot compare more than three experiments.
2. Click the Compare n Items button.
This opens the Compare Experiments page. This page includes the experiment summary and metric plots for each
experiment. The metric plots vary depending on whether this is a classification or regression experiment.
For classification experiments, this page includes:
• Variable Importance list
• Confusion Matrix
• ROC Curve
• Precision Recall Curve
• Lift Chart
• Gains Chart
• Kolmogorov-Smirnov Chart
For regression experiments, this page includes:
• Variable Importance list
• Actual vs. Predicted Graph
Unlinking datasets and/or experiments does not delete that data from Driverless AI. The datasets and experiments will
still be available on the Datasets and Experiments pages.
• Unlink a dataset by clicking on the dataset and selecting Unlink from the menu. Note: You cannot unlink
datasets that are tied to experiments in the same project.
• Unlink an experiment by clicking on the experiment and selecting Unlink from the menu. Note that this will
not automatically unlink datasets that were tied to the experiment.
To delete a project, click the Projects option on the top menu to open the main Projects page. Click the dotted menu
the right-most column, and then select Delete. You will be prompted to confirm the deletion.
Note that deleting projects does not delete datasets and experiments from Driverless AI. Any datasets and experiments
from deleted projects will still be available on the Datasets and Experiments pages.
TWENTY
MLI OVERVIEW
Driverless AI provides robust interpretability of machine learning models to explain modeling results in a human-
readable format. In the Machine Learning Interpetability (MLI) view, Driverless AI employs a host of different
techniques and methodologies for interpreting and explaining the results of its models. A number of charts are gener-
ated automatically (depending on experiment type), including K-LIME, Shapley, Variable Importance, Decision Tree
Surrogate, Partial Dependence, Individual Conditional Expectation, Sensitivity Analysis, NLP Tokens, NLP LOCO,
and more. Additionally, you can download a CSV of LIME and Shapley reasons codes from this view.
This chapter describes Machine Learning Interpretability (MLI) in Driverless AI for both regular and time-series
experiments. Refer to the following sections for more information:
• The Interpreted Models Page
• MLI for Regular (Non-Time-Series) Experiments
• MLI for Time-Series Experiments
Additional Resources
• Click here to download our MLI cheat sheet.
• “An Introduction to Machine Learning Interpetability” book.
• Click here to access the H2O.ai MLI Resources repository. This repo includes materials that illustrate applica-
tions or adaptations of various MLI techniques for practicing data scientists.
• Click here to view our H2O Driverless AI Machine Learning Interpretability walkthrough video.
Limitations
• This release deprecates experiments run in 1.7.0 and earlier. MLI will not be available for experiments from
versions <= 1.7.0.
• MLI is not supported for multiclass Time Series experiments.
• MLI does not require an Internet connection to run on current models.
345
Using Driverless AI, Release 1.8.4.1
TWENTYONE
Click the MLI link in the upper-right corner of the UI to view a list of interpreted models.
You can sort this page by Name, Target, Model, Dataset, N-Folds, Feature Set, Cluster Col, LIME Method, Status, or
ETA/Runtime.
Click the right-most column of an interpreted model to view an additional menu. This menu allows you to open,
rename, or delete the interpretation.
Click on an interpeted model to view the MLI page for that interpretation. The MLI page that displays will vary
depending on whether the experiment was a regular experiment or a time series experiment.
347
Using Driverless AI, Release 1.8.4.1
TWENTYTWO
This section describes MLI functionality and features for regular experiments. Refer to MLI for Time-Series Experi-
ments for MLI information with time-series experiments.
There are two methods you can use for interpreting models:
• Using the Interpret this Model button on a completed experiment page to interpret a Driverless AI model on
original and transformed features.
• Using the MLI link in the upper right corner of the UI to interpret either a Driverless AI model or an external
model.
Notes:
• Experiments run in 1.7.0 and earlier are deprecated in this release. MLI will not be available for experiments
from versions <= 1.7.0.
• MLI does not require an Internet connection to run on current models.
Clicking the Interpret this Model button on a completed experiment page launches the Model Interpretation for that
experiment. Python and Java logs can be viewed for non-time-series experiments while the interpretation is running.
For non-time-series experiments, this page provides several visual explanations and reason codes for the trained Driver-
less AI model and its results. More information about this page is available in the Understanding the Model Interpration
Page section later in this chapter.
349
Using Driverless AI, Release 1.8.4.1
This method allows you to run model interpretation on a Driverless AI model. This method is similar to clicking
“Interpret This Model” on an experiment summary page.
1. Click the MLI link in the upper-right corner of the UI to view a list of interpreted models.
6, then all features are used for k-means clustering.) The previous setting can be turned off to use all features
for k-means by setting use_all_columns_klime_kmeans in the config.toml file to true. All penalized
GLM surrogates are trained to model the predictions of the Driverless AI model. The number of clusters for
local explanations is chosen by a grid search in which the 𝑅2 between the Driverless AI model predictions
and all of the local K-LIME model predictions is maximized. The global and local linear model’s intercepts,
coefficients, 𝑅2 values, accuracy, and predictions can all be used to debug and develop explanations for the
Driverless AI model’s behavior.
• LIME-SUP explains local regions of the trained Driverless AI model in terms of the original variables. Local
regions are defined by each leaf node path of the decision tree surrogate model instead of simulated, perturbed
observation samples - as in the original LIME. For each local region, a local GLM model is trained on the
original inputs and the predictions of the Driverless AI model. Then the parameters of this local GLM can be
used to generate approximate, local explanations of the Driverless AI model.
6. For K-LIME interpretations, specify the depth that you want for your decision tree surrogate model. The tree
depth value can be a value from 2-5 and defaults to 3. For LIME-SUP interpretations, specify the LIME-SUP
tree depth. This can be a value from 2-5 and defaults to 3.
7. Specify whether to use original features or transformed features in the surrogate model for the new interpretation.
Note: If Use Original Features for Surrogate Models is disabled, then the K-LIME clustering column option
will not be available, and quantile binning will not be available.
8. Specify whether to perform the interpretation on a sample of the training data. By default, MLI will sample the
training dataset if it is greater than 100k rows. (Note that this value can be modified in the config.toml setting
for mli_sample_size.) Turn this toggle off to run MLI on the entire dataset.
9. Optionally specify weight and dropped columns.
10. For K-LIME interpretations, optionally specify a clustering column. Note that this column should be categorical.
Also note that this is only available when K-LIME is used as the LIME method and when Use Original Features
for Surrogate Models is enabled. If the LIME method is changed to LIME-SUP, then this option is no longer
available.
11. Optionally specify the number of surrogate cross-validation folds to use (from 0 to 10). When running ex-
periments, Driverless AI automatically splits the training data and uses the validation data to determine the
performance of the model parameter tuning and feature engineering steps. For a new interpretation, Driverless
AI uses 3 cross-validation folds by default for the interpretation.
12. For K-LIME interpretations, optionally specify one or more columns to generate decile bins (uniform distribu-
tion) to help with MLI accuracy. Columns selected are added to top n columns for quantile binning selection.
If a column is not numeric or not in the dataset (transformed features), then the column will be skipped. Note:
This option is only available when Use Original Features for Surrogate Models is enabled.
13. For K-LIME interpretations, optionally specify the number of top variable importance numeric columns to run
decile binning to help with MLI accuracy. (Note that variable importances are generated from a Random Forest
model.) This defaults to 0, and the maximum value is 10. Note: This option is only available when Use Original
Features for Surrogate Models is enabled.
14. Optionally specify the number of top features for which partial dependence and ICE will be computed. This
value defaults to 10. Setting a value greater than 10 can significantly increase the computation time. Setting this
to -1 specifies to use all features.
15. Click the Launch MLI button.
Model Interpretation does not need to be run on a Driverless AI experiment. You can train an external model and run
Model Interpretability on the predictions.
1. Click the MLI link in the upper-right corner of the UI to view a list of interpreted models.
Note: When running interpretations on an external model, leave the Select Model option empty. That option is
for selecting a Driverless AI model.
4. Specify a Target Column (actuals) and the Prediction Column (scores from the model).
5. Select a LIME method of either K-LIME (default) or LIME-SUP.
• K-LIME creates one global surrogate GLM on the entire training data and also creates numerous local surrogate
GLMs on samples formed from k-means clusters in the training data. The features used for k-means are selected
from the Random Forest surrogate model’s variable importance. The number of features used for k-means is the
minimum of the top 25% of variables from the Random Forest surrogate model’s variable importance and the
max number of variables that can be used for k-means, which is set by the user in the config.toml setting for
mli_max_number_cluster_vars. (Note, if the number of features in the dataset are less than or equal to
6, then all features are used for k-means clustering.) The previous setting can be turned off to use all features
for k-means by setting use_all_columns_klime_kmeans in the config.toml file to true. All penalized
GLM surrogates are trained to model the predictions of the Driverless AI model. The number of clusters for
local explanations is chosen by a grid search in which the 𝑅2 between the Driverless AI model predictions
and all of the local K-LIME model predictions is maximized. The global and local linear model’s intercepts,
coefficients, 𝑅2 values, accuracy, and predictions can all be used to debug and develop explanations for the
Driverless AI model’s behavior.
• LIME-SUP explains local regions of the trained Driverless AI model in terms of the original variables. Local
regions are defined by each leaf node path of the decision tree surrogate model instead of simulated, perturbed
observation samples - as in the original LIME. For each local region, a local GLM model is trained on the
original inputs and the predictions of the Driverless AI model. Then the parameters of this local GLM can be
used to generate approximate, local explanations of the Driverless AI model.
6. For K-LIME interpretations, specify the depth that you want for your decision tree surrogate model. The tree
depth value can be a value from 2-5 and defaults to 3. For LIME-SUP interpretations, specify the LIME-SUP
tree depth. This can be a value from 2-5 and defaults to 3.
7. Specify whether to perform the interpretation on a sample of the training data. By default, MLI will sample the
training dataset if it is greater than 100k rows. (Note that this value can be modified in the config.toml setting
for mli_sample_size.) Turn this toggle off to run MLI on the entire dataset.
8. Optionally specify weight and dropped columns.
9. For K-LIME interpretations, optionally specify a clustering column. Note that this column should be categorical.
Also note that this is only available when K-LIME is used as the LIME method. If the LIME method is changed
to LIME-SUP, then this option is no longer available.
10. Optionally specify the number of surrogate cross-validation folds to use (from 0 to 10). When running ex-
periments, Driverless AI automatically splits the training data and uses the validation data to determine the
performance of the model parameter tuning and feature engineering steps. For a new interpretation, Driverless
AI uses 3 cross-validation folds by default for the interpretation.
11. For K-LIME interpretations, optionally specify one or more columns to generate decile bins (uniform distribu-
tion) to help with MLI accuracy. Columns selected are added to top n columns for quantile binning selection. If
a column is not numeric or not in the dataset (transformed features), then the column will be skipped.
12. For K-LIME interpretations, optionally specify the number of top variable importance numeric columns to run
decile binning to help with MLI accuracy. (Note that variable importances are generated from a Random Forest
model.) This value is combined with any specific columns selected for quantile binning. This defaults to 0, and
the maximum value is 10.
13. Optionally specify the number of top features for which partial dependence and ICE will be computed. This
value defaults to 10. Setting a value greater than 10 can significantly increase the computation time. Setting this
to -1 specifies to use all features.
14. Click the Launch MLI button.
This section describes the features on the Model Interpretation page for non-time-series experiments.
The Model Interpretation page opens with a Summary of the interpretation and also provides a row search feature on
the top of the page:
• Row Selection: Provides the ability to search for a particular row by Row Number or by Identifier Column. See
the Row Selection section for more information.
This page also provides left-hand navigation for viewing additional plots. This navigation includes:
• Summary: Provides a summary of the MLI experiment. See the Summary Page section for more information.
• DAI Model: See DAI Model Dropdown for more information.
– For binary classification and regression experiments, the DAI Model menu provides the following plots
for Driverless AI models:
– Feature Importance for transformed features
– Shapley plot for transformed features
– Partial Dependence/ICE
– Disparate Impact Analysis
– Sensitivity Analysis
The row selection feature allows a user to search for a particular observation by row number or by an identifier column.
Identifier columns cannot be specified by the user - MLI makes this choice automatically by choosing columns whose
values are unique (dataset row count equals the number of unique values in a column). To find a row by identifier
column, choose Identifier Column from the drop-down menu (if it meets the logic of being an identifier column), and
then specify a value. In addition to identifier columns, the drop-down menu also allows you to find a row using Row
Number.
The Summary page is the first page that opens when you view an interpretation. This page provides an overview of the
interpretation, including the dataset and Driverless AI experiment (if available) that were used for the interpretation
along with the feature space (original or transformed), target column, problem type, and k-Lime information. If the
interpretation was created from a Driverless AI model, then a table with the Driverless AI model summary is also
included along with the top variables for the model.
This menu provides a Feature Importance plot and a Shapley plot (not supported for RuleFit and TensorFlow
models) for transformed features as well as Partial Dependence/ICE, Disparate Impact Analysis (DIA), Sensitiv-
ity Analysis, NLP Tokens and NLP LOCO (for text experiments), and Permutation Feature Importance (if the
autodoc_include_permutation_feature_importance configuration option is enabled) plots for Driver-
less AI models.
Note: On the Feature Importance and Shapley plots, the transformed feature names are encoded as follows:
<transformation/gene_details_id>_<transformation_name>:<orig>:<. . . >:<orig>.<extra>
So in 32_NumToCatTE:BILL_AMT1:EDUCATION:MARRIAGE:SEX.0, for example:
• 32_ is the transformation index for specific transformation parameters.
Feature Importance
This plot is available for all models for binary classification, multiclass classification, and regression experiments.
This plot shows the Driverless AI feature importance. Driverless AI feature importance is a measure of the contribution
of an input variable to the overall predictions of the Driverless AI model. Global feature importance is calculated by
aggregating the improvement in splitting criterion caused by a single variable across all of the decision trees in the
Driverless AI model.
Shapley Plot
This plot is not available for RuleFit or TensorFlow models. For all other models, this plot is available for binary
classification, multiclass classification, and regression experiments.
Shapley explanations are a technique with credible theoretical support that presents consistent global and local variable
contributions. Local numeric Shapley values are calculated by tracing single rows of data through a trained tree en-
semble and aggregating the contribution of each input variable as the row of data moves through the trained ensemble.
For regression tasks, Shapley values sum to the prediction of the Driverless AI model. For classification problems,
Shapley values sum to the prediction of the Driverless AI model before applying the link function. Global Shapley
values are the average of the absolute Shapley values over every row of a dataset.
More information is available at https://arxiv.org/abs/1706.06060.
The Showing 𝑛 Features dropdown for Feature Importance and Shapley plots allows you to select between original
and transformed features. If there are a significant amount of features, they will be organized in numbered pages that
can be viewed individually. Note: The provided original values are approximations derived from the accompanying
transformed values. For example, if the transformed feature 𝑓 𝑒𝑎𝑡𝑢𝑟𝑒1_𝑓 𝑒𝑎𝑡𝑢𝑟𝑒2 has a value of 0.5, then the value of
the original features (𝑓 𝑒𝑎𝑡𝑢𝑟𝑒1 and 𝑓 𝑒𝑎𝑡𝑢𝑟𝑒2) will be 0.25.
A Partial Dependence and ICE plot is available for both Driverless AI and surrogate models.
Partial dependence is a measure of the average model prediction with respect to an input variable. Partial dependence
plots display how machine-learned response functions change based on the values of an input variable of interest, while
taking nonlinearity into consideration and averaging out the effects of all other input variables. Partial dependence plots
are well-known and described in the Elements of Statistical Learning (Hastie et al, 2001). Partial dependence plots
enable increased transparency in Driverless AI models and the ability to validate and debug Driverless AI models by
comparing a variable’s average predictions across its domain to known standards, domain knowledge, and reasonable
expectations.
Taking the Driverless AI model as F(X), assuming credit scores vary from 500 to 800 in the training data, and that
increments of 30 are used to plot the ICE curve, ICE is calculated as follows:
ICE𝑐𝑟𝑒𝑑𝑖𝑡_𝑠𝑐𝑜𝑟𝑒,500 = 𝐹 (30, 500, 1000)
ICE𝑐𝑟𝑒𝑑𝑖𝑡_𝑠𝑐𝑜𝑟𝑒,530 = 𝐹 (30, 530, 1000)
ICE𝑐𝑟𝑒𝑑𝑖𝑡_𝑠𝑐𝑜𝑟𝑒,560 = 𝐹 (30, 560, 1000)
...
ICE𝑐𝑟𝑒𝑑𝑖𝑡_𝑠𝑐𝑜𝑟𝑒,800 = 𝐹 (30, 800, 1000)
The one-dimensional partial dependence plots displayed here do not take interactions into account. Large differences
in partial dependence and ICE are an indication that strong variable interactions may be present. In this case partial
dependence plots may be misleading because average model behavior may not accurately reflect local behavior.
The following are formulas for error metrics and parity checks utilized by binary DIA. Note that in the tables below:
• tp = true positive
• fp = false positive
• tn = true negative
• fn = false negative
Metrics - Regression
• Users are encouraged to consider the explanation dashboard to understand and augment results from disparate
impact analysis. In addition to its established use as a fairness tool, users may want to consider disparate impact
for broader model debugging purposes. For example, users can analyze the supplied confusion matrices and
group metrics for important, non-demographic features in the Driverless AI model.
Sensitivity Analysis
along with external information and processes to understand whether their institution has enough cash on hand
to be prepared for the simulated crisis.
• Random: Set variables to random values, and then rescore the model. This allows the user to look for things
the user might not have thought of.
Additional Resources
Sensitivity Analysis on a Driverless AI Model: This ipynb uses the UCI credit card default data to perform sensitivity
analysis and test model performance.
NLP Tokens
NLP LOCO
The Surrogate Models dropdown includes K-LIME/LIME-SUP and Decision Tree plots as well as a Random Forest
submenu, which includes Global and Local Feature Importance plots for original features and a Partial Dependence
plot.
Note: For multiclass classification experiments, only the Global and Local Feature Importance plot for the Random
Forest surrogate model is available in this dropdown.
The MLI screen includes a K-LIME or LIME-SUP graph. A K-LIME graph is available by default when you interpret
a model from the experiment page. When you create a new interpretation, you can instead choose to use LIME-SUP
as the LIME method. Note that these graphs are essentially the same, but the K-LIME/LIME-SUP distinction provides
insight into the LIME method that was used during model interpretation.
mli_max_number_cluster_vars. (Note, if the number of features in the dataset are less than or equal to 6,
then all features are used for k-means clustering.) The previous setting can be turned off to use all features for k-means
by setting use_all_columns_klime_kmeans in the config.toml file to true. All penalized GLM surrogates
are trained to model the predictions of the Driverless AI model. The number of clusters for local explanations is
chosen by a grid search in which the 𝑅2 between the Driverless AI model predictions and all of the local K-LIME
model predictions is maximized. The global and local linear model’s intercepts, coefficients, 𝑅2 values, accuracy, and
predictions can all be used to debug and develop explanations for the Driverless AI model’s behavior.
The parameters of the global K-LIME model give an indication of overall linear feature importance and the overall
average direction in which an input variable influences the Driverless AI model predictions. The global model is also
used to generate explanations for very small clusters (𝑁 < 20) where fitting a local linear model is inappropriate.
The in-cluster linear model parameters can be used to profile the local region, to give an average description of the
important variables in the local region, and to understand the average direction in which an input variable affects the
Driverless AI model predictions. For a point within a cluster, the sum of the local linear model intercept and the
products of each coefficient with their respective input variable value are the K-LIME prediction. By disaggregating
the K-LIME predictions into individual coefficient and input variable value products, the local linear impact of the
variable can be determined. This product is sometimes referred to as a reason code and is used to create explanations
for the Driverless AI model’s behavior.
In the following example, reason codes are created by evaluating and disaggregating a local linear model.
Given the row of input data with its corresponding Driverless AI and K-LIME predictions:
model well, nonlinear LOCO feature importance values may be a better explanatory tool for local model behavior. As
K-LIME local explanations rely on the creation of k-means clusters, extremely wide input data or strong correlation
between input variables may also degrade the quality of K-LIME local explanations.
Decision Tree
The decision tree surrogate model increases the transparency of the Driverless AI model by displaying an approximate
flow-chart of the complex Driverless AI model’s decision making process. The decision tree surrogate model also
displays the most important variables in the Driverless AI model and the most important interactions in the Driverless
AI model. The decision tree surrogate model can be used for visualizing, validating, and debugging the Driverless
AI model by comparing the displayed decision-process, important variables, and important interactions to known
standards, domain knowledge, and reasonable expectations.
A surrogate model is a data mining and engineering technique in which a generally simpler model is used to explain
another, usually more complex, model or phenomenon. The decision tree surrogate is known to date back at least to
1996 (Craven and Shavlik). The decision tree surrogate model here is trained to predict the predictions of the more
complex Driverless AI model using the of original model inputs. The trained surrogate model enables a heuristic un-
derstanding (i.e., not a mathematically precise understanding) of the mechanisms of the highly complex and nonlinear
Driverless AI model.
The Random Forest dropdown provides a submenu that includes a Feature Importance plot, a Partial Dependence plot,
and a LOCO plot. These plots are for original features rather than transformed features.
Feature Importance
LOCO
Local feature importance describes how the combination of the learned model rules or parameters and an individual
row’s attributes affect a model’s prediction for that row while taking nonlinearity and interactions into effect. Local
feature importance values reported here are based on a variant of the leave-one-covariate-out (LOCO) method (Lei et
al, 2017).
In the LOCO-variant method, each local feature importance is found by re-scoring the trained Driverless AI model
for each feature in the row of interest, while removing the contribution to the model prediction of splitting rules that
contain that feature throughout the ensemble. The original prediction is then subtracted from this modified prediction
to find the raw, signed importance for the feature. All local feature importance values for the row are then scaled
between 0 and 1 for direct comparison with global feature importance values.
Given the row of input data with its corresponding Driverless AI and K-LIME predictions:
Taking the Driverless AI model as F(X), LOCO-variant feature importance values are calculated as follows.
First, the modified predictions are calculated:
𝐹 𝑑𝑒𝑏𝑡_𝑡𝑜_𝑖𝑛𝑐𝑜𝑚𝑒_𝑟𝑎𝑡𝑖𝑜 = 𝐹 (𝑁 𝐴, 600, 1000) = 0.99
𝐹 𝑐𝑟𝑒𝑑𝑖𝑡_𝑠𝑐𝑜𝑟𝑒 = 𝐹 (30, 𝑁 𝐴, 1000) = 0.73
𝐹 𝑠𝑎𝑣𝑖𝑛𝑔𝑠_𝑎𝑐𝑐𝑡_𝑏𝑎𝑙𝑎𝑛𝑐𝑒 = 𝐹 (30, 600, 𝑁 𝐴) = 0.82
Second, the original prediction is subtracted from each modified prediction to generate the unscaled local feature
importance values:
LOCO𝑑𝑒𝑏𝑡_𝑡𝑜_𝑖𝑛𝑐𝑜𝑚𝑒_𝑟𝑎𝑡𝑖𝑜 = 𝐹 𝑑𝑒𝑏𝑡_𝑡𝑜_𝑖𝑛𝑐𝑜𝑚𝑒_𝑟𝑎𝑡𝑖𝑜 − 0.85 = 0.99 − 0.85 = 0.14
LOCO𝑐𝑟𝑒𝑑𝑖𝑡_𝑠𝑐𝑜𝑟𝑒 = 𝐹 𝑐𝑟𝑒𝑑𝑖𝑡_𝑠𝑐𝑜𝑟𝑒 − 0.85 = 0.73 − 0.85 = −0.12
LOCO𝑠𝑎𝑣𝑖𝑛𝑔𝑠_𝑎𝑐𝑐𝑡_𝑏𝑎𝑙𝑎𝑛𝑐𝑒 = 𝐹 𝑠𝑎𝑣𝑖𝑛𝑔𝑠_𝑎𝑐𝑐𝑡_𝑏𝑎𝑙𝑎𝑛𝑐𝑒 − 0.85 = 0.82 − 0.85 = −0.03
Finally LOCO values are scaled between 0 and 1 by dividing each value for the row by the maximum value for the
row and taking the absolute magnitude of this quotient.
Scaled(LOCO𝑑𝑒𝑏𝑡_𝑡𝑜_𝑖𝑛𝑐𝑜𝑚𝑒_𝑟𝑎𝑡𝑖𝑜 ) = Abs(LOCO 𝑑𝑒𝑏𝑡_𝑡𝑜_𝑖𝑛𝑐𝑜𝑚𝑒_𝑟𝑎𝑡𝑖𝑜 /0.14) = 1
Scaled(LOCO𝑐𝑟𝑒𝑑𝑖𝑡_𝑠𝑐𝑜𝑟𝑒 ) = Abs(LOCO 𝑐𝑟𝑒𝑑𝑖𝑡_𝑠𝑐𝑜𝑟𝑒 /0.14) = 0.86
Scaled(LOCO𝑠𝑎𝑣𝑖𝑛𝑔𝑠_𝑎𝑐𝑐𝑡_𝑏𝑎𝑙𝑎𝑛𝑐𝑒 ) = Abs(LOCO 𝑠𝑎𝑣𝑖𝑛𝑔𝑠_𝑎𝑐𝑐𝑡_𝑏𝑎𝑙𝑎𝑛𝑐𝑒 /0.14) = 0.21
One drawback to these LOCO-variant feature importance values is, unlike K-LIME, it is difficult to generate a mathe-
matical error rate to indicate when LOCO values may be questionable.
A Partial Dependence and ICE plot is available for both Driverless AI and surrogate models. Refer to the previous
Partial Dependence and Individual Conditional Expectation section for more information about this plot.
These plots are available for natural language processing (NLP) models.
For NLP surrogate models, Driverless AI creates a TFIDF matrix by tokenizing all text features. The resulting frame
is appended to numerical or categorical columns from the training dataset, and the original text columns are removed.
This frame is then used for training surrogate models that have prediction columns consisting of tokens and the original
numerical or categorical features.
Notes:
• MLI support for NLP is not available for multinomial experiments.
• Each row in the TFIDF matrix contains 𝑁 columns, where 𝑁 is the total number of tokens in the corpus with
values that are appropriate for that row (0 if absent).
• Driverless AI does not currently generate a K-LIME scoring pipeline for MLI NLP problems.
Note: Not all explanatory functionality is available for multinomial classification scenarios.
Driverless AI provides easy-to-read explanations for a completed model. You can view these by clicking the Expla-
nations button on the Model Interpretation > Dashboard page for an interpreted model.
The UI allows you to view global, cluster-specific, and local reason codes. You can also export the explanations to
CSV.
• Global Reason Codes: To view global reason codes, select the Global from the Cluster dropdown.
With Global selected, click the Explanations button beside the Cluster dropdown.
• Cluster Reason Codes: To view reason codes for a specific cluster, select a cluster from the Cluster dropdown.
• Local Reason Codes by Row Number: To view local reason codes for a specific row, select a point on the
graph or type a value in the Row Number field.
• Local Reason Codes by ID: Identifier columns cannot be specified by the user - MLI makes this choice auto-
matically by choosing columns whose values are unique (dataset row count equals the number of unique values
in a column). To find a row by identifier column, choose Identifier Column from the drop-down menu (if it
meets the logic of being an identifier column), and then specify a value.
For years, common sense has deemed the complex, intricate formulas created by training machine learning algorithms
to be uninterpretable. While great advances have been made in recent years to make these often nonlinear, non-
monotonic, and non-continuous machine-learned response functions more understandable (Hall et al, 2017), it is
likely that such functions will never be as directly or universally interpretable as more traditional linear models.
Why consider machine learning approaches for inferential purposes? In general, linear models focus on understanding
and predicting average behavior, whereas machine-learned response functions can often make accurate, but more
difficult to explain, predictions for subtler aspects of modeled phenomenon. In a sense, linear models create very exact
interpretations for approximate models. The approach here seeks to make approximate explanations for very exact
models. It is quite possible that an approximate explanation of an exact model may have as much, or more, value and
meaning than the exact interpretations of an approximate model. Moreover, the use of machine learning techniques
for inferential or predictive purposes does not preclude using linear models for interpretation (Ribeiro et al, 2016).
It is well understood that for the same set of input variables and prediction targets, complex machine learning al-
gorithms can produce multiple accurate models with very similar, but not exactly the same, internal architectures
(Breiman, 2001). This alone is an obstacle to interpretation, but when using these types of algorithms as interpretation
tools or with interpretation tools it is important to remember that details of explanations will change across multiple
accurate models.
• The decision tree surrogate is a global, nonlinear description of the Driverless AI model behavior. Variables that
appear in the tree should have a direct relationship with variables that appear in the global feature importance
plot. For certain, more linear Driverless AI models, variables that appear in the decision tree surrogate model
may also have large coefficients in the global K-LIME model.
• K-LIME explanations are linear, do not consider interactions, and represent offsets from the local linear model
intercept. LOCO importance values are nonlinear, do consider interactions, and do not explicitly consider a
linear intercept or offset. LIME explanations and LOCO importance values are not expected to have a direct
relationship but can align roughly as both are measures of a variable’s local impact on a model’s predictions,
especially in more linear regions of the Driverless AI model’s learned response function.
• ICE is a type of nonlinear sensitivity analysis which has a complex relationship to LOCO feature importance
values. Comparing ICE to LOCO can only be done at the value of the selected variable that actually appears in
the selected row of the training data. When comparing ICE to LOCO the total value of the prediction for the
row, the value of the variable in the selected row, and the distance of the ICE value from the average prediction
for the selected variable at the value in the selected row must all be considered.
• ICE curves that are outside the standard deviation of partial dependence would be expected to fall into less
populated decision paths of the decision tree surrogate; ICE curves that lie within the standard deviation of
partial dependence would be expected to belong to more common decision paths.
• Partial dependence takes into consideration nonlinear, but average, behavior of the complex Driverless AI model
without considering interactions. Variables with consistently high partial dependence or partial dependence that
swings widely across an input variable’s domain will likely also have high global importance values. Strong
interactions between input variables can cause ICE values to diverge from partial dependence values.
TWENTYTHREE
This section describes how to run MLI for time-series experiments. Refer to MLI for Regular (Non-Time-Series)
Experiments for MLI information with regular experiments.
There are two methods you can use for interpreting time-series models:
• Using the Interpret this Model button on a completed experiment page to interpret a Driverless AI model on
original and transformed features. (See below.)
• Using the MLI link in the upper right corner of the UI to interpret either a Driverless AI model or an external
model. This process is described in the Model Interpretation on Driverless AI Models and Model Interpretation
on External Models sections.
Limitations
• This release deprecates experiments run in 1.7.0 and earlier. MLI will not be available for experiments from
versions <= 1.7.0.
• MLI is not available for NLP experiments or for multiclass Time Series.
• When the test set contains actuals, you will see the time series metric plot and the group metrics table. If there
are no actuals, MLI will run, but you will see only the prediction value time series and a Shapley table.
• MLI does not require an Internet connection to run on current models.
This section describes how to run MLI on time series data for multiple groups.
1. Click the Interpret this Model button on a completed time series experiment to launch Model Interpretation
for that experiment. This page includes the following:
• A Help panel describing how to read and use this page. Click the Hide Help Button to hide this
text.
• If a test set is provided and the test set includes actuals, then a panel will display showing a time
series plot and the top and bottom group matrix tables based on the scorer that was used in the
experiment. The metric plot will show the metric of interest per time point for holdout predictions
and the test set. Likewise, the actual vs. predicted plot will show actuals vs. predicted values per
time point for the holdout set and the test set. Note that this panel can be resized if necessary.
• If a test set is not provided, then internal validation predictions will be used. The metric plot will
only show the metric of interest per time point for holdout predictions. Likewise, the actual vs.
predicted plot will only show actuals vs. predicted values per time point for the holdout set.
• A Download Logs button for retrieving logs that were generated when this interpretation was built.
381
Using Driverless AI, Release 1.8.4.1
• A Download Group Metrics button for retrieving the averages of each group’s scorer, as well as
each group’s sample size.
• A Show Summary button that provide details about the experiment settings that were used.
• A Group Search entry field (scroll to bottom) for selecting the groups to view.
• Use the zoom feature to magnify any portion of a graph by clicking the Enable Zoom icon near the
top-right corner of a graph. While this icon is selected, click and drag to draw a box over the portion
of the graph you want to magnify. Click the Disable Zoom icon to return to the default view.
2. Scroll to the bottom of the panel and select a grouping in the Group Search field to view a graph of Actual vs.
Predicted values for the group. The outputted graph can be downloaded to your local machine.
3. Click on a prediction point in the plot (white line) to view Shapley values for that prediction point. The Shapley
values plot can also be downloaded to your local machine.
4. Click Add Panel to add a new MLI Time Series panel. This allows you to compare different groups in the same
model and also provides the flexibility to do a “side-by-side” comparison between different models.
Time Series MLI can also be run when only one group is available.
1. Click the Interpret this Model button on a completed time series experiment to launch Model Interpretation
for that experiment. This page includes the following:
• A Help panel describing how to read and use this page. Click the Hide Help Button to hide this
text.
• If a test set is provided and the test set includes actuals, then a panel will display showing a time
series plot and the top and bottom group matrix tables based on the scorer that was used in the
experiment. The metric plot will show the metric of interest per time point for holdout predictions
and the test set. Likewise, the actual vs. predicted plot will show actuals vs. predicted values per
time point for the holdout set and the test set. Note that this panel can be resized if necessary.
• If a test set is not provided, then internal validation predictions will be used. The metric plot will
only show the metric of interest per time point for holdout predictions. Likewise, the actual vs.
predicted plot will only show actuals vs. predicted values per time point for the holdout set.
• A Download Logs button for retrieving logs that were generated when this interpretation was built.
• A Download Group Metrics button for retrieving the average of the group’s scorer, as well as the
group’s sample size.
• A Show Summary button that provides details about the experiment settings that were used.
• A Group Search entry field for selecting the group to view. Note that for Single Time Series MLI,
there will only be one option in this field.
• Use the zoom feature to magnify any portion of a graph by clicking the leftmost square icon near the
top-right corner of a graph. While this icon is selected, click and drag to draw a box over the portion
of the graph you want to magnify. To return to the default view, click the square-shaped arrow icon
to the right of the zoom icon.
2. Scroll to the bottom of the panel and select an option in the Group Search field to view a graph of Actual vs.
Predicted values for the group. (Note that for Single Time Series MLI, there will only be one option in this
field.) The outputted graph can be downloaded to your local machine.
3. Click on a prediction point in the plot (white line) to view Shapley values for that prediction point. The Shapley
values plot can also be downloaded to your local machine.
4. Click Add Panel to add a new MLI Time Series panel. This allows you to do a “side-by-side” comparison
between different models.
TWENTYFOUR
After you generate a model, you can use that model to make predictions on another dataset.
1. Click the Experiments link in the top menu and select the experiment that you want to use.
2. On the Experiment page, click the Score on Another Dataset button.
3. Locate the new dataset (test set) that you want to score on. Note that this new dataset must include the same
columns as the dataset used in selected experiment.
4. Select the columns from the test set to include in the predictions frame.
5. Click Done to start the scoring process.
6. Click the Download Predictions button after scoring is complete.
Note: This feature runs batch scoring on a new dataset. You may notice slow speeds if you attempt to perform
single-row scoring.
387
Using Driverless AI, Release 1.8.4.1
TWENTYFIVE
When a training dataset is used in an experiment, Driverless AI transforms the data into an improved, feature en-
gineered dataset. (Refer to Driverless AI Transformations for more information about the transformations that are
provided in Driverless AI.) But what happens when new rows are added to your dataset? In this case, you can specify
to transform the new dataset after adding it to Driverless AI, and the same transformations that Driverless AI applied
to the original dataset will be applied to these new rows.
Follow these steps to transform another dataset. Note that this assumes the new dataset has been added to Driverless
AI already.
Note: Transform Another Dataset is not available for Time Series experiments.
1. On the completed experiment page for the original dataset, click the Transform Another Dataset button.
2. Select the new training dataset that you want to transform. Note that this must have the same number columns
as the original dataset.
3. In the Select drop down, specify a validation dataset to use with this dataset, or specify to split the training data.
If you specify to split the data, then you also specify the split value (defaults to 25%) and the seed (defaults to
1234). Note: To ensure the transformed dataset respects the row order, choose a validation dataset instead of
splitting the training data. Splitting the training data will result in a shuffling of the row order.
4. Optionally specify a test dataset. If specified, then the output also include the final test dataset for final scoring.
5. Click Launch Transformation.
389
Using Driverless AI, Release 1.8.4.1
The following datasets will be available for download upon successful completion:
• Training dataset (not for cross validation)
• Validation dataset for parameter tuning
• Test dataset for final scoring. This option is available if a test dataset was used.
TWENTYSIX
Driverless AI provides several Scoring Pipelines for experiments and/or interpreted models.
• A standalone Python Scoring Pipeline is available for experiments and interpreted models.
• A low-latency, standalone MOJO Scoring Pipeline is available for experiments, with both Java and C++ back-
ends.
The Python Scoring Pipeline is implemented as a Python whl file. While this allows for a single process scoring
engine, the scoring service is generally implemented as a client/server architecture and supports interfaces for TCP
and HTTP.
The MOJO Scoring Pipeline provides a standalone scoring pipeline that converts experiments to MOJOs, which can
be scored in real time. The MOJO Scoring Pipeline is available as either a Java runtime or a C++ runtime. For the
C++ runtime, both Python and R wrappers are provided.
Examples are included with each scoring package.
Note: These sections describe scoring pipelines and not deployments of scoring pipelines. For information on how to
deploy a MOJO scoring pipeline, refer to the Deploying the MOJO Pipeline section.
391
Using Driverless AI, Release 1.8.4.1
TWENTYSEVEN
Click the Visualize Scoring Pipeline (Experimental) button on the completed experiment page to view the visualiza-
tion.
393
Using Driverless AI, Release 1.8.4.1
To view a visual representation of a specific model, click on the oval that corresponds with that model.
To change the orientation of the visualization, click the Transpose button in the bottom right corner of the screen.
395
Using Driverless AI, Release 1.8.4.1
TWENTYEIGHT
Driverless AI provides a Python Scoring Pipeline, an MLI Standalone Scoring Pipeline, and a MOJO Scoring Pipeline.
Consider the following when determining the scoring pipeline that you want to use.
• For all pipelines, the higher the accuracy, the slower the scoring.
• The Python Scoring Pipeline is slower but easier to use than the MOJO scoring pipeline.
• When running the Python Scoring Pipeline:
– HTTP is easy and is supported by virtually any language. HTTP supports RESTful calls via curl, wget, or
supported packages in various scripting languages.
– TCP is a bit more complex, though faster. TCP also requires Thrift, which currently does not handle NAs.
• The MOJO Scoring Pipeline is flexible and is faster than the Python Scoring Pipeline, but it requires a bit more
coding. The MOJO Scoring Pipeline is available as either a Java runtime or a C++ runtime.
• The MLI Standalone Python Scoring Pipeline can be used to score interpreted models but only supports k-LIME
reason codes.
– For obtaining k-LIME reason codes from an MLI experiment, use the MLI Standalone Python Scoring
Pipeline. k-LIME reason codes are available for all models.
– For obtaining Shapley reason codes from an MLI experiment, use the DAI Standalone Python Scoring
Pipeline. Shapley is only available for XGBoost and LightGBM models. Note that obtaining Shapley
reason codes through the Python Scoring Pipeline can be time consuming.
397
Using Driverless AI, Release 1.8.4.1
TWENTYNINE
As indicated earlier, a scoring pipeline is available after a successfully completed experiment. This package contains
an exported model and Python 3.6 source code examples for productionizing models built using H2O Driverless AI.
The files in this package allow you to transform and score on new data in a couple of different ways:
• From Python 3.6, you can import a scoring module, and then use the module to transform and score on new
data.
• From other languages and platforms, you can use the TCP/HTTP scoring service bundled with this package to
call into the scoring pipeline module through remote procedure calls (RPC).
Note about custom recipes and the Python Scoring Pipeline: By default, if a custom recipe was has been uploaded
into Driverless AI but then not used in the experiment, the Python Scoring Pipeline still contains the H2O recipe
server. If this pipeline is then deployed in a container, the H2O recipe server causes the size of the pipeline to be much
larger. In addition, Java has to be installed in the container, which further increases the runtime storage and memory
requirements. A workaround is to set the following environment variable before running the Python Scoring Pipeline:
export dai_enable_custom_recipes=0
399
Using Driverless AI, Release 1.8.4.1
There are two methods for starting the Python Scoring Pipeline.
This is the recommended method for running the Python Scoring Pipeline. Use this method if:
• You have an air gapped environment with no access to the Internet.
• You are running Power.
• You want an easy quick start approach.
Prerequisites
1. Download the TAR SH version of Driverless AI from https://www.h2o.ai/download/ (for either Linux or IBM
Power).
2. Use bash to execute the download. This creates a new dai-<dai_version> folder, where <dai_version> repre-
sents your version of Driverless AI, for example, 1.7.1-linux-x86_64.)
3. Change directories into the new Driverless AI folder. (Replace <dai_version> below with your the version
that was created in Step 2.)
cd dai-<dai_version>
5. Run the following to install the Python Scoring Pipeline for your completed Driverless AI experiment:
./dai-env.sh pip install /path/to/your/scoring_experiment.whl
6. Run the following command to run the included scoring pipeline example:
DRIVERLESS_AI_LICENSE_KEY="pastekeyhere" SCORING_PIPELINE_INSTALL_DEPENDENCIES=0 ./dai-env.sh /path/to/your/run_example.sh
Prerequisites
• The scoring module and scoring service are supported only on Linux with Python 3.6 and OpenBLAS.
• The scoring module and scoring service download additional packages at install time and require Internet access.
Depending on your network environment, you might need to set up internet access via a proxy.
• Valid Driverless AI license. Driverless AI requires a license to be specified in order to run the Python Scoring
Pipeline.
• Apache Thrift (to run the scoring service in TCP mode)
• Linux environment
• Python 3.6
• libopenblas-dev (required for H2O4GPU)
• OpenCL
Examples of how to install these prerequisites are below.
Installing Python 3.6
Installing Python 3.6 and OpenBLAS on Ubuntu 16.10+
sudo apt install python3.6 python3.6-dev python3-pip python3-dev \
python-virtualenv python3-virtualenv libopenblas-dev
License Specification
Driverless AI requires a license to be specified in order to run the Python Scoring Pipeline. The license can be specified
via an environment variable in Python:
Run the following to refresh the runtime shared after installing Thrift:
sudo ldconfig /usr/local/lib
1. On the completed Experiment page, click on the Download Python Scoring Pipeline button to download the
scorer.zip file for this experiment onto your local machine.
Note: By default, the run_*.sh scripts mentioned above create a virtual environment using virtualenv and pip,
within which the Python code is executed. The scripts can also leverage Conda (Anaconda/Mininconda) to create
Conda virtual environment and install required package dependencies. The package manager to use is provided as an
argument to the script.
# to use conda package manager
export DRIVERLESS_AI_LICENSE_FILE="/path/to/license.sig"
bash run_example.sh --pm conda
If you experience errors while running any of the above scripts, please check to make sure your system has a properly
installed and configured Python 3.6 installation. Refer to the Troubleshooting Python Environment Issues section that
follows to see how to set up and test the scoring module using a cleanroom Ubuntu 16.04 virtual machine.
The scoring module is a Python module bundled into a standalone wheel file (name scoring_*.whl). All the prereq-
uisites for the scoring module to work correctly are listed in the requirements.txt file. To use the scoring module, all
you have to do is create a Python virtualenv, install the prerequisites, and then import and use the scoring module as
follows:
# See 'example.py' for complete example.
from scoring_487931_20170921174120_b4066 import Scorer
scorer = Scorer() # Create instance.
score = scorer.score([ # Call score()
7.416, # sepal_len
3.562, # sepal_wid
1.049, # petal_len
2.388, # petal_wid
])
The process of importing and using the scoring module is demonstrated by the bash script run_example.sh, which
effectively performs the following steps:
# See 'run_example.sh' for complete example.
virtualenv -p python3.6 env
source env/bin/activate
pip install -r requirements.txt
export DRIVERLESS_AI_LICENSE_FILE="/path/to/license.sig"
python example.py
The scoring service hosts the scoring module as an HTTP or TCP service. Doing this exposes all the functions of
the scoring module through remote procedure calls (RPC). In effect, this mechanism allows you to invoke scoring
functions from languages other than Python on the same computer or from another computer on a shared network or
on the Internet.
The scoring service can be started in two ways:
• In TCP mode, the scoring service provides high-performance RPC calls via Apache Thrift (https://thrift.apache.
org/) using a binary wire protocol.
• In HTTP mode, the scoring service provides JSON-RPC 2.0 calls served by Tornado (http://www.tornadoweb.
org).
Scoring operations can be performed on individual rows (row-by-row) or in batch mode (multiple rows at a time).
The TCP mode allows you to use the scoring service from any language supported by Thrift, including C, C++, C#,
Cocoa, D, Dart, Delphi, Go, Haxe, Java, Node.js, Lua, perl, PHP, Python, Ruby and Smalltalk.
To start the scoring service in TCP mode, you will need to generate the Thrift bindings once, then run the server:
# See 'run_tcp_server.sh' for complete example.
thrift --gen py scoring.thrift
python tcp_server.py --port=9090
Note that the Thrift compiler is only required at build-time. It is not a run time dependency, i.e. once the scoring
services are built and tested, you do not need to repeat this installation process on the machines where the scoring
services are intended to be deployed.
To call the scoring service, simply generate the Thrift bindings for your language of choice, then make RPC calls via
TCP sockets using Thrift’s buffered transport in conjunction with its binary protocol.
# See 'run_tcp_client.sh' for complete example.
thrift --gen py scoring.thrift
You can reproduce the exact same result from other languages, e.g. Java:
thrift --gen java scoring.thrift
// Dependencies:
// commons-codec-1.9.jar
// commons-logging-1.2.jar
import ai.h2o.scoring.Row;
import ai.h2o.scoring.ScoringService;
import org.apache.thrift.TException;
import org.apache.thrift.protocol.TBinaryProtocol;
import org.apache.thrift.transport.TSocket;
import org.apache.thrift.transport.TTransport;
import java.util.List;
transport.close();
} catch (TException ex) {
ex.printStackTrace();
}
}
}
The HTTP mode allows you to use the scoring service using plaintext JSON-RPC calls. This is usually less performant
compared to Thrift, but has the advantage of being usable from any HTTP client library in your language of choice,
without any dependency on Thrift.
For JSON-RPC documentation, see http://www.jsonrpc.org/specification.
To start the scoring service in HTTP mode:
# See 'run_http_server.sh' for complete example.
export DRIVERLESS_AI_LICENSE_FILE="/path/to/license.sig"
python http_server.py --port=9090
To invoke scoring methods, compose a JSON-RPC message and make a HTTP POST request to http://host:port/rpc as
follows:
# See 'run_http_client.sh' for complete example.
curl http://localhost:9090/rpc \
--header "Content-Type: application/json" \
--data @- <<EOF
{
"id": 1,
"method": "score",
"params": {
"row": [ 7.486, 3.277, 4.755, 2.354 ]
}
}
EOF
Similarly, you can use any HTTP client library to reproduce the above result. For example, from Python, you can use
the requests module as follows:
import requests
row = [7.486, 3.277, 4.755, 2.354]
req = dict(id=1, method='score', params=dict(row=row))
res = requests.post('http://localhost:9090/rpc', data=req)
print(res.json()['result'])
Why am I getting a “TensorFlow is disabled” message when I run the Python Scoring Pipeline?
If you ran an experiment when TensorFlow was enabled and then attempt to run the Python Scoring Pipeline, you may
receive a message similar to the following:
TensorFlow is disabled. To enable, export DRIVERLESS_AI_ENABLE_TENSORFLOW=1 or set enable_tensorflow=true in config.toml.
To successfully run the Python Scoring Pipeline, you must enable the DRIVERLESS_AI_ENABLE_TENSORFLOW
flag. For example:
export DRIVERLESS_AI_LICENSE_FILE="/path/to/license.sig"
DRIVERLESS_AI_ENABLE_TENSORFLOW=1 bash run_example.sh
The following instructions describe how to set up a cleanroom Ubuntu 16.04 virtual machine to test that this scoring
pipeline works correctly.
Prerequisites:
• Install Virtualbox: sudo apt-get install virtualbox
• Install Vagrant: https://www.vagrantup.com/downloads.html
1. Create configuration files for Vagrant.
• bootstrap.sh: contains commands to set up Python 3.6 and OpenBLAS.
• Vagrantfile: contains virtual machine configuration instructions for Vagrant and VirtualBox.
----- bootstrap.sh -----
#!/usr/bin/env bash
# end of bootstrap.sh
Vagrant.configure(2) do |config|
config.vm.box = "ubuntu/xenial64"
config.vm.provision :shell, path: "bootstrap.sh", privileged: false
config.vm.hostname = "h2o"
config.vm.provider "virtualbox" do |vb|
vb.memory = "4096"
end
end
# end of Vagrantfile
2. Launch the VM and SSH into it. Note that we’re also placing the scoring pipeline in the same directory so that
we can access it later inside the VM.
cp /path/to/scorer.zip .
vagrant up
vagrant ssh
At this point, you should see scores printed out on the terminal. If not, contact us at [email protected].
THIRTY
This package contains an exported model and Python 3.6 source code examples for productionizing models built using
H2O Driverless AI Machine Learning Interpretability (MLI) tool. This is only available for interpreted models and
can be downloaded by clicking the Scoring Pipeline button the Interpreted Models page.
The files in this package allow you to obtain reason codes for a given row of data a couple of different ways:
• From Python 3.6, you can import a scoring module, and then use the module to transform and score on new
data.
• From other languages and platforms, you can use the TCP/HTTP scoring service bundled with this package to
call into the scoring pipeline module through remote procedure calls (RPC).
409
Using Driverless AI, Release 1.8.4.1
There are two methods for starting the MLI Standalone Scoring Pipeline.
This is the recommended method for running the MLI Scoring Pipeline. Use this method if:
• You have an air gapped environment with no access to the Internet.
• You are running Power.
• You want an easy quick start approach.
Prerequisites
1. Download the TAR SH version of Driverless AI from https://www.h2o.ai/download/ (for either Linux or IBM
Power).
2. Use bash to execute the download. This creates a new dai-nnn folder.
3. Change directories into the new Driverless AI folder.
cd dai-nnn directory.
4. Run the following to install the Python Scoring Pipeline for your completed Driverless AI experiment:
./dai-env.sh pip install /path/to/your/scoring_experiment.whl
5. Run the following command to run the included scoring pipeline example:
DRIVERLESS_AI_LICENSE_KEY="pastekeyhere" SCORING_PIPELINE_INSTALL_DEPENDENCIES=0 ./dai-env.sh /path/to/your/run_example.sh
This section describes an alternative method for running the MLI Standalone Scoring Pipeline. This version requires
Internet access. It is also not supported on Power machines.
Prerequisites
Run the following to refresh the runtime shared after installing Thrift.
sudo ldconfig /usr/local/lib
2. Unzip the scoring pipeline, and run the following examples in the scoring-pipeline-mli folder.
Run the scoring module example. (This requires Linux and Python 3.6.)
bash run_example.sh
Run the TCP scoring server example. Use two terminal windows. (This requires Linux, Python 3.6 and
Thrift.)
bash run_tcp_server.sh
bash run_tcp_client.sh
Run the HTTP scoring server example. Use two terminal windows. (This requires Linux, Python 3.6 and
Thrift.)
bash run_http_server.sh
bash run_http_client.sh
Note: By default, the run_*.sh scripts mentioned above create a virtual environment using vir-
tualenv and pip, within which the Python code is executed. The scripts can also leverage Conda (Ana-
conda/Mininconda) to create Conda virtual environment and install required package dependencies. The
package manager to use is provided as an argument to the script.
# to use conda package manager
bash run_example.sh --pm conda
The MLI scoring module is a Python module bundled into a standalone wheel file (name scoring_*.whl). All the
prerequisites for the scoring module to work correctly are listed in the ‘requirements.txt’ file. To use the scoring
module, all you have to do is create a Python virtualenv, install the prerequisites, and then import and use the scoring
module as follows:
----- See 'example.py' for complete example. -----
from scoring_487931_20170921174120_b4066 import Scorer
scorer = KLimeScorer() # Create instance.
score = scorer.score_reason_codes([ # Call score_reason_codes()
7.416, # sepal_len
3.562, # sepal_wid
1.049, # petal_len
2.388, # petal_wid
])
There are times when the K-LIME model score is not close to the Driverless AI model score. In this case it may be
better to use reason codes using the Shapley method on the Driverless AI model. Please note: the reason codes from
Shapley will be in the transformed feature space.
To see an example of using both K-LIME and Driverless AI Shapley reason codes in the same Python session, run:
bash run_example_shapley.sh
For this batch script to succeed, MLI must be run on a Driverless AI model. If you have run MLI in standalone
(external model) mode, there will not be a Driverless AI scoring pipeline.
If MLI was run with transformed features, the Shapley example scripts will not be exported. You can generate exact
reason codes directly from the Driverless AI model scoring pipeline.
The MLI scoring service hosts the scoring module as a HTTP or TCP service. Doing this exposes all the functions of
the scoring module through remote procedure calls (RPC).
In effect, this mechanism allows you to invoke scoring functions from languages other than Python on the same
computer, or from another computer on a shared network or the internet.
The scoring service can be started in two ways:
• In TCP mode, the scoring service provides high-performance RPC calls via Apache Thrift (https://thrift.apache.
org/) using a binary wire protocol.
• In HTTP mode, the scoring service provides JSON-RPC 2.0 calls served by Tornado (http://www.tornadoweb.
org).
Scoring operations can be performed on individual rows (row-by-row) using score or in batch mode (multiple rows
at a time) using score_batch. Both functions allow you to specify pred_contribs=[True|False] to get
MLI predictions (KLime/Shapley) on a new dataset. See the example_shapley.py file for more information.
The TCP mode allows you to use the scoring service from any language supported by Thrift, including C, C++, C#,
Cocoa, D, Dart, Delphi, Go, Haxe, Java, Node.js, Lua, perl, PHP, Python, Ruby and Smalltalk.
To start the scoring service in TCP mode, you will need to generate the Thrift bindings once, then run the server:
----- See 'run_tcp_server.sh' for complete example. -----
thrift --gen py scoring.thrift
python tcp_server.py --port=9090
Note that the Thrift compiler is only required at build-time. It is not a run time dependency, i.e. once the scoring
services are built and tested, you do not need to repeat this installation process on the machines where the scoring
services are intended to be deployed.
To call the scoring service, simply generate the Thrift bindings for your language of choice, then make RPC calls via
TCP sockets using Thrift’s buffered transport in conjunction with its binary protocol.
----- See 'run_tcp_client.sh' for complete example. -----
thrift --gen py scoring.thrift
You can reproduce the exact same result from other languages, e.g. Java:
thrift --gen java scoring.thrift
// Dependencies:
// commons-codec-1.9.jar
// commons-logging-1.2.jar
// httpclient-4.4.1.jar
// httpcore-4.4.1.jar
// libthrift-0.10.0.jar
// slf4j-api-1.7.12.jar
import ai.h2o.scoring.Row;
import ai.h2o.scoring.ScoringService;
import org.apache.thrift.TException;
import org.apache.thrift.protocol.TBinaryProtocol;
import org.apache.thrift.transport.TSocket;
import org.apache.thrift.transport.TTransport;
transport.close();
} catch (TException ex) {
ex.printStackTrace();
}
}
}
The HTTP mode allows you to use the scoring service using plaintext JSON-RPC calls. This is usually less performant
compared to Thrift, but has the advantage of being usable from any HTTP client library in your language of choice,
without any dependency on Thrift.
For JSON-RPC documentation, see http://www.jsonrpc.org/specification .
To start the scoring service in HTTP mode:
----- See 'run_http_server.sh' for complete example. -----
python http_server.py --port=9090
To invoke scoring methods, compose a JSON-RPC message and make a HTTP POST request to http://host:port/rpc as
follows:
----- See 'run_http_client.sh' for complete example. -----
curl http://localhost:9090/rpc \
--header "Content-Type: application/json" \
--data @- <<EOF
{
"id": 1,
"method": "score_reason_codes",
"params": {
"row": [ 7.486, 3.277, 4.755, 2.354 ]
}
}
EOF
Similarly, you can use any HTTP client library to reproduce the above result. For example, from Python, you can use
the requests module as follows:
import requests
row = [7.486, 3.277, 4.755, 2.354]
req = dict(id=1, method='score_reason_codes', params=dict(row=row))
res = requests.post('http://localhost:9090/rpc', data=req)
print(res.json()['result'])
THIRTYONE
As indicated previously, the MOJO Scoring Pipeline provides a standalone scoring pipeline that converts experiments
to MOJOs, which can be scored in real time. The MOJO Scoring Pipeline is available as either a Java runtime or a
C++ runtime (with Python and R wrappers).
For completed experiments, Driverless AI automatically converts models to MOJOs (Model Objects, Optimized). The
MOJO Scoring Pipeline is a scoring engine that can be deployed in any Java environment for scoring in real time.
(Refer to Driverless AI MOJO Scoring Pipeline - C++ Runtime with Python and R Wrappers for information about
the C++ scoring runtime with Python and R wrappers.)
Keep in mind that, similar to H2O-3, MOJOs are tied to experiments. Experiments and MOJOs are not automatically
upgraded when Driverless AI is upgraded.
Notes:
• This scoring pipeline is not currently available for TensorFlow, RuleFit, or FTRL models.
• To disable the automatic creation of this scoring pipeline, set the Make MOJO Scoring Pipeline expert setting
to Off.
31.1.1 Prerequisites
The following are required in order to run the MOJO scoring pipeline.
• Java 7 runtime (JDK 1.7) or newer. NOTE: We recommend using Java 11+ due to a bug in Java. (See https:
//bugs.openjdk.java.net/browse/JDK-8186464.)
• Valid Driverless AI license. You can download the license.sig file from the machine hosting Driverless AI
(usually in the license folder). Copy the license file into the downloaded mojo-pipeline folder.
• mojo2-runtime.jar file. This is available from the top navigation menu in the Driverless AI UI and in the
downloaded mojo-pipeline.zip file for an experiment.
417
Using Driverless AI, Release 1.8.4.1
License Specification
Driverless AI requires a license to be specified in order to run the MOJO Scoring Pipeline. The license can be specified
in one of the following ways:
• Via an environment variable:
– DRIVERLESS_AI_LICENSE_FILE: Path to the Driverless AI license file, or
– DRIVERLESS_AI_LICENSE_KEY: The Driverless AI license key (Base64 encoded string)
• Via a system property of JVM (-D option):
– ai.h2o.mojos.runtime.license.file: Path to the Driverless AI license file, or
– ai.h2o.mojos.runtime.license.key: The Driverless AI license key (Base64 encoded
string)
• Via an application classpath:
– The license is loaded from a resource called /license.sig.
– The default resource name can be changed via the JVM system property ai.h2o.mojos.
runtime.license.filename.
For example:
java -Dai.h2o.mojos.runtime.license.file=/etc/dai/license.sig -cp mojo2-runtime.jar ai.h2o.mojos.ExecuteMojo pipeline.mojo example.csv
31.1.3 Quickstart
Before running the quickstart examples, be sure that the MOJO scoring pipeline is already downloaded and unzipped:
1. On the completed Experiment page, click on the Download MOJO Scoring Pipeline button.
2. In the pop-up menu that appears, click on the Download MOJO Scoring Pipeline button once again to down-
load the scorer.zip file for this experiment onto your local machine. Refer to the provided instructions for Java,
Python, or R.
3. To score all rows in the sample test set (example.csv) with the MOJO pipeline (pipeline.mojo) and
license stored in the environment variable DRIVERLESS_AI_LICENSE_KEY:
bash run_example.sh
4. To score a specific test set (example.csv) with MOJO pipeline (pipeline.mojo) and the license file
(license.sig):
bash run_example.sh pipeline.mojo example.csv license.sig
Note: For very large models, it may be necessary to increase the memory limit when running the Java
application for data transformation. This can be done by specifying -Xmx25g when running the above
command.
1. Open a new terminal window and change directories to the experiment folder:
cd experiment
2. Create your main program in the experiment folder by creating a new file called Main.java (for example, using
vim Main.java). Include the following contents.
import java.io.IOException;
import ai.h2o.mojos.runtime.MojoPipeline;
import ai.h2o.mojos.runtime.frame.MojoFrame;
import ai.h2o.mojos.runtime.frame.MojoFrameBuilder;
import ai.h2o.mojos.runtime.frame.MojoRowBuilder;
import ai.h2o.mojos.runtime.utils.SimpleCSV;
import ai.h2o.mojos.runtime.lic.LicenseException;
Note: The Driverless AI 1.5 release will be the last release with TOML-based MOJO2. Releases after 1.5 will include
protobuf-based MOJO2.
MOJO scoring pipeline artifacts can be used in Spark to deploy predictions in parallel using the Sparkling Water API.
This section shows how to load and run predictions on the MOJO scoring pipeline in Spark using Scala and the Python
API.
In the event that you upgrade H2O Driverless AI, we have a good news! Sparkling Water is backwards compatible
with MOJO versions produced by older Driverless AI versions.
Requirements
• You must have a Spark cluster with the Sparkling Water JAR file passed to Spark.
• To run with PySparkling, you must have the PySparkling zip file.
The H2OContext does not have to be created if you only want to run predictions on MOJOs using Spark. This is
because the scoring is independent of the H2O run-time.
In order use the MOJO scoring pipeline, Driverless AI license has to be passed to Spark. This can be achieved via
--jars argument of the Spark launcher scripts.
Note: In Local Spark mode, please use --driver-class-path to specify path to the license file.
PySparkling
First, start PySpark with PySparkling Python package and Driverless AI license.
./bin/pyspark --jars license.sig --py-files pysparkling.zip
or, you can download official Sparkling Water distribution from H2O Download page. Please follow steps on the
Sparkling Water download page. Once you are in the Sparkling Water directory, you can call:
./bin/pysparkling --jars license.sig
At this point, you should have available a PySpark interactive terminal where you can try out predictions. If you
would like to productionalize the scoring process, you can use the same configuration, except instead of using ./
bin/pyspark, you would use ./bin/spark-submit to submit your job to a cluster.
# First, specify the dependency
from pysparkling.ml import H2OMOJOPipelineModel
# The 'namedMojoOutputColumns' option ensures that the output columns are named properly.
# If you want to use old behavior when all output columns were stored inside an array,
# set it to False. However we strongly encourage users to use True which is defined as a default value.
settings = H2OMOJOSettings(namedMojoOutputColumns = True)
# Load the pipeline. 'settings' is an optional argument. If it's not specified, the default values are used.
mojo = H2OMOJOPipelineModel.createFromMojo("file:///path/to/the/pipeline.mojo", settings)
# Run the predictions. The predictions contain all the original columns plus the predictions
# added as new columns
predictions = mojo.transform(dataFrame)
# You can easily get the predictions for a desired column using the helper function as
predictions.select(mojo.selectPredictionUDF("AGE")).collect()
Sparkling Water
First, start Spark with Sparkling Water Scala assembly and Driverless AI license.
./bin/spark-shell --jars license.sig,sparkling-water-assembly.jar
or, you can download official Sparkling Water distribution from H2O Download page. Please follow steps on the
Sparkling Water download page. Once you are in the Sparkling Water directory, you can call:
./bin/sparkling-shell --jars license.sig
At this point, you should have available a Sparkling Water interactive terminal where you can carry out predictions.
If you would like to productionalize the scoring process, you can use the same configuration, except instead of using
./bin/spark-shell, you would use ./bin/spark-submit to submit your job to a cluster.
// First, specify the dependency
import ai.h2o.sparkling.ml.models.H2OMOJOPipelineModel
// The 'namedMojoOutputColumns' option ensures that the output columns are named properly.
// If you want to use old behavior when all output columns were stored inside an array,
// set it to false. However we strongly encourage users to use true which is defined as a default value.
val settings = H2OMOJOSettings(namedMojoOutputColumns = true)
// Load the pipeline. 'settings' is an optional argument. If it's not specified, the default values are used.
val mojo = H2OMOJOPipelineModel.createFromMojo("file:///path/to/the/pipeline.mojo", settings)
// Run the predictions. The predictions contain all the original columns plus the predictions
// added as new columns
val predictions = mojo.transform(dataFrame)
// You can easily get the predictions for desired column using the helper function as follows:
predictions.select(mojo.selectPredictionUDF("AGE"))
The C++ Scoring Pipeline is provided as R and Python packages for the protobuf-based MOJO2 protocol. The pack-
ages are self contained, so no additional software is required. Simply build the MOJO Scoring Pipeline and begin
using your preferred method.
Notes:
• These scoring pipelines are currently not available for RuleFit models.
• The Download MOJO Scoring Pipeline button appears as Build MOJO Scoring Pipeline if the MOJO Scor-
ing Pipeline is disabled.
The R and Python packages can be downloaded from within the Driverless AI application. To do this, click Resources,
then click MOJO2 R Runtime and MOJO2 Py Runtime from the drop-down menu. In the pop-up menu that appears,
click the button that corresponds to the OS you are using. Choose from Linux, Mac OS X, and IBM PowerPC.
31.2.2 Examples
The following examples show how to use the R and Python APIs of the C++ MOJO runtime.
R Example
Prerequisites
library(data.table)
d <- fread("./mojo-pipeline/example.csv", colClasses=col_class, header=TRUE, sep=",")
predict(m, d)
## label.B label.M
## 1 0.08287659 0.91712341
## 2 0.77655075 0.22344925
## 3 0.58438434 0.41561566
## 4 0.10570505 0.89429495
## 5 0.01685609 0.98314391
## 6 0.23656610 0.76343390
## 7 0.17410333 0.82589667
## 8 0.10157948 0.89842052
## 9 0.13546191 0.86453809
## 10 0.94778244 0.05221756
Python Example
Prerequisites
31.2. Driverless AI MOJO Scoring Pipeline - C++ Runtime with Python and R Wrappers 423
Using Driverless AI, Release 1.8.4.1
• Python MOJO runtime. Run one of the following commands after downloading from the GUI:
# Install the MOJO runtime on Linux PPC
pip install daimojo-2.2.0-cp36-cp36m-linux_ppc64le.whl
The downloaded mojo.zip file contains the entire scoring pipeline. This pipeline also includes a MOJO2 Javadoc,
which can be opened by running the following in the mojo-pipeline folder:
jar -xf mojo2-runtime-javadoc.jar
THIRTYTWO
Driverless AI can deploy the MOJO scoring pipeline for you to test and/or to integrate into a final product.
Notes:
• This section describes how to deploy a MOJO scoring pipeline and assumes that a MOJO scoring pipeline exists.
Refer to the MOJO Scoring Pipelines section for information on how to build a MOJO scoring pipeline.
• This is an early feature that will continue to support additional deployments.
All of the existing MOJO scoring pipeline deployments are available in the Deployments Overview page, which is
available from the top menu. This page lists all active deployments and the information needed to access the respective
endpoints. In addition, it allows you to stop any deployments that are no longer needed.
Driverless AI can deploy the trained MOJO scoring pipeline as an AWS Lambda Function, i.e., a server-less scorer
running in Amazon Cloud and charged by the actual usage.
Refer to the aws-lambda-scorer folder in the dai-deployment-templates repository to see different deployment tem-
plates for AWS Lambda scorer.
427
Using Driverless AI, Release 1.8.4.1
• Driverless AI MOJO Scoring Pipeline: To deploy a MOJO scoring pipeline as an AWS Lambda function, the
MOJO pipeline archive has to be created first by choosing the Build MOJO Scoring Pipeline option on the
completed experiment page. Refer to the MOJO Scoring Pipelines section for information on how to build a
MOJO scoring pipeline.
• Terraform v0.11.x (specifically v0.11.10 or greater): In addition, the Terraform tool (https://www.terraform.
io/) has to be installed on the system running Driverless AI. The tool is included in the Driverless AI Docker
images but not in native install packages. To install Terraform, follow the steps on Terraform installation page.
Notes:
• Terraform is not available on every platform. In particular, there is no Power build, so AWS Lambda
Deployment is currently not supported on Power installations of Driverless AI.
• Terraform v0.12 is not supported. If you have v0.12 installed, you will need to download to v0.11.x
(specifically v0.11.10 or greater) in order to deploy a MOJO scoring pipeline as an AWS lambda
function.
Usage Plans
Usage plans must be enabled in the target AWS region in order for API keys to work when accessing the AWS Lambda
via its REST API. Refer to https://aws.amazon.com/blogs/aws/new-usage-plans-for-amazon-api-gateway/ for more
information.
Access Permissions
The following AWS access permissions need to be provided to the role in order for Driverless AI Lambda deployment
to succeed.
• AWSLambdaFullAccess
• IAMFullAccess
• AmazonAPIGatewayAdministrator
The policy can be further stripped down to restrict Lambda and S3 rights using the JSON policy definition as follows:
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "VisualEditor0",
"Effect": "Allow",
Once the MOJO pipeline archive is ready, Driverless AI provides a Deploy (Local & Cloud) option on the completed
experiment page.
Notes: This button is only available after the MOJO Scoring Pipeline has been built.
This option opens a new dialog for setting the AWS account credentials (or use those supplied in the Driverless AI
configuration file or environment variables), AWS region, and the desired deployment name (which must be unique
per Driverless AI user and AWS account used).
name of the experiment and the deployment type. This has to be unique both for Driverless AI user and the
AWS account used.
• Region: The AWS region to deploy the MOJO scoring pipeline to. It makes sense to choose a region geograph-
ically close to any client code calling the endpoint in order to minimize request latency. (See also AWS Regions
and Availability Zones.)
• Use AWS environment variables: If enabled, the AWS creden-
tials are taken from the Driverless AI configuration file (see records
deployment_aws_access_key_id and deployment_aws_secret_access_key)
or environment variables (DRIVERLESS_AI_DEPLOYMENT_AWS_ACCESS_KEY_ID and
DRIVERLESS_AI_DEPLOYMENT_AWS_SECRET_ACCESS_KEY). This would usually be entered by
the Driverless AI installation administrator.
• AWS Access Key ID and AWS Secret Access Key: Credentials to access the AWS account. This pair of secrets
identifies the AWS user and the account and can be obtained from the AWS account console.
On a successful deployment, all the information needed to access the new endpoint (URL and an API Key) is printed,
and the same information is available in the Deployments Overview Page after clicking on the deployment row.
Note that the actual scoring endpoint is located at the path /score. In addition, to prevent DDoS and other malicious
activities, the resulting AWS lambda is protected by an API Key, i.e., a secret that has to be passed in as a part of the
request using the x-api-key HTTP header.
The request is a JSON object containing attributes:
• fields: A list of input column names that should correspond to the training data columns.
• rows: A list of rows that are in turn lists of cell values to predict the target values for.
• optional includeFieldsInOutput: A list of input columns that should be included in the output.
An example request providing 2 columns on the input and asking to get one column copied to the output looks as
follows:
{
"fields": [
"age", "salary"
],
"includeFieldsInOutput": [
"salary"
],
"rows": [
[
"48.0", "15000.0"
],
[
"35.0", "35000.0"
],
[
"18.0", "22000.0"
]
]
}
Assuming the request is stored locally in a file named test.json, the request to the endpoint can be sent, e.g., using
the curl utility, as follows:
URL={place the endpoint URL here}
API_KEY={place the endpoint API key here}
curl \
-d @test.json \
-X POST \
-H "x-api-key: ${API_KEY}" \
${URL}/score
The response is a JSON object with a single attribute score, which contains the list of rows with the optional copied
input values and the predictions.
For the example above with a two class target field, the result is likely to look something like the following snippet.
The particular values would of course depend on the scoring pipeline:
{
"score": [
[
"48.0",
"0.6240277982943945",
"0.045458571508101536",
],
[
"35.0",
"0.7209441819603676",
"0.06299909138586585",
],
[
"18.0",
"0.7209441819603676",
"0.06299909138586585",
]
]
}
We create a new S3 bucket per AWS Lambda deployment. The bucket names have to be unique throughout AWS
S3, and one user can create a maximum of 100 buckets. Therefore, we recommend setting the bucket name used for
deployment with the deployment_aws_bucket_name config option.
This section describes how to deploy the trained MOJO scoring pipeline as a local Representational State Transfer
(REST) Server.
The REST server deployment supports API endpoints such as model metadata, file/CSV scoring, etc. It uses SpringFox
for both programmatic and manual inspection of the API. Refer to the local-rest-scorer folder in the dai-deployment-
templates repository to see different deployment templates for Local REST scorers.
32.3.2 Prerequisites
• Driverless AI MOJO Scoring Pipeline: To deploy a MOJO scoring pipeline as a Local REST Scorer, the MOJO
pipeline archive has to be created first by choosing the Build MOJO Scoring Pipeline option on the completed
experiment page. Refer to the MOJO Scoring Pipelines section for information on how to build a MOJO scoring
pipeline.
• When using a firewall or a virtual private cloud (VPC), the ports that are used by the REST server must be
exposed.
• Ensure that you have enough memory and CPUs to run the REST scorer. Typically, a good estimation for the
amount of required memory is 12 times the size of the pipeline.mojo file. For example, a 100MB pipeline.mojo
file will require approximately 1200MB of RAM. (Note: To conveniently view in-depth information about your
system in Driverless AI, click on Resources at the top of the sceen, then click System Info.)
• When running Driverless AI in a Docker container, you must expose ports on Docker for the REST service
deployment within the Driverless AI Docker container. For example, the following exposes the Driverless AI
Docker container to listen to port 8094 for requests arriving at the host port at 18094.
docker run \
-d \
--pid=host \
--init \
--rm \
--shm-size=256m \
-u `id -u`:`id -g` \
-p 12181:12345 \
-p 18094:8094 \
-v `pwd`/data:/data \
-v `pwd`/log:/log \
-v `pwd`/license:/license \
-v `pwd`/tmp:/tmp \
h2oai/<dai-image-name>:TAG
Once the MOJO pipeline archive is ready, Driverless AI provides a Deploy (Local & Cloud) option on the completed
experiment page.
Notes:
• This button is only available after the MOJO Scoring Pipeline has been built.
• This button is not available on PPC64LE environments.
This option opens a new dialog for setting the REST Server deployment name, port number, and maximum heap size
(optional).
1. Specify a name for the REST scorer in order to help track the deployed REST scorers.
2. Provide a port number on which the REST scorer will run. For example, if port number 8081 is selected, the
scorer will be available at http://my-ip-address:8081/models
3. Optionally specify the maximum heap size for the Java Virtual Machine (JVM) running the REST scorer. This
can help constrain the REST scorer from overconsuming memory of the machine. Because the REST scorer is
running on the same machine as Driverless AI, it may be helpful to limit the amount of memory that is allocated
to the REST scorer. This option will limit the amount of memory the REST scorer can use, but it will also
produce an error if the memory allocated is not enough to run the scorer. (The amount of memory required is
mostly dependent on the size of MOJO. See Prerequisites for more information.)
Note that the actual scoring endpoint is located at the path /score.
The request is a JSON object containing attributes:
• fields: A list of input column names that should correspond to the training data columns.
• rows: A list of rows that are in turn lists of cell values to predict the target values for.
• optional includeFieldsInOutput: A list of input columns that should be included in the output.
An example request providing 2 columns on the input and asking to get one column copied to the output looks as
follows:
{
"fields": [
"age", "salary"
],
"includeFieldsInOutput": [
"salary"
],
"rows": [
[
"48.0", "15000.0"
],
[
"35.0", "35000.0"
],
[
"18.0", "22000.0"
]
]
}
Assuming the request is stored locally in a file named test.json, the request to the endpoint can be sent, e.g., using
the curl utility, as follows:
URL={place the endpoint URL here}
curl \
-X POST \
-d {"fields": ['age', 'salary', 'education'], "rows": [1, 2, 3], "includeFieldsInOutput": ["education"]}\
-H "Content-Type: application/json" \
${URL}/score
The response is a JSON object with a single attribute score, which contains the list of rows with the optional copied
input values and the predictions.
For the example above with a two class target field, the result is likely to look something like the following snippet.
The particular values would of course depend on the scoring pipeline:
{
"score": [
[
"48.0",
"0.6240277982943945",
"0.045458571508101536",
],
[
"35.0",
"0.7209441819603676",
"0.06299909138586585",
],
[
"18.0",
"0.7209441819603676",
"0.06299909138586585",
]
]
}
When using Docker, local REST scorers are deployed within the same container as Driverless AI. As a result, all
REST scorers will be turned off if the Driverless AI container is closed. When using native installs (rpm/deb/tar.sh),
the REST scorers will continue to run even if Driverless AI is shut down.
THIRTYTHREE
H2O Driverless AI is an automatic machine learning platform that uses feature engineering recipes from some of
the world’s best data scientists to deliver highly accurate machine learning models. As part of the automatic feature
engineering process, the system uses a variety of transformers to enhance the available data. This section describes
what’s happening underneath the hood, including details about the feature engineering transformations and time series
and natural language processing functionality.
Refer to one of the following topics:
• Data Sampling
• Driverless AI Transformations
• Internal Validation Technique
• Missing and Unseen Levels Handling
• Imputation in Driverless AI
• Time Series in Driverless AI
• NLP in Driverless AI
439
Using Driverless AI, Release 1.8.4.1
THIRTYFOUR
DATA SAMPLING
Driverless AI does not perform any type of data sampling unless the dataset is big or highly imbal-
anced (for improved accuracy). What is considered big is dependent on your accuracy setting and the
statistical_threshold_data_size_large parameter in the config.toml or in the Expert Settings.
You can see if the data will be sampled by viewing the Experiment Preview when you set up the experiment. In
the experiment preview below, I can see that my data was sampled down to 5 million rows.
If Driverless AI decides to sample the data based on these settings and the data size, then Driverless AI will perform
the following types of sampling at the start of the experiment:
441
Using Driverless AI, Release 1.8.4.1
THIRTYFIVE
DRIVERLESS AI TRANSFORMATIONS
Transformations in Driverless AI are applied to columns in the data. The transformers create the engineered features
in experiments.
Driverless AI provides a number of transformers. The downloaded experiment logs include the transformations that
were applied to your experiment. Note that you can exclude transformations in the config.toml file, and that list of
excluded transformers will also be available in the experiment log.
The following transformers are available for classification (multiclass and binary) and regression experiments.
• ClusterDist Transformer
The Cluster Distance Transformer clusters selected numeric columns and uses the distance to a spe-
cific cluster as a new feature.
• ClusterTE Transformer
The Cluster Target Encoding Transformer clusters selected numeric columns and calculates the mean
of the response column for each cluster. The mean of the response is used as a new feature. Cross
Validation is used to calculate mean response to prevent overfitting.
• Interactions Transformer
The Interactions Transformer adds, divides, multiplies, and subtracts two numeric columns in the
data to create a new feature. This transformation uses a smart search to identify which feature pairs
to transform. Only interactions that improve the baseline model score are kept.
• InteractionsSimple Transformer
The InteractionsSimple Transformer adds, divides, multiplies, and subtracts two numeric columns in
the data to create a new feature. This transformation randomly selects pairs of features to transform.
• NumCatTE Transformer
The Numeric Categorical Target Encoding Transformer calculates the mean of the response column
for several selected columns. If one of the selected columns is numeric, it is first converted to cate-
gorical by binning. The mean of the response column is used as a new feature. Cross Validation is
used to calculate mean response to prevent overfitting.
• NumToCatTE Transformer
443
Using Driverless AI, Release 1.8.4.1
The Numeric to Categorical Target Encoding Transformer converts numeric columns to categoricals
by binning and then calculates the mean of the response column for each group. The mean of the
response for the bin is used as a new feature. Cross Validation is used to calculate mean response to
prevent overfitting.
• NumToCatWoEMonotonic Transformer
The Numeric to Categorical Weight of Evidence Monotonic Transformer converts a numeric col-
umn to categorical by binning and then calculates Weight of Evidence for each bin. The monotonic
constraint ensures the bins of values are monotonically related to the Weight of Evidence value. The
Weight of Evidence is used as a new feature. Weight of Evidence measures the “strength” of a group-
ing for separating good and bad risk and is calculated by taking the log of the ratio of distributions
for a binary response column.
• NumToCatWoE Transformer
The Numeric to Categorical Weight of Evidence Transformer converts a numeric column to categor-
ical by binning and then calculates Weight of Evidence for each bin. The Weight of Evidence is used
as a new feature. Weight of Evidence measures the “strength” of a grouping for separating good and
bad risk and is calculated by taking the log of the ratio of distributions for a binary response column.
• Original Transformer
The Original Transformer applies an identity transformation to a numeric column.
• TruncSVDNum Transformer
Truncated SVD Transformer trains a Truncated SVD model on selected numeric columns and uses
the components of the truncated SVD matrix as new features.
• DateOriginal Transformer
The Date Original Transformer retrieves date values such as year, quarter, month, day, day of the
year, week, and weekday values.
• DateTimeOriginal Transformer
The Date Time Original Transformer retrieves date and time values such as year, quarter, month, day,
day of the year, week, weekday, hour, minute, and second values.
• EwmaLags Transformer
The Exponentially Weighted Moving Average (EWMA) Transformer calculates the exponentially
weighted moving average of target or feature lags.
• LagsAggregates Transformer
The Lags Aggregates Transformer calculates aggregations of target/feature lags like mean(lag7,
lag14, lag21) with support for mean, min, max, median, sum, skew, kurtosis, std. The aggregation is
used as a new feature.
• LagsInteraction Transformer
The Lags Interaction Transformer creates target/feature lags and calculates interactions between the
lags (lag2 - lag1, for instance). The interaction is used as a new feature.
• Lags Transformer
The Lags Transformer creates target/feature lags, possibly over groups. Each lag is used as a new fea-
ture. Lag transformers may apply to categorical (strings) features or binary/multiclass string valued
targets after they have been internally numerically encoded.
• LinearLagsRegression Transformer
The Linear Lags Regression transformer trains a linear model on the target or feature lags to predict
the current target or feature value. The linear model prediction is used as a new feature.
• Cat Transformer
The Cat Transformer sorts a categorical column in lexicographical order and uses the order index
created as a new feature. This transformer works with models that can handle categorical features.
• CatOriginal Transformer
The Categorical Original Transformer applies an identity transformation that leaves categorical fea-
tures as they are. This transformer works with models that can handle non-numeric feature values.
• CVCatNumEncode Transformer
The Cross Validation Categorical to Numeric Encoding Transformer calculates an aggregation of a
numeric column for each value in a categorical column (ex: calculate the mean Temperature for each
City) and uses this aggregation as a new feature.
• CVTargetEncode Transformer
The Cross Validation Target Encoding Transformer calculates the mean of the response column for
each value in a categorical column and uses this as a new feature. Cross Validation is used to calculate
mean response to prevent overfitting.
• Frequent Transformer
The Frequent Transformer calculates the frequency for each value in categorical column(s) and uses
this as a new feature. This count can be either the raw count or the normalized count.
• LexiLabelEncoder Transformer
The Lexi Label Encoder sorts a categorical column in lexicographical order and uses the order index
created as a new feature.
• NumCatTE Transformer
The Numeric Categorical Target Encoding Transformer calculates the mean of the response column
for several selected columns. If one of the selected columns is numeric, it is first converted to cate-
gorical by binning. The mean of the response column is used as a new feature. Cross Validation is
used to calculate mean response to prevent overfitting.
• OneHotEncoding Transformer
The One-hot Encoding transformer converts a categorical column to a series of boolean features by
performing one-hot encoding. The boolean features are used as new features.
• SortedLE Transformer
The Sorted Label Encoding Transformer sorts a categorical column by the response column and uses
the order index created as a new feature.
• WeightOfEvidence Transformer
The Weight of Evidence Transformer calculates Weight of Evidence for each value in categorical
column(s). The Weight of Evidence is used as a new feature. Weight of Evidence measures the
“strength” of a grouping for separating good and bad risk and is calculated by taking the log of the
ratio of distributions for a binary response column.
This only works with a binary target variable. The likelihood needs to be created within a stratified
kfold if a fit_transform method is used. More information can be found here: http://ucanalytics.com/
blogs/information-value-and-weight-of-evidencebanking-case/.
• TextBiGRU Transformer
The Text Bidirectional GRU Transformer trains a bi-directional GRU TensorFlow model on word
embeddings created from a text feature to predict the response column. The GRU prediction is used
as a new feature. Cross Validation is used when training the GRU model to prevent overfitting.
• TextCharCNN Transformer
The Text Character CNN Transformer trains a CNN TensorFlow model on character embeddings
created from a text feature to predict the response column. The CNN prediction is used as a new
feature. Cross Validation is used when training the CNN model to prevent overfitting.
• TextCNN Transformer
The Text CNN Transformer trains a CNN TensorFlow model on word embeddings created from a
text feature to predict the response column. The CNN prediction is used as a new a feature. Cross
Validation is used when training the CNN model to prevent overfitting.
• TextLinModel Transformer
The Text Linear Model Transformer trains a linear model on a TF-IDF matrix created from a text
feature to predict the response column. The linear model prediction is used as a new feature. Cross
Validation is used when training the linear model to prevent overfitting.
• Text Transformer
The Text Transformer tokenizes a text column and creates a TFIDF matrix (term frequency-inverse
document frequency) or count (count of the word) matrix. This may be followed by dimensionality
reduction using truncated SVD. Selected components of the TF-IDF/Count matrix are used as new
features.
• Dates Transformer
The Dates Transformer retrieves any date values, including:
– Year
– Quarter
– Month
– Day
– Day of year
– Week
– Week day
– Hour
– Minute
– Second
• IsHoliday Transformer
The Is Holiday Transformer determines if a date column is a holiday. A boolean column indicating
if the date is a holiday is added as a new feature. Creates a separate feature for holidays in the
United States, United Kingdom, Germany, Mexico, and the European Central Bank. Other countries
available in the python Holiday package can be added via the configuration file.
In this section, we will describe some of the available transformations using the example of predicting house prices on
the example dataset.
Date Built Square Footage Num Beds Num Baths State Price
01/01/1920 1700 3 2 NY $700K
Date Built Square Footage Num Beds Num Baths State Price Freq_State
01/01/1920 1700 3 2 NY 700,000 4,500
There is one more bedroom than there are number of bathrooms for this property.
The first component of the truncated SVD of the columns Price, Number of Beds, Number of Baths.
• get year, get quarter, get month, get day, get day of year, get week, get week day, get hour, get minute, get
second
Date Built Square Footage Num Beds Num Baths State Price DateBuilt_Month
01/01/1920 1700 3 2 NY 700,000 1
• transform text column using methods: TFIDF or count (count of the word)
• this may be followed by dimensionality reduction using truncated SVD
Date Built Square Footage Num Beds Num Baths State Price CV_TE_State
01/01/1920 1700 3 2 NY 700,000 550,000
Date Built Square Footage Num Beds Num Baths State Price CV_TE_SquareFootage
01/01/1920 1700 3 2 NY 700,000 345,000
The column Square Footage has been bucketed into 10 equally populated bins. This property lies in the Square
Footage bucket 1,572 to 1,749. The average price of properties with this range of square footage is $345,000*.
*In order to prevent overfitting, Driverless AI calculates this average on out-of-fold data using cross validation.
The columns: Num Beds, Num Baths, Square Footage have been segmented into 4 clusters. The average
price of properties in the same cluster as the selected property is $450,000*.
*In order to prevent overfitting, Driverless AI calculates this average on out-of-fold data using cross validation.
The columns: Num Beds, Num Baths, Square Footage have been segmented into 4 clusters. The difference
from this record to Cluster 1 is 0.83.
THIRTYSIX
This section describes the technique behind internal validation in Driverless AI.
For the experiment, Driverless AI will either:
(1) split the data into a training set and internal validation set
or
(2) use cross validation to split the data into 𝑛 folds
Driverless AI chooses the method based on the size of the data and the Accuracy setting. For method 1, part of the
data is removed to be used for internal validation. (Note: This train and internal validation split may be repeated if the
data is small so that more data can be used for training.)
For method 2, however, no data is wasted for internal validation. With cross validation, the whole dataset is utilized,
and each model is trained on a different subset of the training data. The following visualization shows an example of
cross validation with 5 folds.
Driverless AI randomly splits the data into the specified number of folds for cross validation. With cross validation,
the whole dataset is utilized, and each model is trained on a different subset of the training data.
Driverless AI will not automatically create the internal validation data randomly if a user provides a Fold Column or a
Validation Dataset. If a Fold Column or a Validation Dataset is provided, Driverless AI will use that data to calculate
the performance of the Driverless AI models and to calculate all performance graphs and statistics.
If the experiment is a Time Series use case, and a Time Column is selected, Driverless AI will change the way the
internal validation data is created. In the case of temporal data, it is important to train on historical data and validate
on more recent data. Driverless AI does not perform random splits, but instead respects the temporal nature of the data
451
Using Driverless AI, Release 1.8.4.1
to prevent any data leakage. In addition, the train/validation split is a function of the time gap between train and test
as well as the forecast horizon (amount of time periods to predict). If test data is provided, Driverless AI will suggest
values for these parameters that lead to a validation set that resembles the test set as much as possible. But users can
control the creation of the validation split in order to adjust it to the actual application.
THIRTYSEVEN
This section describes how missing and unseen levels are handled by each algorithm during training and scoring.
37.1 How Does the Algorithm Handle Missing Values During Train-
ing?
Driverless AI treats missing values natively. (I.e., a missing value is treated as a special value.) Experiments rarely
benefit from imputation techniques, unless the user has a strong understanding of the data.
37.1.2 GLM
Driverless AI automatically performs mean value imputation (equivalent to setting the value to zero after standardiza-
tion).
37.1.3 TensorFlow
Driverless AI provides an imputation setting for TensorFlow in the config.toml file: tf_nan_impute_value (post-
normalization). If you set this option to 0, then missing values will be imputed by the mean. Setting it to (for
example) +5 will specify 5 standard deviations above the mean of the distribution. The default value in Driverless AI
is -5, which specifies that TensorFlow will treat missing values as outliers on the negative end of the spectrum. Specify
0 if you prefer mean imputation.
37.1.4 FTRL
In FTRL, missing values have their own representation for each datable column type. These representations are used
to hash the missing value, with their column’s name, to an integer. This means FTRL replaces missing values with
special constants that are the same for each column type, and then treats these special constants like a normal data
value.
453
Using Driverless AI, Release 1.8.4.1
37.2 How Does the Algorithm Handle Missing Values During Scoring
(Production)?
If missing data is present during training, these tree-based algorithms learn the optimal direction for missing data for
each split (left or right). This optimal direction is then used for missing values during scoring. If no missing data is
present during scoring (for a particular feature), then the majority path is followed if the value is missing.
37.2.2 GLM
Missing values are replaced by the mean value (from training), same as in training.
37.2.3 TensorFlow
Missing values are replaced by the same value as specified during training (parameterized by tf_nan_impute_value).
37.2.4 FTRL
To ensure consistency, FTRL treats missing values during scoring in exactly the same way as during training.
Missing values are replaced with the mean along each column. This is used only on numeric columns.
Isolation Forest uses out-of-range imputation that fills missing values with the values beyond the maximum.
Driverless AI’s feature engineering pipeline will compute a numeric value for every categorical level present in the
data, whether it’s a previously seen value or not. For frequency encoding, unseen levels will be replaced by 0. For
target encoding, the global mean of the target value will be used.
37.3.2 FTRL
FTRL models don’t distinguish between categorical and numeric values. Whether or not FTRL saw a particular value
during training, it will hash all the data, row by row, to numeric and then make predictions. Because you can think of
FTRL as learning all the possible values in the dataset by heart, there is no guarantee it will make accurate predictions
for unseen data. Therefore, it is important to ensure that the training dataset has a reasonable “overlap” in terms of
unique values with the ones used to make predictions.
All algorithms will skip an observation (aka record) if the response value is missing.
THIRTYEIGHT
IMPUTATION IN DRIVERLESS AI
The impute feature allows you to fill in missing values with substituted values. Missing values can be imputed based on
the column’s mean, median, minimum, maximum, or mode value. You can also impute based on a specific percentile
or by a constant value.
The imputation is precomputed on all data or inside the pipeline (based on what’s in the train split).
The following guidelines should be followed when performing imputation:
• For constant imputation on numeric columns, constant must be numeric.
• For constant imputation on string columns, constant must be a string.
• For percentile imputation, the percentage value must be between 0 and 100.
Notes:
• This feature is experimental.
• Time columns cannot be imputed.
Imputation is disabled by default. It can be enabled by setting enable_imputation=true in the config.toml (for
native installs) or via the DRIVERLESS_AI_ENABLE_IMPUTATION=true environment variable (Docker image
installs). This enables imputation functionality in transformers.
Once imputation is enabled, you will have the option when running an experiment to add imputation columns.
1. Click on Columns Imputation in the Experiment Setup page.
457
Using Driverless AI, Release 1.8.4.1
7. At this point, you can add additional imputations, delete the imputation you just created, or close this form and
return to the experiment. Note that each column can have only a single imputation.
THIRTYNINE
Time-series forecasting is one of the most common and important tasks in business analytics. There are many real-
world applications like sales, weather, stock market, and energy demand, just to name a few. At H2O, we believe
that automation can help our users deliver business value in a timely manner. Therefore, we combined advanced time
series analysis and our Kaggle Grand Masters’ time-series recipes into Driverless AI.
The key features/recipes that make automation possible are:
• Automatic handling of time groups (e.g., different stores and departments)
• Robust time-series validation
– Accounts for gaps and forecast horizon
– Uses past information only (i.e., no data leakage)
• Time-series-specific feature engineering recipes
– Date features like day of week, day of month, etc.
– AutoRegressive features, like optimal lag and lag-features interaction
– Different types of exponentially weighted moving averages
– Aggregation of past information (different time groups and time intervals)
– Target transformations and differentiation
• Integration with existing feature engineering functions (recipes and optimization)
• Rolling-window based predictions for time series experiments with test-time augmentation or re-fit
• Automatic pipeline generation (See “From Kaggle Grand Masters’ Recipes to Production Ready in a Few
Clicks” blog post.)
Driverless AI uses GBMs, GLMs and neural networks with a focus on time-series-specific feature engineering. The
feature engineering includes:
• Autoregressive elements: creating lag variables
• Aggregated features on lagged variables: moving averages, exponential smoothing descriptive statistics, corre-
lations
• Date-specific features: week number, day of week, month, year
461
Using Driverless AI, Release 1.8.4.1
Gap
The guiding principle for properly modeling a time series forecasting problem is to use the historical data in the model
training dataset such that it mimics the data/information environment at scoring time (i.e. deployed predictions).
Specifically, you want to partition the training set to account for: 1) the information available to the model when
making predictions and 2) the number of units out that the model should be optimized to predict.
Given a training dataset, the gap and forecast horizon are parameters that determine how to split the training dataset
into training samples and validation samples.
Gap is the amount of missing time bins between the end of a training set and the start of test set (with regards to time).
For example:
• Assume there are daily data with days 1/1/2019, 2/1/2019, 3/1/2019, 4/1/2019 in train. There are 4 days in total
for training.
• In addition, the test data will start from 6/1/2019. There is only 1 day in the test data.
• The previous day (5/1/2019) does not belong to the train data. It is a day that cannot be used for training (i.e
because information from that day may not be available at scoring time). This day cannot be used to derive
information (such as historical lags) for the test data either.
• Here the time bin (or time unit) is 1 day. This is the time interval that separates the different samples/rows in the
data.
• In summary, there are 4 time bins/units for the train data and 1 time bin/unit for the test data plus the Gap.
• In order to estimate the Gap between the end of the train data and the beginning of the test data, the following
formula is applied.
• Gap = min(time bin test) - max(time bin train) - 1.
• In this case min(time bin test) is 6 (or 6/1/2019). This is the earliest (and only) day in the test data.
• max(time bin train) is 4 (or 4/1/2019). This is the latest (or the most recent) day in the train data.
• Therefore the GAP is 1 time bin (or 1 day in this case), because Gap = 6 - 4 - 1 or Gap = 1
Forecast Horizon
Quite often, it is not possible to have the most recent data available when applying a model (or it is costly to update
the data table too often); hence models need to be built accounting for a “future gap”. For example, if it takes a week
to update a certain data table, ideally we would like to predict “7 days ahead” with the data as it is “today”; hence a
gap of 7 days would be sensible. Not specifying a gap and predicting 7 days ahead with the data as it is is unrealistic
(and cannot happen, as we update the data on a weekly basis in this example). Similarly, gap can be used for those
who want to forecast further in advance. For example, users want to know what will happen 7 days in the future, they
will set the gap to 7 days.
Forecast Horizon (or prediction length) is the period that the test data spans for (for example, one day, one week,
etc.). In other words it is the future period that the model can make predictions for (or the number of units out that
the model should be optimized to predict). Forecast horizon is used in feature selection and engineering and in model
selection. Note that forecast horizon might not equal the number of predictions. The actual predictions are determined
by the test dataset.
The periodicity of updating the data may require model predictions to account for significant time in the future. In
an ideal world where data can be updated very quickly, predictions can always be made having the most recent data
available. In this scenario there is no need for a model to be able to predict cases that are well into the future, but rather
focus on maximizing its ability to predict short term. However this is not always the case, and a model needs to be
able to make predictions that span deep into the future because it may be too costly to make predictions every single
day after the data gets updated.
In addition, each future data point is not the same. For example, predicting tomorrow with today’s data is easier than
predicting 2 days ahead with today’s data. Hence specifying the forecast horizon can facilitate building models that
optimize prediction accuracy for these future time intervals.
time_period_in_seconds
Note: This is only available in the Python and R clients. Time period in seconds cannot be specified in the UI.
In Driverless AI, the forecast horizon (a.k.a., num_prediction_periods) needs to be in periods, and the size
is unknown. To overcome this, you can use the optional time_period_in_seconds parameter when running
start_experiment_sync (in Python) or train (in R). This is used to specify the forecast horizon in real time
units (as well as for gap.) If this parameter is not specified, then Driverless AI will automatically detect the period size
in the experiment, and the forecast horizon value will respect this period. I.e., if you are sure that your data has a 1
week period, you can say num_prediction_periods=14; otherwise it is possible that the model will not work
correctly.
Groups
Groups are categorical columns in the data that can significantly help predict the target variable in time series problems.
For example, one may need to predict sales given information about stores and products. Being able to identify that
the combination of store and products can lead to very different sales is key for predicting the target variable, as a big
store or a popular product will have higher sales than a small store and/or with unpopular products.
For example, if we don’t know that the store is available in the data, and we try to see the distribution of sales along
time (with all stores mixed together), it may look like that:
The same graph grouped by store gives a much clearer view of what the sales look like for different stores.
Lag
The primary generated time series features are lag features, which are a variable’s past values. At a given sample with
time stamp 𝑡, features at some time difference 𝑇 (lag) in the past are considered. For example, if the sales today are
300, and sales of yesterday are 250, then the lag of one day for sales is 250. Lags can be created on any feature as well
as on the target.
As previously noted, the training dataset is appropriately split such that the amount of validation data samples equals
that of the testing dataset samples. If we want to determine valid lags, we must consider what happens when we will
evaluate our model on the testing dataset. Essentially, the minimum lag size must be greater than the gap size.
Aside from the minimum useable lag, Driverless AI attemps to to discover predictive lag sizes based on auto-
correlation.
“Lagging” variables are important in time series because knowing what happened in different time periods in the past
can greatly facilitate predictions for the future. Consider the following example to see the lag of 1 and 2 days:
Window/Moving Average
Using the above Lag table, a moving average of 2 would constitute the average of Lag1 and Lag2:
Aggregating multiple lags together (instead of just one) can facilitate stability for defining the target variable. It may
include various lags values, for example lags [1-30] or lags [20-40] or lags [7-70 by 7].
Exponential Weighting
Exponential weighting is a form of weighted moving average where more recent values have higher weight than less
recent values. That weight is exponentially decreased over time based on an alpha (a) (hyper) parameter (0,1), which
is normally within the range of [0.9 - 0.99]. For example:
• Exponential Weight = a**(time)
• If sales 1 day ago = 3.0 and 2 days ago =4.5 and a=0.95:
• Exp. smooth = 3.0*(0.95**1) + 4.5*(0.95**2) / ((0.95**1) + (0.95**2)) =3.73 approx.
Driverless AI supports rolling-window-based predictions for time-series experiments with two options: Test Time
Augmentation (TTA) or re-fit.
This process is automated when the test set spans for a longer period than the forecast horizon. If the user does not
provide a test set, but then scores one after the experiment is finished, rolling predictions will still be applied as long
as the selected horizon is shorter than the test set.
When using Rolling Windows and TTA, Driverless AI takes into account the Prediction Duration and the Rolling
Duration.
• Prediction Duration (PD): This is the duration configured as forecaset horizon while training the Driverless AI
experiment. If you don’t want to predict beyond the horizon configured during experiment training using the
experiment’s scoring pipeline, then in that case, PD may be the same as Test Data Duration/Horizon and the
situation is shown in the previous Horizon image (above).
Note: When using TTA, the prediction duration represents the forecast horizon during experiment train-
ing. During scoring, the prediction duration will be the duration of data passed to score for each invocation
of the score_batch method of the scoring module.
• Rolling Duration (RD): This is the amount of duration by which we move ahead (roll) in time before we score
again for the next prediction duration data.
Usually, the forecast horizon (prediction length) 𝐻 equals the number of time periods in the testing data 𝑁𝑇 𝐸𝑆𝑇 (i.e.
𝑁𝑇 𝐸𝑆𝑇 = 𝐻). You want to have enough training data time periods 𝑁𝑇 𝑅𝐴𝐼𝑁 to score well on the testing dataset. At
a minimum, the training dataset should contain at least three times as many time periods as the testing dataset (i.e.
𝑁𝑇 𝑅𝐴𝐼𝑁 >= 3𝑁𝑇 𝐸𝑆𝑇 ). This allows for the training dataset to be split into a validation set with the same amount of
time periods as the testing dataset while maintaining enough historical data for feature engineering.
Below is a typical example of sales forecasting based on the Walmart competition on Kaggle. In order to frame it as a
machine learning problem, we formulate the historical sales data and additional attributes as shown below:
Raw data
The additional attributes are attributes that we will know at time of scoring. In this example, we want to forecast the
next week of sales. Therefore, all of the attributes included in our data must be known at least one week in advance.
In this case, we assume that we will know whether or not a Store and Department will be running a promotional
markdown. We will not use features like the temperature of the Week since we will not have that information at the
time of scoring.
Once you have your data prepared in tabular format (see raw data above), Driverless AI can formulate it for machine
learning and sort out the rest. If this is your very first session, the Driverless AI assistant will guide you through the
journey.
Similar to previous Driverless AI examples, you need to select the dataset for training/test and define the target. For
time-series, you need to define the time column (by choosing AUTO or selecting the date column manually). If
weighted scoring is required (like the Walmart Kaggle competition), you can select the column with specific weights
for different samples.
If you prefer to use automatic handling of time groups, you can leave the setting for time groups columns as AUTO,
or you can define specific time groups. You can also specify the columns that will be unavailable at prediction time,
the forecast horizon (in weeks), and the gap (in weeks) between the train and test periods.
Once the experiment is finished, you can make new predictions and download the scoring pipeline just like any other
Driverless AI experiments.
The user may further configure the time series experiments with a dedicated set of options available through the
EXPERT SETTINGS. The EXPERT SETTINGS panel is available from within the experiment page right above the
Scorer knob.
Refer to Time Series Settings for information about the available Time Series Settings options.
When you set the experiment’s forecast horizon, you are telling the Driverless AI experiment the dates this model will
be asked to forecast for. In the Walmart Sales example, we set the Driverless AI forecast horizon to 1 (1 week in the
future). This means that Driverless AI expects this model to be used to forecast 1 week after training ends. Since the
training data ends on 2012-10-26, then this model should be used to score for the week of 2012-11-02.
What should the user do once the 2012-11-02 week has passed?
There are two options:
Option 1: Trigger a Driverless AI experiment to be trained once the forecast horizon ends. A Driverless AI experiment
will need to be re-trained every week.
Option 2: Use Test Time Augmentation to update historical features so that we can use the same model to forecast
outside of the forecast horizon.
Test Time Augmentation refers to the process where the model stays the same but the features are refreshed using the
latest data. In our Walmart Sales Forecasting example, a feature that may be very important is the Weekly Sales from
the previous week. Once we move outside of the forecast horizon, our model no longer knows the Weekly Sales from
the previous week. By performing Test Time Augmentation, Driverless AI will automatically generate these historical
features if new data is provided.
In Option 1, we would launch a new Driverless AI experiment every week with the latest data and use the resulting
model to forecast the next week. In Option 2, we would continue using the same Driverless AI experiment outside of
the forecast horizon by using Test Time Augmentation.
Both options have their advantages and disadvantages. By re-training an experiment with the latest data, Driverless
AI has the ability to possibly improve the model by changing the features used, choosing a different algorithm, and/or
selecting different parameters. As the data changes over time, for example, Driverless AI may find that the best
algorithm for this use case has changed.
There may be clear advantages for retraining an experiment after each forecast horizon or for using Test Time Aug-
mentation. Refer to this example to see how to use the scoring pipeline to predict future data instead of using the
prediction endpoint on the Driverless AI server.
Using Test Time Augmentation to be able to continue using the same experiment over a longer period of time means
there would be no need to continually repeat a model review process. The model may become out of date, however,
and the MOJO scoring pipeline is not supported.
For different use cases, there may be clear advantages for retraining an experiment after each forecast horizon or for
using Test Time Augmentation. In this notebook, we show how to perform both and compare the performance: Time
Series Model Rolling Window.
How to trigger Test Time Augmentation?
To tell Driverless AI to perform Test Time Augmentation, simply create your forecast data to include any data that
occurred after the training data ended up to the date you want a forecast for. The date which you want Driverless AI
to forecast should have NA where the target column is. Here is an example of forecasting 2012-11-09.
If we do not include an NA in the Target column for the date we are interested in forecasting, then Test Time Augmen-
tation will not be triggered.
Refer to the following for examples showing how to run Time Series examples in Driverless AI:
• Training a Time Series Model
• Time Series Recipes with Rolling Window
• Time Series Pipeline with Time Test Augmentation
FORTY
NLP IN DRIVERLESS AI
Driverless AI version 1.3 introduced support for TensorFlow Natural Language Processing (NLP) experiments for text
classification and regression problems. The Driverless AI platform has the ability to support both standalone text and
text with other numerical values as predictive features.
The following is the set of features created by the NLP recipe for a given text column:
• N-gram frequency / TFIDF followed by Truncated SVD
• N-gram frequency / TFIDF followed by Linear / Logistic regression
• Word embeddings followed by CNN model (TensorFlow)
• Word embeddings followed by BiGRU model (TensorFlow)
• Character embeddings followed by CNN model (TensorFlow)
In addition to these techniques, Driverless AI supports custom NLP recipes using, for example, PyTorch or Flair.
40.1 n-gram
Frequency-based features represent the count of each word from a given text in the form of vectors. These are created
for different n-gram values. For example, a one-gram is equivalent to a single word, a two-gram is equivalent to two
consecutive words paired together, and so on.
Words and n-grams that occur more often will receive a higher weightage. The ones that are rare will receive a lower
weightage.
Frequency-based features can be multiplied with the inverse document frequency to get term frequency–inverse doc-
ument frequency (TFIDF) vectors. Doing so also gives importance to the rare terms that occur in the corpus, which
may be helpful in certain classification tasks.
473
Using Driverless AI, Release 1.8.4.1
TFIDF and the frequency of n-grams both result in higher dimensions of the representational vectors. To counteract
this, Truncated SVD is commonly used to decompose the vectorized arrays into lower dimensions.
Linear models are also available in the Driverless AI NLP recipe. These capture linear dependencies that are crucial
to the process of achieving high accuracy rates.
Word embeddings is the term for a collective set of feature engineering techniques for text where words or phrases
from the vocabulary are mapped to vectors of real numbers. Representations are made so that words with similar
meanings are placed close to or equidistant from one another. For example, the word “king” is closely associated with
the word “queen” in this kind of vector representation.
TFIDF and frequency-based models represent counts and significant word information, but they lack the semantic
context for these words. Word embedding techniques are used to make up for this lack of semantic information.
Although Convolutional Neural Network (CNN) models are primarily used on image-level machine learning tasks,
their use case on representing text as information has proven to be quite efficient and faster compared to RNN models.
In Driverless AI, we pass word embeddings as input to CNN models, which return cross validated predictions that can
be used as a new set of features.
Recurrent neural networks like long short-term memory units (LSTM) and gated recurrent units (GRU) are state-
of-the-art algorithms for NLP problems. In Driverless AI, we implement bi-directional GRU features for previous
word steps and for later steps to predict the current state. For example, in the sentence “John is walking on the golf
course,” a unidirectional model would represent states that represent “golf” based on “John is walking on,” but would
not represent “course.” Using a bi-directional model, the representation would also account the later representations,
giving the model more predictive power.
In simple terms, a bi-directional GRU model combines two independent RNN models into a single model. A GRU
architecture provides high speeds and accuracy rates similar to a LSTM architecture. As with CNN models, we pass
word embeddings as input to these models, which return cross validated predictions that can be used as a new set of
features.
For languages like Japanese and Mandarin Chinese, where characters play a major role, character level embedding is
available in the NLP recipe.
In character embedding, each character is represented in the form of vectors rather than words. Driverless AI uses
character level embedding as the input to CNN models and later extracts class probabilities to feed as features for
downstream models.
The image below represents the overall set of features created by our NLP recipes:
The naming conventions of the NLP features help to understand the type of feature that has been created.
The syntax for the feature names is as follows:
[FEAT TYPE]:[COL].[TARGET_CLASS]
• [FEAT TYPE] represents one of the following:
• Txt – Frequency / TFIDF of N-grams followed by SVD
• TxtTE - Frequency / TFIDF of N-grams followed by linear model
• TextCNN_TE – Word embeddings followed by CNN model
• TextBiGRU_TE – Word embeddings followed by Bidirectional GRU model
• TextCharCNN_TE – Character embeddings followed by CNN model
• [COL] represents the name of the text column.
• [TARGET_CLASS] represents the target class for which the model predictions are made.
For example, TxtTE:text.0 equates to class 0 predictions for the text column “text” using Frequency / TFIDF of n-
grams followed by a linear model.
A number of configurable settings are available for NLP in Driverless AI. Refer to NLP Settings in the Expert Settings
topic for more information.
The following section provides an NLP example. This information is based on the Automatic Feature Engineering for
Text Analytics blog post. A similar example using the Python Client is available in The Python Client.
This example uses a classical example of sentiment analysis on tweets using the US Airline Sentiment dataset from
Figure Eight’s Data for Everyone library. We can split the dataset into training and test with this simple script. We
will just use the tweets in the ‘text’ column and the sentiment (positive, negative or neutural) in the ‘airline_sentiment’
column for this demo. Here are some samples from the dataset:
Once we have our dataset ready in the tabular format, we are all set to use the Driverless AI. Similar to other problems
in the Driverless AI setup, we need to choose the dataset, and then specify the target column (‘airline_sentiment’).
Because we don’t want to use any other columns in the dataset, we need to click on Dropped Cols, and then exclude
everything but text as shown below:
Next, we will need to make sure TensorFlow is enabled for the experiment. We can go to Expert Settings and enable
TensorFlow Models.
At this point, we are ready to launch an experiment. Text features will be automatically generated and evaluated
during the feature engineering process. Note that some features such as TextCNN rely on TensorFlow models. We
recommend using GPU(s) to leverage the power of TensorFlow and accelerate the feature engineering process.
Once the experiment is done, users can make new predictions and download the scoring pipeline just like any other
Driverless AI experiments.
Resources:
• fastText: https://fasttext.cc/
• GloVe: https://nlp.stanford.edu/projects/glove/
FORTYONE
This section describes how to install the Driverless AI Python Client. It also provides some end-to-end examples
showing how to use the Driverless AI Python client. Additional examples are available in the https://github.com/
h2oai/driverlessai-tutorials repository.
Note: This section and the Python API are in a pre-release state and are a WIP. The documentation and API will both
continue to change, and functions described in these examples may not work in other versions of Driverless AI.
The Python Client is available on the Driverless AI UI and published on the h2oai channel at https://anaconda.org/
h2oai/repo.
Requirements
481
Using Driverless AI, Release 1.8.4.1
Download from UI
On the Driverless AI top menu, select the RESOURCES > PYTHON CLIENT link. This downloads the
h2oai_client wheel.
The Driverless AI Python client is exposed as the /clients/py HTTP end point. This can be accessed via the
command line:
wget --trust-server-names http://<Driverless AI address>/clients/py
Wheel Installation
Install this wheel to your local Python via pip comand. Once installed, you can launch a Jupyter notebook and begin
using the Driverless AI Python Client.
Note: Conda installs of the Python client are not supported on Windows.
Requirements
• Conda Package Manager. You can install the Conda Package Manager either through Anaconda or Miniconda.
Note that the Driverless AI Python client requires Python 3.6, so ensure that you install the Python 3 version of
Anaconda or Miniconda.
– Anaconda Install Instructions
– Miniconda Install Instructions
Installation Procedure
After Conda is installed and the Conda executable is available in $PATH, create a new Anaconda environment for
h2oai_client:
conda create -n h2oaiclientenv -c h2oai -c conda-forge h2oai_client
The above command installs the latest version of the Python client. Include the version number to install a specific
version. For example, the following command installs the 1.6.3 Python client:
conda create -n h2oaiclientenv -c h2oai -c conda-forge h2oai_client=1.6.3
This notebook provides an H2OAI Client workflow, of model building and scoring, that parallels the Driverless AI
workflow.
Notes:
• This is an early release of the Driverless AI Python client.
• This notebook was tested in Driverless AI version 1.8.2.
• Python 3.6 is the only supported version.
• You must install the h2oai_client wheel to your local Python. This is available from the RESOURCES
link in the top menu of the UI.
1. Sign In
2. Upload Datasets
Upload training and testing datasets from the Driverless AI /data folder.
You can provide a training, validation, and testing dataset for an experiment. The validation and testing dataset are
optional. In this example, we will provide only training and testing.
[3]: train_path = '/data/Kaggle/CreditCard/CreditCard-train.csv'
test_path = '/data/Kaggle/CreditCard/CreditCard-test.csv'
train = h2oai.create_dataset_sync(train_path)
test = h2oai.create_dataset_sync(test_path)
We will now set the parameters of our experiment. Some of the parameters include:
• Target Column: The column we are trying to predict.
• Dropped Columns: The columns we do not want to use as predictors such as ID columns, columns with data
leakage, etc.
• Weight Column: The column that indicates the per row observation weights. If None, each row will have an
observation weight of 1.
• Fold Column: The column that indicates the fold. If None, the folds will be determined by Driverless AI.
• Is Time Series: Whether or not the experiment is a time-series use case.
For information on the experiment settings, refer to the Experiment Settings.
For this example, we will be predicting ``default payment next month``. The parameters that con-
trol the experiment process are: accuracy, time, and interpretability. We can use the
get_experiment_preview_sync function to get a sense of what will happen during the experiment.
We will start out by seeing what the experiment will look like with accuracy, time, and interpretability
all set to 5.
[4]: target="default payment next month"
exp_preview = h2oai.get_experiment_preview_sync(dataset_key= train.key
, validset_key=''
, classification=True
, dropped_cols=[]
, target_col=target
, is_time_series=False
, enable_gpus=False
, accuracy=5, time=5, interpretability=5
, reproducible=True
, resumed_experiment_id=''
, time_col=''
, config_overrides=None)
exp_preview
With these settings, the Driverless AI experiment will train about 124 models: * 16 for model and feature tuning * 104
for feature evolution * 4 for the final pipeline
When we start the experiment, we can either:
• specify parameters
• use Driverless AI to suggest parameters
Driverless AI can suggest the parameters based on the dataset and target column. Below we will use the
``get_experiment_tuning_suggestion`` to see what settings Driverless AI suggests.
[5]: # let Driverless suggest parameters for experiment
params = h2oai.get_experiment_tuning_suggestion(dataset_key = train.key, target_col = target,
is_classification = True, is_time_series = False,
config_overrides = None, cols_to_drop=[])
params.dump()
Driverless AI has found that the best parameters are to set ``accuracy = 5``, ``time = 4``, ``interpretability = 6``. It
has selected ``AUC`` as the scorer (this is the default scorer for binomial problems).
Launch the experiment using the parameters that Driverless AI suggested along with the testset, scorer, and seed that
were added. We can launch the experiment with the suggested parameters or create our own.
[6]: experiment = h2oai.start_experiment_sync(dataset_key=train.key,
testset_key = test.key,
target_col=target,
is_classification=True,
accuracy=5,
time=4,
interpretability=6,
scorer="AUC",
enable_gpus=True,
seed=1234,
cols_to_drop=['ID'])
5. Examine Experiment
View the final model score for the validation and test datasets. When feature engineering is complete, an ensemble
model can be built depending on the accuracy setting. The experiment object also contains the score on the validation
and test data for this ensemble model. In this case, the validation score is the score on the training cross-validation
predictions.
[7]: print("Final Model Score on Validation Data: " + str(round(experiment.valid_score, 3)))
print("Final Model Score on Test Data: " + str(round(experiment.test_score, 3)))
The experiment object also contains the scores calculated for each iteration on bootstrapped samples on the validation
data. In the iteration graph in the UI, we can see the mean performance for the best model (yellow dot) and +/- 1
standard deviation of the best model performance (yellow bar).
This information is saved in the experiment object.
[8]: # Add scores from experiment iterations
iteration_data = h2oai.list_model_iteration_data(experiment.key, 0, len(experiment.iteration_data))
iterations = list(map(lambda iteration: iteration.iteration, iteration_data))
scores_mean = list(map(lambda iteration: iteration.score_mean, iteration_data))
scores_sd = list(map(lambda iteration: iteration.score_sd, iteration_data))
plt.figure()
plt.errorbar(iterations, scores_mean, yerr=scores_sd, color = "y",
ecolor='yellow', fmt = '--o', elinewidth = 4, alpha = 0.5)
plt.xlabel("Iteration")
plt.ylabel("AUC")
plt.ylim([0.65, 0.82])
plt.show();
6. Download Results
Once an experiment is complete, we can see that the UI presents us options of downloading the:
• predictions
– on the (holdout) train data
– on the test data
• experiment summary - summary of the experiment including feature importance
We will show an example of downloading the test predictions below. Note that equivalent commands can also be run
for downloading the train (holdout) predictions.
[9]: h2oai.download(src_path=experiment.test_predictions_path, dest_dir=".")
[9]: './test_preds.csv'
It is also possible to use the Python API to examine an experiment that was started through the Web UI using the
experiment key.
You can get a pointer to the experiment by referencing the experiment key in the Web UI.
[11]: # Get list of experiments
experiment_list = list(map(lambda x: x.key, h2oai.list_models(offset=0, limit=100).models))
experiment_list
[11]: ['7f7b429e-33dc-11ea-ba27-0242ac110002',
'0be7d94a-33d8-11ea-ba27-0242ac110002',
'2e6bbcfa-30a1-11ea-83f9-0242ac110002',
'3c06c58c-27fd-11ea-9e09-0242ac110002',
'a3c6dfda-2353-11ea-9f4a-0242ac110002',
'15fe0c0a-203d-11ea-97b6-0242ac110002']
You can use the Python API to score on new data. This is equivalent to the SCORE ON ANOTHER DATASET
button in the Web UI. The example below scores on the test data and then downloads the predictions.
Pass in any dataset that has the same columns as the original training set. If you passed a test set during the H2OAI
model building step, the predictions already exist.
The following shows the predicted probability of default for each record in the test.
[13]: prediction = h2oai.make_prediction_sync(experiment.key, test.key, output_margin = False, pred_contribs = False)
pred_path = h2oai.download(prediction.predictions_csv_path, '.')
pred_table = pd.read_csv(pred_path)
pred_table.head()
We can also get the contribution each feature had to the final prediction by setting pred_contribs = True. This
will give us an idea of how each feature effects the predictions.
[14]: prediction_contributions = h2oai.make_prediction_sync(experiment.key, test.key,
output_margin = False, pred_contribs = True)
pred_contributions_path = h2oai.download(prediction_contributions.predictions_csv_path, '.')
pred_contributions_table = pd.read_csv(pred_contributions_path)
pred_contributions_table.head()
[5 rows x 24 columns]
We will examine the contributions for our first record more closely.
[15]: contrib = pd.DataFrame(pred_contributions_table.iloc[0][1:])
contrib.columns = ["contribution"]
contrib["abs_contribution"] = contrib.contribution.abs()
contrib.sort_values(by="abs_contribution", ascending=False)[["contribution"]].head()
[15]: contribution
contrib_10_PAY_0 1.559683
contrib_bias -1.507163
contrib_11_PAY_2 0.336257
contrib_8_LIMIT_BAL 0.163888
contrib_1_BILL_AMT1 -0.100378
The clusters from this customer’s: PAY_0, PAY_2, and LIMIT_BAL had the greatest impact on their prediction.
Since the contribution is positive, we know that it increases the probability that they will default.
You can use the Python API to also perform model diagnostics on new data. This is equivalent to the Model Diagnos-
tics tab in the Web UI.
The example below performs model diagnostics on the test dataset but any data with the same columns can be selected.
[16]: test_diagnostics = h2oai.make_model_diagnostic_sync(experiment.key, test.key)
Once we have completed an experiment, we can interpret our H2OAI model. Model Interpretability is used to provide
model transparency and explanations. More information on Model Interpretability can be found here: http://docs.h2o.
ai/driverless-ai/latest-stable/docs/userguide/interpreting.html.
We can run the model interpretation in the Python client as shown below. By setting the parameter,
use_raw_features to True, we are interpreting the model using only the raw features in the data. This will
not use the engineered features we saw in our final model’s features to explain the data.
[18]: mli_experiment = h2oai.run_interpretation_sync(dai_model_key = experiment.key,
dataset_key = train.key,
target_col = target,
use_raw_feature = True)
This is equivalent to clicking Interpet this Model on Original Features in the UI once the experiment has completed.
Once our interpretation is finished, we can navigate to the MLI tab in the UI to see our interpreted model.
We can also see the list of interpretations using the Python Client:
[19]: # Get list of interpretations
mli_list = list(map(lambda x: x.key, h2oai.list_interpretations(offset=0, limit=100)))
mli_list
[19]: ['2ff44e94-33de-11ea-ba27-0242ac110002',
'6ce9e9f6-33db-11ea-ba27-0242ac110002']
Model Interpretation does not need to be run on a Driverless AI experiment. We can also train an external model
and run Model Interpretability on the predictions. In this next section, we will walk through the steps to interpret an
external model.
We will begin by training a model with scikit-learn. Our end goal is to use Driverless AI to interpret the predictions
made by our scikit-learn model.
[25]: # Dataset must be located where Python client is running - you may need to download it locally
train_pd = pd.read_csv(train_path)
gbm_model = GradientBoostingClassifier(random_state=10)
gbm_model.fit(train_pd[predictors], train_pd[target])
Now that we have the predictions from our scikit-learn GBM model, we can call Driverless AI’s
``h2o_ai.run_interpretation_sync`` to create the interpretation screen.
[28]: train_gbm_path = "./CreditCard-train-gbm_pred.csv"
predictions = pd.concat([train_pd, pd.DataFrame(predictions[:, 1], columns = ["p1"])], axis = 1)
predictions.to_csv(path_or_buf=train_gbm_path, index = False)
We can also run Model Interpretability on an external model in the UI as shown below:
In our last section, we will build the scoring pipelines from our experiment. There are two scoring pipeline options:
• Python Scoring Pipeline: requires Python runtime
• MOJO Scoring Pipeline: requires Java runtime
Documentation on the scoring pipelines is provided here: http://docs.h2o.ai/driverless-ai/latest-stable/docs/userguide/
python-mojo-pipelines.html.
The experiment screen shows two scoring pipeline buttons: Download Python Scoring Pipeline or Build MOJO
Scoring Pipeline. Driverless AI determines if any scoring pipeline should be automatically built based on the con-
fig.toml file. In this example, we have run Driverless AI with the settings:
# Whether to create the Python scoring pipeline at the end of each experiment
make_python_scoring_pipeline = true
# Whether to create the MOJO scoring pipeline at the end of each experiment
# Note: Not all transformers or main models are available for MOJO (e.g. no gblinear main model)
make_mojo_scoring_pipeline = false
The Python Scoring Pipeline has been built by default based on our config.toml settings. We can get the path to the
Python Scoring Pipeline in our experiment object.
[32]: experiment.scoring_pipeline_path
[32]: 'h2oai_experiment_daguwofe/scoring_pipeline/scorer.zip'
We can also build the Python Scoring Pipeline - this is useful if the ``make_python_scoring_pipeline`` option was
set to false.
[58]: python_scoring_pipeline = h2oai.build_scoring_pipeline_sync(experiment.key)
[59]: python_scoring_pipeline.file_path
[59]: 'h2oai_experiment_adbb4dca-c460-11e9-b1a0-0242ac110002/scoring_pipeline/scorer.zip'
[60]: './scorer.zip'
The MOJO Scoring Pipeline has not been built by default because of our config.toml settings. We can build the MOJO
Scoring Pipeline using the Python client. This is equivalent to selecting the Build MOJO Scoring Pipeline on the
experiment screen.
[61]: mojo_scoring_pipeline = h2oai.build_mojo_pipeline_sync(experiment.key)
[62]: mojo_scoring_pipeline.file_path
[62]: 'h2oai_experiment_adbb4dca-c460-11e9-b1a0-0242ac110002/mojo_pipeline/mojo.zip'
[63]: './mojo.zip'
Once the MOJO Scoring Pipeline is built, the Build MOJO Scoring Pipeline changes to Download MOJO Scoring
Pipeline.
[ ]:
The purpose of this notebook is to show an example of using Driverless AI to train a time series model. Our goal will
be to forecast the Weekly Sales for a particular Store and Department for the next week. The data used in this notebook
is from the: Walmart Kaggle Competition where features.csv and train.csv have been joined together.
Note: This notebook was tested and run on Driverless AI 1.8.1.
41.3.1 Workflow
%matplotlib inline
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
We will begin by importing our data using pandas. We are going to first work with the data in Python to correctly
format it for a Driverless AI time series use case.
[2]: sales_data = pd.read_csv("./walmart_train.csv")
sales_data.head()
IsHoliday sample_weight
0 0 1
1 0 1
2 0 1
3 0 1
4 0 1
The data has one record per Store, Department, and Week. Our goal for this use case will be to forecast the total sales
for the next week.
The only features we should use as predictors are ones that we will have available at the time of scoring. Features
like the Temperature, Fuel Price, and Unemployment will not be known in advance. Therefore, before we start our
Driverless AI experiments, we will choose to use the previous week’s Temperature, Fuel Price, Unemployment, and
CPI attributes. This information we will know at time of scoring.
[4]: lag_variables = ["Temperature", "Fuel_Price", "CPI", "Unemployment"]
dai_data = sales_data.set_index(["Date", "Store", "Dept"])
lagged_data = dai_data.loc[:, lag_variables].groupby(level=["Store", "Dept"]).shift(1)
[6]: # Drop original predictor variables - we do not want to use these in the model
dai_data = dai_data.drop(lagged_data, axis=1)
dai_data = dai_data.reset_index()
[7]: dai_data.head()
Now that our training data is correctly formatted, we can run a Driverless AI experiment to forecast the next week’s
sales.
We will split out data into two pieces: training and test (which consists of the last week of data).
[8]: train_data = dai_data.loc[dai_data["Date"] < "2012-10-26"]
test_data = dai_data.loc[dai_data["Date"] == "2012-10-26"]
We will now launch the Driverless AI experiment. To do that we will need to specify the parameters for our experiment.
Some of the parameters include:
• Target Column: The column we are trying to predict.
• Dropped Columns: The columns we do not want to use as predictors such as ID columns, columns with data
leakage, etc.
• Is Time Series: Whether or not the experiment is a time-series use case.
• Time Column: The column that contains the date/date-time information.
• Time Group Columns: The categorical columns that indicate how to group the data so that there is one time
series per group. In our example, our Time Groups Columns are Store and Dept. Each Store and Dept,
corresponds to a single time series.
• Number of Prediction Periods: How far in the future do we want to predict?
• Number of Gap Periods: After how many periods can we start predicting? If we assume that we can start
forecasting right after the training data ends, then the Number of Gap Periods will be 0.
For this experiment, we want to forecast next week’s sales for each Store and Dept. Therefore, we will use the
following time series parameters:
• Time Group Columns: [Store, Dept]
• Number of Prediction Periods: 1 (a.k.a., horizon)
• Number of Gap Periods: 0
Note that the period size is unknown to the Python client. To overcome this, you can also specify the optional
time_period_in_seconds parameter, which can help specify the horizon in real time units. If this parameter is
omitted, Driverless AI will automatically detect the period size in the experiment, and the horizon value will respect
this period. I.e., if you are sure your data has 1 week period, you can say num_prediction_periods=14,
otherwise it is possible that the model may not work out correctly.
[12]: experiment = h2oai.start_experiment_sync(dataset_key=train_dai.key,
testset_key = test_dai.key,
target_col="Weekly_Sales",
is_classification=False,
cols_to_drop = ["sample_weight"],
accuracy=5,
time=3,
interpretability=1,
scorer="RMSE",
enable_gpus=True,
seed=1234,
time_col = "Date",
time_groups_columns = ["Store", "Dept"],
num_prediction_periods = 1,
num_gap_periods = 0)
Now that our experiment is complete, we can view the model performance metrics within the experiment object.
[13]: print("Validation RMSE: ${:,.0f}".format(experiment.valid_score))
print("Test RMSE: ${:,.0f}".format(experiment.test_score))
We can also plot the actual versus predicted values from the test data.
[14]: plt.scatter(experiment.test_act_vs_pred.x_values, experiment.test_act_vs_pred.y_values)
plt.plot([0, max(experiment.test_act_vs_pred.x_values)],[0, max(experiment.test_act_vs_pred.y_values)], 'b--',)
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.show()
Lastly, we can download the test predictions from Driverless AI and examine the forecasted sales vs actual for a
selected store and department.
[15]: preds_path = h2oai.download(src_path=experiment.test_predictions_path, dest_dir=".")
forecast_predictions = pd.read_csv(preds_path)
forecast_predictions.columns = ["predicted_Weekly_Sales"]
fig, ax = plt.subplots()
ax.plot(selected_ts["Date"], selected_ts["Weekly_Sales"], label = "Actual")
ax.plot(selected_ts_forecast["Date"], selected_ts_forecast["predicted_Weekly_Sales"], marker='o', label = "Predicted")
ax.xaxis.set_major_locator(years)
ax.xaxis.set_major_formatter(yearsFmt)
plt.legend(loc='upper left')
plt.show()
[ ]:
The purpose of this notebook is to show an example of using Driverless AI to train experiments on different subsets of
data. This would result in a collection of forecasted values that can be evaluated. The data used in this notebook is a
public dataset: S+P 500 Stock Data. In this example, we are using the all_stocks_5yr.csv dataset.
41.4.1 Workflow
stock_data = pd.read_csv("./all_stocks_5yr.csv")
stock_data.head()
We will add a new column which is the index. We will use this later on to do a rolling window of training and testing.
We will use this index instead of the actual date because this data only occurs on weekdays (when the stock market is
opened). When you use Driverless AI to perform a forecast, it will forecast the next n days. In this particular case, we
never want to forecast Saturday’s and Sunday’s. We will instead treat our time column as the index of the record.
[3]: dates_index = pd.DataFrame(sorted(stock_data["date"].unique()), columns = ["date"])
dates_index["index"] = range(len(dates_index))
stock_data = pd.merge(stock_data, dates_index, on = "date")
stock_data.head()
Now we will create a function that can split our data by time to create multiple experiments.
We will start by first logging into Driverless AI.
[4]: import h2oai_client
import numpy as np
import pandas as pd
# import h2o
import requests
import math
from h2oai_client import Client, ModelParameters
Our function will split the data into training and testing based on the training length and testing length specified by the
user. It will then run an experiment in Driverless AI and download the test predictions.
[11]: def dai_moving_window(dataset, train_len, test_len, target, predictors, index_col, time_group_cols,
accuracy, time, interpretability):
# Calculate windows for the training and testing data based on the train_len and test_len arguments
num_dates = max(dataset[index_col])
num_windows = (num_dates - train_len) // test_len
windows = []
for i in range(num_windows):
train_start_id = i*test_len
train_end_id = train_start_id + (train_len - 1)
test_start_id = train_end_id + 1
test_end_id = test_start_id + (test_len - 1)
return forecast_predictions
# Save dataset
train_path = "./train_data.csv"
test_path = "./test_data.csv"
keep_cols = predictors + [target, index_col] + time_group_cols
train_data[keep_cols].to_csv(train_path)
test_data[keep_cols].to_csv(test_path)
return test_predictions
[ ]: # We will filter the dataset to the first 1030 dates for demo purposes
filtered_stock_data = stock_data[stock_data["index"] <= 1029]
forecast_predictions = dai_moving_window(filtered_stock_data, 1000, 3, target, predictors, index_col, time_group_cols,
accuracy = 1, time = 1, interpretability = 1)
[25]: forecast_predictions.head()
[ ]:
This example describes how to run the Python Scoring Pipeline on a time series model. This example has been tested
on a Linux machine.
After successfully completing an experiment in DAI, click the DOWNLOAD PYTHON SCORING PIPELINE
button.
The above will create a conda environment with the necessary requirements to be able to run the scoring pipeline and
scores the test.csv files, which proves that the installation is successful.
Run the following to check the list of conda environments:
conda env list
An environment with following the format scoring_h2oai_experiment_xxx should be available, where xxx
is the name of your experiment.
At this point, you can run the example below.
[ ]: import os
import pandas as pd
from sklearn.model_selection import train_test_split
from dateutil.parser import parse
import datetime
from datetime import timedelta
import numpy as np
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:90% !important; }</style>"))
pd.options.display.max_rows = 999
pd.options.display.max_columns = 999
time1_pd = pd.read_csv(time1_path,parse_dates=['Date'])
time2_pd = pd.read_csv(time2_path,parse_dates=['Date'])
[ ]: # Import the scorer for the experiment. For example, below imports
# the scorer for experiemnt "hawolewo". Be sure to replace "hawolewo"
# with you rexperiment name.
from scoring_h2oai_experiment_hawolewo import Scorer
[ ]: %%capture
#Create a singleton Scorer instance.
#For optimal performance, create a Scorer instance once, and call score() or score_batch() multiple times.
scorer = Scorer()
Here we look at the overall model performance in test and train. We also show the model horizon window in red to
illustrate the performance when the model is generating predictions beyond the horizon. We prefer to use R-squared
as the performance metric since the groups of Store and Department weekly sales are on vastly different scales.
[ ]: from sklearn.metrics import r2_score, mean_squared_error
def r2_rmse( g ):
r2 = r2_score( g['Weekly_Sales'], g['predict'] )
rmse = np.sqrt( mean_squared_error( g['Weekly_Sales'], g['predict'] ) )
return pd.Series( dict( r2 = r2, rmse = rmse ) )
This would be a useful plot to compare R2 over time between different DAI time series models each with different
prediction horizons
[ ]: # Note: horizon_in_weeks is how many weeks the model can predict out to.
# In this example 34 had been picked
horizon_in_weeks = 34
[ ]: %matplotlib inline
import matplotlib.pyplot as plt
metrics_ts['r2'].plot(figsize=(20,10), title=("Training dataset in Green. Test data with model prediction horizon in Red"))
Here we generate the best and worst groups by R2. We filter out groups that have some missing data. We only calculate
R2 within the valid test horizon window.
[ ]: avg_count = train_and_test.groupby(["Store","Dept"]).size().mean()
print("average count: " + str(avg_count))
train_and_test_filtered = train_and_test.groupby(["Store","Dept"]).filter(lambda x: len(x) > 0.8 * avg_count)
train_and_test_filtered = train_and_test_filtered.loc[(train_and_test_filtered.Date < test_window_end) &
(train_and_test_filtered.Date > test_window_start)]
Here we generate the best and worst groups by R2. We filter out groups that have some missing data. We only calculate
R2 within the train horizon window.
[ ]: avg_count = train_and_test.groupby(["Store","Dept"]).size().mean()
print("average count: " + str(avg_count))
train_and_test_filtered = train_and_test.groupby(["Store","Dept"]).filter(lambda x: len(x) > 0.8 * avg_count)
train_and_test_filtered = train_and_test_filtered.loc[(train_and_test_filtered.Date < test_window_start) &
(train_and_test_filtered.Date >= '2010-01-10')]
[ ]: plot_df.plot(figsize=(20,10), title=("Training dataset in Green. Test data with model prediction horizon in Red. Date selection in Yellow"))
deploy=pd.read_csv(deploy_path,parse_dates=['Date'])
deploy=deploy[pd.to_datetime(deploy['Date'])<=((time2_pd.Date.max()) + timedelta(days=7*horizon_in_weeks) )].reset_index(drop=True).copy()
#deploy.loc[deploy['Weekly_Sales'].isna(),'Weekly_Sales']=0
deploy=train_and_test.append(deploy,ignore_index=True).reset_index(drop=True)
deploy=deploy[['Store','Dept','Weekly_Sales','Date','IsHoliday','Hl', 'Size', 'ThanksG', 'Type', 'Unemployment', 'Xmas']].copy()
[ ]: deploy["predict"] = scorer.score_batch(deploy)
#deploy2["predict"] = scorer.score_batch(deploy2)
[ ]: #deploy= deploy.append(test,ignore_index=True)
#deploy = deploy.reset_index(drop=True)
plot_df2 = deploy[(deploy.Store == store_num) &
(deploy.Dept == dept_num)][["Date","Weekly_Sales","predict"]]
plot_df2["Date"] = plot_df2["Date"].apply(lambda x: (x))
plot_df2 = plot_df2.set_index("Date")
[ ]: plot_df2.plot(figsize=(20,10), title=("Training dataset in Green. Test data with model prediction horizon in Red. Date selection in Yellow"))
Shapley values show the contribution of engineered features to the predicted weekly sales generated by the model. Will
Shapley values you can break down the components of a prediction and attribute precise values to specfic features.
Please note, in some cases the model has a “link function” that yet to be applied to make the sum of the Shapley
contributions equal to the prediction value.
[ ]: shapley = scorer.score_batch(train_and_test, pred_contribs=True, fast_approx=True)
shapley.columns = [x.replace('contrib_','',1) for x in shapley.columns]
This is a global vs local Shapley plot, with global being the average Shapley values for all of the predictions in the
selected group and local being the Shapley value for that specific prediction. Looking at this plot can give clues as to
which features contributed to the error in the prediction.
[ ]: shap_vals_group = shapley.loc[(train_and_test.Store==store_num) & (train_and_test.Dept==dept_num),:]
shap_vals_timestamp = shapley.loc[(train_and_test.Store==store_num)
& (train_and_test.Dept==dept_num)
& (train_and_test.Date==date_selection),:]
shap_vals = shap_vals_group.mean()
shap_vals = pd.concat([pd.DataFrame(shap_vals), shap_vals_timestamp.transpose()], axis=1, ignore_index=True)
shap_vals = shap_vals.sort_values(by=0)
bias = shap_vals.loc["bias",0]
shap_vals = shap_vals.drop("bias",axis=0)
shap_vals.columns = ["Global (Group)", "Local (Timestamp)"]
shap_vals_timestamp.transpose()
41.5.13 Summary
This notebook should get you started with all you need to diagnose and debug time series models from DAI. Try
different horizons during training and compare the model’s R2 over time to pick the best horizon for your use case.
Use the actual vs prediction plots to do detailed debugging. Find some interesting dates to examine and use the Shapley
plots to see how the features impacted the final prediction.
In this notebook, we will see how to use Driverless AI python client to build text classification models using the Airline
sentiment twitter dataset.
Import the necessary python modules to get started including the Driverless AI client. If not already installed, please
download and install the python client from Driverless AI GUI.
This notebook was tested in Driverless AI version 1.8.2.
[1]: import pandas as pd
from sklearn import model_selection
from h2oai_client import Client
The below code downloads the twitter airline sentiment dataset and save it in the current folder.
[2]: ! wget https://www.figure-eight.com/wp-content/uploads/2016/03/Airline-Sentiment-2-w-AA.csv
We can now split the data into training and testing datasets.
[3]: al = pd.read_csv("Airline-Sentiment-2-w-AA.csv", encoding='ISO-8859-1')
train_al, test_al = model_selection.train_test_split(al, test_size=0.2, random_state=2018)
train_al.to_csv("train_airline_sentiment.csv", index=False)
test_al.to_csv("test_airline_sentiment.csv", index=False)
The first step is to establish a connection to Driverless AI using Client. Please key in your credentials and the url
address.
[4]: address = 'http://ip_where_driverless_is_running:12345'
username = 'username'
password = 'password'
h2oai = Client(address = address, username = username, password = password)
# # make sure to use the same user name and password when signing in through the GUI
Read the train and test files into Driverless AI using the upload_dataset_sync command.
[5]: train_path = './train_airline_sentiment.csv'
test_path = './test_airline_sentiment.csv'
train = h2oai.upload_dataset_sync(train_path)
test = h2oai.upload_dataset_sync(test_path)
[6]: ['_unit_id',
'_golden',
'_unit_state',
'_trusted_judgments',
'_last_judgment_at',
'airline_sentiment',
'airline_sentiment:confidence',
'negativereason',
'negativereason:confidence',
'airline',
'airline_sentiment_gold',
'name',
'negativereason_gold',
'retweet_count',
'text',
'tweet_coord',
'tweet_created',
'tweet_id',
'tweet_location',
'user_timezone']
We just need two columns for our experiment. text which contains the text of the tweet and airline_sentiment
which contains the sentiment of the tweet (target column). We can drop the remaining columns for this experiment.
We will enable tensorflow models and transformations to take advantage of CNN based text features.
[7]: exp_preview = h2oai.get_experiment_preview_sync(
dataset_key=train.key
, validset_key=''
, target_col='airline_sentiment'
, classification=True
, dropped_cols=["_unit_id", "_golden", "_unit_state", "_trusted_judgments", "_last_judgment_at",
"airline_sentiment:confidence", "negativereason", "negativereason:confidence", "airline",
"airline_sentiment_gold", "name", "negativereason_gold", "retweet_count",
"tweet_coord", "tweet_created", "tweet_id", "tweet_location", "user_timezone"]
, accuracy=6
, time=4
, interpretability=5
, is_time_series=False
, time_col=''
, enable_gpus=True
, reproducible=False
, resumed_experiment_id=''
, config_overrides="""
enable_tensorflow='on'
enable_tensorflow_charcnn='on'
enable_tensorflow_textcnn='on'
enable_tensorflow_textbigru='on'
"""
)
exp_preview
Please note that the Text and TextCNN features are enabled for this experiment.
Now we can start the experiment.
[8]: model = h2oai.start_experiment_sync(
dataset_key=train.key,
testset_key=test.key,
target_col='airline_sentiment',
scorer='F1',
is_classification=True,
cols_to_drop=["_unit_id", "_golden", "_unit_state", "_trusted_judgments", "_last_judgment_at",
"airline_sentiment:confidence", "negativereason", "negativereason:confidence", "airline",
"airline_sentiment_gold", "name", "negativereason_gold", "retweet_count",
"tweet_coord", "tweet_created", "tweet_id", "tweet_location", "user_timezone"],
accuracy=6,
time=2,
interpretability=5,
enable_gpus=True,
config_overrides="""
enable_tensorflow='on'
enable_tensorflow_charcnn='on'
enable_tensorflow_textcnn='on'
enable_tensorflow_textbigru='on'
[ ]:
This example shows how to use the Autoviz Python client in Driverless AI to retrieve graphs in Vega Lite format. (See
https://vega.github.io/vega-lite/.)
When running the Autoviz Python client in Jupyter Notebook you can use https://github.com/vega/ipyvega (installed
through pip) and render the graph directly in Jupyter notebook. You can also copy paste the result e.g. to https:
//vega.github.io/vega-editor/?mode=vega-lite. The final graph can then be downloaded to png/svg/json.
The end of this document includes the available API methods.
41.7.1 Prerequisities
Using the Driverless AI Autoviz Python client doesn’t require any additional packages, other than the Driverless AI
client. However, if you are using Jupyter notebooks or labs, installing Vega package can help better user experience,
as it allows you to render the produced graphs directly inside Jupyter environment. In addition, it provides options to
download the generated files in SVG, PNG or JSON formats.
41.7.2 Initialization
To initialize the Autoviz Python client, follow the same steps as when initializing the client for new experiment. You
need to import the Client and initialize it, providing the Driverless AI host address and login credentials.
[1]: from h2oai_client import Client
address = 'http://ip_where_driverless_is_running:12345'
username = 'username'
password = 'password'
cli = Client(address=address, username=username, password=password)
All of the methods provided in Client.autoviz return graphics in Vega Lite (v3) format. In order to visualize
them, you can either paste the returned graph (hist) into Online editor, or you can utilize the Python Vega package,
which can visualize the charts directly in Jupyter environment.
[32]: from vega import VegaLite
[33]: VegaLite(hist)
[34]: VegaLite(barchart)
---------------------------
Optional Keyword arguments:
---------------------------
number_of_bars -- int, number of bars
transformation -- str, default value is "none"
(otherwise, "log" or "square_root")
mark -- str, default value is "bar" (use "area" to get a density polygon)
"""
get_scatterplot(
dataset_key: str,
x_variable_name: str,
y_variable_name: str,
mark: str = "point",
) -> dict:
"""
---------------------------
Required Keyword arguments:
---------------------------
dataset_key -- str, Key of visualized dataset in DriverlessAI
x_variable_name -- str, name of x variable
(y variable assumed to be counts if no y variable specified)
y_variable_name -- str, name of y variable
---------------------------
Optional Keyword arguments:
---------------------------
get_bar_chart(
dataset_key: str,
x_variable_name: str,
y_variable_name: str = "",
transpose: bool = False,
mark: str = "bar",
) -> dict:
"""
---------------------------
Required Keyword arguments:
---------------------------
dataset_key -- str, Key of visualized dataset in DriverlessAI
x_variable_name -- str, name of x variable
(y variable assumed to be counts if no y variable specified)
---------------------------
Optional Keyword arguments:
---------------------------
y_variable_name -- str, name of y variable
transpose -- Boolean, default value is false
get_parallel_coordinates_plot(
dataset_key: str,
variable_names: list = [],
permute: bool = False,
transpose: bool = False,
cluster: bool = False,
) -> dict:
"""
---------------------------
Required Keyword arguments:
---------------------------
dataset_key -- str, Key of visualized dataset in DriverlessAI
---------------------------
Optional Keyword arguments:
---------------------------
variable_names -- str, name of variables
(if no variables specified, all in dataset will be used)
permute -- Boolean, default value is false
(if true, use SVD to permute variables)
transpose -- Boolean, default value is false
cluster -- Boolean, k-means cluster variables and color plot by cluster IDs,
default value is false
"""
get_heatmap(
dataset_key: str,
variable_names: list = [],
permute: bool = False,
transpose: bool = False,
matrix_type: str = "rectangular",
) -> dict:
"""
---------------------------
Required Keyword arguments:
---------------------------
dataset_key -- str, Key of visualized dataset in DriverlessAI
---------------------------
Optional Keyword arguments:
---------------------------
variable_names -- str, name of variables
(if no variables specified, all in dataset will be used)
permute -- Boolean, default value is false
(if true, use SVD to permute rows and columns)
transpose -- Boolean, default value is false
matrix_type -- str, default value is "rectangular" (alternative is "symmetric")
"""
get_boxplot(
dataset_key: str,
variable_name: str,
group_variable_name: str = "",
transpose: bool = False,
) -> dict:
"""
---------------------------
Required Keyword arguments:
---------------------------
dataset_key -- str, Key of visualized dataset in DriverlessAI
variable_name -- str, name of variable for box
---------------------------
Optional Keyword arguments:
---------------------------
group_variable_name -- str, name of grouping variable
transpose -- Boolean, default value is false
"""
get_linear_regression(
dataset_key: str,
x_variable_name: str,
y_variable_name: str,
mark: str = "point",
) -> dict:
"""
---------------------------
Required Keyword arguments:
---------------------------
dataset_key -- str, Key of visualized dataset in DriverlessAI
x_variable_name -- str, name of x variable
(y variable assumed to be counts if no y variable specified)
y_variable_name -- str, name of y variable
---------------------------
Optional Keyword arguments:
---------------------------
mark -- str, default value is "point" (alternative is "square")
"""
get_loess_regression(
dataset_key: str,
x_variable_name: str,
y_variable_name: str,
mark: str = "point",
bandwidth: float = 0.5,
) -> dict:
"""
---------------------------
Required Keyword arguments:
---------------------------
---------------------------
Optional Keyword arguments:
---------------------------
mark -- str, default value is "point" (alternative is "square")
bandwidth -- float, number in the (0,1)
interval denoting proportion of cases in smoothing window (default is 0.5)
"""
get_dotplot(
dataset_key: str,
variable_name: str,
mark: str = "point",
) -> dict:
"""
---------------------------
Required Keyword arguments:
---------------------------
dataset_key -- str, Key of visualized dataset in DriverlessAI
variable_name -- str, name of variable on which dots are calculated
---------------------------
Optional Keyword arguments:
---------------------------
mark -- str, default value is "point" (alternative is "square" or "bar")
"""
get_distribution_plot(
dataset_key: str,
x_variable_name: str,
y_variable_name: str = "",
subtype: str = "probability_plot",
distribution: str = "normal",
mark: str = "point",
transpose: bool = False,
) -> dict:
"""
---------------------------
Required Keyword arguments:
---------------------------
dataset_key -- str, Key of visualized dataset in DriverlessAI
x_variable_name -- str, name of x variable
---------------------------
Optional Keyword arguments:
---------------------------
y_variable_name -- str, name of y variable for quantile plot
subtype -- str "probability_plot" or "quantile_plot"
(default is "probability_plot" done on x variable)
distribution -- str, type of distribution, "normal" or "uniform"
("normal" is default)
mark -- str, default value is "point" (alternative is "square")
transpose -- Boolean, default value is false
"""
FORTYTWO
THE R CLIENT
This section describes how to install the Driverless AI R Client. It also provides an example tutorial showing how to
use the Driverless AI R client.
The R Client is available on the Driverless AI UI and from the command line. The installation process includes
downloading the R client and then installing the source package.
42.1.1 Prerequisites
The R client requires R version 3.3 or greater. In addition, the following R packages must be installed:
• Rcurl
• jsonlite
• rlang
• methods
The R Client can be downloaded from within Driverless AI or from the command line.
Download from UI
On the Driverless AI top menu, select the RESOURCES > R CLIENT link. This downloads the
dai_<version>.tar.gz file.
521
Using Driverless AI, Release 1.8.4.1
The Driverless AI R client is exposed as the /clients/ HTTP end point. This can be accessed via the command
line:
wget --trust-server-names http://<Driverless AI address>/clients/r
After you have downloaded the R client, the next step is to install the source package in R. This can be done by running
the following command in R.
install.packages('~/Downloads/dai_VERSION.tar.gz', type = 'source', repos = NULL)
After the package is installed, you can run the available dai-tutorial vignette to see an example of how to use the client:
This tutorial describes how to use the Driverless AI R client package to use and control the Driveless AI platform. It
covers the main predictive data-science workflow, including:
1. Data load
2. Automated feature engineering and model tuning
3. Model inspection
4. Predicting on new data
5. Managing the datasets and models
Note: These steps assume that you have entered your license key in the Driverless AI UI.
Before we can start working with the Driverless.ai platform (DAI), we have to import the package and initialize the
connection:
library(dai)
dai.connect(uri = 'http://localhost:12345', username = 'h2oai', password = 'h2oai')
The function dai.create_dataset() loads the data located at the machine that hosts DAI. The above command
assumes that the creditcard_train_cat.csv is in the /data folder on the machine running Driverless AI. This file is
available at https://s3.amazonaws.com/h2o-public-test-data/smalldata/kaggle/CreditCard/creditcard_train_cat.csv.
If you want to upload the data located at your workstation, use dai.upload_dataset() instead.
If you already have the data loaded into R data.frame, you can simply convert it into a DAIFrame. For example:
iris.dai <- as.DAIFrame(iris)
#>
|
| | 0%
|
|=================================================================| 100%
print(iris.dai)
#> DAI frame '7c38cb84-5baa-11e9-a50b-b938de969cdb': 150 obs. of 5 variables
#> File path: ./tmp/7c38cb84-5baa-11e9-a50b-b938de969cdb/iris9e1f15d2df00.csv.1554912339.9424415.bin
You can switch off the progress bar whenever it is displayed by setting progress = FALSE.
Upon creation of the dataset, you can display the basic information and summary statistics by calling generics print
and summary:
print(creditcard)
#> DAI frame '7abe28b2-5baa-11e9-a50b-b938de969cdb': 23999 obs. of 25 variables
#> File path: tests/smalldata/kaggle/CreditCard/creditcard_train_cat.csv
summary(creditcard)
#> variable num_classes is_numeric count
#> 1 ID 0 TRUE 23999
#> 2 LIMIT_BAL 79 TRUE 23999
#> 3 SEX 2 FALSE 23999
#> 4 EDUCATION 4 FALSE 23999
#> 5 MARRIAGE 4 FALSE 23999
A couple of other generics works as usual on a DAIFrame: dim, head, and format.
dim(creditcard)
#> [1] 23999 25
head(creditcard, 10)
#> ID LIMIT_BAL SEX EDUCATION MARRIAGE AGE PAY_1 PAY_2 PAY_3 PAY_4
#> 1 1 20000 female university married 24 2 2 -1 -1
#> 2 2 120000 female university single 26 -1 2 0 0
#> 3 3 90000 female university single 34 0 0 0 0
#> 4 4 50000 female university married 37 0 0 0 0
#> 5 5 50000 male university married 57 -1 0 -1 0
#> 6 6 50000 male graduate single 37 0 0 0 0
#> 7 7 500000 male graduate single 29 0 0 0 0
#> 8 8 100000 female university single 23 0 -1 -1 0
#> 9 9 140000 female highschool married 28 0 0 2 0
#> 10 10 20000 male highschool single 35 -2 -2 -2 -2
#> PAY_5 PAY_6 BILL_AMT1 BILL_AMT2 BILL_AMT3 BILL_AMT4 BILL_AMT5 BILL_AMT6
#> 1 -2 -2 3913 3102 689 0 0 0
#> 2 0 2 2682 1725 2682 3272 3455 3261
#> 3 0 0 29239 14027 13559 14331 14948 15549
#> 4 0 0 46990 48233 49291 28314 28959 29547
#> 5 0 0 8617 5670 35835 20940 19146 19131
#> 6 0 0 64400 57069 57608 19394 19619 20024
#> 7 0 0 367965 412023 445007 542653 483003 473944
#> 8 0 -1 11876 380 601 221 -159 567
#> 9 0 0 11285 14096 12108 12211 11793 3719
#> 10 -1 -1 0 0 0 0 13007 13912
#> PAY_AMT1 PAY_AMT2 PAY_AMT3 PAY_AMT4 PAY_AMT5 PAY_AMT6
#> 1 0 689 0 0 0 0
#> 2 0 1000 1000 1000 0 2000
#> 3 1518 1500 1000 1000 1000 5000
#> 4 2000 2019 1200 1100 1069 1000
#> 5 2000 36681 10000 9000 689 679
#> 6 2500 1815 657 1000 1000 800
#> 7 55000 40000 38000 20239 13750 13770
#> 8 380 601 0 581 1687 1542
#> 9 3329 0 432 1000 1000 1000
#> 10 0 0 0 13007 1122 0
#> DEFAULT_PAYMENT_NEXT_MONTH
#> 1 TRUE
#> 2 TRUE
#> 3 FALSE
#> 4 FALSE
#> 5 FALSE
#> 6 FALSE
#> 7 FALSE
#> 8 FALSE
#> 9 FALSE
#> 10 FALSE
You cannot, however, use DAIFrame to access all its data, nor can you use it to modify the data. It only represents
the data set loaded into the DAI platform. The head function gives access only to example data:
creditcard$example_data[1:10, ]
#> ID LIMIT_BAL SEX EDUCATION MARRIAGE AGE PAY_1 PAY_2 PAY_3 PAY_4
#> 1 1 20000 female university married 24 2 2 -1 -1
#> 2 2 120000 female university single 26 -1 2 0 0
#> 3 3 90000 female university single 34 0 0 0 0
#> 4 4 50000 female university married 37 0 0 0 0
#> 5 5 50000 male university married 57 -1 0 -1 0
#> 6 6 50000 male graduate single 37 0 0 0 0
#> 7 7 500000 male graduate single 29 0 0 0 0
#> 8 8 100000 female university single 23 0 -1 -1 0
#> 9 9 140000 female highschool married 28 0 0 2 0
#> 10 10 20000 male highschool single 35 -2 -2 -2 -2
#> PAY_5 PAY_6 BILL_AMT1 BILL_AMT2 BILL_AMT3 BILL_AMT4 BILL_AMT5 BILL_AMT6
#> 1 -2 -2 3913 3102 689 0 0 0
#> 2 0 2 2682 1725 2682 3272 3455 3261
#> 3 0 0 29239 14027 13559 14331 14948 15549
#> 4 0 0 46990 48233 49291 28314 28959 29547
#> 5 0 0 8617 5670 35835 20940 19146 19131
#> 6 0 0 64400 57069 57608 19394 19619 20024
#> 7 0 0 367965 412023 445007 542653 483003 473944
#> 8 0 -1 11876 380 601 221 -159 567
#> 9 0 0 11285 14096 12108 12211 11793 3719
#> 10 -1 -1 0 0 0 0 13007 13912
#> PAY_AMT1 PAY_AMT2 PAY_AMT3 PAY_AMT4 PAY_AMT5 PAY_AMT6
#> 1 0 689 0 0 0 0
#> 2 0 1000 1000 1000 0 2000
#> 3 1518 1500 1000 1000 1000 5000
#> 4 2000 2019 1200 1100 1069 1000
#> 5 2000 36681 10000 9000 689 679
#> 6 2500 1815 657 1000 1000 800
#> 7 55000 40000 38000 20239 13750 13770
#> 8 380 601 0 581 1687 1542
#> 9 3329 0 432 1000 1000 1000
#> 10 0 0 0 13007 1122 0
#> DEFAULT_PAYMENT_NEXT_MONTH
#> 1 TRUE
#> 2 TRUE
#> 3 FALSE
#> 4 FALSE
#> 5 FALSE
#> 6 FALSE
#> 7 FALSE
#> 8 FALSE
#> 9 FALSE
#> 10 FALSE
A dataset can be split into e.g. training and test sets directly in R:
creditcard.splits <- dai.split_dataset(creditcard,
output_name1 = 'train',
output_name2 = 'test',
ratio = .8,
seed = 25,
progress = FALSE)
In this case the creditcard.splits is a list with two elements with names “train” and “test”, where 80% of the data went
into train and 20% of the data went into test.
creditcard.splits$train
#> DAI frame '7cf3024c-5baa-11e9-a50b-b938de969cdb': 19199 obs. of 25 variables
#> File path: ./tmp/7cf3024c-5baa-11e9-a50b-b938de969cdb/train.1554912341.0864356.bin
creditcard.splits$test
#> DAI frame '7cf613a6-5baa-11e9-a50b-b938de969cdb': 4800 obs. of 25 variables
#> File path: ./tmp/7cf613a6-5baa-11e9-a50b-b938de969cdb/test.1554912341.0966916.bin
By default it yields a simple random sample, but you can do stratified or time-based splits as well. See the function’s
documentation for more details.
One of the main strengths of Driverless AI is the fully automated feature engineering along with hyperparameter
tuning, model selection and ensembling. The function dai.train() executes the experiment that results in a
DAIModel instance that represents the model.
model <- dai.train(training_frame = creditcard.splits$train,
testing_frame = creditcard.splits$test,
target_col = 'DEFAULT_PAYMENT_NEXT_MONTH',
is_classification = T,
is_timeseries = F,
accuracy = 1, time = 1, interpretability = 10,
seed = 25)
#>
|
| | 0%
|
|========================== | 40%
|
If you do not specify the accuracy, time, or interpretability, they will be suggested by the DAI platform. (See dai.
suggest_model_params.)
As with DAIFrame, generic methods such as print, format, summary, and predict work with DAIModel:
print(model)
#> Status: Complete
#> Experiment: 7e2b70ae-5baa-11e9-a50b-b938de969cdb, 2019-04-10 18:06, 1.7.0+local_0c7d019-dirty
#> Settings: 1/1/10, seed=25, GPUs enabled
#> Train data: train (19199, 25)
#> Validation data: N/A
#> Test data: test (4800, 24)
#> Target column: DEFAULT_PAYMENT_NEXT_MONTH (binary, 22.366% target class)
#> System specs: Linux, 126 GB, 40 CPU cores, 2/2 GPUs
#> Max memory usage: 0.406 GB, 0.167 GB GPU
#> Recipe: AutoDL (2 iterations, 2 individuals)
#> Validation scheme: stratified, 1 internal holdout
#> Feature engineering: 33 features scored (18 selected)
#> Timing:
#> Data preparation: 4.94 secs
#> Model and feature tuning: 10.13 secs (3 models trained)
#> Feature evolution: 5.54 secs (1 of 3 model trained)
#> Final pipeline training: 7.85 secs (1 model trained)
#> Python / MOJO scorer building: 42.05 secs / 0.00 secs
#> Validation score: AUC = 0.77802 +/- 0.0077539 (baseline)
#> Validation score: AUC = 0.77802 +/- 0.0077539 (final pipeline)
#> Test score: AUC = 0.7861 +/- 0.0064711 (final pipeline)
summary(model)$score
#> [1] 0.7780229
Predicting in R
Generic predict() either directly returns an R data.frame with the results (by default) or it returns a URL pointing
to a CSV file with the results (return_df=FALSE). The latter option may be useful when you predict on a large dataset.
predictions <- predict(model, newdata = creditcard.splits$test)
#>
|
| | 0%
|
|=================================================================| 100%
#> Loading required package: bitops
head(predictions)
#> DEFAULT_PAYMENT_NEXT_MONTH.0 DEFAULT_PAYMENT_NEXT_MONTH.1
#> 1 0.8879988 0.11200116
#> 2 0.9289870 0.07101299
#> 3 0.9550328 0.04496716
#> 4 0.3513577 0.64864230
#> 5 0.9183724 0.08162758
#> 6 0.9154425 0.08455751
For productizing your model in a Python or Java, you can download full Python or MOJO pipelines, respectively. For
more information about how to use the pipelines, please see the documentation.
dai.download_mojo(model, path = tempdir(), force = TRUE)
#>
|
| | 0%
|
|=================================================================| 100%
#> Downloading the pipeline:
#> [1] "/tmp/RtmppsLTZ9/mojo-7e2b70ae-5baa-11e9-a50b-b938de969cdb.zip"
After some time, you may have multiple datasets and models on your DAI server. The dai package offers a few utility
functions to find, reuse, and remove the existing datasets and models.
If you already have the dataset loaded into DAI, you can get the DAIFrame object by either dai.get_frame (if
you know the frame’s key) or dai.find_dataset (if you know the original path or at least a part of it):
dai.get_frame(creditcard$key)
#> DAI frame '7abe28b2-5baa-11e9-a50b-b938de969cdb': 23999 obs. of 25 variables
#> File path: tests/smalldata/kaggle/CreditCard/creditcard_train_cat.csv
dai.find_dataset('creditcard')
#> DAI frame '7abe28b2-5baa-11e9-a50b-b938de969cdb': 23999 obs. of 25 variables
#> File path: tests/smalldata/kaggle/CreditCard/creditcard_train_cat.csv
The latter directly returns you the frame if there’s only one match. Otherwise it let you select which frame to return
from all the matching candidates.
Furthemore, you can get a list of datasets or models:
datasets <- dai.list_datasets()
head(datasets)
#> key name
#> 1 7cf613a6-5baa-11e9-a50b-b938de969cdb test
#> 2 7cf3024c-5baa-11e9-a50b-b938de969cdb train
#> 3 7c38cb84-5baa-11e9-a50b-b938de969cdb iris9e1f15d2df00.csv
#> 4 7abe28b2-5baa-11e9-a50b-b938de969cdb creditcard_train_cat.csv
#> file_path
#> 1 ./tmp/7cf613a6-5baa-11e9-a50b-b938de969cdb/test.1554912341.0966916.bin
#> 2 ./tmp/7cf3024c-5baa-11e9-a50b-b938de969cdb/train.1554912341.0864356.bin
#> 3 ./tmp/7c38cb84-5baa-11e9-a50b-b938de969cdb/iris9e1f15d2df00.csv.1554912339.9424415.bin
#> 4 tests/smalldata/kaggle/CreditCard/creditcard_train_cat.csv
#> file_size data_source row_count column_count import_status import_error
#> 1 567584 upload 4800 25 0
#> 2 2265952 upload 19199 25 0
#> 3 7064 upload 150 5 0
#> 4 2832040 file 23999 25 0
#> aggregation_status aggregation_error aggregated_frame mapping_frame
#> 1 -1
#> 2 -1
#> 3 -1
#> 4 -1
#> uploaded
#> 1 TRUE
#> 2 TRUE
#> 3 TRUE
#> 4 FALSE
If you know the key of the dataset or model, you can obtain the instance of DAIFrame or DAIModel by dai.
get_model and dai.get_frame:
dai.get_model(models$key[1])
#> Status: Complete
#> Experiment: 7e2b70ae-5baa-11e9-a50b-b938de969cdb, 2019-04-10 18:06, 1.7.0+local_0c7d019-dirty
#> Settings: 1/1/10, seed=25, GPUs enabled
#> Train data: train (19199, 25)
#> Validation data: N/A
#> Test data: test (4800, 24)
#> Target column: DEFAULT_PAYMENT_NEXT_MONTH (binary, 22.366% target class)
#> System specs: Linux, 126 GB, 40 CPU cores, 2/2 GPUs
#> Max memory usage: 0.406 GB, 0.167 GB GPU
#> Recipe: AutoDL (2 iterations, 2 individuals)
#> Validation scheme: stratified, 1 internal holdout
#> Feature engineering: 33 features scored (18 selected)
#> Timing:
#> Data preparation: 4.94 secs
#> Model and feature tuning: 10.13 secs (3 models trained)
#> Feature evolution: 5.54 secs (1 of 3 model trained)
#> Final pipeline training: 7.85 secs (1 model trained)
#> Python / MOJO scorer building: 42.05 secs / 0.00 secs
#> Validation score: AUC = 0.77802 +/- 0.0077539 (baseline)
#> Validation score: AUC = 0.77802 +/- 0.0077539 (final pipeline)
#> Test score: AUC = 0.7861 +/- 0.0064711 (final pipeline)
dai.get_frame(datasets$key[1])
#> DAI frame '7cf613a6-5baa-11e9-a50b-b938de969cdb': 4800 obs. of 25 variables
#> File path: ./tmp/7cf613a6-5baa-11e9-a50b-b938de969cdb/test.1554912341.0966916.bin
The function dai.rm deletes the objects by default both from the server and the R session. If you wish to remove it
only from the server, you can set from_session=FALSE. Please note that only objects can be removed from the
session, i.e. in the example above the creditcard.splits$train and creditcard.splits$test objects
will not be removed from R session because they are actually function calls (recall that $ is a function).
FORTYTHREE
DRIVERLESS AI LOGS
This section describes how to access Driverless AI logs and includes information on which logs to send in the event
of a failure.
Driverless AI provides a number of logs that can be retrieved while visualizing datasets, while an experiment is
running, and after an experiment is completed.
When running Autovisualization, you can access the Autoviz logs by clicking the Display Logs button on the Visualize
Datasets page.
531
Using Driverless AI, Release 1.8.4.1
This page presents logs created while the dataset visualization was being performed. You can download the vis-data-
server.log file by clicking the Download Logs button on this page. This file can be used to troubleshoot any issues
encountered during dataset visualization.
While the experiment is running, you can access the logs by clicking on the Log button on the experiment screen. The
Log button can be found in the CPU/Memory section.
Clicking on the Log button will present the experiment logs in real time. You can download these logs by clicking on
the Download Logs button in the upper right corner.
Only the h2oai_experiment.log can be downloaded while the experiment is running (for example:
h2oai_experiment_tobosoru.log). It will have the same information as the logs being presented in real time on
the screen.
For troubleshooting purposes, it is best to view the complete h2oai_experiment.log (or
h2oai_experiment_anonymized.log). This will be available after the experiment finishes, as described in the
next section.
If the experiment has finished, you can download the logs by clicking on the Download Experiment & Logs button
on the completed experiment screen.
This will download a zip file that includes the following logs along with a summary of the experiment:
• h2oai_experiment.log: This is the log corresponding to the experiment.
• h2oai_experiment_anonymized.log: This is the log corresponding to the experiment where all data in the log
is anonymized.
Driverless AI allows you to view and download Python and/or Java logs while MLI is running. Note that these logs
are not available for time-series experiments.
• The Display MLI Python Logs button allows you to view or download the Python log for the model interpre-
tation. The downloaded file is named h2oai_experiment_{mli_key}.log.
• The Display MLI Java Logs button allows you to view or download the Java log for the model interpretation.
The downloaded file is named mli_experiment_{mli_key}.log.
You can view an MLI log for completed model interpretations by selecting the Download MLI Logs link on the MLI
page.
This will download a zip file which includes the following logs:
• h2oai_experiment_{mli_key}.log: This is the log corresponding to the model interpretation.
• h2oai_experiment_{mli_key}_anonymized.log: This is the log corresponding to the model interpretation
where all data in the log is anonymized.
• mli_experiment_{mli_key}.log: This is the Java log corresponding to the model interpretation.
This file can be used to view logging information for successful interpretations. If MLI fails, then those
logs are in ./tmp/h2oai_experiment_{mli_key}.log, ./tmp/h2oai_experiment_{mli_key}_anonymized.log, and
./tmp/mli_experiment_{mli_key}.log.
This section describes the logs to send in the event of failures when running Driverless AI.
• Adding Datasets: If a dataset fails to import, a message on the screen should provide the reason for the failure.
The logs to send are available in the Driverless AI ./tmp folder.
• Dataset Details: If a failure occurs when attempting to view Dataset Details, the logs to send are available in
the Driverless AI ./tmp folder.
• Autovisualization: If a failure occurs when attempting to Visualize Datasets, a message on the screen should
provide a reason for the failure. The logs to send are available in the Driverless AI ./tmp folder.
43.2.2 Experiments
• While Running an Experiment: As indicated previously, a Log button is available on the Experiment page.
Clicking on the Log button will present the experiment logs in real time. You can download these logs by
clicking on the Download Logs button in the upper right corner. You can also retrieve the h2oai_experiment.log
for the corresponding experiment in the Driverless AI ./tmp folder.
43.2.3 MLI
• During Model Interpretation: If a failure occurs during model interpretation, then the logs to send are
./tmp/h2oai_experiment_{mli_key}.log and ./tmp/h2oai_experiment_{mli_key}_anonymized.log.
• After Running an Experiment: If a Custom Recipe is producing errors, the entire zip file obtained by clicking
on the Download Experiments & Logs button can be sent for troubleshooting. Note that these files may contain
information that is not anonymized.
FORTYFOUR
DRIVERLESS AI SECURITY
44.1 Objective
The goal of this document is to describe different aspects of Driverless AI security and to provide guidelines to secure
the system by reducing its surface of vulnerability.
This section covers the following areas of the product:
• User access
• Authentication
• Authorization
• Data security
• Data import
• Data export
• Logs
• User data isolation
• Transfer security
• Custom recipes security
• Web UI security
537
Using Driverless AI, Release 1.8.4.1
44.4.3 Logs
The response headers which are passed between Driverless AI server and client (browser, Python/R clients) are con-
trolled via the following option:
Header Documentation
Public-Key-Pins https://developer.mozilla.org/en-US/docs/Web/HTTP/Public_Key_Pinning https://
CORS-related headers developer.mozilla.org/en-US/docs/Web/HTTP/CORS
Note: The Driverless AI UI is design to be user-friendly, and by default all features like auto-complete are enabled.
Disabling the user-friendly features increases security of the application, but impacts user-friendliness and usability of
the application.
Note: By default Driverless AI enables custom recipes as a main route for the way data-science teams can extend the
application capabilities. In enterprise environments, it is recommended to follow best software engineering practices
for development of custom recipes (i.e., code reviews, testing, stage releasing, etc.) and bundle only a pre-defined and
approved set of custom Driverless AI extensions.
The following Driverless AI configuration is an example of secure configuration. Please, make sure that you fill all
necessary config options.
#
# Auth
#
#
# Data
#
# Restrict Downloads
enable_dataset_downloading=false
#
# Logs
#
audit_log_retention_period=0
collect_server_logs_in_experiment_logs=false
#
# User data isolation
#
file_hide_data_directory=true
#file_path_filter_include=true
#file_path_filter_include=[]
#
# Client-Server Communication
enable_https=true
ssl_key_file="<<FILL ME>>"
ssl_crt_file="<<FILL ME>>"
# Disable support of TLSv1.2 on server side only if your environment supports TLSv1.3
#ssl_no_tlsv1_2=true
#
# Web UI security
#
allow_form_autocomplete=false
allow_localstorage=false
extra_http_headers={ "Strict-Transport-Security" = "max-age=63072000", "Content-Security-Policy" = "default-src https: ; font-src 'self'; script-src 'self
˓→' 'unsafe-eval' 'unsafe-inline'; style-src 'self' 'unsafe-inline'; object-src 'none'", "X-Frame-Options" = "deny", "X-Content-Type-Options" = "nosniff",
˓→ "X-XSS-Protection" = "1; mode=block" }
#
# Custom Recipes
#
enable_custom_recipes=false
enable_custom_recipes_upload=false
include_custom_recipes_by_default=false
FORTYFIVE
FAQ
H2O Driverless AI is an artificial intelligence (AI) platform for automatic machine learning. Driverless AI automates
some of the most difficult data science and machine learning workflows such as feature engineering, model validation,
model tuning, model selection and model deployment. It aims to achieve highest predictive accuracy, comparable to
expert data scientists, but in much shorter time thanks to end-to-end automation. Driverless AI also offers automatic
visualizations and machine learning interpretability (MLI). Especially in regulated industries, model transparency and
explanation are just as important as predictive performance. Modeling pipelines (feature engineering and models) are
exported (in full fidelity, without approximations) both as Python modules and as Java standalone scoring artifacts.
This section provides answers to frequently asked questions. If you have additional questions about using Driverless
AI, post them on Stack Overflow using the driverless-ai tag at http://stackoverflow.com/questions/tagged/driverless-ai.
General
• How is Driverless AI different than any other black box ML algorithm?
• How often do new versions come out?
Installation/Upgrade/Authentication
• How can I change my username and password?
• Can Driverless AI run on CPU-only machines?
• How can I upgrade to a newer version of Driverless AI?
• What kind of authentication is supported in Driverless AI?
• How can I automatically turn on persistence each time the GPU system reboots?
• How can I start Driverless AI on a different port that 12345?
• Can I set up TLS/SSL on Driverless AI?
• Can I set up TLS/SSL on Driverless AI in AWS?
• Why do I receive a “package dai-<version>.x86_64 does not verify: no digest” error during the installation?
• I received a “Must have exactly one OpenCL platform ‘NVIDIA CUDA’” error. How can I fix that?
• Is it possible for multiple users to share a single Driverless AI instance?
• Can multiple Driverless AI users share a GPU server?
• How can I retrieve a list of Driverless AI users?
• Start of Driverless AI fails on the message “Segmentation fault (core dumped)” on Ubuntu 18/RHEL 7.6. How
can I fix this?
• Why do I receive a “shared object file” error when runnnig on RHEL 8?
Data
545
Using Driverless AI, Release 1.8.4.1
547
Using Driverless AI, Release 1.8.4.1
• How does the time series recipe deal with missing values?
• Can the time information be distributed acrosss multiple columns in the input data (such as [year, day, month]?
• What type of modeling approach does Driverless AI use for time series?
• What’s the idea behind exponential weighting of moving averages?
Logging
• How can I reduce the size of the Audit Logger?
45.1 General
45.2 Installation/Upgrade/Authentication
docker run \
--pid=host \
--init \
--rm \
--shm-size=256m \
-u `id -u`:`id -g` \
-p 443:12345 \
-v `pwd`/data:/data \
-v `pwd`/log:/log \
-v `pwd`/license:/license \
-v `pwd`/tmp:/tmp \
h2oai/dai-centos7-x86_64:TAG
Native Installs: To run on a port other than 12345, update the port value in the config.toml file. For
example, edit the following to run Driverless AI on port 443:
# Export the Driverless AI config.toml file (or add it to ~/.bashrc)
export DRIVERLESS_AI_CONFIG_FILE=“/config/config.toml”
You can make a self-signed certificate for testing with the following commands:
umask 077
openssl req -x509 -newkey rsa:4096 -keyout private_key.pem -out cert.pem -days 20 -nodes -subj '/O=Driverless AI'
sudo chown dai:dai cert.pem private_key.pem
sudo mv cert.pem private_key.pem /etc/dai
To configure specific versions of TLS/SSL, enable or disable the following settings in the config.toml file:
ssl_no_sslv2 = true
ssl_no_sslv3 = true
ssl_no_tlsv1 = true
ssl_no_tlsv1_1 = true
ssl_no_tlsv1_2 = false
ssl_no_tlsv1_3 = false
server {
listen 443;
ssl_certificate /etc/nginx/cert.crt;
ssl_certificate_key /etc/nginx/cert.key;
ssl on;
ssl_session_cache builtin:1000 shared:SSL:10m;
ssl_protocols TLSv1 TLSv1.1 TLSv1.2;
ssl_ciphers HIGH:!aNULL:!eNULL:!EXPORT:!CAMELLIA:!DES:!MD5:!PSK:!RC4;
ssl_prefer_server_ciphers on;
location / {
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
# Fix the “It appears that your reverse proxy set up is broken" error.
proxy_pass http://localhost:12345;
proxy_read_timeout 90;
More information about SSL for Nginx in Ubuntu 16.04 can be found here: https://www.digitalocean.
com/community/tutorials/how-to-create-a-self-signed-ssl-certificate-for-nginx-in-ubuntu-16-04.
I received a “package dai-<version>.x86_64 does not verify: no digest” error during the installation. How can
I fix this?
You will recieve a “package dai-<version>.x86_64 does not verify: no digest” error when installing the
rpm using an RPM version newer than 4.11.3. You can run the following as a workaround, replacing
<version> with your DAI version:
rpm --nodigest -i dai-<version>.x86_64.rpm
I received a “Must have exactly one OpenCL platform ‘NVIDIA CUDA’” error. How can I fix that?
If you encounter problems with opencl errors at server time, you may see the following message:
2018-11-08 14:26:15,341 C: D:452.2GB M:246.0GB 21603 ERROR : Must have exactly one OpenCL platform 'NVIDIA CUDA', but got:
Platform #0: Clover
Platform #1: NVIDIA CUDA
+-- Device #0: GeForce GTX 1080 Ti
+-- Device #1: GeForce GTX 1080 Ti
+-- Device #2: GeForce GTX 1080 Ti
#Team 1
NV_GPU='0,1' nvidia-docker run
--pid=host
--init
--rm
--shm-size=256m
-u id -u:id -g
-p port-to-team:12345
-e DRIVERLESS_AI_CONFIG_FILE="/config/config.toml"
-v /data:/data
-v /log:/log
-v /license:/license
-v /tmp:/tmp
-v /config:/config
h2oai/dai-centos7-x86_64:TAG
#Team 2
NV_GPU='0,1' nvidia-docker run
--pid=host
--init
--rm
--shm-size=256m
-u id -u:id -g
-p port-to-team:12345
-e DRIVERLESS_AI_CONFIG_FILE="/config/config.toml"
-v /data:/data
-v /log:/log
-v /license:/license
-v /tmp:/tmp
-v /config:/config
h2oai/dai-centos7-x86_64:TAG
Note, however, that a Driverless AI instance expects to fully utilize and not share the GPUs that are
assigned to it. Sharing a GPU with other Driverless AI instances or other running programs can result in
out-of-memory issues.
How can I retrieve a list of Driverless AI users?
A list of users can be retrieved using the Python client.
h2o = Client(address='http://<client_url>:12345', username='<username>', password='<password>')
h2o.get_users()
Start of Driverless AI fails on the message ``Segmentation fault (core dumped)`` on Ubuntu 18/RHEL 7.6. How
can I fix this?
This problem is caused by the font NotoColorEmoji.ttf, which cannot be processed by the Python
matplotlib library. A workaround is to disable the font by renaming it. (Do not use fontconfig because it
is ignored by matplotlib.) The following will print out the command that should be executed.
sudo find / -name "NotoColorEmoji.ttf" 2>/dev/null | xargs -I{} echo sudo mv {} {}.backup
45.3 Data
45.4 Connectors
Why can’t I import a folder as a file when using a data connector on Windows?
If you try to use the Import Folder as File option via a data connector on Windows, the import will fail if
the folder contains files that do not have file extensions. For example, if a folder contains the files file1.csv,
file2.csv, file3.csv, and _SUCCESS, the function will fail due to the presence of the _SUCCESS file.
Note that this only occurs if the data is sourced from a volume that is mounted from the Windows filesys-
tem onto the Docker container via -v /path/to/windows/filesystem:/path/in/docker/
container flags. This error occurs because of the difference in how files without file extensions are
treated in Windows and in the Docker container (CentOS Linux).
I get a ClassNotFoundException error when I try to select a JDBC connection. How can I fix that?
The folder storing the JDBC jar file must be visible/readable by the dai process user.
If you downloaded the JDBC jar file from Oracle, they may provide you with a tar.gz file that you can
unpackage with the following command:
tar --no-same-permissions --no-same-owner -xzvf <my-jdbc-driver.tar>.gz
Alternatively you can ensure that the permissions on the file are correct in general by running the follow-
ing:
chmod -R o+rx /path/to/folder_containing_jar_file
Finally, if you just want to check the permissions use the command ls -altr and check the final 3
values in the permissions output.
45.5 Recipes
45.6 Experiments
features until it can fit in memory. This may lead to a worse model, but Driverless AI shouldn’t crash
because the data is wide.
How should I use Driverless AI if I have large data?
Driverless AI can handle large datasets out of the box. For very large datasets (more than 10 billion rows
x columns), we recommend sampling your data for Driverless AI. Keep in mind that the goal of driverless
AI is to go through many features and models to find the best modeling pipeline, and not to just train a
few models on the raw data (H2O-3 is ideally suited for that case).
For large datasets, the recommended steps are:
1. Run with the recommended accuracy/time/interpretability settings first, especially accuracy <= 7
2. Gradually increase accuracy settings to 7 and choose accuracy 9 or 10 only after observing runs with
<= 7.
How does Driverless AI detect the ID column?
The ID column logic is one of the following:
• The column is named ‘id’, ‘Id’, ‘ID’ or ‘iD’ exactly
• The column contains a significant number of unique values (above
max_relative_cardinality in the config.toml file or Max. allowed fraction of uniques
for integer and categorical cols in Expert settings)
Can Driverless AI handle data with missing values/nulls?
Yes, data that is imported into Driverless AI can include missing values. Feature engineering is fully
aware of missing values, and missing values are treated as information - either as a special categorical
level or as a special number. So for target encoding, for example, rows with a certain missing feature
will belong to the same group. For Categorical Encoding where aggregations of a numeric columns are
calculated for a grouped categorical column, missing values are kept. The formula for calculating the
mean is the sum of non-missing values divided by the count of all non-missing values. For clustering,
we impute missing values. And for frequency encoding, we count the number of rows that have a certain
missing feature.
The imputation strategy is as follows:
• XGBoost/LightGBM do not need missing value imputation and may, in fact, perform worse with
any specific other strategy unless the user has a strong understanding of the data.
• Driverless AI automatically imputes missing values using the mean for GLM.
• Driverless AI provides an imputation setting for TensorFlow in the config.toml file:
tf_nan_impute_value post-normalization. If you set this option to 0, then missing
values will be imputed. Setting it to (for example) +5 will specify 5 standard deviations outside the
distribution. The default for TensorFlow is -5, which specifies that TensorFlow will treat NAs like a
missing value. We recommend that you specify 0 if the mean is better.
More information is available in the Missing Values Handling section.
How does Driverless AI deal with categorical variables? What if an integer column should really be treated as
categorical?
If a column has string values, then Driverless AI will treat it as a categorical feature. There are multiple
methods for how Driverless AI converts the categorical variables to numeric. These include:
• One Hot Encoding: creating dummy variables for each value
• Frequency Encoding: replace category with how frequently it is seen in the data
• Target Encoding: replace category with the average target value (additional steps included to prevent
overfitting)
• Weight of Evidence: calculate weight of evidence for each category (http://ucanalytics.com/blogs/
information-value-and-weight-of-evidencebanking-case/)
Driverless AI will try multiple methods for representing the column and determine which representation(s)
are best.
If the column has integers, Driverless AI will try treating the column as a categorical column and numeric
column. It will treat any integer column as both categorical and numeric if the number of unique values
is less than 50.
This is configurable in the config.toml file:
# Whether to treat some numerical features as categorical
# For instance, sometimes an integer column may not represent a numerical feature but
# represents different numerical codes instead.
num_as_cat = true
# Max number of unique values for integer/real columns to be treated as categoricals (test applies to first statistical_threshold_data_
˓→size_small rows only)
max_int_as_cat_uniques = 50
(Note: Driverless AI will also check if the distribution of any numeric column differs significantly from
the distribution of typical numerical data using Benford’s Law. If the column distribution does not obey
Benford’s Law, we will also try to treat it as categorical even if there are more than 50 unique values.)
How are outliers handled?
Outliers are not removed from the data. Instead Driverless AI finds the best way to represent data with
outliers. For example, Driverless AI may find that binning a variable with outliers improves performance.
For target columns, Driverless AI first determines the best representation of the column. It may find that
for a target column with outliers, it is best to predict the log of the column.
If I drop several columns from the Train dataset, will Driverless AI understand that it needs to drop the same
columns from the Test dataset?
If you drop columns from the training dataset, Driverless AI will do the same for the validation and test
datasets (if the columns are present). There is no need for these columns because no features will be
created from them.
Does Driverless AI treat numeric variables as categorical variables?
In certain cases, yes. You can prevent this behavior by setting the num_as_cat variable in your instal-
lation’s config.toml file to false. You can have finer grain control over this behavior by excluding
the Numeric to Categorical Target Encoding Transformer and the Numeric To
Categorical Weight of Evidence Transformer and their corresponding genes in your in-
stallation’s config.toml file. To learn more about the config.toml file, see the Using the config.toml File
section.
Which algorithms are used in Driverless AI?
Features are engineered with a proprietary stack of Kaggle-winning statistical approaches including some
of the most sophisticated target encoding and likelihood estimates based on groupings, aggregations and
joins, but we also employ linear models, neural nets, clustering and dimensionality reduction models and
many traditional approaches such as one-hot encoding etc.
On top of the engineered features, sophisticated models are fitted, including, but not limited to: XG-
Boost (both original XGBoost and ‘lossguide’ (LightGBM) mode), Decision Trees, GLM, TensorFlow
(including a TensorFlow NLP recipe based on CNN Deeplearning models), RuleFit, FTRL (Follow the
Regularized Leader), Isolation Forest, and Constant Models. (Refer to Supported Algorithms for more
information.) And additional algorithms can be added via recipes.
In general, GBMs are the best single-shot algorithms. Since 2006, boosting methods have proven to
be the most accurate for noisy predictive modeling tasks outside of pattern recognition in images and
sound (https://www.cs.cornell.edu/~caruana/ctp/ct.papers/caruana.icml06.pdf). The advent of XGBoost
and Kaggle only cemented this position.
Why do my selected algorithms not show up in the Experiment Preview?
When changing the algorithms used via Expert Settings > Model and Expert Settings > Recipes, you may notice
in the Experiment Preview that those changes are not applied. Driverless AI determines whether to include models
and/or recipes based on a hierarchy of those expert settings.
• Setting an Algorithm to “OFF” in Expert Settings: If an algorithm is turned OFF in Expert Settings (for example,
GLM Models) when running, then that algorithmn will not be included in the experiement.
• Algorithms Not Included from Recipes (BYOR): If an algorithm from a custom recipe is not selected for the
experiment in the Include specific models option, then that algorithm will not be included in the experiment,
regardless of whether that same algorithm is set to AUTO or ON on the Expert Settings > Model page.
• Algorithms Not Specified as “OFF” and Included from Recipes: If a Driverless AI algorithm is specified as
either “AUTO” or “ON” and additional models are selected for the experiment in the Include specific models
option, than those algorithms may or may not be included in the experiment. Driverless AI will determine the
algorithms to use based on the data and experiment type.
How can we turn on TensorFlow Neural Networks so they are evaluated?
Neural networks are considered by Driverless AI, although they may not be evaluated by default. To
ensure that neural networks are tried, you can turn on TensorFlow in the Expert Settings:
Once you have set TensorFlow to ON. You should see the Experiment Preview on the left hand side
change and mention that it will evaluate TensorFlow models:
We recommend using TensorFlow neural networks if you have a multinomial use case with more than 5
unique values.
Does Driverless AI standardize the data?
Driverless AI will automatically do variable standardization for certain algorithms. For example, with
Linear Models and Neural Networks, the data is automatically standardized. For decision tree algorithms,
however, we do not perform standardization since these algorithms do not benefit from standardization.
What objective function is used in XGBoost?
The objective function used in XGBoost is:
• reg:linear and custom absolute error objective function for regression
• binary:logistic or multi:softprob for classification
The objective function does not change depending on the scorer chosen. The scorer influences parameter
tuning only.
For regression, Tweedie/Gamma/Poisson/etc. regression is not yet supported, but Driverless AI handles
various target transforms so many target distributions can be handled very effectively already. Driverless
AI handles quantile regression for alpha=0.5 (media), and general quantiles are on the roadmap.
Further details for the XGBoost instantiations can be found in the logs and in the model summary, both
of which can be downloaded from the GUI or are found in the /tmp/h2oai_experiment_<name>/ folder
on the server.
Does Driverless AI perform internal or external validation?
Driverless AI does internal validation when only training data is provided. It does external validation
when training and validation data are provided. In either scenario, the validation data is used for all
parameter tuning (models and features), not just for feature selection. Parameter tuning includes target
transformation, model selection, feature engineering, feature selection, stacking, etc.
Specifically:
• Internal validation (only training data given):
– Ideal when data is either close to i.i.d., or for time-series problems
– Internal holdouts are used for parameter tuning, with temporal causality for time-series prob-
lems
– Will do the full spectrum from single holdout split to 5-fold CV, depending on accuracy settings
– No need to split training data manually
– Final models are trained using CV on the training data
• External validation (training + validation data given):
– Ideal when there’s some amount of drift in the data, and the validation set mimics the test set
data better than the training data
– No training data wasted during training because training data not used for parameter tuning
– Validation data is used only for parameter tuning, and is not part of training data
– No CV possible because we explicitly do not want to overfit on the training data
– Not allowed for time-series problems (see Time Series FAQ section that follows)
Tip: If you want both training and validation data to be used for parameter tuning (the training process),
just concatenate the datasets together and turn them both into training data for the “internal validation”
method.
How does Driverless AI prevent overfitting?
Driverless AI performs a number of checks to prevent overfitting. For example, during certain transfor-
mations, Driverless AI calculates the average on out-of-fold data using cross validation. Driverless AI
also performs early stopping for every model built, ensuring that the model build will stop when it ceases
to improve on holdout data. And additional steps to prevent overfitting include checking for i.i.d. and
avoiding leakage during feature engineering.
A blog post describing Driverless AI overfitting protection in greater detail is available here: https://www.
h2o.ai/blog/driverless-ai-prevents-overfitting-leakage/.
How does Driverless AI avoid the multiple hypothesis (MH) problem?
Or more specifically, like many brute force methods for tuning hyperparameters/model selection, Driver-
less AI runs up against the multihypothesis problem (MH). For example, if I randomly generated a gazil-
lion models, the odds that a few will do awesome on the test data that they are all measured against is
pretty large, simply by sheer luck. How does Driverless AI address this?
Driverless AI uses a variant of the reusable holdout technique to address the multiple hypothesis problem.
Refer to https://pdfs.semanticscholar.org/25fe/96591144f4af3d8f8f79c95b37f415e5bb75.pdf for more in-
formation.
How does Driverless AI suggest the experiment settings?
When you run an experiment on a dataset, the experiment settings (Accuracy, Time, and Interpretability)
are automatically suggested by Driverless AI. For example, Driverless AI may suggest the parameters
Accuracy = 7, Time = 3, Interpretability = 6, based on your data.
Driverless AI will automatically suggest experiment settings based on the number of columns and number
of rows in your dataset. The settings are suggested to ensure best handling when the data is small. If the
data is small, Driverless AI will suggest the settings that prevent overfitting and ensure the full dataset is
utilized.
If the number of rows and number of columns are each below a certain threshold, then:
• Accuracy will be increased up to 8.
– The accuracy is increased so that cross validation is done. (We don’t want to “throw away” any
data for internal validation purposes.)
• Interpretability will be increased up to 8.
– The higher the interpretability setting, the smaller the number of features in the final model.
– More complex features are not allowed.
– This prevents overfitting.
• Time will be decreased down to 2.
– There will be fewer feature engineering iterations to prevent overfitting.
What happens when I set Interpretability and Accuracy to the same number?
The answer is currently that interpretability controls which features are created and what features are
kept. (Also above interpretability = 6, monotonicity constraints are used in XGBoost GBM, XGBoost
Dart, LightGBM, and LightGBM Random Forest models.) The accuracy refers to how hard Driverless AI
then tries to make those features into the most accurate model.
Can I specify the number of GPUs to use when running Driverless AI?
When running an experiment, the Expert Settings allow you to specify the starting GPU ID for Driverless
AI to use. You can also specify the maximum number of GPUs to use per model and per experiment.
Refer to the Expert Settings section for more information.
How can I create the simplest model in Driverless AI?
To create the simplest model in Driverless AI, set the following Experiment Settings:
• Set Accuracy to 1. Note that this can hurt performance as a sample will be used. If necessary, adjust
the konb until the preview shows no sampling.
• Set Time to 1.
• Set Interpretability to 10.
Next, configure the following Expert Settings:
• Turn OFF all algorithms except GLM.
• Set GLM models to ON.
• Set Ensemble level to 0.
• Set Select target transformation of the target for regression problems to Identity.
• Disable Data distribution shift detection.
• Disable Target Encoding.
Alternatively, you can set Pipeline Building Recipe to Compliant. Compliant automatically configures
the following experiment and expert settings:
• interpretability=10 (To avoid complexity. This overrides GUI or Python client settings for Inter-
pretability.)
• enable_glm=’on’ (Remaing algos are ‘off’, to avoid complexity and be compatible with algorithms
supported by MLI.)
• num_as_cat=true: Treat some numerical features as categorical. For instance, sometimes an integer
column may not represent a numerical feature but represent different numerical codes instead.
• fixed_ensemble_level=0: Don’t use any ensemble (to avoid complexity).
• feature_brain_level=0: No feature brain used (to ensure every restart is identical).
• max_feature_interaction_depth=1: Interaction depth is set to 1 (no multi-feature interactions to
avoid complexity).
• target_transformer=”identity”: For regression (to avoid complexity).
• check_distribution_shift=”off”: Don’t use distribution shift between train, valid, and test to drop
features (bit risky without fine-tuning).
Why is my experiment suddenly slow?
It is possible that your experiment has gone from using GPUs to using CPUs due to a change of the host
system outside of Driverless AI’s control. You can verify this using any of the following methods:
• Check GPU usage by going to your Driverless AI experiment page and clicking on the GPU USAGE
tab in the lower-right quadrant of the experiment.
• Run nvidia-smi in a terminal to see if any processes are using GPU resources in an unexpected
way (such as those using a large amount of memory).
• Check if System/GPU memory is being consumed by prior jobs or other tasks or if older jobs are
still running some tasks.
• Check and diable automatic NVIDIA driver updates on your system (as they can interfere with
running experiments).
The general solution to these kind of sudden slowdown problems is to restart:
• Restart Docker if using Docker
• pkill --signal 9 h2oai if using the native installation method
• Restart the system if nvidia-smi does not work as expected (e.g., after a driver update)
More ML-related issues that can lead to a slow experiment are:
• Choosing high accuracy settings on a system with insufficient memory
• Choosing low interpretability settings (can lead to more feature engineering which can increase
memory usage)
• Using a dataset with a lot of columns (> 500)
• Doing multi-class classification with a GBM model when there are many target classes (> 5)
When I run multiple experiments with different seeds, why do I see different scores, runtimes, and sizes on disk
in the Experiments listing page?
When running multiple experiments with all of the same settings except the seed, understand that a feature
brain level > 0 can lead to variations in models, features, timing, and sizes on disk. (The default value is
2.) These variations can be disabled by setting the Feature Brain Level to 0 in the Expert Settings or in
the config.toml file.
In addition, if you use a different seed for each experiment, then each experiment can be different due to
the randomness in the genetic algorithm that searches for the best features and model parameters. Only
if Reproducible is set with the same seed and with the a feature brain level of 0 should users expect the
same outcome. Once a different seed is set, the models, features, timing, and sizes on disk can all vary
within the constraints set by the choices made for the experiment. (I.e., accuracy, time, interpretability,
expert settings, etc., all constrain the outcome, and then a different seed can change things within those
constraints.)
Why does the final model performance appear to be worse than previous iterations?
There are a few things to remember:
• Driverless AI creates a best effort estimate of the generalization performance of the best modeling
pipeline found so far
• The performance estimation is always based on holdout data (data unseen by the model).
• If no validation dataset is provided, the training data is split internally to create internal validation
holdout data (once or multiple times or cross-validation, depending on the accuracy settings).
• If no validation dataset is provided, for accuracy <= 7, a single holdout split is used, and a “lucky”
or “unlucky” split can bias estimates for small datasets or datasets with high variance.
• If a validation dataset is provided, then all performance estimates are solely based on the entire
validation dataset (independent of accuracy settings).
• All scores reported are based on bootstrapped-based statistical methods and come with error bars
that represent a range of estimate uncertainty.
After the final iteration, a best final model is trained on a final set of engineered features. Depending on
accuracy settings, a more accurate estimation of generalization performance may be done using cross-
validation. Also, the final model may be a stacked ensemble consisting of multiple base models, which
generally leads to better performance. Consequently, in rare cases, the difference in performance estima-
tion method can lead to the final model’s estimated performance seeming poorer than those from previous
iterations. (i.e., The final model’s estimated score is significantly worse than the last iteration score and
error bars don’t overlap.) In that case, it is very likely that the final model performance estimation is more
accurate, and the prior estimates were biased due to a “lucky” split. To confirm this, you can re-run the
experiment multiple times (without setting the reproducible flag).
If you would like to minimize the likelihood of the final model performance appearing worse than previous
iterations, here are some recommendations:
• Increase accuracy settings
• Provide a validation dataset
• Provide more data
How can I find features that may be causing data leakages in my Driverless AI model?
To find original features that are causing leakage, have a look at features_orig.txt in the experiment sum-
mary download. Features causing leakage will have high importance there. To get a hint at derived
features that might be causing leakage, create a new experiment with dials set to 2/2/8, and run the new
experiment on your data with all your features and response. Then analyze the top 1-2 features in the
model variable importance. They are likely the main contributors to data leakage if it is occurring.
How can I see the performance metrics on the test data?
As long as you provide a target column in the test set, Driverless AI will show the best estimate of the
final model’s performance on the test set at the end of the experiment. The test set is never used to tune
parameters (unlike to what Kagglers often do), so this is purely a convenience. Of course, you can still
make test set predictions and compute your own metrics using a method of your choice.
How can I see all the performance metrics possible for my experiment?
At the end of the experiment, the model’s estimated performance on all provided datasets with a target
column is printed in the experiment logs. For example, for the test set:
Final scores on test (external holdout) +/- stddev:
GINI = 0.87794 +/- 0.035305 (more is better)
MCC = 0.71124 +/- 0.043232 (more is better)
F05 = 0.79175 +/- 0.04209 (more is better)
F1 = 0.75823 +/- 0.038675 (more is better)
F2 = 0.82752 +/- 0.03604 (more is better)
ACCURACY = 0.91513 +/- 0.011975 (more is better)
LOGLOSS = 0.28429 +/- 0.016682 (less is better)
AUCPR = 0.79074 +/- 0.046223 (more is better)
optimized: AUC = 0.93386 +/- 0.018856 (more is better)
What if my training/validation and testing data sets come from different distributions?
In general, Driverless AI uses training data to engineer features and train models and validation data to
tune all parameters. If no external validation data is given, the training data is used to create internal
holdouts. The way holdouts are created internally depends on whether there is a strong time dependence,
see the point below. If the data has no obvious time dependency (e.g., if there is no time column neither
implicit or explicit), or if the data can be sorted arbitrarily and it won’t affect the outcome (e.g., Iris data,
predicting flower species from measurements), and if the test dataset is different (e.g., new flowers or only
large flowers), then the model performance on validation (either internal or external) as measured during
training won’t be achieved during final testing due to the obvious inability of the model to generalize.
Does Driverless AI handle weighted data?
Yes. You can optionally provide an extra weight column in your training (and validation) data with non-
negative observation weights. This can be useful to implement domain-specific effects such as exponential
weighting in time or class weights. All of our algorithms and metrics in Driverless AI support observation
weights, but note that estimated likelihoods can be skewed as a consequence.
How does Driverless AI handle fold assignments for weighted data?
Currently, Driverless AI does not take the weights into account during fold creation, but you can provide
a fold column to enforce your own grouping, i.e., to keep rows that belong to the same group together
(either in train or valid). The fold column has to be a categorical column (integers ok) that assigns a group
ID to each row. (It needs to have at least 5 groups because we do up to 5-fold CV.)
Why do I see that adding new features to a dataset deteriorates the performance of the model?
You may notice that after adding one or more new features to a dataset, it deteriorates the performance of
the Driverless AI model. In Driverless AI, the feature engineering sequence is fairly random and may end
up not doing same things with original features if you restart entirely fresh with new columns.
Beginning in Driverless AI v1.4.0, you now have the option to Restart from Last Checkpoint. This
allows you to pull in a new dataset with more columns, and Driverless AI will more iteratively take
advantage of the new columns.
How does Driverless AI handle imbalanced data for binary classification experiments?
If you have data that is imbalanced, a binary imbalanced model can help to improve scoring with a variety
of imbalanced sampling methods. An imbalanced model is able to take advantage of most (or even all)
of the imbalanced dataset’s positive values during sampling, while a regular model significantly limits
the population of positive values. Imbalanced models, however, take more time to make predictions,
and they are not always more accurate than regular models. We still recommend that you try using an
imbalanced model if your data is imbalanced to see if scoring is improved over a regular model. Note that
this information only applies to binary models.
45.8 Predictions
How can I download the predictions onto the machine where Driverless AI is running?
When you select Score on Another Dataset, the predictions will automatically be stored on the machine
where Driverless AI is running. They will be saved in the following locations (and can be opened again
by Driverless AI, both for .csv and .bin):
• Training Data Predictions: tmp/h2oai_experiment_<name>/train_preds.csv (also saved as .bin)
• Testing Data Predictions: tmp/h2oai_experiment_<name>/test_preds.csv (also saved as .bin)
• New Data Predictions: tmp/h2oai_experiment_<name>/automatically_generated_name.csv. Note
that the automatically generated name will match the name of the file downloaded to your local
computer.
Why are predicted probabilities not available when I run an experiment without ensembling?
When Driverless AI provides pre-computed predictions after completing an experiment, it uses only those
parts of the modeling pipeline that were not trained on the particular rows for which the predictions are
made. This means that Driverless AI needs holdout data in order to create predictions, such as validation
or test sets, where the model is trained on training data only. In the case of ensembles, Driverless AI
uses cross-validation to generate holdout folds on the training data, so we are able to provide out-of-fold
estimates for every row in the training data and, hence, can also provide training holdout predictions (that
will provide a good estimate of generalization performance). In the case of a single model, though, that
is trained on 100% of the training data. There is no way to create unbiased estimates for any row in the
training data. While DAI uses an internal validation dataset, this is a re-usable holdout, and therefore
will not contain holdout predictions for the full training dataset. You need cross-validation in order to
get out-of-fold estimates, and then that’s not a single model anymore. If you want to still get predictions
for the training data for a single model, then you have to use the scoring API to create predictions on the
training set. From the GUI, this can be done using the Score on Another Dataset button for a completed
experiment. Note, though, that the results will likely be overly optimistic, too good to be true, and virtually
useless.
45.9 Deployment
Assume that in my Walmart dataset, all stores provided data at the week level, but one store provided data at
the day level. What would Driverless AI do?
Driverless AI would still assume “weekly data” in this case because the majority of stores are yielding
this property. The “daily” store would be resampled to the detected overall frequency.
Assume that in my Walmart dataset, all stores and departments provided data at the weekly level, but one
department in a specific store provided weekly sales on a bi-weekly basis (every two weeks). What would
Driverless AI do?
That’s similar to having missing data. Due to proper resampling, Driverless AI can handle this without
any issues.
Why does the number of weeks that you want to start predicting matter?
That’s an option to provide a train-test gap if there is no test data is available. That is to say, “I don’t have
my test data yet, but I know it will have a gap to train of x.”
Are the scoring components of time series sensitive to the order in which new pieces of data arrive? I.e., is each
row independent at scoring time, or is there a real-time windowing effect in the scoring pieces?
Each row is independent at scoring time.
What happens if the user, at predict time, gives a row with a time value that is too small or too large?
Internally, “out-of bounds” time values are encoded with special values. The samples will still be scored,
but the predictions won’t be trustworthy.
What’s the minimum data size for a time series recipe?
We recommended that you have around 10,000 validation samples in order to get a reliable estimate of
the true error. The time series recipe can still be applied for smaller data, but the validation error might be
inaccurate.
How long must the training data be compared to the test data?
At a minimum, the training data has to be at least twice as long as the test data along the time axis.
However, we recommended that the training data is at least three times as long as the test data.
How does the time series recipe deal with missing values?
Missing values will be converted to a special value, which is different from any non-missing feature value.
Explicit imputation techniques won’t be applied.
Can the time information be distributed acrosss multiple columns in the input data (such as [year, day, month]?
Currently Driverless AI requires the data to have the time stamps given in a single column. Driverless AI
will create addidtion time features like [year, day, month] on its own, if they turn out to be useful.
What type of modeling approach does Driverless AI use for time series?
Driverless AI combines the creation of history-based features like lags, moving averages etc. with the
modeling techniques, which are also applied for i.i.d. data. The primary model of choice is XGBoost.
What’s the idea behind exponential weighting of moving averages?
Exponential weighting accounts for the possibility that more recent observatons are better suited to explain
the present than older observations.
45.11 Logging
FORTYSIX
TIPS ‘N TRICKS
Given training data and a target column to predict, H2O Driverless AI produces an end-to-end pipeline tuned for high
predictive performance (and/or high interpretability) for general classification and regression tasks. The pipeline has
only one purpose: to take a test set, row by row, and turn its feature values into predictions.
A typical pipeline creates dozens or even hundreds of derived features from the user-given dataset. Those transforma-
tions are often based on precomputed lookup tables and parameterized mathematical operations that were selected and
optimized during training. It then feeds all these derived features to one or several machine learning algorithms such
as linear models, deep learning models, or gradient boosting models (and several more derived models). If there are
multiple models, then their output is post-processed to form the final prediction (either probabilities or target values).
The pipeline is a directed acyclic graph.
It is important to note that the training dataset is processed as a whole for better results (e.g., aggregate statistics).
For scoring, however, every row of the test dataset must be processed independently to mimic the actual production
scenario.
To facilitate deployment to various production environments, there are multiple ways to obtain predictions from a
completed Driverless AI experiment, either from the GUI, from the R or Python client API, or from a standalone
pipeline.
GUI
• Score on Another Dataset - Convenient, parallelized, ideal for imported data
• Download Predictions - Available if a test set was provided during training
• Deploy - Creates an Amazon Lambda endpoint (more endpoints coming soon)
• Diagnostics - Useful if the test set includes a target column
Client APIs
• Python client - Use the make_prediction_sync() method. An optional argument can be used to get
per-row and per-feature ‘Shapley’ prediction contributions. (Pass pred_contribs=True.)
• R client - Use the predict() method. An optional argument can be used to get per-row and per-feature
‘Shapley’ prediction contributions. (Pass pred_contribs=True.)
Standalone Pipelines
• Python - Supports all models and transformers, and supports ‘Shapley’ prediction contributions and MLI reason
codes
569
Using Driverless AI, Release 1.8.4.1
• Java - Most portable, low latency, supports all models and transformers that are enabled by default (except
TensorFlow NLP transformers), can be used in Spark/H2O-3/SparklingWater for scale
• C++ - Highly portable, low latency, standalone runtime with a convenient Python and R wrapper
A core capability of H2O Driverless AI is the creation of automatic machine learning modeling pipelines for supervised
problems. In addition to the data and the target column to be predicted, the user can pick a scorer. A scorer is a function
that takes actual and predicted values for a dataset and returns a number. Looking at this single number is the most
common way to estimate the generalization performance of a predictive model on unseen data by comparing the
model’s predictions on the dataset with its actual values. There are more detailed ways to estimate the performance
of a machine learning model such as residual plots (available on the Diagnostics page in Driverless AI), but we will
focus on scorers here.
For a given scorer, Driverless AI optimizes the pipeline to end up with the best possible score for this scorer. The
default scorer for regression problems is RMSE (root mean squared error), where 0 is the best possible value. For
example, for a dataset containing 4 rows, if actual target values are [1, 1, 10, 0], but predictions are [2, 3, 4, -1], then
the RMSE is sqrt((1+4+36+1)/4) and the largest misprediction dominates the overall score (quadratically). Driverless
AI will focus on improving the predictions for the third data point, which can be very difficult when hard-to-predict
outliers are present in the data. If outliers are not that important to get right, a metric like the MAE (mean absolute
error) can lead to better results. For this case, the MAE is (1+2+6+1)/4 and the optimization process will consider
all errors equally (linearly). Another scorer that is robust to outliers is RMSLE (root mean square logarithmic error),
which is like RMSE but after taking the logarithm of actual and predicted values - however, it is restricted to positive
values. For price predictions, scorers such as MAPE (mean absolute percentage error) or MER (median absolute
percentage error) are useful, but have problems with zero or small positive values. SMAPE (symmetric mean absolute
percentage error) is designed to improve upon that.
For classification problems, the default scorer is either the AUC (area under the receiver operating characteristic
curve) or LOGLOSS (logarithmic loss) for imbalanced problems. LOGLOSS focuses on getting the probabilities
right (strongly penalizes wrong probabilities), while AUC is designed for ranking problems. Gini is similar to the
AUC, but measures the quality of ranking (inequality) for regression problems. For general imbalanced classification
problems, AUCPR and MCC are good choices, while F05, F1 and F2 are designed to balance recall against precision.
We highly suggest experimenting with different scorers and to study their impact on the resulting models. Using the
Diagnostics page in Driverless AI, all applicable scores can be computed for any given model, no matter which scorer
was used during training.
H2O Driverless AI allows you to customize every experiment in great detail via the expert settings. The most important
controls however are the three knobs for accuracy, time and interpretability. A higher accuracy setting results in a
better estimate of the model generalization performance, usually through using more data, more holdout sets, more
parameter tuning rounds and other advanced techniques. Higher time settings means the experiment is given more
time to converge to an optimal solution. Higher interpretability settings reduces the model’s complexity through less
feature engineering and using simpler models. In general, a setting of 1/1/10 will lead to the simplest and usually
least accurate modeling pipeline, while a setting of 10/10/1 will lead to the most complex and most time consuming
experiment possible. Generally, it is sufficient to use settings of 7/5/5 or similar, and we recommend to start with the
default settings. We highly recommend studying the experiment preview on the left-hand side of the GUI before each
experiment - it can help you fine-tune the settings and save time overall.
Note that you can always finish an experiment early, either by clicking ‘Finish’ to get the deployable final pipeline out,
or by clicking ‘Abort’ to instantly terminate the experiment. In either case, the experiment can be continued seamlessly
at a later time with ‘Restart from last Checkpoint’ or ‘Retrain Final Pipeline’, and you can always turn the knobs (or
modify the expert settings) to adapt to your requirements.
H2O Driverless AI is an automatic machine learning platform designed to create highly accurate modeling pipelines
from tabular training data. The predictive performance of the pipeline is a function of both the training data and
the parameters of the pipeline (details of feature engineering and modeling). During an experiment, Driverless AI
automatically tunes these parameters by scoring candidate pipelines on held out (“validation”) data. This important
validation data is either provided by the user (for experts) or automatically created (random, time-based or fold-based)
by Driverless AI. Once a final pipeline has been created, it should be scored on yet another held out dataset (“test data”)
to estimate its generalization performance. Understanding the origin of the training, validation and test datasets (“the
validation scheme”) is critical for success with machine learning, and we welcome your feedback and suggestions to
help us create the right validation schemes for your use cases.
H2O Driverless AI offers a range of ‘Expert Settings’ that allow you to customize each experiment. For example, you
can limit the amount of feature engineering by reducing the value for ‘Feature engineering effort’ or ‘Max. feature
interaction depth’ or by disabling ‘Target Encoding’. You can also select the model types to be used for training
on the engineered features (such as XGBoost, LightGBM, GLM, TensorFlow, FTRL, or RuleFit). For time-series
problems where the selected time_column leads to an error message (this can currently happen if the the time structure
is not regular enough - we are working on an improved version), you can disable the ‘Time-series lag-based recipe’
and Driverless AI will create train/validation splits based on the time order instead, which can increase the model’s
performance if the time column is important.
Driverless AI provides the option to checkpoint experiments to speed up feature engineering and model tuning when
running multiple experiments on the same dataset. By default, H2O Driverless AI automatically scans all prior exper-
iments (including aborted ones) for an optimal checkpoint to restart from. You can select a specific prior experiment
to restart a new experiment from with “Restart from Last Checkpoint” in the experiment listing page (click on the 3
yellow bars on the right). You can disable checkpointing by setting ‘Feature Brain Level’ in the expert settings (or
feature_brain_level in the configuration file) to 0 to force the experiment to start from scratch.
For datasets that contain text (string) columns - where each value can be a few words, a paragraph or an entire document
- Driverless AI automatically creates NLP features based on bag of words, tf-idf, singular value decomposition and
out-of-fold likelihood estimates. In versions 1.3 and above, you can enable TensorFlow in the expert settings to see
how CNN (convolutional neural net) based learned word embeddings can improve predictive accuracy even more. Try
this for sentiment analysis, document classification, and generic text-enriched datasets.
FORTYSEVEN
This appendix describes how to use custom recipes in Driverless AI. You’re welcome to create your own recipes, or
you can select from a number of recipes available in the https://github.com/h2oai/driverlessai-recipes/tree/rel-1.8.4
repository.
Notes:
• Recipes only need to be added once. After a recipe is added to an experiment, that recipe will then be available
for all future experiments.
• In most cases (especially for complex recipes), MOJOs won’t be available out of the box. But, it is possible to
get the MOJO. Contact [email protected] for more information about creating MOJOs for custom recipes. (Note
that the Python Scoring Pipeline features full support for custom recipes.)
• Custom Recipes FAQ: For answers to common questions about custom recipes.
• How to Write a Recipe: A guide for writing your own recipes.
• Data Template: A template for creating your own Data recipe.
• Model Template: A template for creating your own Model recipe.
• Scorer Template: A template for creating your own Scorer recipe.
• Transformer Template: A template for creating your own Transformer recipe.
47.2 Examples
Driverless AI allows you to create a new dataset by modifying an existing dataset with a data recipe. (Refer to the
modify_by_recipe section for more information.) This example shows you how to use the Live Code option to create
a new dataset by adding a data recipe.
1. Navigate to the Datasets page, then click on the dataset you want to modify.
573
Using Driverless AI, Release 1.8.4.1
2. Click Details from the submenu that appears to open the Dataset Details page.
3. Click the Modify by Recipe button in the top right portion of the UI, then click Live Code from the submenu
that appears.
4. Enter the code for the data recipe you want to use to modify the dataset. Click the Get Preview button to see
a preview of how the data recipe will modify the dataset. In this simple example, the data recipe modifies the
number of rows and columns in the dataset.
5. Click the Save button to confirm the changes and create a new dataset. (The original dataset will still be available
on the Datasets page.)
Driverless AI already supports a variety of algorithms. This example shows how you can use our h2o-3-models-py
recipe to include H2O-3 supervised learning algorithms in your experiment. The available H2O-3 algorithms in the
recipe include:
• Naive Bayes
• GBM
• Random Forest
• Deep Learning
• GLM
• AutoML
Caution: Because AutoML is treated as a regular ML algorithm here, the runtime requirements can be large.
We recommend that you adjust the max_runtime_secs parameters as suggested here: https://github.com/h2oai/
driverlessai-recipes/blob/rel-1.8.4/models/algorithms/h2o-3-models.py#L39
1. Start an experiment in Driverless AI by selecting your training dataset along with (optionally) validation and
testing datasets and then specifying a Target Column. Notice the list of algorithms that will be used in the Fea-
ture evolution section of the experiment summary. In the example below, the experiment will use LightGBM
and XGBoostGBM.
Driverless AI will begin uploading and verifying the new custom recipe.
4. In the Expert Settings page, specify any additional settings and then click Save. This returns you to the experi-
ment summary.
5. To include each of the new models in your experiment, return to the Expert Settings option. Click the Recipes
> Include Specific Models option. Select the algorithm(s) that you want to include. Click Done to return to the
experiment summary.
6. Edit any additional experiment settings, and then click Launch Experiment.
Upon completion, you can download the Experiment Summary and review the Model Tuning section of the re-
port.docx file to see how each of the algorithms compare.
3. Specify the custom scorer recipe using one of the following methods:
• On your local machine, clone the https://github.com/h2oai/driverlessai-recipes for this re-
lease branch. Then use the Upload Custom Recipe button to upload the driverlessai-
recipes/scorers/explained_variance.py file.
• Click the Load Custom Recipe from URL button, then enter the URL for the raw h2o-3-
models.py file (for example, https://raw.githubusercontent.com/h2oai/driverlessai-recipes/rel-1.8.4/
scorers/regression/explained_variance.py).
Note: Click the Official Recipes (External) button to browse the driverlessai-recipes repository.
Driverless AI will begin uploading and verifying the new custom recipe.
4. In the Experiment Summary page, select the new Explained Variance (EXPVAR) scorer. (Note: If you do not
see the EXPVAR option, return to the Expert Settings, select Recipes > Include Specific Scorers, then click the
Enable Custom button in the top right corner. Click Done and then Save to return to the Experiment Summary.)
5. Edit any additional experiment settings, and then click Launch Experiment. The experiment will run using the
custom Explained Variance scorer.
Driverless AI supports a number of feature transformers as described in Driverless AI Transformations. This example
shows how you can include a custom transformer in your experiment. Specifically, this example will show how to add
the ExpandingMean transformer.
1. Start an experiment in Driverless AI by selecting your training dataset along with (optionally) validation and
testing datasets and then specifying a Target Column. Notice the list of transformers that will be used in the
Feature engineering search space (where applicable) section of the experiment summary. Driverless AI
determines this list based on the dataset and experiment.
Driverless AI will begin uploading and verifying the new custom recipe.
4. Navigate to the Expert Settings > Recipes tab and click the Include Specific Transformers button. Notice
that all transformers are selected by default, including the new ExpandingMean transformer (bottom of page).
5. Select the transformers that you want to include in the experiment. Use the Check All/Uncheck All button to
quickly add or remove all transfomers at once. This example removes all transformers except for OriginalTrans-
former and ExpandingMean.
Note: If you uncheck all transformers so that none is selected, Driverless AI will ignore this and will use
the default list of transformers for that experiment. (See the image in Step 1.) This list of transformers
will vary for each experiment.
6. Edit any additional experiment settings, and then click Launch Experiment. The experiment will run using the
custom ExpandingMean transformer.
FORTYEIGHT
H2O Driverless AI integrates with a (continuously growing) number of third-party products. Please contact
[email protected] to schedule a discussion with one of our Solution Engineers for more information.
If you are interested in a product not yet listed here, please ask us about it!
The following products are able to manage (start and stop) Driverless AI instances themselves:
Name Notes
BlueData DAI runs in a BlueData container
Domino DAI runs in a Domino container
IBM Spectrum DAI runs in user mode via TAR SH distribution
Conductor
IBM Cloud Pri- Uses Kubernetes underneath; DAI runs in a docker container; requires HELM chart
vate (ICP)
Kubernetes DAI runs in as a long running service via Docker container
Kubeflow Abstraction of Kubernetes; allows additional monitoring and management of Kubernetes de-
ployments. Click here for more information.
Puddle (from Multi-tenant orchestration platform for DAI instances (not a third party, but listed here for
H2O.ai) completeness)
SageMaker Bring your own algorithm docker container
Name Notes
Alteryx Allows users to interact with a remote DAI server from Alteryx Designer
Cinchy Data collaboration for the Enterprise, use MOJOs to enrich data and use Cinchy data network to
train models
Jupyter/Python DAI Python API client library can be downloaded from the Web UI of a running instance
KDB Use KDB as a data source in Driverless AI for training
RStudio/R Under development, please ask for the DAI R API client library
585
Using Driverless AI, Release 1.8.4.1
48.3 Scoring
Name Notes
KDB Call a MOJO to score streaming data from KDB Ticker Service
ParallelM Deploy and monitor MOJO models
Qlik Call a MOJO from a Qlik dashboard
SageMaker Host scoring-only docker image that uses a MOJO
Trifacta Call a MOJO as a UDF
UiPath Call a MOJO from within an RPA workflow
48.4 Storage
Name Notes
Network Appliance A mounted expandable volume is convenient for the Driverless AI working (tmp) directory
Please visit the section on Enabling Data Connectors for information about data sources supported by Driverless AI.
FORTYNINE
REFERENCES
Adebayo, Julius A. “Fairml: Toolbox for diagnosing bias in predictive modeling.” Master’s Thesis, MIT, 2016.
Breiman, Leo. “Statistical Modeling: The Two Cultures (with comments and a rejoinder by the author).” Statistical
Science 16, no. 3, 2001.
Craven, Mark W. and Shavlik, Jude W. “Extracting tree structured representations of trained networks.” Advances in
Neural Information Processing Systems, 1996.
Goldstein, Alex, Kapelner, Adam, Bleich, Justin, and Pitkin, Emil. “Peeking inside the black box: Visualizing statisti-
cal learning with plots of individual conditional expectation.” Journal of Computational and Graphical Statistics, no.
24, 2015.
Groeneveld, R.A. and Meeden, G. (1984), “Measuring Skewness and Kurtosis.” The Statistician, 33, 391-399.
Hall, Patrick, Wen Phan, and SriSatish Ambati. “Ideas for Interpreting Machine Learning.” O’Reilly Ideas. O’Reilly
Media, 2017.
Hartigan, J. A. and Mohanty, S. (1992), “The RUNT test for multimodality,” Journal of Classification, 9, 63–70.
Hastie, Trevor, Tibshirani, Robert, and Friedman, Jerome. The Elements of Statistical Learning. Springer, 2008.
Lei, Jing, Max G’Sell, Alessandro Rinaldo, Ryan J. Tibshirani, and Larry Wasserman. “Distribution-Free Predictive
Inference for Regression.” Journal of the American Statistical Association (just-accepted), 2017.
Ribeiro, Marco Tulio, Sameer Singh, and Carlos Guestrin. “Why Should I Trust You?: Explaining the Predictions of
Any Classifier.” In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and
Data Mining, 2016.
Wilkinson, L. (1999). “Dot plots.” The American Statistician, 53, 276–281.
Wilkinson, L., Anand, A., and Grossman, R. (2005), “Graph-theoretic Scagnostics,” in Proceedings of the IEEE
Information Visualization 2005, pp. 157–164.
587