0% found this document useful (0 votes)
11 views

Using Driverless AI

UsingDriverlessAI
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

Using Driverless AI

UsingDriverlessAI
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 595

Using Driverless AI

Release 1.8.4.1

H2O.ai

Feb 05, 2020


RELEASE NOTES

1 H2O Driverless AI Release Notes 3


1.1 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2 Roadmap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3 Change Log . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2 Why Driverless AI? 35

3 Key Features 37
3.1 Flexibility of Data and Deployment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.2 NVIDIA GPU Acceleration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.3 Automatic Data Visualization (Autovis) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.4 Automatic Feature Engineering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.5 Automatic Model Documentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.6 Time Series Forecasting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.7 NLP with TensorFlow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.8 Automatic Scoring Pipelines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.9 Machine Learning Interpretability (MLI) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.10 Automatic Reason Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.11 Bring Your Own Recipe (BYOR) Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

4 Supported Algorithms 41
4.1 Constant Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.2 Decision Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.3 FTRL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.4 GLM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.5 Isolation Forest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.6 LightGBM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.7 RuleFit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.8 TensorFlow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.9 XGBoost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.10 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

5 Driverless AI Workflow 45

6 Driverless AI Licenses 47
6.1 About Licenses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
6.2 Adding Licenses for the First Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
6.3 Updating Licenses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

7 Before You Begin Installing or Upgrading 53


7.1 Supported Browsers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

i
7.2 To sudo or Not to sudo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
7.3 Note about nvidia-docker 1.0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
7.4 Deprecation of nvidia-smi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
7.5 New nvidia-container-runtime-hook Requirement for PowerPC Users . . . . . . . . . . 54
7.6 Note About CUDA Versions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
7.7 Note About Authentication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
7.8 Note About Shared File Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
7.9 Backup Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
7.10 Upgrade Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

8 Installing and Upgrading Driverless AI 57


8.1 Sizing Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
8.2 Linux X86_64 Installs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
8.3 IBM Power Installs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
8.4 Mac OS X . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
8.5 Windows 10 Pro . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138

9 Using the config.toml File 145


9.1 Configuration Override Chain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
9.2 Docker Image Users . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
9.3 Native Install Users . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
9.4 Sample Config.toml File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146

10 Environment Variables and Configuration Options 173


10.1 Setting Environment Variables in Docker Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
10.2 Setting Configuration Options in Native Installs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173

11 Enabling Data Connectors 175


11.1 Using Data Connectors with the Docker Image . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
11.2 Using Data Connectors with Native Installs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200

12 Configuring Authentication 221


12.1 Client Certificate Authentication Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221
12.2 LDAP Authentication Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224
12.3 Local Authentication Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227
12.4 mTLS Authentication Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228
12.5 OpenID Connect Authentication Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231
12.6 PAM Authentication Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234

13 Enabling Notifications 237


13.1 Script Interfaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237
13.2 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237

14 Export Artifacts 241


14.1 Enabling Artifact Exports . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241
14.2 Exporting an Artifact . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242

15 Launching Driverless AI 245


15.1 Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246
15.2 Messages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247

16 The Datasets Page 249


16.1 Supported File Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249
16.2 Adding Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 250
16.3 Renaming Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253

ii
16.4 Dataset Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253
16.5 Downloading Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257
16.6 Splitting Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258
16.7 Visualizing Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259

17 Experiments 265
17.1 Before You Begin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265
17.2 Experiment Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265
17.3 Expert Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269
17.4 Scorers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 299
17.5 New Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 308
17.6 Completed Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314
17.7 Experiment Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 316
17.8 Model Scores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 320
17.9 Experiment Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 322
17.10 Viewing Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 326

18 Diagnosing a Model 331


18.1 Classification Metric Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 331
18.2 Regression Metric Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333

19 Project Workspace 335


19.1 Linking Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335
19.2 Linking Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 337
19.3 Experiments List . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 339
19.4 Unlinking Data on a Projects Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 342
19.5 Deleting Projects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 342

20 MLI Overview 345

21 The Interpreted Models Page 347

22 MLI for Regular (Non-Time-Series) Experiments 349


22.1 Interpreting a Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 349
22.2 Understanding the Model Interpretation Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354
22.3 Viewing Explanations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 375
22.4 General Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 379

23 MLI for Time-Series Experiments 381


23.1 Multi-Group Time Series MLI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 381
23.2 Single Time Series MLI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 384

24 Score on Another Dataset 387

25 Transform Another Dataset 389

26 Scoring Pipelines Overview 391

27 Visualizing the Scoring Pipeline 393

28 Which Pipeline Should I Use? 397

29 Driverless AI Standalone Python Scoring Pipeline 399


29.1 Python Scoring Pipeline Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 399
29.2 Quick Start . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 400
29.3 The Python Scoring Module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 403

iii
29.4 The Scoring Service . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 404
29.5 Python Scoring Pipeline FAQ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 406
29.6 Troubleshooting Python Environment Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 406

30 Driverless AI MLI Standalone Python Scoring Package 409


30.1 MLI Python Scoring Package Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 409
30.2 Quick Start . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 410
30.3 MLI Python Scoring Module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 413
30.4 K-LIME vs Shapley Reason Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 413
30.5 MLI Scoring Service Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 414

31 MOJO Scoring Pipelines 417


31.1 Driverless AI MOJO Scoring Pipeline - Java Runtime . . . . . . . . . . . . . . . . . . . . . . . . . 417
31.2 Driverless AI MOJO Scoring Pipeline - C++ Runtime with Python and R Wrappers . . . . . . . . . 422
31.3 MOJO2 Javadoc . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 425

32 Deploying the MOJO Pipeline 427


32.1 Deployments Overview Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 427
32.2 Amazon Lambda Deployment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 427
32.3 REST Server Deployment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 433

33 What’s Happening in Driverless AI? 439

34 Data Sampling 441

35 Driverless AI Transformations 443


35.1 Available Transformers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 443
35.2 Example Transformations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 447

36 Internal Validation Technique 451

37 Missing and Unseen Levels Handling 453


37.1 How Does the Algorithm Handle Missing Values During Training? . . . . . . . . . . . . . . . . . . 453
37.2 How Does the Algorithm Handle Missing Values During Scoring (Production)? . . . . . . . . . . . . 454
37.3 What Happens When You Try to Predict on a Categorical Level Not Seen During Training? . . . . . 454
37.4 What Happens if the Response Has Missing Values? . . . . . . . . . . . . . . . . . . . . . . . . . . 455

38 Imputation in Driverless AI 457


38.1 Enabling Imputation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 457
38.2 Running an Experiment with Imputation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 457

39 Time Series in Driverless AI 461


39.1 Understanding Time Series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 461
39.2 Rolling-Window-Based Predictions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 466
39.3 Time Series Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 467
39.4 Time Series Use Case: Sales Forecasting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 467
39.5 Time Series Expert Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 470
39.6 Using a Driverless AI Time Series Model to Forecast . . . . . . . . . . . . . . . . . . . . . . . . . . 470
39.7 Additional Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 471

40 NLP in Driverless AI 473


40.1 n-gram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 473
40.2 Truncated SVD Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 474
40.3 Linear Models for TFIDF Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 475
40.4 Word Embeddings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 475
40.5 NLP Naming Conventions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 477

iv
40.6 NLP Expert Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 477
40.7 A Typical NLP Example: Sentiment Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 478

41 The Python Client 481


41.1 Installing the Python Client . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 481
41.2 Driverless AI: Credit Card Demo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 483
41.3 Driverless AI - Training Time Series Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 500
41.4 Driverless AI - Time Series Recipes with Rolling Window . . . . . . . . . . . . . . . . . . . . . . . 506
41.5 Time Series Analysis on a Driverless AI Model Scoring Pipeline . . . . . . . . . . . . . . . . . . . . 508
41.6 Driverless AI NLP Demo - Airline Sentiment Dataset . . . . . . . . . . . . . . . . . . . . . . . . . 513
41.7 Driverless AI Autoviz Python Client Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 515

42 The R Client 521


42.1 Installing the R Client . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 521
42.2 R Client Tutorial . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 523

43 Driverless AI Logs 531


43.1 Accessing Driverless AI Logs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 531
43.2 Sending Logs to H2O . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 536

44 Driverless AI Security 537


44.1 Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 537
44.2 Important things to know . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 537
44.3 User Access . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 538
44.4 Data Security . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 538
44.5 Client-Server Communication Security . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 541
44.6 Web UI Security . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 543
44.7 Custom Recipe Security . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 543
44.8 Baseline Secure Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 543

45 FAQ 545
45.1 General . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 548
45.2 Installation/Upgrade/Authentication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 549
45.3 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 552
45.4 Connectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 553
45.5 Recipes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 553
45.6 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 553
45.7 Feature Transformations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 563
45.8 Predictions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 563
45.9 Deployment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 564
45.10 Time Series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 565
45.11 Logging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 567

46 Tips ‘n Tricks 569


46.1 Pipeline Tips . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 569
46.2 Time Series Tips . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 570
46.3 Scorer Tips . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 571
46.4 Knob Settings Tips . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 571
46.5 Tips for Running an Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 572
46.6 Expert Settings Tips . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 572
46.7 Checkpointing Tips . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 572
46.8 Text Data Tips . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 572

47 Appendix A: Custom Recipes 573


47.1 Additional Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 573

v
47.2 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 573

48 Appendix B: Third-Party Integrations 585


48.1 Instance Life-Cycle Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 585
48.2 API Clients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 585
48.3 Scoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 586
48.4 Storage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 586
48.5 Data Sources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 586

49 References 587

vi
Using Driverless AI, Release 1.8.4.1

H2O Driverless AI is an artificial intelligence (AI) platform for automatic machine learning. Driverless AI automates
some of the most difficult data science and machine learning workflows such as feature engineering, model validation,
model tuning, model selection, and model deployment. It aims to achieve highest predictive accuracy, comparable to
expert data scientists, but in much shorter time thanks to end-to-end automation. Driverless AI also offers automatic
visualizations and machine learning interpretability (MLI). Especially in regulated industries, model transparency and
explanation are just as important as predictive performance. Modeling pipelines (feature engineering and models) are
exported (in full fidelity, without approximations) both as Python modules and as Java standalone scoring artifacts.
Driverless AI runs on commodity hardware. It was also specifically designed to take advantage of graphical processing
units (GPUs), including multi-GPU workstations and servers such as IBM’s Power9-GPU AC922 server and the
NVIDIA DGX-1 for order-of-magnitude faster training.
This document describes how to install and use Driverless AI. For more information about Driverless AI, see https:
//www.h2o.ai/products/h2o-driverless-ai/.
For a third-party review, see https://www.infoworld.com/article/3236048/machine-learning/
review-h2oai-automates-machine-learning.html.
Have Questions?
If you have questions about using Driverless AI, post them on Stack Overflow using the driverless-ai tag at http:
//stackoverflow.com/questions/tagged/driverless-ai.
You can also post questions on the H2O.ai Community Slack workspace in the #driverlessai channel. If you have not
signed up for the H2O.ai Community Slack workspace, you can do so here: https://www.h2o.ai/community/.

RELEASE NOTES 1
Using Driverless AI, Release 1.8.4.1

2 RELEASE NOTES
CHAPTER

ONE

H2O DRIVERLESS AI RELEASE NOTES

H2O Driverless AI is a high-performance, GPU-enabled, client-server application for the rapid development and de-
ployment of state-of-the-art predictive analytics models. It reads tabular data from various sources and automates data
visualization, grand-master level automatic feature engineering, model validation (overfitting and leakage prevention),
model parameter tuning, model interpretability and model deployment. H2O Driverless AI is currently targeting
common regression, binomial classification, and multinomial classification applications including loss-given-default,
probability of default, customer churn, campaign response, fraud detection, anti-money-laundering, and predictive
asset maintenance models. It also handles time-series problems for individual or grouped time-series such as weekly
sales predictions per store and department, with time-causal feature engineering and validation schemes. The ability
to model unstructured data is coming soon.
High-level capabilities:
• Client/server application for rapid experimentation and deployment of state-of-the-art supervised machine learn-
ing models
• User-friendly GUI
• Python and R client API
• Automatically creates machine learning modeling pipelines for highest predictive accuracy
• Automates data cleaning, feature selection, feature engineering, model selection, model tuning, ensembling
• Automatically creates stand-alone batch scoring pipeline for in-process scoring or client/server scoring via
HTTP or TCP protocols in Python
• Automatically creates stand-alone (MOJO) low latency scoring pipeline for in-process scoring or client/server
scoring via HTTP or TCP protocols, in C++ (with R and Python runtimes) and Java (runs anywhere)
• Multi-GPU and multi-CPU support for powerful workstations and NVidia DGX supercomputers
• Machine Learning model interpretation module with global and local model interpretation
• Automatic Visualization module
• Multi-user support
• Backward compatibility
Problem types supported:
• Regression (continuous target variable like age, income, price or loss prediction, time-series forecasting)
• Binary classification (0/1 or “N”/”Y”, for fraud prediction, churn prediction, failure prediction, etc.)
• Multinomial classification (“negative”/”neutral”/”positive” or 0/1/2/3 or 0.5/1.0/2.0 for categorical target vari-
ables, for prediction of membership type, next-action, product recommendation, sentiment analysis, etc.)
Data types supported:

3
Using Driverless AI, Release 1.8.4.1

• Tabular structured data, rows are observations, columns are fields/features/variables


• Numeric, categorical and textual fields
• Missing values are allowed
• i.i.d. (identically and independently distributed) data
• Time-series data with a single time-series (time flows across the entire dataset, not per block of data)
• Grouped time-series (e.g., sales per store per department per week, all in one file, with 3 columns for store, dept,
week)
• Time-series problems with a gap between training and testing (i.e., the time to deploy), and a known forecast
horizon (after which model has to be retrained)
Data types supported via custom recipes:
• Image
• Video
• Audio
• Graphs
Data sources supported:
• Local file system or NFS
• File upload from browser or Python client
• S3 (Amazon)
• Hadoop (HDFS)
• Azure Blob storage
• Blue Data Tap
• Google BigQuery
• Google Cloud storage
• kdb+
• Minio
• Snowflake
• JDBC
• Custom Data Recipe BYOR (Python, bring your own recipe)
File formats supported:
• Plain text formats of columnar data (.csv, .tsv, .txt)
• Compressed archives (.zip, .gz, .bz2)
• Excel
• Parquet
• Feather
• Python datatable (.jay)

4 Chapter 1. H2O Driverless AI Release Notes


Using Driverless AI, Release 1.8.4.1

1.1 Architecture

Fig. 1: DAI architecture

1.1. Architecture 5
Using Driverless AI, Release 1.8.4.1

1.2 Roadmap

Fig. 2: DAI roadmap

1.3 Change Log

1.3.1 Version 1.8.4.1 LTS (Feb 4, 2020)

Available here
• Add option for dynamic port allocation
• Documentation for AWS community AMI
• Various bug fixes (MLI UI)

1.3.2 Version 1.8.4 LTS (Jan 31, 2020)

Available here
• New Features:
– Added ‘Scores’ tab in experiment page to show detailed tuning tables and scores for models and folds
– Added Constant Model (constant predictions) and use it as reference model by default
– Show score of global constant predictions in experiment summary as reference
– Added support for setting up mutual TLS for the DriverlessAI
– Added option to use client/personal certificate as an authentication method

6 Chapter 1. H2O Driverless AI Release Notes


Using Driverless AI, Release 1.8.4.1

• Documentation Updates:
– Added sections for enabling mTLS and Client Certificate authentication
– Constant Models is now included in the list of Supported Algorithms
– Added a section describing the Model Scores page
– Improved the C++ Scoring Pipeline documentation describing the process for importing datatable
– Improved documentation for the Java Scoring Pipeline
• Bug fixes:
– Fix refitting of final pipeline when new features are added
– Various bug fixes

1.3.3 Version 1.8.3 LTS (Jan 22, 2020)

Available here
• Added option to upload experiment artifacts to a configured disk location
• Various bug fixes (correct feature engineering from time column, migration for brain restart)

1.3.4 Version 1.8.2 LTS (Jan 17, 2020)

Available here
• New Features:
– Decision Tree model
– Automatically enabled for accuracy <= 7 and interpretability >= 7
– Supports all problem types: regression/binary/multiclass
– Using LightGBM GPU/CPU backend with MOJO
– Visualization of tree splits and leaf node decisions as part of pipeline visualization
– Per-Column Imputation Scheme (experimental)
– Select one of [const, mean, median, min, max, quantile] imputation scheme at start of experiment
– Select method of calculation of imputation value: either on entire dataset or inside each pipeline’s training
data split
– Disabled by default and must be enabled at startup time to be effective
– Show MOJO size and scoring latency (for C++/R/Python runtime) in experiment summary
– Automatically prune low weight base models in final ensemble (based on interpretability setting) to reduce
final model complexity
– Automatically convert non-raw github URLs for custom recipes to raw source code URLs
• Improvements:
– Speed up feature evolution for time-series and low-accuracy experiments
– Improved accuracy of feature evolution algorithm
– Feature transformer interpretability, total count, and importance accounted for in genetic algorithm’s model
and feature selection

1.3. Change Log 7


Using Driverless AI, Release 1.8.4.1

– Binary confusion matrix in ROC curve of experiment page is made consistent with Diagnostics (flipped
positions of TP/TN)
– Only include custom recipes in Python scoring pipeline if the experiment uses any custom recipes
– Additional documentation (New OpenID config options, JDBD data connector syntax)
– Improved AutoReport’s transformer descriptions
– Improved progress reporting during Autoreport creation
– Improved speed of automatic interaction search for imbalanced multiclass problems
– Improved accuracy of single final model for GLM and FTRL
– Allow config_overrides to be a list/vector of parameters for R client API
– Disable early stopping for Random Forest models by default, and expose new ‘rf_early_stopping’ mode
(optional)
– Create identical example data (again, as in 1.8.0 and before) for all scoring pipelines
– Upgraded versions of datatable and Java
– Installed graphviz in Docker image, now get .png file of pipeline visualization in MOJO package and Au-
toreport. Note: For RPM/DEB/TAR SH installs, user can install graphviz to get this optional functionality
• Documentation Updates:
– Added a simple example for modifying a dataset by recipe using live code
– Added a section describing how to impute datasets (experimental)
– Added Decision Trees to list of supported algorithms
– Fixed examples for enabling JDBC connectors
– Added information describing how to use a JDBC driver that is not tested in house
– Updated the Missing Values Handling topic to include sections for “Clustering in Transformers” and “Iso-
lation Forest Anomaly Score Transformer”
– Improved the “Fold Column” description
• Bug Fixes:
– Fix various reasons why final model score was too far off from best feature evolution score
– Delete temporary files created during test set scoring
– Fixed target transformer tuning (was potentially mixing up target transformers between feature evolution
and final model)
– Fixed tensorflow_nlp_have_gpus_in_production=true mode
– Fixed partial dependence plots for missing datetime values and no longer show them for text columns
– Fixed time-series GUI for quarterly data
– Feature transformer exploration limited to no more than 1000 new features (Small data on 10/10/1 would
try too many features)
– Fixed Kaggle pipeline building recipe to try more input features than 8
– Fixed cursor placement in live code editor for custom data recipe
– Show correct number of cross-validation splits in pipeline visualization if have more than 10 splits
– Fixed parsing of datetime in MOJO for some datetime formats without ‘%d’ (day)

8 Chapter 1. H2O Driverless AI Release Notes


Using Driverless AI, Release 1.8.4.1

– Various bug fixes


• Backward/Forward compatibility:
– Models built in 1.8.2 LTS will remain supported in upcoming versions 1.8.x LTS
– Models built in 1.7.1/1.8.0/1.8.1 are not deprecated and should continue to work (best effort is made to
preserve MOJO and Autoreport creation, MLI, scoring, etc.)
– Models built in 1.7.0 or earlier will be deprecated

1.3.5 Version 1.8.1.1 (Dec 21, 2019)

Available here
• Bugfix for time series experiments with quarterly data when launched from GUI

1.3.6 Version 1.8.1 (Dec 10, 2019)

Available here
• New Features:
– Full set of scoring metrics and corresponding downloadable holdout predictions for experiments with
single final models (time-series or i.i.d)
– MLI Updates:

* What-If (sensitivity) analysis


* Interpretation of experiments on text data (NLP)
– Custom Data Recipe BYOR:

* BYOR (bring your own recipe) in Python: pandas, numpy, datatable, third-party libraries for fast
prototyping of connectors and data preprocessing inside DAI

* data connectors, cleaning, filtering, aggregation, augmentation, feature engineering, splits, etc.
* can create one or multiple datasets from scratch or from existing datasets
* interactive code editor with live preview
* example code at https://github.com/h2oai/driverlessai-recipes/tree/rel-1.8.1/data
– Visualization of final scoring pipeline (Experimental)

* In-GUI display of graph of feature engineering, modeling and ensembling steps of entire machine
learning pipeline

* Addition to Autodoc
– Time-Series:

* Ability to specify which features will be unavailable at test time for time-series experiments
* Custom user-provided train/validation splits (by start/end datetime for each split) for time-series ex-
periments

* Back-testing metrics for time-series experiments (regression and classification, with and without lags)
based on rolling windows (configurable number of windows)
– MOJO:

* Java MOJO for FTRL

1.3. Change Log 9


Using Driverless AI, Release 1.8.4.1

* PyTorch MOJO (C++/Py/R) for custom recipes based on BERT/DistilBERT NLP models (available
upon request)
• Improvements:
– Accuracy:

* Automatic pairwise interaction search (+,-,*,/) for numeric features (“magic feature” finder)
* Improved accuracy for time series experiments with low interpretability
* Improved leakage detection logic
* Improved genetic algorithm heuristics for feature evolution (more exploration)
– Time-Series Recipes:

* Re-enable Test-time augmentation in Python scoring pipeline for time-series experiments


* Reduce default number of time-series rolling holdout predictions to same number as validation splits
(but configurable)
– Computation:

* Faster feature evolution part for non-time-series experiments with single final model
* Faster binary imbalanced models for very high class imbalance by limiting internal number of re-
sampling bags

* Faster feature selection


* Enable GPU support for ImbalancedXGBoostGBMModel
* Improved speed for importing multiple files at once
* Faster automatic determination of time series properties
* Enable use of XGBoost models on large datasets if low enough accuracy settings, expose dataset size
limits in expert settings

* Reduced memory usage for all experiments


* Faster creation of holdout predictions for time-series experiments (Shapley values are now computed
by MLI on demand by default)
– UX Improvements:

* Added ability to rename datasets


* Added search bar for expert settings
* Show traces for long-running experiments
* All experiments create a MOJO (if possible, set to ‘auto’)
* All experiments create a pipeline visualization
* By default, all experiments (iid and time series) have holdout predictions on training data and a full
set of metrics for final model
• Documentation Updates:
– Updated steps for enabling GPU persistence mode
– Added information about deprecated NVIDIA functions
– Improved documentation for enabling LDAP authentication
– Added information about changing the column type in datasets

10 Chapter 1. H2O Driverless AI Release Notes


Using Driverless AI, Release 1.8.4.1

– Updated list of experiment artifacts available in an experiment summary


– Added steps describing how to expose ports on Docker for the REST service deployment within the Driver-
less AI Docker container
– Added an example showing how to run an experiment with a custom transform recipe
– Improved the FAQ for setting up TLS/SSL
– Added FAQ describing issues that can occur when attempting Import Folder as File with a data connector
on Windows
• Bug Fixes:
– Allow brain restart/refit to accept unscored previous pipelines
– Fix actual vs predicted labeling for diagnostics of regression model
– Fix MOJO for TensorFlow for non target transformers other than identity
– Fix column type detection for Excel files
– Allow experiments with default expert settings to have a MOJO
– Various bug fixes

1.3.7 Version 1.8.0 (Oct 3, 2019)

Available here
• Improve speed and memory usage for feature engineering
• Improve speed of leakage and shift detection, and improve accuracy
• Improve speed of AutoVis under high system load
• Improve speed for experiments with large user-given validation data
• Improve accuracy of ensembles for regression problems
• Improve creation of Autoreport (only one background job per experiment)
• Improve sampling techniques for ImbalancedXGBoost and ImbalancedLightGBM models, and disable them by
default since can be slower
• Add Python/R/C++ MOJO support for FTRL and RandomForest
• Add native categorical handling for LightGBM in CPU mode
• Add monotonicity constraints support for LightGBM
• Add Isolation Forest Anomaly Score transformer (outlier detection)
• Re-enable One-Hot-Encoding for GLM models
• Add lexicographical label encoding (disabled by default)
• Add ability to further train user-provided pretrained embeddings for TensorFlow NLP transformers, in addition
to fine-tuning the rest of the neural network graph
• Add timeout for BYOR acceptance tests
• Add log and notifications for large shifts in final model variable importances compared to tuning model
• Add more expert control over time series feature engineering
• Add ability for recipes to be uploaded in bulk as entire (or part of) github repository or as links to python files
on page

1.3. Change Log 11


Using Driverless AI, Release 1.8.4.1

• Allow missing values in fold column


• Add support for feature brain when starting “New Model With Same Parameters” of a model that was previously
restarted
• Add support for toggling whether additional features are to be included in pipeline during “Retrain Final
Pipeline”
• Limit experiment runtime to one day by default (approximately enforced, can be configured in Expert Settings
-> Experiment or config.toml ‘max_runtime_minutes’)
• Add support for importing pickled Pandas frames (.pkl)
• MLI updates:
– Show holdout predictions and test set predictions (if applicable) in MLI TS for both metric and actual vs.
predicted charts
– Add ability to download group metrics in MLI TS
– Add ability to zoom into charts in MLI TS
– Add ability to use column not used in DAI model as a k-LIME cluster column in MLI
– Add ability to view original and transformed DAI model-based feature importance in MLI
– Add ability to view Shapley importance for original features
– Add ability to view permutation importance for a DAI model when the config option
autodoc_include_permutation_feature_importance is set to on
– Fixed bug in binary Disparate Impact Analysis, which caused incorrect calculations amongst several met-
rics (ones using false positives and true negatives in the numerator)
• Disable NLP TensorFlow transformers by default (enable in NLP expert settings by switching to “on”)
• Reorganize expert settings, add tab for feature engineering
• Experiment now informs if aborted by user, system or server restart
• Reduce load of all tasks launched by server, giving priority to experiments to use cores
• Add experiment summary files to aborted experiment logs
• Add warning when ensemble has models that reach limit of max iterations despite early stopping, with learning
rate controls in expert panel to control.
• Improve progress reporting
• Allow disabling of H2O recipe server for scoring if not using custom recipes (to avoid Java dependency)
• Fix RMSPE scorer
• Fix recipes error handling when uploading via URL
• Fix Autoreport being spawned anytime GUI was on experiment page, overloading the system with forks from
the server
• Fix time-out for Autoreport PDP calculations, so completes more quickly
• Fix certain config settings to be honored from GUI expert settings (woe_bin_list, ohe_bin_list,
text_gene_max_ngram, text_gene_dim_reduction_choice, tensorflow_max_epochs_nlp, tensor-
flow_nlp_pretrained_embeddings_file_path, holiday_country), previously were only honored when provided at
startup time
• Fix column type for additional columns during scored test set download
• Fix GUI incorrectly converting time for forecast horizon in TS experiments

12 Chapter 1. H2O Driverless AI Release Notes


Using Driverless AI, Release 1.8.4.1

• Fix calculation of correlation for string columns in AutoVis


• Fix download for R MOJO runtime
• Fix parameters for LightGBM RF mode
• Fix dart parameters for LightGBM and XGBoost
• Documentation updates:
– Included more information in the Before You Begin Installing or Upgrading topic to help making installa-
tions and upgrades go more smoothly
– Added topic describing how to choose between the AWS Community and AWS Marketplace AMIs
– Added information describing how to retrieve the MOJO2 Javadoc
– Updated Python client examples to work with Driverless AI 1.7.x releases
– Updated documentation for new features, expert settings, MLI plots, etc.
• Backward/Forward compatibility:
– Models built in 1.8.0 will remain supported in versions 1.8.x
– Models built in 1.7.1 are not deprecated and should continue to work (best effort is made to preserve MOJO
and Autoreport creation, MLI, scoring, etc.)
– 1.8.0 upgraded to scipy version 1.3.1 to support newer custom recipes. This might deprecate custom
recipes that depend on scipy version 1.2.2 (and experiments using them) and might require re-import of
those custom recipes. Previously built Python scoring pipelines will continue to work.
– Models built in 1.7.0 or earlier will be deprecated
• Various bug fixes

1.3.8 Version 1.7.1 (Aug 19, 2019)

Available here
• Added two new models with internal sampling techniques for imbalanced binary classification problems: Im-
balancedXGBoost and ImbalancedLightGBM
• Added support for rolling-window based predictions for time-series experiments (2 options: test-time augmen-
tation or re-fit)
• Added support for setting logical column types for a dataset (to override type detection during experiments)
• Added ability to set experiment name at start of experiment
• Added leakage detection for time-series problems
• Added JDBC connector
• MOJO updates:
– Added Python/R/C++ MOJO support for TensorFlow model
– Added Python/R/C++ MOJO support for TensorFlow NLP transformers: TextCNN, CharCNN, BiGRU,
including any pretrained embeddings if provided
– Reduced memory usage for MOJO creation
– Increased speed of MOJO creation
– Configuration options for MOJO and Python scoring pipelines now have 3-way toggle: “on”/”off”/”auto”

1.3. Change Log 13


Using Driverless AI, Release 1.8.4.1

• MLI updates:
– Added disparate impact analysis (DIA) for MLI
– Allow MLI scoring pipeline to be built for datasets with column names that need to be sanitized
– Date-aware binning for partial dependence and ICE in MLI
• Improved generalization performance for time-series modeling with regulariation techniques for lag-based fea-
tures
• Improved “predicted vs actual” plots for regression problems (using adaptive point sizes)
• Fix bug in datatable for manipulations of string columns larger than 2GB
• Fixed download of predictions on user-provided validation data
• Fix bug in time-series test-time augmentation (work-around was to include entire training data in test set)
• Honor the expert settings flag to enable detailed traces (disable again by default)
• Various bug fixes

1.3.9 Version 1.6.4 LTS (Aug 19, 2019)

Available here
• ML Core updates:
– Speed up schema detection
– DAI now drops rows with missing values when diagnosing regression problems
– Speed up column type detection
– Fixed growth of individuals
– Fixed n_jobs for predict
– Target column is no longer included in predictors for skewed datasets
– Added an option to prevent users from downloading data files locally
– Improved UI split functionality
– A new “max_listing_items” config option to limit the number of items fetched in listing pages
• Model Ops updates:
– MOJO runtime upgraded to version 2.1.3 which supports perpetual MOJO pipeline
– Upgraded deployment templates to version matching MOJO runtime version
• MLI updates:
– Fix to MLI schema builder
– Fix parsing of categorical reason codes
– Added ability to handle integer time column
• Various bug fixes

14 Chapter 1. H2O Driverless AI Release Notes


Using Driverless AI, Release 1.8.4.1

1.3.10 Version 1.7.0 (Jul 7, 2019)

Available here
• Support for Bring Your Own Recipe (BYOR) for transformers, models (algorithms) and scorers
• Added protobuf-based MOJO scoring runtime libraries for Python, R and Java (standalone, low-latency)
• Added local REST server as one-click deployment option for MOJO scoring pipeline, in addition to AWS
Lambda endpoint
• Added R client package, in addition to Python client
• Added Project workspace to group datasets and experiments and to visually compare experiments and create
leaderboards
• Added download of imported datasets as .csv
• Recommendations for columnar transformations in AutoViz
• Improved scalability and performance
• Ability to provide max. runtime for experiments
• Create MOJO scoring pipeline by default if the experiment configuration allows (for convenience, enables lo-
cal/cloud deployment options without user input)
• Support for user provided pre-trained embeddings for TensorFlow NLP models
• Support for holdout splits lacking some target classes (can happen when a fold column is provided)
• MLI updates:
– Added residual plot for regression problems (keeping all outliers intact)
– Added confusion matrix as default metric display for multinomial problems
– Added Partial Dependence (PD) and Individual Conditional Expectation (ICE) plots for Driverless.ai mod-
els in MLI GUI
– Added ability to search by ID column in MLI GUI
– Added ability to run MLI PD/ICE on all features
– Added ability to handle multiple observations for a single time column in MLI TS by taking the mean of
the target and prediction where applicable
– Added ability to handle integer time column in MLI TS
– MLI TS will use train holdout predictions if there is no test set provided
• Faster import of files with “%Y%m%d” and “%Y%m%d%H%M” time format strings, and files with lots of text
strings
• Fix units for RMSPE scorer to be a percentage (multiply by 100)
• Allow non-positive outcomes for MAPE and SMAPE scorers
• Improved listing in GUI
• Allow zooming in GUI
• Upgrade to TensorFlow 1.13.1 and CUDA 10 (and CUDA is part of the distribution now, to simplify installation)
• Add CPU-support for TensorFlow on PPC
• Documentation updates:
– Added documentation for new features including

1.3. Change Log 15


Using Driverless AI, Release 1.8.4.1

* Projects
* Custom Recipes
* C++ MOJO Scoring Pipelines
* R Client API
* REST Server Deployment
– Added information about variable importance values on the experiments page
– Updated documentation for Expert Settings
– Updated “Tips n Tricks” with new Scoring Pipeline tips
• Various bug fixes

1.3.11 Version 1.6.3 LTS (June 14, 2019)

Available here
• Included an Audit log feature
• Fixed support for decimal types for parquet files in MOJO
• Autodoc can order PDP/ICE by feature importance
• Session Management updates
• Upgraded datatable
• Improved reproducibility
• Model diagnostics now uses a weight column
• MLI can build surrogate models on all the original features or on all the transformed features that DAI uses
• Internal server cache now respects usernames
• Fixed an issue with time series settings
• Fixed an out of memory error when loading a MOJO
• Fixed Python scoring package for TensorFlow
• Added OpenID configurations
• Documentation updates:
– Updated the list of artifacts available in the Experiment Summary
– Clarified language in the documentation for unsupported (but available) features
– For the Terraform requirement in deployments, clarified that only Terraform versions in the 0.11.x release
are supported, and specifically 0.11.10 or greater
– Fixed link to the Miniconda installation instructions
• Various bug fixes

16 Chapter 1. H2O Driverless AI Release Notes


Using Driverless AI, Release 1.8.4.1

1.3.12 Version 1.6.2 LTS (May 10, 2019)

Available here
• This version provides PPC64le artifacts
• Improved stability of datatable
• Improved path filtering in the file browser
• Fixed units for RMSPE scorer to be a percentage (multiply by 100)
• Fixed segmentation fault on Ubuntu 18 with installed font package
• Fixed IBM Spectrum Conductor authentication
• Fixed handling of EC2 machine credentials
• Fixed of Lag transformer configuration
• Fixed KDB and Snowflake Error Reporting
• Gradually reduce number of used workers for column statistics computation in case of failure.
• Hide default Tornado header exposing used version of Tornado
• Documentation updates:
– Added instructions for installing via AWS Marketplace
– Improved documentation for installing via Google Cloud
– Improved FAQ documentation
– Added Data Sampling documentation topic
• Various bug fixes

1.3.13 Version 1.6.1.1 LTS (Apr 24, 2019)

Available here
• Fix in AWS role handling.

1.3.14 Version 1.6.1 LTS (Apr 18, 2019)

Available here
• Several fixes for MLI (partial dependence plots, Shapley values)
• Improved documentation for model deployment, time-series scoring, AutoVis and FAQs

1.3. Change Log 17


Using Driverless AI, Release 1.8.4.1

1.3.15 Version 1.6.0 LTS (Apr 5, 2019)

Private build only.


• Fixed import of string columns larger than 2GB
• Fixed AutoViz crashes on Windows
• Fixed quantile binning in MLI
• Plot global absolute mean Shapley values instead of global mean Shapley values in MLI
• Improvements to PDP/ICE plots in MLI
• Validated Terraform version in AWS Lambda deployment
• Added support for NULL variable importance in AutoDoc
• Made Variable Importance table size configurable in AutoDoc
• Improved support for various combinations of data import options being enabled/disabled
• CUDA is now part of distribution for easier installation
• Security updates:
– Enforced SSL settings to be honored for all h2oai_client calls
– Added config option to prevent using LocalStorage in the browser to cache information
– Upgraded Tornado server version to 5.1.1
– Improved session expiration and autologout functionality
– Disabled access to Driverless AI data folder in file browser
– Provided an option to filter content that is shown in the file browser
– Use login name for HDFS impersonation instead of predefined name
– Disabled autocomplete in login form
• Various bug fixes

1.3.16 Version 1.5.4 (Feb 24, 2019)

Available here
• Speed up calculation of column statistics for date/datetime columns using certain formats (now uses
‘max_rows_col_stats’ parameter)
• Added computation of standard deviation for variable importances in experiment summary files
• Added computation of shift of variable importances between feature evolution and final pipeline
• Fix link to MLI Time-Series experiment
• Fix display bug for iteration scores for long experiments
• Fix display bug for early finish of experiment for GLM models
• Fix display bug for k-LIME when target is skewed
• Fix display bug for forecast horizon in MLI for Time-Series
• Fix MLI for Time-Series for single time group column
• Fix in-server scoring of time-series experiments created in 1.5.0 and 1.5.1

18 Chapter 1. H2O Driverless AI Release Notes


Using Driverless AI, Release 1.8.4.1

• Fix OpenBLAS dependency


• Detect disabled GPU persistence mode in Docker
• Reduce disk usage during TensorFlow NLP experiments
• Reduce disk usage of aborted experiments
• Refresh reported size of experiments during start of application
• Disable TensorFlow NLP transformers by default to speed up experiments (can enable in expert settings)
• Improved progress percentage shown during experiment
• Improved documentation (upgrade on Windows, how to create the simplest model, DTap connectors, etc.)
• Various bug fixes

1.3.17 Version 1.5.3 (Feb 8, 2019)

Available here
• Added support for splitting datasets by time via time column containing date, datetime or integer values
• Added option to disable file upload
• Require authentication to download experiment artifacts
• Automatically drop predictor columns from training frame if not found in validation or test frame and warn
• Improved performance by using physical CPU cores only (configurable in config.toml)
• Added option to not show inactive data connectors
• Various bug fixes

1.3.18 Version 1.5.2 (Feb 2, 2019)

Available here
• Added world-level bidirectional GRU Tensorflow models for NLP features
• Added character-level CNN Tensorflow models for NLP features
• Added support to import multiple individual datasets at once
• Added support for holdout predictions for time-series experiments
• Added support for regression and multinomial classification for FTRL (in addition to binomial classification)
• Improved scoring for time-series when test data contains actual target values (missing target values will be
predicted)
• Reduced memory usage for LightGBM models
• Improved performance for feature engineering
• Improved speed for TensorFlow models
• Improved MLI GUI for time-series problems
• Fix final model fold splits when fold_column is provided
• Various bug fixes

1.3. Change Log 19


Using Driverless AI, Release 1.8.4.1

1.3.19 Version 1.5.1 (Jan 22, 2019)

Available here
• Fix MOJO for GLM
• Add back .csv file of experiment summary
• Improve collection of pipeline timing artifacts
• Clean up Docker tag

1.3.20 Version 1.5.0 (Jan 18, 2019)

Available here
• Added model diagnostics (interactive model metrics on new test data incl. residual analysis for regression)
• Added FTRL model (Follow The Regularized Leader)
• Added Kolmogorov-Smirnov metric (degree of separation between positives and negatives)
• Added ability to retrain (only) the final model on new data
• Added one-hot encoding for low-cardinality categorical features, for GLM
• Added choice between 32-bit (now default) and 64-bit precision
• Added system information (CPU, GPU, disk, memory, experiments)
• Added support for time-series data with many more time gaps, and with weekday-only data
• Added one-click deployment to Amazon Lambda
• Added ability to split datasets randomly, with option to stratify by target column or group by fold column
• Added support for OpenID authentication
• Added connector for BlueData
• Improved responsiveness of the GUI under heavy load situations
• Improved speed and reduce memory footprint of feature engineering
• Improved performance for RuleFit models and enable GPU and multinomial support
• Improved auto-detection of temporal frequency for time-series problems
• Improved accuracy of final single model if external validation provided
• Improved final pipeline if external validation data is provided (add ensembling)
• Improved k-LIME in MLI by using original features deemed important by DAI instead of all original features
• Improved MLI by using 3-fold CV by default for all surrogate models
• Improved GUI for MLI time series (integrated help, better integration)
• Added ability to view MLI time series logs while MLI time series experiment is running
• PDF version of the Automatic Report (AutoDoc) is now replaced by a Word version
• Various bug fixes (GLM accuracy, UI slowness, MLI UI, AutoVis)

20 Chapter 1. H2O Driverless AI Release Notes


Using Driverless AI, Release 1.8.4.1

1.3.21 Version 1.4.2 (Dec 3, 2018)

Available here
• Support for IBM Power architecture
• Speed up training and reduce size of final pipeline
• Reduced resource utilization during training of final pipeline
• Display test set metrics (ROC, ROCPR, Gains, Lift) in GUI in addition to validation metrics (if test set provided)
• Show location of best threshold for Accuracy, MCC and F1 in ROC curves
• Add relative point sizing for scatter plots in AutoVis
• Fix file upload and add model checkpointing in python client API
• Various bug fixes

1.3.22 Version 1.4.1 (Nov 11, 2018)

Available here
• Improved integration of MLI for time-series
• Reduced disk and memory usage during final ensemble
• Allow scoring and transformations on previously imported datasets
• Enable checkpoint restart for unfinished models
• Add startup checks for OpenCL platforms for LightGBM on GPUs
• Improved feature importances for ensembles
• Faster dataset statistics for date/datetime columns
• Faster MOJO batch scoring
• Fix potential hangs
• Fix ‘not in list’ error in MOJO
• Fix NullPointerException in MLI
• Fix outlier detection in AutoVis
• Various bug fixes

1.3.23 Version 1.4.0 (Oct 27, 2018)

Available here
• Enable LightGBM by default (now with MOJO)
• LightGBM tuned for GBM decision trees, Random Forest (rf), and Dropouts meet Multiple Additive Regression
Trees (dart)
• Add ‘isHoliday’ feature for time columns
• Add ‘time’ column type for date/datetime columns in data preview
• Add support for binary datatable file ingest in .jay format
• Improved final ensemble (each model has its own feature pipeline)

1.3. Change Log 21


Using Driverless AI, Release 1.8.4.1

• Automatic smart checkpointing (feature brain) from prior experiments


• Add kdb+ connector
• Feature selection of original columns for data with many columns to handle >>100 columns
• Improved time-series recipe (multiple validation splits, better logic)
• Improved performance of AutoVis
• Improved date detection logic (now detects %Y%m%d and %Y-%m date formats)
• Automatic fallback to CPU mode if GPU runs out of memory (for XGBoost, GLM and LightGBM)
• No longer require header for validation and testing datasets if data types match
• No longer include text columns for data shift detection
• Add support for time-series models in MLI (including ability to select time-series groups)
• Add ability to download MLI logs from MLI experiment page (includes both Python and Java logs)
• Add ability to view MLI logs while MLI experiment is running (Python and Java logs)
• Add ability to download LIME and Shapley reason codes from MLI page
• Add ability to run MLI on transformed features
• Display all variables for MLI variable importance for both DAI and surrogate models in MLI summary
• Include variable definitions for DAI variable importance list in MLI summary
• Fix Gains/Lift charts when observations weights are given
• Various bug fixes

1.3.24 Version 1.3.1 (Sep 12, 2018)

Available here
• Fix ‘Broken pipe’ failures for TensorFlow models
• Fix time-series problems with categorical features and interpretability >= 8
• Various bug fixes

1.3.25 Version 1.3.0 (Sep 4, 2018)

Available here
• Added LightGBM models - now have [XGBoost, LightGBM, GLM, TensorFlow, RuleFit]
• Added TensorFlow NLP recipe based on CNN Deeplearning models (sentiment analysis, document classifica-
tion, etc.)
• Added MOJO for GLM
• Added detailed confusion matrix statistics
• Added more expert settings
• Improved data exploration (columnar statistics and row-based data preview)
• Improved speed of feature evolution stage
• Improved speed of GLM

22 Chapter 1. H2O Driverless AI Release Notes


Using Driverless AI, Release 1.8.4.1

• Report single-pass score on external validation and test data (instead of bootstrap mean)
• Reduced memory overhead for data processing
• Reduced number of open files - fixes ‘Bad file descriptor’ error on Mac/Docker
• Simplified Python client API
• Query any data point in the MLI UI from the original dataset due to “on-demand” reason code generation
• Enhanced k-means clustering in k-LIME by only using a subset of features. See klime_technique for more
information.
• Report k-means centers for k-LIME in MLI summary for better cluster interpretation
• Improved MLI experiment listing details
• Various bug fixes

1.3.26 Version 1.2.2 (July 5, 2018)

Available here
• MOJO Java scoring pipeline for time-series problems
• Multi-class confusion matrices
• AUCMACRO Scorer: Multi-class AUC via macro-averaging (in addition to the default micro-averaging)
• Expert settings (configuration override) for each experiment from GUI and client APIs.
• Support for HTTPS
• Improved downsampling logic for time-series problems (if enabled through accuracy knob settings)
• LDAP readonly access to Active Directory
• Snowflake data connector
• Various bug fixes

1.3.27 Version 1.2.1 (June 26, 2018)

• Added LIME-SUP (alpha) to MLI as alternative to k-LIME (local regions are defined by decision tree instead
of k-means)
• Added RuleFit model (alpha), now have [GBM, GLM, TensorFlow, RuleFit] - TensorFlow and RuleFit are
disabled by default
• Added Minio (private cloud storage) connector
• Added support for importing folders from S3
• Added ‘Upload File’ option to ‘Add Dataset’ (in addition to drag & drop)
• Predictions for binary classification problems now have 2 columns (probabilities per class), for consistency with
multi-class
• Improved model parameter tuning
• Improved feature engineering for time-series problems
• Improved speed of MOJO generation and loading
• Improved speed of time-series related automatic calculations in the GUI

1.3. Change Log 23


Using Driverless AI, Release 1.8.4.1

• Fixed potential rare hangs at end of experiment


• No longer require internet to run MLI
• Various bug fixes

1.3.28 Version 1.2.0 (June 11, 2018)

• Time-Series recipe
• Low-latency standalone MOJO Java scoring pipelines (now beta)
• Enable Elastic Net Generalized Linear Modeling (GLM) with lambda search (and GPU support), for inter-
pretability>=6 and accuracy<=5 by default (alpha)
• Enable TensorFlow (TF) Deep Learning models (with GPU support) for interpretability=1 and/or multi-class
models (alpha, enable via config.toml)
• Support for pre-tuning of [GBM, GLM, TF] models for picking best feature evolution model parameters
• Support for final ensemble consisting of mix of [GBM, GLM, TF] models
• Automatic Report (AutoDoc) in PDF and Markdown format as part of summary zip file
• Interactive tour (assistant) for first-time users
• MLI now runs on experiments from previous releases
• Surrogate models in MLI now use 3 folds by default
• Improved small data recipe with up to 10 cross-validation folds
• Improved accuracy for binary classification with imbalanced data
• Additional time-series transformers for interactions and aggreations between lags and lagging of non-target
columns
• Faster creation of MOJOs
• Progress report during data ingest
• Normalize binarized multi-class confusion matrices by class count (global scaling factor)
• Improved parsing of boolean environment variables for configuration
• Various bug fixes

1.3.29 Version 1.1.6 (May 29, 2018)

• Improved performance for large datasets


• Improved speed and user interface for MLI
• Improved accuracy for binary classification with imbalanced data
• Improved generalization estimate for experiments with given validation data
• Reduced size of experiment directories
• Support for Parquet files
• Support for bzip2 compressed files
• Added Data preview in UI: ‘Describe’
• No longer add ID column to holdout and test set predictions for simplicity

24 Chapter 1. H2O Driverless AI Release Notes


Using Driverless AI, Release 1.8.4.1

• Various bug fixes

1.3.30 Version 1.1.4 (May 17, 2018)

• Native builds (RPM/DEB) for 1.1.3

1.3.31 Version 1.1.3 (May 16, 2018)

• Faster speed for systems with large CPU core counts


• Faster and more robust handling of user-specified missing values for training and scoring
• Same validation scheme for feature engineering and final ensemble for high enough accuracy
• MOJO scoring pipeline for text transformers
• Fixed single-row scoring in Python scoring pipeline (broken in 1.1.2)
• Fixed default scorer when experiment is started too quickly
• Improved responsiveness for time-series GUI
• Improved responsiveness after experiment abort
• Improved load balancing of memory usage for multi-GPU XGBoost
• Improved UI for selection of columns to drop
• Various bug fixes

1.3.32 Version 1.1.2 (May 8, 2018)

• Support for automatic time-series recipe (alpha)


• Now using Generalized Linear Model (GLM) instead of XGBoost (GBM) for interpretability 10
• Added experiment preview with runtime and memory usage estimation
• Added MER scorer (Median Error Rate, Median Abs. Percentage Error)
• Added ability to use integer column as time column
• Speed up type enforcement during scoring
• Support for reading ARFF file format (alpha)
• Quantile Binning for MLI
• Various bug fixes

1.3. Change Log 25


Using Driverless AI, Release 1.8.4.1

1.3.33 Version 1.1.1 (April 23, 2018)

• Support string columns larger than 2GB

1.3.34 Version 1.1.0 (April 19, 2018)

• AWS/Azure integration (hourly cloud usage)


• Bug fixes for MOJO pipeline scoring (now beta)
• Google Cloud storage and BigQuery (alpha)
• Speed up categorical column stats computation during data import
• Further improved memory management on GPUs
• Improved accuracy for MAE scorer
• Ability to build scoring pipelines on demand (if not enabled by default)
• Additional target transformer for regression problems sqrt(sqrt(x))
• Add GLM models as candidates for interpretability=10 (alpha, disabled by default)
• Improved performance of native builds (RPM/DEB)
• Improved estimation of error bars
• Various bug fixes

1.3.35 Version 1.0.30 (April 5, 2018)

• Speed up MOJO pipeline creation and disable MOJO by default (still alpha)
• Improved memory management on GPUs
• Support for optional 32-bit floating-point precision for reduced memory footprint
• Added logging of test set scoring and data transformations
• Various bug fixes

1.3.36 Version 1.0.29 (April 4, 2018)

• If MOJO fails to build, no MOJO will be available, but experiment can still succeed

1.3.37 Version 1.0.28 (April 3, 2018)

• (Non-docker) RPM installers for RHEL7/CentOS7/SLES 12 with systemd support

26 Chapter 1. H2O Driverless AI Release Notes


Using Driverless AI, Release 1.8.4.1

1.3.38 Version 1.0.27 (March 31, 2018)

• MOJO scoring pipeline for Java standalone cross-platform low-latency scoring (alpha)
• Various bug fixes

1.3.39 Version 1.0.26 (March 28, 2018)

• Improved performance and reduced memory usage for large datasets


• Improved performance for F0.5, F2 and accuracy
• Improved performance of MLI
• Distribution shift detection now also between validation and test data
• Batch scoring example using datatable
• Various enhancements for AutoVis (outliers, parallel coordinates, log file)
• Various bug fixes

1.3.40 Version 1.0.25 (March 22, 2018)

• New scorers for binary/multinomial classification: F0.5, F2 and accuracy


• Precision-recall curve for binary/multinomial classification models
• Plot of actual vs predicted values for regression problems
• Support for excluding feature transformations by operation type
• Support for reading binary file formats: datatable and Feather
• Improved multi-GPU memory load balancing
• Improved display of initial tuning results
• Reduced memory usage during creation of final model
• Fixed several bugs in creation of final scoring pipeline
• Various UI improvements (e.g., zooming on iteration scoreboard)
• Various bug fixes

1.3.41 Version 1.0.24 (March 8, 2018)

• Fix test set scoring bug for data with an ID column (introduced in 1.0.23)
• Allow renaming of MLI experiments
• Ability to limit maximum number of cores used for datatable
• Print validation scores and error bars across final ensemble model CV folds in logs
• Various UI improvements
• Various bug fixes

1.3. Change Log 27


Using Driverless AI, Release 1.8.4.1

1.3.42 Version 1.0.23 (March 7, 2018)

• Support for Gains and Lift curves for binomial and multinomial classification
• Support for multi-GPU single-model training for large datasets
• Improved recipes for large datasets (faster and less memory/disk usage)
• Improved recipes for text features
• Increased sensitivity of interpretability setting for feature engineering complexity
• Disable automatic time column detection by default to avoid confusion
• Automatic column type conversion for test and validation data, and during scoring
• Improved speed of MLI
• Improved feature importances for MLI on transformed features
• Added ability to download each MLI plot as a PNG file
• Added support for dropped columns and weight column to MLI stand-alone page
• Fix serialization of bytes objects larger than 4 GiB
• Fix failure to build scoring pipeline with ‘command not found’ error
• Various UI improvements
• Various bug fixes

1.3.43 Version 1.0.22 (Feb 23, 2018)

• Fix CPU-only mode


• Improved robustness of datatable CSV parser

1.3.44 Version 1.0.21 (Feb 21, 2018)

• Fix MLI GUI scaling issue on Mac


• Work-around segfault in truncated SVD scipy backend
• Various bug fixes

1.3.45 Version 1.0.20 (Feb 17, 2018)

• HDFS/S3/Excel data connectors


• LDAP/PAM/Kerberos authentication
• Automatic setting of default values for accuracy / time / interpretability
• Interpretability: per-observation and per-feature (signed) contributions to predicted values in scoring pipeline
• Interpretability setting now affects feature engineering complexity and final model complexity
• Standalone MLI scoring pipeline for Python
• Time setting of 1 now runs for only 1 iteration
• Early stopping of experiments if convergence is detected

28 Chapter 1. H2O Driverless AI Release Notes


Using Driverless AI, Release 1.8.4.1

• ROC curve display for binomial and multinomial classification, with confusion matrices and threshold/F1/MCC
display
• Training/Validation/Test data shift detectors
• Added AUCPR scorer for multinomial classification
• Improved handling of imbalanced binary classification problems
• Configuration file for runtime limits such as cores/memory/harddrive (for admins)
• Various GUI improvements (ability to rename experiments, re-run experiments, logs)
• Various bug fixes

1.3.46 Version 1.0.19 (Jan 28, 2018)

• Fix hang during final ensemble (accuracy >= 5) for larger datasets
• Allow scoring of all models built in older versions (>= 1.0.13) in GUI
• More detailed progress messages in the GUI during experiments
• Fix scoring pipeline to only use relative paths
• Error bars in model summary are now +/- 1*stddev (instead of 2*stddev)
• Added RMSPE scorer (RMS Percentage Error)
• Added SMAPE scorer (Symmetric Mean Abs. Percentage Error)
• Added AUCPR scorer (Area under Precision-Recall Curve)
• Gracefully handle inf/-inf in data
• Various UI improvements
• Various bug fixes

1.3.47 Version 1.0.18 (Jan 24, 2018)

• Fix migration from version 1.0.15 and earlier


• Confirmation dialog for experiment abort and data/experiment deletion
• Various UI improvements
• Various AutoVis improvements
• Various bug fixes

1.3.48 Version 1.0.17 (Jan 23, 2018)

• Fix migration from version 1.0.15 and earlier (partial, for experiments only)
• Added model summary download from GUI
• Restructured and renamed logs archive, and add model summary to it
• Fix regression in AutoVis in 1.0.16 that led to slowdown
• Various bug fixes

1.3. Change Log 29


Using Driverless AI, Release 1.8.4.1

1.3.49 Version 1.0.16 (Jan 22, 2018)

• Added support for validation dataset (optional, instead of internal validation on training data)
• Standard deviation estimates for model scores (+/- 1 std.dev.)
• Computation of all applicable scores for final models (in logs only for now)
• Standard deviation estimates for MLI reason codes (+/- 1 std.dev.) when running in stand-alone mode
• Added ability to abort MLI job
• Improved final ensemble performance
• Improved outlier visualization
• Updated H2O-3 to version 3.16.0.4
• More readable experiment names
• Various speedups
• Various bug fixes

1.3.50 Version 1.0.15 (Jan 11, 2018)

• Fix truncated per-experiment log file


• Various bug fixes

1.3.51 Version 1.0.14 (Jan 11, 2018)

• Improved performance

1.3.52 Version 1.0.13 (Jan 10, 2018)

• Improved estimate of generalization performance for final ensemble by removing leakage from target encoding
• Added API for re-fitting and applying feature engineering on new (potentially larger) data
• Remove access to pre-transformed datasets to avoid unintended leakage issues downstream
• Added mean absolute percentage error (MAPE) scorer
• Enforce monotonicity constraints for binary classification and regression models if interpretability >= 6
• Use squared Pearson correlation for R^2 metric (instead of coefficient of determination) to avoid negative values
• Separated HTTP and TCP scoring pipeline examples
• Reduced size of h2oai_client wheel
• No longer require weight column for test data if it was provided for training data
• Improved accuracy of final modeling pipeline
• Include H2O-3 logs in downloadable logs.zip
• Updated H2O-3 to version 3.16.0.2
• Various bug fixes

30 Chapter 1. H2O Driverless AI Release Notes


Using Driverless AI, Release 1.8.4.1

1.3.53 Version 1.0.11 (Dec 12, 2017)

• Faster multi-GPU training, especially for small data


• Increase default amount of exploration of genetic algorithm for systems with fewer than 4 GPUs
• Improved accuracy of generalization performance estimate for models on small data (< 100k rows)
• Faster abort of experiment
• Improved final ensemble meta-learner
• More robust date parsing
• Various bug fixes

1.3.54 Version 1.0.10 (Dec 4, 2017)

• Tool tips and link to documentation in parameter settings screen


• Faster training for multi-class problems with > 5 classes
• Experiment summary displayed in GUI after experiment finishes
• Python Client Library downloadable from the GUI
• Speedup for Maxwell-based GPUs
• Support for multinomial AUC and Gini scorers
• Add MCC and F1 scorers for binomial and multinomial problems
• Faster abort of experiment
• Various bug fixes

1.3.55 Version 1.0.9 (Nov 29, 2017)

• Support for time column for causal train/validation splits in time-series datasets
• Automatic detection of the time column from temporal correlations in data
• MLI improvements, dedicated page, selection of datasets and models
• Improved final ensemble meta-learner
• Test set score now displayed in experiment listing
• Original response is preserved in exported datasets
• Various bug fixes

1.3. Change Log 31


Using Driverless AI, Release 1.8.4.1

1.3.56 Version 1.0.8 (Nov 21, 2017)

• Various bug fixes

1.3.57 Version 1.0.7 (Nov 17, 2017)

• Sharing of GPUs between experiments - can run multiple experiments at the same time while sharing GPU
resources
• Persistence of experiments and data - can stop and restart the application without loss of data
• Support for weight column for optional user-specified per-row observation weights
• Support for fold column for user-specified grouping of rows in train/validation splits
• Higher accuracy through model tuning
• Faster training - overall improvements and optimization in model training speed
• Separate log file for each experiment
• Ability to delete experiments and datasets from the GUI
• Improved accuracy for regression tasks with very large response values
• Faster test set scoring - Significant improvements in test set scoring in the GUI
• Various bug fixes

1.3.58 Version 1.0.5 (Oct 24, 2017)

• Only display scorers that are allowed


• Various bug fixes

1.3.59 Version 1.0.4 (Oct 19, 2017)

• Improved automatic type detection logic


• Improved final ensemble accuracy
• Various bug fixes

1.3.60 Version 1.0.3 (Oct 9, 2017)

• Various speedups
• Results are now reproducible
• Various bug fixes

32 Chapter 1. H2O Driverless AI Release Notes


Using Driverless AI, Release 1.8.4.1

1.3.61 Version 1.0.2 (Oct 5, 2017)

• Improved final ensemble accuracy


• Weight of Evidence features added
• Various bug fixes

1.3.62 Version 1.0.1 (Oct 4, 2017)

• Improved speed of final ensemble


• Various bug fixes

1.3.63 Version 1.0.0 (Sep 24, 2017)

• Initial stable release

1.3. Change Log 33


Using Driverless AI, Release 1.8.4.1

34 Chapter 1. H2O Driverless AI Release Notes


CHAPTER

TWO

WHY DRIVERLESS AI?

Over the last several years, machine learning has become an integral part of many organizations’ decision-making
processes at various levels. With not enough data scientists to fill the increasing demand for data-driven business
processes, H2O.ai offers Driverless AI, which automates several time consuming aspects of a typical data science
workflow, including data visualization, feature engineering, predictive modeling, and model explanation.
H2O Driverless AI is a high-performance, GPU-enabled computing platform for automatic development and rapid
deployment of state-of-the-art predictive analytics models. It reads tabular data from plain text sources and from a
variety of external data sources, and it automates data visualization and the construction of predictive models.
Driverless AI also includes robust Machine Learning Interpretability (MLI), which incorporates a number of contem-
porary approaches to increase the transparency and accountability of complex models by providing model results in a
human-readable format.
Driverless AI targets business applications such as loss-given-default, probability of default, customer churn, campaign
response, fraud detection, anti-money-laundering, demand forecasting, and predictive asset maintenance models. (Or
in machine learning parlance: common regression, binomial classification, and multinomial classification problems.)
Visit https://www.h2o.ai/driverless-ai/ to download your free 21-day evaluation copy.
How do you frame business problems in a data set for Driverless AI?
The data that is read into Driverless AI must contain one entity per row, like a customer, patient, piece of equipment,
or financial transaction. That row must also contain information about what you will be trying to predict using similar
data in the future, like whether that customer in the row of data used a promotion, whether that patient was readmitted
to the hospital within thirty days of being released, whether that piece of equipment required maintenance, or whether
that financial transaction was fraudulent. (In data science speak, Driverless AI requires “labeled” data.) Driverless AI
runs through your data many, many times looking for interactions, insights, and business drivers of the phenomenon
described by the provided dataset.
How do you use Driverless AI results to create commercial value?
Commercial value is generated by Driverless AI in a few ways.
• Driverless AI empowers data scientists or data analysts to work on projects faster and more efficiently by using
automation and state-of-the-art computing power to accomplish tasks in just minutes or hours instead of the
weeks or months that it can take humans.
• Like in many other industries, automation leads to standardization of business processes, enforces best practices,
and eventually drives down the cost of delivering the final product – in this case a predictive model.
• Driverless AI makes deploying predictive models easy – typically a difficult step in the data science process.
In large organizations, value from predictive modeling is typically realized when a predictive model is moved
from a data analyst’s or data scientist’s development environment into a production deployment setting. In this
setting, the model is running on live data and making quick and automatic decisions that make or save money.
Driverless AI provides both Java- and Python-based technologies to make production deployment simpler.

35
Using Driverless AI, Release 1.8.4.1

Moreover, the system was designed with interpretability and transparency in mind. Every prediction made by a
Driverless AI model can be explained to business users, so the system is viable even for regulated industries.

36 Chapter 2. Why Driverless AI?


CHAPTER

THREE

KEY FEATURES

Below are some of the key features available in Driverless AI.

3.1 Flexibility of Data and Deployment

Driverless AI works across a variety of data sources including Hadoop HDFS, Amazon S3, and more. Driverless AI
can be deployed everywhere including all clouds (Microsoft Azure, AWS, Google Cloud) and on premises on any
system, but it is ideally suited for systems with GPUs, including IBM Power 9 with GPUs built in.

3.2 NVIDIA GPU Acceleration

Driverless AI is optimized to take advantage of GPU acceleration to achieve up to 40X speedups for automatic machine
learning. It includes multi-GPU algorithms for XGBoost, GLM, K-Means, and more. GPUs allow for thousands of
iterations of model features and optimizations.

3.3 Automatic Data Visualization (Autovis)

For datasets, Driverless AI automatically selects data plots based on the most relevant data statistics, generates visu-
alizations, and creates data plots that are most relevant from a statistical perspective based on the most relevant data
statistics. These visualizations help users get a quick understanding of their data prior to starting the model building
process. They are also useful for understanding the composition of very large datasets and for seeing trends or even
possible issues, such as large numbers of missing values or significant outliers that could impact modeling results. See
Visualizing Datasets for more information.

3.4 Automatic Feature Engineering

Feature engineering is the secret weapon that advanced data scientists use to extract the most accurate results from
algorithms. H2O Driverless AI employs a library of algorithms and feature transformations to automatically engineer
new, high value features for a given dataset. (See Driverless AI Transformations for more information.) Included in
the interface is an easy-to-read variable importance chart that shows the significance of original and newly engineered
features.

37
Using Driverless AI, Release 1.8.4.1

3.5 Automatic Model Documentation

To explain models to business users and regulators, data scientists and data engineers must document the data, algo-
rithms, and processes used to create machine learning models. Driverless AI provides an Autoreport (Autodoc) for
each experiment, relieving the user from the time-consuming task of documenting and summarizing their workflow
used when building machine learning models. The Autoreport includes details about the data used, the validation
schema selected, model and feature tuning, and the final model created. With this capability in Driverless AI, practi-
tioners can focus more on drawing actionable insights from the models and save weeks or even months in development,
validation, and deployment process.
Driverless AI also provides a number of autodoc_ configuration options, giving users even more control over output
of the Autoreport. (Refer to the Sample Config.toml File topic for information about these configuration options.)
Click here to download and view a sample experiment report in Word format.

3.6 Time Series Forecasting

Time series forecasting is one of the biggest challenges for data scientists. These models address key use cases,
including demand forecasting, infrastructure monitoring, and predictive maintenance. Driverless AI delivers superior
time series capabilities to optimize for almost any prediction time window. Driverless AI incorporates data from
numerous predictors, handles structured character data and high-cardinality categorical variables, and handles gaps in
time series data and other missing values. See Time Series in Driverless AI for more information.

3.7 NLP with TensorFlow

Text data can contain critical information to inform better predictions. Driverless AI automatically converts short text
strings into features using powerful techniques like TFIDF. With TensorFlow, Driverless AI can also process larger
text blocks and build models using all available data to solve business problems like sentiment analysis, document
classification, and content tagging. See NLP in Driverless AI for more information.

3.8 Automatic Scoring Pipelines

For completed experiments, Driverless AI automatically generates both Python scoring pipelines and new ultra-low
latency automatic scoring pipelines. The new automatic scoring pipeline is a unique technology that deploys all feature
engineering and the winning machine learning model in a highly optimized, low-latency, production-ready Java code
that can be deployed anywhere. See Scoring Pipelines Overview for more information.

3.9 Machine Learning Interpretability (MLI)

Driverless AI provides robust interpretability of machine learning models to explain modeling results in a human-
readable format. In the MLI view, Driverless AI employs a host of different techniques and methodologies for in-
terpreting and explaining the results of its models. A number of charts are generated automatically (depending on
experiment type), including K-LIME, Shapley, Variable Importance, Decision Tree Surrogate, Partial Dependence,
Individual Conditional Expectation, Sensitivity Analysis, NLP Tokens, NLP LOCO, and more. Additionally, you can
download a CSV of LIME and Shapley reasons codes from this view. See MLI Overview for more information.

38 Chapter 3. Key Features


Using Driverless AI, Release 1.8.4.1

3.10 Automatic Reason Codes

In regulated industries, an explanation is often required for significant decisions relating to customers (for example,
credit denial). Reason codes show the key positive and negative factors in a model’s scoring decision in a simple
language. Reasons codes are also useful in other industries, such as healthcare, because they can provide insights into
model decisions that can drive additional testing or investigation.

3.11 Bring Your Own Recipe (BYOR) Support

Driverless AI allows you to import custom recipes (BYOR) for MLI algorithms, feature engineering (transformers),
scorers, data, and configuration. You can use your custom recipes in combination with or instead of all built-in recipes.
This allows you to have greater influence over the Driverless AI Automatic ML pipeline and gives you control over
the optimization choices that Driverless AI makes. See Appendix A: Custom Recipes for more information.

3.10. Automatic Reason Codes 39


Using Driverless AI, Release 1.8.4.1

40 Chapter 3. Key Features


CHAPTER

FOUR

SUPPORTED ALGORITHMS

4.1 Constant Model

A Constant Model predicts the same constant value for any input data. The constant value is computed by optimizing
the given scorer. For example, for MSE/RMSE, the constant is the (weighted) mean of the target column. For MAE, it
is the (weighted) median. For other scorers like MAPE or custom scorers, the constant is found with an optimization
process. For classification problems, the constant probabilities are the observed priors.
A constant model is meant as a baseline reference model. If it ends up being used in the final pipeline, a warning will
be issued because that would indicate a problem in the dataset or target column (e.g., when trying to predict a random
outcome).

4.2 Decision Tree

A Decision Tree is a single (binary) tree model that splits the training data population into sub-groups (leaf nodes) with
similar outcomes. No row or column sampling is performed, and the tree depth and method of growth (depth-wise or
loss-guided) is controlled by hyper-parameters.

4.3 FTRL

Follow the Regularized Leader (FTRL) is a DataTable implementation [1] of the FTRL-Proximal online learning
algorithm proposed in [4]. This implementation uses a hashing trick and Hogwild approach [3] for parallelization.
FTRL supports binomial and multinomial classification for categorical targets, as well as regression for continuous
targets.

4.4 GLM

Generalized Linear Models (GLM) estimate regression models for outcomes following exponential distributions.
GLMs are an extension of traditional linear models. They have gained popularity in statistical data analysis due
to:
• the flexibility of the model structure unifying the typical regression methods (such as linear regression and
logistic regression for binary classification)
• the recent availability of model-fitting software
• the ability to scale well with large datasets

41
Using Driverless AI, Release 1.8.4.1

4.5 Isolation Forest

Isolation Forest is useful for identifying anomalies or outliers in data. Isolation Forest isolates observations by ran-
domly selecting a feature and then randomly selecting a split value between the maximum and minimum values of
that selected feature. This split depends on how long it takes to separate the points. Random partitioning produces
noticeably shorter paths for anomalies. When a forest of random trees collectively produces shorter path lengths for
particular samples, they are highly likely to be anomalies.

4.6 LightGBM

LightGBM is a gradient boosting framework developed by Microsoft that uses tree based learning algorithms. It was
specifically designed for lower memory usage and faster training speed and higher efficiency. Similar to XGBoost,
it is one of the best gradient boosting implementations available. It is also used for fitting Random Forest, DART
(experimental), and Decision Tree models inside of Driverless AI.
Note: LightGBM with GPUs is not currently supported on Power.

4.7 RuleFit

The RuleFit [2] algorithm creates an optimal set of decision rules by first fitting a tree model, and then fitting a Lasso
(L1-regularized) GLM model to create a linear model consisting of the most important tree leaves (rules).
Note: MOJOs are not currently available for RuleFit models.

4.8 TensorFlow

TensorFlow is an open source software library for performing high performance numerical computation. Driverless
AI includes a TensorFlow NLP recipe based on CNN Deeplearning models.
Note: MOJOs are not currently available for TensorFlow models.

4.9 XGBoost

XGBoost is a supervised learning algorithm that implements a process called boosting to yield accurate models. Boost-
ing refers to the ensemble learning technique of building many models sequentially, with each new model attempting
to correct for the deficiencies in the previous model. In tree boosting, each new model that is added to the ensemble is
a decision tree. XGBoost provides parallel tree boosting (also known as GBDT, GBM) that solves many data science
problems in a fast and accurate way. For many problems, XGBoost is one of the best gradient boosting machine
(GBM) frameworks today. Driverless AI supports XGBoost GBM and XGBoost DART (experimental) models.

42 Chapter 4. Supported Algorithms


Using Driverless AI, Release 1.8.4.1

4.10 References

[1] DataTable for Python, https://github.com/h2oai/datatable


[2] J. Friedman, B. Popescu. “Predictive Learning via Rule Ensembles”. 2005. http://statweb.stanford.edu/~jhf/ftp/
RuleFit.pdf
[3] Niu, Feng, et al. “Hogwild: A lock-free approach to parallelizing stochastic gradient descent.” Advances in neural
information processing systems. 2011. https://people.eecs.berkeley.edu/~brecht/papers/hogwildTR.pdf
[4] McMahan, H. Brendan, et al. “Ad click prediction: a view from the trenches.” Proceedings of the 19th ACM
SIGKDD international conference on Knowledge discovery and data mining. ACM, 2013. https://research.google.
com/pubs/archive/41159.pdf

4.10. References 43
Using Driverless AI, Release 1.8.4.1

44 Chapter 4. Supported Algorithms


CHAPTER

FIVE

DRIVERLESS AI WORKFLOW

A typical Driverless AI workflow is to:


1. Load data
2. Visualize data
3. Run an experiment
4. Interpret the model
5. Deploy the scoring pipeline
In addition, you can diagnose a model, transform another dataset, score the model against another dataset, and manage
your data in Projects.
The image below describes a typical workflow.

45
Using Driverless AI, Release 1.8.4.1

46 Chapter 5. Driverless AI Workflow


CHAPTER

SIX

DRIVERLESS AI LICENSES

A valid license is required for running Driverless AI and for running the scoring pipelines.

6.1 About Licenses

Driverless AI is licensed per a single named user. Therefore, in order, to have different users run experiments simulta-
neously, they would each need a license. Driverless AI manages the GPU(s) that it is given and ensures that different
experiments from different users can run safely simultaneously and don’t interfere with each other. So when two
licensed users log in with different credentials, neither of them will see the other’s experiment. Similarly, if a licensed
user logs in using a different set of credentials, that user will not see any previously run experiments.

6.2 Adding Licenses for the First Time

6.2.1 Specifying a License File for the Driverless AI Application

A license file to run Driverless AI can be added in one of three ways when starting Driverless AI.
• Specifying the license.sig file during launch in native installs
• Using the DRIVERLESS_AI_LICENSE_FILE and DRIVERLESS_AI_LICENSE_KEY environment variables
when starting the Driverless AI Docker image
• Uploading your license in the Web UI

Specifying the license.sig File During Launch

By default, Driverless AI looks for a license key in /opt/h2oai/dai/home/.driverlessai/license.sig. If you are installing
Driverless AI programmatically, you can copy a license key file to that location. If no license key is found, the
application will prompted you to add one via the Web UI.

47
Using Driverless AI, Release 1.8.4.1

Specifying Environment Variables

You can use the DRIVERLESS_AI_LICENSE_FILE or DRIVERLESS_AI_LICENSE_KEY environment variable


when starting the Driverless AI Docker image. For example:

nvidia-docker run \
--pid=host \
--init \
--rm \
--shm-size=256m \
-u `id -u`:`id -g` \
-p 12345:12345 \
-e DRIVERLESS_AI_LICENSE_FILE="/license/license.sig" \
-v `pwd`/config:/config \
-v `pwd`/data:/data \
-v `pwd`/log:/log \
-v `pwd`/license:/license \
-v `pwd`/tmp:/tmp \
h2oai/dai-centos7-x86_64:TAG

or

nvidia-docker run \
--pid=host \
--init \
--rm \
--shm-size=256m \
-u `id -u`:`id -g` \
-p 12345:12345 \
-e DRIVERLESS_AI_LICENSE_KEY="Y0uRl1cens3KeyH3re" \
-v `pwd`/config:/config \
-v `pwd`/data:/data \
-v `pwd`/log:/log \
-v `pwd`/license:/license \
-v `pwd`/tmp:/tmp \
h2oai/dai-centos7-x86_64:TAG

Uploading Your License in the Web UI

If Driverless AI does not locate a license.sig file during launch, then the UI will prompt you to enter your license key
after you log in the first time.

Click the Enter License button, and then paste the entire license into the provided License Key entry field. Click
Save when you are done. Upon successful completion, you will be able to begin using Driverless AI.

48 Chapter 6. Driverless AI Licenses


Using Driverless AI, Release 1.8.4.1

6.2.2 Specifying a License for Scoring Pipelines

Driverless AI requires a license to be specified in order to run the Scoring Pipelines.

Python Scoring Pipeline

The license can be specified via an environment variable in Python:

# Set DRIVERLESS_AI_LICENSE_FILE, the path to the Driverless AI license file


%env DRIVERLESS_AI_LICENSE_FILE="/home/ubuntu/license/license.sig"

# Set DRIVERLESS_AI_LICENSE_KEY, the Driverless AI license key (Base64 encoded string)


%env DRIVERLESS_AI_LICENSE_KEY="oLqLZXMI0y..."

You can also export the license file when running the scoring pipeline:

export DRIVERLESS_AI_LICENSE_FILE="/path/to/license.sig"
bash run_example.sh

MOJO Scoring Pipeline

Driverless AI requires a license to be specified in order to run the MOJO Scoring Pipeline. The license can be specified
in one of the following ways:
• Via an environment variable:
– DRIVERLESS_AI_LICENSE_FILE: Path to the Driverless AI license file, or
– DRIVERLESS_AI_LICENSE_KEY: The Driverless AI license key (Base64 encoded string)
• Via a system property of JVM (-D option):

6.2. Adding Licenses for the First Time 49


Using Driverless AI, Release 1.8.4.1

– ai.h2o.mojos.runtime.license.file: Path to the Driverless AI license file, or


– ai.h2o.mojos.runtime.license.key: The Driverless AI license key (Base64 encoded
string)
• Via an application classpath:
– The license is loaded from a resource called /license.sig.
– The default resource name can be changed via the JVM system property ai.h2o.mojos.
runtime.license.filename.
For example:

java -Dai.h2o.mojos.runtime.license.file=/etc/dai/license.sig -cp mojo2-runtime.jar


˓→ai.h2o.mojos.ExecuteMojo pipeline.mojo example.csv

6.2.3 Driverless AI Licenses in Production via AWS Lambdas

Driverless AI deployment pipelines to AWS Lambdas automatically set the license key as an environment variable
based on the license key that was used in Driverless AI.

6.3 Updating Licenses

If your current Driverless AI license has expired, you will be required to update it in order to continue running
Driverless AI, in order to run the scoring pipeline, in order to access deployed pipelines to AWS Lambdas, etc.

6.3.1 Updating the License for Driverless AI

Similar to adding a license for the first time, you can update your license for running Driverless AI either by replacing
your current license.sig file or via the Web UI.

Updating the license.sig File

Update the license key in your /opt/h2oai/dai/home/.driverlessai/license.sig file by replacing the existing license with
your new one.

Updating the License in the Web UI

If your license is expired, the Web UI will prompt you to enter a new one. The steps are the same as adding a license
for the first time via the Driverless AI Web UI.

50 Chapter 6. Driverless AI Licenses


Using Driverless AI, Release 1.8.4.1

6.3.2 Updating the License for Scoring Pipelines

For the Python Scoring Pipeline, simply include the updated license file when setting the environment variable in
Python. Refer to the above Python Scoring Pipeline section for adding licenses.
For the MOJO Scoring Pipeline, the updated license file can be specifed using an environment variable, using a system
property of JVM, or via an application classpath. This is the same as adding a license for the first time. Refer to the
above MOJO Scoring Pipeline section for adding licenses.

6.3.3 Updating Driverless AI Licenses in Production

The Driverless AI deployment pipeline to AWS Lambdas explicitly sets the license key as an environment variable.
Replace the expired license key with your updated one.

6.3. Updating Licenses 51


Using Driverless AI, Release 1.8.4.1

52 Chapter 6. Driverless AI Licenses


CHAPTER

SEVEN

BEFORE YOU BEGIN INSTALLING OR UPGRADING

Please review the following information before you begin installing Driverless AI. Be sure to also review the Sizing
Requirements in the next section before beginning the installation.

7.1 Supported Browsers

Driverless AI is tested most extensively on Chrome and Firefox. For the best user experience, we recommend using
the latest version of Chrome. You may encounter issues if you use other browsers or earlier versions of Chrome and/or
Firefox.

7.2 To sudo or Not to sudo

Many of the installation steps show sudo prepending different commands. Note that sudo may not always be
required, but the steps that are documented here are the steps that we followed in house.

7.3 Note about nvidia-docker 1.0

If you have nvidia-docker 1.0 installed, you need to remove it and all existing GPU containers. Refer to https://github.
com/NVIDIA/nvidia-docker/blob/master/README.md for more information.

7.4 Deprecation of nvidia-smi

The nvidia-smi command has been deprecated by NVIDIA. Refer to https://github.com/nvidia/nvidia-docker#


upgrading-with-nvidia-docker2-deprecated for more information. The installation steps have been updated for en-
abling persistence mode for GPUs.

53
Using Driverless AI, Release 1.8.4.1

7.5 New nvidia-container-runtime-hook Requirement for Pow-


erPC Users

PowerPC users are now required to install the nvidia-container-runtime-hook when running in Docker.
Refer to https://github.com/nvidia/nvidia-docker#rhel-docker for more information. The IBM Docker installation
steps have been updated to reflect this information.

7.6 Note About CUDA Versions

Your host environment must have CUDA 10.0 or later with NVIDIA drivers >= 410 installed (GPU only). Driverless
AI ships with its own CUDA libraries, but the driver must exist in the host environment. Go to https://www.nvidia.
com/Download/index.aspx to get the latest NVIDIA Tesla V/P/K series driver.

7.7 Note About Authentication

The default authentication setting in Driverless AI is “unvalidated.” In this case, Driverless AI will accept any login
and password combination, it will not validate whether the password is correct for the specified login ID, and it will
connect to the system as the user specified in the login ID. This is true for all instances, including Cloud, Docker, and
native instances.
We recommend that you configure authentication. Driverless AI provides a number of authentication options, includ-
ing LDAP, PAM, Local, and None. Refer to Configuring Authentication for information on how to enable a different
authentication method.
Note: Driverless AI is also integrated with IBM Spectrum Conductor and supports authentication from Conductor.
Contact [email protected] for more information about using IBM Spectrum Conductor authentication.

7.8 Note About Shared File Systems

If your environment uses a shared file system, then you must set the following configuration option:

datatable_strategy='write'

The above can be specified in the config.toml file (for native installs) or specified as an environment variable (Docker
image installs).
This configuration is required because, in some cases, Driverless AI can fail to read files during an experiment. The
write option will allow Driverless AI to properly read and write data from shared file systems to disk.

7.9 Backup Strategy

We recommend that you periodically stop Driverless AI and back up your Driverless AI tmp directory, even if you are
not upgrading.

54 Chapter 7. Before You Begin Installing or Upgrading


Using Driverless AI, Release 1.8.4.1

7.10 Upgrade Strategy

Keep in mind the following when upgrading Driverless AI:


• This release deprecates experiments and MLI models from 1.7.0 and earlier.
• Experiments, MLIs, and MOJOs reside in the Driverless AI tmp directory and are not automatically upgraded
when Driverless AI is upgraded. We recommend you take the following steps before upgrading.
– Build MLI models before upgrading.
– Build MOJO pipelines before upgrading.
– Stop Driverless AI and then make a backup of your Driverless AI tmp directory before upgrading.
If you did not build MLI on a model before upgrading Driverless AI, then you will not be able to view
MLI on that model after upgrading. Before upgrading, be sure to run MLI jobs on models that you want
to continue to interpret in future releases. If that MLI job appears in the list of Interpreted Models in your
current version, then it will be retained after upgrading.
If you did not build a MOJO pipeline on a model before upgrading Driverless AI, then you will not be
able to build a MOJO pipeline on that model after upgrading. Before upgrading, be sure to build MOJO
pipelines on all desired models and then back up your Driverless AI tmp directory.
The upgrade process inherits the service user and group from /etc/dai/User.conf and /etc/dai/Group.conf. You do not
need to manually specify the DAI_USER or DAI_GROUP environment variables during an upgrade.

7.10. Upgrade Strategy 55


Using Driverless AI, Release 1.8.4.1

56 Chapter 7. Before You Begin Installing or Upgrading


CHAPTER

EIGHT

INSTALLING AND UPGRADING DRIVERLESS AI

For the best (and intended-as-designed) experience, install Driverless AI on modern data center hardware with GPUs
and CUDA support. Use Pascal or Volta GPUs with maximum GPU memory for best results. (Note that the older K80
and M60 GPUs available in EC2 are supported and very convenient, but not as fast.)
Driverless AI supports local, LDAP, and PAM authentication. Authentication can be configured by setting environment
variables or via a config.toml file. Refer to the Configuring Authentication section for more information. Note that the
default authentication method is “unvalidated.”
Driverless AI also supports HDFS, S3, Google Cloud Storage, Google Big Query, KDB, Minio, and Snowflake access.
Support for these data sources can be configured by setting environment variables for the data connectors or via a
config.toml file. Refer to the Enabling Data Connectors section for more information.

8.1 Sizing Requirements

8.1.1 Sizing Requirements for Native Installs

Driverless AI requires a minimum of 5 GB of system memory in order to start experiments and a minimum of 5
GB of disk space in order to run a small experiment. Note that these limits can changed in the config.toml file. We
recommend that you have lots of system CPU memory (64 GB or more) and 1 TB of free disk space available.

8.1.2 Sizing Requirements for Docker Installs

For Docker installs, we recommend 1 TB of free disk space. Driverless AI uses approximately 38 GB. In addition,
the unpacking/temp files require space on the same Linux mount /var during installation. Once DAI runs, the mounts
from the Docker container can point to other file system mount points.

8.1.3 GPU Sizing Requirements

If you are running Driverless AI with GPUs, be sure that your GPU has compute capability >=3.5 and at least 4GB of
RAM. If these requirements are not met, then Driverless AI will switch to CPU-only mode.

57
Using Driverless AI, Release 1.8.4.1

8.1.4 Sizing Requirements for Storing Experiments

We recommend that your tmp directory has at least 500 GB to 1 TB of space. The tmp directory holds all experiments
and all datasets. We also recommend that you use SSDs (preferably NVMe).

8.1.5 Virtual Memory Settings in Linux

If you are running Driverless AI on a Linux machine, we recommend setting the overcommit memory to 0. The setting
can be changed by the following command:

sudo echo 0 > /proc/sys/vm/overcommit_memory

This is the default value, and it indicates that the Linux kernel is free to overcommit memory. If this value is set to
2, then the Linux kernel will not overcommit memory. In this case, the memory requirements of Driverless AI may
surpass the memory allocation limit, which would prevent the experiment from completing.

8.2 Linux X86_64 Installs

This section provides installation steps for Linux 86_64 environments. This includes information for Docker image
installs, RPMs, Deb, and Tar installs as well as Cloud installations.

8.2.1 Linux Docker Images

To simplify local installation, Driverless AI is provided as a Docker image for the following system combinations:

Host OS Docker Version Host Architecture Min Mem


Ubuntu 16.04 or later Docker CE x86_64 64 GB
RHEL or CentOS 7.4 or later Docker CE x86_64 64 GB
NVIDIA DGX Registry x86_64

Note: CUDA 10.0 or later with NVIDIA drivers >= 410 is required (GPU only).
For the best performance, including GPU support, use nvidia-docker. For a lower-performance experience without
GPUs, use regular docker (with the same docker image).
These installation steps assume that you have a license key for Driverless AI. For information on how to obtain a
license key for Driverless AI, visit https://www.h2o.ai/driverless-ai/. Once obtained, you will be promted to paste the
license key into the Driverless AI UI when you first log in, or you can save it as a .sig file and place it in the license
folder that you will create during the installation process.

Install on Ubuntu

This section describes how to install the Driverless AI Docker image on Ubuntu. The installation steps vary depending
on whether your system has GPUs or if it is CPU only.

58 Chapter 8. Installing and Upgrading Driverless AI


Using Driverless AI, Release 1.8.4.1

Environment

Operating System GPUs? Min Mem


Ubuntu with GPUs Yes 64 GB
Ubuntu with CPUs No 64 GB

Install on Ubuntu with GPUs

Note: Driverless AI is supported on Ubuntu 16.04 or later.


Open a Terminal and ssh to the machine that will run Driverless AI. Once you are logged in, perform the following
steps.
1. Retrieve the Driverless AI Docker image from https://www.h2o.ai/download/. (Note that the contents of this
Docker image include a CentOS kernel and CentOS packages.)
2. Install and run Docker on Ubuntu (if not already installed):

# Install and run Docker on Ubuntu


curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -
sudo apt-key fingerprint 0EBFCD88 sudo add-apt-repository \
"deb [arch=amd64] https://download.docker.com/linux/ubuntu $(lsb_release -
˓→cs) stable"

sudo apt-get update


sudo apt-get install docker-ce
sudo systemctl start docker

3. Install nvidia-docker2 (if not already installed). More information is available at https://github.com/NVIDIA/
nvidia-docker/blob/master/README.md.

curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | \
sudo apt-key add -
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-
˓→docker.list | \

sudo tee /etc/apt/sources.list.d/nvidia-docker.list


sudo apt-get update

# Install nvidia-docker2 and reload the Docker daemon configuration


sudo apt-get install -y nvidia-docker2

4. Verify that the NVIDIA driver is up and running. If the driver is not up and running, log on to http://www.nvidia.
com/Download/index.aspx?lang=en-us to get the latest NVIDIA Tesla V/P/K series driver:

nvidia-smi

5. Set up a directory for the version of Driverless AI on the host machine, replacing VERSION below with your
Driverless AI Docker image version:

# Set up directory with the version name


mkdir dai_rel_VERSION

6. Change directories to the new folder, then load the Driverless AI Docker image inside the new directory. This
example shows how to load Driverless AI. Replace VERSION with your image.

8.2. Linux X86_64 Installs 59


Using Driverless AI, Release 1.8.4.1

# cd into the new directory


cd dai_rel_VERSION

# Load the Driverless AI docker image


docker load < dai-docker-centos7-x86_64-VERSION.tar.gz

7. Enable persistence of the GPU. Note that this needs to be run once every reboot. Refer to the following for more
information: http://docs.nvidia.com/deploy/driver-persistence/index.html.

sudo nvidia-persistenced --persistence-mode

8. Set up the data, log, and license directories on the host machine:

# Set up the data, log, license, and tmp directories on the host machine
˓→(within the new directory)

mkdir data
mkdir log
mkdir license
mkdir tmp

9. At this point, you can copy data into the data directory on the host machine. The data will be visible inside the
Docker container.
10. Run docker images to find the image tag.
11. Start the Driverless AI Docker image with nvidia-docker and replace TAG below with the image tag:

# Start the Driverless AI Docker image


nvidia-docker run \
--pid=host \
--init \
--rm \
--shm-size=256m \
-u `id -u`:`id -g` \
-p 12345:12345 \
-v `pwd`/data:/data \
-v `pwd`/log:/log \
-v `pwd`/license:/license \
-v `pwd`/tmp:/tmp \
-v /etc/passwd:/etc/passwd:ro \
-v /etc/group:/etc/group:ro \
h2oai/dai-centos7-x86_64:TAG

Driverless AI will begin running:

--------------------------------
Welcome to H2O.ai's Driverless AI
---------------------------------

- Put data in the volume mounted at /data


- Logs are written to the volume mounted at /log/20180606-044258
- Connect to Driverless AI on port 12345 inside the container
- Connect to Jupyter notebook on port 8888 inside the container

12. Connect to Driverless AI with your browser:

http://Your-Driverless-AI-Host-Machine:12345

60 Chapter 8. Installing and Upgrading Driverless AI


Using Driverless AI, Release 1.8.4.1

Install on Ubuntu with CPUs

Note: Driverless AI is supported on Ubuntu 16.04 or later.


This section describes how to install and start the Driverless AI Docker image on Ubuntu. Note that this uses Docker
EE and not NVIDIA Docker. GPU support will not be available.
Watch the installation video here. Note that some of the images in this video may change between releases, but the
installation steps remain the same.
Open a Terminal and ssh to the machine that will run Driverless AI. Once you are logged in, perform the following
steps.
1. Retrieve the Driverless AI Docker image from https://www.h2o.ai/download/.
2. Install and run Docker on Ubuntu (if not already installed):

# Install and run Docker on Ubuntu


curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -
sudo apt-key fingerprint 0EBFCD88 sudo add-apt-repository \
"deb [arch=amd64] https://download.docker.com/linux/ubuntu $(lsb_release -
˓→cs) stable"

sudo apt-get update


sudo apt-get install docker-ce
sudo systemctl start docker

3. Set up a directory for the version of Driverless AI on the host machine, replacing VERSION below with your
Driverless AI Docker image version:

# Set up directory with the version name


mkdir dai_rel_VERSION

4. Change directories to the new folder, then load the Driverless AI Docker image inside the new directory. This
example shows how to load Driverless AI. Replace VERSION with your image.

# cd into the new directory


cd dai_rel_VERSION

# Load the Driverless AI docker image


docker load < dai-docker-centos7-x86_64-VERSION.tar.gz

5. Set up the data, log, license, and tmp directories on the host machine (within the new directory):

# Set up the data, log, license, and tmp directories


mkdir data
mkdir log
mkdir license
mkdir tmp

6. At this point, you can copy data into the data directory on the host machine. The data will be visible inside the
Docker container.
7. Run docker images to find the new image tag.
8. Start the Driverless AI Docker image and replace TAG below with the image tag. Note that GPU support will
not be available.

# Start the Driverless AI Docker image


docker run \
(continues on next page)

8.2. Linux X86_64 Installs 61


Using Driverless AI, Release 1.8.4.1

(continued from previous page)


--pid=host \
--init \
--rm \
--shm-size=256m \
-u `id -u`:`id -g` \
-p 12345:12345 \
-v `pwd`/data:/data \
-v `pwd`/log:/log \
-v `pwd`/license:/license \
-v `pwd`/tmp:/tmp \
-v /etc/passwd:/etc/passwd:ro \
-v /etc/group:/etc/group:ro \
h2oai/dai-centos7-x86_64:TAG

Driverless AI will begin running:


--------------------------------
Welcome to H2O.ai's Driverless AI
---------------------------------

- Put data in the volume mounted at /data


- Logs are written to the volume mounted at /log/20180606-044258
- Connect to Driverless AI on port 12345 inside the container
- Connect to Jupyter notebook on port 8888 inside the container

9. Connect to Driverless AI with your browser:


http://Your-Driverless-AI-Host-Machine:12345

Stopping the Docker Image

To stop the Driverless AI Docker image, type Ctrl + C in the Terminal (Mac OS X) or PowerShell (Windows 10)
window that is running the Driverless AI Docker image.

Upgrading the Docker Image

This section provides instructions for upgrading Driverless AI versions that were installed in a Docker container. These
steps ensure that existing experiments are saved.
WARNING: Experiments, MLIs, and MOJOs reside in the Driverless AI tmp directory and are not automatically
upgraded when Driverless AI is upgraded.
• Build MLI models before upgrading.
• Build MOJO pipelines before upgrading.
• Stop Driverless AI and make a backup of your Driverless AI tmp directory before upgrading.
If you did not build MLI on a model before upgrading Driverless AI, then you will not be able to view
MLI on that model after upgrading. Before upgrading, be sure to run MLI jobs on models that you want
to continue to interpret in future releases. If that MLI job appears in the list of Interpreted Models in your
current version, then it will be retained after upgrading.
If you did not build a MOJO pipeline on a model before upgrading Driverless AI, then you will not be
able to build a MOJO pipeline on that model after upgrading. Before upgrading, be sure to build MOJO
pipelines on all desired models and then back up your Driverless AI tmp directory.

62 Chapter 8. Installing and Upgrading Driverless AI


Using Driverless AI, Release 1.8.4.1

Note: Stop Driverless AI if it is still running.

Requirements

As of 1.7.0, CUDA 9 is no longer supported. Your host environment must have CUDA 10.0 or later with NVIDIA
drivers >= 410 installed (GPU only). Driverless AI ships with its own CUDA libraries, but the driver must exist in the
host environment. Go to https://www.nvidia.com/Download/index.aspx to get the latest NVIDIA Tesla V/P/K series
driver.

Upgrade Steps

1. SSH into the IP address of the machine that is running Driverless AI.
2. Set up a directory for the version of Driverless AI on the host machine:

# Set up directory with the version name


mkdir dai_rel_VERSION

# cd into the new directory


cd dai_rel_VERSION

3. Retrieve the Driverless AI package from https://www.h2o.ai/download/ and add it to the new directory.
4. Load the Driverless AI Docker image inside the new directory. This example shows how to load Driverless AI
version. If necessary, replace VERSION with your image.

# Load the Driverless AI docker image


docker load < dai-docker-centos7-x86_64-VERSION.tar.gz

5. Copy the data, log, license, and tmp directories from the previous Driverless AI directory to the new Driverless
AI directory:

# Copy the data, log, license, and tmp directories on the host machine
cp -a dai_rel_1.4.2/data dai_rel_VERSION/data
cp -a dai_rel_1.4.2/log dai_rel_VERSION/log
cp -a dai_rel_1.4.2/license dai_rel_VERSION/license
cp -a dai_rel_1.4.2/tmp dai_rel_VERSION/tmp

At this point, your experiments from the previous versions will be visible inside the Docker container.
6. Use docker images to find the new image tag.
7. Start the Driverless AI Docker image.
8. Connect to Driverless AI with your browser at http://Your-Driverless-AI-Host-Machine:12345.

8.2. Linux X86_64 Installs 63


Using Driverless AI, Release 1.8.4.1

Install on RHEL

This section describes how to install the Driverless AI Docker image on RHEL. The installation steps vary depending
on whether your system has GPUs or if it is CPU only.

Environment

Operating System GPUs? Min Mem


RHEL with GPUs Yes 64 GB
RHEL with CPUs No 64 GB

Install on RHEL with GPUs

Note: Refer to the following links for more information about using RHEL with GPUs. These links describe how to
disable automatic updates and specific package updates. This is necessary in order to prevent a mismatch between the
NVIDIA driver and the kernel, which can lead to the GPUs failures.
• https://access.redhat.com/solutions/2372971
• https://www.rootusers.com/how-to-disable-specific-package-updates-in-rhel-centos/
Watch the installation video here. Note that some of the images in this video may change between releases, but the
installation steps remain the same.
Note: As of this writing, Driverless AI has only been tested on RHEL version 7.4.
Open a Terminal and ssh to the machine that will run Driverless AI. Once you are logged in, perform the following
steps.
1. Retrieve the Driverless AI Docker image from https://www.h2o.ai/download/.
2. Install and start Docker EE on RHEL (if not already installed). Follow the instructions on https://docs.docker.
com/engine/installation/linux/docker-ee/rhel/.
Alternatively, you can run on Docker CE.

sudo yum install -y yum-utils


sudo yum-config-manager --add-repo https://download.docker.com/linux/centos/
˓→docker-ce.repo

sudo yum makecache fast


sudo yum -y install docker-ce
sudo systemctl start docker

3. Install nvidia-docker2 (if not already installed). More information is available at https://github.com/NVIDIA/
nvidia-docker/blob/master/README.md.

curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | \
sudo apt-key add -
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-
˓→docker.list | \

sudo tee /etc/apt/sources.list.d/nvidia-docker.list


sudo apt-get update

# Install nvidia-docker2 and reload the Docker daemon configuration


sudo apt-get install -y nvidia-docker2

64 Chapter 8. Installing and Upgrading Driverless AI


Using Driverless AI, Release 1.8.4.1

Note: If you would like the nvidia-docker service to automatically start when the server is rebooted then
run the following command. If you do not run this command, you will have to remember to start the
nvidia-docker service manually; otherwise the GPUs will not appear as available.
sudo systemctl enable nvidia-docker

Alternatively, if you have installed Docker CE above you can install nvidia-docker with:
curl -s -L https://nvidia.github.io/nvidia-docker/centos7/x86_64/nvidia-
˓→docker.repo | \

sudo tee /etc/yum.repos.d/nvidia-docker.repo


sudo yum install nvidia-docker2

4. Verify that the NVIDIA driver is up and running. If the driver is not up and running, log on to http://www.nvidia.
com/Download/index.aspx?lang=en-us to get the latest NVIDIA Tesla V/P/K series driver.
nvidia-docker run --rm nvidia/cuda nvidia-smi

5. Set up a directory for the version of Driverless AI on the host machine, replacing VERSION below with your
Driverless AI Docker image version:
# Set up directory with the version name
mkdir dai_rel_VERSION

6. Change directories to the new folder, then load the Driverless AI Docker image inside the new directory. This
example shows how to load Driverless AI. Replace VERSION with your image.
# cd into the new directory
cd dai_rel_VERSION

# Load the Driverless AI docker image


docker load < dai-docker-centos7-x86_64-VERSION.tar.gz

7. Enable persistence of the GPU. Note that this needs to be run once every reboot. Refer to the following for more
information: http://docs.nvidia.com/deploy/driver-persistence/index.html.
sudo nvidia-persistenced --persistence-mode

8. Set up the data, log, and license directories on the host machine (within the new directory):
# Set up the data, log, license, and tmp directories on the host machine
mkdir data
mkdir log
mkdir license
mkdir tmp

9. At this point, you can copy data into the data directory on the host machine. The data will be visible inside the
Docker container.
10. Run docker images to find the image tag.
11. Start the Driverless AI Docker image with nvidia-docker and replace TAG below with the image tag:
# Start the Driverless AI Docker image
nvidia-docker run \
--pid=host \
--init \
--rm \
(continues on next page)

8.2. Linux X86_64 Installs 65


Using Driverless AI, Release 1.8.4.1

(continued from previous page)


--shm-size=256m \
-u `id -u`:`id -g` \
-p 12345:12345 \
-v `pwd`/data:/data \
-v `pwd`/log:/log \
-v `pwd`/license:/license \
-v `pwd`/tmp:/tmp \
-v /etc/passwd:/etc/passwd:ro \
-v /etc/group:/etc/group:ro \
h2oai/dai-centos7-x86_64:TAG

Driverless AI will begin running:

--------------------------------
Welcome to H2O.ai's Driverless AI
---------------------------------

- Put data in the volume mounted at /data


- Logs are written to the volume mounted at /log/20180606-044258
- Connect to Driverless AI on port 12345 inside the container
- Connect to Jupyter notebook on port 8888 inside the container

12. Connect to Driverless AI with your browser at http://Your-Driverless-AI-Host-Machine:12345.

Install on RHEL with CPUs

This section describes how to install and start the Driverless AI Docker image on RHEL. Note that this uses Docker
EE and not NVIDIA Docker. GPU support will not be available.
Watch the installation video here. Note that some of the images in this video may change between releases, but the
installation steps remain the same.
Note: As of this writing, Driverless AI has only been tested on RHEL version 7.4.
Open a Terminal and ssh to the machine that will run Driverless AI. Once you are logged in, perform the following
steps.
1. Install and start Docker EE on RHEL (if not already installed). Follow the instructions on https://docs.docker.
com/engine/installation/linux/docker-ee/rhel/.
Alternatively, you can run on Docker CE.

sudo yum install -y yum-utils


sudo yum-config-manager --add-repo https://download.docker.com/linux/centos/
˓→docker-ce.repo

sudo yum makecache fast


sudo yum -y install docker-ce
sudo systemctl start docker

2. On the machine that is running Docker EE, retrieve the Driverless AI Docker image from https://www.h2o.ai/
download/.
3. Set up a directory for the version of Driverless AI on the host machine, replacing VERSION below with your
Driverless AI Docker image version:

# Set up directory with the version name


mkdir dai_rel_VERSION

66 Chapter 8. Installing and Upgrading Driverless AI


Using Driverless AI, Release 1.8.4.1

4. Load the Driverless AI Docker image inside the new directory. The following example shows how to load
Driverless AI version. Replace VERSION with your image.

# Load the Driverless AI Docker image


docker load < dai-docker-centos7-x86_64-VERSION.tar.gz

5. Set up the data, log, license, and tmp directories (within the new directory):

# cd into the directory associated with the selected version of Driverless AI


cd dai_rel_VERSION

# Set up the data, log, license, and tmp directories on the host machine
mkdir data
mkdir log
mkdir license
mkdir tmp

6. Copy data into the data directory on the host. The data will be visible inside the Docker container at /<user-
home>/data.
7. Run docker images to find the image tag.
8. Start the Driverless AI Docker image and replace TAG below with the image tag. Note that GPU support will
not be available.

$ docker run \
--pid=host \
--init \
--rm \
--shm-size=256m \
-u `id -u`:`id -g` \
-p 12345:12345 \
-v `pwd`/data:/data \
-v `pwd`/log:/log \
-v `pwd`/license:/license \
-v `pwd`/tmp:/tmp \
-v /etc/passwd:/etc/passwd:ro \
-v /etc/group:/etc/group:ro \
h2oai/dai-centos7-x86_64:TAG

Driverless AI will begin running:

--------------------------------
Welcome to H2O.ai's Driverless AI
---------------------------------

- Put data in the volume mounted at /data


- Logs are written to the volume mounted at /log/20180606-044258
- Connect to Driverless AI on port 12345 inside the container
- Connect to Jupyter notebook on port 8888 inside the container

9. Connect to Driverless AI with your browser at http://Your-Driverless-AI-Host-Machine:12345.

8.2. Linux X86_64 Installs 67


Using Driverless AI, Release 1.8.4.1

Stopping the Docker Image

To stop the Driverless AI Docker image, type Ctrl + C in the Terminal (Mac OS X) or PowerShell (Windows 10)
window that is running the Driverless AI Docker image.

Upgrading the Docker Image

This section provides instructions for upgrading Driverless AI versions that were installed in a Docker container. These
steps ensure that existing experiments are saved.
WARNING: Experiments, MLIs, and MOJOs reside in the Driverless AI tmp directory and are not automatically
upgraded when Driverless AI is upgraded.
• Build MLI models before upgrading.
• Build MOJO pipelines before upgrading.
• Stop Driverless AI and make a backup of your Driverless AI tmp directory before upgrading.
If you did not build MLI on a model before upgrading Driverless AI, then you will not be able to view
MLI on that model after upgrading. Before upgrading, be sure to run MLI jobs on models that you want
to continue to interpret in future releases. If that MLI job appears in the list of Interpreted Models in your
current version, then it will be retained after upgrading.
If you did not build a MOJO pipeline on a model before upgrading Driverless AI, then you will not be
able to build a MOJO pipeline on that model after upgrading. Before upgrading, be sure to build MOJO
pipelines on all desired models and then back up your Driverless AI tmp directory.
Note: Stop Driverless AI if it is still running.

Requirements

As of 1.7.0, CUDA 9 is no longer supported. Your host environment must have CUDA 10.0 or later with NVIDIA
drivers >= 410 installed (GPU only). Driverless AI ships with its own CUDA libraries, but the driver must exist in the
host environment. Go to https://www.nvidia.com/Download/index.aspx to get the latest NVIDIA Tesla V/P/K series
driver.

Upgrade Steps

1. SSH into the IP address of the machine that is running Driverless AI.
2. Set up a directory for the version of Driverless AI on the host machine:

# Set up directory with the version name


mkdir dai_rel_VERSION

# cd into the new directory


cd dai_rel_VERSION

3. Retrieve the Driverless AI package from https://www.h2o.ai/download/ and add it to the new directory.
4. Load the Driverless AI Docker image inside the new directory. This example shows how to load Driverless AI
version. If necessary, replace VERSION with your image.

# Load the Driverless AI docker image


docker load < dai-docker-centos7-x86_64-VERSION.tar.gz

68 Chapter 8. Installing and Upgrading Driverless AI


Using Driverless AI, Release 1.8.4.1

5. Copy the data, log, license, and tmp directories from the previous Driverless AI directory to the new Driverless
AI directory:

# Copy the data, log, license, and tmp directories on the host machine
cp -a dai_rel_1.4.2/data dai_rel_VERSION/data
cp -a dai_rel_1.4.2/log dai_rel_VERSION/log
cp -a dai_rel_1.4.2/license dai_rel_VERSION/license
cp -a dai_rel_1.4.2/tmp dai_rel_VERSION/tmp

At this point, your experiments from the previous versions will be visible inside the Docker container.
6. Use docker images to find the new image tag.
7. Start the Driverless AI Docker image.
8. Connect to Driverless AI with your browser at http://Your-Driverless-AI-Host-Machine:12345.

Install on NVIDIA GPU Cloud/NGC Registry

Driverless AI is supported on the following NVIDIA DGX products, and the installation steps for each platform are
the same.
• NVIDIA GPU Cloud
• NVIDIA DGX-1
• NVIDIA DGX-2
• NVIDIA DGX Station

Environment

Provider GPUs Min Memory Suitable for


NVIDIA GPU Cloud Yes Serious use
NVIDIA DGX-1/DGX-2 Yes 128 GB Serious use
NVIDIA DGX Station Yes 64 GB Serious Use

Installing the NVIDIA NGC Registry

Note: These installation instructions assume that you are running on an NVIDIA DGX machine. Driverless AI is only
available in the NGC registry for DGX machines.
1. Log in to your NVIDIA GPU Cloud account at https://ngc.nvidia.com/registry. (Note that NVIDIA Compute is
no longer supported by NVIDIA.)
2. In the Registry > Partners menu, select h2oai-driverless.

8.2. Linux X86_64 Installs 69


Using Driverless AI, Release 1.8.4.1

3. At the bottom of the screen, select one of the H2O Driverless AI tags to retrieve the pull command.

4. On your NVIDIA DGX machine, open a command prompt and use the specified pull command to retrieve the
Driverless AI image. For example:

docker pull nvcr.io/nvidia_partners/h2o-driverless-ai:latest

5. Set up a directory for the version of Driverless AI on the host machine, replacing VERSION below with your
Driverless AI Docker image version:

# Set up directory with the version name


mkdir dai_rel_VERSION

6. Set up the data, log, license, and tmp directories on the host machine:

70 Chapter 8. Installing and Upgrading Driverless AI


Using Driverless AI, Release 1.8.4.1

# cd into the directory associated with the selected version of Driverless AI


cd dai_rel_VERSION

# Set up the data, log, license, and tmp directories on the host machine
mkdir data
mkdir log
mkdir license
mkdir tmp

7. At this point, you can copy data into the data directory on the host machine. The data will be visible inside the
Docker container.
8. Enable persistence of the GPU. Note that this only needs to be run once. Refer to the following for more
information: http://docs.nvidia.com/deploy/driver-persistence/index.html.

sudo nvidia-persistenced --persistence-mode

9. Run docker images to find the new image tag.


10. Start the Driverless AI Docker image with nvidia-docker and replace TAG below with the image tag:

nvidia-docker run \
--pid=host \
--init \
--rm \
--shm-size=256m \
-u `id -u`:`id -g` \
-p 12345:12345 \
-v `pwd`/data:/data \
-v `pwd`/log:/log \
-v `pwd`/license:/license \
-v `pwd`/tmp:/tmp \
nvcr.io/h2oai/h2oai-driverless-ai:TAG

Driverless AI will begin running:

--------------------------------
Welcome to H2O.ai's Driverless AI
---------------------------------

- Put data in the volume mounted at /data


- Logs are written to the volume mounted at /log/20180606-044258
- Connect to Driverless AI on port 12345 inside the container
- Connect to Jupyter notebook on port 8888 inside the container

11. Connect to Driverless AI with your browser:

http://Your-Driverless-AI-Host-Machine:12345

8.2. Linux X86_64 Installs 71


Using Driverless AI, Release 1.8.4.1

Stopping Driverless AI

Use Ctrl+C to stop Driverless AI.

Upgrading Driverless AI

The steps for upgrading Driverless AI on an NVIDIA DGX system are similar to the installation steps.
WARNINGS:
• This release deprecates experiments and MLI models from 1.7.0 and earlier.
• Experiments, MLIs, and MOJOs reside in the Driverless AI tmp directory and are not automatically upgraded
when Driverless AI is upgraded. We recommend you take the following steps before upgrading.
– Build MLI models before upgrading.
– Build MOJO pipelines before upgrading.
– Stop Driverless AI and make a backup of your Driverless AI tmp directory before upgrading.
If you did not build MLI on a model before upgrading Driverless AI, then you will not be able to view
MLI on that model after upgrading. Before upgrading, be sure to run MLI jobs on models that you want
to continue to interpret in future releases. If that MLI job appears in the list of Interpreted Models in your
current version, then it will be retained after upgrading.
If you did not build a MOJO pipeline on a model before upgrading Driverless AI, then you will not be
able to build a MOJO pipeline on that model after upgrading. Before upgrading, be sure to build MOJO
pipelines on all desired models and then back up your Driverless AI tmp directory.
The upgrade process inherits the service user and group from /etc/dai/User.conf and /etc/dai/Group.conf. You do not
need to manually specify the DAI_USER or DAI_GROUP environment variables during an upgrade.
Note: Use Ctrl+C to stop Driverless AI if it is still running.

Requirements

As of 1.7.0, CUDA 9 is no longer supported. Your host environment must have CUDA 10.0 or later with NVIDIA
drivers >= 410 installed (GPU only). Driverless AI ships with its own CUDA libraries, but the driver must exist in the
host environment. Go to https://www.nvidia.com/Download/index.aspx to get the latest NVIDIA Tesla V/P/K series
driver.

Upgrade Steps

1. On your NVIDIA DGX machine, create a directory for the new Driverless AI version.
2. Copy the data, log, license, and tmp directories from the previous Driverless AI directory into the new Driverless
AI directory.
3. Run docker pull nvcr.io/h2oai/h2oai-driverless-ai:latest to retrieve the latest Driver-
less AI version.
4. Start the Driverless AI Docker image.
5. Connect to Driverless AI with your browser at http://Your-Driverless-AI-Host-Machine:12345.

72 Chapter 8. Installing and Upgrading Driverless AI


Using Driverless AI, Release 1.8.4.1

8.2.2 Linux RPMs

For Linux machines that will not use the Docker image or DEB, an RPM installation is available for the following
environments:
• x86_64 RHEL 7, CentOS 7, or SLES 12
The installation steps assume that you have a license key for Driverless AI. For information on how to obtain a license
key for Driverless AI, visit https://www.h2o.ai/products/h2o-driverless-ai/. Once obtained, you will be prompted to
paste the license key into the Driverless AI UI when you first log in, or you can save it as a .sig file and place it in the
license folder that you will create during the installation process.

Environment

Operating System Min Mem


RHEL with GPUs 64 GB
RHEL with CPUs 64 GB
CentOS 7 with GPUS 64 GB
CentOS 7 with CPUs 64 GB
SLES 12 with GPUs 64 GB
SLES 12 with CPUs 64 GB

Requirements

• RedHat 7/CentOS 7/SLES 12


• CUDA 10 or later with NVIDIA drivers >= 410 (GPU only)
• cuDNN >= 7.4.1 (Required only if using TensorFlow.)
• OpenCL (Required for LightGBM support on GPUs.)
• Driverless AI RPM, available from https://www.h2o.ai/download/

About the Install

• The ‘dai’ service user is created locally (in /etc/passwd) if it is not found by ‘getent passwd’. You can override
the user by providing the DAI_USER environment variable during rpm or dpkg installation.
• The ‘dai’ service group is created locally (in /etc/group) if it is not found by ‘getent group’. You can override
the group by providing the DAI_GROUP environment variable during rpm or dpkg installation.
• Configuration files are put in /etc/dai and owned by the ‘root’ user:
– /etc/dai/config.toml: Driverless AI config file (See Using the config.toml File section for details)
– /etc/dai/User.conf: Systemd config file specifying the service user
– /etc/dai/Group.conf: Systemd config file specifying the service group
– /etc/dai/EnvironmentFile.conf: Systemd config file specifying (optional) environment variable overrides
• Software files are put in /opt/h2oai/dai and owned by the ‘root’ user
• The following directories are owned by the service user so they can be updated by the running software:
– /opt/h2oai/dai/home: The application’s home directory (license key files are stored here)
– /opt/h2oai/dai/tmp: Experiments and imported data are stored here

8.2. Linux X86_64 Installs 73


Using Driverless AI, Release 1.8.4.1

– /opt/h2oai/dai/log: Log files go here if you are not using systemd (if you are using systemd, then the use
the standard journalctl tool)
• By default, Driverless AI looks for a license key in /opt/h2oai/dai/home/.driverlessai/license.sig. If you are
installing Driverless AI programmatically, you can copy a license key file to that location. If no license key is
found, the application will interactively guide you to add one from the Web UI.
• systemd unit files are put in /usr/lib/systemd/system
• Symbolic links to the configuration files in /etc/dai files are put in /etc/systemd/system
If your environment is running an operational systemd, that is the preferred way to manage Driverless AI. The package
installs the following systemd services and a wrapper service:
• dai: Wrapper service that starts/stops the other three services
• dai-dai: Main Driverless AI process
• dai-h2o: H2O-3 helper process used by Driverless AI
• dai-procsy: Procsy helper process used by Driverless AI
• dai-vis-server: Visualization server helper process used by Driverless AI
If you don’t have systemd, you can also use the provided run script to start Driverless AI.

Installing Driverless AI

Run the following commands to install the Driverless AI RPM. Replace VERSION with your specific version.

# Install Driverless AI.


sudo rpm -i dai-VERSION.rpm

Note: For RHEL 7.5, it is necessary to upgrade library glib2:

sudo yum upgrade glib2

By default, the Driverless AI processes are owned by the ‘dai’ userand ‘dai’ group. You can optionally specify a
different service user and group as shown below. Replace <myuser> and <mygroup> as appropriate.

# Temporarily specify service user and group when installing Driverless AI.
# rpm saves these for systemd in the /etc/dai/User.conf and /etc/dai/Group.conf files.
sudo DAI_USER=myuser DAI_GROUP=mygroup rpm -i dai-VERSION.rpm

You may now optionally make changes to /etc/dai/config.toml.

Starting Driverless AI

If you have systemd (preferred):

# Start Driverless AI.


sudo systemctl start dai

If you do not have systemd:

# Start Driverless AI.


sudo -H -u dai /opt/h2oai/dai/run-dai.sh

74 Chapter 8. Installing and Upgrading Driverless AI


Using Driverless AI, Release 1.8.4.1

Starting NVIDIA Persistence Mode

If you have NVIDIA GPUs, you must run the following NVIDIA command. This command needs to be run every
reboot. For more information: http://docs.nvidia.com/deploy/driver-persistence/index.html.

sudo nvidia-persistenced --persistence-mode

Install OpenCL

OpenCL is required in order to run LightGBM on GPUs. Run the following for Centos7/RH7 based systems using
yum and x86.

yum -y clean all


yum -y makecache
yum -y update
wget http://dl.fedoraproject.org/pub/epel/7/x86_64/Packages/c/clinfo-2.1.17.02.09-1.
˓→el7.x86_64.rpm

wget http://dl.fedoraproject.org/pub/epel/7/x86_64/Packages/o/ocl-icd-2.2.12-1.el7.
˓→x86_64.rpm

rpm -if clinfo-2.1.17.02.09-1.el7.x86_64.rpm


rpm -if ocl-icd-2.2.12-1.el7.x86_64.rpm
clinfo

mkdir -p /etc/OpenCL/vendors && \


echo "libnvidia-opencl.so.1" > /etc/OpenCL/vendors/nvidia.icd

Looking at Driverless AI log files

If you have systemd (preferred):

sudo systemctl status dai-dai


sudo systemctl status dai-h2o
sudo systemctl status dai-procsy
sudo systemctl status dai-vis-server
sudo journalctl -u dai-dai
sudo journalctl -u dai-h2o
sudo journalctl -u dai-procsy
sudo journalctl -u dai-vis-server

If you do not have systemd:

sudo less /opt/h2oai/dai/log/dai.log


sudo less /opt/h2oai/dai/log/h2o.log
sudo less /opt/h2oai/dai/log/procsy.log
sudo less /opt/h2oai/dai/log/vis-server.log

8.2. Linux X86_64 Installs 75


Using Driverless AI, Release 1.8.4.1

Stopping Driverless AI

If you have systemd (preferred):

# Stop Driverless AI.


sudo systemctl stop dai

# The processes should now be stopped. Verify.


sudo ps -u dai

If you do not have systemd:

# Stop Driverless AI.


sudo pkill -U dai

# The processes should now be stopped. Verify.


sudo ps -u dai

Upgrading Driverless AI

WARNINGS:
• This release deprecates experiments and MLI models from 1.7.0 and earlier.
• Experiments, MLIs, and MOJOs reside in the Driverless AI tmp directory and are not automatically upgraded
when Driverless AI is upgraded. We recommend you take the following steps before upgrading.
– Build MLI models before upgrading.
– Build MOJO pipelines before upgrading.
– Stop Driverless AI and make a backup of your Driverless AI tmp directory before upgrading.
If you did not build MLI on a model before upgrading Driverless AI, then you will not be able to view
MLI on that model after upgrading. Before upgrading, be sure to run MLI jobs on models that you want
to continue to interpret in future releases. If that MLI job appears in the list of Interpreted Models in your
current version, then it will be retained after upgrading.
If you did not build a MOJO pipeline on a model before upgrading Driverless AI, then you will not be
able to build a MOJO pipeline on that model after upgrading. Before upgrading, be sure to build MOJO
pipelines on all desired models and then back up your Driverless AI tmp directory.
The upgrade process inherits the service user and group from /etc/dai/User.conf and /etc/dai/Group.conf. You do not
need to manually specify the DAI_USER or DAI_GROUP environment variables during an upgrade.

Requirements

As of 1.7.0, CUDA 9 is no longer supported. Your host environment must have CUDA 10.0 or later with NVIDIA
drivers >= 410 installed (GPU only). Driverless AI ships with its own CUDA libraries, but the driver must exist in the
host environment. Go to https://www.nvidia.com/Download/index.aspx to get the latest NVIDIA Tesla V/P/K series
driver.

76 Chapter 8. Installing and Upgrading Driverless AI


Using Driverless AI, Release 1.8.4.1

Upgrade Steps

If you have systemd (preferred):


# Stop Driverless AI.
sudo systemctl stop dai

# The processes should now be stopped. Verify.


sudo ps -u dai

# Make a backup of /opt/h2oai/dai/tmp directory at this time.

# Upgrade and restart.


sudo rpm -U dai-NEWVERSION.rpm
sudo systemctl daemon-reload
sudo systemctl start dai

If you do not have systemd:


# Stop Driverless AI.
sudo pkill -U dai

# The processes should now be stopped. Verify.


sudo ps -u dai

# Make a backup of /opt/h2oai/dai/tmp directory at this time.

# Upgrade and restart.


sudo rpm -U dai-NEWVERSION.rpm
sudo -H -u dai /opt/h2oai/dai/run-dai.sh

Uninstalling Driverless AI

If you have systemd (preferred):


# Stop Driverless AI.
sudo systemctl stop dai

# The processes should now be stopped. Verify.


sudo ps -u dai

# Uninstall.
sudo rpm -e dai

If you do not have systemd:


# Stop Driverless AI.
sudo pkill -U dai

# The processes should now be stopped. Verify.


sudo ps -u dai

# Uninstall.
sudo rpm -e dai

CAUTION! At this point you can optionally completely remove all remaining files, including the database. (This
cannot be undone.)

8.2. Linux X86_64 Installs 77


Using Driverless AI, Release 1.8.4.1

sudo rm -rf /opt/h2oai/dai


sudo rm -rf /etc/dai

8.2.3 Linux DEBs

For Linux machines that will not use the Docker image or RPM, a DEB installation is available for x86_64 Ubuntu
16.04/18.04.
The installation steps assume that you have a license key for Driverless AI. For information on how to obtain a license
key for Driverless AI, visit https://www.h2o.ai/products/h2o-driverless-ai/. Once obtained, you will be promted to
paste the license key into the Driverless AI UI when you first log in, or you can save it as a .sig file and place it in the
license folder that you will create during the installation process.

Environment

Operating System Min Mem


Ubuntu with GPUs 64 GB
Ubuntu with CPUs 64 GB

Requirements

• Ubuntu 16.04/Ubuntu 18.04


• CUDA 10 or later with NVIDIA drivers >= 410 (GPU only)
• cuDNN >= 7.4.1 (Required only if using TensorFlow.)
• OpenCL (Required for LightGBM support on GPUs.)
• Driverless AI DEB, available from https://www.h2o.ai/download/

About the Install

• The ‘dai’ service user is created locally (in /etc/passwd) if it is not found by ‘getent passwd’. You can override
the user by providing the DAI_USER environment variable during rpm or dpkg installation.
• The ‘dai’ service group is created locally (in /etc/group) if it is not found by ‘getent group’. You can override
the group by providing the DAI_GROUP environment variable during rpm or dpkg installation.
• Configuration files are put in /etc/dai and owned by the ‘root’ user:
– /etc/dai/config.toml: Driverless AI config file (See Using the config.toml File section for details)
– /etc/dai/User.conf: Systemd config file specifying the service user
– /etc/dai/Group.conf: Systemd config file specifying the service group
– /etc/dai/EnvironmentFile.conf: Systemd config file specifying (optional) environment variable overrides
• Software files are put in /opt/h2oai/dai and owned by the ‘root’ user
• The following directories are owned by the service user so they can be updated by the running software:
– /opt/h2oai/dai/home: The application’s home directory (license key files are stored here)
– /opt/h2oai/dai/tmp: Experiments and imported data are stored here

78 Chapter 8. Installing and Upgrading Driverless AI


Using Driverless AI, Release 1.8.4.1

– /opt/h2oai/dai/log: Log files go here if you are not using systemd (if you are using systemd, then the use
the standard journalctl tool)
• By default, Driverless AI looks for a license key in /opt/h2oai/dai/home/.driverlessai/license.sig. If you are
installing Driverless AI programmatically, you can copy a license key file to that location. If no license key is
found, the application will interactively guide you to add one from the Web UI.
• systemd unit files are put in /usr/lib/systemd/system
• Symbolic links to the configuration files in /etc/dai files are put in /etc/systemd/system
If your environment is running an operational systemd, that is the preferred way to manage Driverless AI. The package
installs the following systemd services and a wrapper service:
• dai: Wrapper service that starts/stops the other three services
• dai-dai: Main Driverless AI process
• dai-h2o: H2O-3 helper process used by Driverless AI
• dai-procsy: Procsy helper process used by Driverless AI
• dai-vis-server: Visualization server helper process used by Driverless AI
If you don’t have systemd, you can also use the provided run script to start Driverless AI.

Starting NVIDIA Persistence Mode (GPU only)

If you have NVIDIA GPUs, you must run the following NVIDIA command. This command needs to be run every
reboot. For more information: http://docs.nvidia.com/deploy/driver-persistence/index.html.
sudo nvidia-persistenced --persistence-mode

Install OpenCL

OpenCL is required in order to run LightGBM on GPUs. Run the following for Ubuntu-based ystems.
sudo apt-get install opencl-headers clinfo ocl-icd-opencl-dev

mkdir -p /etc/OpenCL/vendors && \


echo "libnvidia-opencl.so.1" > /etc/OpenCL/vendors/nvidia.icd

Installing the Driverless AI Linux DEB

Run the following commands to install the Driverless AI DEB. Replace VERSION with your specific version.
# Install Driverless AI.
sudo dpkg -i dai_VERSION.deb

By default, the Driverless AI processes are owned by the ‘dai’ user and ‘dai’ group. You can optionally specify a
different service user and group as shown below. Replace <myuser> and <mygroup> as appropriate.
# Temporarily specify service user and group when installing Driverless AI.
# dpkg saves these for systemd in the /etc/dai/User.conf and /etc/dai/Group.conf
˓→files.

sudo DAI_USER=myuser DAI_GROUP=mygroup dpkg -i dai_VERSION.deb

You may now optionally make changes to /etc/dai/config.toml.

8.2. Linux X86_64 Installs 79


Using Driverless AI, Release 1.8.4.1

Starting Driverless AI

If you have systemd (preferred):

# Start Driverless AI.


sudo systemctl start dai

If you do not have systemd:

# Start Driverless AI.


sudo -H -u dai /opt/h2oai/dai/run-dai.sh

Looking at Driverless AI log files

If you have systemd (preferred):

sudo systemctl status dai-dai


sudo systemctl status dai-h2o
sudo systemctl status dai-procsy
sudo systemctl status dai-vis-server
sudo journalctl -u dai-dai
sudo journalctl -u dai-h2o
sudo journalctl -u dai-procsy
sudo journalctl -u dai-vis-server

If you do not have systemd:

sudo less /opt/h2oai/dai/log/dai.log


sudo less /opt/h2oai/dai/log/h2o.log
sudo less /opt/h2oai/dai/log/procsy.log
sudo less /opt/h2oai/dai/log/vis-server.log

Stopping Driverless AI

If you have systemd (preferred):

# Stop Driverless AI.


sudo systemctl stop dai

# The processes should now be stopped. Verify.


sudo ps -u dai

If you do not have systemd:

# Stop Driverless AI.


sudo pkill -U dai

# The processes should now be stopped. Verify.


sudo ps -u dai

80 Chapter 8. Installing and Upgrading Driverless AI


Using Driverless AI, Release 1.8.4.1

Upgrading Driverless AI

WARNINGS:
• This release deprecates experiments and MLI models from 1.7.0 and earlier.
• Experiments, MLIs, and MOJOs reside in the Driverless AI tmp directory and are not automatically upgraded
when Driverless AI is upgraded. We recommend you take the following steps before upgrading.
– Build MLI models before upgrading.
– Build MOJO pipelines before upgrading.
– Stop Driverless AI and make a backup of your Driverless AI tmp directory before upgrading.
If you did not build MLI on a model before upgrading Driverless AI, then you will not be able to view
MLI on that model after upgrading. Before upgrading, be sure to run MLI jobs on models that you want
to continue to interpret in future releases. If that MLI job appears in the list of Interpreted Models in your
current version, then it will be retained after upgrading.
If you did not build a MOJO pipeline on a model before upgrading Driverless AI, then you will not be
able to build a MOJO pipeline on that model after upgrading. Before upgrading, be sure to build MOJO
pipelines on all desired models and then back up your Driverless AI tmp directory.
The upgrade process inherits the service user and group from /etc/dai/User.conf and /etc/dai/Group.conf. You do not
need to manually specify the DAI_USER or DAI_GROUP environment variables during an upgrade.

Requirements

As of 1.7.0, CUDA 9 is no longer supported. Your host environment must have CUDA 10.0 or later with NVIDIA
drivers >= 410 installed (GPU only). Driverless AI ships with its own CUDA libraries, but the driver must exist in the
host environment. Go to https://www.nvidia.com/Download/index.aspx to get the latest NVIDIA Tesla V/P/K series
driver.

Upgrade Steps

If you have systemd (preferred):


# Stop Driverless AI.
sudo systemctl stop dai

# Make a backup of /opt/h2oai/dai/tmp directory at this time.

# Upgrade Driverless AI.


sudo dpkg -i dai_NEWVERSION.deb
sudo systemctl daemon-reload
sudo systemctl start dai

If you do not have systemd:


# Stop Driverless AI.
sudo pkill -U dai

# The processes should now be stopped. Verify.


sudo ps -u dai

# Make a backup of /opt/h2oai/dai/tmp directory at this time. If you do not, all


˓→previous data will be lost.
(continues on next page)

8.2. Linux X86_64 Installs 81


Using Driverless AI, Release 1.8.4.1

(continued from previous page)

# Upgrade and restart.


sudo dpkg -i dai_NEWVERSION.deb
sudo -H -u dai /opt/h2oai/dai/run-dai.sh

Uninstalling Driverless AI

If you have systemd (preferred):

# Stop Driverless AI.


sudo systemctl stop dai

# The processes should now be stopped. Verify.


sudo ps -u dai

# Uninstall Driverless AI.


sudo dpkg -r dai

# Purge Driverless AI.


sudo dpkg -P dai

If you do not have systemd:

# Stop Driverless AI.


sudo pkill -U dai

# The processes should now be stopped. Verify.


sudo ps -u dai

# Uninstall Driverless AI.


sudo dpkg -r dai

# Purge Driverless AI.


sudo dpkg -P dai

CAUTION! At this point you can optionally completely remove all remaining files, including the database (this cannot
be undone):

sudo rm -rf /opt/h2oai/dai


sudo rm -rf /etc/dai

Common Problems

Start of Driverless AI fails on the message ``Segmentation fault (core dumped)`` on Ubuntu 18.
This problem is caused by the font NotoColorEmoji.ttf, which cannot be processed by the Python matplotlib
library. A workaround is to disable the font by renaming it. (Do not use fontconfig because it is ignored by matplotlib.)
The following will print out the command that should be executed.

sudo find / -name "NotoColorEmoji.ttf" 2>/dev/null | xargs -I{} echo sudo mv {} {}.
˓→backup

82 Chapter 8. Installing and Upgrading Driverless AI


Using Driverless AI, Release 1.8.4.1

8.2.4 Linux TAR SH

The Driverless AI software is available for use in pure user-mode environments as a self-extracting TAR SH archive.
This form of installation does not require a privileged user to install or to run.
This artifact has the same compatibility matrix as the RPM and DEB packages (combined), it just comes packaged
slightly differently. See those sections for a full list of supported environments.
The installation steps assume that you have a license key for Driverless AI. For information on how to obtain a license
key for Driverless AI, visit https://www.h2o.ai/products/h2o-driverless-ai/. Once obtained, you will be prompted to
paste the license key into the Driverless AI UI when you first log in.

Requirements

• RedHat 7 or Ubuntu 16.04


• CUDA 10 or later with NVIDIA drivers >= 410 (GPU only)
• cuDNN >= 7.4.1 (Required only if using TensorFlow.)
• OpenCL (Required for LightGBM support on GPUs.)
• Driverless AI TAR SH, available from https://www.h2o.ai/download/

Installing Driverless AI

Run the following commands to install the Driverless AI RPM. Replace VERSION with your specific version.

# Install Driverless AI.


chmod 755 dai-VERSION.sh
./dai-VERSION.sh

You may now cd to the unpacked directory and optionally make changes to config.toml.

Starting Driverless AI

# Start Driverless AI.


./run-dai.sh

Starting NVIDIA Persistence Mode

If you have NVIDIA GPUs, you must run the following NVIDIA command. This command needs to be run every
reboot. For more information: http://docs.nvidia.com/deploy/driver-persistence/index.html.

sudo nvidia-persistenced --persistence-mode

8.2. Linux X86_64 Installs 83


Using Driverless AI, Release 1.8.4.1

Install OpenCL

OpenCL is required in order to run LightGBM on GPUs. Run the following for Centos7/RH7 based systems using
yum and x86.

yum -y clean all


yum -y makecache
yum -y update
wget http://dl.fedoraproject.org/pub/epel/7/x86_64/Packages/c/clinfo-2.1.17.02.09-1.
˓→el7.x86_64.rpm

wget http://dl.fedoraproject.org/pub/epel/7/x86_64/Packages/o/ocl-icd-2.2.12-1.el7.
˓→x86_64.rpm

rpm -if clinfo-2.1.17.02.09-1.el7.x86_64.rpm


rpm -if ocl-icd-2.2.12-1.el7.x86_64.rpm
clinfo

mkdir -p /etc/OpenCL/vendors && \


echo "libnvidia-opencl.so.1" > /etc/OpenCL/vendors/nvidia.icd

Looking at Driverless AI log files

less log/dai.log
less log/h2o.log
less log/procsy.log
less log/vis-server.log

Stopping Driverless AI

# Stop Driverless AI.


./kill-dai.sh

Uninstalling Driverless AI

To uninstall Driverless AI, just remove the directory created by the unpacking process. By default, all files for Driver-
less AI are contained within this directory.

Upgrading Driverless AI

WARNINGS:
• This release deprecates experiments and MLI models from 1.7.0 and earlier.
• Experiments, MLIs, and MOJOs reside in the Driverless AI tmp directory and are not automatically upgraded
when Driverless AI is upgraded. We recommend you take the following steps before upgrading.
– Build MLI models before upgrading.
– Build MOJO pipelines before upgrading.
– Stop Driverless AI and make a backup of your Driverless AI tmp directory before upgrading.
If you did not build MLI on a model before upgrading Driverless AI, then you will not be able to view
MLI on that model after upgrading. Before upgrading, be sure to run MLI jobs on models that you want

84 Chapter 8. Installing and Upgrading Driverless AI


Using Driverless AI, Release 1.8.4.1

to continue to interpret in future releases. If that MLI job appears in the list of Interpreted Models in your
current version, then it will be retained after upgrading.
If you did not build a MOJO pipeline on a model before upgrading Driverless AI, then you will not be
able to build a MOJO pipeline on that model after upgrading. Before upgrading, be sure to build MOJO
pipelines on all desired models and then back up your Driverless AI tmp directory.
The upgrade process inherits the service user and group from /etc/dai/User.conf and /etc/dai/Group.conf. You do not
need to manually specify the DAI_USER or DAI_GROUP environment variables during an upgrade.

Requirements

As of 1.7.0, CUDA 9 is no longer supported. Your host environment must have CUDA 10.0 or later with NVIDIA
drivers >= 410 installed (GPU only). Driverless AI ships with its own CUDA libraries, but the driver must exist in the
host environment. Go to https://www.nvidia.com/Download/index.aspx to get the latest NVIDIA Tesla V/P/K series
driver.

Upgrade Steps

1. Stop your previous version of Driverless AI.


2. Run the self-extracting archive for the new version of Driverless AI.
3. Port any previous changes you made to your config.toml file to the newly unpacked directory.
4. Copy the tmp directory (which contains all the Driverless AI working state) from your previous Driverless AI
installation into the newly upacked directory.
5. Start your newly extracted version of Driverless AI.

8.2.5 Linux in the Cloud

To simplify cloud installation, Driverless AI is provided as an AMI for the following cloud platforms:
• AWS AMI
• Azure Image
• Google Cloud
The installation steps for AWS, Azure, and Google Cloud assume that you have a license key for Driverless AI. For
information on how to obtain a license key for Driverless AI, visit https://www.h2o.ai/driverless-ai/. Once obtained,
you will be promted to paste the license key into the Driverless AI UI when you first log in, or you can save it as a .sig
file and place it in the license folder that you will create during the installation process.

Install on AWS

Driverless AI can be installed on Amazon AWS using the AWS Marketplace AMI or the AWS Community AMI.

8.2. Linux X86_64 Installs 85


Using Driverless AI, Release 1.8.4.1

Choosing an Install Method

Consider the following when choosing between the AWS Marketplace and AWS Community AMIs:

Driverless AI AWS Marketplace AMI

• Native (Debian) install based


• Certified by AWS
• Will typically lag behind our standard releases, and may require updates to work with the latest versions of
Driverless AI
• Features several default configurations like default password and HTTPS configuration, which are required by
AWS

Driverless AI AWS Community AMI

• Docker based
• Not certified by AWS
• Will typically have an up-to-date version of Driverless AI for both LTS and latest stable releases
• Base Driverless AI installation on Docker does not feature preset configurations

Install the Driverless AI AWS Marketplace AMI

A Driverless AI AMI is available in the AWS Marketplace beginning with Driverless AI version 1.5.2. This section
describes how to install and run Driverless AI through the AWS Marketplace.

Environment

Provider Instance Type Num GPUs Suitable for


AWS p2.xlarge 1 Experimentation
p2.8xlarge 8 Serious use
p2.16xlarge 16 Serious use
p3.2xlarge 1 Experimentation
p3.8xlarge 4 Serious use
p3.16xlarge 8 Serious use
g3.4xlarge 1 Experimentation
g3.8xlarge 2 Experimentation
g3.16xlarge 4 Serious use

86 Chapter 8. Installing and Upgrading Driverless AI


Using Driverless AI, Release 1.8.4.1

Installation Procedure

1. Log in to the AWS Marketplace.


2. Search for Driverless AI.

3. Select the version of Driverless AI that you want to install.

4. Scroll down to review/edit your region and the selected infrastructure and pricing.

8.2. Linux X86_64 Installs 87


Using Driverless AI, Release 1.8.4.1

5. Return to the top and select Continue to Subscribe.

6. Review the subscription, then click Continue to Configure.


7. If desired, change the Fullfillment Option, Software Version, and Region. Note that this page also includes the
AMI ID for the selected software version. Click Continue to Launch when you are done.

88 Chapter 8. Installing and Upgrading Driverless AI


Using Driverless AI, Release 1.8.4.1

8. Review the configuration and choose a method for launching Driverless AI. Be sure to also review the Usage
Instructions. This button provides you with the login and password for launching Driverless AI. Scroll
down to the bottom of the page and click Launch when you are done.

8.2. Linux X86_64 Installs 89


Using Driverless AI, Release 1.8.4.1

You will receive a “Success” message when the image launches successfully.

Starting Driverless AI

This section describes how to start Driverless AI after the Marketplace AMI has been successfully launched.
1. Navigate to the EC2 Console.
2. Select your instance.
3. Open another browser and launch Driverless AI by navigating to https://<public IP of the instance>:12345.
4. Sign in to Driverless AI with the username and password provided in the Usage Instructions. You will be
prompted to enter your Driverless AI license key the first time that you log in.

Stopping the EC2 Instance

The EC2 instance will continue to run even when you close the aws.amazon.com portal. To stop the instance:
1. On the EC2 Dashboard, click the Running Instances link under the Resources section.
2. Select the instance that you want to stop.
3. In the Actions drop down menu, select Instance State > Stop.
4. A confirmation page will display. Click Yes, Stop to stop the instance.

90 Chapter 8. Installing and Upgrading Driverless AI


Using Driverless AI, Release 1.8.4.1

Upgrading the Driverless AI Marketplace Image

Note that the first offering of the Driverless AI Marketplace image was 1.5.2. As such, it is only possible to upgrade
to versions greater than that.
Perform the following steps if you are upgrading to a Driverless AI Marketeplace image version greater than
1.5.2. Replace dai_NEWVERSION.deb below with the new Driverless AI version (for example, dai_1.5.
4_amd64.deb). Note that this upgrade process inherits the service user and group from /etc/dai/User.conf and
/etc/dai/Group.conf. You do not need to manually specify the DAI_USER or DAI_GROUP environment variables
during an upgrade.

# Stop Driverless AI.


sudo systemctl stop dai

# Make a backup of /opt/h2oai/dai/tmp directory at this time.

# Upgrade Driverless AI.


sudo dpkg -i dai_NEWVERSION.deb
sudo systemctl daemon-reload
sudo systemctl start dai

Install the Driverless AI AWS Community AMI

Watch the installation video here. Note that some of the images in this video may change between releases, but the
installation steps remain the same.

Environment

Provider Instance Type Num GPUs Suitable for


AWS p2.xlarge 1 Experimentation
p2.8xlarge 8 Serious use
p2.16xlarge 16 Serious use
p3.2xlarge 1 Experimentation
p3.8xlarge 4 Serious use
p3.16xlarge 8 Serious use
g3.4xlarge 1 Experimentation
g3.8xlarge 2 Experimentation
g3.16xlarge 4 Serious use

Installing the EC2 Instance

1. Log in to your AWS account at https://aws.amazon.com.


2. In the upper right corner of the Amazon Web Services page, set the location drop-down. (Note: We recommend
selecting the US East region because H2O’s resources are stored there. It also offers more instance types than
other regions.)

8.2. Linux X86_64 Installs 91


Using Driverless AI, Release 1.8.4.1

3. Select the EC2 option under the Compute section to open the EC2 Dashboard.

4. Click the Launch Instance button under the Create Instance section.

92 Chapter 8. Installing and Upgrading Driverless AI


Using Driverless AI, Release 1.8.4.1

5. Under Community AMIs, search for h2oai, and then select the version that you want to launch.

6. On the Choose an Instance Type page, select GPU compute in the Filter by dropdown. This will ensure that
your Driverless AI instance will run on GPUs. Select a GPU compute instance from the available options. (We
recommend at least 32 vCPUs.) Click the Next: Configure Instance Details button.

8.2. Linux X86_64 Installs 93


Using Driverless AI, Release 1.8.4.1

7. Specify the Instance Details that you want to configure. Create a VPC or use an existing one, and ensure that
“Auto-Assign Public IP” is enabled and associated to your subnet. Click Next: Add Storage.

8. Specify the Storage Device settings. Note again that Driverless AI requires 10 GB to run and will stop working
of less than 10 GB is available. The machine should have a minimum of 30 GB of disk space. Click Next: Add
Tags.

94 Chapter 8. Installing and Upgrading Driverless AI


Using Driverless AI, Release 1.8.4.1

9. If desired, add unique Tag name to identify your instance. Click Next: Configure Security Group.
10. Add the following security rules to enable SSH access to Driverless AI, then click Review and Launch.

Type Protocol Port Range Source Description


SSH TCP 22 Anywhere 0.0.0.0/0
Custom TCP Rule TCP 12345 Anywhere 0.0.0.0/0 Launch DAI

11. Review the configuration, and then click Launch.


12. A popup will appear prompting you to select a key pair. This is required in order to SSH into the instance.
You can select your existing key pair or create a new one. Be sure to accept the acknowledgement, then click
Launch Instances to start the new instance.

8.2. Linux X86_64 Installs 95


Using Driverless AI, Release 1.8.4.1

13. Upon successful completion, a message will display informing you that your instance is launching. Click the
View Instances button to see information about the instance including the IP address. The Connect button on
this page provides information on how to SSH into your instance.
14. Open a Terminal window and SSH into the IP address of the AWS instance. Replace the DNS name below with
your instance DNS.
ssh -i "mykeypair.pem" [email protected]

15. If you selected a GPU-compute instance, then you must enable persistence and optimizations of the GPU. The
commands vary depending on the instance type. Note also that these commands need to be run once every
reboot. Refer to the following for more information:
• http://docs.nvidia.com/deploy/driver-persistence/index.html
• https://docs.aws.amazon.com/AWSEC2/latest/WindowsGuide/optimize_gpu.html
• https://www.migenius.com/articles/realityserver-on-aws
# g3:
sudo nvidia-persistenced --persistence-mode
sudo nvidia-smi -acp 0
sudo nvidia-smi --auto-boost-permission=0
sudo nvidia-smi --auto-boost-default=0
sudo nvidia-smi -ac "2505,1177"

# p2:
sudo nvidia-persistenced --persistence-mode
sudo nvidia-smi -acp 0
sudo nvidia-smi --auto-boost-permission=0
sudo nvidia-smi --auto-boost-default=0
sudo nvidia-smi -ac "2505,875"

# p3:
sudo nvidia-persistenced --persistence-mode
(continues on next page)

96 Chapter 8. Installing and Upgrading Driverless AI


Using Driverless AI, Release 1.8.4.1

(continued from previous page)


sudo nvidia-smi -acp 0
sudo nvidia-smi -ac "877,1530"

16. At this point, you can copy data into the data directory on the host machine using scp. (Note that the data folder
already exists.) For example:

scp <data_file>.csv [email protected]:/home/


˓→data

The data will be visible inside the Docker container.


17. Connect to Driverless AI with your browser. You will be prompted to enter your Driverless AI license key the
first time that you log in.

http://Your-Driverless-AI-Host-Machine:12345

Stopping the EC2 Instance

The EC2 instance will continue to run even when you close the aws.amazon.com portal. To stop the instance:
1. On the EC2 Dashboard, click the Running Instances link under the Resources section.
2. Select the instance that you want to stop.
3. In the Actions drop down menu, select Instance State > Stop.
4. A confirmation page will display. Click Yes, Stop to stop the instance.

Upgrading the Driverless AI Community Image

WARNINGS:
• This release deprecates experiments and MLI models from 1.7.0 and earlier.
• Experiments, MLIs, and MOJOs reside in the Driverless AI tmp directory and are not automatically upgraded
when Driverless AI is upgraded. We recommend you take the following steps before upgrading.
– Build MLI models before upgrading.
– Build MOJO pipelines before upgrading.
– Stop Driverless AI and make a backup of your Driverless AI tmp directory before upgrading.
If you did not build MLI on a model before upgrading Driverless AI, then you will not be able to view
MLI on that model after upgrading. Before upgrading, be sure to run MLI jobs on models that you want
to continue to interpret in future releases. If that MLI job appears in the list of Interpreted Models in your
current version, then it will be retained after upgrading.
If you did not build a MOJO pipeline on a model before upgrading Driverless AI, then you will not be
able to build a MOJO pipeline on that model after upgrading. Before upgrading, be sure to build MOJO
pipelines on all desired models and then back up your Driverless AI tmp directory.
The upgrade process inherits the service user and group from /etc/dai/User.conf and /etc/dai/Group.conf. You do not
need to manually specify the DAI_USER or DAI_GROUP environment variables during an upgrade.

8.2. Linux X86_64 Installs 97


Using Driverless AI, Release 1.8.4.1

Upgrading from Version 1.2.2 or Earlier

The following example shows how to upgrade from 1.2.2 or earlier to the current version. Upgrading from these earlier
versions requires an edit to the start and h2oai scripts.
1. SSH into the IP address of the image instance and copy the existing experiments to a backup location:

# Set up a directory of the previous version name


mkdir dai_rel_1.2.2

# Copy the data, log, license, and tmp directories as backup


cp -a ./data dai_rel_1.2.2/data
cp -a ./log dai_rel_1.2.2/log
cp -a ./license dai_rel_1.2.2/license
cp -a ./tmp dai_rel_1.2.2/tmp

2. wget the newer image. The command below retrieves version 1.2.2:

wget https://s3.amazonaws.com/artifacts.h2o.ai/releases/ai/h2o/dai/rel-1.2.2-
˓→6/x86_64-centos7/dai-docker-centos7-x86_64-1.2.2-9.0.tar.gz

3. In the /home/ubuntu/scripts/ folder, edit both the start.sh and h2oai.sh scripts to use the newer image.
4. Use the docker load command to load the image:

docker load < ami-0c50db5e1999408a7

5. Optionally run docker images to ensure that the new image is in the registry.
6. Connect to Driverless AI with your browser at http://Your-Driverless-AI-Host-Machine:12345.

Upgrading from Version 1.3.0 or Later

The following example shows how to upgrade from version 1.3.0.


1. SSH into the IP address of the image instance and copy the existing experiments to a backup location:

# Set up a directory of the previous version name


mkdir dai_rel_1.3.0

# Copy the data, log, license, and tmp directories as backup


cp -a ./data dai_rel_1.3.0/data
cp -a ./log dai_rel_1.3.0/log
cp -a ./license dai_rel_1.3.0/license
cp -a ./tmp dai_rel_1.3.0/tmp

2. wget the newer image. Replace VERSION and BUILD below with the Driverless AI version.

wget https://s3.amazonaws.com/artifacts.h2o.ai/releases/ai/h2o/dai/VERSION-
˓→BUILD/x86_64-centos7/dai-docker-centos7-x86_64-VERSION.tar.gz

3. Use the docker load command to load the image:

docker load < dai-docker-centos7-x86_64-VERSION.tar.gz

4. In the new AMI, locate the DAI_RELEASE file, and edit that file to match the new image tag.
5. Stop and then start Driverless AI.

98 Chapter 8. Installing and Upgrading Driverless AI


Using Driverless AI, Release 1.8.4.1

h2oai stop
h2oai start

When installing via AWS, you can also enable role-based authentication.

AWS Role-Based Authentication

In Driverless AI, it is possible to enable role-based authentication via the IAM role. This is a two-step process that
involves setting up AWS IAM and then starting Driverless AI by specifying the role in the config.toml file or by setting
the AWS_USE_EC2_ROLE_CREDENTIALS environment variable to True.

AWS IAM Setup

1. Create an IAM role. This IAM role should have a Trust Relationship with Principal Trust Entity set to your
Account ID. For example: trust relationship for Account ID 524466471676 would look like:

{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::524466471676:root"
},
"Action": "sts:AssumeRole"
}
]
}

8.2. Linux X86_64 Installs 99


Using Driverless AI, Release 1.8.4.1

2. Create a new policy that allows users to assume the role:

3. Assign the policy to the user.

4. Test role switching here: https://signin.aws.amazon.com/switchrole. (Refer to https://docs.aws.amazon.com/


IAM/latest/UserGuide/troubleshoot_roles.html#troubleshoot_roles_cant-assume-role.)

100 Chapter 8. Installing and Upgrading Driverless AI


Using Driverless AI, Release 1.8.4.1

Driverless AI Setup

Update the aws_use_ec2_role_credentials config variable in the config.toml file or start Driverless AI using
the AWS_USE_EC2_ROLE_CREDENTIALS environment variable.

Resources

1. Granting a User Permissions to Switch Roles: https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles_


use_permissions-to-switch.html
2. Creating a Role to Delegate Permissions to an IAM User: https://docs.aws.amazon.com/IAM/latest/UserGuide/
id_roles_create_for-user.html
3. Assuming an IAM Role in the AWS CLI: https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-role.
html

Install on Azure

This section describes how to install the Driverless AI image from Azure.
Note: Prior versions of the Driverless AI installation and upgrade on Azure were done via Docker. This is no longer
the case as of version 1.5.2.
Watch the installation video here. Note that some of the images in this video may change between releases, but the
installation steps remain the same.

Environment

Provider Instance Type Num GPUs Suitable for


Azure Standard_NV6 1 Experimentation
Standard_NV12 2 Experimentation
Standard_NV24 4 Serious use
Standard_NC6 1 Experimentation
Standard_NC12 2 Experimentation
Standard_NC24 4 Serious use

About the Install

• The ‘dai’ service user is created locally (in /etc/passwd) if it is not found by ‘getent passwd’. You can override
the user by providing the DAI_USER environment variable during rpm or dpkg installation.
• The ‘dai’ service group is created locally (in /etc/group) if it is not found by ‘getent group’. You can override
the group by providing the DAI_GROUP environment variable during rpm or dpkg installation.
• Configuration files are put in /etc/dai and owned by the ‘root’ user:
– /etc/dai/config.toml: Driverless AI config file (See Using the config.toml File section for details)
– /etc/dai/User.conf: Systemd config file specifying the service user
– /etc/dai/Group.conf: Systemd config file specifying the service group
– /etc/dai/EnvironmentFile.conf: Systemd config file specifying (optional) environment variable overrides

8.2. Linux X86_64 Installs 101


Using Driverless AI, Release 1.8.4.1

• Software files are put in /opt/h2oai/dai and owned by the ‘root’ user
• The following directories are owned by the service user so they can be updated by the running software:
– /opt/h2oai/dai/home: The application’s home directory (license key files are stored here)
– /opt/h2oai/dai/tmp: Experiments and imported data are stored here
– /opt/h2oai/dai/log: Log files go here if you are not using systemd (if you are using systemd, then the use
the standard journalctl tool)
• By default, Driverless AI looks for a license key in /opt/h2oai/dai/home/.driverlessai/license.sig. If you are
installing Driverless AI programmatically, you can copy a license key file to that location. If no license key is
found, the application will interactively guide you to add one from the Web UI.
• systemd unit files are put in /usr/lib/systemd/system
• Symbolic links to the configuration files in /etc/dai files are put in /etc/systemd/system
If your environment is running an operational systemd, that is the preferred way to manage Driverless AI. The package
installs the following systemd services and a wrapper service:
• dai: Wrapper service that starts/stops the other three services
• dai-dai: Main Driverless AI process
• dai-h2o: H2O-3 helper process used by Driverless AI
• dai-procsy: Procsy helper process used by Driverless AI
• dai-vis-server: Visualization server helper process used by Driverless AI
If you don’t have systemd, you can also use the provided run script to start Driverless AI.

Installing the Azure Instance

1. Log in to your Azure portal at https://portal.azure.com, and click the Create a Resource button.
2. Search for and select H2O DriverlessAI in the Marketplace.

3. Click Create. This launches the H2O DriverlessAI Virtual Machine creation process.

102 Chapter 8. Installing and Upgrading Driverless AI


Using Driverless AI, Release 1.8.4.1

4. On the Basics tab:


a. Enter a name for the VM.
b. Select the Disk Type for the VM. Use HDD for GPU instances.
c. Enter the name that you will use when connecting to the machine through SSH.
d. Enter and confirm a password that will be used when connecting to the machine through
SSH.
e. Specify the Subscription option. (This should be Pay-As-You-Go.)
f. Enter a name unique name for the resource group.
g. Specify the VM region.
Click OK when you are done.

8.2. Linux X86_64 Installs 103


Using Driverless AI, Release 1.8.4.1

5. On the Size tab, select your virtual machine size. Specify the HDD disk type and select a configuration. We
recommend using an N-Series type, which comes with a GPU. Also note that Driverless AI requires 10 GB of
free space in order to run and will stop working of less than 10 GB is available. We recommend a minimum of
30 GB of disk space. Click OK when you are done.

104 Chapter 8. Installing and Upgrading Driverless AI


Using Driverless AI, Release 1.8.4.1

6. On the Settings tab, select or create the Virtual Network and Subnet where the VM is going to be located and
then click OK.

8.2. Linux X86_64 Installs 105


Using Driverless AI, Release 1.8.4.1

7. The Summary tab performs a validation on the specified settings and will report back any errors. When the
validation passes successfully, click Create to create the VM.

8. After the VM is created, it will be available under the list of Virtual Machines. Select this Driverless AI VM to
view the IP address of your newly created machine.
9. Connect to Driverless AI with your browser using the IP address retrieved in the previous step.

http://Your-Driverless-AI-Host-Machine:12345

Stopping the Azure Instance

The Azure instance will continue to run even when you close the Azure portal. To stop the instance:
1. Click the Virtual Machines left menu item.
2. Select the checkbox beside your DriverlessAI virtual machine.
3. On the right side of the row, click the . . . button, then select Stop. (Note that you can then restart this by
selecting Start.)

106 Chapter 8. Installing and Upgrading Driverless AI


Using Driverless AI, Release 1.8.4.1

Upgrading the Driverless AI Image

WARNINGS:
• This release deprecates experiments and MLI models from 1.7.0 and earlier.
• Experiments, MLIs, and MOJOs reside in the Driverless AI tmp directory and are not automatically upgraded
when Driverless AI is upgraded. We recommend you take the following steps before upgrading.
– Build MLI models before upgrading.
– Build MOJO pipelines before upgrading.
– Stop Driverless AI and make a backup of your Driverless AI tmp directory before upgrading.
If you did not build MLI on a model before upgrading Driverless AI, then you will not be able to view
MLI on that model after upgrading. Before upgrading, be sure to run MLI jobs on models that you want
to continue to interpret in future releases. If that MLI job appears in the list of Interpreted Models in your
current version, then it will be retained after upgrading.
If you did not build a MOJO pipeline on a model before upgrading Driverless AI, then you will not be
able to build a MOJO pipeline on that model after upgrading. Before upgrading, be sure to build MOJO
pipelines on all desired models and then back up your Driverless AI tmp directory.
The upgrade process inherits the service user and group from /etc/dai/User.conf and /etc/dai/Group.conf. You do not
need to manually specify the DAI_USER or DAI_GROUP environment variables during an upgrade.

8.2. Linux X86_64 Installs 107


Using Driverless AI, Release 1.8.4.1

Upgrading from Version 1.2.2 or Earlier

It is not possible to upgrade from version 1.2.2 or earlier to the latest version. You have to manually remove the 1.2.2
container and then reinstall the latest Driverless AI version. Be sure to backup your data before doing this.

Upgrading from Version 1.3.0 to 1.5.1

1. SSH into the IP address of the image instance and copy the existing experiments to a backup location:

# Set up a directory of the previous version name


mkdir dai_rel_1.3.0

# Copy the data, log, license, and tmp directories as backup


cp -a ./data dai_rel_1.3.0/data
cp -a ./log dai_rel_1.3.0/log
cp -a ./license dai_rel_1.3.0/license
cp -a ./tmp dai_rel_1.3.0/tmp

2. wget the newer image. Replace VERSION and BUILD below with the Driverless AI version.

wget https://s3.amazonaws.com/artifacts.h2o.ai/releases/ai/h2o/dai/VERSION-
˓→BUILD/x86_64-centos7/dai-docker-centos7-x86_64-VERSION.tar.gz

3. Use the docker load command to load the image:

docker load < dai-docker-centos7-x86_64-VERSION.tar.gz

4. Run docker images to find the new image tag.


5. Start the Driverless AI Docker image with nvidia-docker and replace TAG below with the image tag:

# Start the Driverless AI Docker image


nvidia-docker run \
--pid=host \
--init \
--rm \
--shm-size=256m \
-u `id -u`:`id -g` \
-p 12345:12345 \
-v `pwd`/data:/data \
-v `pwd`/log:/log \
-v `pwd`/license:/license \
-v `pwd`/tmp:/tmp \
h2oai/dai-centos7-x86_64:TAG

Upgrading from version 1.5.2 or Later

Upgrading to versions 1.5.2 and later is no longer done via Docker. Instead, perform the following steps if you are
upgrading to version 1.5.2 or later. Replace dai_NEWVERSION.deb below with the new Driverless AI version
(for example, dai_1.6.1_amd64.deb). Note that this upgrade process inherits the service user and group from
/etc/dai/User.conf and /etc/dai/Group.conf. You do not need to manually specify the DAI_USER or DAI_GROUP
environment variables during an upgrade.
Note about upgrading to 1.7.x: As of 1.7.0, CUDA 9 is no longer supported. Your host environment must have
CUDA 10.0 or later with NVIDIA drivers >= 410 installed (GPU only). Driverless AI ships with its own CUDA

108 Chapter 8. Installing and Upgrading Driverless AI


Using Driverless AI, Release 1.8.4.1

libraries, but the driver must exist in the host environment. Go to https://www.nvidia.com/Download/index.aspx to get
the latest NVIDIA Tesla V/P/K series driver.

# Stop Driverless AI.


sudo systemctl stop dai

# Backup your /opt/h2oai/dai/tmp directory at this time.

# Upgrade Driverless AI.


sudo dpkg -i dai_NEWVERSION.deb
sudo systemctl daemon-reload
sudo systemctl start dai

Install on Google Compute

Driverless AI can be installed on Google Compute using one of two methods:


• Install the Google Cloud Platform offering. This installs Driverless AI via the available GCP Marketplace
offering.
• Install and Run in a Docker Container on Google Compute Engine. This installs and runs Driverless AI fromn
scratch in a Docker container on Google Compute Engine.
Select your desired installation procedure below:

Install the Google Cloud Platform Offering

This section describes how to install and start Driverless AI in a Google Compute environment using the GCP Mar-
ketplace. This assumes that you already have a Google Cloud Platform account. If you don’t have an account, go to
https://console.cloud.google.com/getting-started to create one.

Before You Begin

If you are trying GCP for the first time and have just created an account, please check your Google Compute En-
gine (GCE) resource quota limits. By default, GCP allocates a maximum of 8 CPUs and no GPUs. Our de-
fault recommendation for launching Driverless AI is 32 CPUs, 120 GB RAM, and 2 P100 NVIDIA GPUs. You
can change these settings to match your quota limit, or you can request more resources from GCP. Refer to
https://cloud.google.com/compute/quotas for more information, including information on how to check your quota
and request additional quota.

Installation Procedure

1. In your browser, log in to the Google Compute Engine Console at https://console.cloud.google.com/.


2. In the left navigation panel, select Marketplace.

8.2. Linux X86_64 Installs 109


Using Driverless AI, Release 1.8.4.1

3. On the Marketplace page, search for Driverless and select the H2O.ai Driverless AI offering. The following
page will display.

4. Click Launch on Compute Engine. (If necessary, refer to Google Compute Instance Types for information
about machine and GPU types.)
• Select a zone that has p100s or k80s (such as us-east1-)
• Optionally change the number of cores and amount of memory. (This defaults to 32 CPUs and 120
GB RAM.)

110 Chapter 8. Installing and Upgrading Driverless AI


Using Driverless AI, Release 1.8.4.1

• Specify a GPU type. (This defaults to a p100 GPU.)


• Optionally change the number of GPUs. (Default is 2.)
• Specify the boot disk type and size.
• Optionally change the network name and subnetwork names. Be sure that whichever network you
specify has port 12345 exposed.
• Click Deploy when you are done. Driverless AI will begin deploying. Note that this can take several
minutes.

5. A summary page displays when the compute engine is successfully deployed. This page includes the instance
ID and the username (always h2oai) and password that will be required when starting Driverless AI. Click on
the Instance link to retrieve the external IP address for starting Driverless AI.

8.2. Linux X86_64 Installs 111


Using Driverless AI, Release 1.8.4.1

6. In your browser, go to https://{[}External_IP{]}:12345 to start Driverless AI.


7. Agree to the Terms and Conditions.
8. Log in to Driverless AI using your user name and password.
9. Optionally enable GCS and Big Query access.
a. In order to enable GCS and Google BigQuery access, you must pass the running instance a service
account json file configured with GCS and GBQ access. The Driverless AI image comes with a
blank file called service_account.json. Obtain a functioning service account json file from GCP,
rename it to “service_account.json”, and copy it to the Ubuntu user on the running instance.

gcloud compute scp /path/to/service_account.json ubuntu@<running_


˓→instance_name>:service_account.json

b. SSH into the machine running Driverless AI, and verify that the service_account.json file is in the
/etc/dai/ folder.
c. Restart the machine for the changes to take effect.

sudo systemctl stop dai

# Wait for the system to stop

# Verify that the system is no longer running


sudo systemctl status dai

# Restart the system


sudo systemctl start dai

112 Chapter 8. Installing and Upgrading Driverless AI


Using Driverless AI, Release 1.8.4.1

Upgrading the Google Cloud Platform Offering

Perform the following steps to upgrade the Driverless AI Google Platform offering. Replace dai_NEWVERSION.
deb below with the new Driverless AI version (for example, dai_1.6.1_amd64.deb). Note that this upgrade
process inherits the service user and group from /etc/dai/User.conf and /etc/dai/Group.conf. You do not need to man-
ually specify the DAI_USER or DAI_GROUP environment variables during an upgrade.

# Stop Driverless AI.


sudo systemctl stop dai

# Make a backup of /opt/h2oai/dai/tmp directory at this time.

# Upgrade Driverless AI.


sudo dpkg -i dai_NEWVERSION.deb
sudo systemctl daemon-reload
sudo systemctl start dai

Install and Run in a Docker Container on Google Compute Engine

This section describes how to install and start Driverless AI from scratch using a Docker container in a Google
Compute environment.
This installation assumes that you already have a Google Cloud Platform account. If you don’t have an account,
go to https://console.cloud.google.com/getting-started to create one. In addition, refer to Google’s Machine Types
documentation for information on Google Compute machine types.
Watch the installation video here. Note that some of the images in this video may change between releases, but the
installation steps remain the same.

Before You Begin

If you are trying GCP for the first time and have just created an account, please check your Google Compute Engine
(GCE) resource quota limits. By default, GCP allocates a maximum of 8 CPUs and no GPUs. You can change these
settings to match your quota limit, or you can request more resources from GCP. Refer to https://cloud.google.com/
compute/quotas for more information, including information on how to check your quota and request additional quota.

Installation Procedure

1. In your browser, log in to the Google Compute Engine Console at https://console.cloud.google.com/.


2. In the left navigation panel, select Compute Engine > VM Instances.

8.2. Linux X86_64 Installs 113


Using Driverless AI, Release 1.8.4.1

3. Click Create Instance.

4. Specify the following at a minimum:


• A unique name for this instance.
• The desired zone. Note that not all zones and user accounts can select zones with GPU instances.
Refer to the following for information on how to add GPUs: https://cloud.google.com/compute/
docs/gpus/.
• A supported OS, for example Ubuntu 16.04. Be sure to also increase the disk size of the OS image
to be 64 GB.
Click Create at the bottom of the form when you are done. This creates the new VM instance.

114 Chapter 8. Installing and Upgrading Driverless AI


Using Driverless AI, Release 1.8.4.1

5. Create a Firewall rule for Driverless AI. On the Google Cloud Platform left navigation panel, select VPC
network > Firewall rules. Specify the following settings:
• Specify a unique name and Description for this instance.
• Change the Targets dropdown to All instances in the network.
• Specify the Source IP ranges to be 0.0.0.0/0.
• Under Protocols and Ports, select Specified protocols and ports and enter the following: tcp:
12345.
Click Create at the bottom of the form when you are done.

8.2. Linux X86_64 Installs 115


Using Driverless AI, Release 1.8.4.1

6. On the VM Instances page, SSH to the new VM Instance by selecting Open in Browser Window from the SSH
dropdown.

7. H2O provides a script for you to run in your VM instance. Open an editor in the VM instance (for example,
vi). Copy one of the scripts below (depending on whether you are running GPUs or CPUs). Save the script as
install.sh.
# SCRIPT FOR GPUs ONLY
apt-get -y update
apt-get -y --no-install-recommends install \
curl \
apt-utils \
(continues on next page)

116 Chapter 8. Installing and Upgrading Driverless AI


Using Driverless AI, Release 1.8.4.1

(continued from previous page)


python-software-properties \
software-properties-common

add-apt-repository -y ppa:graphics-drivers/ppa
add-apt-repository -y "deb [arch=amd64] https://download.docker.com/linux/
˓→ubuntu $(lsb_release -cs) stable"

curl -fsSL https://download.docker.com/linux/ubuntu/gpg | apt-key add -

apt-get update
apt-get install -y \
nvidia-384 \
nvidia-modprobe \
docker-ce

curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | \
sudo apt-key add -
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-
˓→docker.list | \

sudo tee /etc/apt/sources.list.d/nvidia-docker.list


sudo apt-get update

# Install nvidia-docker2 and reload the Docker daemon configuration


sudo apt-get install -y nvidia-docker2

# SCRIPT FOR CPUs ONLY


apt-get -y update
apt-get -y --no-install-recommends install \
curl \
apt-utils \
python-software-properties \
software-properties-common

add-apt-repository -y "deb [arch=amd64] https://download.docker.com/linux/


˓→ubuntu $(lsb_release -cs) stable"

curl -fsSL https://download.docker.com/linux/ubuntu/gpg | apt-key add -

apt-get update
apt-get install -y docker-ce

8. Type the following commands to run the install script.

chmod +x install.sh
sudo ./install.sh

9. In your user folder, create the following directories as your user.

mkdir ~/tmp
mkdir ~/log
mkdir ~/data
mkdir ~/scripts
mkdir ~/license
mkdir ~/demo
mkdir -p ~/jupyter/notebooks

10. Add your Google Compute user name to the Docker container.

8.2. Linux X86_64 Installs 117


Using Driverless AI, Release 1.8.4.1

sudo usermod -aG docker <username>

11. Reboot the system to enable NVIDIA drivers.

sudo reboot

12. Retrieve the Driverless AI Docker image from https://www.h2o.ai/download/.


13. Load the Driverless AI Docker image. The following example shows how to load Driverless AI. Replace VER-
SION with your image.

sudo docker load < dai-docker-centos7-x86_64-VERSION.tar.gz

14. If you are running CPUs, you can skip this step. Otherwise, you must enable persistence of the GPU. Note that
this needs to be run once every reboot. Refer to the following for more information: http://docs.nvidia.com/
deploy/driver-persistence/index.html.

sudo nvidia-persistenced --persistence-mode

15. Start the Driverless AI Docker image with nvidia-docker run (GPUs) or docker run (CPUs). Note
that you must have write privileges for the folders that are created below. You can replace ‘pwd’ with the path to
/home/<username> or start with sudo nvidia-docker run. Replace TAG with the Docker image tag (run
docker images if necessary.) Also, refer to Using Data Connectors with the Docker Image for information
on how to add the GCS and GBQ data connectors to your Driverless AI instance.

# Start the Driverless AI Docker image


nvidia-docker run \
--pid=host \
--init \
--rm \
--shm-size=256m \
-u `id -u`:`id -g` \
-p 12345:12345 \
-v `pwd`/data:/data \
-v `pwd`/log:/log \
-v `pwd`/license:/license \
-v `pwd`/tmp:/tmp \
h2oai/dai-centos7-x86_64:TAG

Driverless AI will begin running:

--------------------------------
Welcome to H2O.ai's Driverless AI
---------------------------------

- Put data in the volume mounted at /data


- Logs are written to the volume mounted at /log/20180606-044258
- Connect to Driverless AI on port 12345 inside the container
- Connect to Jupyter notebook on port 8888 inside the container

16. Connect to Driverless AI with your browser:

http://Your-Driverless-AI-Host-Machine:12345

118 Chapter 8. Installing and Upgrading Driverless AI


Using Driverless AI, Release 1.8.4.1

Stopping the GCE Instance

The Google Compute Engine instance will continue to run even when you close the portal. You can stop the instance
using one of the following methods:
Stopping in the browser
1. On the VM Instances page, click on the VM instance that you want to stop.
2. Click Stop at the top of the page.
3. A confirmation page will display. Click Stop to stop the instance.
Stopping in Terminal
SSH into the machine that is running Driverless AI, and then run the following:

h2oai stop

Upgrading Driverless AI

This section provides instructions for upgrading Driverless AI versions that were installed in a Docker container. These
steps ensure that existing experiments are saved.
WARNING: Experiments, MLIs, and MOJOs reside in the Driverless AI tmp directory and are not automatically
upgraded when Driverless AI is upgraded.
• Build MLI models before upgrading.
• Build MOJO pipelines before upgrading.
• Stop Driverless AI and make a backup of your Driverless AI tmp directory before upgrading.
If you did not build MLI on a model before upgrading Driverless AI, then you will not be able to view
MLI on that model after upgrading. Before upgrading, be sure to run MLI jobs on models that you want
to continue to interpret in future releases. If that MLI job appears in the list of Interpreted Models in your
current version, then it will be retained after upgrading.
If you did not build a MOJO pipeline on a model before upgrading Driverless AI, then you will not be
able to build a MOJO pipeline on that model after upgrading. Before upgrading, be sure to build MOJO
pipelines on all desired models and then back up your Driverless AI tmp directory.
Note: Stop Driverless AI if it is still running.

Requirements

As of 1.7.0, CUDA 9 is no longer supported. Your host environment must have CUDA 10.0 or later with NVIDIA
drivers >= 410 installed (GPU only). Driverless AI ships with its own CUDA libraries, but the driver must exist in the
host environment. Go to https://www.nvidia.com/Download/index.aspx to get the latest NVIDIA Tesla V/P/K series
driver.

8.2. Linux X86_64 Installs 119


Using Driverless AI, Release 1.8.4.1

Upgrade Steps

1. SSH into the IP address of the machine that is running Driverless AI.
2. Set up a directory for the version of Driverless AI on the host machine:

# Set up directory with the version name


mkdir dai_rel_VERSION

# cd into the new directory


cd dai_rel_VERSION

3. Retrieve the Driverless AI package from https://www.h2o.ai/download/ and add it to the new directory.
4. Load the Driverless AI Docker image inside the new directory. This example shows how to load Driverless AI
version. If necessary, replace VERSION with your image.

# Load the Driverless AI docker image


docker load < dai-docker-centos7-x86_64-VERSION.tar.gz

5. Copy the data, log, license, and tmp directories from the previous Driverless AI directory to the new Driverless
AI directory:

# Copy the data, log, license, and tmp directories on the host machine
cp -a dai_rel_1.4.2/data dai_rel_VERSION/data
cp -a dai_rel_1.4.2/log dai_rel_VERSION/log
cp -a dai_rel_1.4.2/license dai_rel_VERSION/license
cp -a dai_rel_1.4.2/tmp dai_rel_VERSION/tmp

At this point, your experiments from the previous versions will be visible inside the Docker container.
6. Use docker images to find the new image tag.
7. Start the Driverless AI Docker image.
8. Connect to Driverless AI with your browser at http://Your-Driverless-AI-Host-Machine:12345.

8.3 IBM Power Installs

This section provides installation steps for IBM Power environments. This includes information for Docker image
installs, RPMs, Deb, and Tar installs.
Notes:
• Ubuntu is not fully tested on Power.
• OpenCL and LightGBM with GPUs are not supported on Power currently.

120 Chapter 8. Installing and Upgrading Driverless AI


Using Driverless AI, Release 1.8.4.1

8.3.1 IBM Docker Images

To simplify local installation, Driverless AI is provided as a Docker image for the following system combination:

Host OS Docker Version Host Architecture Min Mem


RHEL or CentOS 7.4 or later Docker CE ppc64le 64 GB

Notes:
• CUDA 10 or later with NVIDIA drivers >= 410 (GPU only)
• OpenCL and LightGBM with GPUs are not supported on Power currently.
For the best performance, including GPU support, use nvidia-docker2. For a lower-performance experience without
GPUs, use regular docker (with the same docker image).
These installation steps assume that you have a license key for Driverless AI. For information on how to obtain a license
key for Driverless AI, visit https://www.h2o.ai/products/h2o-driverless-ai/. Once obtained, you will be promted to
paste the license key into the Driverless AI UI when you first log in, or you can save it as a .sig file and place it in the
license folder that you will create during the installation process.

Install on IBM with GPUs

This section describes how to install and start the Driverless AI Docker image on RHEL for IBM Power LE systems
with GPUs. Note that nvidia-docker has limited support for ppc64le machines. More information about nvidia-docker
support for ppc64le machines is available here.
Open a Terminal and ssh to the machine that will run Driverless AI. Once you are logged in, perform the following
steps.
1. Retrieve the Driverless AI Docker image from https://www.h2o.ai/download/.
2. Add the following to cuda-rhel7.repo in /etc/yum.repos.d/:

[cuda]
name=cuda
baseurl=http://developer.download.nvidia.com/compute/cuda/repos/rhel7/ppc64le
enabled=1
gpgcheck=1
gpgkey=http://developer.download.nvidia.com/compute/cuda/repos/rhel7/ppc64le/
˓→7fa2af80.pub

3. Add the following to nvidia-container-runtime.repo in /etc/yum.repos.d/:

[libnvidia-container]
name=libnvidia-container
baseurl=https://nvidia.github.io/libnvidia-container/centos7/$basearch
repo_gpgcheck=1
gpgcheck=0
enabled=1
gpgkey=https://nvidia.github.io/libnvidia-container/gpgkey
sslverify=1
sslcacert=/etc/pki/tls/certs/ca-bundle.crt

[nvidia-container-runtime]
name=nvidia-container-runtime
baseurl=https://nvidia.github.io/nvidia-container-runtime/centos7/$basearch
(continues on next page)

8.3. IBM Power Installs 121


Using Driverless AI, Release 1.8.4.1

(continued from previous page)


repo_gpgcheck=1
gpgcheck=0
enabled=1
gpgkey=https://nvidia.github.io/nvidia-container-runtime/gpgkey
sslverify=1
sslcacert=/etc/pki/tls/certs/ca-bundle.crt

4. Install the latest drivers and the latest version of CUDA:

yum -y install nvidia-driver-latest-dkms cuda --nogpgcheck

5. Install Docker on RedHat:

yum -y install docker

6. Install NVIDIA hook. (See https://github.com/NVIDIA/nvidia-docker#rhel-docker for more information.) This


automatically switches Docker’s runtime to nvidia-runtime:

yum -y install nvidia-container-runtime-hook

7. Set up a directory for the version of Driverless AI on the host machine, replacing VERSION below with your
Driverless AI Docker image version:

# Set up directory with the version name


mkdir dai_rel_VERSION

8. Change directories to the new folder, then load the Driverless AI Docker image inside the new directory. This
example shows how to load Driverless AI. Replace VERSION with your image.

# cd into the new directory


cd dai_rel_VERSION

# Load the Driverless AI docker image


docker load < dai-docker-centos7-ppc64le-VERSION.tar.gz

9. Enable persistence of the GPU. Note that this needs to be run once every reboot. Refer to the following for more
information: http://docs.nvidia.com/deploy/driver-persistence/index.html.

sudo nvidia-persistenced --persistence-mode

10. Set up the data, log, and license directories on the host machine (within the new directory):

# Set up the data, log, license, and tmp directories on the host machine
mkdir data
mkdir log
mkdir license
mkdir tmp

11. At this point, you can copy data into the data directory on the host machine. The data will be visible inside the
Docker container.
12. Run docker images to find the image tag.
13. Start the Driverless AI Docker image with nvidia-docker and replace TAG below with the image tag:

122 Chapter 8. Installing and Upgrading Driverless AI


Using Driverless AI, Release 1.8.4.1

# Start the Driverless AI Docker image


nvidia-docker run \
--pid=host \
--init \
--rm \
--shm-size=256m \
-u `id -u`:`id -g` \
-p 12345:12345 \
-v `pwd`/data:/data \
-v `pwd`/log:/log \
-v `pwd`/license:/license \
-v `pwd`/tmp:/tmp \
-v /etc/passwd:/etc/passwd:ro \
-v /etc/group:/etc/group:ro \
h2oai/dai-centos7-ppc64le:TAG

Driverless AI will begin running:

--------------------------------
Welcome to H2O.ai's Driverless AI
---------------------------------

- Put data in the volume mounted at /data


- Logs are written to the volume mounted at /log/20180606-044258
- Connect to Driverless AI on port 12345 inside the container
- Connect to Jupyter notebook on port 8888 inside the container

14. Connect to Driverless AI with your browser at http://Your-Driverless-AI-Host-Machine:12345.

Install on IBM with CPUs

This section describes how to install and start the Driverless AI Docker image on RHEL for IBM Power LE systems
with CPUs. Note that this uses Docker EE and not NVIDIA Docker. GPU support will not be available.
Watch the installation video here. Note that some of the images in this video may change between releases, but the
installation steps remain the same.
Note: As of this writing, Driverless AI has only been tested on RHEL version 7.4.
Open a Terminal and ssh to the machine that will run Driverless AI. Once you are logged in, perform the following
steps.
1. Install and start Docker CE.

sudo yum install -y yum-utils


sudo yum-config-manager --add-repo https://download.docker.com/linux/centos/
˓→docker-ce.repo

sudo yum makecache fast


sudo yum -y install docker-ce
sudo systemctl start docker

2. On the machine that is running Docker EE, retrieve the Driverless AI Docker image from https://www.h2o.ai/
driverless-ai-download/.
3. Set up a directory for the version of Driverless AI on the host machine, replacing VERSION below with your
Driverless AI Docker image version:

8.3. IBM Power Installs 123


Using Driverless AI, Release 1.8.4.1

# Set up directory with the version name


mkdir dai_rel_VERSION

4. Load the Driverless AI Docker image inside the new directory. The following example shows how to load
Driverless AI. Replace VERSION with your image.

# Load the Driverless AI Docker image


docker load < dai-docker-centos7-ppc64le-VERSION.tar.gz

5. Set up the data, log, license, and tmp directories (within the new directory):

# cd into the directory associated with the selected version of Driverless AI


cd dai_rel_VERSION

# Set up the data, log, license, and tmp directories on the host machine
mkdir data
mkdir log
mkdir license
mkdir tmp

6. Copy data into the data directory on the host. The data will be visible inside the Docker container at /<user-
home>/data.
7. Run docker images to find the image tag.
8. Start the Driverless AI Docker image and replace TAG below with the image tag. Note that GPU support will
not be available.

$ docker run \
--pid=host \
--init \
--rm \
-u `id -u`:`id -g` \
-p 12345:12345 \
-v `pwd`/data:/data \
-v `pwd`/log:/log \
-v `pwd`/license:/license \
-v `pwd`/tmp:/tmp \
-v /etc/passwd:/etc/passwd:ro \
-v /etc/group:/etc/group:ro \
h2oai/dai-centos7-ppc64le:TAG

Driverless AI will begin running:

--------------------------------
Welcome to H2O.ai's Driverless AI
---------------------------------

- Put data in the volume mounted at /data


- Logs are written to the volume mounted at /log/20180606-044258
- Connect to Driverless AI on port 12345 inside the container
- Connect to Jupyter notebook on port 8888 inside the container

9. Connect to Driverless AI with your browser at http://Your-Driverless-AI-Host-Machine:12345.

124 Chapter 8. Installing and Upgrading Driverless AI


Using Driverless AI, Release 1.8.4.1

Stopping the Docker Image

To stop the Driverless AI Docker image, type Ctrl + C in the Terminal (Mac OS X) or PowerShell (Windows 10)
window that is running the Driverless AI Docker image.

Upgrading the Docker Image

This section provides instructions for upgrading Driverless AI versions that were installed in a Docker container. These
steps ensure that existing experiments are saved.
WARNING: Experiments, MLIs, and MOJOs reside in the Driverless AI tmp directory and are not automatically
upgraded when Driverless AI is upgraded.
• Build MLI models before upgrading.
• Build MOJO pipelines before upgrading.
• Stop Driverless AI and make a backup of your Driverless AI tmp directory before upgrading.
If you did not build MLI on a model before upgrading Driverless AI, then you will not be able to view
MLI on that model after upgrading. Before upgrading, be sure to run MLI jobs on models that you want
to continue to interpret in future releases. If that MLI job appears in the list of Interpreted Models in your
current version, then it will be retained after upgrading.
If you did not build a MOJO pipeline on a model before upgrading Driverless AI, then you will not be
able to build a MOJO pipeline on that model after upgrading. Before upgrading, be sure to build MOJO
pipelines on all desired models and then back up your Driverless AI tmp directory.
Note: Stop Driverless AI if it is still running.

Requirements

As of 1.7.0, CUDA 9 is no longer supported. Your host environment must have CUDA 10.0 or later with NVIDIA
drivers >= 410 installed (GPU only). Driverless AI ships with its own CUDA libraries, but the driver must exist in the
host environment. Go to https://www.nvidia.com/Download/index.aspx to get the latest NVIDIA Tesla V/P/K series
driver.

Upgrade Steps

1. SSH into the IP address of the machine that is running Driverless AI.
2. Set up a directory for the version of Driverless AI on the host machine:

# Set up directory with the version name


mkdir dai_rel_VERSION

# cd into the new directory


cd dai_rel_VERSION

3. Retrieve the Driverless AI package from https://www.h2o.ai/download/ and add it to the new directory.
4. Load the Driverless AI Docker image inside the new directory. This example shows how to load Driverless AI
version. If necessary, replace VERSION with your image.

# Load the Driverless AI docker image


docker load < dai-docker-centos7-x86_64-VERSION.tar.gz

8.3. IBM Power Installs 125


Using Driverless AI, Release 1.8.4.1

5. Copy the data, log, license, and tmp directories from the previous Driverless AI directory to the new Driverless
AI directory:

# Copy the data, log, license, and tmp directories on the host machine
cp -a dai_rel_1.4.2/data dai_rel_VERSION/data
cp -a dai_rel_1.4.2/log dai_rel_VERSION/log
cp -a dai_rel_1.4.2/license dai_rel_VERSION/license
cp -a dai_rel_1.4.2/tmp dai_rel_VERSION/tmp

At this point, your experiments from the previous versions will be visible inside the Docker container.
6. Use docker images to find the new image tag.
7. Start the Driverless AI Docker image.
8. Connect to Driverless AI with your browser at http://Your-Driverless-AI-Host-Machine:12345.

8.3.2 IBM RPMs

For IBM machines that will not use the Docker image or DEB, an RPM installation is available for ppc64le RHEL 7.
The installation steps assume that you have a license key for Driverless AI. For information on how to obtain a license
key for Driverless AI, visit https://www.h2o.ai/products/h2o-driverless-ai/. Once obtained, you will be promted to
paste the license key into the Driverless AI UI when you first log in, or you can save it as a .sig file and place it in the
license folder that you will create during the installation process.
Note: OpenCL and LightGBM with GPUs are not supported on Power currently.

Requirements

• RedHat 7
• CUDA 10 or later with NVIDIA drivers >= 410 (GPU only)
• cuDNN >= 7.4.1 (Required only if using TensorFlow.)
• Driverless AI RPM, available from https://www.h2o.ai/download/

About the Install

• The ‘dai’ service user is created locally (in /etc/passwd) if it is not found by ‘getent passwd’. You can override
the user by providing the DAI_USER environment variable during rpm or dpkg installation.
• The ‘dai’ service group is created locally (in /etc/group) if it is not found by ‘getent group’. You can override
the group by providing the DAI_GROUP environment variable during rpm or dpkg installation.
• Configuration files are put in /etc/dai and owned by the ‘root’ user:
– /etc/dai/config.toml: Driverless AI config file (See Using the config.toml File section for details)
– /etc/dai/User.conf: Systemd config file specifying the service user
– /etc/dai/Group.conf: Systemd config file specifying the service group
– /etc/dai/EnvironmentFile.conf: Systemd config file specifying (optional) environment variable overrides
• Software files are put in /opt/h2oai/dai and owned by the ‘root’ user
• The following directories are owned by the service user so they can be updated by the running software:
– /opt/h2oai/dai/home: The application’s home directory (license key files are stored here)

126 Chapter 8. Installing and Upgrading Driverless AI


Using Driverless AI, Release 1.8.4.1

– /opt/h2oai/dai/tmp: Experiments and imported data are stored here


– /opt/h2oai/dai/log: Log files go here if you are not using systemd (if you are using systemd, then the use
the standard journalctl tool)
• By default, Driverless AI looks for a license key in /opt/h2oai/dai/home/.driverlessai/license.sig. If you are
installing Driverless AI programmatically, you can copy a license key file to that location. If no license key is
found, the application will interactively guide you to add one from the Web UI.
• systemd unit files are put in /usr/lib/systemd/system
• Symbolic links to the configuration files in /etc/dai files are put in /etc/systemd/system
If your environment is running an operational systemd, that is the preferred way to manage Driverless AI. The package
installs the following systemd services and a wrapper service:
• dai: Wrapper service that starts/stops the other three services
• dai-dai: Main Driverless AI process
• dai-h2o: H2O-3 helper process used by Driverless AI
• dai-procsy: Procsy helper process used by Driverless AI
• dai-vis-server: Visualization server helper process used by Driverless AI
If you don’t have systemd, you can also use the provided run script to start Driverless AI.

Installing Driverless AI

Run the following commands to install the Driverless AI RPM. Replace VERSION with your specific version.

# Install Driverless AI.


sudo rpm -i dai-VERSION.rpm

By default, the Driverless AI processes are owned by the ‘dai’ userand ‘dai’ group. You can optionally specify a
different service user and group as shown below. Replace <myuser> and <mygroup> as appropriate.

# Temporarily specify service user and group when installing Driverless AI.
# rpm saves these for systemd in the /etc/dai/User.conf and /etc/dai/Group.conf files.
sudo DAI_USER=myuser DAI_GROUP=mygroup rpm -i dai-VERSION.rpm

You may now optionally make changes to /etc/dai/config.toml.

Starting Driverless AI

If you have systemd (preferred):

# Start Driverless AI.


sudo systemctl start dai

If you do not have systemd:

# Start Driverless AI.


sudo -H -u dai /opt/h2oai/dai/run-dai.sh

8.3. IBM Power Installs 127


Using Driverless AI, Release 1.8.4.1

Starting NVIDIA Persistence Mode

If you have NVIDIA GPUs, you must run the following NVIDIA command. This command needs to be run every
reboot. For more information: http://docs.nvidia.com/deploy/driver-persistence/index.html.

sudo nvidia-persistenced --persistence-mode

Looking at Driverless AI log files

If you have systemd (preferred):

sudo systemctl status dai-dai


sudo systemctl status dai-h2o
sudo systemctl status dai-procsy
sudo systemctl status dai-vis-server
sudo journalctl -u dai-dai
sudo journalctl -u dai-h2o
sudo journalctl -u dai-procsy
sudo journalctl -u dai-vis-server

If you do not have systemd:

sudo less /opt/h2oai/dai/log/dai.log


sudo less /opt/h2oai/dai/log/h2o.log
sudo less /opt/h2oai/dai/log/procsy.log
sudo less /opt/h2oai/dai/log/vis-server.log

Stopping Driverless AI

If you have systemd (preferred):

# Stop Driverless AI.


sudo systemctl stop dai

# The processes should now be stopped. Verify.


sudo ps -u dai

If you do not have systemd:

# Stop Driverless AI.


sudo pkill -U dai

# The processes should now be stopped. Verify.


sudo ps -u dai

128 Chapter 8. Installing and Upgrading Driverless AI


Using Driverless AI, Release 1.8.4.1

Upgrading Driverless AI

WARNINGS:
• This release deprecates experiments and MLI models from 1.7.0 and earlier.
• Experiments, MLIs, and MOJOs reside in the Driverless AI tmp directory and are not automatically upgraded
when Driverless AI is upgraded. We recommend you take the following steps before upgrading.
– Build MLI models before upgrading.
– Build MOJO pipelines before upgrading.
– Stop Driverless AI and make a backup of your Driverless AI tmp directory before upgrading.
If you did not build MLI on a model before upgrading Driverless AI, then you will not be able to view
MLI on that model after upgrading. Before upgrading, be sure to run MLI jobs on models that you want
to continue to interpret in future releases. If that MLI job appears in the list of Interpreted Models in your
current version, then it will be retained after upgrading.
If you did not build a MOJO pipeline on a model before upgrading Driverless AI, then you will not be
able to build a MOJO pipeline on that model after upgrading. Before upgrading, be sure to build MOJO
pipelines on all desired models and then back up your Driverless AI tmp directory.
The upgrade process inherits the service user and group from /etc/dai/User.conf and /etc/dai/Group.conf. You do not
need to manually specify the DAI_USER or DAI_GROUP environment variables during an upgrade.

Requirements

As of 1.7.0, CUDA 9 is no longer supported. Your host environment must have CUDA 10.0 or later with NVIDIA
drivers >= 410 installed (GPU only). Driverless AI ships with its own CUDA libraries, but the driver must exist in the
host environment. Go to https://www.nvidia.com/Download/index.aspx to get the latest NVIDIA Tesla V/P/K series
driver.

Upgrade Steps

If you have systemd (preferred):


# Stop Driverless AI.
sudo systemctl stop dai

# The processes should now be stopped. Verify.


sudo ps -u dai

# Make a backup of /opt/h2oai/dai/tmp directory at this time.

# Upgrade and restart.


sudo rpm -U dai-NEWVERSION.rpm
sudo systemctl daemon-reload
sudo systemctl start dai

If you do not have systemd:


# Stop Driverless AI.
sudo pkill -U dai

# The processes should now be stopped. Verify.


(continues on next page)

8.3. IBM Power Installs 129


Using Driverless AI, Release 1.8.4.1

(continued from previous page)


sudo ps -u dai

# Make a backup of /opt/h2oai/dai/tmp directory at this time. If you do not, all


˓→previous data will be lost.

# Upgrade and restart.


sudo rpm -U dai-NEWVERSION.rpm
sudo -H -u dai /opt/h2oai/dai/run-dai.sh

Uninstalling Driverless AI

If you have systemd (preferred):

# Stop Driverless AI.


sudo systemctl stop dai

# The processes should now be stopped. Verify.


sudo ps -u dai

# Uninstall.
sudo rpm -e dai

If you do not have systemd:

# Stop Driverless AI.


sudo pkill -U dai

# The processes should now be stopped. Verify.


sudo ps -u dai

# Uninstall.
sudo rpm -e dai

CAUTION! At this point you can optionally completely remove all remaining files, including the database (this cannot
be undone):

sudo rm -rf /opt/h2oai/dai


sudo rm -rf /etc/dai

8.3.3 IBM TAR SH

The Driverless AI software is available for use in pure user-mode environments as a self-extracting TAR SH archive.
This form of installation does not require a privileged user to install or to run.
This artifact has the same compatibility matrix as the RPM and DEB packages (combined), it just comes packaged
slightly differently. See those sections for a full list of supported environments.
The installation steps assume that you have a license key for Driverless AI. For information on how to obtain a license
key for Driverless AI, visit https://www.h2o.ai/products/h2o-driverless-ai/. Once obtained, you will be prompted to
paste the license key into the Driverless AI UI when you first log in.
Note: OpenCL and LightGBM with GPUs are not supported on Power currently.

130 Chapter 8. Installing and Upgrading Driverless AI


Using Driverless AI, Release 1.8.4.1

Requirements

• RedHat 7 or Ubuntu 16.04 (not fully tested)


• CUDA 10 or later with NVIDIA drivers >= 410 (GPU only)
• cuDNN >=7.2.1 (Required only if using TensorFlow.)
• Driverless AI TAR SH, available from https://www.h2o.ai/download/

Installing Driverless AI

Run the following commands to install the Driverless AI RPM. Replace VERSION with your specific version.

# Install Driverless AI.


chmod 755 dai-VERSION.sh
./dai-VERSION.sh

You may now cd to the unpacked directory and optionally make changes to config.toml.

Starting Driverless AI

# Start Driverless AI.


./run-dai.sh

Starting NVIDIA Persistence Mode

If you have NVIDIA GPUs, you must run the following NVIDIA command. This command needs to be run every
reboot. For more information: http://docs.nvidia.com/deploy/driver-persistence/index.html.

sudo nvidia-persistenced --persistence-mode

Looking at Driverless AI log files

less log/dai.log
less log/h2o.log
less log/procsy.log
less log/vis-server.log

Stopping Driverless AI

# Stop Driverless AI.


./kill-dai.sh

8.3. IBM Power Installs 131


Using Driverless AI, Release 1.8.4.1

Uninstalling Driverless AI

To uninstall Driverless AI, just remove the directory created by the unpacking process. By default, all files for Driver-
less AI are contained within this directory.

Upgrading Driverless AI

WARNINGS:
• This release deprecates experiments and MLI models from 1.7.0 and earlier.
• Experiments, MLIs, and MOJOs reside in the Driverless AI tmp directory and are not automatically upgraded
when Driverless AI is upgraded. We recommend you take the following steps before upgrading.
– Build MLI models before upgrading.
– Build MOJO pipelines before upgrading.
– Stop Driverless AI and make a backup of your Driverless AI tmp directory before upgrading.
If you did not build MLI on a model before upgrading Driverless AI, then you will not be able to view
MLI on that model after upgrading. Before upgrading, be sure to run MLI jobs on models that you want
to continue to interpret in future releases. If that MLI job appears in the list of Interpreted Models in your
current version, then it will be retained after upgrading.
If you did not build a MOJO pipeline on a model before upgrading Driverless AI, then you will not be
able to build a MOJO pipeline on that model after upgrading. Before upgrading, be sure to build MOJO
pipelines on all desired models and then back up your Driverless AI tmp directory.
The upgrade process inherits the service user and group from /etc/dai/User.conf and /etc/dai/Group.conf. You do not
need to manually specify the DAI_USER or DAI_GROUP environment variables during an upgrade.

Requirements

As of 1.7.0, CUDA 9 is no longer supported. Your host environment must have CUDA 10.0 or later with NVIDIA
drivers >= 410 installed (GPU only). Driverless AI ships with its own CUDA libraries, but the driver must exist in the
host environment. Go to https://www.nvidia.com/Download/index.aspx to get the latest NVIDIA Tesla V/P/K series
driver.

Upgrade Steps

1. Stop your previous version of Driverless AI.


2. Run the self-extracting archive for the new version of Driverless AI.
3. Port any previous changes you made to your config.toml file to the newly unpacked directory.
4. Copy the tmp directory (which contains all the Driverless AI working state) from your previous Driverless AI
installation into the newly upacked directory.
5. Start your newly extracted version of Driverless AI.

132 Chapter 8. Installing and Upgrading Driverless AI


Using Driverless AI, Release 1.8.4.1

8.3.4 Troubleshooting IBM Installations

Opening Port 12345

For default IBM Power9 systems with RHEL 7 installed, be sure to open port 12345 in the firewall. For example:

firewall-cmd --zone=public --add-port=12345/tcp --permanent


firewall-cmd --reload

Growing the Disk

Some users may find it necessary to grow their disk. An example describing how to add disk space to a virtual
machine is available at https://www.geoffstratton.com/expand-hard-disk-ubuntu-lvm. The steps for an IBM Power9
system with RHEL 7 would be similar.

8.4 Mac OS X

This section describes how to install, start, stop, and upgrade the Driverless AI Docker image on Mac OS X. Note that
this uses regular Docker and not NVIDIA Docker.
Notes:
• GPU support is not available on Mac OS X.
• Scoring is not available on Mac OS X.
The installation steps assume that you have a license key for Driverless AI. For information on how to obtain a license
key for Driverless AI, visit https://www.h2o.ai/driverless-ai/. Once obtained, you will be promted to paste the license
key into the Driverless AI UI when you first log in, or you can save it as a .sig file and place it in the license folder that
you will create during the installation process.
Caution:
• This is an extremely memory-constrained environment for experimental purposes only. Stick to small datasets!
For serious use, please use Linux.
• Be aware that there are known performace issues with Docker for Mac. More information is available here:
https://docs.docker.com/docker-for-mac/osxfs/#technology.

8.4.1 Environment

Operating System GPU Support? Min Mem Suitable for


Mac OS X No 16 GB Experimentation

8.4. Mac OS X 133


Using Driverless AI, Release 1.8.4.1

8.4.2 Installing Driverless AI

1. Retrieve the Driverless AI Docker image from https://www.h2o.ai/download/.


2. Download and run Docker for Mac from https://docs.docker.com/docker-for-mac/install.
3. Adjust the amount of memory given to Docker to be at least 10 GB. Driverless AI won’t run at all with less than
10 GB of memory. You can optionally adjust the number of CPUs given to Docker. You will find the controls
by clicking on (Docker Whale)->Preferences->Advanced as shown in the following screenshots. (Don’t forget
to Apply the changes after setting the desired memory value.)

134 Chapter 8. Installing and Upgrading Driverless AI


Using Driverless AI, Release 1.8.4.1

4. On the File Sharing tab, verify that your macOS directories (and their subdirectories) can be bind mounted
into Docker containers. More information is available here: https://docs.docker.com/docker-for-mac/osxfs/
#namespaces.

8.4. Mac OS X 135


Using Driverless AI, Release 1.8.4.1

5. Set up a directory for the version of Driverless AI within the Terminal, replacing VERSION below with your
Driverless AI Docker image version:
mkdir dai_rel_VERSION

6. With Docker running, open a Terminal and move the downloaded Driverless AI image to your new directory.
7. Change directories to the new directory, then load the image using the following command. This example shows
how to load Driverless AI. Replace VERSION with your image. Note that this process may take some time to
complete.
cd dai_rel_VERSION
docker load < dai-docker-centos7-x86_64-VERSION.tar.gz

8. Set up the data, log, license, and tmp directories (within the new Driverless AI directory):
mkdir data
mkdir log
mkdir license
mkdir tmp

9. Optionally copy data into the data directory on the host. The data will be visible inside the Docker container at
/data. You can also upload data after starting Driverless AI.
10. Run docker images to find the image tag.
11. Start the Driverless AI Docker image (still within the new Driverless AI directory). Replace TAG below with
the image tag. Note that GPU support will not be available.
docker run \
--pid=host \
--init \
--rm \
--shm-size=256m \
-u `id -u`:`id -g` \
-p 12345:12345 \
(continues on next page)

136 Chapter 8. Installing and Upgrading Driverless AI


Using Driverless AI, Release 1.8.4.1

(continued from previous page)


-v `pwd`/data:/data \
-v `pwd`/log:/log \
-v `pwd`/license:/license \
-v `pwd`/tmp:/tmp \
h2oai/dai-centos7-x86_64:TAG

12. Connect to Driverless AI with your browser at http://localhost:12345.

8.4.3 Stopping the Docker Image

To stop the Driverless AI Docker image, type Ctrl + C in the Terminal (Mac OS X) or PowerShell (Windows 10)
window that is running the Driverless AI Docker image.

8.4.4 Upgrading the Docker Image

This section provides instructions for upgrading Driverless AI versions that were installed in a Docker container. These
steps ensure that existing experiments are saved.
WARNING: Experiments, MLIs, and MOJOs reside in the Driverless AI tmp directory and are not automatically
upgraded when Driverless AI is upgraded.
• Build MLI models before upgrading.
• Build MOJO pipelines before upgrading.
• Stop Driverless AI and make a backup of your Driverless AI tmp directory before upgrading.
If you did not build MLI on a model before upgrading Driverless AI, then you will not be able to view
MLI on that model after upgrading. Before upgrading, be sure to run MLI jobs on models that you want
to continue to interpret in future releases. If that MLI job appears in the list of Interpreted Models in your
current version, then it will be retained after upgrading.
If you did not build a MOJO pipeline on a model before upgrading Driverless AI, then you will not be
able to build a MOJO pipeline on that model after upgrading. Before upgrading, be sure to build MOJO
pipelines on all desired models and then back up your Driverless AI tmp directory.
Note: Stop Driverless AI if it is still running.

Upgrade Steps

1. SSH into the IP address of the machine that is running Driverless AI.
2. Set up a directory for the version of Driverless AI on the host machine:

# Set up directory with the version name


mkdir dai_rel_VERSION

# cd into the new directory


cd dai_rel_VERSION

3. Retrieve the Driverless AI package from https://www.h2o.ai/download/ and add it to the new directory.
4. Load the Driverless AI Docker image inside the new directory. This example shows how to load Driverless AI
version. If necessary, replace VERSION with your image.

8.4. Mac OS X 137


Using Driverless AI, Release 1.8.4.1

# Load the Driverless AI docker image


docker load < dai-docker-centos7-x86_64-VERSION.tar.gz

5. Copy the data, log, license, and tmp directories from the previous Driverless AI directory to the new Driverless
AI directory:

# Copy the data, log, license, and tmp directories on the host machine
cp -a dai_rel_1.4.2/data dai_rel_VERSION/data
cp -a dai_rel_1.4.2/log dai_rel_VERSION/log
cp -a dai_rel_1.4.2/license dai_rel_VERSION/license
cp -a dai_rel_1.4.2/tmp dai_rel_VERSION/tmp

At this point, your experiments from the previous versions will be visible inside the Docker container.
6. Use docker images to find the new image tag.
7. Start the Driverless AI Docker image.
8. Connect to Driverless AI with your browser at http://Your-Driverless-AI-Host-Machine:12345.

8.5 Windows 10 Pro

This section describes how to install, start, stop, and upgrade Driverless AI on a Windows 10 Pro machine. The
installation steps assume that you have a license key for Driverless AI. For information on how to obtain a license key
for Driverless AI, visit https://www.h2o.ai/driverless-ai/. Once obtained, you will be promted to paste the license key
into the Driverless AI UI when you first log in, or you can save it as a .sig file and place it in the license folder that you
will create during the installation process.

8.5.1 Overview of Installation on Windows

The recommended way of installing Driverless AI on Windows is via WSL Ubuntu. Running a Driverless AI Docker
image on Windows is also possible but not preferred.
Notes:
• GPU support is not available on Windows.
• Scoring is not available on Windows.
Caution: This should be used only for experimental purposes and only on small data. For serious use, please use
Linux.

8.5.2 Environment

Operating System GPU Support? Min Mem Suitable for


Windows 10 Pro No 16 GB Experimentation

138 Chapter 8. Installing and Upgrading Driverless AI


Using Driverless AI, Release 1.8.4.1

8.5.3 DEB Installs

This section describes how to install the Driverless AI DEB on Windows 10 using Windows Subsystem for Linux
(WSL).

Requirements

• Windows subsystems for Linux (WSL) must be enabled as specified at https://docs.microsoft.com/en-us/


windows/wsl/install-win10
• Ubuntu 18.04 from the Windows Store. (Note that Ubuntu 16.04 for WSL is no longer supported.)
• Driverless AI DEB, available from https://www.h2o.ai/download/.

Installation Procedure

(Note that systemd is not supported for Linux on Windows.)


Run the following commands to install and run the Driverless AI DEB. Replace <VERSION> with your specific
version.
# Install Driverless AI. Expect installation of the .deb file to take several
˓→minutes on WSL.

sudo dpkg -i dai_VERSION.deb

# Run Driverless AI.


sudo -H -u dai /opt/h2oai/dai/run-dai.sh

Upgrading the DEB

The Driverless AI Windows DEB cannot be upgraded. In order to run to a newer version, you must first uninstall the
prior version and then install the newer one.
WARNINGS:
• This release deprecates experiments and MLI models from 1.7.0 and earlier.
• Experiments, MLIs, and MOJOs reside in the Driverless AI tmp directory and are not automatically upgraded
when Driverless AI is upgraded. We recommend you take the following steps before upgrading.
– Build MLI models before upgrading.
– Build MOJO pipelines before upgrading.
– Stop Driverless AI and make a backup of your Driverless AI tmp directory before upgrading.
If you did not build MLI on a model before upgrading Driverless AI, then you will not be able to view
MLI on that model after upgrading. Before upgrading, be sure to run MLI jobs on models that you want
to continue to interpret in future releases. If that MLI job appears in the list of Interpreted Models in your
current version, then it will be retained after upgrading.
If you did not build a MOJO pipeline on a model before upgrading Driverless AI, then you will not be
able to build a MOJO pipeline on that model after upgrading. Before upgrading, be sure to build MOJO
pipelines on all desired models and then back up your Driverless AI tmp directory.
The upgrade process inherits the service user and group from /etc/dai/User.conf and /etc/dai/Group.conf. You do not
need to manually specify the DAI_USER or DAI_GROUP environment variables during an upgrade.
Run the following commands to uninstall a prior version.

8.5. Windows 10 Pro 139


Using Driverless AI, Release 1.8.4.1

# Stop Driverless AI.


sudo pkill -U dai

# The processes should now be stopped. Verify.


sudo ps -u dai

# Make a backup of /opt/h2oai/dai/tmp directory at this time.

# Uninstall Driverless AI.


sudo dpkg -r dai

# If the above uninstall command results in a message


# "failed to lookup unit file state: invalid argument,"
# then try the below command to force uninstall.
sudo dpkg --purge --force-all dai

At this point, follow the previous installation procedure to install a newer version of Driverless AI.

8.5.4 Docker Image Installs

Notes:
• Installing the Driverless AI Docker image on Windows is not the recommended method for running Driverless
AI. RPM and DEB installs are preferred.
• Be aware that there are known issues with Docker for Windows. More information is available here: https:
//github.com/docker/for-win/issues/188.
• Consult with your Windows System Admin if
– Your corporate environment does not allow third-part software installs
– You are running Windows Defender
– You your machine is not running with Enable-WindowsOptionalFeature -Online
-FeatureName Microsoft-Windows-Subsystem-Linux.
Watch the installation video here. Note that some of the images in this video may change between releases, but the
installation steps remain the same.

Requirements

• Windows 10 Pro

Installation Procedure

1. Retrieve the Driverless AI Docker image from https://www.h2o.ai/download/.


2. Download, install, and run Docker for Windows from https://docs.docker.com/docker-for-windows/install/. You
can verify that Docker is running by typing docker version in a terminal (such as Windows PowerShell).
Note that you may have to reboot after installation.
3. Before running Driverless AI, you must:
• Enable shared access to the C drive. Driverless AI will not be able to see your local data if this is
not set.

140 Chapter 8. Installing and Upgrading Driverless AI


Using Driverless AI, Release 1.8.4.1

• Adjust the amount of memory given to Docker to be at least 10 GB. Driverless AI won’t run at all
with less than 10 GB of memory.
• Optionally adjust the number of CPUs given to Docker.
You can adjust these settings by clicking on the Docker whale in your taskbar (look for hidden tasks, if
necessary), then selecting Settings > Shared Drive and Settings > Advanced as shown in the following
screenshots. Don’t forget to Apply the changes after setting the desired memory value. (Docker will
restart.) Note that if you cannot make changes, stop Docker and then start Docker again by right clicking
on the Docker icon on your desktop and selecting Run as Administrator.

8.5. Windows 10 Pro 141


Using Driverless AI, Release 1.8.4.1

4. Open a PowerShell terminal and set up a directory for the version of Driverless AI on the host machine, replacing
VERSION below with your Driverless AI Docker image version:

md dai_rel_VERSION

5. With Docker running, navigate to the location of your downloaded Driverless AI image. Move the downloaded
Driverless AI image to your new directory.
6. Change directories to the new directory, then load the image using the following command. This example shows
how to load Driverless AI. Replace VERSION with your image.

cd dai_rel_VERSION
docker load -i .\dai-docker-centos7-x86_64-VERSION.tar.gz

7. Set up the data, log, license, and tmp directories (within the new directory).

md data
md log
md license
md tmp

8. Copy data into the /data directory. The data will be visible inside the Docker container at /data.
9. Run docker images to find the image tag.
10. Start the Driverless AI Docker image. Be sure to replace path_to_ below with the entire path to the location
of the folders that you created (for example, “c:/Users/user-name/driverlessai_folder/data”), and replace TAG
with the Docker image tag. Note that this is regular Docker, not NVIDIA Docker. GPU support will not be
available.

docker run --pid=host --init --rm --shm-size=256m -p 12345:12345 -v c:/path_


˓→to_data:/data -v c:/path_to_log:/log -v c:/path_to_license:/license -v c:/

˓→path_to_tmp:/tmp h2oai/dai-centos7-x86_64:TAG

11. Connect to Driverless AI with your browser at http://localhost:12345.

142 Chapter 8. Installing and Upgrading Driverless AI


Using Driverless AI, Release 1.8.4.1

Stopping the Docker Image

To stop the Driverless AI Docker image, type Ctrl + C in the Terminal (Mac OS X) or PowerShell (Windows 10)
window that is running the Driverless AI Docker image.

Upgrading the Docker Image

This section provides instructions for upgrading Driverless AI versions that were installed in a Docker container. These
steps ensure that existing experiments are saved.
WARNING: Experiments, MLIs, and MOJOs reside in the Driverless AI tmp directory and are not automatically
upgraded when Driverless AI is upgraded.
• Build MLI models before upgrading.
• Build MOJO pipelines before upgrading.
• Stop Driverless AI and make a backup of your Driverless AI tmp directory before upgrading.
If you did not build MLI on a model before upgrading Driverless AI, then you will not be able to view
MLI on that model after upgrading. Before upgrading, be sure to run MLI jobs on models that you want
to continue to interpret in future releases. If that MLI job appears in the list of Interpreted Models in your
current version, then it will be retained after upgrading.
If you did not build a MOJO pipeline on a model before upgrading Driverless AI, then you will not be
able to build a MOJO pipeline on that model after upgrading. Before upgrading, be sure to build MOJO
pipelines on all desired models and then back up your Driverless AI tmp directory.
Note: Stop Driverless AI if it is still running.

Requirements

As of 1.7.0, CUDA 9 is no longer supported. Your host environment must have CUDA 10.0 or later with NVIDIA
drivers >= 410 installed (GPU only). Driverless AI ships with its own CUDA libraries, but the driver must exist in the
host environment. Go to https://www.nvidia.com/Download/index.aspx to get the latest NVIDIA Tesla V/P/K series
driver.

Upgrade Steps

1. SSH into the IP address of the machine that is running Driverless AI.
2. Set up a directory for the version of Driverless AI on the host machine:

# Set up directory with the version name


mkdir dai_rel_VERSION

# cd into the new directory


cd dai_rel_VERSION

3. Retrieve the Driverless AI package from https://www.h2o.ai/download/ and add it to the new directory.
4. Load the Driverless AI Docker image inside the new directory. This example shows how to load Driverless AI
version. If necessary, replace VERSION with your image.

# Load the Driverless AI docker image


docker load < dai-docker-centos7-x86_64-VERSION.tar.gz

8.5. Windows 10 Pro 143


Using Driverless AI, Release 1.8.4.1

5. Copy the data, log, license, and tmp directories from the previous Driverless AI directory to the new Driverless
AI directory:

# Copy the data, log, license, and tmp directories on the host machine
cp -a dai_rel_1.4.2/data dai_rel_VERSION/data
cp -a dai_rel_1.4.2/log dai_rel_VERSION/log
cp -a dai_rel_1.4.2/license dai_rel_VERSION/license
cp -a dai_rel_1.4.2/tmp dai_rel_VERSION/tmp

At this point, your experiments from the previous versions will be visible inside the Docker container.
6. Use docker images to find the new image tag.
7. Start the Driverless AI Docker image.
8. Connect to Driverless AI with your browser at http://Your-Driverless-AI-Host-Machine:12345.

144 Chapter 8. Installing and Upgrading Driverless AI


CHAPTER

NINE

USING THE CONFIG.TOML FILE

Admins can edit a config.toml file when starting the Driverless AI Docker image. The config.toml file includes all
possible configuration options that would otherwise be specified in the nvidia-docker run command. This file
is located in a folder on the container. You can make updates to environment variables directly in this file. Driverless
AI will use the updated config.toml file when starting from native installs. Docker users can specify that updated
config.toml file when starting Driverless AI Docker image.

9.1 Configuration Override Chain

The configuration engine reads and overrides variables in the following order:
1. h2oai/config/config.toml - This is an internal file that is not visible.
2. config.toml - Place this file in a folder or mount it in a Docker container and specify the path in the “DRIVER-
LESS_AI_CONFIG_FILE” environment variable.
3. Environment variable - Configuration variables can also be provided as environment variables. They must have
the prefix DRIVERLESS_AI_ followed by the variable name in all caps. For example, “authentication_method”
can be provided as “DRIVERLESS_AI_AUTHENTICATION_METHOD”.

9.2 Docker Image Users

1. Copy the config.toml file from inside the Docker image to your local filesystem.

# Make a config directory


mkdir config

# Copy the config.toml file to the new config directory.


nvidia-docker run \
--pid=host \
--rm \
--init \
-u `id -u`:`id -g` \
-v `pwd`/config:/config \
--entrypoint bash \
h2oai/dai-centos7-x86_64:TAG
-c "cp /etc/dai/config.toml /config"

2. Edit the desired variables in the config.toml file. Save your changes when you are done.
3. Start Driverless AI with the DRIVERLESS_AI_CONFIG_FILE environment variable. Make sure this points to
the location of the edited config.toml file so that the software finds the configuration file.

145
Using Driverless AI, Release 1.8.4.1

nvidia-docker run \
--pid=host \
--init \
--rm \
--shm-size=256m \
-u `id -u`:`id -g` \
-p 12345:12345 \
-e DRIVERLESS_AI_CONFIG_FILE="/config/config.toml" \
-v `pwd`/config:/config \
-v `pwd`/data:/data \
-v `pwd`/log:/log \
-v `pwd`/license:/license \
-v `pwd`/tmp:/tmp \
h2oai/dai-centos7-x86_64:TAG

9.3 Native Install Users

Native installs include DEBs, RPMs, and TAR SH installs.


1. Export the Driverless AI config.toml file or add it to ~/.bashrc. For example:

export DRIVERLESS_AI_CONFIG_FILE=“/config/config.toml”

2. Edit the desired variables in the config.toml file. Save your changes when you are done.
3. Start Driverless AI. Note that the command used to start Driverless AI varies depending on your install type.
For reference, below is a copy of the standard config.toml file included with this version of Driverless AI. The sections
that follow describe some examples showing how to set different environment variables, data connectors, authentica-
tion, and notifications.

9.4 Sample Config.toml File


1
2 ##############################################################################
3 # DRIVERLESS AI CONFIGURATION FILE
4 #
5 # Comments:
6 # This file is authored in TOML (see https://github.com/toml-lang/toml)
7 #
8 # Config Override Chain
9 # Configuration variables for Driverless AI can be provided in several ways,
10 # the config engine reads and overrides variables in the following order
11 #
12 # 1. h2oai/config/config.toml
13 # [internal not visible to users]
14 #
15 # 2. config.toml
16 # [place file in a folder/mount file in docker container and provide path
17 # in "DRIVERLESS_AI_CONFIG_FILE" environment variable]
18 #
19 # 3. Environment variable
20 # [configuration variables can also be provided as environment variables
21 # they must have the prefix "DRIVERLESS_AI_" followed by
22 # variable name in caps e.g "authentication_method" can be provided as
23 # "DRIVERLESS_AI_AUTHENTICATION_METHOD"]
24 ##############################################################################
25
26 # Whether to allow user to change non-server toml parameters per experiment in expert page.
27 #allow_config_overrides_in_expert_page = true
28
29 # Every *.toml file is read from this directory and process the same way as main config file.
30 #user_config_directory = ""
31
32 # IP address and port of autoviz process.
33 #vis_server_ip = "127.0.0.1"
34
35 # IP and port of autoviz process.
36 #vis_server_port = 12346
37

(continues on next page)

146 Chapter 9. Using the config.toml File


Using Driverless AI, Release 1.8.4.1

(continued from previous page)


38 # IP address and port of procsy process.
39 #procsy_ip = "127.0.0.1"
40
41 # IP address and port of procsy process.
42 #procsy_port = 12347
43
44 # IP address and port of H2O instance.
45 #h2o_ip = "127.0.0.1"
46
47 # IP address and port of H2O instance for use by MLI.
48 #h2o_port = 12348
49
50 # Enable h2o recipes server.
51 #enable_h2o_recipes = true
52
53 # URL of H2O instance for use by transformers, models, or scorers.
54 #h2o_recipes_url = "None"
55
56 # IP of H2O instance for use by transformers, models, or scorers.
57 #h2o_recipes_ip = "None"
58
59 # Port of H2O instance for use by transformers, models, or scorers. No other instances must be on that port or on next port.
60 #h2o_recipes_port = 50341
61
62 # Name of H2O instance for use by transformers, models, or scorers.
63 #h2o_recipes_name = "None"
64
65 # Number of threads for H2O instance for use by transformers, models, or scorers.
66 #h2o_recipes_nthreads = 4
67
68 # Log Level of H2O instance for use by transformers, models, or scorers.
69 #h2o_recipes_log_level = "None"
70
71 # Maximum memory size of H2O instance for use by transformers, models, or scorers.
72 #h2o_recipes_max_mem_size = "None"
73
74 # Minimum memory size of H2O instance for use by transformers, models, or scorers.
75 #h2o_recipes_min_mem_size = "None"
76
77 # General user overrides of kwargs dict to pass to h2o.init() for recipe server
78 #h2o_recipes_kwargs = "{}"
79
80 # IP address and port for Driverless AI HTTP server.
81 #ip = "127.0.0.1"
82
83 # IP address and port for Driverless AI HTTP server.
84 #port = 12345
85
86 # A list of two integers indicating the port range to search over, and dynamically find an open port to bind to (e.g., [11111,20000]).
87 #port_range = "[]"
88
89 # File upload limit (default 100GB)
90 #max_file_upload_size = 104857600000
91
92 # Verbosity of logging
93 # 0: quiet (CRITICAL, ERROR, WARNING)
94 # 1: default (CRITICAL, ERROR, WARNING, INFO, DATA)
95 # 2: verbose (CRITICAL, ERROR, WARNING, INFO, DATA, DEBUG)
96 # Affects server and all experiments
97 #log_level = 1
98
99 # Whether to collect relevant server logs (h2oai_server.log, dai.log from systemctl or docker, and h2o log)
100 # Useful for when sending logs to H2O.ai
101 #collect_server_logs_in_experiment_logs = false
102
103 # Redis settings
104 #redis_ip = "127.0.0.1"
105
106 # Redis settings
107 #redis_port = 6379
108
109 # Redis settings
110 #master_redis_password = ""
111
112 # https settings
113 # You can make a self-signed certificate for testing with the following commands:
114 # sudo openssl req -x509 -newkey rsa:4096 -keyout private_key.pem -out cert.pem -days 3650 -nodes -subj '/O=Driverless AI'
115 # sudo chown dai:dai cert.pem private_key.pem
116 # sudo chmod 600 cert.pem private_key.pem
117 # sudo mv cert.pem private_key.pem /etc/dai
118 #enable_https = false
119
120 # https settings
121 # You can make a self-signed certificate for testing with the following commands:
122 # sudo openssl req -x509 -newkey rsa:4096 -keyout private_key.pem -out cert.pem -days 3650 -nodes -subj '/O=Driverless AI'
123 # sudo chown dai:dai cert.pem private_key.pem
124 # sudo chmod 600 cert.pem private_key.pem
125 # sudo mv cert.pem private_key.pem /etc/dai
126 #ssl_key_file = "/etc/dai/private_key.pem"
127
128 # https settings
129 # You can make a self-signed certificate for testing with the following commands:
130 # sudo openssl req -x509 -newkey rsa:4096 -keyout private_key.pem -out cert.pem -days 3650 -nodes -subj '/O=Driverless AI'
131 # sudo chown dai:dai cert.pem private_key.pem
132 # sudo chmod 600 cert.pem private_key.pem
133 # sudo mv cert.pem private_key.pem /etc/dai
134 #ssl_crt_file = "/etc/dai/cert.pem"
135
136 # SSL TLS
137 #ssl_no_sslv2 = true
138
139 # SSL TLS
140 #ssl_no_sslv3 = true
141

(continues on next page)

9.4. Sample Config.toml File 147


Using Driverless AI, Release 1.8.4.1

(continued from previous page)


142 # SSL TLS
143 #ssl_no_tlsv1 = true
144
145 # SSL TLS
146 #ssl_no_tlsv1_1 = true
147
148 # SSL TLS
149 #ssl_no_tlsv1_2 = false
150
151 # SSL TLS
152 #ssl_no_tlsv1_3 = false
153
154 # https settings
155 # Sets the client verification mode.
156 # CERT_NONE: Client does not need to provide the certificate and if it does any
157 # verification errors are ignored.
158 # CERT_OPTIONAL: Client does not need to provide the certificate and if it does
159 # certificate is verified agains set up CA chains.
160 # CERT_REQUIRED: Client needs to provide a certificate and certificate is
161 # verified.
162 # You'll need to set 'ssl_client_key_file' and 'ssl_client_crt_file'
163 # When this mode is selected for Driverless to be able to verify
164 # it's own callback requests.
165 #
166 #ssl_client_verify_mode = "CERT_NONE"
167
168 # https settings
169 # Path to the Certification Authority certificate file. This certificate will be
170 # used when to verify client certificate when client authentication is turned on.
171 # If this is not set, clients are verified using default system certificates.
172 #
173 #ssl_ca_file = ""
174
175 # https settings
176 # path to the private key that Driverless will use to authenticate itself when
177 # CERT_REQUIRED mode is set.
178 #
179 #ssl_client_key_file = ""
180
181 # https settings
182 # path to the client certificate that Driverless will use to authenticate itself
183 # when CERT_REQUIRED mode is set.
184 #
185 #ssl_client_crt_file = ""
186
187 # Data directory. All application data and files related datasets and
188 # experiments are stored in this directory.
189 #data_directory = "./tmp"
190
191 # Whether to run quick performance benchmark at start of application
192 #enable_quick_benchmark = true
193
194 # Whether to run extended performance benchmark at start of application
195 #enable_extended_benchmark = false
196
197 # Scaling factor for number of rows for extended performance benchmark. For rigorous performance benchmarking,
198 # values of 1 or larger are recommended.
199 #extended_benchmark_scale_num_rows = 0.1
200
201 # Whether to run quick startup checks at start of application
202 #enable_startup_checks = true
203
204 # Whether to opt in to usage statistics and bug reporting
205 #usage_stats_opt_in = false
206
207 # authentication_method
208 # unvalidated : Accepts user id and password. Does not validate password.
209 # none: Does not ask for user id or password. Authenticated as admin.
210 # openid: Users OpenID Connect provider for authentication. See additional OpenID settings below.
211 # pam: Accepts user id and password. Validates user with operating system.
212 # ldap: Accepts user id and password. Validates against an ldap server. Look
213 # for additional settings under LDAP settings.
214 # local: Accepts a user id and password. Validated against an htpasswd file provided in local_htpasswd_file.
215 # ibm_spectrum_conductor: Authenticate with IBM conductor auth api.
216 # tls_certificate: Authenticate with Driverless by providing a TLS certificate.
217 #
218 #authentication_method = "unvalidated"
219
220 # default amount of time in hours before we force user to login again (if not provided by authentication_method)
221 #authentication_default_timeout_hours = 72
222
223 # OpenID Connect Settings:
224 # Refer to OpenID Connect Basic Client Implementation Guide for details on how OpenID authentication flow works
225 # https://openid.net/specs/openid-connect-basic-1_0.html
226 # base server uri to the OpenID Provider server (ex: https://oidp.ourdomain.com
227 #auth_openid_provider_base_uri = ""
228
229 # uri to pull OpenID config data from (you can extract most of required OpenID config from this url)
230 # usually located at: /auth/realms/master/.well-known/openid-configuration
231 #auth_openid_configuration_uri = ""
232
233 # uri to start authentication flow
234 #auth_openid_auth_uri = ""
235
236 # uri to make request for token after callback from OpenID server was received
237 #auth_openid_token_uri = ""
238
239 # uri to get user information once access_token has been acquired (ex: list of groups user belongs to will be provided here)
240 #auth_openid_userinfo_uri = ""
241
242 # uri to logout user
243 #auth_openid_logout_uri = ""
244
245 # callback uri that OpenID provide will use to send 'authentication_code'

(continues on next page)

148 Chapter 9. Using the config.toml File


Using Driverless AI, Release 1.8.4.1

(continued from previous page)


246 # This is OpenID callback endpoint in Driverless AI. Most OpenID providers need this to be HTTPs.
247 # (ex. https://driverless.ourdomin.com/openid/callback)
248 #auth_openid_redirect_uri = ""
249
250 # OAuth2 grant type (usually authorization_code for OpenID, can be access_token also)
251 #auth_openid_grant_type = ""
252
253 # OAuth2 response type (usually code)
254 #auth_openid_response_type = ""
255
256 # Client ID registered with OpenID provider
257 #auth_openid_client_id = ""
258
259 # Client secret provided by OpenID provider when registering Client ID
260 #auth_openid_client_secret = ""
261
262 # Scope of info (usually openid). Can be list of more than one, space delimited, possible
263 # values listed at https://openid.net/specs/openid-connect-basic-1_0.html#Scopes
264 #auth_openid_scope = ""
265
266 # What key in user_info json should we check to authorize user
267 #auth_openid_userinfo_auth_key = ""
268
269 # What value should the key have in user_info json in order to authorize user
270 #auth_openid_userinfo_auth_value = ""
271
272 # Key that specifies username in user_info json (we will use the value of this key as username in Driverless AI)
273 #auth_openid_userinfo_username_key = ""
274
275 # Quote method from urllib.parse used to encode payload dict in Authentication Request
276 #auth_openid_urlencode_quote_via = "quote"
277
278 # Key in Token Response JSON that holds the value for access token expiry
279 #auth_openid_access_token_expiry_key = "expires_in"
280
281 # Key in Token Response JSON that holds the value for access token expiry
282 #auth_openid_refresh_token_expiry_key = "refresh_expires_in"
283
284 # Expiration time in seconds for access token
285 #auth_openid_token_expiration_secs = 3600
286
287 # Enables advanced matching for OpenID Connect authentication.
288 # When enabled ObjectPath (<http://objectpath.org/>) expression is used to
289 # evaluate the user identity.
290 #
291 #auth_openid_use_objectpath_match = false
292
293 # ObjectPath (<http://objectpath.org/>) expression that will be used
294 # to evaluate whether user is allowed to login into Driverless.
295 # Any expression that evaluates to True means user is allowed to log in.
296 # Examples:
297 # Simple claim equality: `$.our_claim is "our_value"`
298 # List of claims contains required value: `"expected_role" in @.roles`
299 #
300 #auth_openid_use_objectpath_expression = ""
301
302 # ldap server domain or ip
303 #ldap_server = ""
304
305 # ldap server port
306 #ldap_port = ""
307
308 # Complete DN of the LDAP bind user
309 #ldap_bind_dn = ""
310
311 # Password for the LDAP bind
312 #ldap_bind_password = ""
313
314 # Provide Cert file location
315 #ldap_tls_file = ""
316
317 # use true to use ssl or false
318 #ldap_use_ssl = false
319
320 # the location in the DIT where the search will start
321 #ldap_search_base = ""
322
323 # A string that describes what you are searching for. You can use Pythonsubstitution to have this constructed dynamically.(only {{DAI_USERNAME}} is
˓→supported)
324 #ldap_search_filter = ""
325
326 # ldap attributes to return from search
327 #ldap_search_attributes = ""
328
329 # specify key to find user name
330 #ldap_user_name_attribute = ""
331
332 # When using this recipe, needs to be set to "1"
333 #ldap_recipe = "0"
334
335 # Deprecated do not use
336 #ldap_user_prefix = ""
337
338 # Deprecated, Use ldap_bind_dn
339 #ldap_search_user_id = ""
340
341 # Deprecated, ldap_bind_password
342 #ldap_search_password = ""
343
344 # Deprecated, use ldap_search_base instead
345 #ldap_ou_dn = ""
346
347 # Deprecated, use ldap_base_dn
348 #ldap_dc = ""

(continues on next page)

9.4. Sample Config.toml File 149


Using Driverless AI, Release 1.8.4.1

(continued from previous page)


349
350 # Deprecated, use ldap_search_base
351 #ldap_base_dn = ""
352
353 # Deprecated, use ldap_search_filter
354 #ldap_base_filter = ""
355
356 # Path to the CRL file that will be used to verify client certificate.
357 #auth_tls_crl_file = ""
358
359 # What field of the subject would used as source for username or other values used for further validation.
360 #auth_tls_subject_field = "CN"
361
362 # Regular expression that will be used to parse subject field to obtain the username or other values used for further validation.
363 #auth_tls_field_parse_regexp = "(?P<username>.*)"
364
365 # Sets up the way how user identity would be obtained
366 # REGEXP_ONLY: Will use 'auth_tls_subject_field' and 'auth_tls_field_parse_regexp'
367 # to extract the username from the client certificate.
368 # LDAP_LOOKUP: Will use LDAP server to lookup for the username.
369 # 'ldap_server', 'ldap_use_ssl', 'ldap_tls_file', 'ldap_bind_dn',
370 # 'ldap_bind_password' options are used to establish
371 # the connection with the LDAP server.
372 # 'auth_tls_subject_field' and 'auth_tls_field_parse_regexp'
373 # options are used to parse the certificate.
374 # 'ldap_search_base', 'ldap_search_filter', and
375 # 'ldap_username_attribute' options are used to do the lookup.
376 # 'ldap_search_filter' can be built dynamically using the named
377 # capturing groups from the 'auth_tls_field_parse_regexp' for
378 # substitution.
379 # Example:
380 # auth_tls_field_parse_regexp = "\w+ (?P<id>\d+)"
381 # ldap_search_filter = "(&(objectClass=person)(id={{id}}))"
382 #
383 #auth_tls_user_lookup = "REGEXP_ONLY"
384
385 # Sets optional additional lookup filter that is performed after the
386 # user is found. This can be used for example to check whether the is member of
387 # particular group.
388 # Filter can be built dynamically from the attributes returned by the lookup.
389 # Authorization fails when search does not return any entry. If one ore more
390 # entries are returned authorization succeeds.
391 # Example:
392 # auth_tls_field_parse_regexp = "\w+ (?P<id>\d+)"
393 # ldap_search_filter = "(&(objectClass=person)(id={{id}}))"
394 # auth_tls_ldap_authorization_lookup_filter = "(&(objectClass=group)(member=uid={{uid}},dc=example,dc=com))"
395 # If this option is empty no additional lookup is done and just a successful user
396 # lookup is enough to authorize the user.
397 #
398 #auth_tls_ldap_authorization_lookup_filter = ""
399
400 # Base DN where to start the Authorization lookup. Used when 'auth_tls_ldap_authorization_lookup_filter' is set.
401 #auth_tls_ldap_authorization_search_base = ""
402
403 # Local password file
404 # Generating a htpasswd file: see syntax below
405 # htpasswd -B '<location_to_place_htpasswd_file>' '<username>'
406 # note: -B forces use of brcypt, a secure encryption method
407 #local_htpasswd_file = ""
408
409 # Supported file formats (file name endings must match for files to show up in file browser)
410 #supported_file_types = "csv, tsv, txt, dat, tgz, gz, bz2, zip, xz, xls, xlsx, jay, feather, bin, arff, parquet, pkl"
411
412 # Supported file formats of data recipe files (file name endings must match for files to show up in file browser)
413 #recipe_supported_file_types = "py, pyc"
414
415 # By default, only supported file types (based on the file extensions listed above) will be listed for import into DAI
416 # Some data pipelines generate parquet files without any extensions. Enabling the below option will cause files
417 # without an extension to be listed in the file import dialog.
418 # DAI will import files without extensions as parquet files; if cannot be imported, an error is generated
419 #
420 #list_files_without_extensions = false
421
422 # File System Support
423 # upload : standard upload feature
424 # file : local file system/server file system
425 # hdfs : Hadoop file system, remember to configure the HDFS config folder path and keytab below
426 # dtap : Blue Data Tap file system, remember to configure the DTap section below
427 # s3 : Amazon S3, optionally configure secret and access key below
428 # gcs : Google Cloud Storage, remember to configure gcs_path_to_service_account_json below
429 # gbq : Google Big Query, remember to configure gcs_path_to_service_account_json below
430 # minio : Minio Cloud Storage, remember to configure secret and access key below
431 # snow : Snowflake Data Warehouse, remember to configure Snowflake credentials below (account name, username, password)
432 # kdb : KDB+ Time Series Database, remember to configure KDB credentials below (hostname and port, optionally: username, password, classpath, and jvm_args)
433 # azrbs : Azure Blob Storage, remember to configure Azure credentials below (account name, account key)
434 # jdbc: JDBC Connector, remember to configure JDBC below. (jdbc_app_configs)
435 # recipe_file: Custom recipe file upload
436 # recipe_url: Custom recipe upload via url
437 #
438 #enabled_file_systems = "upload, file, hdfs, s3, recipe_file, recipe_url"
439
440 #max_files_listed = 100
441
442 # do_not_log_list : add configurations that you do not wish to be recorded in logs here
443 #do_not_log_list = "local_htpasswd_file, aws_access_key_id, aws_secret_access_key, snowflake_password, snowflake_url, snowflake_user, snowflake_account,
˓→minio_endpoint_url, minio_access_key_id, minio_secret_access_key, kdb_user, kdb_password, ldap_bind_password, gcs_path_to_service_account_json, azure_
˓→blob_account_name, azure_blob_account_key, deployment_aws_access_key_id, deployment_aws_secret_access_key, master_minio_access_key_id, master_minio_
˓→secret_access_key, master_redis_password, auth_openid_client_id, auth_openid_client_secret, auth_openid_userinfo_auth_key, auth_openid_userinfo_auth_
˓→value, auth_openid_userinfo_username_key"
444
445 # Minio is used for file distribution on multinode architecture.
446 # These settings are used to specify the local Minio connection to master nodes.
447 #master_minio_address = "<URL>:<PORT>"
448

(continues on next page)

150 Chapter 9. Using the config.toml File


Using Driverless AI, Release 1.8.4.1

(continued from previous page)


449 # Minio is used for file distribution on multinode architecture.
450 # These settings are used to specify the local Minio connection to master nodes.
451 #master_minio_access_key_id = ""
452
453 # Minio is used for file distribution on multinode architecture.
454 # These settings are used to specify the local Minio connection to master nodes.
455 #master_minio_secret_access_key = ""
456
457 # Allow using browser localstorage, to improve UX.
458 #allow_localstorage = true
459
460 # Allow original dataset columns to be present in downloaded predictions CSV
461 #allow_orig_cols_in_predictions = true
462
463 # If the experiment is not done after this many minutes, stop feature engineering and model tuning as soon as possible and proceed with building the final
˓→modeling pipeline and deployment artifacts, independent of model score convergence or pre-determined number of iterations. Only active is not in
˓→reproducible mode. Depending on the data and experiment settings, overall experiment runtime can differ significantly from this setting.
464 #max_runtime_minutes = 1440
465
466 # If the experiment is not done after this many minutes, push the abort button. Preserves experiment artifacts made so far for summary and log zip files,
˓→but further artifacts are made.
467 #max_runtime_minutes_until_abort = 10080
468
469 # Recipe type
470 # Recipes override any GUI settings
471 # 'auto' : all models and features automatically determined by experiment settings, toml settings, and feature_engineering_effort
472 # 'compliant' : like 'auto' except:
473 # * interpretability=10 (to avoid complexity, overrides GUI or python client chose for interpretability)
474 # * enable_glm='on' (rest 'off', to avoid complexity and be compatible with algorithms supported by MLI)
475 # * fixed_ensemble_level=0: Don't use any ensemble (to avoid complexity)
476 # * feature_brain_level=0: No feature brain used (to ensure every restart is identical)
477 # * max_feature_interaction_depth=1: interaction depth is set to 1 (no multi-feature interactions to avoid complexity)
478 # * target_transformer='identity': for regression (to avoid complexity)
479 # * check_distribution_shift_drop='off': Don't use distribution shift between train, valid, and test to drop features (bit risky without fine-tuning
480 # 'kaggle' : like 'auto' except:
481 # * external validation set is concatenated with train set, with target marked as missing
482 # * test set is concatenated with train set, with target marked as missing
483 # * transformers that do not use the target are allowed to fit_transform across entire train + validation + test
484 # * several config toml expert options open-up limits (e.g. more numerics are treated as categoricals)
485 # Note: If plentiful memory, can:
486 # * choose kaggle mode and then change fixed_feature_interaction_depth to large negative number,
487 # otherwise default number of features given to transformer is limited to 50 by default
488 # * choose mutation_mode = "full", so even more types are transformations are done at once per transformer
489 #
490 #recipe = "auto"
491
492 # How much effort to spend on feature engineering (0...10)
493 # Heuristic combination of various developer-level toml parameters
494 # 0 : keep only numeric features, only model tuning during evolution
495 # 1 : keep only numeric features and frequency-encoded categoricals, only model tuning during evolution
496 # 2 : Like #1 but instead just no Text features. Some feature tuning before evolution.
497 # 3 : Like #5 but only tuning during evolution. Mixed tuning of features and model parameters.
498 # 4 : Like #5, but slightly more focused on model tuning
499 # 5 : Default. Balanced feature-model tuning
500 # 6-7 : Like #5, but slightly more focused on feature engineering
501 # 8 : Like #6-7, but even more focused on feature engineering with high feature generation rate, no feature dropping even if high interpretability
502 # 9-10: Like #8, but no model tuning during feature evolution
503 #feature_engineering_effort = 5
504
505 # Whether to enable train/valid and train/test distribution shift detection ('auto'/'on'/'off')
506 #check_distribution_shift = "auto"
507
508 # Whether to drop high-shift features ('auto'/'on'/'off'). Auto disables for time series.
509 #check_distribution_shift_drop = "auto"
510
511 # If distribution shift detection is enabled, drop features (except ID, text, date/datetime, time, weight) for
512 # which shift AUC is above this value (AUC of a binary classifier that predicts whether given feature value
513 # belongs to train or test data)
514 #drop_features_distribution_shift_threshold_auc = 0.999
515
516 # Whether to check leakage for each feature (True/False).
517 # If fold column, this checks leakage without fold column used.
518 #check_leakage = "auto"
519
520 # If leakage detection is enabled,
521 # drop features for which AUC (R2 for regression) is above this value.
522 # If fold column present, features are not dropped,
523 # because leakage test applies without fold column used.
524 #
525 #drop_features_leakage_threshold_auc = 0.999
526
527 # Max number of rows x number of columns to trigger (stratified) sampling for leakage checks
528 #leakage_max_data_size = 10000000
529
530 # Whether to create the Python scoring pipeline at the end of each experiment.
531 #make_python_scoring_pipeline = "auto"
532
533 # Whether to create the MOJO scoring pipeline at the end of each experiment. If set to "auto", will attempt to
534 # create it if possible (without dropping capabilities). If set to "on", might need to drop some models,
535 # transformers or custom recipes.
536 #make_mojo_scoring_pipeline = "auto"
537
538 # Whether to measure the MOJO scoring latency at the time of MOJO creation.
539 #benchmark_mojo_latency = "auto"
540
541 # Max size of pipeline.mojo file (in MB) for automatic mode of MOJO scoring latency measurement
542 #benchmark_mojo_latency_auto_size_limit = 100
543
544 # If MOJO creation times out at end of experiment, can still make MOJO from the GUI or from the R/Py clients (timeout doesn't apply there).
545 #mojo_building_timeout = 1800.0
546
547 # If MOJO creation is too slow, increase this value. Higher values can finish faster, but use more memory.
548 # If MOJO creation fails due to an out-of-memory error, reduce this value to 1.
549 # Set to -1 for all physical cores.

(continues on next page)

9.4. Sample Config.toml File 151


Using Driverless AI, Release 1.8.4.1

(continued from previous page)


550 #
551 #mojo_building_parallelism = -1
552
553 # Whether to create the pipeline visualization at the end of each experiment.
554 #make_pipeline_visualization = "auto"
555
556 # Whether to create the experiment Autoreport after end of experiment.
557 #
558 #make_autoreport = true
559
560 # Max number of CPU cores to use per experiment. Set to <= 0 to use all cores.
561 # One can also set environment variable 'OMP_NUM_THREADS' to number of cores to use for OpenMP
562 # (e.g., in bash: 'export OMP_NUM_THREADS=32' and 'export OPENBLAS_NUM_THREADS=32').
563 #max_cores = 0
564
565 # Max number of CPU cores to use across all of DAI experiments and tasks.
566 # -1 is all available, with stall_subprocess_submission_dai_fork_threshold_count=0 means restricted to core count.
567 #
568 #max_cores_dai = -1
569
570 # Stall submission of tasks if total DAI fork count exceeds count (-1 to disable, 0 for automatic of max_cores_dai)
571 #stall_subprocess_submission_dai_fork_threshold_count = 0
572
573 # Stall submission of tasks if system memory available is less than this threshold in percent (set to 0 to disable).
574 # Above this threshold, the number of workers in any pool of workers is linearly reduced down to 1 once hitting this threshold.
575 #
576 #stall_subprocess_submission_mem_threshold_pct = 2
577
578 # Whether to set automatic number of cores by physical (True) or logical (False) count.
579 # Using all logical cores can lead to poor performance due to cache thrashing.
580 #max_cores_by_physical = true
581
582 # Absolute limit to core count
583 #max_cores_limit = 100
584
585 # Control maximum number of cores to use for a model's fit call (0 = all physical cores >= 1 that count)
586 #max_fit_cores = 10
587
588 # Control maximum number of cores to use for a model's predict call (0 = all physical cores >= 1 that count)
589 #max_predict_cores = 0
590
591 # Control maximum number of cores to use for a model's transform and predict call when doing operations inside DAI-MLI GUI and R/Py client (0 = all
˓→physical cores >= 1 that count)
592 #max_predict_cores_in_dai = 4
593
594 # Control number of workers used in CPU mode for tuning (0 = socket count -1 = all physical cores >= 1 that count). More workers will be more parallel
˓→but models learn less from each other.
595 #batch_cpu_tuning_max_workers = 0
596
597 # Control number of workers used in CPU mode for training (0 = socket count -1 = all physical cores >= 1 that count)
598 #cpu_max_workers = 0
599
600 # Number of GPUs to use per experiment for training task. Set to -1 for all GPUs.
601 # An experiment will generate many different models.
602 # Currently num_gpus_per_experiment!=-1 disables GPU locking, so is only recommended for
603 # single experiments and single users.
604 # Ignored if GPUs disabled or no GPUs on system.
605 # More info at: https://github.com/NVIDIA/nvidia-docker/wiki/nvidia-docker#gpu-isolation
606 #num_gpus_per_experiment = -1
607
608 # Number of CPU cores per GPU. Limits number of GPUs in order to have sufficient cores per GPU.
609 # Set to -1 to disable.
610 #min_num_cores_per_gpu = 2
611
612 # Number of GPUs to use per model training task. Set to -1 for all GPUs.
613 # For example, when this is set to -1 and there are 4 GPUs available, all of them can be used for the training of a single model.
614 # Currently num_gpus_per_model!=1 disables GPU locking, so is only recommended for single
615 # experiments and single users.
616 # Ignored if GPUs disabled or no GPUs on system.
617 # More info at: https://github.com/NVIDIA/nvidia-docker/wiki/nvidia-docker#gpu-isolation
618 #num_gpus_per_model = 1
619
620 # Number of GPUs to use for predict for models and transform for transformers when running outside of fit/fit_transform.
621 # If predict/transform are called in same process as fit/fit_transform, number of GPUs will match,
622 # while new processes will use this count for number of GPUs for applicable models/transformers.
623 # If tensorflow_nlp_have_gpus_in_production=true, then that overrides this setting for relevant
624 # TensorFflow NLP transformers.
625 #
626 #num_gpus_for_prediction = 0
627
628 # Minimum number of threads for datatable (and OpenMP) during data munging (per process).
629 # datatable is the main data munging tool used within Driverless ai (source :
630 # https://github.com/h2oai/datatable)
631 #min_dt_threads_munging = 1
632
633 # Like min_datatable (and OpenMP)_threads_munging but for final pipeline munging
634 #min_dt_threads_final_munging = 1
635
636 # Maximum number of threads for datatable during data munging (per process) (0 = all, -1 = auto).
637 #max_dt_threads_munging = -1
638
639 # Maximum number of threads for datatable during data reading and writing (per process) (0 = all, -1 = auto).
640 #max_dt_threads_readwrite = -1
641
642 # Maximum number of threads for datatable stats and openblas (per process) (0 = all, -1 = auto).
643 #max_dt_threads_stats_openblas = -1
644
645 # Maximum number of threads for datatable during TS properties preview panel computations).
646 #max_dt_threads_do_timeseries_split_suggestion = 1
647
648 # Which gpu_id to start with
649 # If using CUDA_VISIBLE_DEVICES=... to control GPUs (preferred method), gpu_id=0 is the
650 # first in that restricted list of devices.
651 # E.g. if CUDA_VISIBLE_DEVICES='4,5' then gpu_id_start=0 will refer to the

(continues on next page)

152 Chapter 9. Using the config.toml File


Using Driverless AI, Release 1.8.4.1

(continued from previous page)


652 # device #4.
653 # E.g. from expert mode, to run 2 experiments, each on a distinct GPU out of 2 GPUs:
654 # Experiment#1: num_gpus_per_model=1, num_gpus_per_experiment=1, gpu_id_start=0
655 # Experiment#2: num_gpus_per_model=1, num_gpus_per_experiment=1, gpu_id_start=1
656 # E.g. from expert mode, to run 2 experiments, each on a distinct GPU out of 8 GPUs:
657 # Experiment#1: num_gpus_per_model=1, num_gpus_per_experiment=4, gpu_id_start=0
658 # Experiment#2: num_gpus_per_model=1, num_gpus_per_experiment=4, gpu_id_start=4
659 # E.g. Like just above, but now run on all 4 GPUs/model
660 # Experiment#1: num_gpus_per_model=4, num_gpus_per_experiment=4, gpu_id_start=0
661 # Experiment#2: num_gpus_per_model=4, num_gpus_per_experiment=4, gpu_id_start=4
662 # If num_gpus_per_model!=1, global GPU locking is disabled
663 # (because underlying algorithms don't support arbitrary gpu ids, only sequential ids),
664 # so must setup above correctly to avoid overlap across all experiments by all users
665 # More info at: https://github.com/NVIDIA/nvidia-docker/wiki/nvidia-docker#gpu-isolation
666 # Note that gpu selection does not wrap, so gpu_id_start + num_gpus_per_model must be less than number of visibile gpus
667 #gpu_id_start = 0
668
669 # Maximum number of workers for Driverless AI server pool (only 1 needed currently)
670 #max_workers = 1
671
672 # Period (in seconds) of ping by Driverless AI server to each experiment
673 # (in order to get logger info like disk space and memory usage).
674 # 0 means don't print anything.
675 #ping_period = 60
676
677 # Period between checking DAI status.
678 #ping_sleep_period = 1
679
680 # Minimum amount of disk space in GB needed to run experiments.
681 # Experiments will fail if this limit is crossed.
682 # This limit exists because Driverless AI needs to generate data for model training
683 # feature engineering, documentation and other such processes.
684 #disk_limit_gb = 5
685
686 # Minimum amount of disk space in GB needed to before stall forking of new processes during an experiment.
687 #stall_disk_limit_gb = 1
688
689 # Minimum amount of system memory in GB needed to start experiments.
690 # Similarly with disk space, a certain amount of system memory is needed to run some basic
691 # operations.
692 #memory_limit_gb = 5
693
694 # Minimum number of rows needed to run experiments (values lower than 100 might not work).
695 # A minimum threshold is set to ensure there is enough data to create a statistically
696 # reliable model and avoid other small-data related failures.
697 #min_num_rows = 100
698
699 # Minimum required number of rows (in the training data) for each class label for classification problems.
700 #min_rows_per_class = 5
701
702 # Minimum required number of rows for each split when generating validation samples.
703 #min_rows_per_split = 5
704
705 # Precision of how data is stored
706 # 'float32' best for speed, 'float64' best for accuracy or very large input values
707 # 'float32' allows numbers up to about +-3E38 with relative error of about 1E-7
708 # 'float64' allows numbers up to about +-1E308 with relative error of about 1E-16
709 # Some calculations, like the GLM standardization, can only handle up to sqrt() of these maximums for data values,
710 # So GLM with 32-bit precision can only handle up to about a value of 1E19 before standardization generates inf values.
711 # If you see "Best individual has invalid score" you may require higher precision.
712 #data_precision = "float32"
713
714 # Precision of most data transformers (same options and notes as data_precision).
715 # Useful for higher precision in transformers with numerous operations that can accumulate error.
716 # Also useful if want faster performance for transformers but otherwise want data stored in high precision.
717 #transformer_precision = "float32"
718
719 # Whether to change ulimit soft limits up to hard limits (for DAI server app, which is not a generic user app).
720 # Prevents resource limit problems in some cases.
721 # Restricted to no more than limit_nofile and limit_nproc for those resources.
722 #ulimit_up_to_hard_limit = true
723
724 # number of file limit
725 # Below should be consistent with start-dai.sh
726 #limit_nofile = 65535
727
728 # number of threads limit
729 # Below should be consistent with start-dai.sh
730 #limit_nproc = 16384
731
732 # Level of reproducibility desired (for same data and same inputs).
733 # Only active if 'reproducible' mode is enabled (GUI button enabled or a seed is set from the client API).
734 # Supported levels are:
735 # reproducibility_level = 1 for same experiment results as long as same O/S, same CPU(s) and same GPU(s)
736 # reproducibility_level = 2 for same experiment results as long as same O/S, same CPU architecture and same GPU architecture
737 # reproducibility_level = 3 for same experiment results as long as same O/S, same CPU architecture, not using GPUs
738 # reproducibility_level = 4 for same experiment results as long as same O/S, (best effort)
739 #reproducibility_level = 1
740
741 # Seed for random number generator to make experiments reproducible, to a certain reproducibility level (see above).
742 # Only active if 'reproducible' mode is enabled (GUI button enabled or a seed is set from the client API).
743 #seed = 1234
744
745 # The list of values that should be interpreted as missing values during data import.
746 # This applies to both numeric and string columns. Note that the dataset must be reloaded after applying changes to this config via the expert settings.
747 # Also note that 'nan' is always interpreted as a missing value for numeric columns.
748 #missing_values = "['', '?', 'None', 'nan', 'NA', 'N/A', 'unknown', 'inf', '-inf', '1.7976931348623157e+308', '-1.7976931348623157e+308']"
749
750 # For tensorflow, what numerical value to give to missing values, where numeric values are standardized.
751 # So 0 is center of distribution, and if Normal distribution then +-5 is 5 standard deviations away from the center.
752 # In many cases, an out of bounds value is a good way to represent missings, but in some cases the mean (0) may be better.
753 #tf_nan_impute_value = -5
754
755 # Internal threshold for number of rows x number of columns to trigger certain statistical

(continues on next page)

9.4. Sample Config.toml File 153


Using Driverless AI, Release 1.8.4.1

(continued from previous page)


756 # techniques (small data recipe like including one hot encoding for all model types, and smaller learning rate)
757 # to increase model accuracy
758 #statistical_threshold_data_size_small = 100000
759
760 # Internal threshold for number of rows x number of columns to trigger certain statistical
761 # techniques (fewer genes created, removal of high max_depth for tree models, etc.) that can speed up modeling.
762 # Also controls maximum rows used in training final model,
763 # by sampling statistical_threshold_data_size_large / columns number of rows
764 #statistical_threshold_data_size_large = 500000000
765
766 # Internal threshold for number of rows x number of columns to trigger sampling for auxiliary data uses,
767 # like imbalanced data set detection and bootstrap scoring sample size and iterations
768 #aux_threshold_data_size_large = 10000000
769
770 # Internal threshold for number of rows x number of columns to trigger certain changes in performance
771 # (fewer threads if beyond large value) to help avoid OOM or unnecessary slowdowns
772 # (fewer threads if lower than small value) to avoid excess forking of tasks
773 #performance_threshold_data_size_small = 100000
774
775 # Internal threshold for number of rows x number of columns to trigger certain changes in performance
776 # (fewer threads if beyond large value) to help avoid OOM or unnecessary slowdowns
777 # (fewer threads if lower than small value) to avoid excess forking of tasks
778 #performance_threshold_data_size_large = 100000000
779
780 # Maximum number of columns to start an experiment. This threshold exists to constraint the # complexity and the length of the Driverless AI's processes.
781 #max_cols = 10000
782
783 # Largest number of rows to use for column stats, otherwise sample randomly
784 #max_rows_col_stats = 1000000
785
786 # Whether to obtain permutation feature importance on original features for reporting in logs and file.
787 #
788 #orig_features_fs_report = false
789
790 # Maximum number of rows when doing permutation feature importance, reduced by (stratified) random sampling.
791 #
792 #max_rows_fs = 1000000
793
794 # How many workers to use for feature selection by permutation for predict phase
795 # (0 = auto, > 0: min of DAI value and this value, < 0: exactly negative of this value)
796 #
797 #max_workers_fs = 0
798
799 # Maximum number of columns selected out of original set of original columns, using feature selection
800 # The selection is based upon how well target encoding (or frequency encoding if not available) on categoricals and numerics treated as categoricals
801 # This is useful to reduce the final model complexity. First the best
802 # [max_orig_cols_selected] are found through feature selection methods and then
803 # these features are used in feature evolution (to derive other features) and in modelling.
804 #max_orig_cols_selected = 10000
805
806 # Maximum number of numeric columns selected, above which will do feature selection
807 # same as above (max_orig_cols_selected) but for numeric columns.
808 #max_orig_numeric_cols_selected = 10000
809
810 # Maximum number of non-numeric columns selected, above which will do feature selection on all features and avoid treating numerical as categorical
811 # same as above (max_orig_numeric_cols_selected) but for categorical columns.
812 #max_orig_nonnumeric_cols_selected = 300
813
814 # The factor times max_orig_cols_selected, by which column selection is based upon no target encoding and no treating numerical as categorical
815 # in order to limit performance cost of feature engineering
816 #max_orig_cols_selected_simple_factor = 2
817
818 # Like max_orig_cols_selected, but columns above which add special individual with original columns reduced.
819 #
820 #fs_orig_cols_selected = 500
821
822 # Like max_orig_numeric_cols_selected, but applicable to special individual with original columns reduced.
823 # A separate individual in the genetic algorithm is created by doing feature selection by permutation importance on original features.
824 #
825 #fs_orig_numeric_cols_selected = 500
826
827 # Like max_orig_nonnumeric_cols_selected, but applicable to special individual with original columns reduced.
828 # A separate individual in the genetic algorithm is created by doing feature selection by permutation importance on original features.
829 #
830 #fs_orig_nonnumeric_cols_selected = 200
831
832 # Like max_orig_cols_selected_simple_factor, but applicable to special individual with original columns reduced.
833 #fs_orig_cols_selected_simple_factor = 2
834
835 # Maximum allowed fraction of unique values for integer and categorical columns (otherwise will treat column as ID and drop)
836 #max_relative_cardinality = 0.95
837
838 # Maximum allowed number of unique values for integer and categorical columns (otherwise will treat column as ID and drop)
839 #max_absolute_cardinality = 1000000
840
841 # Whether to treat some numerical features as categorical.
842 # For instance, sometimes an integer column may not represent a numerical feature but
843 # represent different numerical codes instead.
844 #num_as_cat = true
845
846 # Max number of unique values for integer/real columns to be treated as categoricals (test applies to first statistical_threshold_data_size_small rows only)
847 #max_int_as_cat_uniques = 50
848
849 # Number of folds for models used during the feature engineering process.
850 # Increasing this will put a lower fraction of data into validation and more into training
851 # (e.g., num_folds=3 means 67%/33% training/validation splits).
852 # Actual value will vary for small or big data cases.
853 #num_folds = 3
854
855 # For multiclass problems only. Whether to allow different sets of target classes across (cross-)validation
856 # fold splits. Especially important when passing a fold column that isn't balanced w.r.t class distribution.
857 #
858 #allow_different_classes_across_fold_splits = true
859

(continues on next page)

154 Chapter 9. Using the config.toml File


Using Driverless AI, Release 1.8.4.1

(continued from previous page)


860 # Accuracy setting equal and above which enables full cross-validation (multiple folds) during feature evolution
861 # as opposed to only a single holdout split (e.g. 2/3 train and 1/3 validation holdout)
862 #full_cv_accuracy_switch = 8
863
864 # Accuracy setting equal and above which enables stacked ensemble as final model.
865 # Stacking commences at the end of the feature evolution process..
866 # It quite often leads to better model performance, but it does increase the complexity
867 # and execution time of the final model.
868 #ensemble_accuracy_switch = 5
869
870 # Number of fold splits to use for ensemble_level >= 2.
871 # The ensemble modelling may require predictions to be made on out-of-fold samples
872 # hence the data needs to be split on different folds to generate these predictions.
873 # Less folds (like 2 or 3) normally create more stable models, but may be less accurate
874 # More folds can get to higher accuracy at the expense of more time, but the performance
875 # may be less stable when the training data is not enough (i.e. higher chance of overfitting).
876 # Actual value will vary for small or big data cases.
877 #num_ensemble_folds = 5
878
879 # Number of repeats for each fold for all validation
880 # (modified slightly for small or big data cases)
881 #fold_reps = 1
882
883 # Maximum number of classes to allow for a classification problem.
884 # High number of classes may make certain processes of Driverless AI time-consuming.
885 # Memory requirements also increase with higher number of classes
886 #max_num_classes = 200
887
888 # Number of actuals vs. predicted data points to use in order to generate in the relevant
889 # plot/graph which is shown at the right part of the screen within an experiment.
890 #num_actuals_vs_predicted = 100
891
892 # Whether to use H2O.ai brain: the local caching and smart re-use of prior experiments,
893 # in order to generate more useful features and models for new experiments.
894 # It can also be used to control checkpointing for experiments that have been paused or interrupted.
895 # DAI will use H2O.ai brain cache if cache file has
896 # a) any matching column names and types for a similar experiment type
897 # b) exactly matches classes
898 # c) exactly matches class labels
899 # d) matches basic time series choices
900 # e) interpretability of cache is equal or lower
901 # f) main model (booster) is allowed by new experiment.
902 # Level of brain to use (for chosen level, where higher levels will also do all lower level operations automatically)
903 # -1 = Don't use any brain cache and don't write any cache
904 # 0 = Don't use any brain cache but still write cache
905 # Use case: Want to save model for later use, but want current model to be built without any brain models
906 # 1 = smart checkpoint from latest best individual model
907 # Use case: Want to use latest matching model, but match can be loose, so needs caution
908 # 2 = smart checkpoint from H2O.ai brain cache of individual best models
909 # Use case: DAI scans through H2O.ai brain cache for best models to restart from
910 # 3 = smart checkpoint like level #1, but for entire population. Tune only if brain population insufficient size
911 # (will re-score entire population in single iteration, so appears to take longer to complete first iteration)
912 # 4 = smart checkpoint like level #2, but for entire population. Tune only if brain population insufficient size
913 # (will re-score entire population in single iteration, so appears to take longer to complete first iteration)
914 # 5 = like #4, but will scan over entire brain cache of populations to get best scored individuals
915 # (can be slower due to brain cache scanning if big cache)
916 # 1000 + feature_brain_level (above positive values) = use resumed_experiment_id and actual feature_brain_level,
917 # to use other specific experiment as base for individuals or population,
918 # instead of sampling from any old experiments
919 # GUI has 3 options and corresponding settings:
920 # 1) New Experiment: Uses feature brain level default of 2
921 # 2) New Model With Same Parameters: Re-uses the same feature brain level as parent experiment
922 # 3) Restart From Last Checkpoint: Resets feature brain level to 1003 and sets experiment ID to resume from
923 # (continued genetic algorithm iterations)
924 # 4) Retrain Final Pipeline: Like Restart but also time=0 so skips any tuning and heads straight to final model
925 # (assumes had at least one tuning iteration in parent experiment)
926 # Other use cases:
927 # a) Restart on different data: Use same column names and fewer or more rows (applicable to 1 - 5)
928 # b) Re-fit only final pipeline: Like (a), but choose time=1 and feature_brain_level=3 - 5
929 # c) Restart with more columns: Add columns, so model builds upon old model built from old column names (1 - 5)
930 # d) Restart with focus on model tuning: Restart, then select feature_engineering_effort = 3 in expert settings
931 # e) can retrain final model but ignore any original features except those in final pipeline (normal retrain but set brain_add_features_for_new_
˓→columns=false)
932 # Notes:
933 # 1) In all cases, we first check the resumed experiment id if given, and then the brain cache
934 # 2) For Restart cases, may want to set min_dai_iterations to non-zero to force delayed early stopping, else may not be enough iterations to find better
˓→model.
935 # 3) A "New model with Same Params" of a Restart will use feature_brain_level=1003 for default Restart mode (revert to 2, or even 0 if want to start a
˓→fresh experiment otherwise)
936 #feature_brain_level = 2
937
938 # Relative number of columns that must match between current reference individual and brain individual.
939 # 0.0: perfect match
940 # 1.0: All columns are different, worst match
941 # e.g. 0.1 implies no more than 10% of columns mismatch between reference set of columns and brain individual.
942 #
943 #brain_maximum_diff_score = 0.1
944
945 # Maximum number of brain individuals pulled from H2O.ai brain cache for feature_brain_level=1, 2
946 #max_num_brain_indivs = 3
947
948 # Save feature brain iterations every iter_num % feature_brain_iterations_save_every_iteration == 0, to be able to restart/refit with which_iteration_
˓→brain >= 0
949 # 0 means disable
950 #feature_brain_save_every_iteration = 0
951
952 # When doing restart or re-fit type feature_brain_level with resumed_experiment_id, choose which iteration to start from, instead of only last best
953 # -1 means just use last best
954 # Usage:
955 # 1) Run one experiment with feature_brain_iterations_save_every_iteration=1 or some other number
956 # 2) Identify which iteration brain dump one wants to restart/refit from
957 # 3) Restart/Refit from original experiment, setting which_iteration_brain to that number in expert settings
958 # Note: If restart from a tuning iteration, this will pull in entire scored tuning population and use that for feature evolution
959 #which_iteration_brain = -1

(continues on next page)

9.4. Sample Config.toml File 155


Using Driverless AI, Release 1.8.4.1

(continued from previous page)


960
961 # When doing re-fit, if change columns or features, population of individuals used to refit from may change order of which was best,
962 # leading to better result chosen (False case). But sometimes want to see exact same model/features with only one feature added,
963 # and then would need to set this to True case.
964 # E.g. if refit with just 1 extra column and have interpretability=1, then final model will be same features,
965 # with one more engineered feature applied to that new original feature.
966 #
967 #refit_same_best_individual = false
968
969 # Directory, relative to data_directory, to store H2O.ai brain meta model files
970 #brain_rel_dir = "H2O.ai_brain"
971
972 # Maximum size in bytes the brain will store
973 # We reserve this memory to save data in order to ensure we can retrieve an experiment if
974 # for any reason it gets interrupted.
975 # -1: unlimited
976 # >=0 number of GB to limit brain to
977 #brain_max_size_GB = 20
978
979 # Whether to take any new columns and add additional features to pipeline, even if doing retrain final model.
980 # In some cases, one might have a new dataset but only want to keep same pipeline regardless of new columns,
981 # in which case one sets this to False. For example, new data might lead to new dropped features,
982 # due to shift or leak detection. To avoid change of feature set, one can disable all dropping of columns,
983 # but set this to False to avoid adding any columns as new features,
984 # so pipeline is perfectly preserved when changing data.
985 #brain_add_features_for_new_columns = true
986
987 # Whether to enable early stopping
988 # Early stopping refers to stopping the feature evolution/engineering process
989 # when there is no performance uplift after a certain number of iterations.
990 # After early stopping has been triggered, Driverless AI will initiate the ensemble
991 # process if selected.
992 #early_stopping = true
993
994 # Whether to enable early stopping per individual
995 # Each individual in the generic algorithm will stop early if no improvement,
996 # and it will no longer be mutated.
997 # Instead, the best individual will be additionally mutated.
998 #early_stopping_per_individual = true
999
1000 # Minimum number of Driverless AI iterations to stop the feature evolution/engineering
1001 # process even if score is not improving. Driverless AI needs to run for at least that many
1002 # iterations before deciding to stop. It can be seen a safeguard against suboptimal (early)
1003 # convergence.
1004 #min_dai_iterations = 0
1005
1006 # Maximum features per model (and each model within the final model if ensemble) kept just after scoring them
1007 # Keeps top variable importance features, prunes rest away, after each scoring.
1008 # Final ensemble will exclude any pruned-away features and only train on kept features,
1009 # but may contain a few new features due to fitting on different data view (e.g. new clusters)
1010 # Final scoring pipeline will exclude any pruned-away features,
1011 # but may contain a few new features due to fitting on different data view (e.g. new clusters)
1012 # -1 means no restrictions except internally-determined memory and interpretability restrictions
1013 #nfeatures_max = -1
1014
1015 # Maximum genes (transformer instances) per model (and each model within the final model if ensemble) kept.
1016 # Controls number of genes before features are scored, so just randomly samples genes if pruning occurs.
1017 # If restriction occurs after scoring features, then aggregated gene importances are used for pruning genes.
1018 # Instances includes all possible transformers, including original transformer for numeric features.
1019 # -1 means no restrictions except internally-determined memory and interpretability restrictions
1020 #ngenes_max = -1
1021
1022 # Whether to limit feature counts by interpretability setting via features_allowed_by_interpretability
1023 #limit_features_by_interpretability = true
1024
1025 # Max. number of epochs for TensorFlow models for making NLP features
1026 #tensorflow_max_epochs_nlp = 2
1027
1028 # Accuracy setting equal and above which will add all enabled TensorFlow NLP models below at start of experiment for text dominated problems
1029 # when TensorFlow NLP transformers are set to auto. If set to on, this parameter is ignored.
1030 # Otherwise, at lower accuracy, TensorFlow NLP transformations will only be created as a mutation.
1031 #enable_tensorflow_nlp_accuracy_switch = 5
1032
1033 # Whether to use Word-based CNN TensorFlow models for NLP if TensorFlow enabled
1034 #enable_tensorflow_textcnn = "auto"
1035
1036 # Whether to use Word-based Bi-GRU TensorFlow models for NLP if TensorFlow enabled
1037 #enable_tensorflow_textbigru = "auto"
1038
1039 # Whether to use Character-level CNN TensorFlow models for NLP if TensorFlow enabled
1040 #enable_tensorflow_charcnn = "auto"
1041
1042 # Path to pretrained embeddings for TensorFlow NLP models
1043 # For example, download and unzip https://nlp.stanford.edu/data/glove.6B.zip
1044 # tensorflow_nlp_pretrained_embeddings_file_path = /path/on/server/to/glove.6B.300d.txt
1045 #tensorflow_nlp_pretrained_embeddings_file_path = ""
1046
1047 # Allow training of all weights of the neural network graph, including the pretrained embedding layer weights. If disabled, then the embedding layer is
˓→frozen, but all other weights are still fine-tuned.
1048 #tensorflow_nlp_pretrained_embeddings_trainable = false
1049
1050 # Whether Python/MOJO scoring runtime will have GPUs (otherwise BiGRU will fail in production if this is enabled).
1051 # Enabling this can speed up training for BiGRU, but will require GPUs and CuDNN in production.
1052 #tensorflow_nlp_have_gpus_in_production = false
1053
1054 # Fraction of text columns out of all features to be considered a text-dominated problem
1055 #text_fraction_for_text_dominated_problem = 0.3
1056
1057 # Fraction of text transformers to all transformers above which to trigger that text dominated problem
1058 #text_transformer_fraction_for_text_dominated_problem = 0.3
1059
1060 # Threshold for average string-is-text score as determined by internal heuristics
1061 # It decides when a string column will be treated as text (for an NLP problem) or just as
1062 # a standard categorical variable.

(continues on next page)

156 Chapter 9. Using the config.toml File


Using Driverless AI, Release 1.8.4.1

(continued from previous page)


1063 # Higher values will favor string columns as categoricals, lower values will favor string columns as text
1064 #string_col_as_text_threshold = 0.3
1065
1066 # Mininum fraction of unique values for string columns to be considered as possible text (otherwise categorical)
1067 #string_col_as_text_min_relative_cardinality = 0.1
1068
1069 # Mininum number of uniques for string columns to be considered as possible text (otherwise categorical)
1070 #string_col_as_text_min_absolute_cardinality = 100
1071
1072 # Interpretability setting equal and above which will use monotonicity constraints in GBM
1073 # You may read the following source to understand what these constraints connote and why
1074 # they may be important, especially when the end goal is a very interpretable machine
1075 # learning model: https://blog.datadive.net/monotonicity-constraints-in-machine-learning/
1076 #monotonicity_constraints_interpretability_switch = 7
1077
1078 # Threshold, of Pearson product-moment correlation coefficient between numerical or encoded transformed
1079 # feature and target, above (below negative for) which will enforce positive (negative) monotonicity
1080 # for XGBoost and LightGBM.
1081 # Enabled when interpretability >= monotonicity_constraints_interpretability_switch config toml value
1082 #monotonicity_constraints_correlation_threshold = 0.1
1083
1084 # Exploring feature interactions can be important in gaining better predictive performance.
1085 # The interaction can take multiple forms (i.e. feature1 + feature2 or feature1 * feature2 + ... featureN)
1086 # Although certain machine learning algorithms (like tree-based methods) can do well in
1087 # capturing these interactions as part of their training process, still generating them may
1088 # help them (or other algorithms) yield better performance.
1089 # The depth of the interaction level (as in "up to" how many features may be combined at
1090 # once to create one single feature) can be specified to control the complexity of the
1091 # feature engineering process. For transformers that use both numeric and categorical features, this constrains
1092 # the number of each type, not the total number. Higher values might be able to make more predictive models
1093 # at the expense of time (-1 means automatic).
1094 #max_feature_interaction_depth = -1
1095
1096 # Instead of sampling from min to max (up to max_feature_interaction_depth unless all specified)
1097 # columns allowed for each transformer (0), choose fixed non-zero number of columns to use.
1098 # Can make same as number of columns to use all columns for each transformers if allowed by each transformer.
1099 # -n can be chosen to do 50/50 sample and fixed of n features.
1100 #
1101 #fixed_feature_interaction_depth = 0
1102
1103 # Like fixed_feature_interaction_depth but for categoricals if doing numcat (i.e. numeric and categoricals separated)
1104 # transformers. Only applies if also fixed_feature_interaction_depth > 0, and then will use sampling (0) unless
1105 # override since doesn't make sense to group by all columns usually.
1106 #fixed_feature_interaction_depth_numcat_cat = 0
1107
1108 # Accuracy setting equal and above which enables tuning of model parameters
1109 # Only applicable if parameter_tuning_num_models=-1 (auto)
1110 #tune_parameters_accuracy_switch = 3
1111
1112 # Accuracy setting equal and above which enables tuning of target transform for regression.
1113 # This is useful for time series when instead of predicting the actual target value, it
1114 # might be better to predict a transformed target variable like sqrt(target) or log(target)
1115 # as a means to control for outliers.
1116 #tune_target_transform_accuracy_switch = 3
1117
1118 # Select a target transformation for regression problems. Must be one of: ['auto',
1119 # 'identity', 'unit_box', 'log', 'square', 'sqrt', 'double_sqrt', 'inverse', 'anscombe', 'logit', 'sigmoid'].
1120 # If set to 'auto', will automatically pick the best target transformer (if accuracy is set to
1121 # tune_target_transform_accuracy_switch or larger).
1122 #
1123 #target_transformer = "auto"
1124
1125 # Tournament style (method to decide which models are best at each iteration)
1126 # 'auto' : Choose based upon accuracy, etc.
1127 # 'fullstack' : Choose among optimal model and feature types
1128 # 'model' : individuals with same model type compete
1129 # 'feature' : individuals with similar feature types compete
1130 # 'uniform' : all individuals in population compete to win as best
1131 # 'model' and 'feature' styles preserve at least one winner for each type (and so 2 total indivs of each type after mutation)
1132 # For each case, a round robin approach is used to choose best scores among type of models to choose from
1133 #tournament_style = "auto"
1134
1135 # Interpretability above which will use 'uniform' tournament style
1136 #tournament_uniform_style_interpretability_switch = 6
1137
1138 # Accuracy below which will use uniform style if tournament_style = 'auto' (regardless of other accuracy tournament style switch values)
1139 #tournament_uniform_style_accuracy_switch = 6
1140
1141 # Accuracy equal and above which uses model style if tournament_style = 'auto'
1142 #tournament_model_style_accuracy_switch = 6
1143
1144 # Accuracy equal and above which uses feature style if tournament_style = 'auto'
1145 #tournament_feature_style_accuracy_switch = 7
1146
1147 # Accuracy equal and above which uses fullstack style if tournament_style = 'auto'
1148 #tournament_fullstack_style_accuracy_switch = 8
1149
1150 # Whether to use penalized score for GA tournament or actual score
1151 #tournament_use_feature_penalized_score = true
1152
1153 # Driverless AI uses a genetic algorithm (GA) to find the best features, best models and
1154 # best hyper parameters for these models. The GA facilitates getting good results while not
1155 # requiring torun/try every possible model/feature/parameter. This version of GA has
1156 # reinforcement learning elements - it uses a form of exploration-exploitation to reach
1157 # optimum solutions. This means it will capitalise on models/features/parameters that seem # to be working well and continue to exploit them even more,
˓→while allowing some room for
1158 # trying new (and semi-random) models/features/parameters to avoid settling on a local
1159 # minimum.
1160 # These models/features/parameters tried are what-we-call individuals of a population. More # individuals connote more models/features/parameters to be
˓→tried and compete to find the best # ones.
1161 #num_individuals = 2
1162
1163 # set fixed number of individuals (if > 0) - useful to compare different hardware configurations
1164 #fixed_num_individuals = 0

(continues on next page)

9.4. Sample Config.toml File 157


Using Driverless AI, Release 1.8.4.1

(continued from previous page)


1165
1166 # set fixed number of fold reps (if > 0) - useful for quick runs regardless of data
1167 #fixed_fold_reps = 0
1168
1169 # set True to force only first fold for models - useful for quick runs regardless of data
1170 #fixed_only_first_fold_model = false
1171
1172 # number of unique targets or folds counts after which switch to faster/simpler non-natural sorting and print outs
1173 #sanitize_natural_sort_limit = 1000
1174
1175 # Whether target encoding could be enabled
1176 # Target encoding refers to several different feature transformations (primarily focused on
1177 # categorical data) that aim to represent the feature using information of the actual
1178 # target variable. A simple example can be to use the mean of the target to replace each
1179 # unique category of a categorical feature. This type of features can be very predictive,
1180 # but are prone to overfitting and require more memory as they need to store mappings of
1181 # the unique categories and the target values.
1182 #enable_target_encoding = "auto"
1183
1184 #enable_lexilabel_encoding = "off"
1185
1186 #enable_isolation_forest = "off"
1187
1188 # Whether one hot encoding could be enabled. If auto, then only applied for small data and GLM.
1189 #enable_one_hot_encoding = "auto"
1190
1191 #isolation_forest_nestimators = 200
1192
1193 # Driverless AI categorises all data (feature engineering) transformers
1194 # More information for these transformers can be viewed here:
1195 # http://docs.h2o.ai/driverless-ai/latest-stable/docs/userguide/transformations.html
1196 # This section allows including/excluding these transformations and may be useful when
1197 # simpler (more interpretable) models are sought at the expense of accuracy.
1198 # the interpretability setting)
1199 # for multi-class: '['NumCatTETransformer', 'TextLinModelTransformer',
1200 # 'FrequentTransformer', 'CVTargetEncodeTransformer', 'ClusterDistTransformer',
1201 # 'WeightOfEvidenceTransformer', 'TruncSVDNumTransformer', 'CVCatNumEncodeTransformer',
1202 # 'DatesTransformer', 'TextTransformer', 'OriginalTransformer',
1203 # 'NumToCatWoETransformer', 'NumToCatTETransformer', 'ClusterTETransformer',
1204 # 'InteractionsTransformer']'
1205 # for regression/binary: '['TextTransformer', 'ClusterDistTransformer',
1206 # 'OriginalTransformer', 'TextLinModelTransformer', 'NumToCatTETransformer',
1207 # 'DatesTransformer', 'WeightOfEvidenceTransformer', 'InteractionsTransformer',
1208 # 'FrequentTransformer', 'CVTargetEncodeTransformer', 'NumCatTETransformer',
1209 # 'NumToCatWoETransformer', 'TruncSVDNumTransformer', 'ClusterTETransformer',
1210 # 'CVCatNumEncodeTransformer']'
1211 # This list appears in the experiment logs (search for 'Transformers used')
1212 #
1213 #included_transformers = "[]"
1214
1215 # Auxiliary to included_transformers
1216 # e.g. to disable all Target Encoding: excluded_transformers =
1217 # '['NumCatTETransformer', 'CVTargetEncodeF', 'NumToCatTETransformer',
1218 # 'ClusterTETransformer']'
1219 #
1220 #excluded_transformers = "[]"
1221
1222 # Include list of genes (i.e. genes (built on top of transformers) to use,
1223 # independent of the interpretability setting)
1224 # Some transformers are used by multiple genes, so this allows different control over feature engineering
1225 # for multi-class: '['InteractionsGene', 'WeightOfEvidenceGene',
1226 # 'NumToCatTargetEncodeSingleGene', 'OriginalGene', 'TextGene', 'FrequentGene',
1227 # 'NumToCatWeightOfEvidenceGene', 'NumToCatWeightOfEvidenceMonotonicGene', '
1228 # CvTargetEncodeSingleGene', 'DateGene', 'NumToCatTargetEncodeMultiGene', '
1229 # DateTimeGene', 'TextLinRegressorGene', 'ClusterIDTargetEncodeSingleGene',
1230 # 'CvCatNumEncodeGene', 'TruncSvdNumGene', 'ClusterIDTargetEncodeMultiGene',
1231 # 'NumCatTargetEncodeMultiGene', 'CvTargetEncodeMultiGene', 'TextLinClassifierGene',
1232 # 'NumCatTargetEncodeSingleGene', 'ClusterDistGene']'
1233 # for regression/binary: '['CvTargetEncodeSingleGene', 'NumToCatTargetEncodeSingleGene',
1234 # 'CvCatNumEncodeGene', 'ClusterIDTargetEncodeSingleGene', 'TextLinRegressorGene',
1235 # 'CvTargetEncodeMultiGene', 'ClusterDistGene', 'OriginalGene', 'DateGene',
1236 # 'ClusterIDTargetEncodeMultiGene', 'NumToCatTargetEncodeMultiGene',
1237 # 'NumCatTargetEncodeMultiGene', 'TextLinClassifierGene', 'WeightOfEvidenceGene',
1238 # 'FrequentGene', 'TruncSvdNumGene', 'InteractionsGene', 'TextGene',
1239 # 'DateTimeGene', 'NumToCatWeightOfEvidenceGene',
1240 # 'NumToCatWeightOfEvidenceMonotonicGene', ''NumCatTargetEncodeSingleGene']'
1241 # This list appears in the experiment logs (search for 'Genes used')
1242 # e.g. to only enable interaction gene, use: included_genes =
1243 # '['InteractionsGene']'
1244 #included_genes = "[]"
1245
1246 # Exclude list of genes (i.e. genes (built on top of transformers) to not use,
1247 # independent of the interpretability setting)
1248 # Some transformers are used by multiple genes, so this allows different control over feature engineering
1249 # for multi-class: '['InteractionsGene', 'WeightOfEvidenceGene',
1250 # 'NumToCatTargetEncodeSingleGene', 'OriginalGene', 'TextGene', 'FrequentGene',
1251 # 'NumToCatWeightOfEvidenceGene', 'NumToCatWeightOfEvidenceMonotonicGene', '
1252 # CvTargetEncodeSingleGene', 'DateGene', 'NumToCatTargetEncodeMultiGene', '
1253 # DateTimeGene', 'TextLinRegressorGene', 'ClusterIDTargetEncodeSingleGene',
1254 # 'CvCatNumEncodeGene', 'TruncSvdNumGene', 'ClusterIDTargetEncodeMultiGene',
1255 # 'NumCatTargetEncodeMultiGene', 'CvTargetEncodeMultiGene', 'TextLinClassifierGene',
1256 # 'NumCatTargetEncodeSingleGene', 'ClusterDistGene']'
1257 # for regression/binary: '['CvTargetEncodeSingleGene', 'NumToCatTargetEncodeSingleGene',
1258 # 'CvCatNumEncodeGene', 'ClusterIDTargetEncodeSingleGene', 'TextLinRegressorGene',
1259 # 'CvTargetEncodeMultiGene', 'ClusterDistGene', 'OriginalGene', 'DateGene',
1260 # 'ClusterIDTargetEncodeMultiGene', 'NumToCatTargetEncodeMultiGene',
1261 # 'NumCatTargetEncodeMultiGene', 'TextLinClassifierGene', 'WeightOfEvidenceGene',
1262 # 'FrequentGene', 'TruncSvdNumGene', 'InteractionsGene', 'TextGene',
1263 # 'DateTimeGene', 'NumToCatWeightOfEvidenceGene',
1264 # 'NumToCatWeightOfEvidenceMonotonicGene', ''NumCatTargetEncodeSingleGene']'
1265 # This list appears in the experiment logs (search for 'Genes used')
1266 # e.g. to disable interaction gene, use: excluded_genes =
1267 # '['InteractionsGene']'
1268 #excluded_genes = "[]"

(continues on next page)

158 Chapter 9. Using the config.toml File


Using Driverless AI, Release 1.8.4.1

(continued from previous page)


1269
1270 #included_models = "[]"
1271
1272 # Auxiliary to included_models
1273 #excluded_models = "[]"
1274
1275 #included_scorers = "[]"
1276
1277 # Auxiliary to included_scorers
1278 #excluded_scorers = "[]"
1279
1280 # Whether to enable XGBoost GBM models ('auto'/'on'/'off')
1281 #enable_xgboost_gbm = "auto"
1282
1283 # Whether to enable XGBoost Dart models ('auto'/'on'/'off')
1284 #enable_xgboost_dart = "auto"
1285
1286 # Internal threshold for number of rows x number of columns to trigger no xgboost models due to high memory use
1287 # Overridden if enable_xgboost_gbm = "on" or enable_xgboost_dart = "on", in which case always allow each model type to be used
1288 #xgboost_threshold_data_size_large = 100000000
1289
1290 # Internal threshold for number of rows x number of columns to trigger no xgboost models due to limits on GPU memory capability
1291 # Overridden if enable_xgboost_gbm = "on" or enable_xgboost_dart = "on", in which case always allow each model type to be used
1292 #xgboost_gpu_threshold_data_size_large = 30000000
1293
1294 # Whether to enable GLM models ('auto'/'on'/'off')
1295 #enable_glm = "auto"
1296
1297 # Whether to enable Decision Tree models ('auto'/'on'/'off')
1298 #enable_decision_tree = "auto"
1299
1300 # Whether to enable LightGBM models ('auto'/'on'/'off')
1301 #enable_lightgbm = "auto"
1302
1303 # Whether to enable TensorFlow models ('auto'/'on'/'off')
1304 #enable_tensorflow = "auto"
1305
1306 # Whether to enable FTRL support (beta version, no mojo) (follow the regularized leader) model ('auto'/'on'/'off')
1307 #enable_ftrl = "auto"
1308
1309 # Whether to enable RuleFit support (beta version, no mojo) ('auto'/'on'/'off')
1310 #enable_rulefit = "auto"
1311
1312 # Which boosting types to enable for LightGBM (gbdt = boosted trees, rf_early_stopping = random forest with early stopping rf = random forest (no early
˓→stopping), dart = drop-out boosted trees with no early stopping
1313 #enable_lightgbm_boosting_types = "['gbdt']"
1314
1315 # Whether to enable LightGBM categorical feature support (only CPU mode currently)
1316 #enable_lightgbm_cat_support = false
1317
1318 # Whether to enable constant models ('auto'/'on'/'off')
1319 #enable_constant_model = "auto"
1320
1321 # Whether to show constant models in iteration panel
1322 #show_constant_model = false
1323
1324 #drop_constant_model_final_ensemble = true
1325
1326 # Parameters for LightGBM to override DAI parameters
1327 # parameters should be given as XGBoost equivalent unless unique LightGBM parameter
1328 # e.g. 'eval_metric' instead of 'metric' should be used
1329 # e.g. params_lightgbm = "{'objective': 'binary:logistic', 'n_estimators': 100, 'max_leaves': 64, 'random_state': 1234}"
1330 # e.g. params_lightgbm = {'n_estimators': 600, 'learning_rate': 0.1, 'reg_alpha': 0.0, 'reg_lambda': 0.5, 'gamma': 0, 'max_depth': 0, 'max_bin': 128, 'max_
˓→leaves': 256, 'scale_pos_weight': 1.0, 'max_delta_step': 3.469919910597877, 'min_child_weight': 1, 'subsample': 0.9, 'colsample_bytree': 0.3, 'tree_
˓→method': 'gpu_hist', 'grow_policy': 'lossguide', 'min_data_in_bin': 3, 'min_child_samples': 5, 'early_stopping_rounds': 20, 'num_classes': 2, 'objective
˓→': 'binary:logistic', 'eval_metric': 'logloss', 'random_state': 987654, 'early_stopping_threshold': 0.01, 'monotonicity_constraints': False, 'silent':
˓→True, 'debug_verbose': 0, 'subsample_freq': 1}"
1331 # avoid including "system"-level parameters like 'n_gpus': 1, 'gpu_id': 0, , 'n_jobs': 1, 'booster': 'lightgbm'
1332 # also likely should avoid parameters like: 'objective': 'binary:logistic', unless one really knows what one is doing (e.g. alternative objectives)
1333 # See: https://xgboost.readthedocs.io/en/latest/parameter.html
1334 # And see: https://github.com/Microsoft/LightGBM/blob/master/docs/Parameters.rst
1335 # Can also pass objective parameters if choose (or in case automatically chosen) certain objectives
1336 # https://lightgbm.readthedocs.io/en/latest/Parameters.html#metric-parameters
1337 #params_lightgbm = "{}"
1338
1339 # Parameters for XGBoost to override DAI parameters
1340 # similar parameters as lightgbm since lightgbm parameters are transcribed from xgboost equivalent versions
1341 # e.g. params_xgboost = '{'n_estimators': 100, 'max_leaves': 64, 'max_depth': 0, 'random_state': 1234}'
1342 # See: https://xgboost.readthedocs.io/en/latest/parameter.html
1343 #params_xgboost = "{}"
1344
1345 # Like params_xgboost but for XGBoost's dart method
1346 #params_dart = "{}"
1347
1348 # Parameters for TensorFlow to override DAI parameters
1349 # e.g. params_tensorflow = '{'lr': 0.01, 'add_wide': False, 'add_attention': True, 'epochs': 30, 'layers': [100, 100], 'activation': 'selu', 'batch_size':
˓→64, 'chunk_size': 1000, 'dropout': 0.3, 'strategy': 'one_shot', 'l1': 0.0, 'l2': 0.0, 'ort_loss': 0.5, 'ort_loss_tau': 0.01, 'normalize_type':
˓→'streaming'}'
1350 # See: https://keras.io/ , e.g. for activations: https://keras.io/activations/
1351 # Example layers: [500, 500, 500], [100, 100, 100], [100, 100], [50, 50]
1352 # Strategies: '1cycle' or 'one_shot', See: https://github.com/fastai/fastai
1353 # normalize_type: 'streaming' or 'global' (using sklearn StandardScaler)
1354 #params_tensorflow = "{}"
1355
1356 # Parameters for XGBoost's gblinear to override DAI parameters
1357 # e.g. params_gblinear = '{'n_estimators': 100}'
1358 # See: https://xgboost.readthedocs.io/en/latest/parameter.html
1359 #params_gblinear = "{}"
1360
1361 # Parameters for Decision Tree to override DAI parameters
1362 # parameters should be given as XGBoost equivalent unless unique LightGBM parameter
1363 # e.g. 'eval_metric' instead of 'metric' should be used
1364 # e.g. params_decision_tree = "{'objective': 'binary:logistic', 'n_estimators': 100, 'max_leaves': 64, 'random_state': 1234}"
1365 # e.g. params_decision_tree = {'n_estimators': 1, 'learning_rate': 1, 'reg_alpha': 0.0, 'reg_lambda': 0.5, 'gamma': 0, 'max_depth': 0, 'max_bin': 128, 'max_
˓→leaves': 256, 'scale_pos_weight': 1.0, 'max_delta_step': 3.469919910597877, 'min_child_weight': 1, 'subsample': 0.9, 'colsample_bytree': 0.3, 'tree_
(continues on next page)
˓→method': 'gpu_hist', 'grow_policy': 'lossguide', 'min_data_in_bin': 3, 'min_child_samples': 5, 'early_stopping_rounds': 20, 'num_classes': 2, 'objective
˓→': 'binary:logistic', 'eval_metric': 'logloss', 'random_state': 987654, 'early_stopping_threshold': 0.01, 'monotonicity_constraints': False, 'silent':
˓→True, 'debug_verbose': 0, 'subsample_freq': 1}"

9.4. Sample Config.toml File 159


Using Driverless AI, Release 1.8.4.1

(continued from previous page)


1366 # avoid including "system"-level parameters like 'n_gpus': 1, 'gpu_id': 0, , 'n_jobs': 1, 'booster': 'lightgbm'
1367 # also likely should avoid parameters like: 'objective': 'binary:logistic', unless one really knows what one is doing (e.g. alternative objectives)
1368 # See: https://xgboost.readthedocs.io/en/latest/parameter.html
1369 # And see: https://github.com/Microsoft/LightGBM/blob/master/docs/Parameters.rst
1370 # Can also pass objective parameters if choose (or in case automatically chosen) certain objectives
1371 # https://lightgbm.readthedocs.io/en/latest/Parameters.html#metric-parameters
1372 #params_decision_tree = "{}"
1373
1374 # Parameters for Rulefit to override DAI parameters
1375 # e.g. params_rulefit = '{'max_leaves': 64}'
1376 # See: https://xgboost.readthedocs.io/en/latest/parameter.html
1377 #params_rulefit = "{}"
1378
1379 # Parameters for FTRL to override DAI parameters
1380 #params_ftrl = "{}"
1381
1382 # Dictionary of key:lists of values to use for LightGBM tuning, overrides DAI's choice per key
1383 # e.g. params_tune_lightgbm = '{'min_child_samples': [1,2,5,100,1000], 'min_data_in_bin': [1,2,3,10,100,1000]}'
1384 #params_tune_lightgbm = "{}"
1385
1386 # Like params_tune_lightgbm but for XGBoost
1387 # e.g. params_tune_xgboost = '{'max_leaves': [8, 16, 32, 64]}'
1388 #params_tune_xgboost = "{}"
1389
1390 # Like params_tune_lightgbm but for XGBoost's Dart
1391 # e.g. params_tune_dart = '{'max_leaves': [8, 16, 32, 64]}'
1392 #params_tune_dart = "{}"
1393
1394 # Like params_tune_lightgbm but for TensorFlow
1395 # e.g. params_tune_tensorflow = '{'layers': [[10,10,10], [10, 10, 10, 10]]}'
1396 #params_tune_tensorflow = "{}"
1397
1398 # Like params_tune_lightgbm but for gblinear
1399 # e.g. params_tune_gblinear = '{'reg_lambda': [.01, .001, .0001, .0002]}'
1400 #params_tune_gblinear = "{}"
1401
1402 # Like params_tune_lightgbm but for rulefit
1403 # e.g. params_tune_rulefit = '{'max_depth': [4, 5, 6]}'
1404 #params_tune_rulefit = "{}"
1405
1406 # Like params_tune_lightgbm but for ftrl
1407 #params_tune_ftrl = "{}"
1408
1409 # Whether to force max_leaves and max_depth to be 0 if grow_policy is depthwise and lossguide, respectively.
1410 #params_tune_grow_policy_simple_trees = true
1411
1412 # Maximum number of GBM trees or GLM iterations
1413 # Early-stopping usually chooses less
1414 #max_nestimators = 3000
1415
1416 # LightGBM dart mode and normal rf mode do not use early stopping and will sample from these values for n_estimators.
1417 #n_estimators_list_no_early_stopping = [50, 100, 200, 300]
1418
1419 # Lower limit on learning rate for final ensemble GBM models.
1420 # In some cases, the maximum number of treess/iterations is insufficient for the final learning rate,
1421 # which can lead to no early stopping triggered and poor final model performance.
1422 # Then, one can try increasing the learning rate by raising this minimum,
1423 # or one can try increasing the maximum number of trees/iterations.
1424 #
1425 #min_learning_rate_final = 0.01
1426
1427 # Upper limit on learning rate for final ensemble GBM models
1428 #max_learning_rate_final = 0.05
1429
1430 # factor by which max_nestimators is reduced for tuning and feature evolution
1431 #max_nestimators_feature_evolution_factor = 0.2
1432
1433 # Lower limit on learning rate for feature engineering GBM models
1434 #min_learning_rate = 0.05
1435
1436 # Upper limit on learning rate for GBM models
1437 # If want to override min_learning_rate and min_learning_rate_final, set this to smaller value
1438 #max_learning_rate = 0.5
1439
1440 # Max. number of epochs for TensorFlow and FTRL models
1441 #max_epochs = 10
1442
1443 # Maximum tree depth (and corresponding max max_leaves as 2**max_max_depth)
1444 #max_max_depth = 12
1445
1446 # Default max_bin for tree methods
1447 #default_max_bin = 256
1448
1449 # Default max_bin for lightgbm (recommended for GPU lightgbm)
1450 #default_lightgbm_max_bin = 64
1451
1452 # Maximum max_bin for tree features
1453 #max_max_bin = 256
1454
1455 # Minimum max_bin for any tree
1456 #min_max_bin = 32
1457
1458 # Amount of memory which can handle max_bin = 256 can handle 125 columns and max_bin = 32 for 1000 columns
1459 # As available memory on system goes higher than this scale, can handle proportionally more columns at higher max_bin
1460 # Currently set to 10GB
1461 #scale_mem_for_max_bin = 10737418240
1462
1463 # Factor by which rf gets more depth than gbdt
1464 #factor_rf = 1.25
1465
1466 # Whether TensorFlow will use all CPU cores, or if it will split among all transformers
1467 #tensorflow_use_all_cores = true
1468
1469 # Whether TensorFlow will use all CPU cores if reproducible is set, or if it will split among all transformers

(continues on next page)

160 Chapter 9. Using the config.toml File


Using Driverless AI, Release 1.8.4.1

(continued from previous page)


1470 #tensorflow_use_all_cores_even_if_reproducible_true = false
1471
1472 # How many cores to use for each TensorFlow model, regardless if GPU or CPU based (0 = auto mode)
1473 #tensorflow_cores = 0
1474
1475 # Max number of rules to be used for RuleFit models (-1 for all)
1476 #rulefit_max_num_rules = -1
1477
1478 # Max tree depth for RuleFit models
1479 #rulefit_max_tree_depth = 6
1480
1481 # Max number of trees for RuleFit models
1482 #rulefit_max_num_trees = 100
1483
1484 # Internal threshold for number of rows x number of columns to trigger no rulefit models due to being too slow currently
1485 #rulefit_threshold_data_size_large = 100000000
1486
1487 # Enable One-Hot-Encoding (which does binning to limit to number of bins to no more than 100 anyway) for categorical columns with fewer than this many
˓→unique values
1488 # Set to 0 to disable
1489 #one_hot_encoding_cardinality_threshold = 50
1490
1491 # Fixed ensemble_level
1492 # -1 = auto, based upon ensemble_accuracy_switch, accuracy, size of data, etc.
1493 # 0 = No ensemble, only final single model on validated iteration/tree count
1494 # 1 = 1 model, multiple ensemble folds (cross-validation)
1495 # 2 = 2 models, multiple ensemble folds (cross-validation)
1496 # 3 = 3 models, multiple ensemble folds (cross-validation)
1497 # 4 = 4 models, multiple ensemble folds (cross-validation)
1498 #fixed_ensemble_level = -1
1499
1500 # If enabled, use cross-validation to determine optimal parameters for single final model,
1501 # and to be able to create training holdout predictions.
1502 #cross_validate_single_final_model = true
1503
1504 # Number of models to tune during pre-evolution phase
1505 # Can make this lower to avoid excessive tuning, or make higher to do enhanced tuning.
1506 # -1 : auto
1507 #
1508 #parameter_tuning_num_models = -1
1509
1510 #validate_meta_learner = false
1511
1512 # Set fixed number of folds (if >= 2) for cross-validation (actual number of splits allowed can be less and is determined at experiment run-time).
1513 #fixed_num_folds_evolution = -1
1514
1515 # Set fixed number of folds (if >= 2) for cross-validation (actual number of splits allowed can be less and is determined at experiment run-time).
1516 #fixed_num_folds = -1
1517
1518 #num_fold_ids_show = 10
1519
1520 # Upper limit on the number of rows x number of columns for feature evolution (applies to both training and validation/holdout splits)
1521 # feature evolution is the process that determines which features will be derived.
1522 # Depending on accuracy settings, a fraction of this value will be used
1523 #feature_evolution_data_size = 100000000
1524
1525 # Upper limit on the number of rows x number of columns for training final pipeline.
1526 #
1527 #final_pipeline_data_size = 500000000
1528
1529 # Smaller values can speed up final pipeline model training, as validation data is only used for early stopping.
1530 # Note that final model predictions and scores will always be provided on the full dataset provided.
1531 #
1532 #max_validation_to_training_size_ratio_for_final_ensemble = 2.0
1533
1534 # Ratio of minority to majority class of the target column beyond which stratified sampling is done for binary classification. Otherwise perform random
˓→sampling. Set to 0 to always do random sampling. Set to 1 to always do stratified sampling.
1535 #force_stratified_splits_for_imbalanced_threshold_binary = 0.01
1536
1537 # Sampling method for imbalanced binary classification problems. Choices are:
1538 # "auto": sample both classes as needed, depending on data
1539 # "over_under_sampling": over-sample the minority class and under-sample the majority class, depending on data
1540 # "under_sampling": under-sample the majority class to reach class balance
1541 # "off": do not perform any sampling
1542 #
1543 #imbalance_sampling_method = "off"
1544
1545 # For imbalanced binary classification: ratio of majority to minority class equal and above which to enable
1546 # special imbalanced models with sampling techniques (specified by imbalance_sampling_method) to attempt to improve model performance.
1547 #imbalance_ratio_sampling_threshold = 5
1548
1549 # For heavily imbalanced binary classification: ratio of majority to minority class equal and above which to enable only
1550 # special imbalanced models on full original data, without upfront sampling.
1551 #heavy_imbalance_ratio_sampling_threshold = 25
1552
1553 # -1: automatic
1554 #imbalance_sampling_number_of_bags = -1
1555
1556 # -1: automatic
1557 #imbalance_sampling_max_number_of_bags = 10
1558
1559 # Only for shift/leakage/tuning/feature evolution models. Not used for final models. Final models can
1560 # be limited by imbalance_sampling_max_number_of_bags.
1561 #imbalance_sampling_max_number_of_bags_feature_evolution = 3
1562
1563 # Max. size of data sampled during imbalanced sampling (in terms of dataset size),
1564 # controls number of bags (approximately). Only for imbalance_sampling_number_of_bags == -1.
1565 #imbalance_sampling_max_multiple_data_size = 1.0
1566
1567 # Rank averaging can be helpful when ensembling diverse models when ranking metrics like AUC/Gini
1568 # metrics are optimized. No MOJO support yet.
1569 #imbalance_sampling_rank_averaging = "auto"
1570
1571 # A value of 0.5 means that models/algorithms will be presented a balanced target class distribution

(continues on next page)

9.4. Sample Config.toml File 161


Using Driverless AI, Release 1.8.4.1

(continued from previous page)


1572 # after applying under/over-sampling techniques on the training data. Sometimes it makes sense to
1573 # choose a smaller value like 0.1 or 0.01 when starting from an extremely imbalanced original target
1574 # distribution. -1.0: automatic
1575 #imbalance_sampling_target_minority_fraction = -1.0
1576
1577 # For binary classification: ratio of majority to minority class equal and above which to notify
1578 # of imbalance in GUI to say slightly imbalanced.
1579 # More than imbalance_ratio_sampling_threshold will say problem is imbalanced.
1580 #
1581 #imbalance_ratio_notification_threshold = 2.0
1582
1583 # list of possible bins for FTRL (largest is default best value)
1584 #nbins_ftrl_list = "[1000000, 10000000, 100000000]"
1585
1586 # Samples the number of automatic FTRL interactions terms to no more than this value (for each of 2nd, 3rd, 4th order terms)
1587 #ftrl_max_interaction_terms_per_degree = 10000
1588
1589 # list of possible bins for target encoding (first is default value)
1590 #te_bin_list = "[25, 10, 100, 250]"
1591
1592 # list of possible bins for weight of evidence encoding (first is default value)
1593 # If only want one value: woe_bin_list = [2]
1594 #woe_bin_list = "[25, 10, 100, 250]"
1595
1596 # list of possible bins for ohe hot encoding (first is default value)
1597 #ohe_bin_list = "[10, 25, 50, 75, 100]"
1598
1599 # Whether to drop columns with constant values
1600 #drop_constant_columns = true
1601
1602 # Whether to drop columns that appear to be an ID
1603 #drop_id_columns = true
1604
1605 # Whether to avoid dropping any columns (original or derived)
1606 #no_drop_features = false
1607
1608 # Direct control over columns to drop in bulk so can copy-paste large lists instead of selecting each one separately in GUI
1609 #cols_to_drop = ""
1610
1611 # Control over columns to group by, default is empty list that means DAI automatically searches all columns,
1612 # selected randomly or by which have top variable importance.
1613 #cols_to_group_by = ""
1614
1615 # Whether to sample from given features to group by (True) or to always group by all features (False).
1616 #sample_cols_to_group_by = false
1617
1618 # Aggregation functions to use for groupby operations.
1619 #agg_funcs_for_group_by = "['mean', 'sd', 'min', 'max', 'count']"
1620
1621 # Out of fold aggregations ensure less overfitting, but see less data in each fold.
1622 #folds_for_group_by = 5
1623
1624 # Strategy to apply when doing mutations on transformers.
1625 # Sample mode is default, with tendency to sample transformer parameters.
1626 # Batched mode tends to do multiple types of the same transformation together.
1627 # Full mode does even more types of the same transformation together.
1628 #
1629 #mutation_mode = "sample"
1630
1631 # Whether to enable checking text for shift, currently only via label encoding.
1632 #shift_check_text = false
1633
1634 # Whether to use LightGBM random forest mode without early stopping for shift detection.
1635 #use_rf_for_shift_if_have_lgbm = true
1636
1637 # Normalized training variable importance above which to check the feature for shift
1638 # Useful to avoid checking likely unimportant features
1639 #shift_key_features_varimp = 0.01
1640
1641 # Whether to only check certain features based upon the value of shift_key_features_varimp
1642 #shift_check_reduced_features = true
1643
1644 # Number of trees to use to train model to check shift in distribution
1645 # No larger than max_nestimators
1646 #shift_trees = 100
1647
1648 # The value of max_bin to use for trees to use to train model to check shift in distribution
1649 #shift_max_bin = 256
1650
1651 # The min. value of max_depth to use for trees to use to train model to check shift in distribution
1652 #shift_min_max_depth = 4
1653
1654 # The max. value of max_depth to use for trees to use to train model to check shift in distribution
1655 #shift_max_max_depth = 8
1656
1657 # If distribution shift detection is enabled, show features for which shift AUC is above this value
1658 # (AUC of a binary classifier that predicts whether given feature value belongs to train or test data)
1659 #detect_features_distribution_shift_threshold_auc = 0.55
1660
1661 # Minimum number of features to keep, keeping least shifted feature at least if 1
1662 #drop_features_distribution_shift_min_features = 1
1663
1664 # Whether to enable checking text for leakage, currently only via label encoding.
1665 #leakage_check_text = true
1666
1667 # Normalized training variable importance (per 1 minus AUC/R2 to control for leaky varimp dominance) above which to check the feature for leakage
1668 # Useful to avoid checking likely unimportant features
1669 #leakage_key_features_varimp = 0.001
1670
1671 # Like leakage_key_features_varimp, but applies if early stopping disabled when can trust multiple leaks to get uniform varimp.
1672 #leakage_key_features_varimp_if_no_early_stopping = 0.05
1673
1674 # Whether to only check certain features based upon the value of leakage_key_features_varimp. If any feature has AUC near 1, will consume all variable
˓→importance, even if another feature is also leaky. So False is safest option, but True generally good if many columns.

(continues on next page)

162 Chapter 9. Using the config.toml File


Using Driverless AI, Release 1.8.4.1

(continued from previous page)


1675 #leakage_check_reduced_features = true
1676
1677 # Whether to use LightGBM random forest mode without early stopping for leakage detection.
1678 #use_rf_for_leakage_if_have_lgbm = true
1679
1680 # Number of trees to use to train model to check for leakage
1681 # No larger than max_nestimators
1682 #leakage_trees = 100
1683
1684 # The value of max_bin to use for trees to use to train model to check for leakage
1685 #leakage_max_bin = 256
1686
1687 # The value of max_depth to use for trees to use to train model to check for leakage
1688 #leakage_min_max_depth = 4
1689
1690 # The value of max_depth to use for trees to use to train model to check for leakage
1691 #leakage_max_max_depth = 8
1692
1693 # When leakage detection is enabled, if AUC (R2 for regression) on original data (label-encoded)
1694 # is above or equal to this value, then trigger per-feature leakage detection
1695 #detect_features_leakage_threshold_auc = 0.95
1696
1697 # When leakage detection is enabled, show features for which AUC (R2 for regression,
1698 # for whether that predictor/feature alone predicts the target) is above or equal to this value.
1699 # Feature is dropped if AUC/R2 is above or equal to drop_features_leakage_threshold_auc
1700 #detect_features_per_feature_leakage_threshold_auc = 0.8
1701
1702 # Minimum number of features to keep, keeping least leakage feature at least if 1
1703 #drop_features_leakage_min_features = 1
1704
1705 # Ratio of train to validation holdout when testing for leakage
1706 #leakage_train_test_split = 0.25
1707
1708 # Whether to enable detailed traces (in GUI Trace)
1709 #detailed_traces = false
1710
1711 # Whether to enable debug log level (in log files)
1712 #debug_log = false
1713
1714 # Whether to add logging of system information such as CPU, GPU, disk space at the start of each experiment log. Same information is already logged in
˓→system logs.
1715 #log_system_info_per_experiment = true
1716
1717 # How close to the optimal value (usually 1 or 0) does the validation score need to be to be considered perfect (to stop the experiment)?
1718 #abs_tol_for_perfect_score = 0.0001
1719
1720 # Timeout in seconds to wait for data ingestion.
1721 #data_ingest_timeout = 86400.0
1722
1723 # Enable time series recipe
1724 #time_series_recipe = true
1725
1726 # Provide date or datetime timestamps (in same format as the time column) for custom training and validation splits like this: "tr_start1, tr_end1, va_
˓→start1, va_end1, ..., tr_startN, tr_endN, va_startN, va_endN"
1727 #time_series_validation_fold_split_datetime_boundaries = ""
1728
1729 # Timeout in seconds for time-series properties detection in UI.
1730 #timeseries_split_suggestion_timeout = 30.0
1731
1732 # Whether to use lag transformers when using causal-split for validation (as occurs when not using time-based lag recipe).
1733 # If no time groups columns, lag transformers will still use time-column as sole time group column.
1734 #
1735 #use_lags_if_not_time_series_recipe = false
1736
1737 # earliest datetime for automatic conversion of integers in %Y%m%d format to a time column during parsing
1738 #min_ymd_timestamp = 19000101
1739
1740 # lastet datetime for automatic conversion of integers in %Y%m%d format to a time column during parsing
1741 #max_ymd_timestamp = 21000101
1742
1743 # maximum number of data samples (randomly selected rows) for date/datetime format detection
1744 #max_rows_datetime_format_detection = 100000
1745
1746 # Automatically generate is-holiday features from date columns
1747 #holiday_features = true
1748
1749 # County code to use to look up holiday calendar (Python package 'holiday')
1750 #holiday_country = "US"
1751
1752 # Max. sample size for automatic determination of time series train/valid split properties, only if time column is selected
1753 #max_time_series_properties_sample_size = 250000
1754
1755 # Maximum number of lag sizes, which are sampled from if sample_lag_sizes==True, else all are taken (-1 == automatic)
1756 #max_lag_sizes = -1
1757
1758 # Minimum required autocorrelation threshold for a lag to be considered for feature engineering
1759 #min_lag_autocorrelation = 0.1
1760
1761 # How many samples of lag sizes to use for a single time group (single time series signal)
1762 #max_signal_lag_sizes = 100
1763
1764 # Whether to sample lag sizes
1765 #sample_lag_sizes = false
1766
1767 # How many samples of lag sizes to use, chosen randomly out of original set of lag sizes
1768 #max_sampled_lag_sizes = 10
1769
1770 # Override lags to be used
1771 # e.g. [7, 14, 21] # this exact list
1772 # e.g. 21 # produce from 1 to 21
1773 # e.g. 21:3 produce from 1 to 21 in step of 3
1774 # e.g. 5-21 produce from 5 to 21
1775 # e.g. 5-21:3 produce from 5 to 21 in step of 3
1776 #override_lag_sizes = []

(continues on next page)

9.4. Sample Config.toml File 163


Using Driverless AI, Release 1.8.4.1

(continued from previous page)


1777
1778 # Smallest considered lag size
1779 #min_lag_size = -1
1780
1781 # Whether to enable feature engineering based on selected time column, e.g. Date~weekday.
1782 #allow_time_column_as_feature = true
1783
1784 # Whether to enable integer time column to be used as a numeric feature.
1785 # If using time series recipe, using time column (numeric time stamps) as input features can lead to model that
1786 # memorizes the actual time stamps instead of features that generalize to the future.
1787 #
1788 #allow_time_column_as_numeric_feature = false
1789
1790 # Allowed date or date-time transformations.
1791 # Date transformers include: year, quarter, month, week, weekday, day, dayofyear, num.
1792 # Date transformers also include: hour, minute, second.
1793 # Features in DAI will show up as get_ + transformation name.
1794 # E.g. num is a direct numeric value representing the floating point value of time,
1795 # which can lead to over-fitting if used on IID problems. So this is turned off by default.
1796 #datetime_funcs = "['year', 'quarter', 'month', 'week', 'weekday', 'day', 'dayofyear', 'hour', 'minute', 'second']"
1797
1798 # Whether to consider time groups columns (tgc) as standalone features.
1799 # Note that 'time_column' is treated separately via 'Allow to engineer features from time column'.
1800 # Use allowed_coltypes_for_tgc_as_features for control per feature type.
1801 #
1802 #allow_tgc_as_features = false
1803
1804 # Which time groups columns (tgc) feature types to consider as standalone features,
1805 # if the corresponding flag "Consider time groups columns as standalone features" is set to true.
1806 # E.g. all column types would be ["numeric", "categorical", "ohe_categorical", "datetime", "date", "text"]
1807 # Note that 'time_column' is treated separately via 'Allow to engineer features from time column'.
1808 # Note that if lag-based time series recipe is disabled, then all tgc are allowed features.
1809 #
1810 #allowed_coltypes_for_tgc_as_features = "['numeric', 'categorical', 'ohe_categorical', 'datetime', 'date', 'text']"
1811
1812 # Whether various transformers (clustering, truncated SVD) are enabled,
1813 # that otherwise would be disabled for time series due to
1814 # potential to overfit by leaking across time within the fit of each fold.
1815 #enable_time_unaware_transformers = "auto"
1816
1817 # Whether to group by all time groups columns for creating lag features, instead of sampling from them
1818 #tgc_only_use_all_groups = true
1819
1820 # Enable creation of holdout predictions on training data
1821 # using moving windows (useful for MLI, but can be slow)
1822 #time_series_holdout_preds = true
1823
1824 # Set fixed number of time-based splits for internal model validation (actual number of splits allowed can be less and is determined at experiment run-
˓→time).
1825 #time_series_validation_splits = -1
1826
1827 # Maximum overlap between two time-based splits. Higher values increase the amount of possible splits.
1828 #time_series_splits_max_overlap = 0.5
1829
1830 # Max number of splits used for creating final time-series model's holdout/backtesting predictions. With the default value '-1' the same amount of splits
˓→as during model validation will be used. Use 'time_series_validation_splits' to control amount of time-based splits used for model validation.
1831 #time_series_max_holdout_splits = -1
1832
1833 # Whether to speed up time-series holdout predictions for back-testing on training data (used for MLI and metrics calculation). Can be slightly less
˓→accurate.
1834 #mli_ts_fast_approx = false
1835
1836 # Whether to speed up Shapley values for time-series holdout predictions for back-testing on training data (used for MLI). Can be slightly less accurate.
1837 #mli_ts_fast_approx_contribs = true
1838
1839 # Enable creation of Shapley values for holdout predictions on training data
1840 # using moving windows (useful for MLI, but can be slow), at the time of the experiment. If disabled, MLI will
1841 # generate Shapley values on demand.
1842 #mli_ts_holdout_contribs = true
1843
1844 # Values of 5 or more can improve generalization by more aggressive dropping of least important features. Set to 1 to disable.
1845 #time_series_min_interpretability = 5
1846
1847 # Dropout mode for lag features in order to achieve an equal n.a.-ratio between train and validation/test. The independent mode performs a simple feature-
˓→wise dropout, whereas the dependent one takes lag-size dependencies per sample/row into account.
1848 #lags_dropout = "dependent"
1849
1850 # Normalized probability of choosing to lag non-targets relative to targets (-1.0 = auto)
1851 #prob_lag_non_targets = -1.0
1852
1853 # Method to create rolling test set predictions, if the forecast horizon is shorter than the time span of the test set. One can choose between test time
˓→augmentation (TTA) and a successive refitting of the final pipeline.
1854 #rolling_test_method = "tta"
1855
1856 # Probability for new Lags/EWMA gene to use default lags (determined by frequency/gap/horizon, independent of data) (-1.0 = auto)
1857 #prob_default_lags = -1.0
1858
1859 # Unnormalized probability of choosing other lag time-series transformers based on interactions (-1.0 = auto)
1860 #prob_lagsinteraction = -1.0
1861
1862 # Unnormalized probability of choosing other lag time-series transformers based on aggregations (-1.0 = auto)
1863 #prob_lagsaggregates = -1.0
1864
1865 # Maximum amount of columns send from UI to backend in order to auto-detect TGC
1866 #tgc_via_ui_max_ncols = 10
1867
1868 # Maximum frequency of duplicated timestamps for TGC detection
1869 #tgc_dup_tolerance = 0.01
1870
1871 # Build (if missing but available) the MOJO pipeline to be used for predictions in MLI. For certain models this can improve performance.
1872 #mli_use_mojo_pipeline = false
1873
1874 # When number of rows are above this limit sample for MLI for scoring UI data
1875 #mli_sample_above_for_scoring = 1000000

(continues on next page)

164 Chapter 9. Using the config.toml File


Using Driverless AI, Release 1.8.4.1

(continued from previous page)


1876
1877 # When number of rows are above this limit sample for MLI for training surrogate models
1878 #mli_sample_above_for_training = 100000
1879
1880 # When sample for MLI how many rows to sample
1881 #mli_sample_size = 100000
1882
1883 # how many bins to do quantile binning
1884 #mli_num_quantiles = 10
1885
1886 # mli random forest number of trees
1887 #mli_drf_num_trees = 100
1888
1889 # Whether to speed up predictions used inside MLI with a fast approximation
1890 #mli_fast_approx = true
1891
1892 # mli number of trees for fast_approx during predict for Shapley
1893 #fast_approx_num_trees = 50
1894
1895 # whether to do only 1 fold and 1 model of all folds and models if ensemble
1896 #fast_approx_do_one_fold_one_model = true
1897
1898 # mli random forest max depth
1899 #mli_drf_max_depth = 20
1900
1901 # not only sample training, but also sample scoring
1902 #mli_sample_training = true
1903
1904 # regularization strength for k-LIME GLM's
1905 #klime_lambda = "[1e-06, 1e-08]"
1906
1907 # regularization strength for k-LIME GLM's
1908 #klime_alpha = 0.0
1909
1910 # mli converts numeric columns to enum when cardinality is <= this value
1911 #mli_max_numeric_enum_cardinality = 25
1912
1913 # Maximum number of features allowed for k-LIME k-means clustering
1914 #mli_max_number_cluster_vars = 6
1915
1916 # Use all columns for k-LIME k-means clustering (this will override `mli_max_number_cluster_vars` if set to `True`
1917 #use_all_columns_klime_kmeans = false
1918
1919 # Strict version check for MLI
1920 #mli_strict_version_check = true
1921
1922 # MLI cloud name
1923 #mli_cloud_name = ""
1924
1925 # Compute original model ICE using per feature's bin predictions (true) or use "one frame" strategy (false)
1926 #mli_ice_per_bin_strategy = false
1927
1928 # By default DIA will run for categorical columns with cardinality <= mli_dia_default_max_cardinality
1929 #mli_dia_default_max_cardinality = 10
1930
1931 # Enable MLI Sensitivity Analysis
1932 #enable_mli_sa = true
1933
1934 # When number of rows are above this limit, then sample for DIA
1935 #mli_dia_sample_size = 100000
1936
1937 # When number of rows are above this limit, then sample for DAI PD/ICE
1938 #mli_pd_sample_size = 100000
1939
1940 # Enable async/await-based non-blocking MLI API
1941 #enable_mli_async_api = true
1942
1943 # Enable main chart aggregator in Sensitivity Analysis
1944 #enable_mli_sa_main_chart_aggregator = true
1945
1946 # When to sample for Sensitivity Analysis (number of rows after sampling)
1947 #mli_sa_sampling_limit = 500000
1948
1949 # Run main chart aggregator in Sensitivity Analysis when the number of dataset instances is bigger than given limit
1950 #mli_sa_main_chart_aggregator_limit = 1000
1951
1952 # Use predict_safe() (true) or predict_base() (false) in MLI (PD, ICE, SA, ...
1953 #mli_predict_safe = false
1954
1955 # Allow predict method with fast approximation in MLI (PD, ICE, SA, ...
1956 #enable_mli_predict_fast_approx = false
1957
1958 # Number of max retries should the surrogate model fail to build.
1959 #mli_max_surrogate_retries = 5
1960
1961 # Number of rows per batch when scoring using MOJO.
1962 #mli_mojo_batch_size = 50
1963
1964 # Tokenizer used to extract tokens from text columns for MLI.
1965 #mli_nlp_tokenizer = "tfidf"
1966
1967 # Number of tokens used for MLI NLP explanations. -1 means all.
1968 #mli_nlp_top_n = 20
1969
1970 # Maximum number of records on which we'll perform MLI NLP
1971 #mli_nlp_sample_limit = 10000
1972
1973 # Number of parallel workers when scoring using MOJO in MLI NLP.
1974 #mli_nlp_workers = 4
1975
1976 # Minimum number of documents in which token has to appear. Integer mean absolute count, float means percentage.
1977 #mli_nlp_min_df = 3
1978
1979 # Maximum number of documents in which token has to appear. Integer mean absolute count, float means percentage.

(continues on next page)

9.4. Sample Config.toml File 165


Using Driverless AI, Release 1.8.4.1

(continued from previous page)


1980 #mli_nlp_max_df = 0.9
1981
1982 # The minimum value in the ngram range. The tokenizer will generate all possible tokens in the (mli_nlp_min_ngram, mli_nlp_max_ngram) range.
1983 #mli_nlp_min_ngram = 1
1984
1985 # The maximum value in the ngram range. The tokenizer will generate all possible tokens in the (mli_nlp_min_ngram, mli_nlp_max_ngram) range.
1986 #mli_nlp_max_ngram = 1
1987
1988 # Mode used to choose N tokens for MLI NLP.
1989 # "top" chooses N top tokens.
1990 # "bottom" chooses N bottom tokens.
1991 # "top-bottom" chooses math.floor(N/2) top and math.ceil(N/2) bottom tokens.
1992 # "linspace" chooses N evenly spaced out tokens.
1993 #mli_nlp_min_token_mode = "top"
1994
1995 # The number of top tokens to be used as features when building token based feature importance.
1996 #mli_nlp_tokenizer_max_features = -1
1997
1998 # The number of top tokens to be used as features when computing text LOCO.
1999 #mli_nlp_loco_max_features = -1
2000
2001 # The number of top tokens to be used as features when building surrogate models.
2002 #mli_nlp_surrogate_tokens = 100
2003
2004 # Whether to dump every scored individual's variable importance (both derived and original) to csv/tabulated/json file
2005 # produces files like: individual_scored_id%d.iter%d*features*
2006 #dump_varimp_every_scored_indiv = false
2007
2008 # Whether to dump every scored individual's model parameters to csv/tabulated/json file
2009 # produces files like: individual_scored.params.[txt, csv, json]
2010 #dump_modelparams_every_scored_indiv = true
2011
2012 # Number of features to show in model dump every scored individual
2013 #dump_modelparams_every_scored_indiv_feature_count = 3
2014
2015 # Whether to append (false) or have separate files, files like: individual_scored_id%d.iter%d*params*, (true) for modelparams every scored indiv
2016 #dump_modelparams_separate_files = false
2017
2018 # Whether to dump every scored fold's timing and feature info to a *timings*.txt file
2019 #dump_trans_timings = false
2020
2021 # whether to delete preview timings if wrote transformer timings
2022 #delete_preview_trans_timings = true
2023
2024 # Location of the AutoDoc template
2025 #autodoc_template = "report_template.docx"
2026
2027 # Specify the output of the report.
2028 # Options are docx or md.
2029 #autodoc_output_type = "docx"
2030
2031 # Specify the name of the report.
2032 #autodoc_report_name = "report"
2033
2034 # Specify the maximum number of classes in the confusion
2035 # matrix.
2036 #autodoc_max_cm_size = 10
2037
2038 # Set the number of models for which a glm coefficients
2039 # table is shown in the Autoreport. coef_table_num_models must
2040 # be -1 or an integer >= 1 (-1 shows all models).
2041 #autodoc_coef_table_num_models = 1
2042
2043 # Set the number of folds per model for which a glm
2044 # coefficients table is shown in the Autoreport. coef_table_num_folds
2045 # must be -1 or an integer >= 1 (-1 shows all folds per model).
2046 #autodoc_coef_table_num_folds = -1
2047
2048 # Set the number of coefficients to show within a glm
2049 # coefficients table in the Autoreport. coef_table_num_coef, controls
2050 # the number of rows shown in a glm table and must be -1 or
2051 # an integer >= 1 (-1 shows all coefficients).
2052 #autodoc_coef_table_num_coef = 50
2053
2054 # Set the number of classes to show within a glm
2055 # coefficients table in the Autoreport. coef_table_num_classes controls
2056 # the number of class-columns shown in a glm table and must be -1 or
2057 # an integer >= 4 (-1 shows all classes).
2058 #autodoc_coef_table_num_classes = 9
2059
2060 # Specify whether to show the full glm coefficient
2061 # table(s) in the appendix. coef_table_appendix_results_table must be
2062 # a boolean: True to show tables in appendix, False to not show them.
2063 #autodoc_coef_table_appendix_results_table = false
2064
2065 # Specify the minimum relative importance in order
2066 # for a feature to be displayed. autodoc_min_relative_importance
2067 # must be a float >= 0 and < 1.
2068 #autodoc_min_relative_importance = 0.003
2069
2070 # Specify the number of top features to display in
2071 # the document. setting to -1 disables this restriction
2072 #autodoc_num_features = 50
2073
2074 # Specify the number of rows to include in PDP and ICE plot
2075 # if individual rows are not specified.
2076 #autodoc_num_rows = 0
2077
2078 # Maximum number of seconds Partial Dependency computation
2079 # can take when generating report. Set to -1 for no time limit.
2080 #autodoc_pd_max_runtime = 20
2081
2082 # Number of standard deviations outside of the range of
2083 # a column to include in partial dependence plots. This shows how the

(continues on next page)

166 Chapter 9. Using the config.toml File


Using Driverless AI, Release 1.8.4.1

(continued from previous page)


2084 # model will react to data it has not seen before.
2085 #autodoc_out_of_range = 3
2086
2087 # Number of columns to be show in data summary. Value
2088 # must be an integer. Values lower than 1, f.e. 0 or -1, indicate that
2089 # all columns should be shown.
2090 #autodoc_data_summary_col_num = -1
2091
2092 # Whether to include prediction statistics information if
2093 # experiment is binary classification/regression.
2094 #autodoc_prediction_stats = false
2095
2096 # Number of quantiles to use for prediction statistics
2097 # computation.
2098 #autodoc_prediction_stats_n_quantiles = 20
2099
2100 # Whether to include population stability index if
2101 # experiment is binary classification/regression.
2102 #autodoc_population_stability_index = false
2103
2104 # Number of quantiles to use for population stability index
2105 # computation.
2106 #autodoc_population_stability_index_n_quantiles = 10
2107
2108 # Whether to compute permutation based feature importance.
2109 #autodoc_include_permutation_feature_importance = false
2110
2111 # Name of the scorer to be used to calculate feature
2112 # importance. Leave blank to use experiments default scorer
2113 #autodoc_feature_importance_scorer = ""
2114
2115 # Number of permutations to make per feature when computing
2116 # feature importance.
2117 #autodoc_feature_importance_num_perm = 1
2118
2119 # Whether to include response rates information if
2120 # experiment is binary classification.
2121 #autodoc_response_rate = false
2122
2123 # Number of quantiles to use for response rates information
2124 # computation.
2125 #autodoc_response_rate_n_quantiles = 10
2126
2127 # The number feature in a KLIME global GLM coefficients
2128 # table. Must be an integer greater than 0 or -1. To
2129 # show all features set to -1.
2130 #autodoc_global_klime_num_features = 10
2131
2132 # Set the number of KLIME global GLM coefficients tables. Set
2133 # to 1 to show one table with coefficients sorted by absolute
2134 # value. Set to 2 to two tables one with the top positive
2135 # coefficients and one with the top negative coefficients. Must
2136 # be set to the integer 1 or 2.
2137 #autodoc_global_klime_num_tables = 1
2138
2139 # Whether to show the Gini Plot.
2140 #autodoc_gini_plot = false
2141
2142 # Whether to show all config settings. If False, only
2143 # the changed settings (config overrides) are listed, otherwise all
2144 # settings are listed.
2145 #autodoc_list_all_config_settings = false
2146
2147 # '
2148 # Whether to compute training, validation, and test correlation matrix (table and heatmap pdf) and save to disk
2149 # alpha: currently single threaded and slow for many columns
2150 #compute_correlation = false
2151
2152 # Whether to dump to disk a correlation heatmap
2153 #produce_correlation_heatmap = false
2154
2155 # Value to report high correlation between original features
2156 #high_correlation_value_to_report = 0.95
2157
2158 # Whether to delete preview cache on server exit
2159 #preview_cache_upon_server_exit = true
2160
2161 # Configurations for a HDFS data source
2162 # Path of hdfs coresite.xml
2163 # core_site_xml_path is deprecated, please use hdfs_config_path
2164 #core_site_xml_path = ""
2165
2166 # HDFS config folder path , can contain multiple config files
2167 #hdfs_config_path = ""
2168
2169 # Path of the principal key tab file
2170 #key_tab_path = ""
2171
2172 # The option disable access to DAI data_directory from file browser
2173 #file_hide_data_directory = true
2174
2175 # Enable usage of path filters
2176 #file_path_filtering_enabled = false
2177
2178 # List of absolute path prefixes to restrict access to in file browser.
2179 # For example:
2180 # file_path_filter_include = "['/data','/home/michal/']"
2181 #file_path_filter_include = []
2182
2183 # HDFS connector
2184 # Auth type can be Principal/keytab/keytabPrincipal
2185 # Specify HDFS Auth Type, allowed options are:
2186 # noauth : No authentication needed
2187 # principal : Authenticate with HDFS with a principal user

(continues on next page)

9.4. Sample Config.toml File 167


Using Driverless AI, Release 1.8.4.1

(continued from previous page)


2188 # keytab : Authenticate with a Key tab (recommended). If running
2189 # DAI as a service, then the Kerberos keytab needs to
2190 # be owned by the DAI user.
2191 # keytabimpersonation : Login with impersonation using a keytab
2192 #hdfs_auth_type = "noauth"
2193
2194 # Kerberos app principal user (recommended)
2195 #hdfs_app_principal_user = ""
2196
2197 # Deprecated - Do Not Use, login user is taken from the user name from login
2198 #hdfs_app_login_user = ""
2199
2200 # JVM args for HDFS distributions, provide args seperate by space
2201 # -Djava.security.krb5.conf=<path>/krb5.conf
2202 # -Dsun.security.krb5.debug=True
2203 # -Dlog4j.configuration=file:///<path>log4j.properties
2204 #hdfs_app_jvm_args = ""
2205
2206 # hdfs class path
2207 #hdfs_app_classpath = ""
2208
2209 # List of supported DFS schemas. Ex. "['hdfs://', 'maprfs://', 'swift://']"
2210 # Supported schemas list is used as an initial check to ensure valid input to connector
2211 #
2212 #hdfs_app_supported_schemes = "['hdfs://', 'maprfs://', 'swift://']"
2213
2214 # Maximum number of files viewable in connector ui. Set to larger number to view more files
2215 #hdfs_max_files_listed = 100
2216
2217 # Blue Data DTap connector settings are similar to HDFS connector settings.
2218 # Specify DTap Auth Type, allowed options are:
2219 # noauth : No authentication needed
2220 # principal : Authenticate with DTab with a principal user
2221 # keytab : Authenticate with a Key tab (recommended). If running
2222 # DAI as a service, then the Kerberos keytab needs to
2223 # be owned by the DAI user.
2224 # keytabimpersonation : Login with impersonation using a keytab
2225 # NOTE: "hdfs_app_classpath" and "core_site_xml_path" are both required to be set for DTap connector
2226 #dtap_auth_type = "noauth"
2227
2228 # Dtap (HDFS) config folder path , can contain multiple config files
2229 #dtap_config_path = ""
2230
2231 # Path of the principal key tab file
2232 #dtap_key_tab_path = ""
2233
2234 # Kerberos app principal user (recommended)
2235 #dtap_app_principal_user = ""
2236
2237 # Specify the user id of the current user here as user@realm
2238 #dtap_app_login_user = ""
2239
2240 # JVM args for DTap distributions, provide args seperate by space
2241 #dtap_app_jvm_args = ""
2242
2243 # DTap (HDFS) class path. NOTE: set 'hdfs_app_classpath' also
2244 #dtap_app_classpath = ""
2245
2246 # S3 Connector credentials
2247 #aws_access_key_id = ""
2248
2249 # S3 Connector credentials
2250 #aws_secret_access_key = ""
2251
2252 # S3 Connector credentials
2253 #aws_role_arn = ""
2254
2255 # What region to use when none is specified in the s3 url.
2256 # Ignored when aws_s3_endpoint_url is set.
2257 #
2258 #aws_default_region = ""
2259
2260 # Sets enpoint URL that will be used to access S3.
2261 #aws_s3_endpoint_url = ""
2262
2263 # If set to true S3 Connector will try to to obtain credentials assiciated with
2264 # the role attached to the EC2 instance.
2265 #aws_use_ec2_role_credentials = false
2266
2267 # Starting S3 path displayed in UI S3 browser
2268 #s3_init_path = "s3://"
2269
2270 # GCS Connector credentials
2271 # example (suggested) -- '/licenses/my_service_account_json.json'
2272 #gcs_path_to_service_account_json = ""
2273
2274 # Minio Connector credentials
2275 #minio_endpoint_url = ""
2276
2277 # Minio Connector credentials
2278 #minio_access_key_id = ""
2279
2280 # Minio Connector credentials
2281 #minio_secret_access_key = ""
2282
2283 # Recommended Provide: url, user, password
2284 # Optionally Provide: account, user, password
2285 # Example URL: https://<snowflake_account>.<region>.snowflakecomputing.com
2286 # Snowflake Connector credentials
2287 #snowflake_url = ""
2288
2289 # Snowflake Connector credentials
2290 #snowflake_user = ""
2291

(continues on next page)

168 Chapter 9. Using the config.toml File


Using Driverless AI, Release 1.8.4.1

(continued from previous page)


2292 # Snowflake Connector credentials
2293 #snowflake_password = ""
2294
2295 # Snowflake Connector credentials
2296 #snowflake_account = ""
2297
2298 # KDB Connector credentials
2299 #kdb_user = ""
2300
2301 # KDB Connector credentials
2302 #kdb_password = ""
2303
2304 # KDB Connector credentials
2305 #kdb_hostname = ""
2306
2307 # KDB Connector credentials
2308 #kdb_port = ""
2309
2310 # KDB Connector credentials
2311 #kdb_app_classpath = ""
2312
2313 # KDB Connector credentials
2314 #kdb_app_jvm_args = ""
2315
2316 # Azure Blob Store Connector credentials
2317 #azure_blob_account_name = ""
2318
2319 # Azure Blob Store Connector credentials
2320 #azure_blob_account_key = ""
2321
2322 # Azure Blob Store Connector credentials
2323 #azure_connection_string = ""
2324
2325 # Configuration for JDBC Connector.
2326 # JSON/Dictionary String with multiple keys.
2327 # Format as a single line without using carriage returns (the following example is formatted for readability).
2328 # Use triple quotations to ensure that the text is read as a single string.
2329 # Example:
2330 # '{
2331 # "postgres": {
2332 # "url": "jdbc:postgresql://ip address:port/postgres",
2333 # "jarpath": "/path/to/postgres_driver.jar",
2334 # "classpath": "org.postgresql.Driver"
2335 # },
2336 # "mysql": {
2337 # "url":"mysql connection string",
2338 # "jarpath": "/path/to/mysql_driver.jar",
2339 # "classpath": "my.sql.classpath.Driver"
2340 # }
2341 # }'
2342 #
2343 #jdbc_app_configs = "{}"
2344
2345 # extra jvm args for jdbc connector
2346 #jdbc_app_jvm_args = "-Xmx4g"
2347
2348 # alternative classpath for jdbc connector
2349 #jdbc_app_classpath = ""
2350
2351 # Notification scripts
2352 # - the variable points to a location of script which is executed at given event in experiment lifecycle
2353 # - the script should have executable flag enabled
2354 # - use of absolute path is suggested
2355 # The on experiment start notification script location
2356 #listeners_experiment_start = ""
2357
2358 # The on experiment finished notification script location
2359 #listeners_experiment_done = ""
2360
2361 # Address of the H2O Storage endpoint. Keep empty to use the local storage only.
2362 #h2o_storage_address = ""
2363
2364 # Whether the channel to the storage should be encrypted.
2365 #h2o_storage_tls_enabled = true
2366
2367 # Path to the certification authority certificate that H2O Storage server identity will be checked against.
2368 #h2o_storage_tls_ca_path = ""
2369
2370 # Path to the client certificate to authenticate with H2O Storage server
2371 #h2o_storage_tls_cert_path = ""
2372
2373 # Path to the client key to authenticate with H2O Storage server
2374 #h2o_storage_tls_key_path = ""
2375
2376 # UUID of a Storage project to use instead of the remote HOME folder.
2377 #h2o_storage_internal_default_project_id = ""
2378
2379 # Default AWS credentials to be used for scorer deployments.
2380 #deployment_aws_access_key_id = ""
2381
2382 # Default AWS credentials to be used for scorer deployments.
2383 #deployment_aws_secret_access_key = ""
2384
2385 # AWS S3 bucket to be used for scorer deployments.
2386 #deployment_aws_bucket_name = ""
2387
2388 # Allow the browser to store e.g. login credentials in login form (set to false for higher security)
2389 #allow_form_autocomplete = true
2390
2391 # Enable Projects workspace (alpha version, for evaluation)
2392 #enable_projects = true
2393
2394 # Enable custom recipes.
2395 #enable_custom_recipes = true

(continues on next page)

9.4. Sample Config.toml File 169


Using Driverless AI, Release 1.8.4.1

(continued from previous page)


2396
2397 # Enable uploading of custom recipes.
2398 #enable_custom_recipes_upload = true
2399
2400 # Include custom recipes in default inclusion lists (warning: enables all custom recipes)
2401 #include_custom_recipes_by_default = false
2402
2403 # Default application language - options are 'en', 'ja'
2404 #app_language = "en"
2405
2406 #enable_funnel = true
2407
2408 #clean_funnel = true
2409
2410 #quiet_funnel = false
2411
2412 #debug_daimodel_level = 0
2413
2414 # Whether to enable bootstrap sampling. Provides error bars to validation and test scores based on the standard error of the bootstrap mean.
2415 #enable_bootstrap = true
2416
2417 # Minimum number of bootstrap samples to use for estimating score and its standard deviation
2418 # Actual number of bootstrap samples will vary between the min and max,
2419 # depending upon row count (more rows, fewer samples) and accuracy settings (higher accuracy, more samples)
2420 #
2421 #min_bootstrap_samples = 1
2422
2423 # Maximum number of bootstrap samples to use for estimating score and its standard deviation
2424 # Actual number of bootstrap samples will vary between the min and max,
2425 # depending upon row count (more rows, fewer samples) and accuracy settings (higher accuracy, more samples)
2426 #
2427 #max_bootstrap_samples = 100
2428
2429 # Minimum fraction of row size to take as sample size for bootstrap estimator
2430 # Actual sample size used for bootstrap estimate will vary between the min and max,
2431 # depending upon row count (more rows, smaller sample size) and accuracy settings (higher accuracy, larger sample size)
2432 #
2433 #min_bootstrap_sample_size_factor = 1.0
2434
2435 # Maximum fraction of row size to take as sample size for bootstrap estimator
2436 # Actual sample size used for bootstrap estimate will vary between the min and max,
2437 # depending upon row count (more rows, smaller sample size) and accuracy settings (higher accuracy, larger sample size)
2438 #
2439 #max_bootstrap_sample_size_factor = 10.0
2440
2441 # Seed to use for final model bootstrap sampling, -1 means use experiment-derived seed.
2442 # E.g. one can retrain final model with different seed to get different final model error bars for scores.
2443 #
2444 #bootstrap_final_seed = -1
2445
2446 #varimp_threshold_at_interpretability_10 = 0.05
2447
2448 #features_allowed_by_interpretability = "{1: 10000000, 2: 10000, 3: 1000, 4: 500, 5: 300, 6: 200, 7: 150, 8: 100, 9: 80, 10: 50, 11: 50, 12: 50, 13: 50}"
2449
2450 #nfeatures_max_threshold = 200
2451
2452 #rdelta_percent_score_penalty_per_feature_by_interpretability = "{1: 0.0, 2: 0.1, 3: 1.0, 4: 2.0, 5: 5.0, 6: 10.0, 7: 20.0, 8: 30.0, 9: 50.0, 10: 100.0, 11:
˓→ 100.0, 12: 100.0, 13: 100.0}"
2453
2454 #drop_low_meta_weights = true
2455
2456 #meta_weight_allowed_by_interpretability = "{1: 1E-7, 2: 1E-5, 3: 1E-4, 4: 1E-3, 5: 1E-2, 6: 0.03, 7: 0.05, 8: 0.08, 9: 0.10, 10: 0.15, 11: 0.15, 12: 0.15,
˓→ 13: 0.15}"
2457
2458 #feature_cost_mean_interp_for_penalty = 5
2459
2460 #features_cost_per_interp = 0.25
2461
2462 #varimp_threshold_shift_report = 0.3
2463
2464 #apply_featuregene_limits_after_tuning = true
2465
2466 #remove_scored_0gain_genes_in_postprocessing_above_interpretability = 13
2467
2468 #remove_scored_0gain_genes_in_postprocessing_above_interpretability_final_population = 2
2469
2470 #remove_scored_by_threshold_genes_in_postprocessing_above_interpretability_final_population = 7
2471
2472 # Unnormalized probability to add genes or instances of transformers with specific attributes.
2473 # If no genes can be added, other mutations
2474 # (mutating models hyper parmaters, pruning genes, pruning features, etc.) are attempted.
2475 #
2476 #prob_add_genes = 0.5
2477
2478 # Unnormalized probability, conditioned on prob_add_genes,
2479 # to add genes or instances of transformers with specific attributes
2480 # that have shown to be beneficial to other individuals within the population.
2481 #
2482 #prob_addbest_genes = 0.5
2483
2484 # Unnormalized probability to prune genes or instances of transformers with specific attributes.
2485 # If a variety of transformers with many attributes exists, default value is reasonable.
2486 # However, if one has fixed set of transformers that should not change or no new transformer attributes
2487 # can be added, then setting this to 0.0 is reasonable to avoid undesired loss of transformations.
2488 #
2489 #prob_prune_genes = 0.5
2490
2491 # Unnormalized probability change model hyper parameters.
2492 #
2493 #prob_perturb_xgb = 0.25
2494
2495 # Unnormalized probability to prune features that have low variable importance,
2496 # as opposed to pruning entire instances of genes/transformers.
2497 #

(continues on next page)

170 Chapter 9. Using the config.toml File


Using Driverless AI, Release 1.8.4.1

(continued from previous page)


2498 #prob_prune_by_features = 0.25
2499
2500 #max_absolute_feature_expansion = 1000
2501
2502 # Number of classes above which to always use TensorFlow (if TensorFlow is enabled),
2503 # instead of others models set on 'auto' (models set to 'on' are still used).
2504 #tensorflow_num_classes_switch = 10
2505
2506 # Class count above which do not use TextLin Transformer.
2507 #textlin_num_classes_switch = 5
2508
2509 #text_gene_dim_reduction_choices = "[50]"
2510
2511 #text_gene_max_ngram = "[1, 2, 3]"
2512
2513 # Max. number of top variable importances to show in logs during feature evolution
2514 #max_num_varimp_to_log = 10
2515
2516 # Max. number of top variable importance shifts to show in logs and GUI after final model built
2517 #max_num_varimp_shift_to_log = 10
2518
2519 # Dictionary to control recipes for each experiment and particular custom recipes.
2520 # E.g. if inserting into the GUI as any toml string, can use:
2521 # ""recipe_dict="{'key1': 2, 'key2': 'value2'}"""
2522 # E.g. if putting into config.toml as a dict, can use:
2523 # recipe_dict="{'key1': 2, 'key2': 'value2'}"
2524 #
2525 #recipe_dict = "{}"
2526
2527 # location of custom recipes packages installed (relative to data_directory)
2528 # We will try to install packages dynamically, but can also do (before or after server started):
2529 # (inside docker running docker instance if running docker, or as user server is running as (e.g. dai user) if deb/tar native installation:
2530 # PYTHONPATH=<full tmp dir>/<contrib_env_relative_directory>/lib/python3.6/site-packages/ <path to dai>dai-env.sh python -m pip install --prefix=<full tmp
˓→dir>/<contrib_env_relative_directory> <packagename> --upgrade --upgrade-strategy only-if-needed --log-file pip_log_file.log
2531 # where <path to dai> is /opt/h2oai/dai/ for native rpm/deb installation
2532 # Note can also install wheel files if <packagename> is name of wheel file or archive.
2533 #
2534 #contrib_env_relative_directory = "contrib/env"
2535
2536 # pip install retry for call to pip. Sometimes need to try twice
2537 #pip_install_overall_retries = 2
2538
2539 # pip install verbosity level (number of -v's given to pip, up to 3
2540 #pip_install_verbosity = 2
2541
2542 # pip install timeout in seconds, Sometimes internet issues would mean want to fail faster
2543 #pip_install_timeout = 15
2544
2545 # pip install retry count
2546 #pip_install_retries = 5
2547
2548 # pip install options: string of list of other options, e.g. "['--proxy', 'http://user:password@proxyserver:port']"
2549 #pip_install_options = ""
2550
2551 # Whether to enable basic acceptance testing. Tests if can pickle the state, etc.
2552 #enable_basic_acceptance_tests = true
2553
2554 # Whether acceptance tests should run for custom genes / models / scorers / etc.
2555 #enable_acceptance_tests = true
2556
2557 # Minutes to wait until a recipe's acceptance testing is aborted. A recipe is rejected if acceptance
2558 # testing is enabled and times out.
2559 # One may also set timeout for a specific recipe by setting the class's staticmethod function called
2560 # acceptance_test_timeout to return number of minutes to wait until timeout doing acceptance testing.
2561 # This timeout does not include the time to install required packages.
2562 #
2563 #acceptance_test_timeout = 20.0
2564
2565 # Skipping just avoids the failed transformer.
2566 # Sometimes python multiprocessing swallows exceptions,
2567 # so skipping and logging exceptions is also more reliable way to handle them.
2568 # Recipe can raise h2oaicore.systemutils.IgnoreError to ignore error and avoid logging error.
2569 #
2570 #skip_transformer_failures = true
2571
2572 # Skipping just avoids the failed model. Failures are logged depending upon detailed_skip_failure_messages_level."
2573 # Recipe can raise h2oaicore.systemutils.IgnoreError to ignore error and avoid logging error.
2574 #
2575 #skip_model_failures = true
2576
2577 # How much verbosity to log failure messages for failed and then skipped transformers or models.
2578 # Full failures always go to disk as *.stack files,
2579 # which upon completion of experiment goes into details folder within experiment log zip file.
2580 #
2581 #detailed_skip_failure_messages_level = 1
2582
2583 # Instructions for 'Add to config.toml via toml string' in GUI expert page
2584 # Self-referential toml parameter, for setting any other toml parameters as string of tomls separated by
2585 # (spaces around
2586 # are ok).
2587 # Useful when toml parameter is not in expert mode but want per-experiment control.
2588 # Setting this will override all other choices.
2589 # In expert page, each time expert options saved, the new state is set without memory of any prior settings.
2590 # The entered item is a fully compliant toml string that would be processed directly by toml.load().
2591 # One should include 2 double quotes around the entire setting, or double quotes need to be escaped.
2592 # One enters into the expert page text as follows:
2593 # e.g. enable_glm="off"
2594 # enable_xgboost_gbm="off"
2595 # enable_lightgbm="on"
2596 # e.g. ""enable_glm="off"
2597 # enable_xgboost_gbm="off"
2598 # enable_lightgbm="off"
2599 # enable_tensorflow="on"""
2600 # e.g. fixed_num_individuals=4

(continues on next page)

9.4. Sample Config.toml File 171


Using Driverless AI, Release 1.8.4.1

(continued from previous page)


2601 # e.g. params_lightgbm="{'objective':'poisson'}"
2602 # e.g. ""params_lightgbm="{'objective':'poisson'}"""
2603 # e.g. max_cores=10
2604 # data_precision="float32"
2605 # max_rows_feature_evolution=50000000000
2606 # ensemble_accuracy_switch=11
2607 # feature_engineering_effort=1
2608 # target_transformer="identity"
2609 # tournament_feature_style_accuracy_switch=5
2610 # params_tensorflow="{'layers': [100, 100, 100, 100, 100, 100]}"
2611 # e.g. ""max_cores=10
2612 # data_precision="float32"
2613 # max_rows_feature_evolution=50000000000
2614 # ensemble_accuracy_switch=11
2615 # feature_engineering_effort=1
2616 # target_transformer="identity"
2617 # tournament_feature_style_accuracy_switch=5
2618 # params_tensorflow="{'layers': [100, 100, 100, 100, 100, 100]}"""
2619 # If you see: "toml.TomlDecodeError" then ensure toml is set correctly.
2620 # When set in the expert page of an experiment, these changes only affect experiments and not the server
2621 # Usually should keep this as empty string in this toml file.
2622 #config_overrides = ""
2623
2624 # Whether user can download dataset as csv file
2625 #enable_dataset_downloading = true
2626
2627 # Extra HTTP headers.
2628 #extra_http_headers = "{}"
2629
2630 # After how many days the audit log records are removed.
2631 # Set equal to 0 to disable removal of old records.
2632 #
2633 #audit_log_retention_period = 5
2634
2635 # Replace all the downloads on the experiment page to exports and allow users to push to the artifact store configured with artifacts_store
2636 #enable_artifacts_upload = false
2637
2638 # Artifacts store.
2639 # file_system: stores artifacts on a file system directory denoted by artifacts_file_system_directory.
2640 #
2641 #artifacts_store = "file_system"
2642
2643 # File system location where artifacts will be copied in case artifacts_store is set to file_system
2644 #artifacts_file_system_directory = "tmp"
2645

172 Chapter 9. Using the config.toml File


CHAPTER

TEN

ENVIRONMENT VARIABLES AND CONFIGURATION OPTIONS

Driverless AI provides a number of environment variables that can be passed when starting Driverless AI or specified
in a config.toml file. The complete list of variables is in the Using the config.toml File section. The steps for specifying
variables vary depending on whether you installed a Driverless AI RPM, DEB, or TAR SH or whether you are running
a Docker image.

10.1 Setting Environment Variables in Docker Images

Each property must be prepended with DRIVERLESS_AI_. The example below starts Driverless AI with environment
variables that enable S3 and HDFS access (without authentication).
nvidia-docker run \
--pid=host \
--init \
--rm \
-u `id -u`:`id -g` \
-e DRIVERLESS_AI_ENABLED_FILE_SYSTEMS="file,s3,hdfs" \
-e DRIVERLESS_AI_AUTHENTICATION_METHOD="local" \
-e DRIVERLESS_AI_LOCAL_HTPASSWD_FILE="<htpasswd_file_location>" \
-v /etc/passwd:/etc/passwd:ro \
-v /etc/group:/etc/group:ro \
-v `pwd`/data:/data \
-v `pwd`/log:/log \
-v `pwd`/license:/license \
-v `pwd`/tmp:/tmp \
h2oai/dai-centos7-x86_64:TAG

10.2 Setting Configuration Options in Native Installs

The config.toml file is available in the etc/dai folder after the RPM, DEB, or TAR SH is installed. Edit the desired
variables in this file, and then restart Driverless AI.
The example below shows the configuration options in the config.toml file to set when enabling the S3 and HDFS
access (without authentication)
1. Export the Driverless AI config.toml file or add it to ~/.bashrc. For example:
# DEB and RPM
export DRIVERLESS_AI_CONFIG_FILE="/etc/dai/config.toml"

# TAR SH
export DRIVERLESS_AI_CONFIG_FILE="/path/to/your/unpacked/dai/directory/config.toml"

2. Specify the desired configuration options to enable S3 and HDFS access (without authentication).
# File System Support
# upload : standard upload feature
# file : local file system/server file system
# hdfs : Hadoop file system, remember to configure the HDFS config folder path and keytab below
# dtap : Blue Data Tap file system, remember to configure the DTap section below
# s3 : Amazon S3, optionally configure secret and access key below
# gcs : Google Cloud Storage, remember to configure gcs_path_to_service_account_json below
# gbq : Google BigQuery, remember to configure gcs_path_to_service_account_json below

(continues on next page)

173
Using Driverless AI, Release 1.8.4.1

(continued from previous page)


# minio : Minio Cloud Storage, remember to configure secret and access key below
# snow : Snowflake Data Warehouse, remember to configure Snowflake credentials below (account name, username, password)
# kdb : KDB+ Time Series Database, remember to configure KDB credentials below (hostname and port, optionally: username, password,
˓→classpath, and jvm_args)
# azrbs : Azure Blob Storage, remember to configure Azure credentials below (account name, account key)
# jdbc: JDBC Connector, remember to configure JDBC below. (jdbc_app_configs)
enabled_file_systems = "file,s3,hdfs"

# authentication_method
# unvalidated : Accepts user id and password, does not validate password
# none : Does not ask for user id or password, authenticated as admin
# pam : Accepts user id and password, Validates user with operating system
# ldap : Accepts user id and password, Validates against an ldap server, look
# local: Accepts a user id and password, Validated against a htpasswd file provided in local_htpasswd_file
# for additional settings under LDAP settings
authentication_method = "local"

# Local password file


# Generating a htpasswd file: see syntax below
# htpasswd -B "<location_to_place_htpasswd_file>" "<username>"
# note: -B forces use of brcypt, a secure encryption method
local_htpasswd_file = "<htpasswd_file_location>"

3. Start Driverless AI. Note that the command used to start Driverless AI varies depending on your install type.
# Linux RPM or DEB with systemd
sudo systemctl start dai

# Linux RPM or DEB without systemd


sudo -H -u dai /opt/h2oai/dai/run-dai.sh

# Linux TAR SH
./run-dai.sh

174 Chapter 10. Environment Variables and Configuration Options


CHAPTER

ELEVEN

ENABLING DATA CONNECTORS

Driverless AI provides various data connectors for external data sources. Data sources are exposed in the form of the
file systems. Each file system is prefixed by a unique prefix. For example:
• To reference data on S3, use s3://.
• To reference data on HDFS, use the prefix hdfs://.
• To reference data on Azure Blob Store, use https://<storage_name>.blob.core.windows.net.
• To reference data on BlueData Datatap, use dtap://.
• To reference data on Google BigQuery, make sure you know the Google BigQuery dataset and the table that
you want to query. Use a standard SQL query to ingest data.
• To reference data on Google Cloud Storage, use gs://
• To reference data on kdb+, use the hostname and the port http://<kdb_server>:<port>
• To reference data on Minio, use http://<endpoint_url>.
• To reference data on Snowflake, use a standard SQL query to ingest data.
• To access a SQL database via JDBC, use a SQL query with the syntax associated with your database.
Refer to the following sections for more information:

11.1 Using Data Connectors with the Docker Image

Available file systems can be configured via the enabled_file_systems property. Note that each property must
be prepended with DRIVERLESS_AI_. Replace TAG below with the image tag.
nvidia-docker run \
--pid=host \
--init \
--rm \
--shm-size=256m \
-u `id -u`:`id -g` \
-p 12345:12345 \
-e DRIVERLESS_AI_ENABLED_FILE_SYSTEMS="file,s3,hdfs,gcs,gbq,kdb,minio,snow,dtap,azrbs" \
-v `pwd`/data:/data \
-v `pwd`/log:/log \
-v `pwd`/license:/license \
-v `pwd`/tmp:/tmp \
h2oai/dai-centos7-ppc64le:TAG

The sections that follow shows examples describing how to use environment variables to enable HDFS, S3, Google
Cloud Storage, Google Big Query, Minio, Snowflake, kdb+, Azure Blob Store, BlueData DataTap, and JDBC data
sources.

175
Using Driverless AI, Release 1.8.4.1

11.1.1 S3 Setup

Driverless AI allows you to explore S3 data sources from within the Driverless AI application. This section provides
instructions for configuring Driverless AI to work with S3.

Description of Configuration Attributes

• aws_access_key_id: The S3 access key ID


• aws_secret_access_key: The S3 access key
• aws_role_arn: The Amazon Resource Name
• aws_default_region: The region to use when the aws_s3_endpoint_url option is not set. This is ignored
when aws_s3_endpoint_url is set.
• aws_s3_endpoint_url: The endpoint URL that will be used to access S3.
• aws_use_ec2_role_credentials: If set to true, the S3 Connector will try to to obtain credentials asso-
ciated with the role attached to the EC2 instance.
• s3_init_path: The starting S3 path that will be displayed in UI S3 browser.

Start Driverless AI

The following sections describes how to enable the S3 data connector when starting Driverless AI in Docker. This can
done by specifying each environment variable in the nvidia-docker run command or by editing the configura-
tion options in the config.toml file and then specifying that file in the nvidia-docker run command.

Enable S3 with No Authentication

This example enables the S3 data connector and disables authentication. It does not pass any S3 access key or secret;
however it configures Docker DNS by passing the name and IP of the S3 name node. This allows users to reference
data stored in S3 directly using the name node address, for example: s3://name.node/datasets/iris.csv. Replace TAG
below with the image tag.
nvidia-docker run \
--shm-size=256m \
--add-host name.node:172.16.2.186 \
-e DRIVERLESS_AI_ENABLED_FILE_SYSTEMS="file,s3" \
-p 12345:12345 \
--init -it --rm \
-v /tmp/dtmp/:/tmp \
-v /tmp/dlog/:/log \
-v /tmp/dlicense/:/license \
-v /tmp/ddata/:/data \
-u $(id -u):$(id -g) \
h2oai/dai-centos7-x86_64:TAG

Enable S3 with Authentication

This example enables the S3 data connector with authentication by passing an S3 access key ID and an access key.
It also configures Docker DNS by passing the name and IP of the S3 name node. This allows users to reference data
stored in S3 directly using the name node address, for example: s3://name.node/datasets/iris.csv. Replace TAG below
with the image tag.

176 Chapter 11. Enabling Data Connectors


Using Driverless AI, Release 1.8.4.1

nvidia-docker run \
--shm-size=256m \
--add-host name.node:172.16.2.186 \
-e DRIVERLESS_AI_ENABLED_FILE_SYSTEMS="file,s3" \
-e DRIVERLESS_AI_AWS_ACCESS_KEY_ID="<access_key_id>" \
-e DRIVERLESS_AI_AWS_SECRET_ACCESS_KEY="<access_key>" \
-p 12345:12345 \
--init -it --rm \
-v /tmp/dtmp/:/tmp \
-v /tmp/dlog/:/log \
-v /tmp/dlicense/:/license \
-v /tmp/ddata/:/data \
-u $(id -u):$(id -g) \
h2oai/dai-centos7-x86_64:TAG

Start DAI by Updating the config.toml File

This example shows how to configure S3 options in the config.toml file, and then specify that file when starting
Driverless AI in Docker. Note that this example enables S3 with no authentication.
1. Configure the Driverless AI config.toml file. Set the following configuration options.
• enabled_file_systems = "file, upload, s3"
2. Mount the config.toml file into the Docker container.
nvidia-docker run \
--pid=host \
--init \
--rm \
--shm-size=256m \
--add-host name.node:172.16.2.186 \
-e DRIVERLESS_AI_CONFIG_FILE=/path/in/docker/config.toml
-p 12345:12345 \
-v /local/path/to/config.toml:/path/in/docker/config.toml
-v /etc/passwd:/etc/passwd:ro \
-v /etc/group:/etc/group:ro \
-v /tmp/dtmp/:/tmp \
-v /tmp/dlog/:/log \
-v /tmp/dlicense/:/license \
-v /tmp/ddata/:/data \
-u $(id -u):$(id -g) \
h2oai/dai-centos7-x86_64:TAG

11.1.2 HDFS Setup

Driverless AI allows you to explore HDFS data sources from within the Driverless AI application. This section
provides instructions for configuring Driverless AI to work with HDFS.

Supported Hadoop Platforms

• CDH 5.4
• CDH 5.5
• CDH 5.6
• CDH 5.7
• CDH 5.8
• CDH 5.9
• CDH 5.10
• CDH 5.13
• CDH 5.14
• CDH 5.15
• CDH 5.16

11.1. Using Data Connectors with the Docker Image 177


Using Driverless AI, Release 1.8.4.1

• CDH 6.0
• CDH 6.1
• CDH 6.2
• CDH 6.3
• HDP 2.2
• HDP 2.3
• HDP 2.4
• HDP 2.5
• HDP 2.6
• HDP 3.0
• HDP 3.1

Description of Configuration Attributes

• hdfs_config_path: The location the HDFS config folder path. This folder can contain multiple config
files.
• hdfs_auth_type: Selects HDFS authentication. Available values are:
– principal: Authenticate with HDFS with a principal user.
– keytab: Authenticate with a keytab (recommended). If running DAI as a service, then the Kerberos
keytab needs to be owned by the DAI user.
– keytabimpersonation: Login with impersonation using a keytab.
– noauth: No authentication needed.
• key_tab_path: The path of the principal key tab file. For use when hdfs_auth_type=principal.
• hdfs_app_principal_user: The Kerberos application principal user.
• hdfs_app_jvm_args: JVM args for HDFS distributions. Separate each argument with spaces.
– -Djava.security.krb5.conf
– -Dsun.security.krb5.debug
– -Dlog4j.configuration
• hdfs_app_classpath: The HDFS classpath.

Start Driverless AI

This section describes how to enable the kdb+ data connector when starting Driverless AI in Docker. This can done
by specifying each environment variable in the nvidia-docker run command or by editing the configuration
options in the config.toml file and then specifying that file in the nvidia-docker run command.

178 Chapter 11. Enabling Data Connectors


Using Driverless AI, Release 1.8.4.1

Enable HDFS with No Authentication

This example enables the HDFS data connector and disables HDFS authentication. It does not pass any HDFS con-
figuration file; however it configures Docker DNS by passing the name and IP of the HDFS name node. This allows
users to reference data stored in HDFS directly using name node address, for example: hdfs://name.node/
datasets/iris.csv. Replace TAG below with the image tag.
nvidia-docker run \
--pid=host \
--init \
--rm \
--shm-size=256m \
--add-host name.node:172.16.2.186 \
-e DRIVERLESS_AI_ENABLED_FILE_SYSTEMS="file,hdfs" \
-e DRIVERLESS_AI_HDFS_AUTH_TYPE='noauth' \
-e DRIVERLESS_AI_PROCSY_PORT=8080 \
-p 12345:12345 \
-v /etc/passwd:/etc/passwd:ro \
-v /etc/group:/etc/group:ro \
-v /tmp/dtmp/:/tmp \
-v /tmp/dlog/:/log \
-v /tmp/dlicense/:/license \
-v /tmp/ddata/:/data \
-u $(id -u):$(id -g) \
h2oai/dai-centos7-x86_64:TAG

Enable HDFS with Keytab-Based Authentication

Notes:
• If using Kerberos Authentication, the time on the Driverless AI server must be in sync with Kerberos server. If
the time difference between clients and DCs are 5 minutes or higher, there will be Kerberos failures.
• If running Driverless AI as a service, then the Kerberos keytab needs to be owned by the Driverless AI user;
otherwise Driverless AI will not be able to read/access the Keytab and will result in a fallback to simple authen-
tication and, hence, fail.
This example:
• Places keytabs in the /tmp/dtmp folder on your machine and provides the file path as described below.
• Configures the environment variable DRIVERLESS_AI_HDFS_APP_PRINCIPAL_USER to reference a user
for whom the keytab was created (usually in the form of user@realm).
Replace TAG below with the image tag.
# Docker instructions
nvidia-docker run \
--pid=host \
--init \
--rm \
--shm-size=256m \
-e DRIVERLESS_AI_ENABLED_FILE_SYSTEMS="file,hdfs" \
-e DRIVERLESS_AI_HDFS_AUTH_TYPE='keytab' \
-e DRIVERLESS_AI_KEY_TAB_PATH='tmp/<<keytabname>>' \
-e DRIVERLESS_AI_HDFS_APP_PRINCIPAL_USER='<<user@kerberosrealm>>' \
-e DRIVERLESS_AI_PROCSY_PORT=8080 \
-p 12345:12345 \
-v /etc/passwd:/etc/passwd:ro \
-v /etc/group:/etc/group:ro \
-v /tmp/dtmp/:/tmp \
-v /tmp/dlog/:/log \
-v /tmp/dlicense/:/license \
-v /tmp/ddata/:/data \
-u $(id -u):$(id -g) \
h2oai/dai-centos7-x86_64:TAG

11.1. Using Data Connectors with the Docker Image 179


Using Driverless AI, Release 1.8.4.1

Enable HDFS with Keytab-Based Impersonation

Notes:
• If using Kerberos, be sure that the Driverless AI time is synched with the Kerberos server.
• If running Driverless AI as a service, then the Kerberos keytab needs to be owned by the Driverless AI user.
• Logins are case sensitive when keytab-based impersonation is configured.
The example:
• Sets the authentication type to keytabimpersonation.
• Places keytabs in the /tmp/dtmp folder on your machine and provides the file path as described below.
• Configures the DRIVERLESS_AI_HDFS_APP_PRINCIPAL_USER variable, which references a user for
whom the keytab was created (usually in the form of user@realm).
Replace TAG below with the image tag.
# Docker instructions
nvidia-docker run \
--pid=host \
--init \
--rm \
--shm-size=256m \
-e DRIVERLESS_AI_ENABLED_FILE_SYSTEMS="file,hdfs" \
-e DRIVERLESS_AI_HDFS_AUTH_TYPE='keytabimpersonation' \
-e DRIVERLESS_AI_KEY_TAB_PATH='/tmp/<<keytabname>>' \
-e DRIVERLESS_AI_HDFS_APP_PRINCIPAL_USER='<<appuser@kerberosrealm>>' \
-e DRIVERLESS_AI_PROCSY_PORT=8080 \
-p 12345:12345 \
-v /etc/passwd:/etc/passwd:ro \
-v /etc/group:/etc/group:ro \
-v /tmp/dlog/:/log \
-v /tmp/dlicense/:/license \
-v /tmp/ddata/:/data \
-u $(id -u):$(id -g) \
h2oai/dai-centos7-x86_64:TAG

Start DAI by Updating the config.toml File

This example shows how to configure HDFS options in the config.toml file, and then specify that file when starting
Driverless AI in Docker. Note that this example enables HDFS with no authentication.
1. Configure the Driverless AI config.toml file. Set the following configuration options. Note that the procsy port,
which defaults to 12347, also has to be changed.
• enabled_file_systems = "file, upload, hdfs"
• procsy_ip = "127.0.0.1"
• procsy_port = 8080
2. Mount the config.toml file into the Docker container.
nvidia-docker run \
--pid=host \
--init \
--rm \
--shm-size=256m \
--add-host name.node:172.16.2.186 \
-e DRIVERLESS_AI_CONFIG_FILE=/path/in/docker/config.toml
-p 12345:12345 \
-v /local/path/to/config.toml:/path/in/docker/config.toml
-v /etc/passwd:/etc/passwd:ro \
-v /etc/group:/etc/group:ro \
-v /tmp/dtmp/:/tmp \
-v /tmp/dlog/:/log \
-v /tmp/dlicense/:/license \
-v /tmp/ddata/:/data \
-u $(id -u):$(id -g) \
h2oai/dai-centos7-x86_64:TAG

180 Chapter 11. Enabling Data Connectors


Using Driverless AI, Release 1.8.4.1

Specifying a Hadoop Platform

The following example shows how to build an H2O-3 Hadoop image and run Driverless AI on that image. This
example uses CDH 6.0. Change the H2O_TARGET to specify a different platform.
1. Clone and then build H2O-3 for CDH 6.0.
git clone https://github.com/h2oai/h2o-3.git
cd h2o-3
./gradlew clean build -x test
export H2O_TARGET=cdh6.0
export BUILD_HADOOP=true
./gradlew clean build -x test

2. Start Driverless AI.


docker run -it --rm \
-v `pwd`:`pwd` \
-w `pwd` \
--entrypoint bash \
--network=host \
-p 8020:8020 \
docker.h2o.ai/cdh-6-w-hive \
-c 'sudo -E startup.sh && \
source /envs/h2o_env_python3.6/bin/activate && \
hadoop jar h2o-hadoop-3/h2o-cdh6.0-assembly/build/libs/h2odriver.jar -libjars "$(cat /opt/hive-jars/hive-libjars)" -n 1 -mapperXmx 2g -
˓→baseport 54445 -notify h2o_one_node -ea -disown && \
export CLOUD_IP=localhost && \
export CLOUD_PORT=54445 && \
make -f scripts/jenkins/Makefile.jenkins test-hadoop-smoke; \
bash'

3. Run the Driverless AI HDFS connector.


java -cp h2oai-dai-connectors.jar ai.h2o.dai.connectors.HdfsConnector

4. Verify the commands for ls and cp, for example.


{"coreSiteXmlPath": "/etc/hadoop/conf", "keyTabPath": "", authType: "noauth", "srcPath": "hdfs://localhost/user/jenkins/", "dstPath": "/
˓→tmp/xxx", "command": "cp", "user": "", "appUser": ""}

11.1.3 Azure Blob Store Setup

Driverless AI allows you to explore Azure Blob Store data sources from within the Driverless AI application. This
section describes how to enable the Azure Blob Store data connector in Docker environments.

Description of Configuration Attributes

azure_blob_account_name: The Microsoft Azure Storage account name. This should be the dns prefix created
when the account was created (for example, “mystorage”).
azure_blob_account_key: Specify the account key that maps to your account name.
azure_connection_string: Optionally specify a new connection string. With this option, you can include an
override for a host, port, and/or account name. For example,
azure_connection_string = "DefaultEndpointsProtocol=http;AccountName=<account_name>;AccountKey=<account_key>;BlobEndpoint=http://<host>:
˓→<port>/<account_name>;"

11.1. Using Data Connectors with the Docker Image 181


Using Driverless AI, Release 1.8.4.1

Start Driverless AI

This section describes how to enable the Azure Blob Storer data connector when starting Driverless AI in Docker.
This can done by specifying each environment variable in the nvidia-docker run command or by editing the
configuration options in the config.toml file and then specifying that file in the nvidia-docker run command.

Start DAI Using Environment Variables

This example enables the Azure Blob Store data connector. This allows users to reference data stored on your Azure
storage account using the account name, for example: https://mystorage.blob.core.windows.net. Re-
place TAG below with the image tag.
nvidia-docker run \
--pid=host \
--init \
--rm \
--shm-size=256m \
-e DRIVERLESS_AI_ENABLED_FILE_SYSTEMS="file,azrbs" \
-e DRIVERLESS_AI_AZURE_BLOB_ACCOUNT_NAME="mystorage" \
-e DRIVERLESS_AI_AZURE_BLOB_ACCOUNT_KEY="<access_key>" \
-p 12345:12345 \
-v /tmp/dtmp/:/tmp \
-v /tmp/dlog/:/log \
-v /tmp/dlicense/:/license \
-v /tmp/ddata/:/data \
-u $(id -u):$(id -g) \
h2oai/dai-centos7-x86_64:TAG

Start DAI by Updating the config.toml File

This example shows how to configure Azure Blob Store options in the config.toml file, and then specify that file when
starting Driverless AI in Docker.
1. Configure the Driverless AI config.toml file. Set the following configuration options:
• enabled_file_systems = "file, upload, azrbs"
• azure_blob_account_name = "mystorage"
• azure_blob_account_key = "<account_key>"
2. Mount the config.toml file into the Docker container.
nvidia-docker run \
--pid=host \
--init \
--rm \
--shm-size=256m \
--add-host name.node:172.16.2.186 \
-e DRIVERLESS_AI_CONFIG_FILE=/path/in/docker/config.toml
-p 12345:12345 \
-v /local/path/to/config.toml:/path/in/docker/config.toml
-v /etc/passwd:/etc/passwd:ro \
-v /etc/group:/etc/group:ro \
-v /tmp/dtmp/:/tmp \
-v /tmp/dlog/:/log \
-v /tmp/dlicense/:/license \
-v /tmp/ddata/:/data \
-u $(id -u):$(id -g) \
h2oai/dai-centos7-x86_64:TAG

182 Chapter 11. Enabling Data Connectors


Using Driverless AI, Release 1.8.4.1

11.1.4 BlueData DataTap Setup

This section provides instructions for configuring Driverless AI to work with BlueData DataTap.

Description of Configuration Attributes

• dtap_auth_type: Selects DTAP authentication. Available values are:


– noauth: No authentication needed
– principal: Authenticate with DataTap with a principal user
– keytab: Authenticate with a Key tab (recommended). If running Driverless AI as a service, then the
Kerberos keytab needs to be owned by the Driverless AI user.
– keytabimpersonation: Login with impersonation using a keytab
• dtap_config_path: The location of the DTAP (HDFS) config folder path. This folder can contain multiple
config files. Note: The DTAP config file core-site.xml needs to contain DTap FS configuration, for example:
<configuration>
<property>
<name>fs.dtap.impl</name>
<value>com.bluedata.hadoop.bdfs.Bdfs</value>
<description>The FileSystem for BlueData dtap: URIs.</description>
</property>
</configuration>

• dtap_key_tab_path: The path of the principal key tab file. For use when
dtap_auth_type=principal.
• dtap_app_principal_user: The Kerberos app principal user (recommended).
• dtap_app_login_user: The user ID of the current user (for example, user@realm).
• dtap_app_jvm_args: JVM args for DTap distributions. Separate each argument with spaces.
• dtap_app_classpath: The DTap classpath.

Start Driverless AI

This section describes how to enable the BlueData DataTap data connector when starting Driverless AI in Docker.
This can done by specifying each environment variable in the nvidia-docker run command or by editing the
configuration options in the config.toml file and then specifying that file in the nvidia-docker run command.

Enable DataTap with No Authentication

This example enables the DataTap data connector and disables authentication. It does not pass any configuration file;
however it configures Docker DNS by passing the name and IP of the DTap name node. This allows users to reference
data stored in DTap directly using the name node address, for example: dtap://name.node/datasets/iris.
csv or dtap://name.node/datasets/. (Note: The trailing slash is currently required for directories.) Replace
TAG below with the image tag.
nvidia-docker run \
--pid=host \
--init \
--rm \
--shm-size=256m \
--add-host name.node:172.16.2.186 \
-e DRIVERLESS_AI_ENABLED_FILE_SYSTEMS="file,dtap" \
-e DRIVERLESS_AI_DTAP_AUTH_TYPE='noauth' \
-p 12345:12345 \
-v /etc/passwd:/etc/passwd \
-v /tmp/dtmp/:/tmp \
-v /tmp/dlog/:/log \

(continues on next page)

11.1. Using Data Connectors with the Docker Image 183


Using Driverless AI, Release 1.8.4.1

(continued from previous page)


-v /tmp/dlicense/:/license \
-v /tmp/ddata/:/data \
-u $(id -u):$(id -g) \
h2oai/dai-centos7-x86_64:TAG

Enable DataTap with Keytab-Based Authentication

Notes:
• If using Kerberos Authentication, the time on the Driverless AI server must be in sync with Kerberos server. If
the time difference between clients and DCs are 5 minutes or higher, there will be Kerberos failures.
• If running Driverless AI as a service, then the Kerberos keytab needs to be owned by the Driverless AI user;
otherwise Driverless AI will not be able to read/access the Keytab and will result in a fallback to simple authen-
tication and, hence, fail.
This example:
• Places keytabs in the /tmp/dtmp folder on your machine and provides the file path as described below.
• Configures the environment variable DRIVERLESS_AI_DTAP_APP_PRINCIPAL_USER to reference a user
for whom the keytab was created (usually in the form of user@realm).
Replace TAG below with the image tag.
# Docker instructions
nvidia-docker run \
--pid=host \
--init \
--rm \
--shm-size=256m \
-e DRIVERLESS_AI_ENABLED_FILE_SYSTEMS="file,dtap" \
-e DRIVERLESS_AI_DTAP_AUTH_TYPE='keytab' \
-e DRIVERLESS_AI_DTAP_KEY_TAB_PATH='tmp/<<keytabname>>' \
-e DRIVERLESS_AI_DTAP_APP_PRINCIPAL_USER='<<user@kerberosrealm>>' \
-p 12345:12345 \
-v /etc/passwd:/etc/passwd \
-v /tmp/dtmp/:/tmp \
-v /tmp/dlog/:/log \
-v /tmp/dlicense/:/license \
-v /tmp/ddata/:/data \
-u $(id -u):$(id -g) \
h2oai/dai-centos7-x86_64:TAG

Enable DataTap with Keytab-Based Impersonation

Notes:
• If using Kerberos, be sure that the Driverless AI time is synched with the Kerberos server.
• If running Driverless AI as a service, then the Kerberos keytab needs to be owned by the Driverless AI user.
The example:
• Places keytabs in the /tmp/dtmp folder on your machine and provides the file path as described below.
• Configures the DRIVERLESS_AI_DTAP_APP_PRINCIPAL_USER variable, which references a user for
whom the keytab was created (usually in the form of user@realm).
• Configures the DRIVERLESS_AI_DTAP_APP_LOGIN_USER variable, which references a user who is being
impersonated (usually in the form of user@realm).
Replace TAG below with the image tag.
# Docker instructions
nvidia-docker run \
--pid=host \
--init \
--rm \
--shm-size=256m \

(continues on next page)

184 Chapter 11. Enabling Data Connectors


Using Driverless AI, Release 1.8.4.1

(continued from previous page)


-e DRIVERLESS_AI_ENABLED_FILE_SYSTEMS="file,dtap" \
-e DRIVERLESS_AI_DTAP_AUTH_TYPE='Keytab' \
-e DRIVERLESS_AI_DTAP_KEY_TAB_PATH='tmp/<<keytabname>>' \
-e DRIVERLESS_AI_DTAP_APP_PRINCIPAL_USER='<<appuser@kerberosrealm>>' \
-e DRIVERLESS_AI_DTAP_APP_LOGIN_USER='<<thisuser@kerberosrealm>>' \
-p 12345:12345 \
-v /etc/passwd:/etc/passwd \
-v /tmp/dtmp/:/tmp \
-v /tmp/dlog/:/log \
-v /tmp/dlicense/:/license \
-v /tmp/ddata/:/data \
-u $(id -u):$(id -g) \
h2oai/dai-centos7-x86_64:TAG

Start DAI by Updating the config.toml File

This example shows how to configure DataTap options in the config.toml file, and then specify that file when starting
Driverless AI in Docker. Note that this example enables DataTap with no authentication.
1. Configure the Driverless AI config.toml file. Set the following configuration options:
• enabled_file_systems = "file, upload, dtap"
2. Mount the config.toml file into the Docker container.
nvidia-docker run \
--pid=host \
--init \
--rm \
--shm-size=256m \
--add-host name.node:172.16.2.186 \
-e DRIVERLESS_AI_CONFIG_FILE=/path/in/docker/config.toml
-p 12345:12345 \
-v /local/path/to/config.toml:/path/in/docker/config.toml
-v /etc/passwd:/etc/passwd:ro \
-v /etc/group:/etc/group:ro \
-v /tmp/dtmp/:/tmp \
-v /tmp/dlog/:/log \
-v /tmp/dlicense/:/license \
-v /tmp/ddata/:/data \
-u $(id -u):$(id -g) \
h2oai/dai-centos7-x86_64:TAG

11.1.5 Google BigQuery

Driverless AI allows you to explore Google BigQuery data sources from within the Driverless AI application. This
section provides instructions for configuring Driverless AI to work with Google BigQuery. This setup requires you to
enable authentication. If you enable the GCS and/or GBQ connectors, those file systems will be available in the UI,
but you will not be able to use those connectors without authentication.
In order to enable the GBQ data connector with authentication, you must:
1. Retrieve a JSON authentication file from GCP.
2. Mount the JSON file to the Docker instance.
3. Specify the path to the /json_auth_file.json in the GCS_PATH_TO_SERVICE_ACCOUNT_JSON environmen-
tal variable.
Note: The account JSON includes authentications as provided by the system administrator. You can be provided a
JSON file that contains both Google Cloud Storage and Google BigQuery authentications, just one or the other, or
none at all.

11.1. Using Data Connectors with the Docker Image 185


Using Driverless AI, Release 1.8.4.1

Start Driverless AI

This section describes how to enable the Google BigQuery data connector when starting Driverless AI in Docker.
This can done by specifying each environment variable in the nvidia-docker run command or by editing the
configuration options in the config.toml file and then specifying that file in the nvidia-docker run command.

Enable GBQ with Authentication

This example enables the GBQ data connector with authentication by passing the JSON authentication file. This
assumes that the JSON file contains Google BigQuery authentications. Replace TAG below with the image tag.
nvidia-docker run \
--pid=host \
--init \
--rm \
--shm-size=256m \
-e DRIVERLESS_AI_ENABLED_FILE_SYSTEMS="file,gbq" \
-e DRIVERLESS_AI_GCS_PATH_TO_SERVICE_ACCOUNT_JSON="/service_account_json.json" \
-u `id -u`:`id -g` \
-p 12345:12345 \
-v `pwd`/data:/data \
-v `pwd`/log:/log \
-v `pwd`/license:/license \
-v `pwd`/tmp:/tmp \
-v `pwd`/service_account_json.json:/service_account_json.json \
h2oai/dai-centos7-x86_64:TAG

After Google BigQuery is enabled, you can add datasets by selecting Google Big Query from the Add Dataset (or
Drag and Drop) drop-down menu.

Start DAI Using Environment Variables

This example shows how to configure the GBQ data connector options in the config.toml file, and then specify that
file when starting Driverless AI in Docker.
1. Configure the Driverless AI config.toml file. Set the following configuration options:
• enabled_file_systems = "file, upload, gbq"
• gcs_path_to_service_account_json = "/service_account_json.json"
2. Mount the config.toml file into the Docker container.
nvidia-docker run \
--pid=host \
--init \
--rm \
--shm-size=256m \
--add-host name.node:172.16.2.186 \
-e DRIVERLESS_AI_CONFIG_FILE=/path/in/docker/config.toml
-p 12345:12345 \
-v /local/path/to/config.toml:/path/in/docker/config.toml
-v /etc/passwd:/etc/passwd:ro \
-v /etc/group:/etc/group:ro \
-v /tmp/dtmp/:/tmp \
-v /tmp/dlog/:/log \
-v /tmp/dlicense/:/license \
-v /tmp/ddata/:/data \
-u $(id -u):$(id -g) \
h2oai/dai-centos7-x86_64:TAG

186 Chapter 11. Enabling Data Connectors


Using Driverless AI, Release 1.8.4.1

Specify the following information to add your dataset.


1. Enter BQ Dataset ID with write access to create temporary table: Enter a dataset ID in Google BigQuery
that this user has read/write access to. BigQuery uses this dataset as the location for the new table generated by
the query.
Note: Driverless AI’s connection to GBQ will inherit the top-level directory from the service JSON file.
So if a dataset named “my-dataset” is in a top-level directory named “dai-gbq”, then the value for the
dataset ID input field would be “my-dataset” and not “dai-gbq:my-dataset”.
2. Enter Google Storage destination bucket: Specify the name of Google Cloud Storage destination bucket.
Note that the user must have write access to this bucket.
3. Enter Name for Dataset to be saved as: Specify a name for the dataset, for example, my_file.
4. Enter BigQuery Query (Use StandardSQL): Enter a StandardSQL query that you want BigQuery to execute.
For example: SELECT * FROM <my_dataset>.<my_table>.
5. When you are finished, select the Click to Make Query button to add the dataset.

11.1. Using Data Connectors with the Docker Image 187


Using Driverless AI, Release 1.8.4.1

11.1.6 Google Cloud Storage Setup

Driverless AI allows you to explore Google Cloud Storage data sources from within the Driverless AI application. This
section provides instructions for configuring Driverless AI to work with Google Cloud Storage. This setup requires
you to enable authentication. If you enable GCS or GBP connectors, those file systems will be available in the UI, but
you will not be able to use those connectors without authentication.
In order to enable the GCS data connector with authentication, you must:
1. Obtain a JSON authentication file from GCP.
2. Mount the JSON file to the Docker instance.
3. Specify the path to the /json_auth_file.json in the GCS_PATH_TO_SERVICE_ACCOUNT_JSON environmen-
tal variable.
Note: The account JSON includes authentications as provided by the system administrator. You can be provided a
JSON file that contains both Google Cloud Storage and Google BigQuery authentications, just one or the other, or
none at all.

Start Driverless AI

This section describes how to enable the Google Cloud Storage data connector when starting Driverless AI in Docker.
This can done by specifying each environment variable in the nvidia-docker run command or by editing the
configuration options in the config.toml file and then specifying that file in the nvidia-docker run command.

Start GCS with Authentication

This example enables the GCS data connector with authentication by passing the JSON authentication file. This
assumes that the JSON file contains Google Cloud Storage authentications. Replace TAG below with the image tag.
nvidia-docker run \
--pid=host \
--init \
--rm \
--shm-size=256m \
-e DRIVERLESS_AI_ENABLED_FILE_SYSTEMS="file,gcs" \
-e DRIVERLESS_AI_GCS_PATH_TO_SERVICE_ACCOUNT_JSON="/service_account_json.json" \
-u `id -u`:`id -g` \
-p 12345:12345 \
-v `pwd`/data:/data \
-v `pwd`/log:/log \
-v `pwd`/license:/license \
-v `pwd`/tmp:/tmp \
-v `pwd`/service_account_json.json:/service_account_json.json \
h2oai/dai-centos7-x86_64:TAG

Start DAI Using Environment Variables

This example shows how to configure the GCS data connector options in the config.toml file, and then specify that file
when starting Driverless AI in Docker.
1. Configure the Driverless AI config.toml file. Set the following configuration options:
• enabled_file_systems = "file, upload, gcs"
• gcs_path_to_service_account_json = "/service_account_json.json"
2. Mount the config.toml file into the Docker container.

188 Chapter 11. Enabling Data Connectors


Using Driverless AI, Release 1.8.4.1

nvidia-docker run \
--pid=host \
--init \
--rm \
--shm-size=256m \
--add-host name.node:172.16.2.186 \
-e DRIVERLESS_AI_CONFIG_FILE=/path/in/docker/config.toml
-p 12345:12345 \
-v /local/path/to/config.toml:/path/in/docker/config.toml
-v /etc/passwd:/etc/passwd:ro \
-v /etc/group:/etc/group:ro \
-v /tmp/dtmp/:/tmp \
-v /tmp/dlog/:/log \
-v /tmp/dlicense/:/license \
-v /tmp/ddata/:/data \
-u $(id -u):$(id -g) \
h2oai/dai-centos7-x86_64:TAG

11.1.7 kdb+ Setup

Driverless AI allows you to explore kdb+ data sources from within the Driverless AI application. This section provides
instructions for configuring Driverless AI to work with kdb+.

Description of Configuration Attributes

• kdb_user: (Optional) User name


• kdb_password: (Optional) User’s password
• kdb_hostname: IP address or host of the KDB server
• kdb_port: Port on which the kdb+ server is listening
• kdb_app_jvm_args: (Optional) JVM args for kdb+ distributions (for example, -Dlog4j.
configuration). Separate each argument with spaces.
• kdb_app_classpath: (Optional) The kdb+ classpath (or other if the jar file is stored elsewhere).

Start Driverless AI

The following sections describes how to enable the kdb+ data connector when starting Driverless AI in Docker. This
can done by specifying each environment variable in the nvidia-docker run command or by editing the config-
uration options in the config.toml file and then specifying that file in the nvidia-docker run command.

Enable kdb+ with No Authentication

This example enables the kdb+ connector without authentication. The only required flags are the hostname and the
port. Replace TAG below with the image tag.
nvidia-docker run \
--pid=host \
--init \
--rm \
--shm-size=256m \
--add-host name.node:172.16.2.186 \
-e DRIVERLESS_AI_ENABLED_FILE_SYSTEMS="file,kdb" \
-e DRIVERLESS_AI_KDB_HOSTNAME="<ip_or_host_of_kdb_server>" \
-e DRIVERLESS_AI_KDB_PORT="<kdb_server_port>" \
-p 12345:12345 \
-v /tmp/dtmp/:/tmp \
-v /tmp/dlog/:/log \
-v /tmp/dlicense/:/license \
-v /tmp/ddata/:/data \
-u $(id -u):$(id -g) \
h2oai/dai-centos7-x86_64:TAG

11.1. Using Data Connectors with the Docker Image 189


Using Driverless AI, Release 1.8.4.1

Enable kdb+ with Authentication

This example provides users credentials for accessing a kdb+ server from Driverless AI. Replace TAG below with the
image tag.
# Docker instructions
nvidia-docker run \
--pid=host \
--init \
--rm \
--shm-size=256m \
-e DRIVERLESS_AI_ENABLED_FILE_SYSTEMS="file,kdb" \
-e DRIVERLESS_AI_KDB_HOSTNAME="<ip_or_host_of_kdb_server>" \
-e DRIVERLESS_AI_KDB_PORT="<kdb_server_port>" \
-e DRIVERLESS_AI_KDB_USER="<username>" \
-e DRIVERLESS_AI_KDB_PASSWORD="<password>" \
-p 12345:12345 \
-v /tmp/dtmp/:/tmp \
-v /tmp/dlog/:/log \
-v /tmp/dlicense/:/license \
-v /tmp/ddata/:/data \
-u $(id -u):$(id -g) \
h2oai/dai-centos7-x86_64:TAG

After the kdb+ connector is enabled, you can add datasets by selecting kdb+ from the Add Dataset (or Drag and
Drop) drop-down menu.

Start DAI by Updating the config.toml File

This example shows how to configure kdb+ options in the config.toml file, and then specify that file when starting
Driverless AI in Docker. Note that this example enables kdb+ with no authentication.
1. Configure the Driverless AI config.toml file. Set the following configuration options.
• enabled_file_systems = "file, upload, kdb"
• kdb_hostname = <ip_or_host_of_kdb_server>"
• kdb_port = "<kdb_server_port>"
2. Mount the config.toml file into the Docker container.
nvidia-docker run \
--pid=host \
--init \
--rm \
--shm-size=256m \
--add-host name.node:172.16.2.186 \
-e DRIVERLESS_AI_CONFIG_FILE=/path/in/docker/config.toml
-p 12345:12345 \
-v /local/path/to/config.toml:/path/in/docker/config.toml
-v /etc/passwd:/etc/passwd:ro \
-v /etc/group:/etc/group:ro \
-v /tmp/dtmp/:/tmp \
-v /tmp/dlog/:/log \
-v /tmp/dlicense/:/license \
-v /tmp/ddata/:/data \
-u $(id -u):$(id -g) \
h2oai/dai-centos7-x86_64:TAG

After the kdb+ connector is enabled, you can add datasets by selecting kdb+ from the Add Dataset (or Drag and
Drop) drop-down menu.

190 Chapter 11. Enabling Data Connectors


Using Driverless AI, Release 1.8.4.1

Specify the following information to add your dataset.


1. Enter filepath to save query. Enter the local file path for storing your dataset. For example,
/home/<user>/myfile.csv. Note that this can only be a CSV file.
2. Enter KDB Query: Enter a kdb+ query that you want to execute. Note that the connector will accept any q
qeuries. For example: select from <mytable> or <mytable> lj <myothertable>
3. When you are finished, select the Click to Make Query button to add the dataset.

11.1. Using Data Connectors with the Docker Image 191


Using Driverless AI, Release 1.8.4.1

11.1.8 Minio Setup

This section provides instructions for configuring Driverless AI to work with Minio. Note that unlike S3, authentication
must also be configured when the Minio data connector is specified.

Start Driverless AI

The following sections describes how to enable the Minio data connector when starting Driverless AI in Docker.
This can done by specifying each environment variable in the nvidia-docker run command or by editing the
configuration options in the config.toml file and then specifying that file in the nvidia-docker run command.

Enable Minio with Authentication

This example enables the Minio data connector with authentication by passing an endpoint URL, access key ID, and an
access key. It also configures Docker DNS by passing the name and IP of the name node. This allows users to reference
data stored in Minio directly using the endpoint URL, for example: http://<endpoint_url>/<bucket>/datasets/iris.csv.
Replace TAG below with the image tag.
nvidia-docker run \
--shm-size=256m \
--add-host name.node:172.16.2.186 \
-e DRIVERLESS_AI_ENABLED_FILE_SYSTEMS="file,minio" \
-e DRIVERLESS_AI_MINIO_ENDPOINT_URL="<endpoint_url>"
-e DRIVERLESS_AI_MINIO_ACCESS_KEY_ID="<access_key_id>" \
-e DRIVERLESS_AI_MINIO_SECRET_ACCESS_KEY="<access_key>" \
-p 12345:12345 \
--init -it --rm \
-v /tmp/dtmp/:/tmp \
-v /tmp/dlog/:/log \
-v /tmp/dlicense/:/license \
-v /tmp/ddata/:/data \
-u $(id -u):$(id -g) \
h2oai/dai-centos7-x86_64:TAG

Start DAI by Updating the config.toml File

This example shows how to configure Minio options in the config.toml file, and then specify that file when starting
Driverless AI in Docker.
1. Configure the Driverless AI config.toml file. Set the following configuration options.
• enabled_file_systems = "file, upload, minio"
• minio_endpoint_url = "<endpoint_url>"
• minio_access_key_id = "<access_key_id>"
• minio_secret_access_key = "<access_key>"
2. Mount the config.toml file into the Docker container.
nvidia-docker run \
--pid=host \
--init \
--rm \
--shm-size=256m \
--add-host name.node:172.16.2.186 \
-e DRIVERLESS_AI_CONFIG_FILE=/path/in/docker/config.toml
-p 12345:12345 \
-v /local/path/to/config.toml:/path/in/docker/config.toml
-v /etc/passwd:/etc/passwd:ro \
-v /etc/group:/etc/group:ro \
-v /tmp/dtmp/:/tmp \
-v /tmp/dlog/:/log \
-v /tmp/dlicense/:/license \
-v /tmp/ddata/:/data \
-u $(id -u):$(id -g) \
h2oai/dai-centos7-x86_64:TAG

192 Chapter 11. Enabling Data Connectors


Using Driverless AI, Release 1.8.4.1

11.1.9 Snowflake

Driverless AI allows you to explore Snowflake data sources from within the Driverless AI application. This section
provides instructions for configuring Driverless AI to work with Snowflake. This setup requires you to enable authen-
tication. If you enable Snowflake connectors, those file systems will be available in the UI, but you will not be able to
use those connectors without authentication.

Start Driverless AI

The following sections describes how to enable the Snowflake data connector when starting Driverless AI in Docker.
This can done by specifying each environment variable in the nvidia-docker run command or by editing the
configuration options in the config.toml file and then specifying that file in the nvidia-docker run command.

Enable Snowflake with Authentication

This example enables the Snowflake data connector with authentication by passing the account, user, and
password variables. Replace TAG below with the image tag.
nvidia-docker run \
--rm \
--shm-size=256m \
-e DRIVERLESS_AI_ENABLED_FILE_SYSTEMS="file,snow" \
-e DRIVERLESS_AI_SNOWFLAKE_ACCOUNT = "<account_id>" \
-e DRIVERLESS_AI_SNOWFLAKE_USER = "<username>" \
-e DRIVERLESS_AI_SNOWFLAKE_PASSWORD = "<password>"\
-u `id -u`:`id -g` \
-p 12345:12345 \
-v `pwd`/data:/data \
-v `pwd`/log:/log \
-v `pwd`/license:/license \
-v `pwd`/tmp:/tmp \
-v `pwd`/service_account_json.json:/service_account_json.json \
h2oai/dai-centos7-x86_64:TAG

After the Snowflake connector is enabled, you can add datasets by selecting Snowflake from the Add Dataset (or
Drag and Drop) drop-down menu.

Start DAI by Updating the config.toml File

This example shows how to configure Snowflake options in the config.toml file, and then specify that file when starting
Driverless AI in Docker.
1. Configure the Driverless AI config.toml file. Set the following configuration options.
• enabled_file_systems = "file, snow"
• snowflake_account = "<account_id>"
• snowflake_user = "<username>"
• snowflake_password = "<password>"
2. Mount the config.toml file into the Docker container.
nvidia-docker run \
--pid=host \
--init \
--rm \
--shm-size=256m \
--add-host name.node:172.16.2.186 \
-e DRIVERLESS_AI_CONFIG_FILE=/path/in/docker/config.toml
-p 12345:12345 \
-v /local/path/to/config.toml:/path/in/docker/config.toml
-v /etc/passwd:/etc/passwd:ro \
-v /etc/group:/etc/group:ro \
-v /tmp/dtmp/:/tmp \
-v /tmp/dlog/:/log \
-v /tmp/dlicense/:/license \
-v /tmp/ddata/:/data \

(continues on next page)

11.1. Using Data Connectors with the Docker Image 193


Using Driverless AI, Release 1.8.4.1

(continued from previous page)


-u $(id -u):$(id -g) \
h2oai/dai-centos7-x86_64:TAG

After the Snowflake connector is enabled, you can add datasets by selecting Snowflake from the Add Dataset (or
Drag and Drop) drop-down menu.

Specify the following information to add your dataset.


1. Enter Output Filename: Specify the name of the file on your local system that you want to add to Driverless
AI. Note that this can only be a CSV file (for example, myfile.csv).
2. Enter Database: Specify the name of the Snowflake database that you are querying.
3. Enter Warehouse: Specify the name of the Snowflake warehouse that you are querying.
4. Enter Schema: Specify the schema of the dataset that you are querying.
5. Enter Region: (Optional) Specify the region of the warehouse that you are querying. This can be found in
the Snowflake-provided URL to access your database (as in <optional-deployment-name>.<region>.<cloud-
provider>.snowflakecomputing.com).
6. Enter Role: (Optional) Specify your role as designated within Snowflake. See https://docs.snowflake.net/
manuals/user-guide/security-access-control-overview.html for more information.
7. Enter File Formatting Params: (Optional) Specify any additional parameters for formatting your datasets.
Available parameters are listed in https://docs.snowflake.net/manuals/sql-reference/sql/create-file-format.
html#optional-parameters. (Note: Use only parameters for TYPE = CSV.) For example, if your
dataset includes a text column that contains commas, you can specify a different delimiter using
FIELD_DELIMITER='character'. Separate multiple parameters with spaces only. For example:
FIELD_DELIMITER='|' FIELD_OPTIONALLY_ENCLOSED_BY=""

8. Enter Snowflake Query: Specify the Snowflake query that you want to execute.
9. When you are finished, select the Click to Make Query button to add the dataset.

194 Chapter 11. Enabling Data Connectors


Using Driverless AI, Release 1.8.4.1

11.1.10 JDBC

Driverless AI allows you to explore Java Database Connectivity (JDBC) data sources from within the Driverless AI
application. This section provides instructions for configuring Driverless AI to work with JDBC.

Tested Databases

The following databases have been tested for minimal functionality. Note that JDBC drivers that are not included in
this list should work with Driverless AI. We recommend that you test out your JDBC driver even if you do not see it
on list of tested databases. See the Adding an Untested JDBC Driver section at the end of this chapter for information
on how to try out an untested JDBC driver.
• Oracle DB
• PostgreSQL
• Amazon Redshift
• Teradata

11.1. Using Data Connectors with the Docker Image 195


Using Driverless AI, Release 1.8.4.1

Description of Configuration Attributes

• jdbc_app_configs: Configuration for the JDBC connector. This is a JSON/Dictionary String with multiple
keys. Note: This requires a JSON key (typically the name of the database being configured) to be associated
with a nested JSON that contains the url, jarpath, and classpath fields. In addition, this should take the
format:
"""{"my_jdbc_database": {"url": "jdbc:my_jdbc_database://hostname:port/database",
"jarpath": "/path/to/my/jdbc/database.jar", "classpath": "com.my.jdbc.Driver"}}"""

For example:
"""{
"postgres": {
"url": "jdbc:postgresql://ip address:port/postgres",
"jarpath": "/path/to/postgres_driver.jar",
"classpath": "org.postgresql.Driver"
},
"mysql": {
"url":"mysql connection string",
"jarpath": "/path/to/mysql_driver.jar",
"classpath": "my.sql.classpath.Driver"
}
}"""

• jdbc_app_jvm_args: Extra jvm args for JDBC connector. For example, “-Xmx4g”.
• jdbc_app_classpath: Optionally specify an alternative classpath for the JDBC connector.

Retrieve the JDBC Driver

1. Download JDBC Driver JAR files:


• Oracle DB
• PostgreSQL
• Amazon Redshift
• Teradata
Note: Remember to take note of the driver classpath, as it is needed for the configuration steps (for
example, org.postgresql.Driver).
2. Copy the driver JAR to a location that can be mounted into the Docker container.
Note: The folder storing the JDBC jar file must be visible/readable by the dai process user.

Start Driverless AI

This section describes how to enable JDBC when starting Driverless AI in Docker. This can done by specifying
each environment variable in the nvidia-docker run command or by editing the configuration options in the
config.toml file and then specifying that file in the nvidia-docker run command.

196 Chapter 11. Enabling Data Connectors


Using Driverless AI, Release 1.8.4.1

Start DAI Using Environment Variables

This example enables the JDBC connector for PostgresQL. Note that the JDBC connection strings will vary depending
on the database that is used. Replace TAG below with the image tag.
nvidia-docker run \
--pid=host \
--init \
--rm \
--shm-size=256m \
--add-host name.node:172.16.2.186 \
-e DRIVERLESS_AI_ENABLED_FILE_SYSTEMS="file,hdfs,jdbc" \
-e DRIVERLESS_AI_JDBC_APP_CONFIGS="""{"postgres":
{"url": "jdbc:postgres://localhost:5432/my_database",
"jarpath": "/path/to/postgresql/jdbc/driver.jar",
"classpath": "org.postgresql.Driver"}}""" \
-e DRIVERLESS_AI_JDBC_APP_JVM_ARGS="-Xmx2g" \
-p 12345:12345 \
-v /path/to/local/postgresql/jdbc/driver.jar:/path/to/postgresql/jdbc/driver.jar \
-v /etc/passwd:/etc/passwd:ro \
-v /etc/group:/etc/group:ro \
-v /tmp/dtmp/:/tmp \
-v /tmp/dlog/:/log \
-v /tmp/dlicense/:/license \
-v /tmp/ddata/:/data \
-u $(id -u):$(id -g) \
h2oai/dai-centos7-x86_64:TAG

Start DAI by Updating the config.toml File

This example shows how to configure JDBC options in the config.toml file, and then specify that file when starting
Driverless AI in Docker.
1. Configure the Driverless AI config.toml file. Set the following configuration options:
enabled_file_systems = "file, upload, jdbc"
jdbc_app_configs = '{JSON string containing configurations}'
jdbc_app_configs = """{"postgres": {"url": "jdbc:postgress://localhost:5432/my_database",
"jarpath": "/path/to/postgresql/jdbc/driver.jar",
"classpath": "org.postgresql.Driver"}}"""

2. Mount the config.toml file and requisite JAR files into the Docker container.
nvidia-docker run \
--pid=host \
--init \
--rm \
--shm-size=256m \
--add-host name.node:172.16.2.186 \
-e DRIVERLESS_AI_CONFIG_FILE=/path/in/docker/config.toml
-p 12345:12345 \
-v /local/path/to/jdbc/driver.jar:/path/in/docker/jdbc/driver.jar
-v /local/path/to/config.toml:/path/in/docker/config.toml
-v /etc/passwd:/etc/passwd:ro \
-v /etc/group:/etc/group:ro \
-v /tmp/dtmp/:/tmp \
-v /tmp/dlog/:/log \
-v /tmp/dlicense/:/license \
-v /tmp/ddata/:/data \
-u $(id -u):$(id -g) \
h2oai/dai-centos7-x86_64:TAG

Adding Datasets Using JDBC

After the JDBC connector is enabled, you can add datasets by selecting JDBC from the Add Dataset (or Drag and
Drop) drop-down menu.

11.1. Using Data Connectors with the Docker Image 197


Using Driverless AI, Release 1.8.4.1

1. Click on the Add Dataset button on the Datasets page.


2. Select JDBC from the list that appears.
3. Click on the Select JDBC Connection button to select a JDBC configuration.
4. The form will populate with the JDBC Database, URL, Driver, and Jar information. Complete the following
remaining fields:
• JDBC Username: Enter your JDBC username.
• JDBC Password: Enter your JDBC password.
• Destination Name: Enter a name for the new dataset.
• (Optional) ID Column Name: Enter a name for the ID column. Specify this field when making
large data queries.
Notes:
• Due to resource sharing within Driverless AI, the JDBC Connector is only allocated a relatively
small amount of memory.
• When making large queries, the ID column is used to partition the data into manageable portions.
This ensures that the maximum memory allocation is not exceeded.
• If a query that is larger than the maximum memory allocation is made without specifying an ID
column, the query will not complete successfully.
5. Write a SQL Query in the format of the database that you want to query. (See the Query Examples section
below.) The format will vary depending on the database that is used.
6. Click the Click to Make Query button to execute the query. The time it takes to complete depends on the size
of the data being queried and the network speeds to the database.
On a successful query, you will be returned to the datasets page, and the queried data will be available as a new dataset.

198 Chapter 11. Enabling Data Connectors


Using Driverless AI, Release 1.8.4.1

Query Examples

The following are sample configurations and queries for Oracle DB and PostgreSQL:

Oracle DB

1. Configuration:
jdbc_app_configs = '{"oracledb": {"url": "jdbc:oracle:thin:@localhost:1521/oracledatabase", "jarpath": "/home/ubuntu/jdbc-jars/ojdbc8.jar",
˓→ "classpath": "oracle.jdbc.OracleDriver"}}'

2. Sample Query:
• Select oracledb from the Select JDBC Connection dropdown menu.
• JDBC Username: oracleuser
• JDBC Password: oracleuserpassword
• ID Column Name:
• Query:
SELECT MIN(ID) AS NEW_ID, EDUCATION, COUNT(EDUCATION) FROM my_oracle_schema.creditcardtrain GROUP BY EDUCATION

Note: Because this query does not specify an ID Column Name, it will only work for small data. How-
ever, the NEW_ID column can be used as the ID Column if the query is for larger data.
3. Click the Click to Make Query button to execute the query.

PostgreSQL

1. Configuration:
jdbc_app_configs = '{"postgres": {"url": "jdbc:postgresql://localhost:5432/postgresdatabase", "jarpath": "/home/ubuntu/postgres-artifacts/
˓→postgres/Driver.jar", "classpath": "org.postgresql.Driver"}}'

2. Sample Query:
• Select postgres from the Select JDBC Connection dropdown menu.
• JDBC Username: postgres_user
• JDBC Password: pguserpassword
• ID Column Name: id
• Query:
SELECT * FROM loan_level WHERE LOAN_TYPE = 5 (selects all columns from table loan_level with column LOAN_TYPE containing
˓→value 5)

3. Click the Click to Make Query button to execute the query.

11.1. Using Data Connectors with the Docker Image 199


Using Driverless AI, Release 1.8.4.1

Adding an Untested JDBC Driver

We encourage you to try out JDBC drivers that are not tested in house.
1. Download the JDBC jar for your database.
2. Move your JDBC jar file to a location that DAI can access.
3. Modify the following config.toml settings. Note that these can also be specified as environment variables when
starting Driverless AI in Docker:
# enable the JDBC file system
enabled_file_systems = "upload, file, hdfs, s3, recipe_file, jdbc"

# Configure the JDBC Connector.


# JSON/Dictionary String with multiple keys.
# Format as a single line without using carriage returns (the following example is formatted for readability).
# Use triple quotations to ensure that the text is read as a single string.
# Example:
jdbc_app_configs = """{"my_jdbc_database": {"url": "jdbc:my_jdbc_database://hostname:port/database",
"jarpath": "/path/to/my/jdbc/database.jar",
"classpath": "com.my.jdbc.Driver"}}"""

# optional extra jvm args for jdbc connector


jdbc_app_jvm_args = ""

# optional alternative classpath for jdbc connector


jdbc_app_classpath = ""

4. Save the changes when you are done, then stop/restart Driverless AI.

11.2 Using Data Connectors with Native Installs

The config.toml file is available in the etc/dai folder after the RPM, DEB, or TAR SH is installed. Before enabling a
connector, be sure to export the Driverless AI config.toml file or add it to ~/.bashrc. For example:
# DEB and RPM
export DRIVERLESS_AI_CONFIG_FILE="/etc/dai/config.toml"

# TAR SH
export DRIVERLESS_AI_CONFIG_FILE="/path/to/your/unpacked/dai/directory/config.toml"

The sections that follow show examples that describe how to specify configuration options in the config.toml file to
enable HDFS, S3, Google Cloud Storage, Google Big Query, Minio, Snowflake, kdb+, Azure Blob Store, BlueData
DataTap, and JDBC data sources.

11.2.1 S3 Setup

Driverless AI allows you to explore S3 data sources from within the Driverless AI application. This section provides
instructions for configuring Driverless AI to work with S3.

Description of Configuration Attributes

• aws_access_key_id: The S3 access key ID


• aws_secret_access_key: The S3 access key
• aws_role_arn: The Amazon Resource Name
• aws_default_region: The region to use when the aws_s3_endpoint_url option is not set. This is ignored
when aws_s3_endpoint_url is set.
• aws_s3_endpoint_url: The endpoint URL that will be used to access S3.
• aws_use_ec2_role_credentials: If set to true, the S3 Connector will try to to obtain credentials asso-
ciated with the role attached to the EC2 instance.

200 Chapter 11. Enabling Data Connectors


Using Driverless AI, Release 1.8.4.1

• s3_init_path: The starting S3 path that will be displayed in UI S3 browser.

S3 with No Authentication

This example enables the S3 data connector and disables authentication. It does not pass any S3 access key or secret;
however it configures Docker DNS by passing the name and IP of the S3 name node. This allows users to reference
data stored in S3 directly using name node address, for example: s3://name.node/datasets/iris.csv.
1. Export the Driverless AI config.toml file or add it to ~/.bashrc. For example:
# DEB and RPM
export DRIVERLESS_AI_CONFIG_FILE="/etc/dai/config.toml"

# TAR SH
export DRIVERLESS_AI_CONFIG_FILE="/path/to/your/unpacked/dai/directory/config.toml"

2. Specify the following configuration options in the config.toml file.


# File System Support
# upload : standard upload feature
# file : local file system/server file system
# hdfs : Hadoop file system, remember to configure the HDFS config folder path and keytab below
# dtap : Blue Data Tap file system, remember to configure the DTap section below
# s3 : Amazon S3, optionally configure secret and access key below
# gcs : Google Cloud Storage, remember to configure gcs_path_to_service_account_json below
# gbq : Google Big Query, remember to configure gcs_path_to_service_account_json below
# minio : Minio Cloud Storage, remember to configure secret and access key below
# snow : Snowflake Data Warehouse, remember to configure Snowflake credentials below (account name, username, password)
# kdb : KDB+ Time Series Database, remember to configure KDB credentials below (hostname and port, optionally: username, password,
˓→classpath, and jvm_args)
# azrbs : Azure Blob Storage, remember to configure Azure credentials below (account name, account key)
# jdbc: JDBC Connector, remember to configure JDBC below. (jdbc_app_configs)
enabled_file_systems = "file, s3"

3. Save the changes when you are done, then stop/restart Driverless AI.

S3 with Authentication

This example enables the S3 data connector with authentication by passing an S3 access key ID and an access key.
It also configures Docker DNS by passing the name and IP of the S3 name node. This allows users to reference data
stored in S3 directly using name node address, for example: s3://name.node/datasets/iris.csv.
1. Export the Driverless AI config.toml file or add it to ~/.bashrc. For example:
# DEB and RPM
export DRIVERLESS_AI_CONFIG_FILE="/etc/dai/config.toml"

# TAR SH
export DRIVERLESS_AI_CONFIG_FILE="/path/to/your/unpacked/dai/directory/config.toml"

2. Specify the following configuration options in the config.toml file.


# File System Support
# upload : standard upload feature
# file : local file system/server file system
# hdfs : Hadoop file system, remember to configure the HDFS config folder path and keytab below
# dtap : Blue Data Tap file system, remember to configure the DTap section below
# s3 : Amazon S3, optionally configure secret and access key below
# gcs : Google Cloud Storage, remember to configure gcs_path_to_service_account_json below
# gbq : Google Big Query, remember to configure gcs_path_to_service_account_json below
# minio : Minio Cloud Storage, remember to configure secret and access key below
# snow : Snowflake Data Warehouse, remember to configure Snowflake credentials below (account name, username, password)
# kdb : KDB+ Time Series Database, remember to configure KDB credentials below (hostname and port, optionally: username, password,
˓→classpath, and jvm_args)
# azrbs : Azure Blob Storage, remember to configure Azure credentials below (account name, account key)
# jdbc: JDBC Connector, remember to configure JDBC below. (jdbc_app_configs)
enabled_file_systems = "file, s3"

# S3 Connector credentials
aws_access_key_id = "<access_key_id>"
aws_secret_access_key = "<access_key>"

3. Save the changes when you are done, then stop/restart Driverless AI.

11.2. Using Data Connectors with Native Installs 201


Using Driverless AI, Release 1.8.4.1

11.2.2 HDFS Setup

This section provides instructions for configuring Driverless AI to work with HDFS.

Supported Hadoop Platforms

• CDH 5.4
• CDH 5.5
• CDH 5.6
• CDH 5.7
• CDH 5.8
• CDH 5.9
• CDH 5.10
• CDH 5.13
• CDH 5.14
• CDH 5.15
• CDH 5.16
• CDH 6.0
• CDH 6.1
• CDH 6.2
• CDH 6.3
• HDP 2.2
• HDP 2.3
• HDP 2.4
• HDP 2.5
• HDP 2.6
• HDP 3.0
• HDP 3.1

Description of Configuration Attributes

• hdfs_config_path: The location the HDFS config folder path. This folder can contain multiple config
files.
• hdfs_auth_type: Selects HDFS authentication. Available values are:
– principal: Authenticate with HDFS with a principal user.
– keytab: Authenticate with a keytab (recommended). If running DAI as a service, then the Kerberos
keytab needs to be owned by the DAI user.
– keytabimpersonation: Login with impersonation using a keytab.
– noauth: No authentication needed.

202 Chapter 11. Enabling Data Connectors


Using Driverless AI, Release 1.8.4.1

• key_tab_path: The path of the principal key tab file. For use when hdfs_auth_type=principal.
• hdfs_app_principal_user: The Kerberos application principal user.
• hdfs_app_login_user: The user ID of the current user (for example, user@realm).
• hdfs_app_jvm_args: JVM args for HDFS distributions. Separate each argument with spaces.
– -Djava.security.krb5.conf
– -Dsun.security.krb5.debug
– -Dlog4j.configuration
• hdfs_app_classpath: The HDFS classpath.

HDFS with No Authentication

This example enables the HDFS data connector and disables HDFS authentication in the config.toml file. This allows
users to reference data stored in HDFS directly using the name node address, for example: hdfs://name.node/
datasets/iris.csv.
1. Export the Driverless AI config.toml file or add it to ~/.bashrc. For example:
# DEB and RPM
export DRIVERLESS_AI_CONFIG_FILE="/etc/dai/config.toml"

# TAR SH
export DRIVERLESS_AI_CONFIG_FILE="/path/to/your/unpacked/dai/directory/config.toml"

2. Specify the following configuration options in the config.toml file. Note that the procsy port, which defaults to
12347, also has to be changed.
# IP address and port of procsy process.
procsy_ip = "127.0.0.1"
procsy_port = 8080

# File System Support


# upload : standard upload feature
# file : local file system/server file system
# hdfs : Hadoop file system, remember to configure the HDFS config folder path and keytab below
# dtap : Blue Data Tap file system, remember to configure the DTap section below
# s3 : Amazon S3, optionally configure secret and access key below
# gcs : Google Cloud Storage, remember to configure gcs_path_to_service_account_json below
# gbq : Google Big Query, remember to configure gcs_path_to_service_account_json below
# minio : Minio Cloud Storage, remember to configure secret and access key below
# snow : Snowflake Data Warehouse, remember to configure Snowflake credentials below (account name, username, password)
# kdb : KDB+ Time Series Database, remember to configure KDB credentials below (hostname and port, optionally: username, password,
˓→classpath, and jvm_args)
# azrbs : Azure Blob Storage, remember to configure Azure credentials below (account name, account key)
# jdbc: JDBC Connector, remember to configure JDBC below. (jdbc_app_configs)
enabled_file_systems = "file, hdfs"

3. Save the changes when you are done, then stop/restart Driverless AI.

HDFS with Keytab-Based Authentication

This example:
• Places keytabs in the /tmp/dtmp folder on your machine and provides the file path as described below.
• Configures the option hdfs_app_prinicpal_user to reference a user for whom the keytab was created
(usually in the form of user@realm).
1. Export the Driverless AI config.toml file or add it to ~/.bashrc. For example:
# DEB and RPM
export DRIVERLESS_AI_CONFIG_FILE="/etc/dai/config.toml"

# TAR SH
export DRIVERLESS_AI_CONFIG_FILE="/path/to/your/unpacked/dai/directory/config.toml"

2. Specify the following configuration options in the config.toml file.

11.2. Using Data Connectors with Native Installs 203


Using Driverless AI, Release 1.8.4.1

# IP address and port of procsy process.


procsy_ip = "127.0.0.1"
procsy_port = 8080

# File System Support


# upload : standard upload feature
# file : local file system/server file system
# hdfs : Hadoop file system, remember to configure the HDFS config folder path and keytab below
# dtap : Blue Data Tap file system, remember to configure the DTap section below
# s3 : Amazon S3, optionally configure secret and access key below
# gcs : Google Cloud Storage, remember to configure gcs_path_to_service_account_json below
# gbq : Google Big Query, remember to configure gcs_path_to_service_account_json below
# minio : Minio Cloud Storage, remember to configure secret and access key below
# snow : Snowflake Data Warehouse, remember to configure Snowflake credentials below (account name, username, password)
# kdb : KDB+ Time Series Database, remember to configure KDB credentials below (hostname and port, optionally: username, password,
˓→classpath, and jvm_args)
# azrbs : Azure Blob Storage, remember to configure Azure credentials below (account name, account key)
# jdbc: JDBC Connector, remember to configure JDBC below. (jdbc_app_configs)
enabled_file_systems = "file, hdfs"

# HDFS connector
# Auth type can be Principal/keytab/keytabPrincipal
# Specify HDFS Auth Type, allowed options are:
# noauth : No authentication needed
# principal : Authenticate with HDFS with a principal user
# keytab : Authenticate with a Key tab (recommended)
# keytabimpersonation : Login with impersonation using a keytab
hdfs_auth_type = "keytab"

# Path of the principal key tab file


key_tab_path = "/tmp/<keytabname>"

# Kerberos app principal user (recommended)


hdfs_app_principal_user = "<user@kerberosrealm>"

3. Save the changes when you are done, then stop/restart Driverless AI.

HDFS with Keytab-Based Impersonation

Notes:
• If using Kerberos, be sure that the Driverless AI time is synched with the Kerberos server.
• If running Driverless AI as a service, then the Kerberos keytab needs to be owned by the Driverless AI user.
• Logins are case sensitive when keytab-based impersonation is configured.
The example:
• Sets the authentication type to keytabimpersonation.
• Places keytabs in the /tmp/dtmp folder on your machine and provides the file path as described below.
• Configures the hdfs_app_principal_user variable, which references a user for whom the keytab was
created (usually in the form of user@realm).
1. Export the Driverless AI config.toml file or add it to ~/.bashrc. For example:
# DEB and RPM
export DRIVERLESS_AI_CONFIG_FILE="/etc/dai/config.toml"

# TAR SH
export DRIVERLESS_AI_CONFIG_FILE="/path/to/your/unpacked/dai/directory/config.toml"

2. Specify the following configuration options in the config.toml file.


# IP address and port of procsy process.
procsy_ip = "127.0.0.1"
procsy_port = 8080

# File System Support


# upload : standard upload feature
# file : local file system/server file system
# hdfs : Hadoop file system, remember to configure the HDFS config folder path and keytab below
# dtap : Blue Data Tap file system, remember to configure the DTap section below
# s3 : Amazon S3, optionally configure secret and access key below
# gcs : Google Cloud Storage, remember to configure gcs_path_to_service_account_json below
# gbq : Google Big Query, remember to configure gcs_path_to_service_account_json below
# minio : Minio Cloud Storage, remember to configure secret and access key below
# snow : Snowflake Data Warehouse, remember to configure Snowflake credentials below (account name, username, password)
# kdb : KDB+ Time Series Database, remember to configure KDB credentials below (hostname and port, optionally: username, password,
˓→classpath, and jvm_args)
# azrbs : Azure Blob Storage, remember to configure Azure credentials below (account name, account key)
# jdbc: JDBC Connector, remember to configure JDBC below. (jdbc_app_configs)
enabled_file_systems = "file, hdfs"

(continues on next page)

204 Chapter 11. Enabling Data Connectors


Using Driverless AI, Release 1.8.4.1

(continued from previous page)


# HDFS connector
# Auth type can be Principal/keytab/keytabPrincipal
# Specify HDFS Auth Type, allowed options are:
# noauth : No authentication needed
# principal : Authenticate with HDFS with a principal user
# keytab : Authenticate with a Key tab (recommended)
# keytabimpersonation : Login with impersonation using a keytab
hdfs_auth_type = "keytabimpersonation"

# Path of the principal key tab file


key_tab_path = "/tmp/<keytabname>"

# Kerberos app principal user (recommended)


hdfs_app_principal_user = "<user@kerberosrealm>"

3. Save the changes when you are done, then stop/restart Driverless AI.

11.2.3 Azure Blob Store Setup

Driverless AI allows you to explore Azure Blob Store data sources from within the Driverless AI application. This
section describes how to enable the Azure Blob Store data connector in native install environments.

Description of Configuration Attributes

azure_blob_account_name: The Microsoft Azure Storage account name. This should be the dns prefix created
when the account was created (for example, “mystorage”).
azure_blob_account_key: Specify the account key that maps to your account name.
azure_connection_string: Optionally specify a new connection string. With this option, you can include an
override for a host, port, and/or account name. For example,
azure_connection_string = "DefaultEndpointsProtocol=http;AccountName=<account_name>;AccountKey=<account_key>;BlobEndpoint=http://<host>:
˓→<port>/<account_name>;"

Azure Blob Store Example

This example enables the Azure Blob Store data connector. This allows users to reference data stored on your Azure
storage account using the account name, for example: https://mystorage.blob.core.windows.net.
1. Export the Driverless AI config.toml file or add it to ~/.bashrc. For example:
# DEB and RPM
export DRIVERLESS_AI_CONFIG_FILE="/etc/dai/config.toml"

# TAR SH
export DRIVERLESS_AI_CONFIG_FILE="/path/to/your/unpacked/dai/directory/config.toml"

2. Specify the following configuration options in the config.toml file.


# File System Support
# upload : standard upload feature
# file : local file system/server file system
# hdfs : Hadoop file system, remember to configure the HDFS config folder path and keytab below
# dtap : Blue Data Tap file system, remember to configure the DTap section below
# s3 : Amazon S3, optionally configure secret and access key below
# gcs : Google Cloud Storage, remember to configure gcs_path_to_service_account_json below
# gbq : Google Big Query, remember to configure gcs_path_to_service_account_json below
# minio : Minio Cloud Storage, remember to configure secret and access key below
# snow : Snowflake Data Warehouse, remember to configure Snowflake credentials below (account name, username, password)
# kdb : KDB+ Time Series Database, remember to configure KDB credentials below (hostname and port, optionally: username, password,
˓→classpath, and jvm_args)
# azrbs : Azure Blob Storage, remember to configure Azure credentials below (account name, account key)
# jdbc: JDBC Connector, remember to configure JDBC below. (jdbc_app_configs)
enabled_file_systems = "file, azrbs"

# Azure Blob Store Connector credentials


azure_blob_account_name = "mystorage"
azure_blob_account_key = "<account_key>"

3. Save the changes when you are done, then stop/restart Driverless AI.

11.2. Using Data Connectors with Native Installs 205


Using Driverless AI, Release 1.8.4.1

11.2.4 BlueData DataTap Setup

This section provides instructions for configuring Driverless AI to work with BlueData DataTap.

Description of Configuration Attributes

• dtap_auth_type: Selects DTAP authentication. Available values are:


– noauth: No authentication needed
– principal: Authenticate with DataTap with a principal user
– keytab: Authenticate with a Key tab (recommended). If running Driverless AI as a service, then the
Kerberos keytab needs to be owned by the Driverless AI user.
– keytabimpersonation: Login with impersonation using a keytab
• dtap_config_path: The location of the DTAP (HDFS) config folder path. This folder can contain multiple
config files. Note: The DTAP config file core-site.xml needs to contain DTap FS configuration, for example:
<configuration>
<property>
<name>fs.dtap.impl</name>
<value>com.bluedata.hadoop.bdfs.Bdfs</value>
<description>The FileSystem for BlueData dtap: URIs.</description>
</property>
</configuration>

• dtap_key_tab_path: The path of the principal key tab file. For use when
dtap_auth_type=principal.
• dtap_app_principal_user: The Kerberos app principal user (recommended).
• dtap_app_login_user: The user ID of the current user (for example, user@realm).
• dtap_app_jvm_args: JVM args for DTap distributions. Separate each argument with spaces.
• dtap_app_classpath: The DTap classpath.

DataTap with No Authentication

This example enables the DataTap data connector and disables authentication in the config.toml file. This allows users
to reference data stored in DataTap directly using the name node address, for example: dtap://name.node/
datasets/iris.csv or dtap://name.node/datasets/. (Note: The trailing slash is currently required
for directories.)
1. Export the Driverless AI config.toml file or add it to ~/.bashrc. For example:
# DEB and RPM
export DRIVERLESS_AI_CONFIG_FILE="/etc/dai/config.toml"

# TAR SH
export DRIVERLESS_AI_CONFIG_FILE="/path/to/your/unpacked/dai/directory/config.toml"

2. Specify the following configuration options in the config.toml file.


# File System Support
# upload : standard upload feature
# file : local file system/server file system
# hdfs : Hadoop file system, remember to configure the HDFS config folder path and keytab below
# dtap : Blue Data Tap file system, remember to configure the DTap section below
# s3 : Amazon S3, optionally configure secret and access key below
# gcs : Google Cloud Storage, remember to configure gcs_path_to_service_account_json below
# gbq : Google Big Query, remember to configure gcs_path_to_service_account_json below
# minio : Minio Cloud Storage, remember to configure secret and access key below
# snow : Snowflake Data Warehouse, remember to configure Snowflake credentials below (account name, username, password)
# kdb : KDB+ Time Series Database, remember to configure KDB credentials below (hostname and port, optionally: username, password,
˓→classpath, and jvm_args)
# azrbs : Azure Blob Storage, remember to configure Azure credentials below (account name, account key)
# jdbc: JDBC Connector, remember to configure JDBC below. (jdbc_app_configs)
enabled_file_systems = "file, dtap"

206 Chapter 11. Enabling Data Connectors


Using Driverless AI, Release 1.8.4.1

3. Save the changes when you are done, then stop/restart Driverless AI.

DataTap with Keytab-Based Authentication

This example:
• Places keytabs in the /tmp/dtmp folder on your machine and provides the file path as described below.
• Configures the option dtap_app_prinicpal_user to reference a user for whom the keytab was created
(usually in the form of user@realm).
1. Export the Driverless AI config.toml file or add it to ~/.bashrc. For example:
# DEB and RPM
export DRIVERLESS_AI_CONFIG_FILE="/etc/dai/config.toml"

# TAR SH
export DRIVERLESS_AI_CONFIG_FILE="/path/to/your/unpacked/dai/directory/config.toml"

2. Specify the following configuration options in the config.toml file.


# File System Support
# upload : standard upload feature
# file : local file system/server file system
# hdfs : Hadoop file system, remember to configure the HDFS config folder path and keytab below
# dtap : Blue Data Tap file system, remember to configure the DTap section below
# s3 : Amazon S3, optionally configure secret and access key below
# gcs : Google Cloud Storage, remember to configure gcs_path_to_service_account_json below
# gbq : Google Big Query, remember to configure gcs_path_to_service_account_json below
# minio : Minio Cloud Storage, remember to configure secret and access key below
# snow : Snowflake Data Warehouse, remember to configure Snowflake credentials below (account name, username, password)
# kdb : KDB+ Time Series Database, remember to configure KDB credentials below (hostname and port, optionally: username, password,
˓→classpath, and jvm_args)
# azrbs : Azure Blob Storage, remember to configure Azure credentials below (account name, account key)
# jdbc: JDBC Connector, remember to configure JDBC below. (jdbc_app_configs)
enabled_file_systems = "file, dtap"

# Blue Data DTap connector settings are similar to HDFS connector settings.
#
# Specify DTap Auth Type, allowed options are:
# noauth : No authentication needed
# principal : Authenticate with DTab with a principal user
# keytab : Authenticate with a Key tab (recommended). If running
# DAI as a service, then the Kerberos keytab needs to
# be owned by the DAI user.
# keytabimpersonation : Login with impersonation using a keytab
dtap_auth_type = "keytab"

# Path of the principal key tab file


dtap_key_tab_path = "/tmp/<keytabname>"

# Kerberos app principal user (recommended)


dtap_app_principal_user = "<user@kerberosrealm>"

3. Save the changes when you are done, then stop/restart Driverless AI.

DataTap with Keytab-Based Impersonation

The example:
• Places keytabs in the /tmp/dtmp folder on your machine and provides the file path as described below.
• Configures the dtap_app_principal_user variable, which references a user for whom the keytab was
created (usually in the form of user@realm).
• Configures the dtap_app_login_user variable, which references a user who is being impersonated (usu-
ally in the form of user@realm).
1. Export the Driverless AI config.toml file or add it to ~/.bashrc. For example:
# DEB and RPM
export DRIVERLESS_AI_CONFIG_FILE="/etc/dai/config.toml"

# TAR SH
export DRIVERLESS_AI_CONFIG_FILE="/path/to/your/unpacked/dai/directory/config.toml"

2. Specify the following configuration options in the config.toml file.

11.2. Using Data Connectors with Native Installs 207


Using Driverless AI, Release 1.8.4.1

# File System Support


# upload : standard upload feature
# file : local file system/server file system
# hdfs : Hadoop file system, remember to configure the HDFS config folder path and keytab below
# dtap : Blue Data Tap file system, remember to configure the DTap section below
# s3 : Amazon S3, optionally configure secret and access key below
# gcs : Google Cloud Storage, remember to configure gcs_path_to_service_account_json below
# gbq : Google Big Query, remember to configure gcs_path_to_service_account_json below
# minio : Minio Cloud Storage, remember to configure secret and access key below
# snow : Snowflake Data Warehouse, remember to configure Snowflake credentials below (account name, username, password)
# kdb : KDB+ Time Series Database, remember to configure KDB credentials below (hostname and port, optionally: username, password,
˓→classpath, and jvm_args)
# azrbs : Azure Blob Storage, remember to configure Azure credentials below (account name, account key)
# jdbc: JDBC Connector, remember to configure JDBC below. (jdbc_app_configs)
enabled_file_systems = "file, dtap"

# Blue Data DTap connector settings are similar to HDFS connector settings.
#
# Specify DTap Auth Type, allowed options are:
# noauth : No authentication needed
# principal : Authenticate with DTab with a principal user
# keytab : Authenticate with a Key tab (recommended). If running
# DAI as a service, then the Kerberos keytab needs to
# be owned by the DAI user.
# keytabimpersonation : Login with impersonation using a keytab
dtap_auth_type = "keytab"

# Path of the principal key tab file


dtap_key_tab_path = "/tmp/<keytabname>"

# Kerberos app principal user (recommended)


dtap_app_principal_user = "<user@kerberosrealm>"

# Specify the user id of the current user here as user@realm


dtap_app_login_user = "<user@realm>"

3. Save the changes when you are done, then stop/restart Driverless AI.

11.2.5 Google Big Query

Driverless AI allows you to explore Google BigQuery data sources from within the Driverless AI application. This
section provides instructions for configuring Driverless AI to work with Google BigQuery. This setup requires you to
enable authentication. If you enable GCS or GBP connectors, those file systems will be available in the UI, but you
will not be able to use those connectors without authentication.
In order to enable the GBQ data connector with authentication, you must:
1. Obtain a JSON authentication file from GCP.
2. Mount the JSON file to the Docker instance.
3. Specify the path to the /json_auth_file.json in the gcs_path_to_service_account_json configuration option.
Note: The account JSON includes authentications as provided by the system administrator. You can be provided a
JSON file that contains both Google Cloud Storage and Google BigQuery authentications, just one or the other, or
none at all.

Google BigQuery with Authentication

This example enables the GBQ data connector with authentication by passing the JSON authentication file. This
assumes that the JSON file contains Google BigQuery authentications.
1. Export the Driverless AI config.toml file or add it to ~/.bashrc. For example:
# DEB and RPM
export DRIVERLESS_AI_CONFIG_FILE="/etc/dai/config.toml"

# TAR SH
export DRIVERLESS_AI_CONFIG_FILE="/path/to/your/unpacked/dai/directory/config.toml"

2. Specify the following configuration options in the config.toml file.


# File System Support
# upload : standard upload feature
# file : local file system/server file system
# hdfs : Hadoop file system, remember to configure the HDFS config folder path and keytab below
# dtap : Blue Data Tap file system, remember to configure the DTap section below

(continues on next page)

208 Chapter 11. Enabling Data Connectors


Using Driverless AI, Release 1.8.4.1

(continued from previous page)


# s3 : Amazon S3, optionally configure secret and access key below
# gcs : Google Cloud Storage, remember to configure gcs_path_to_service_account_json below
# gbq : Google Big Query, remember to configure gcs_path_to_service_account_json below
# minio : Minio Cloud Storage, remember to configure secret and access key below
# snow : Snowflake Data Warehouse, remember to configure Snowflake credentials below (account name, username, password)
# kdb : KDB+ Time Series Database, remember to configure KDB credentials below (hostname and port, optionally: username, password,
˓→classpath, and jvm_args)
# azrbs : Azure Blob Storage, remember to configure Azure credentials below (account name, account key)
# jdbc: JDBC Connector, remember to configure JDBC below. (jdbc_app_configs)
enabled_file_systems = "file, gbq"

# GCS Connector credentials


# example (suggested) -- "/licenses/my_service_account_json.json"
gcs_path_to_service_account_json = "/service_account_json.json"

3. Save the changes when you are done, then stop/restart Driverless AI.
After Google BigQuery is enabled, you can add datasets by selecting Google Big Query from the Add Dataset (or
Drag and Drop) drop-down menu.

Specify the following information to add your dataset.


1. Enter BQ Dataset ID with write access to create temporary table: Enter a dataset ID in Google BigQuery
that this user has read/write access to. BigQuery uses this dataset as the location for the new table generated by
the query.
Note: Driverless AI’s connection to GBQ will inherit the top-level directory from the service JSON file.
So if a dataset named “my-dataset” is in a top-level directory named “dai-gbq”, then the value for the
dataset ID input field would be “my-dataset” and not “dai-gbq:my-dataset”.
2. Enter Google Storage destination bucket: Specify the name of Google Cloud Storage destination bucket.
Note that the user must have write access to this bucket.
3. Enter Name for Dataset to be saved as: Specify a name for the dataset, for example, my_file.
4. Enter BigQuery Query (Use StandardSQL): Enter a StandardSQL query that you want BigQuery to execute.
For example: SELECT * FROM <my_dataset>.<my_table>.
5. When you are finished, select the Click to Make Query button to add the dataset.

11.2. Using Data Connectors with Native Installs 209


Using Driverless AI, Release 1.8.4.1

11.2.6 Google Cloud Storage Setup

This section provides instructions for configuring Driverless AI to work with Google Cloud Storage. This setup
requires you to enable authentication. If you enable GCS or GBP connectors, those file systems will be available in
the UI, but you will not be able to use those connectors without authentication.
In order to enable the GCS data connector with authentication, you must:
1. Obtain a JSON authentication file from GCP.
2. Mount the JSON file to the Docker instance.
3. Specify the path to the /json_auth_file.json in the gcs_path_to_service_account_json configuration option.
Note: The account JSON includes authentications as provided by the system administrator. You can be provided a
JSON file that contains both Google Cloud Storage and Google BigQuery authentications, just one or the other, or
none at all.

GCS with Authentication

This example enables the GCS data connector with authentication by passing the JSON authentication file. This
assumes that the JSON file contains Google Cloud Storage authentications.
1. Export the Driverless AI config.toml file or add it to ~/.bashrc. For example:
# DEB and RPM
export DRIVERLESS_AI_CONFIG_FILE="/etc/dai/config.toml"

# TAR SH
export DRIVERLESS_AI_CONFIG_FILE="/path/to/your/unpacked/dai/directory/config.toml"

2. Specify the following configuration options in the config.toml file.


# File System Support
# upload : standard upload feature
# file : local file system/server file system
# hdfs : Hadoop file system, remember to configure the HDFS config folder path and keytab below
# dtap : Blue Data Tap file system, remember to configure the DTap section below
# s3 : Amazon S3, optionally configure secret and access key below
# gcs : Google Cloud Storage, remember to configure gcs_path_to_service_account_json below
# gbq : Google Big Query, remember to configure gcs_path_to_service_account_json below
# minio : Minio Cloud Storage, remember to configure secret and access key below
# snow : Snowflake Data Warehouse, remember to configure Snowflake credentials below (account name, username, password)
# kdb : KDB+ Time Series Database, remember to configure KDB credentials below (hostname and port, optionally: username, password,
˓→classpath, and jvm_args)
# azrbs : Azure Blob Storage, remember to configure Azure credentials below (account name, account key)
# jdbc: JDBC Connector, remember to configure JDBC below. (jdbc_app_configs)
enabled_file_systems = "file, gcs"

(continues on next page)

210 Chapter 11. Enabling Data Connectors


Using Driverless AI, Release 1.8.4.1

(continued from previous page)

# GCS Connector credentials


# example (suggested) -- "/licenses/my_service_account_json.json"
gcs_path_to_service_account_json = "/service_account_json.json"

3. Save the changes when you are done, then stop/restart Driverless AI.

11.2.7 kdb+ Setup

Driverless AI allows you to explore kdb+ data sources from within the Driverless AI application. This section provides
instructions for configuring Driverless AI to work with kdb+.

kdb+ with No Authentication

This example enables the kdb+ connector without authentication. The only required flags are the hostname and the
port.
1. Export the Driverless AI config.toml file or add it to ~/.bashrc. For example:
# DEB and RPM
export DRIVERLESS_AI_CONFIG_FILE="/etc/dai/config.toml"

# TAR SH
export DRIVERLESS_AI_CONFIG_FILE="/path/to/your/unpacked/dai/directory/config.toml"

2. Specify the following configuration options in the config.toml file.


# File System Support
# upload : standard upload feature
# file : local file system/server file system
# hdfs : Hadoop file system, remember to configure the HDFS config folder path and keytab below
# dtap : Blue Data Tap file system, remember to configure the DTap section below
# s3 : Amazon S3, optionally configure secret and access key below
# gcs : Google Cloud Storage, remember to configure gcs_path_to_service_account_json below
# gbq : Google Big Query, remember to configure gcs_path_to_service_account_json below
# minio : Minio Cloud Storage, remember to configure secret and access key below
# snow : Snowflake Data Warehouse, remember to configure Snowflake credentials below (account name, username, password)
# kdb : KDB+ Time Series Database, remember to configure KDB credentials below (hostname and port, optionally: username, password,
˓→classpath, and jvm_args)
# azrbs : Azure Blob Storage, remember to configure Azure credentials below (account name, account key)
# jdbc: JDBC Connector, remember to configure JDBC below. (jdbc_app_configs)
enabled_file_systems = "file, kdb"

# KDB Connector credentials


kdb_hostname = <ip_or_host_of_kdb_server>"
kdb_port = "<kdb_server_port>"

3. Save the changes when you are done, then stop/restart Driverless AI.

kdb+ with Authentication Example

This example provides users credentials for accessing a kdb+ server from Driverless AI.
1. Export the Driverless AI config.toml file or add it to ~/.bashrc. For example:
# DEB and RPM
export DRIVERLESS_AI_CONFIG_FILE="/etc/dai/config.toml"

# TAR SH
export DRIVERLESS_AI_CONFIG_FILE="/path/to/your/unpacked/dai/directory/config.toml"

2. Specify the following configuration options in the config.toml file.


# File System Support
# upload : standard upload feature
# file : local file system/server file system
# hdfs : Hadoop file system, remember to configure the HDFS config folder path and keytab below
# dtap : Blue Data Tap file system, remember to configure the DTap section below
# s3 : Amazon S3, optionally configure secret and access key below
# gcs : Google Cloud Storage, remember to configure gcs_path_to_service_account_json below
# gbq : Google Big Query, remember to configure gcs_path_to_service_account_json below
# minio : Minio Cloud Storage, remember to configure secret and access key below
# snow : Snowflake Data Warehouse, remember to configure Snowflake credentials below (account name, username, password)

(continues on next page)

11.2. Using Data Connectors with Native Installs 211


Using Driverless AI, Release 1.8.4.1

(continued from previous page)


# kdb : KDB+ Time Series Database, remember to configure KDB credentials below (hostname and port, optionally: username, password,
˓→classpath, and jvm_args)
# azrbs : Azure Blob Storage, remember to configure Azure credentials below (account name, account key)
# jdbc: JDBC Connector, remember to configure JDBC below. (jdbc_app_configs)
enabled_file_systems = "file, kdb"

# kdb+ Connector credentials


kdb_user = "<username>"
kdb_password = "<password>"
kdb_hostname = <ip_or_host_of_kdb_server>"
kdb_port = "<kdb_server_port>"
kdb_app_classpath = ""
kdb_app_jvm_args = ""

3. Save the changes when you are done, then stop/restart Driverless AI.
After the kdb+ connector is enabled, you can add datasets by selecting kdb+ from the Add Dataset (or Drag and
Drop) drop-down menu.

Specify the following information to add your dataset.


1. Enter filepath to save query. Enter the local file path for storing your dataset. For example,
/home/<user>/myfile.csv. Note that this can only be a CSV file.
2. Enter KDB Query: Enter a kdb+ query that you want to execute. Note that the connector will accept any q
qeuries. For example: select from <mytable> or <mytable> lj <myothertable>
3. When you are finished, select the Click to Make Query button to add the dataset.

212 Chapter 11. Enabling Data Connectors


Using Driverless AI, Release 1.8.4.1

11.2.8 Minio Setup

This section provides instructions for configuring Driverless AI to work with Minio. Note that unlike S3, authentication
must also be configured when the Minio data connector is specified.

Minio with Authentication

This example enables the Minio data connector with authentication by passing an endpoint URL, access key
ID, and an access key. It also configures Docker DNS by passing the name and IP of the Minio end-
point. This allows users to reference data stored in Minio directly using the endpoint URL, for example: http:
//<endpoint_url>/<bucket>/datasets/iris.csv.
1. Export the Driverless AI config.toml file or add it to ~/.bashrc. For example:
# DEB and RPM
export DRIVERLESS_AI_CONFIG_FILE="/etc/dai/config.toml"

# TAR SH
export DRIVERLESS_AI_CONFIG_FILE="/path/to/your/unpacked/dai/directory/config.toml"

2. Specify the following configuration options in the config.toml file.


# File System Support
# upload : standard upload feature
# file : local file system/server file system
# hdfs : Hadoop file system, remember to configure the HDFS config folder path and keytab below
# dtap : Blue Data Tap file system, remember to configure the DTap section below
# s3 : Amazon S3, optionally configure secret and access key below
# gcs : Google Cloud Storage, remember to configure gcs_path_to_service_account_json below
# gbq : Google Big Query, remember to configure gcs_path_to_service_account_json below
# minio : Minio Cloud Storage, remember to configure secret and access key below
# snow : Snowflake Data Warehouse, remember to configure Snowflake credentials below (account name, username, password)
# kdb : KDB+ Time Series Database, remember to configure KDB credentials below (hostname and port, optionally: username, password,
˓→classpath, and jvm_args)
# azrbs : Azure Blob Storage, remember to configure Azure credentials below (account name, account key)
# jdbc: JDBC Connector, remember to configure JDBC below. (jdbc_app_configs)
enabled_file_systems = "file, minio"

# Minio Connector credentials


minio_endpoint_url = "<endpoint_url>"
minio_access_key_id = "<access_key_id>"
minio_secret_access_key = "<access_key>"

3. Save the changes when you are done, then stop/restart Driverless AI.

11.2. Using Data Connectors with Native Installs 213


Using Driverless AI, Release 1.8.4.1

11.2.9 Snowflake

Driverless AI allows you to explore Snowflake data sources from within the Driverless AI application. This section
provides instructions for configuring Driverless AI to work with Snowflake. This setup requires you to enable authen-
tication. If you enable Snowflake connectors, those file systems will be available in the UI, but you will not be able to
use those connectors without authentication.

Snowflake with Authentication

This example enables the Snowflake data connector with authentication by passing the account, user, and
password variables.
1. Export the Driverless AI config.toml file or add it to ~/.bashrc. For example:
# DEB and RPM
export DRIVERLESS_AI_CONFIG_FILE="/etc/dai/config.toml"

# TAR SH
export DRIVERLESS_AI_CONFIG_FILE="/path/to/your/unpacked/dai/directory/config.toml"

2. Specify the following configuration options in the config.toml file.


# File System Support
# upload : standard upload feature
# file : local file system/server file system
# hdfs : Hadoop file system, remember to configure the HDFS config folder path and keytab below
# dtap : Blue Data Tap file system, remember to configure the DTap section below
# s3 : Amazon S3, optionally configure secret and access key below
# gcs : Google Cloud Storage, remember to configure gcs_path_to_service_account_json below
# gbq : Google Big Query, remember to configure gcs_path_to_service_account_json below
# minio : Minio Cloud Storage, remember to configure secret and access key below
# snow : Snowflake Data Warehouse, remember to configure Snowflake credentials below (account name, username, password)
# kdb : KDB+ Time Series Database, remember to configure KDB credentials below (hostname and port, optionally: username, password,
˓→classpath, and jvm_args)
# azrbs : Azure Blob Storage, remember to configure Azure credentials below (account name, account key)
# jdbc: JDBC Connector, remember to configure JDBC below. (jdbc_app_configs)
enabled_file_systems = "file, snow"

# Snowflake Connector credentials


snowflake_account = "<account_id>"
snowflake_user = "<username>"
snowflake_password = "<password>"

3. Save the changes when you are done, then stop/restart Driverless AI.
After the Snowflake connector is enabled, you can add datasets by selecting Snowflake from the Add Dataset (or
Drag and Drop) drop-down menu.

214 Chapter 11. Enabling Data Connectors


Using Driverless AI, Release 1.8.4.1

Specify the following information to add your dataset.


1. Enter Output Filename: Specify the name of the file on your local system that you want to add to Driverless
AI. Note that this can only be a CSV file (for example, myfile.csv).
2. Enter Database: Specify the name of the Snowflake database that you are querying.
3. Enter Warehouse: Specify the name of the Snowflake warehouse that you are querying.
4. Enter Schema: Specify the schema of the dataset that you are querying.
5. Enter Region: (Optional) Specify the region of the warehouse that you are querying. This can be found in
the Snowflake-provided URL to access your database (as in <optional-deployment-name>.<region>.<cloud-
provider>.snowflakecomputing.com).
6. Enter Role: (Optional) Specify your role as designated within Snowflake. See https://docs.snowflake.net/
manuals/user-guide/security-access-control-overview.html for more information.
7. Enter File Formatting Params: (Optional) Specify any additional parameters for formatting your datasets.
Available parameters are listed in https://docs.snowflake.net/manuals/sql-reference/sql/create-file-format.
html#optional-parameters. (Note: Use only parameters for TYPE = CSV.) For example, if your
dataset includes a text column that contains commas, you can specify a different delimiter using
FIELD_DELIMITER='character'. Separate multiple parameters with spaces only. For example:
FIELD_DELIMITER='|' FIELD_OPTIONALLY_ENCLOSED_BY=""

8. Enter Snowflake Query: Specify the Snowflake query that you want to execute.
9. When you are finished, select the Click to Make Query button to add the dataset.

11.2. Using Data Connectors with Native Installs 215


Using Driverless AI, Release 1.8.4.1

11.2.10 JDBC

Driverless AI allows you to explore Java Database Connectivity (JDBC) data sources from within the Driverless AI
application. This section provides instructions for configuring Driverless AI to work with JDBC.

Tested Databases

The following databases have been tested for minimal functionality. Note that JDBC drivers that are not included in
this list should work with Driverless AI. We recommend that you test out your JDBC driver even if you do not see it
on list of tested databases. See the Adding an Untested JDBC Driver section at the end of this chapter for information
on how to try out an untested JDBC driver.
• Oracle DB
• PostgreSQL
• Amazon Redshift
• Teradata

216 Chapter 11. Enabling Data Connectors


Using Driverless AI, Release 1.8.4.1

Retrieve the JDBC Driver

1. Download JDBC Driver JAR files:


• Oracle DB
• PostgreSQL
• Amazon Redshift
• Teradata
Note: Remember to take note of the driver classpath, as it is needed for the configuration steps (for
example, org.postgresql.Driver).
2. Copy the driver JAR to a location that is visible to Driverless AI.
Note: The folder storing the JDBC jar file must be visible/readable by the dai process user.

Start Driverless AI

This example enables the JDBC connector for PostgresQL.


Notes:
• The JDBC connection strings will vary depending on the database that is used.
• The configuration requires a JSON key (typically the name of the database being configured) to
be associated with a nested JSON that contains the url, jarpath, and classpath fields. In
addition, this should take the format:
"""{"my_jdbc_database": {"url": "jdbc:my_jdbc_database://hostname:port/database",
"jarpath": "/path/to/my/jdbc/database.jar", "classpath": "com.my.jdbc.Driver"}}"""

1. Export the Driverless AI config.toml file or add it to ~/.bashrc. For example:


# DEB and RPM
export DRIVERLESS_AI_CONFIG_FILE="/etc/dai/config.toml"

# TAR SH
export DRIVERLESS_AI_CONFIG_FILE="/path/to/your/unpacked/dai/directory/config.toml"

2. Edit the following values in the config.toml file.


# File System Support
# upload : standard upload feature
# file : local file system/server file system
# hdfs : Hadoop file system, remember to configure the HDFS config folder path and keytab below
# dtap : Blue Data Tap file system, remember to configure the DTap section below
# s3 : Amazon S3, optionally configure secret and access key below
# gcs : Google Cloud Storage, remember to configure gcs_path_to_service_account_json below
# gbq : Google Big Query, remember to configure gcs_path_to_service_account_json below
# minio : Minio Cloud Storage, remember to configure secret and access key below
# snow : Snowflake Data Warehouse, remember to configure Snowflake credentials below (account name, username, password)
# kdb : KDB+ Time Series Database, remember to configure KDB credentials below (hostname and port, optionally: username, password,
˓→classpath, and jvm_args)
# azrbs : Azure Blob Storage, remember to configure Azure credentials below (account name, account key)
# jdbc: JDBC Connector, remember to configure JDBC below. (jdbc_app_configs)
enabled_file_systems = "upload, file, hdfs, jdbc"

# Configuration for JDBC Connector.


# JSON/Dictionary String with multiple keys.
# Format as a single line without using carriage returns (the following example is formatted for readability).
# Use triple quotations to ensure that the text is read as a single string.
# Example:
# """{
# "postgres": {
# "url": "jdbc:postgresql://ip address:port/postgres",
# "jarpath": "/path/to/postgres_driver.jar",
# "classpath": "org.postgresql.Driver"
# },
# "mysql": {
# "url":"mysql connection string",
# "jarpath": "/path/to/mysql_driver.jar",
# "classpath": "my.sql.classpath.Driver"
# }
# }"""
jdbc_app_configs = """{"postgres": {"url": "jdbc:postgress://localhost:5432/my_database",
"jarpath": "/path/to/postgresql/jdbc/driver.jar",

(continues on next page)

11.2. Using Data Connectors with Native Installs 217


Using Driverless AI, Release 1.8.4.1

(continued from previous page)


"classpath": "org.postgresql.Driver"}}"""

# extra jvm args for jdbc connector


jdbc_app_jvm_args = ""

# alternative classpath for jdbc connector


jdbc_app_classpath = ""

3. Save the changes when you are done, then stop/restart Driverless AI.
After the JDBC connector is enabled, you can add datasets by selecting JDBC from the Add Dataset (or Drag and
Drop) drop-down menu.

Adding Datasets Using JDBC

After the JDBC connector is enabled, you can add datasets by selecting JDBC from the Add Dataset (or Drag and
Drop) drop-down menu.

1. Click on the Add Dataset button on the Datasets page.


2. Select JDBC from the list that appears.
3. Click on the Select JDBC Connection button to select a JDBC configuration.
4. The form will populate with the JDBC Database, URL, Driver, and Jar information. Complete the following
remaining fields:
• JDBC Username: Enter your JDBC username.
• JDBC Password: Enter your JDBC password.
• Destination Name: Enter a name for the new dataset.
• (Optional) ID Column Name: Enter a name for the ID column. Specify this field when making
large data queries.
Notes:

218 Chapter 11. Enabling Data Connectors


Using Driverless AI, Release 1.8.4.1

• Due to resource sharing within Driverless AI, the JDBC Connector is only allocated a relatively
small amount of memory.
• When making large queries, the ID column is used to partition the data into manageable portions.
This ensures that the maximum memory allocation is not exceeded.
• If a query that is larger than the maximum memory allocation is made without specifying an ID
column, the query will not complete successfully.
5. Write a SQL Query in the format of the database that you want to query. (See the Query Examples section
below.) The format will vary depending on the database that is used.
6. Click the Click to Make Query button to execute the query. The time it takes to complete depends on the size
of the data being queried and the network speeds to the database.
On a successful query, you will be returned to the datasets page, and the queried data will be available as a new dataset.

Query Examples

The following are sample configurations and queries for Oracle DB and PostgreSQL:

Oracle DB

1. Configuration:
jdbc_app_configs = '{"oracledb": {"url": "jdbc:oracle:thin:@localhost:1521/oracledatabase", "jarpath": "/home/ubuntu/jdbc-jars/ojdbc8.jar",
˓→ "classpath": "oracle.jdbc.OracleDriver"}}'

2. Sample Query:
• Select oracledb from the Select JDBC Connection dropdown menu.
• JDBC Username: oracleuser
• JDBC Password: oracleuserpassword
• ID Column Name:
• Query:
SELECT MIN(ID) AS NEW_ID, EDUCATION, COUNT(EDUCATION) FROM my_oracle_schema.creditcardtrain GROUP BY EDUCATION

Note: Because this query does not specify an ID Column Name, it will only work for small data. How-
ever, the NEW_ID column can be used as the ID Column if the query is for larger data.
3. Click the Click to Make Query button to execute the query.

PostgreSQL

1. Configuration:
jdbc_app_configs = '{"postgres": {"url": "jdbc:postgresql://localhost:5432/postgresdatabase", "jarpath": "/home/ubuntu/postgres-artifacts/
˓→postgres/Driver.jar", "classpath": "org.postgresql.Driver"}}'

2. Sample Query:
• Select postgres from the Select JDBC Connection dropdown menu.
• JDBC Username: postgres_user
• JDBC Password: pguserpassword

11.2. Using Data Connectors with Native Installs 219


Using Driverless AI, Release 1.8.4.1

• ID Column Name: id
• Query:
SELECT * FROM loan_level WHERE LOAN_TYPE = 5 (selects all columns from table loan_level with column LOAN_TYPE containing
˓→value 5)

3. Click the Click to Make Query button to execute the query.

Adding an Untested JDBC Driver

We encourage you to try out JDBC drivers that are not tested in house.
1. Download the JDBC jar for your database.
2. Move your JDBC jar file to a location that DAI can access.
3. Modify the following config.toml settings. Note that these can also be specified as environment variables when
starting Driverless AI in Docker:
# enable the JDBC file system
enabled_file_systems = "upload, file, hdfs, s3, recipe_file, jdbc"

# Configure the JDBC Connector.


# JSON/Dictionary String with multiple keys.
# Format as a single line without using carriage returns (the following example is formatted for readability).
# Use triple quotations to ensure that the text is read as a single string.
# Example:
jdbc_app_configs = """{"my_jdbc_database": {"url": "jdbc:my_jdbc_database://hostname:port/database",
"jarpath": "/path/to/my/jdbc/database.jar",
"classpath": "com.my.jdbc.Driver"}}"""

# optional extra jvm args for jdbc connector


jdbc_app_jvm_args = ""

# optional alternative classpath for jdbc connector


jdbc_app_classpath = ""

4. Save the changes when you are done, then stop/restart Driverless AI.

220 Chapter 11. Enabling Data Connectors


CHAPTER

TWELVE

CONFIGURING AUTHENTICATION

Driverless AI supports Client Certificate, LDAP, Local, mTLS, OpenID, none, and unvalidated (default) authentica-
tion. These can be configured by specifying the environment variables when starting the Driverless AI Docker image
or by specifying the appropriate configuration options in the config.toml file.
Notes:
• Driverless AI is also integrated with IBM Spectrum Conductor and supports authentication from Conductor.
Contact [email protected] for more information about using IBM Spectrum Conductor authentication.
• Driverless AI does not support LDAP client auth. If you have LDAP client auth enabled, then the Driverless AI
LDAP connector will not work.

12.1 Client Certificate Authentication Example

This section describes how to configure client certificate authentication in Driverless AI.

12.1.1 Client Certificate and SSL Configuration Options

The following options can be specified when configuring client certificate authentication.

SSL Configuration Options

Mutual TLS authentication (mTLS) must be enabled in order to enable Client Certificate Authentication. Use the
following configuration options to configure mTLS. Refer to the mTLS Authentication topic for more information on
how to enable mTLS.
• ssl_client_verify_mode: Sets the client verification mode. Choose from the following verification
modes:
• CERT_NONE: The client will not need to provide a certificate. If it does provide a certificate, any resulting
verification errors are ignored.
• CERT_OPTIONAL: The client does not need to provide a certificate. If it does provide a certificate, it is verified
against the configured CA chains.
• CERT_REQUIRED: The client needs to provide a certificate for verification. Note that you will need to configure
the ssl_client_key_file and ssl_client_crt_file options when this mode is selected in order
for Driverless to be able to verify it’s own callback requests.
• ssl_ca_file: Specifies the path to the certification authority (CA) certificate file. This certificate will be
used to verify the client certificate when client authentication is enabled. If this is not specified, clients are
verified using the default system certificates.

221
Using Driverless AI, Release 1.8.4.1

• ssl_client_key_file: Required if ssl_client_verify_mode = "CERT_REQUIRED". Speci-


fies the HTTPS settings path to the private key that Driverless AI uses to authenticate itself.
• ssl_client_crt_file: Required if ssl_client_verify_mode = "CERT_REQUIRED". Speci-
fies the HTTPS settings path to the client certificate that Driverless AI will use to authenticate itself.

Client Certificate Options

• auth_tls_crl_file: The path to the certificate revocation list (CRL) file that is used to verify the client
certificate.
• auth_tls_subject_field: The subject field that is used as a source for a username or other values that
provide further validation.
• auth_tls_field_parse_regexp: The regular expression that is used to parse the subject field in order
to obtain the username or other values that provide further validation.
• auth_tls_user_lookup: Specifies how a user’s identity is obtained. Choose from the following:
– REGEXP_ONLY: Uses auth_tls_subject_field and auth_tls_field_parse_regexp to
extract the username from the client certificate.
– LDAP_LOOKUP: Uses the LDAP server to obtain the username. (Refer to the LDAP Authentication Ex-
ample section for information about additional LDAP Authentication configuration options.)
• auth_tls_ldap_authorization_lookup_filter: (Optional) Specifies an additional search filter
that is performed after the user is found. For example, this can be used to check whether that user is a member
of a particular group.
• auth_tls_ldap_authorization_search_base: Specifies the base DN to start the authorization
lookup from. Used when the above option is specified.

12.1.2 Enabling Client Certificate Authentication in Docker Images

To enable Client Certificate authentication in Docker images, specify the authentication environment variable that
you want to use. Each variable must be prepended with DRIVERLESS_AI_. The example below enables Client
Certification authentication and uses LDAP_LOOKUP for the TLS user lookup method. Replace TAG below with the
image tag.
nvidia-docker run \
--pid=host \
--init \
--rm \
--shm-size=256m \
-p 12345:12345 \
-u `id -u`:`id -g` \
-e DRIVERLESS_AI_ENABLED_FILE_SYSTEMS="file,s3,hdfs" \
-e DRIVERLESS_AI_ENABLE_HTTPS="true" \
-e DRIVERLESS_AI_SSL_KEY_FILE="/etc/pki/dai-server.key" \
-e DRIVERLESS_AI_SSL_CRT_FILE="/etc/pki/dai-server.crt" \
-e DRIVERLESS_AI_SSL_CA_FILE="/etc/pki/ca.crt" \
-e DRIVERLESS_AI_SSL_CLIENT_VERIFY_MODE="CERT_REQUIRED" \
-e DRIVERLESS_AI_SSL_CLIENT_KEY_FILE="/etc/pki/dai-self.key" \
-e DRIVERLESS_AI_SSL_CLIENT_CRT_FILE="/etc/pki/dai-self.cert" \
-e DRIVERLESS_AI_AUTHENTICATION_METHOD="tls_certificate" \
-e DRIVERLESS_AI_AUTH_TLS_SUBJECT_FIELD="CN" \
-e DRIVERLESS_AI_AUTH_TLS_CRL_FILE="/etc/pki/crl.pem" \
-e DRIVERLESS_AI_AUTH_TLS_FIELD_PARS_REGEXP="(?P<di>.*)" \
-e DRIVERLESS_AI_AUTH_TLS_USER_LOOKUP="LDAP_LOOKUP" \
-e DRIVERLESS_AI_LDAP_SERVER="ldap.forumsys.com" \
-e DRIVERLESS_AI_LDAP_BIND_DN="cn=read-only-admin,dc=example,dc=com" \
-e DRIVERLESS_AI_LDAP_BIND_PASSWORD="password" \
-e DRIVERLESS_AI_LDAP_SEARCH_BASE="dc=example,dc=com" \
-e DRIVERLESS_AI_LDAP_USER_NAME_ATTRIBUTE="uid" \
-e DRIVERLESS_AI_LDAP_SEARCH_FILTER="(&(objectClass=inetOrgPerson)(uid={{id}}))" \
-e DRIVERLESS_AI_AUTH_TLS_LDAP_AUTHORIZATION_SEARCH_BASE="dc=example,dc=com" \
-e DRIVERLESS_AI_AUTH_TLS_LDAP_AUTHORIZATION_LOOKUP_FILTER="(&(objectClass=groupOfUniqueNames)(uniqueMember=uid={{uid}},dc=example,dc=com)(ou=chemists))
˓→" \
-v `pwd`/data:/data \
-v `pwd`/log:/log \
-v `pwd`/license:/license \
-v `pwd`/tmp:/tmp \
h2oai/dai-centos7-x86_64:TAG

222 Chapter 12. Configuring Authentication


Using Driverless AI, Release 1.8.4.1

12.1.3 Enabling Client Certificate Authentication in the config.toml File for Native
Installs

Native installs include DEBs, RPMs, and TAR SH installs. The example below shows how to edit the config.toml file
to enable Client Certification authentication and uses the LDAP_LOOKUP for the TLS user lookup method.
1. Export the Driverless AI config.toml file or add it to ~/.bashrc. For example:
# DEB and RPM
export DRIVERLESS_AI_CONFIG_FILE="/etc/dai/config.toml"

# TAR SH
export DRIVERLESS_AI_CONFIG_FILE="/path/to/your/unpacked/dai/directory/config.toml"

2. Open the config.toml file and edit the following authentication variables. The config.toml file is available in the
etc/dai folder after Driverless AI is installed.
# https settings
enable_https = true

# https settings
# Path to the SSL key file
#
ssl_key_file = "/etc/pki/dai-server.key"

# https settings
# Path to the SSL certificate file
#
ssl_crt_file = "/etc/pki/dai-server.crt"

# https settings
# Path to the Certification Authority certificate file. This certificate will be
# used when to verify client certificate when client authentication is turned on.
# If this is not set, clients are verified using default system certificates.
#
ssl_ca_file = "/etc/pki/ca.crt"

# https settings
# Sets the client verification mode.
# CERT_NONE: Client does not need to provide the certificate and if it does any
# verification errors are ignored.
# CERT_OPTIONAL: Client does not need to provide the certificate and if it does
# certificate is verified agains set up CA chains.
# CERT_REQUIRED: Client needs to provide a certificate and certificate is
# verified.
# You'll need to set 'ssl_client_key_file' and 'ssl_client_crt_file'
# When this mode is selected for Driverless to be able to verify
# it's own callback requests.
#
ssl_client_verify_mode = "CERT_REQUIRED"

# https settings
# Path to the private key that Driverless will use to authenticate itself when
# CERT_REQUIRED mode is set.
#
ssl_client_key_file = "/etc/pki/dai-self.key"

# https settings
# Path to the client certificate that Driverless will use to authenticate itself
# when CERT_REQUIRED mode is set.
#
ssl_client_crt_file = "/etc/pki/dai-self.crt"

# Enable client certificate authentication


authentication_method = "tls_certificate"

# Subject field that is used as a source for a username or other values that provide further validation
auth_tls_subject_field = "CN"

# Path to the CRL file that will be used to verify client certificate.
auth_tls_crl_file = "/etc/pki/crl.pem"

# Sets up the way how user identity would be obtained


# REGEXP_ONLY: Will use 'auth_tls_subject_field' and 'auth_tls_field_parse_regexp'
# to extract the username from the client certificate.
# LDAP_LOOKUP: Will use LDAP server to lookup for the username.
# 'ldap_server', 'ldap_use_ssl', 'ldap_tls_file', 'ldap_bind_dn',
# 'ldap_bind_password' options are used to establish
# the connection with the LDAP server.
# 'auth_tls_subject_field' and 'auth_tls_field_parse_regexp'
# options are used to parse the certificate.
# 'ldap_search_base', 'ldap_search_filter', and
# 'ldap_username_attribute' options are used to do the lookup.
# 'ldap_search_filter' can be built dynamically using the named
# capturing groups from the 'auth_tls_field_parse_regexp' for
# substitution.
# Example:
# auth_tls_field_parse_regexp = "\w+ (?P<id>\d+)"
# ldap_search_filter = "(&(objectClass=person)(id={{id}}))"
auth_tls_user_lookup = "LDAP_LOOKUP"

# Regular expression that is used to parse the subject field in order to


# obtain the username or other values that provide further validation
auth_tls_field_parse_regexp = "\w+ (?P<id>\d+)"

# ldap server domain or ip

(continues on next page)

12.1. Client Certificate Authentication Example 223


Using Driverless AI, Release 1.8.4.1

(continued from previous page)


ldap_server = "ldap.forumsys.com"

# Complete DN of the LDAP bind user


ldap_bind_dn = "cn=read-only-admin,dc=example,dc=com"

# Password for the LDAP bind


ldap_bind_password = "password"

# the location in the DIT where the search will start


ldap_search_base = "dc=example,dc=com"

# specify key to find user name


ldap_user_name_attribute = "uid"

# A string that describes what you are searching for. You can use Python
# substitution to have this constructed dynamically.
# (only {{DAI_USERNAME}} is supported)
ldap_search_filter = "(&(objectClass=inetOrgPerson)(uid={{id}}))"

# Base DN where to start the Authorization lookup. Used when


# 'auth_tls_ldap_authorization_lookup_filter' is set.
auth_tls_ldap_authorization_search_base="dc=example,dc=com"

# Sets optional additional lookup filter that is performed after the


# user is found. This can be used for example to check whether the is member of
# particular group.
# Filter can be built dynamically from the attributes returned by the lookup.
# Authorization fails when search does not return any entry. If one ore more
# entries are returned authorization succeeds.
# Example:
# auth_tls_field_parse_regexp = "\w+ (?P<id>\d+)"
# ldap_search_filter = "(&(objectClass=person)(id={{id}}))"
# auth_tls_ldap_authorization_lookup_filter = "(&(objectClass=group)(member=uid={{uid}},dc=example,dc=com))"
# If this option is empty no additional lookup is done and just a successful user
# lookup is enough to authorize the user.
#
auth_tls_ldap_authorization_lookup_filter = "(&(objectClass=groupOfUniqueNames)(uniqueMember=uid={{uid}},dc=example,dc=com)(ou=chemists))"

3. Start (or restart) Driverless AI.

12.2 LDAP Authentication Example

This section describes how to enable Lightweight Directory Access Protocol in Driverless AI. The available parameters
can be specified as environment variables when starting the Driverless AI Docker image, or they can be set via the
config.toml file for native installs. Upon completion, all the users in the configured LDAP should be able to log in to
Driverless AI and run experiments, visualize datasets, interpret models, etc.
Note: Driverless AI does not support LDAP client auth. If you have LDAP client auth enabled, then the Driverless AI
LDAP connector will not work.

12.2.1 Description of Configuration Attributes

The following options can be specified when enabling LDAP authentication.


• ldap_server: The LDAP server domain or IP
• ldap_port: The LDAP server port
• ldap_bind_dn: The complete DN of the LDAP bind user
• ldap_bind_password: The password for the LDAP bind
• ldap_tls_file: The Transport Layer Security (TLS) certificate file location
• ldap_use_ssl: Whether to enable (TRUE) or disable (FALSE) SSL
• ldap_search_base: The location in the Directory Information Tree (DIT) where the search will start
• ldap_search_filter: A string that describes what you are searching for. You can use Python substitution
to have this constructed dynamically. (Only {{DAI_USERNAME}} is supported. For example, “(&(object-
Class=person)(cn:dn:={{DAI_USERNAME}}))”.)
• ldap_search_attributes: LDAP attributes to return from search

224 Chapter 12. Configuring Authentication


Using Driverless AI, Release 1.8.4.1

• ldap_user_name_attribute="uid": Specify the key to find user name

12.2.2 LDAP without SSL

The following examples describe how to enable LDAP without SSL when running Driverless AI in the Docker image
or through native installs.

Setting Environment Variables in Docker Images

The following example shows how to configure LDAP without SSL when starting the Driverless AI Docker image.
Replace TAG below with the image tag.
nvidia-docker run \
--pid=host \
--init \
--rm \
--shm-size=256m \
-p 12345:12345 \
-u `id -u`:`id -g` \
-e DRIVERLESS_AI_ENABLED_FILE_SYSTEMS="file,s3,hdfs" \
-e DRIVERLESS_AI_AUTHENTICATION_METHOD="ldap" \
-e DRIVERLESS_AI_LDAP_USE_SSL="false" \
-e DRIVERLESS_AI_LDAP_SERVER="ldap.forumsys.com" \
-e DRIVERLESS_AI_LDAP_PORT="389" \
-e DRIVERLESS_AI_SEARCH_BASE="dc=example,dc=com" \
-e DRIVERLESS_AI_LDAP_BIND_DN="cn=read-only-admin,dc=example,dc=com" \
-e DRIVERLESS_AI_LDAP_BIND_PASSWORD=password \
-e DRIVERLESS_AI_LDAP_SEARCH_FILTER="(&(objectClass=person)(cn:dn:={{DAI_USERNAME}}))" \
-e DRIVERLESS_AI_LDAP_USER_NAME_ATTRIBUTE="uid" \
-v `pwd`/data:/data \
-v `pwd`/log:/log \
-v `pwd`/license:/license \
-v `pwd`/tmp:/tmp \
h2oai/dai-centos7-x86_64:TAG

Using the config.toml file with Native Installs

The following example shows how to configure LDAP without SSL when starting Driverless AI from a native install.
Native installs include DEBs, RPMs, and TAR SH installs.
1. Export the Driverless AI config.toml file or add it to ~/.bashrc. For example:
# DEB and RPM
export DRIVERLESS_AI_CONFIG_FILE="/etc/dai/config.toml"

# TAR SH
export DRIVERLESS_AI_CONFIG_FILE="/path/to/your/unpacked/dai/directory/config.toml"

2. Enable LDAP authentication without SSL.


# Enable LDAP authentication
authentication_method = "ldap"

# Specify the LDAP server domain or IP to connect to


ldap_server = "ldap.forumsys.com"

# Specify the LDAP port to connect to


ldap_port = "389"

# Disable SSL
ldap_use_ssl="false"

# Specify the location in the DIT where the search will start
ldap_search_base = "dc=example,dc=com"

# Specify the LDAP search filter


# This is A string that describes what you are searching for. You
# can use Python substitution to have this constructed dynamically.
# (Only {{DAI_USERNAME}} is supported. For example, "(&(objectClass=person)(cn:dn:={{DAI_USERNAME}}))".)
ldap_search_filter = "(&(objectClass=person)(cn:dn:={{DAI_USERNAME}}))"

# Specify the complete DN of the LDAP bind user


ldap_bind_dn = "cn=read-only-admin,dc=example,dc=com"

# Specify the LDAP password for the above user


ldap_bind_password = "password"

# Specify a key to find the user name


ldap_user_name_attribute = "uid"

12.2. LDAP Authentication Example 225


Using Driverless AI, Release 1.8.4.1

3. Start (or restart) Driverless AI.


Users can now launch Driverless AI using their LDAP credentials. If authentication is successful, the user can access
Driverless AI and run experiments, visualize datasets, interpret models, etc.

12.2.3 LDAP with SSL

These examples show how to enable LDAP authentication with SSL and additional parameters that can be specified
as environment variables when starting the Driverless AI Docker image, or they can be set via the config.toml file for
native installs. Upon completion, all the users in the configured LDAP should be able to log in to Driverless AI and
run experiments, visualize datasets, interpret models, etc.

Setting Environment Variables in Docker Images

Specify the following LDAP environment variables when starting the Driverless AI Docker image. This example
enables LDAP authentication and shows how to specify additional options that are used when recipe=1. Replace
TAG below with the image tag.
nvidia-docker run \
--pid=host \
--init \
--rm \
--shm-size=256m \
-p 12345:12345 \
-u `id -u`:`id -g` \
-e DRIVERLESS_AI_ENABLED_FILE_SYSTEMS="file,s3,hdfs" \
-e DRIVERLESS_AI_AUTHENTICATION_METHOD="ldap" \
-e DRIVERLESS_AI_LDAP_SERVER="ldap.forumsys.com" \
-e DRIVERLESS_AI_LDAP_PORT="389" \
-e DRIVERLESS_AI_LDAP_SEARCH_BASE="dc=example,dc=com" \
-e DRIVERLESS_AI_LDAP_SEARCH_FILTER="(&(objectClass=person)(cn:dn:={{DAI_USERNAME}}))" \
-e DRIVERLESS_AI_LDAP_USE_SSL="true" \
-e DRIVERLESS_AI_LDAP_TLS_FILE="/tmp/abc-def-root.cer" \
-e DRIVERLESS_AI_LDAP_LDAP_BIND_DN="cn=read-only-admin,dc=example,dc=com" \
-e DRIVERLESS_AI_LDAP_LDAP_BIND_PASSWORD="password" \
-e DRIVERLESS_AI_LDAP_USER_NAME_ATTRIBUTE="uid" \
-v `pwd`/data:/data \
-v `pwd`/log:/log \
-v `pwd`/license:/license \
-v `pwd`/tmp:/tmp \
h2oai/dai-centos7-x86_64:TAG

Upon successful completion, all the users in the configured LDAP should be able to log in to Driverless AI and run
experiments, visualize datasets, interpret models, etc.

Using the config.toml file with Native Installs

Native installs include DEBs, RPMs, and TAR SH installs.


1. Export the Driverless AI config.toml file or add it to ~/.bashrc. For example:
# DEB and RPM
export DRIVERLESS_AI_CONFIG_FILE="/etc/dai/config.toml"

# TAR SH
export DRIVERLESS_AI_CONFIG_FILE="/path/to/your/unpacked/dai/directory/config.toml"

2. Enable LDAP authentication with SSL.


# Enable LDAP authentication
authentication_method = "ldap"

# Specify the LDAP server domain or IP to connect to


ldap_server = "ldap.forumsys.com"

# Specify the LDAP port to connect to


ldap_port = "389"

# Specify the location in the DIT where the search will start
ldap_search_base = "dc=example,dc=com"

# Specify the LDAP search filter


# This is a string that describes what you are searching for. You
# can use Python substitution to have this constructed dynamically.

(continues on next page)

226 Chapter 12. Configuring Authentication


Using Driverless AI, Release 1.8.4.1

(continued from previous page)


# (Only {{DAI_USERNAME}} is supported.)
ldap_search_filter = "(&(objectClass=person)(cn:dn:={{DAI_USERNAME}}))"

# If the LDAP connection to the LDAP server needs an SSL certificate,


# then this needs to be specified
ldap_use_ssl = "True"

# Specify the LDAP TLS file location if SSL is set to True


ldap_tls_file = "/tmp/abc-def-root.cer"

# Complete DN of the LDAP bind user


ldap_bind_dn = "cn=read-only-admin,dc=example,dc=com"

# Specify the LDAP password for the above user


ldap_bind_password = "password"

# Specify a key to find the user name


ldap_user_name_attribute = "uid"

3. Start (or restart) Driverless AI. Users can now launch Driverless AI using their LDAP credentials. If authenti-
cation is successful, the user can access Driverless AI and run experiments, visualize datasets, interpret models,
etc.

12.3 Local Authentication Example

This section describes how to enable local authentication in Driverless AI.

12.3.1 Enabling Local Auth in Docker Images

To enable authentication in Docker images, specify the authentication environment variable that you want to use. Each
variable must be prepended with DRIVERLESS_AI_. Replace TAG below with the image tag. The example below
starts Driverless AI with environment variables the enable the following:
• Local authentication when starting Driverless AI
• S3 and HDFS access (without authentication)
nvidia-docker run \
--pid=host \
--init \
--rm \
--shm-size=256m \
-p 12345:12345 \
-u `id -u`:`id -g` \
-e DRIVERLESS_AI_ENABLED_FILE_SYSTEMS="file,s3,hdfs" \
-e DRIVERLESS_AI_AUTHENTICATION_METHOD="local" \
-e DRIVERLESS_AI_LOCAL_HTPASSWD_FILE="<htpasswd_file_location>" \
-v `pwd`/data:/data \
-v `pwd`/log:/log \
-v `pwd`/license:/license \
-v `pwd`/tmp:/tmp \
h2oai/dai-centos7-x86_64:TAG

12.3.2 Enabling Local Auth in the config.toml File for Native Installs

Native installs include DEBs, RPMs, and TAR SH installs. The example below shows the configuration options in the
config.toml file to set when enabling the following:
• Local authentication when starting Driverless AI
• S3 and HDFS access (without authentication)
1. Export the Driverless AI config.toml file or add it to ~/.bashrc. For example:
# DEB and RPM
export DRIVERLESS_AI_CONFIG_FILE="/etc/dai/config.toml"

# TAR SH
export DRIVERLESS_AI_CONFIG_FILE="/path/to/your/unpacked/dai/directory/config.toml"

12.3. Local Authentication Example 227


Using Driverless AI, Release 1.8.4.1

2. Open the config.toml file and edit the authentication variables. The config.toml file is available in the etc/dai
folder after the RPM or DEB is installed.
# File System Support
# file : local file system/server file system
# hdfs : Hadoop file system, remember to configure the hadoop coresite and keytab below
# s3 : Amazon S3, optionally configure secret and access key below
# gcs : Google Cloud Storage, remember to configure gcs_path_to_service_account_json below
# gbq : Google Big Query, remember to configure gcs_path_to_service_account_json below
enabled_file_systems = "file,s3,hdfs"

# authentication_method
# unvalidated : Accepts user id and password, does not validate password
# none : Does not ask for user id or password, authenticated as admin
# pam : Accepts user id and password, Validates user with operating system
# ldap : Accepts user id and password, Validates against an ldap server, look
# local: Accepts a user id and password, Validated against a htpasswd file provided in local_htpasswd_file
# for additional settings under LDAP settings
authentication_method = "local"

# Local password file


# Generating a htpasswd file: see syntax below
# htpasswd -B "<location_to_place_htpasswd_file>" "<username>"
# note: -B forces use of brcypt, a secure encryption method
local_htpasswd_file = "<htpasswd_file_location>"

3. Start (or restart) Driverless AI. Note that the command used to start Driverless AI varies depending on your
install type.
# Linux RPM or DEB with systemd
sudo systemctl start dai

# Linux RPM or DEB without systemd


sudo -H -u dai /opt/h2oai/dai/run-dai.sh

# Linux TAR SH
./run-dai.sh

12.4 mTLS Authentication Example

Driverless AI supports Mutual TLS authentication (mTLS) by setting a specific verification mode along with a certifi-
cate authority file, an SSL private key, and an SSL certificate file. The diagram below is a visual representation of the
mTLS authentication process.

228 Chapter 12. Configuring Authentication


Using Driverless AI, Release 1.8.4.1

12.4.1 Description of Configuration Attributes

Use the following configuration options to configure mTLS.


• ssl_client_verify_mode: Sets the client verification mode. Choose from the following verification
modes:
• CERT_NONE: The client will not need to provide a certificate. If it does provide a certificate, any resulting
verification errors are ignored.
• CERT_OPTIONAL: The client does not need to provide a certificate. If it does provide a certificate, it is verified
against the configured CA chains.
• CERT_REQUIRED: The client needs to provide a certificate for verification. Note that you will need to configure
the ssl_client_key_file and ssl_client_crt_file options when this mode is selected in order
for Driverless to be able to verify it’s own callback requests.
• ssl_ca_file: Specifies the path to the certification authority (CA) certificate file, provided by your organi-
zation. This certificate will be used to verify the client certificate when client authentication is enabled. If this
is not specified, clients are verified using the default system certificates.
• ssl_key_file: Specifies your web server private key file. This is normally created by your organization’s
sys admin.
• ssl_crt_file: Specifies your web server public certificate file. This is normally created by your organiza-
tion’s sys admin.
• ssl_client_key_file: Required if ssl_client_verify_mode = "CERT_REQUIRED". Speci-
fies the private key file that Driverless AI uses to authenticate itself. This is normally created by your organiza-

12.4. mTLS Authentication Example 229


Using Driverless AI, Release 1.8.4.1

tion’s sys admin.


• ssl_client_crt_file: Required if ssl_client_verify_mode = "CERT_REQUIRED". Speci-
fies the private client certificate file that Driverless AI will use to authenticate itself. This is normally created by
your organization’s sys admin.
• auth_tls_crl_file: Specifies the path to the certificate revocation list file that will be used to verify the
client certificate. This file contains a list of revoked user IDs.

12.4.2 Configuration Scenarios

The table below describes user certificate behavior for mTLS authentication based on combinations of the configura-
tion options described above.

config.toml settings User User has a correct and valid User has a revoked
does not certificate certificate
have a
certifi-
cate
User
ssl_client_verify_mode='CERT_NONE' User certs are ignored User revoked certs are ig-
certs are nored
ignored
User User certs are set to Driverless AI
ssl_client_verify_mode='CERT_OPTIONAL' User revoked certs are not
certs are but are not used for validating the validated
ignored certs
Not User provides a valid certificate
ssl_client_verify_mode='CERT_REQUIRED' User revoke lists are not
allowed used by Driverless AI but does not validated
authenticate the user
Not User provides a valid certificate.
sl_client_verify_mode='CERT_REQUIRED' User revoked certs are
AND allowed The certificate is used for connect- validated and the re-
ing to the Driverless AI server as
authentication_method='tls_authentication' voked file is provided in
well as for authentication. AUTH_TLS_CRL_FILE

12.4.3 Enabling mTLS Authentication in Docker Images

To enable mTLS authentication in Docker images, specify the authentication environment variable that you want to
use. Each variable must be prepended with DRIVERLESS_AI_. Replace TAG below with the image tag.
nvidia-docker run \
--pid=host \
--init \
--rm \
--shm-size=256m \
-p 12345:12345 \
-u `id -u`:`id -g` \
-e DRIVERLESS_AI_ENABLE_HTTPS=true \
-e DRIVERLESS_AI_SSL_KEY_FILE=/etc/dai/private_key.pem \
-e DRIVERLESS_AI_SSL_CRT_FILE=/etc/dai/cert.pem \
-e DRIVERLESS_AI_AUTHENTICATION_METHOD=tls_certificate \
-e DRIVERLESS_AI_SSL_CLIENT_VERIFY_MODE=CERT_REQUIRED \
-e DRIVERLESS_AI_SSL_CA_FILE=/etc/dai/rootCA.pem \
-e DRIVERLESS_AI_SSL_CLIENT_KEY=/etc/dai/client_config_key.key \
-e DRIVERLESS_AI_SSL_CLIENT_CRT_FILE=/etc/dai/client_config_cert.pem \
-v /user/1.8.4_auth/log:/log -v /home/anu/1.8.4_auth/tmp:/tmp \
-v /user/certificates/server_config_key.pem:/etc/dai/private_key.pem \
-v /user/certificates/server_config_cert.pem:/etc/dai/cert.pem \
-v /user/certificates/client_config_cert.pem:/etc/dai/client_config_cert.pem \
-v /user/certificates/client_config_key.key:/etc/dai/client_config_key.key \
-v /user/certificates/rootCA.pem:/etc/dai/rootCA.pem \
h2oai/dai-centos7-x86_64:TAG

230 Chapter 12. Configuring Authentication


Using Driverless AI, Release 1.8.4.1

12.4.4 Enabling mTLS Authentication in the config.toml File for Native Installs

Native installs include DEBs, RPMs, and TAR SH installs. The example below shows how to edit the config.toml file
to enable mTLS authentication.
1. Export the Driverless AI config.toml file or add it to ~/.bashrc. For example:
# DEB and RPM
export DRIVERLESS_AI_CONFIG_FILE="/etc/dai/config.toml"

# TAR SH
export DRIVERLESS_AI_CONFIG_FILE="/path/to/your/unpacked/dai/directory/config.toml"

2. Open the config.toml file and edit the following authentication variables. The config.toml file is available in the
etc/dai folder after Driverless AI is installed.
# Path to the Certification Authority certificate file. This certificate will be
# used when to verify client certificate when client authentication is turned on.
# If this is not set, clients are verified using default system certificates.
#
ssl_ca_file = "/etc/pki/ca.crt"

# Sets the client verification mode.


# CERT_NONE: Client does not need to provide the certificate and if it does any
# verification errors are ignored.
# CERT_OPTIONAL: Client does not need to provide the certificate and if it does
# certificate is verified agains set up CA chains.
# CERT_REQUIRED: Client needs to provide a certificate and certificate is
# verified.
# You'll need to set 'ssl_client_key_file' and 'ssl_client_crt_file'
# When this mode is selected for Driverless to be able to verify
# it's own callback requests.
#
ssl_client_verify_mode = "CERT_REQUIRED"

# Path to the private key that Driverless will use to authenticate itself when
# CERT_REQUIRED mode is set.
#
ssl_client_key_file = "/etc/pki/dai-self.key"

# Path to the client certificate that Driverless will use to authenticate itself
# when CERT_REQUIRED mode is set.
#
ssl_client_crt_file = "/etc/pki/dai-self.crt"

# Enable client certificate authentication


authentication_method = "tls_certificate"

3. Start (or restart) Driverless AI.

12.5 OpenID Connect Authentication Example

This section describes how to enable OpenID Connect authentication in Driverless AI.
Note: The Driverless AI Python and R clients are not compatible with the OpenID Connect authentication method.
Enable OpenID Connect through the UI only, or use a different authentication method is recommended if you plan to
connect to DAI using the R or Python clients.

12.5.1 The OpenID Connect Protocol

OpenID Connect follows a distinct protocol during the authentication process:


1. A request is sent from the client (RP) to the OpenID provider (OP).
2. The OP authenticates the end user and obtains authorization.
3. The OP responds with an ID Token. (An Access Token is usually provided as well.)
4. The Relying Party (RP) can send a request with the Access Token to the UserInfo Endpoint.
5. The UserInfo Endpoint returns Claims about the End User.
Refer to the OpenID Connect Basic Client Implementer’s Guide for more information: https://openid.net/specs/
openid-connect-basic-1_0.html

12.5. OpenID Connect Authentication Example 231


Using Driverless AI, Release 1.8.4.1

12.5.2 Understanding the Well-Known Endpoint

In order to begin the process of configuring Driverless AI for OpenID-based authentication, the end user must retrieve
OpenID Connect metadata about their authorization server by requesting information from the well-known endpoint.
This information is subsequently used to configure further interactions with the provider.
The well-known endpoint is typically configured as follows:
https://yourOpenIDProviderHostname/.well-known/openid-configuration

12.5.3 Open ID Configuration Options

Set the following options in the config.toml file for enabling OpenID-based authentication.
# The OpenID server URL. (Ex: https://oidp.ourdomain.com) Do not end with a "/"
auth_openid_provider_base_uri= "https://yourOpenIDProviderHostname"

# The uri to pull OpenID config data from. (You can extract most of required OpenID config from this URL.)
# Usually located at: /auth/realms/master/.well-known/openid-configuration

# Quote method from urllib.parse used to encode payload dict in Authentication Request
auth_openid_urlencode_quote_via="quote"

# These endpoints are made available by the well-known endpoint of the OpenID provider
# All endpoints should start with a "/"
auth_openid_auth_uri=""
auth_openid_token_uri=""
auth_openid_userinfo_uri=""
auth_openid_logout_uri=""

# In most cases, these values are usually 'code' and 'authorization_code' (as shown below)
# Supported values for response_type and grant_type are listed in the response of well-known endpoint
auth_openid_response_type="code"
auth_openid_grant_type="authorization_code"

# Scope values--supported values are available in the response from the well-known endpoint
# 'openid' is required
# Additional scopes may be necessary if the response to the userinfo request
# does not include enough information to use for authentication
# Separate additional scopes with a blank space.
# See https://openid.net/specs/openid-connect-basic-1_0.html#Scopes for more info
auth_openid_scope="openid"

# The OpenID client details that are available from the provider
# A new client for Driverless AI in your OpenID provider must be created if one does not already exist
auth_openid_client_id=""
auth_openid_client_secret=""

# Sample redirect value: http[s]://driverlessai-server-address:port/openid/callback


# Ensure that the client configuration in the OpenID provider (see previous step) includes
# this exact URL as one of the possible redirect URLs for the client
# If these do not match, the OpenID connection will fail
auth_openid_redirect_uri=""

# Token endpoint response key configs


auth_openid_access_token_expiry_key="expires_in"
auth_openid_refresh_token_expiry_key="refresh_expires_in"

# UserInfo response key configs for all users who log in to Driverless AI
# The userinfo_auth_key and userinfi_auth_value are
# a key value combination in the userinfo response that remain static for everyone
# If this key value pair does not exist in the user_info response,
# then the Authentication is considered failed
auth_openid_userinfo_auth_key=""
auth_openid_userinfo_auth_value=""

# Key that specifies username in user_info json (we will use value of this key as username in Driverless AI)
auth_openid_userinfo_username_key=""

# Enable advanced matching for OpenID authentication


# When enabled, the ObjectPath expression is used to evaluate the user's identity
# Disabled by default
# For more information, refer to http://objectpath.org/
auth_openid_use_objectpath_match=false

# Set the ObjectPath expression


# Used to evaluate whether a user is allowed to login to Driverless AI
# The user is allowed to log in when the expression evaluates to True
# Examples:
# $.our_claim is "our_value" (simple claim equality)
# "expected_role" in @.roles (list of claims contains required value)
auth_openid_use_objectpath_expression=""

232 Chapter 12. Configuring Authentication


Using Driverless AI, Release 1.8.4.1

12.5.4 Enabling OpenID Connect

The examples that follow describe how to start Driverless AI in the Docker image and with native installs after OpenID
has been configured.

Enabling OpenID Connect in Docker Images

1. Edit the OpenID configuration options in your config.toml file as described in the Open ID Configuration Op-
tions section.
2. Mount the edited config.toml file into the Docker container. Replace TAG below with your Driverless AI tag.
nvidia-docker run \
--net=openid-network \
--name="dai-with-openid" \
--pid=host \
--init \
--rm \
--shm-size=256m \
-u `id -u`:`id -g` \
-p 12345:12345 \
-v "`pwd`/DAI_DATA/data":/data \
-v "`pwd`/DAI_DATA/log":/log \
-v "`pwd`/DAI_DATA/license":/license \
-v "`pwd`/DAI_DATA/tmp":/tmp \
-v "`pwd`/DAI_DATA/config":/config \
-e DRIVERLESS_AI_CONFIG_FILE="/config/config.toml" \
h2oai/dai-centos7-x86_64:TAG

The next step is to launch and log in to Driverless AI. Refer to Logging in to Driverless AI.

Enabling OpenID Connect in Native Installs

1. Export the Driverless AI config.toml file or add it to ~/.bashrc. For example:


# DEB and RPM
export DRIVERLESS_AI_CONFIG_FILE="/etc/dai/config.toml"

# TAR SH
export DRIVERLESS_AI_CONFIG_FILE="/path/to/your/unpacked/dai/directory/config.toml"

2. Edit the OpenID configuration properties in the config.toml file as described in the Open ID Configuration
Options section.
3. Start (or restart) Driverless AI.
The next step is to launch and log in to Driverless AI. Refer to Logging in to Driverless AI.

12.5.5 Logging in to Driverless AI

Open a browser and launch Driverless AI. Notice the you will be prompted to log in with OpenID.

12.5. OpenID Connect Authentication Example 233


Using Driverless AI, Release 1.8.4.1

12.6 PAM Authentication Example

The following sections describe how to enable Pluggable Authentication Modules (PAM) in Driverless AI. You can
do this by specifying environment variables in the Docker image or by updating the config.toml file.
Note: This assumes that the user has an understanding of how to grant permissions in their own environment in
order for PAM to work. Specifically for Driverless AI, be sure that the Driverless AI processes owner has access to
/etc/shadow (without root); otherwise authentication will fail.

12.6.1 Enabling PAM in Docker Images

Note: The following instructions are only applicable with a CentOS 7 host.
In this example, the host Linux system has PAM enabled for authentication and Docker running on that Linux system.
The goal is to enable PAM for Driverless AI authentication while the Linux system hosts the user information.
1. Verify that the username (“eric” in this case) is defined in the Linux system.
[root@Linux-Server]# cat /etc/shadow | grep eric
eric:$6$inOv3GsQuRanR1H4$kYgys3oc2dQ3u9it02WTvAYqiGiQgQ/yqOiOs.g4F9DM1UJGpruUVoGl5G6OD3MrX/3uy4gWflYJnbJofaAni/::0:99999:7:::

2. Start Docker on the Linux Server and enable PAM in Driverless AI. Replace TAG below with the image tag.
[root@Linux-Server]# docker run \
--rm \
--shm-size=256m \
-u `id -u`:`id -g` \
-p 12345:12345 \
-v `pwd`/config:/config \
-v `pwd`/data:/data \
-v `pwd`/log:/log \
-v `pwd`/license:/license \
-v `pwd`/tmp:/tmp \
-v /etc/passwd:/etc/passwd \
-v /etc/shadow:/etc/shadow \
-v /etc/pam.d/:/etc/pam.d/ \
-e DRIVERLESS_AI_AUTHENTICATION_METHOD="pam" \
h2oai/dai-centos7-x86_64:TAG

3. Obtain the Driverless AI container ID. This ID is required for the next step and will be different every time
Driverless AI is started.
[root@Linux-Server]# docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS
˓→ NAMES
8e333475ffd8 opsh2oai/h2oai-runtime "./run.sh" 36 seconds ago Up 35 seconds 192.168.0.1:9090->9090/tcp, 192.
˓→168.0.1:12345->12345/tcp, 192.168.0.1:12348->12348/tcp clever_swirles

234 Chapter 12. Configuring Authentication


Using Driverless AI, Release 1.8.4.1

4. From the Linux Server, verify that the Docker Driverless AI instance can see the shadow file. The example
below references 8e333475ffd8, which is the container ID obtained in the previous step.
[root@Linux-Server]# docker exec 8e333475ffd8 cat /etc/shadow|grep eric
eric:$6$inOv3GsQuRanR1H4$kYgys3oc2dQ3u9it02WTvAYqiGiQgQ/yqOiOs.g4F9DM1UJGpruUVoGl5G6OD3MrX/3uy4gWflYJnbJofaAni/::0:99999:7:::

5. Open a Web browser and navigate to port 12345 on the Linux system that is running the Driverless AI Docker
Image. Log in with credentials known to the Linux system. The login information will now be validated using
PAM.

12.6.2 Enabling PAM in the config.toml File for Native Installs

In this example, the host Linux system has PAM enabled for authentication. The goal is to enable PAM for Driverless
AI authentication while the Linux system hosts the user information.
This example shows how to edit the config.toml file to enable PAM. The config.toml file is available in the etc/dai folder
after the RPM or DEB is installed. Edit the authentication_method variable in this file to enable PAM authentication,
and then restart Driverless AI.
1. Verify that the username (“eric” in this case) is defined in the Linux system.
[root@Linux-Server]# cat /etc/shadow | grep eric
eric:$6$inOv3GsQuRanR1H4$kYgys3oc2dQ3u9it02WTvAYqiGiQgQ/yqOiOs.g4F9DM1UJGpruUVoGl5G6OD3MrX/3uy4gWflYJnbJofaAni/::0:99999:7:::

2. Export the Driverless AI config.toml file or add it to ~/.bashrc. For example:


# DEB and RPM
export DRIVERLESS_AI_CONFIG_FILE="/etc/dai/config.toml"

# TAR SH
export DRIVERLESS_AI_CONFIG_FILE="/path/to/your/unpacked/dai/directory/config.toml"

3. Edit the authentication_method variable in the config.toml file so that PAM is enabled.
# authentication_method
# unvalidated : Accepts user id and password, does not validate password
# none : Does not ask for user id or password, authenticated as admin
# pam : Accepts user id and password, Validates user with operating system
# ldap : Accepts user id and password, Validates against an ldap server, look
# local: Accepts a user id and password, Validated against a htpasswd file provided in local_htpasswd_file
# for additional settings under LDAP settings
authentication_method = "pam"

4. Start Driverless AI. Note that the command used to start Driverless AI varies depending on your install type.
# Linux RPM or DEB with systemd
[root@Linux-Server]# sudo systemctl start dai

# Linux RPM or DEB without systemd


[root@Linux-Server]# sudo -H -u dai /opt/h2oai/dai/run-dai.sh

# Linux TAR SH
[root@Linux-Server]# ./run-dai.sh

5. Open a Web browser and navigate to port 12345 on the Linux system that is running Driverless AI. Log in with
credentials known to the Linux system (as verified in the first step). The login information will now be validated
using PAM.

12.6. PAM Authentication Example 235


Using Driverless AI, Release 1.8.4.1

236 Chapter 12. Configuring Authentication


CHAPTER

THIRTEEN

ENABLING NOTIFICATIONS

Driverless AI can be configured to trigger a user-defined script at the beginning and end of an experiment. This
functionality can be used to send notifications to services like Slack or to trigger a machine shutdown.
The config.toml file exposes the following variables:
• listeners_experiment_start: Registers an absolute location of a script that gets executed at the start
of an experiment.
• listeners_experiment_done: Registers an absolute location of a script that gets executed when an
experiment is finished successfully.
Driverless AI accepts any executable as a script. (For example, a script can be implemented in Bash or Python.) There
are only two requirements:
• The specified script can be executed. (i.e., The file has executable flag.)
• The script should be able to accept command line parameters.

13.1 Script Interfaces

When Driverless AI executes a script, it passes the following parameters as a script command line:
• Application ID: A unique identifier of a running Driverless AI instance.
• User ID: The identification of the user who is running the experiment.
• Experiment ID: A unique identifier of the experiment.
• Experiment Path: The location of the experiment results.

13.2 Example

The following example demonstrates how to use notification scripts to shutdown an EC2 machine that is running
Driverless AI after all launched experiments are finished. The example shows how to use a notification script in a
Docker container and with native installations. The idea of a notification script is to create a simple counter (i.e.,
number of files in a directory) that counts the number of running experiments. If counter reaches 0-value, then the
specified action is performed.
In this example, we use the AWS command line utility to shut down the actual machine; however, the same functional-
ity can be achieved by executing sudo poweroff (if the actual user has password-less sudo capability configured)
or poweroff (if the script poweroff has setuid bit set up together with executable bit. For more info, please
visit: https://unix.stackexchange.com/questions/85663/poweroff-or-reboot-as-normal-user.)
• The on_start Script. This script increases the counter of running experiments.

237
Using Driverless AI, Release 1.8.4.1

#!/usr/bin/env bash

app_id="${1}"
experiment_id="${3}"
tmp_dir="${TMPDIR:-/tmp}/${app_id}"
exp_file="${tmp_dir}/${experiment_id}"

mkdir -p "${tmp_dir}"
touch "${exp_file}"

• The on_done Script. This script decreases the counter and executes machine shutdown when the counter
reaches 0-value.
#!/usr/bin/env bash

app_id="${1}"
experiment_id="${3}"
tmp_dir="${TMPDIR:-/tmp}/${app_id}"
exp_file="${tmp_dir}/${experiment_id}"

if [ -f "${exp_file}" ]; then
rm -f "${exp_file}"
fi

running_experiments=$(ls -1 "${tmp_dir}" | wc -l)

if [ "${running_experiments}" -gt 0 ]; then


echo "There is still ${running_experiments} running experiments!"
else
echo "No experiments running! Machine is going to shutdown!"
# Use instance meta-data API to get instance ID and then use AWS CLI to shutdown the machine
# This expects, that AWS CLI is properly configured and has capability to shutdown instances enabled.
aws ec2 stop-instances --instance-ids $(curl http://169.254.169.254/latest/meta-data/instance-id)
fi

13.2.1 Docker Image Users

1. Copy the config.toml file from inside the Docker image to your local filesystem. (Change nvidia-docker
run to docker run for non-GPU environments.)
# In your Driverless AI folder (for exmaple, dai_1.5.1),
# make config and scripts directories
mkdir config
mkdir scripts

# Copy the config.toml file to the new config directory.


nvidia-docker run \
--pid=host \
--rm \
--init \
-u `id -u`:`id -g` \
-v `pwd`/config:/config \
--entrypoint bash \
h2oai/dai-centos7-x86_64:TAG
-c "cp /etc/dai/config.toml /config"

2. Edit the Notification scripts section in the config.toml file and save your changes. Note that in this example,
the scripts are saved to a dai_VERSION/scripts folder.
# Notification scripts
# - the variable points to a location of script which is executed at given event in experiment lifecycle
# - the script should have executable flag enabled
# - use of absolute path is suggested
# The on experiment start notification script location
listeners_experiment_start = "dai_VERSION/scripts/on_start.sh"
# The on experiment finished notification script location
listeners_experiment_done = "dai_VERSION/scripts/on_done.sh"

3. Start Driverless AI with the DRIVERLESS_AI_CONFIG_FILE environment variable. Make sure this points
to the location of the edited config.toml file so that the software finds the configuration file. (Change
nvidia-docker run to docker run for non-GPU environments.)
nvidia-docker run \
--pid=host \
--init \
--rm \
-u `id -u`:`id -g` \
-e DRIVERLESS_AI_CONFIG_FILE="/config/config.toml" \
-v `pwd`/config:/config \
-v `pwd`/data:/data \
-v `pwd`/log:/log \
-v `pwd`/license:/license \
-v `pwd`/tmp:/tmp \
-v `pwd`/scripts:/scripts \
h2oai/dai-centos7-x86_64:TAG

238 Chapter 13. Enabling Notifications


Using Driverless AI, Release 1.8.4.1

13.2.2 Native Install Users

1. Export the Driverless AI config.toml file or add it to ~/.bashrc. For example:


# DEB and RPM
export DRIVERLESS_AI_CONFIG_FILE="/etc/dai/config.toml"

# TAR SH
export DRIVERLESS_AI_CONFIG_FILE="/path/to/your/unpacked/dai/directory/config.toml"

2. Edit the Notification scripts section in the config.toml file to point to the new scripts. Save your changes when
you are done.
# Notification scripts
# - the variable points to a location of script which is executed at given event in experiment lifecycle
# - the script should have executable flag enabled
# - use of absolute path is suggested
# The on experiment start notification script location
listeners_experiment_start = "/opt/h2oai/dai/scripts/on_start.sh"
# The on experiment finished notification script location
listeners_experiment_done = "/opt/h2oai/dai/scripts/on_done.sh"

3. Start Driverless AI. Note that the command used to start Driverless AI varies depending on your install type.
# Deb or RPM with systemd (preferred for Deb and RPM):
# Start Driverless AI.
sudo systemctl start dai

# Deb or RPM without systemd:


# Start Driverless AI.
sudo -H -u dai /opt/h2oai/dai/run-dai.sh

# Tar.sh
# Start Driverless AI
./run-dai.sh

13.2. Example 239


Using Driverless AI, Release 1.8.4.1

240 Chapter 13. Enabling Notifications


CHAPTER

FOURTEEN

EXPORT ARTIFACTS

In some cases, you might find that you do not want your users to download artifacts directly to their machines.
Driverless AI provides several configuration options/environment variables that enable exporting of artifacts instead
of downloading.

14.1 Enabling Artifact Exports

The config.toml file exposes the following variables:


• enable_artifacts_upload: Replace all the downloads on the experiment page to exports, and allow
users to push to the artifact store configured with artifacts_store.
• artifacts_store: Stores artifacts on a file system directory denoted by
artifacts_file_system_directory.
• artifacts_file_system_directory: The file system location where artifacts will be copied.
Notes:
• Currently, file_system is the only option that can be specified for artifacts_store. Additional options
will be available in future releases.
• The location for artifacts_file_system_directory is expected to be a directory on your server.
• The option to disable artifact downloads does not extend to datasets. Whether users can download datasets is
controlled by the enable_dataset_downloading configuration option, which is set to true by default.
Set this to false if you do not want users to download datasets to their local machine. There is currently no
configuration option that enables exporting datasets to a file system.

14.1.1 Docker Image Users

The following example shows how to start the Driverless AI Docker image with artifact exporting enabled.
docker run \
--pid=host \
--init \
--rm \
--shm-size=256m \
-e DRIVERLESS_AI_ENABLE_ARTIFACTS_UPLOAD="true" \
-e DRIVERLESS_AI_ARTIFACTS_STORE="file_system" \
-e DRIVERLESS_AI_ARTIFACTS_FILE_SYSTEM_DIRECTORY="tmp" \
-u `id -u`:`id -g` \
-p 12345:12345 \
-v `pwd`/data:/data \
-v `pwd`/log:/log \
-v `pwd`/license:/license \
-v `pwd`/tmp:/tmp \
harbor.h2o.ai/h2oai/dai-centos7-x86_64:TAG

241
Using Driverless AI, Release 1.8.4.1

14.1.2 Native Install Users

The following example shows how to start Driverless AI with artifact exporting enabled on native installs.
1. Export the Driverless AI config.toml file or add it to ~/.bashrc. For example:
# DEB and RPM
export DRIVERLESS_AI_CONFIG_FILE="/etc/dai/config.toml"

# TAR SH
export DRIVERLESS_AI_CONFIG_FILE="/path/to/your/unpacked/dai/directory/config.toml"

2. Edit the following configuration option in the config.toml file. Save your changes when you are done.
# Replace all the downloads on the experiment page to exports and allow users to push to the artifact store configured with artifacts_store
enable_artifacts_upload = true

# Artifacts store.
# file_system: stores artifacts on a file system directory denoted by artifacts_file_system_directory.
#
artifacts_store = "file_system"

# File system location where artifacts will be copied in case artifacts_store is set to file_system
artifacts_file_system_directory = "tmp"

3. Start Driverless AI. Note that the command used to start Driverless AI varies depending on your install type.
# Deb or RPM with systemd (preferred for Deb and RPM):
# Start Driverless AI.
sudo systemctl start dai

# Deb or RPM without systemd:


# Start Driverless AI.
sudo -H -u dai /opt/h2oai/dai/run-dai.sh

# Tar.sh
# Start Driverless AI
./run-dai.sh

14.2 Exporting an Artifact

When the export artifacts options are enabled/configured, the menu options on the Completed Experiment page will
change. Specifically, all “Download” options (with the exception of Autoreport) will change to “Export.”

1. Click on an artifact to begin exporting. For example, click on Export Summary and Logs.

242 Chapter 14. Export Artifacts


Using Driverless AI, Release 1.8.4.1

2. Specify a file name or use the default file name. This denotes the new name to be given to the exported artifact.
By default, this name matches the selected export artifact name.
3. Now click the Summary and Logs: Export to Data Store button. (Note that this button name changes de-
pending on the artifact that you select.) This begins the export action. Upon completion, the exported artifact
will display in the list of artifacts. The directory structure is: <path_to_export_to>/<user>/<experiment_id>/

4. Continue exporting additional artifacts for this experiment.

14.2. Exporting an Artifact 243


Using Driverless AI, Release 1.8.4.1

244 Chapter 14. Export Artifacts


CHAPTER

FIFTEEN

LAUNCHING DRIVERLESS AI

Driverless AI is tested on Chrome and Firefox but is supported on all major browsers. For the best user experience,
we recommend using Chrome.
1. After Driverless AI is installed and started, open a browser and navigate to <server>:12345.
2. The first time you log in to Driverless AI, you will be prompted to read and accept the Evaluation Agreement.
You must accept the terms before continuing. Review the agreement, then click I agree to these terms to
continue.
3. Log in by entering unique credentials. For example:
Username: h2oai Password: h2oai
Note that these credentials do not restrict access to Driverless AI; they are used to tie experiments to users.
If you log in with different credentials, for example, then you will not see any previously run experiments.
4. As with accepting the Evaluation Agreement, the first time you log in, you will be prompted to enter your
License Key. Click the Enter License button, then paste the License Key into the License Key entry field.
Click Save to continue. This license key will be saved in the host machine’s /license folder.
Note: Contact [email protected] for information on how to purchase a Driverless AI license.
Upon successful completion, you will be ready to add datasets and run experiments.

245
Using Driverless AI, Release 1.8.4.1

15.1 Resources

The Resources dropdown menu provides you with links to view System Information and the Driverless AI User Guide.
From this dropdown menu, you can also download the following:
• Python Client (See The Python Client)
• R Client (See The R Client)
• MOJO2 Java Runtime (See Driverless AI MOJO Scoring Pipeline - Java Runtime)
• MOJO2 Python Runtime (See Driverless AI MOJO Scoring Pipeline - C++ Runtime with Python and R Wrap-
pers)
• MOJO2 R Runtime (See Driverless AI MOJO Scoring Pipeline - C++ Runtime with Python and R Wrappers)

246 Chapter 15. Launching Driverless AI


Using Driverless AI, Release 1.8.4.1

15.2 Messages

A Messages menu option is available in the top menu when you launch Driverless AI. Click this to view news and
upcoming events regarding Driverless AI.

15.2. Messages 247


Using Driverless AI, Release 1.8.4.1

248 Chapter 15. Launching Driverless AI


CHAPTER

SIXTEEN

THE DATASETS PAGE

The Datasets Overview page is the Driverless AI Home page. This shows all datasets that have been imported. Note
that the first time you log in, this list will be empty.

16.1 Supported File Types

Driverless AI supports the following dataset file formats:


• arff
• bin
• bz2
• csv (See note below)
• dat
• feather

249
Using Driverless AI, Release 1.8.4.1

• gz
• jay (See note below)
• parquet (See notes below)
• pkl
• tgz
• tsv
• txt
• xls
• xlsx
• xz
• zip
Notes:
• CSV in UTF-16 encoding is only supported when implemented with a byte order mark (BOM). If a BOM is not
present, the dataset is read as UTF-8.
• For Parquet file formats, if you select to import multiple Parquet files, those files will be imported as multi-
ple datasets. If you select a folder of Parquet files, the folder will be imported as a single dataset. Tools like
Spark/Hive export data as multiple Parquet files that are stored in a directory with a user-defined name. For ex-
ample, if you export with Spark dataFrame.write.parquet("/data/big_parquet_dataset
"), Spark creates a folder /data/big_parquet_dataset, which will contain multiple Parquet files (depending on
the number of partitions in the input dataset) + metadata.
• You may receive a “Failed to ingest binary file with Parquet: lists with structs are not supported” error when
ingesting a Parquet file that has a struct as an element of an array. This is because PyArrow cannot handle a
struct that’s an element of an array. In Sparkling Water, we provide a workaround to flatten the Parquet file.
Refer to our Sparkling Water solution for more information.
• You can create new datasets from Python script files (custom recipes) by selecting Data Recipe URL or Upload
Data Recipe from the Add Dataset (or Drag & Drop) dropdown menu. If you select the Data Recipe URL
option, the URL must point to either a raw file, a GitHub repository or tree, or a local file. In addition, you can
create a new dataset by modifying an existing dataset with a custom recipe. Refer to modify_by_recipe for more
information. Datasets created or added from recipes will be saved as .jay files.

16.2 Adding Datasets

You can add datasets using one of the following methods:


Drag and drop files from your local machine directly onto this page. Note that this method currently works for files
that are less than 10 GB.
or
Click the Add Dataset (or Drag & Drop) button to upload or add a dataset.
Notes:
• Upload File, File System, HDFS, S3, Data Recipe URL, and Upload Data Recipe are enabled by default. These
can be disabled by removing them from the enabled_file_systems setting in the config.toml file. (Refer
to Using the config.toml file section for more information.)
• If File System is disabled, Driverless AI will open a local filebrowser by default.

250 Chapter 16. The Datasets Page


Using Driverless AI, Release 1.8.4.1

• If Driverless AI was started with data connectors enabled for Azure Blob Store, BlueData Datatap, Google
Big Query, Google Cloud Storage, KDB+, Minio, Snowflake, or JDBC, then these options will appear in the
Add Dataset (or Drag & Drop) dropdown menu. Refer to the Enabling Data Connectors section for more
information.
• When specifying to add a dataset using Data Recipe URL, the URL must point to either a raw file, a GitHub
repository or tree, or a local file. When adding or uploading datasets via recipes, the dataset will be saved as a
.jay file.
• Datasets must be in delimited text format.
• Driverless AI can detect the following separators: ,|;t
• When importing a folder, the entire folder and all of its contents are read into Driverless AI as a single file.
• When importing a folder, all of the files in the folder must have the same columns.
• If you try to import a folder via a data connector on Windows, the import will fail if the folder contains files that
do not have file extensions (the resulting error is usually related to the above note).

16.2. Adding Datasets 251


Using Driverless AI, Release 1.8.4.1

Upon completion, the datasets will appear in the Datasets Overview page. Click on a dataset to open a submenu. From
this menu, you can specify to Rename, view Details of, Visualize, Split, Download, or Delete a dataset. Note: You
cannot delete a dataset that was used in an active experiment. You have to delete the experiment first.

252 Chapter 16. The Datasets Page


Using Driverless AI, Release 1.8.4.1

16.3 Renaming Datasets

In Driverless AI, you can rename datasets from the Datasets Overview page.
To rename a dataset, click on the dataset or select the [Click for Actions] button beside the dataset that you want to
rename, and then select Rename from the submenu that appears.
Note: If the name of a dataset is changed, every instance of the dataset in Driverless AI will be changed to reflect the
new name.

16.4 Dataset Details

To view a summary of a dataset or to preview the dataset, click on the dataset or select the [Click for Actions] button
beside the dataset that you want to view, and then click Details from the submenu that appears. This opens the Dataset
Details page.

16.3. Renaming Datasets 253


Using Driverless AI, Release 1.8.4.1

16.4.1 Dataset Details Page

The Dataset Details page provides a summary of the dataset. This summary lists each of the dataset’s columns and
displays accompanying rows for logical type, format, storage type (see note below), count, number of missing values,
mean, minimum, maximum, standard deviation, frequency, and number of unique values.
Note: Driverless AI recognizes the following storage types: integer, string, real, boolean, and time.
Hover over the top of a column to view a summary of the first 20 rows of that column.

254 Chapter 16. The Datasets Page


Using Driverless AI, Release 1.8.4.1

To view information for a specific column, type the column name in the field above the graph.

Changing a Column Type

Driverless AI also allows you to change a column type. If a column’s data type or distribution does not match the
manner in which you want the column to be handled during an experiment, changing the Logical Type can help to
make the column fit better. For example, an integer zip code can be changed into a categorical so that it is only used
with categorical-related feature engineering. For Date and Datetime columns, use the Format option. To change the
Logical Type or Format of a column, click on the group of square icons located to the right of the words Auto-detect.
(The squares light up when you hover over them with your cursor.) Then select the new column type for that column.

16.4. Dataset Details 255


Using Driverless AI, Release 1.8.4.1

16.4.2 Dataset Rows Page

To switch the view and preview the dataset, click the Dataset Rows button in the top right portion of the UI. Then
click the Dataset Overview button to return to the original view.

16.4.3 Modify By Recipe

The option to create a new dataset by modifying an existing dataset with custom recipes is also available from this
page. Scoring pipelines can be created on the new dataset by building an experiment. This feature is useful when you
want to make changes to the training data that you would not need to make on the new data you are predicting on. For
example, you can change the target column from regression to classification, add a weight column to mark specific
training rows as being more important, or remove outliers that you do not want to model on. Refer to the Adding a
Data Recipe section for more information.
Click the Modify by Recipe button in the top right portion of the UI and select from the following options:
• Data Recipe URL: Load a custom recipe from a URL to use to modify the dataset. The URL must point to
either a raw file, a GitHub repository or tree, or a local file. Sample custom data recipes are available in the
https://github.com/h2oai/driverlessai-recipes/tree/rel-1.8.2/data repository.
• Upload Data Recipe: If you have a custom recipe available on your local system, click this button to upload
that recipe.
• Live Code: Manually enter custom recipe code to use to modify the dataset. Click the Get Preview button to
preview the code’s effect on the dataset, then click Save to create a new dataset.
Notes:
• These options are enabled by default. You can disable them by removing recipe_file and recipe_url
from the enabled_file_systems configuration option.
• Modifying a dataset with a recipe will not overwrite the original dataset. The dataset that is selected for modi-
fication will remain in the list of available datasets in its original form, and the modified dataset will appear in
this list as a new dataset.

256 Chapter 16. The Datasets Page


Using Driverless AI, Release 1.8.4.1

• Changes made to the original dataset through this feature will not be applied to new data that is scored.

16.5 Downloading Datasets

In Driverless AI, you can download datasets from the Datasets Overview page.
To download a dataset, click on the dataset or select the [Click for Actions] button beside the dataset that you want to
download, and then select Download from the submenu that appears.
Note: The option to download datasets will not be available if the enable_dataset_downloading option is set
to false when starting Driverless AI. This option can be specified in the config.toml file.

16.5. Downloading Datasets 257


Using Driverless AI, Release 1.8.4.1

16.6 Splitting Datasets

In Driverless AI, you can split a training dataset into test and validation datasets.
Perform the following steps to split a dataset.
1. To split a dataset, click on the dataset or select the [Click for Actions] button beside the dataset that you want to
split, and then select Split from the submenu that appears.

258 Chapter 16. The Datasets Page


Using Driverless AI, Release 1.8.4.1

2. The Dataset Splitter form displays. Specify an Output Name 1 and an Output Name 2 for the first and second
part of the split. (For example, you can name one test and one valid.)
3. Optionally specify a Target column (for stratified sampling), a Fold column (to keep rows belonging to the same
group together), a Time column, and/or a Random Seed (defaults to 1234).
4. Use the slider to select a split ratio, or enter a value in the Train/Valid Split Ratio field.
5. Click Save when you are done.

Upon completion, the split datasets will be available on the Datasets page.

16.7 Visualizing Datasets

Perform one of the following steps to visualize a dataset:


• On the Datasets page, select the [Click for Actions] button beside the dataset that you want to view, and then
click Visualize from the submenu that appears.

16.7. Visualizing Datasets 259


Using Driverless AI, Release 1.8.4.1

• Click the Autoviz top menu link to go to the Visualizations list page, click the New Visualization button, then
select or import the dataset that you want to visualize.

16.7.1 The Visualization Page

The Visualization page shows all available graphs for the selected dataset. Note that the graphs on the Visualization
page can vary based on the information in your dataset. You can also view and download logs that were generated
during the visualization.

260 Chapter 16. The Datasets Page


Using Driverless AI, Release 1.8.4.1

The following is a complete list of available graphs.


• Correlated Scatterplots: Correlated scatterplots are 2D plots with large values of the squared Pearson corre-
lation coefficient. All possible scatterplots based on pairs of features (variables) are examined for correlations.
The displayed plots are ranked according to the correlation. Some of these plots may not look like textbook
examples of correlation. The only criterion is that they have a large value of squared Pearson’s r (greater than
.95). When modeling with these variables, you may want to leave out variables that are perfectly correlated with
others.
Note that points in the scatterplot can have different sizes. Because Driverless AI aggregates the data and
does not display all points, the bigger the point is, the bigger number of exemplars (aggregated points) the
plot covers.
• Spikey Histograms: Spikey histograms are histograms with huge spikes. This often indicates an inordinate
number of single values (usually zeros) or highly similar values. The measure of “spikeyness” is a bin frequency
that is ten times the average frequency of all the bins. You should be careful when modeling (particularly
regression models) with spikey variables.
• Skewed Histograms: Skewed histograms are ones with especially large skewness (asymmetry). The robust
measure of skewness is derived from Groeneveld, R.A. and Meeden, G. (1984), “Measuring Skewness and
Kurtosis.” The Statistician, 33, 391-399. Highly skewed variables are often candidates for a transformation
(e.g., logging) before use in modeling. The histograms in the output are sorted in descending order of skewness.
• Varying Boxplots: Varying boxplots reveal unusual variability in a feature across the categories of a categor-
ical variable. The measure of variability is computed from a robust one-way analysis of variance (ANOVA).
Sufficiently diverse variables are flagged in the ANOVA. A boxplot is a graphical display of the fractiles of a
distribution. The center of the box denotes the median, the edges of a box denote the lower and upper quartiles,
and the ends of the “whiskers” denote that range of values. Sometimes outliers occur, in which case the adjacent
whisker is shortened to the next lower or upper value. For variables (features) having only a few values, the
boxes can be compressed, sometimes into a single horizontal line at the median.
• Heteroscedastic Boxplots: Heteroscedastic boxplots reveal unusual variability in a feature across the cate-
gories of a categorical variable. Heteroscedasticity is calculated with a Brown-Forsythe test: Brown, M. B. and
Forsythe, A. B. (1974), “Robust tests for equality of variances. Journal of the American Statistical Association,
69, 364-367. Plots are ranked according to their heteroscedasticity values. A boxplot is a graphical display of
the fractiles of a distribution. The center of the box denotes the median, the edges of a box denote the lower and

16.7. Visualizing Datasets 261


Using Driverless AI, Release 1.8.4.1

upper quartiles, and the ends of the “whiskers” denote that range of values. Sometimes outliers occur, in which
case the adjacent whisker is shortened to the next lower or upper value. For variables (features) having only a
few values, the boxes can be compressed, sometimes into a single horizontal line at the median.
• Biplots: A Biplot is an enhanced scatterplot that uses both points and vectors to represent structure simultane-
ously for rows and columns of a data matrix. Rows are represented as points (scores), and columns are repre-
sented as vectors (loadings). The plot is computed from the first two principal components of the correlation
matrix of the variables (features). You should look for unusual (non-elliptical) shapes in the points that might
reveal outliers or non-normal distributions. And you should look for purple vectors that are well-separated.
Overlapping vectors can indicate a high degree of correlation between variables.
• Outliers: Variables with anomalous or outlying values are displayed as red points in a dot plot. Dot plots are
constructed using an algorithm in Wilkinson, L. (1999). “Dot plots.” The American Statistician, 53, 276–281.
Not all anomalous points are outliers. Sometimes the algorithm will flag points that lie in an empty region (i.e.,
they are not near any other points). You should inspect outliers to see if they are miscodings or if they are due
to some other mistake. Outliers should ordinarily be eliminated from models only when there is a reasonable
explanation for their occurrence.
• Correlation Graph: The correlation network graph is constructed from all pairwise squared correlations be-
tween variables (features). For continuous-continuous variable pairs, the statistic used is the squared Pearson
correlation. For continuous-categorical variable pairs, the statistic is based on the squared intraclass correlation
(ICC). This statistic is computed from the mean squares from a one-way analysis of variance (ANOVA). The
formula is (MSbetween - MSwithin)/(MSbetween + (k - 1)MSwithin), where k is the number of categories in
the categorical variable. For categorical-categorical pairs, the statistic is computed from Cramer’s V squared.
If the first variable has k1 categories and the second variable has k2 categories, then a k1 x k2 table is created
from the joint frequencies of values. From this table, we compute a chi-square statistic. Cramer’s V squared
statistic is then (chi-square / n) / min(k1,k2), where n is the total of the joint frequencies in the table. Variables
with large values of these respective statistics appear near each other in the network diagram. The color scale
used for the connecting edges runs from low (blue) to high (red). Variables connected by short red edges tend
to be highly correlated.
• Parallel Coordinates Plot: A Parallel Coordinates Plot is a graph used for comparing multiple variables. Each
variable has its own vertical axis in the plot. Each profile connects the values on the axes for a single observation.
If the data contain clusters, these profiles will be colored by their cluster number.
• Radar Plot: A Radar Plot is a two-dimensional graph that is used for comparing multiple variables. Each
variable has its own axis that starts from the center of the graph. The data are standardized on each variable
between 0 and 1 so that values can be compared across variables. Each profile, which usually appears in the
form of a star, connects the values on the axes for a single observation. Multivariate outliers are represented
by red profiles. The Radar Plot is the polar version of the popular Parallel Coordinates plot. The polar layout
enables us to represent more variables in a single plot.
• Data Heatmap: The heatmap graphic is constructed from the transposed data matrix. Rows of the heatmap
represent variables, and columns represent cases (instances). The data are standardized before display so that
small values are yellow and large values are red. The rows and columns are permuted via a singular value
decomposition (SVD) of the data matrix so that similar rows and similar columns are near each other.
• Missing Values Heatmap: The missing values heatmap graphic is constructed from the transposed data matrix.
Rows of the heatmap represent variables and columns represent cases (instances). The data are coded into the
values 0 (missing) and 1 (nonmissing). Missing values are colored red and nonmissing values are left blank
(white). The rows and columns are permuted via a singular value decomposition (SVD) of the data matrix so
that similar rows and similar columns are near each other.
• Gaps Histogram: The gaps index is computed using an algorithm of Wainer and Schacht based on work by
John Tukey. (Wainer, H. and Schacht, Psychometrika, 43, 2, 203-12.) Histograms with gaps can indicate a
mixture of two or more distributions based on possible subgroups not necessarily characterized in the dataset.
The images on this page are thumbnails. You can click on any of the graphs to view and download a full-scale image.

262 Chapter 16. The Datasets Page


Using Driverless AI, Release 1.8.4.1

You can also view an explanation for each graph by clicking the Help button in the lower-left corner of each expanded
graph.

16.7. Visualizing Datasets 263


Using Driverless AI, Release 1.8.4.1

264 Chapter 16. The Datasets Page


CHAPTER

SEVENTEEN

EXPERIMENTS

17.1 Before You Begin

This section describes how to run an experiment using the Driverless AI UI. Before you begin, it is best that you
understand the available options that you can specify. Note that only a dataset and a target column are required to be
specified, but Driverless AI provides a variety of experiment and expert settings that you can use to build your models.
After you have a comfortable working knowledge of these options, proceed to the New Experiments section.

17.2 Experiment Settings

This section describes the settings that are available when running an experiment.

17.2.1 Display Name

Optional: Specify a display name for the new experiment. There are no character or length restrictions for naming. If
this field is left blank, Driverless AI will automatically generate a name for the experiment.

17.2.2 Dropped Columns

Dropped columns are columns that you do not want to be used as predictors in the experiment. Note that Driver-
less AI will automatically drop ID columns and columns that contain a significant number of unique values (above
max_relative_cardinality in the config.toml file or Max. allowed fraction of uniques for integer and
categorical cols in Expert settings).

17.2.3 Validation Dataset

The validation dataset is used for tuning the modeling pipeline. If provided, the entire training data will be used for
training, and validation of the modeling pipeline is performed with only this validation dataset. When you do not
include a validation dataset, Driverless AI will do K-fold cross validation for I.I.D. experiments and multiple rolling
window validation splits for time series experiments. For this reason it is not generally recommended to include a
validation dataset as you are then validating on only a single dataset. Please note that time series experiments cannot
be used with a validation dataset: including a validation dataset will disable the ability to select a time column and
vice versa.
This dataset must have the same number of columns (and column types) as the training dataset. Also note that if
provided, the validation set is not sampled down, so it can lead to large memory usage, even if accuracy=1 (which
reduces the train size).

265
Using Driverless AI, Release 1.8.4.1

17.2.4 Test Dataset

The test dataset is used for testing the modeling pipeline and creating test predictions. The test set is never used during
training of the modeling pipeline. (Results are the same whether a test set is provided or not.) If a test dataset is
provided, then test set predictions will be available at the end of the experiment.

17.2.5 Weight Column

Optional: Column that indicates the observation weight (a.k.a. sample or row weight), if applicable. This column must
be numeric with values >= 0. Rows with higher weights have higher importance. The weight affects model training
through a weighted loss function and affects model scoring through weighted metrics. The weight column is not used
when making test set predictions, but a weight column (if specified) is used when computing the test score.

17.2.6 Fold Column

Optional: Rows with the same value in the fold column represent groups that should be kept together in the training,
validation, or cross-validation datasets.
By default, Driverless AI assumes that the dataset is i.i.d. (identically and independently distributed) and creates
validation datasets randomly for regression or with stratification of the target variable for classification.
The fold column is used to create the training and validation datasets so that all rows with the same Fold value will
be in the same dataset. This can prevent data leakage and improve generalization. For example, when viewing data
for a pneumonia dataset, person_id would be a good Fold Column. This is because the data may include multiple
diagnostic snapshots per person, and we want to ensure that the same person’s characteristics show up only in either
the training or validation frames, but not in both to avoid data leakage.
This column must be an integer or categorical variable and cannot be specified if a validation set is used or if a Time
Column is specified.

17.2.7 Time Column

Optional: Specify a column that provides a time order (time stamps for observations), if applicable. This can improve
model performance and model validation accuracy for problems where the target values are auto-correlated with
respect to the ordering (per time-series group).
The values in this column must be a datetime format understood by pandas.to_datetime(), like “2017-11-29 00:30:35”
or “2017/11/29”, or integer values. If [AUTO] is selected, all string columns are tested for potential date/datetime
content and considered as potential time columns. If a time column is found, feature engineering and model validation
will respect the causality of time. If [OFF] is selected, no time order is used for modeling and data may be shuffled
randomly (any potential temporal causality will be ignored).
When your data has a date column, then in most cases, specifying [AUTO] for the Time Column will be sufficient.
However, if you select a specific date column, then Driverless AI will provide you with an additional side menu. From
this side menu, you can specify Time Group columns or specify [Auto] to let Driverless AI determine the best time
group columns. You can also specify the columns that will be unavailable at prediction time (see Notes below), the
Forecast Horizon in weeks, and the Gap between the train and test periods.
Refer to Time Series in Driverless AI for more information about time series experiments in Driverless AI and to see
a time series example.

266 Chapter 17. Experiments


Using Driverless AI, Release 1.8.4.1

Notes:
• Engineered features will be used for MLI when a time series experiment is built. This is because munged time
series features are more useful features for MLI compared to raw time series features.
• A Time Column cannot be specified if a Fold Column is specified. This is because both fold and time columns
are only used to split training datasets into training/validation, so once you split by time, you cannot also split
with the fold column. If a Time Column is specified, then the time group columns play the role of the fold
column for time series.
• A Time Column cannot be specified if a validation dataset is used.
• A column that is specified as being unavailable at prediction time will only have lag-related features created for
(or with) it.

17.2. Experiment Settings 267


Using Driverless AI, Release 1.8.4.1

17.2.8 Accuracy, Time, and Interpretability Knobs

The experiment preview describes what the Accuracy, Time, and Interpretability settings mean for your specific ex-
periment. This preview will autmatically update if any of the knob values change. The following is more detailed
information describing how these values affect an experiment.

Accuracy

As accuracy increases (as indicated by the tournament_* toml settings), Driverless AI gradually adjusts the method
for performing the evolution and ensemble. At low accuracy, Driverless AI varies features and models, but they
all compete evenly against each other. At higher accuracy, each independent main model will evolve independently
and be part of the final ensemble as an ensemble over different main models. At higher accuracies, Driverless AI
will evolve+ensemble feature types like Target Encoding on and off that evolve independently. Finally, at highest
accuracies, Driverless AI performs both model and feature tracking and ensembles all those variations.
Changing this value affects the feature evolution and final pipeline.
Note: A check for a shift in the distribution between train and test is done for accuracy >= 5.
Feature evolution: This represents the algorithms used to create the experiment. If a test set is provided without
a validation set, then Driverless AI will perform a 1/3 validation split during the experiment. If a validation set is
provided, then the experiment will perform external validation.
Final Pipeline: This represents the leveling of ensembling done for the final model (if no time column is selected)
along with the cross-validation values.

268 Chapter 17. Experiments


Using Driverless AI, Release 1.8.4.1

Time

This specifies the relative time for completing the experiment (i.e., higher settings take longer). Early stopping will
take place if the experiment doesn’t improve the score for the specified amount of iterations.

Interpretability

Specify the relative interpretability for this experiment. Higher values favor more interpretable models. Changing
the interpretability level affects the feature pre-pruning strategy, monotonicity constraints, and the feature engineering
search space.
Feature pre-pruning strategy: This represents the feature selection strategy (to prune-away features that do not
clearly give improvement to model score). Strategy = “FS” if interpretability >= 6; otherwise strategy is None.
Monotonicity constraints: If Monotonicity Constraints are enabled, the model will satisfy knowledge about mono-
tonicity in the data and monotone relationships between the predictors and the target variable. For example, in house
price prediction, the house price should increase with lot size and number of rooms, and should decrease with crime
rate in the area. If enabled, Driverless AI will automatically determine if monotonicity is present and enforce it in its
modeling pipelines. Depending on the correlation, Driverless AI will assign positive, negative, or no monotonicity
constraints. Monotonicity is enforced if the absolute correlation is greater than 0.1. All other predictors will not have
monotonicity enforced.
Note: Monotonicity constraints are used in Decision Trees, XGBoost Dart, XGBoost GBM, LightGBM,
and LightGBM Random Forest models.
Feature engineering search space: This represents the transformers used when Note that when mixing GBM and
GLM in parameter tuning, the search space is split 50%/50% between GBM and GLM.

17.2.9 Classification, Reproducible, and Enable GPUs Buttons

• Classification or Regression button. Driverless AI automatically determines the problem type based on the
response column. Though not recommended, you can override this setting by clicking this button.
• Reproducible: This button allows you to build an experiment with a random seed and get reproducible results.
If this is disabled (default), then results will vary between runs.
• Enable GPUs: Specify whether to enable GPUs. (Note that this option is ignored on CPU-only systems.)

17.3 Expert Settings

This section describes the Expert Settings that are available when starting an experiment. Driverless AI provides a
variety of options in the Expert Settings that allow you to customize your experiment. Use the search bar to refine the
list of settings or locate a specific setting.
The default values for these options are derived from the configuration options in the config.toml file. Refer to the
Sample Config.toml File section for more information about each of these options.
Note about Feature Brain Level: By default, the feature brain pulls in any better model regardless of the features
even if the new model disabled those features. For full control over features pulled in via changes in these Expert
Settings, users should set the Feature Brain Level option to 0.

17.3. Expert Settings 269


Using Driverless AI, Release 1.8.4.1

17.3.1 Upload Custom Recipe

Driverless AI supports the use of custom recipes (optional). If you have a custom recipe available on your local system,
click this button to upload that recipe. If you do not have a custom recipe, you can select from a number of recipes
available in the https://github.com/h2oai/driverlessai-recipes repository. Clone this repository on your local machine
and upload the desired recipe. Refer to the Custom Recipes appendix for examples.

17.3.2 Load Custom Recipe from URL

If you have a custom recipe available on an external system, specify the URL for that recipe here. Note that this
must point to the raw recipe file (for example https://raw.githubusercontent.com/h2oai/driverlessai-recipes/master/
transformers/text_sentiment_transformer.py). Refer to the Custom Recipes appendix for examples.

17.3.3 Official Recipes (External)

Click this button to access H2O’s official recipes repository (https://github.com/h2oai/driverlessai-recipes).

270 Chapter 17. Experiments


Using Driverless AI, Release 1.8.4.1

17.3.4 Experiment Settings

Max Runtime in Minutes Before Triggering the Finish Button

Specify the maximum runtime in minutes for an experiment. This is equivalent to pushing the Finish button once half
of the specified time value has elapsed. Note that the overall enforced runtime is only an approximation.
This value defaults to 1440, which is the equivalent of a 24 hour approximate overall runtime. The Finish button will
be automatically selected once 12 hours have elapsed, and Driverless AI will subsequently attempt to complete the
overall experiment in the remaining 12 hours. Set this value to 0 to disable this setting.

Max Runtime in Minutes Before Triggering the Abort Button

Specify the maximum runtime in minutes for an experiment before triggering the abort button. This option preserves
experiment artifacts that have been generated for the summary and log zip files while continuing to generate additional
artifacts. This value defaults to 10080.

Pipeline Building Recipe

Specify the Pipeline Building recipe type (overrides GUI settings). Select from the following:
• AUTO: Specifies that all models and features are automatically determined by experiment settings, config.toml
settings, and the feature engineering effort. (Default)
• COMPLIANT: Similar to AUTO except for the following:
– Interpretability is set to 10.
– Only uses GLM.
– Fixed ensemble level is set to 0.
– Feature brain level is set to 0.
– Max feature interaction depth is set to 1.
– Target transformers is set to ‘identity’ for regression.
– Does not use distribution shift.
• KAGGLE: Similar to AUTO except for the following:
– Any external validation set is concatenated with the train set, with the target marked as missing.
– The test set is concatenated with the train set, with the target marked as missing
– Transformers that do not use the target are allowed to fit_transform across the entirety of the train,
validation, and test sets.
– Has several config.toml expert options open-up limits.

17.3. Expert Settings 271


Using Driverless AI, Release 1.8.4.1

Make Python Scoring Pipeline

Specify whether to automatically build a Python Scoring Pipeline for the experiment. Select ON or AUTO (default) to
make the Python Scoring Pipeline immediately available for download when the experiment is finished. Select OFF
to disable the automatic creation of the Python Scoring Pipeline.

Make MOJO Scoring Pipeline

Specify whether to automatically build a MOJO (Java) Scoring Pipeline for the experiment. Select ON to make the
MOJO Scoring Pipeline immediately available for download when the experiment is finished. With this option, any
capabilities that prevent the creation of the pipeline are dropped. Select OFF to disable the automatic creation of the
MOJO Scoring Pipeline. Select AUTO (default) to attempt to create the MOJO Scoring Pipeline without dropping
any capabilities.

Measure MOJO Scoring Latency

Specify whether to measure the MOJO scoring latency at the time of MOJO creation. This is set to AUTO by default.
In this case, MOJO scoring latency will be measured if the pipeline.mojo file size is less than 100 MB.

Timeout in Seconds to Wait for MOJO Creation at End of Experiment

Specify the amount of time in seconds to wait for MOJO creation at the end of an experiment. If the MOJO creation
process times out, a MOJO can still be made from the GUI or the R and Python clients (the timeout contraint is not
applied to these). This value defaults to 1800 (30 minutes).

Number of Parallel Workers to Use During MOJO Creation

Specify the number of parallel workers to use during MOJO creation. Higher values can speed up MOJO creation but
use more memory. Set this value to -1 (default) to use all physical cores.

Make Pipeline Visualization

Specify whether to create a visualization of the scoring pipeline at the end of an experiment. This is set to AUTO by
default. Note that the Visualize Scoring Pipeline feature is experimental and is not available for deprecated models.
Visualizations are available for all newly created experiments.

Make Autoreport

Specify whether to create the experiment Autoreport after the experiment is finished. This is enabled by default.

272 Chapter 17. Experiments


Using Driverless AI, Release 1.8.4.1

Min Number of Rows Needed to Run an Experiment

Specify the minimum number of rows that a dataset must contain in order to run an experiment. This value defaults to
100.

Reproducibility Level

Specify one of the following levels of reproducibility (note that this setting is only active while reproducible mode is
enabled):
• 1 = Same experiment results for same O/S, same CPU(s), and same GPU(s) (Default)
• 2 = Same experiment results for same O/S, same CPU architecture, and same GPU architecture
• 3 = Same experiment results for same O/S, same CPU archicture (excludes GPUs)
• 4 = Same experiment results for same O/S (best approximation)
This value defaults to 1.

Random Seed

Specify a random seed for the experiment. When a seed is defined and the reproducible button is enabled (not by
default), the algorithm will behave deterministically.

Allow Different Sets of Classes Across All Train/Validation Fold Splits

(Note: Applicable for multiclass problems only.) Specify whether to enable full cross-validation (multiple folds)
during feature evolution as opposed to a single holdout split. This is enabled by default.

Max Number of Classes for Classification Problems

Specify the maximum number of classes to allow for a classification problem. A higher number of classes may make
certain processes more time-consuming. Memory requirements also increase with a higher number of classes. This
value defaults to 200.

Model/Feature Brain Level

Specify whether to use H2O.ai brain, which enables local caching and smart re-use (checkpointing) of prior experi-
ments to generate useful features and models for new experiments. It can also be used to control checkpointing for
experiments that have been paused or interrupted.
When enabled, this will use the H2O.ai brain cache if the cache file:
• has any matching column names and types for a similar experiment type
• has classes that match exactly
• has class labels that match exactly
• has basic time series choices that match
• the interpretability of the cache is equal or lower
• the main model (booster) is allowed by the new experiment
• -1: Don’t use any brain cache (default)

17.3. Expert Settings 273


Using Driverless AI, Release 1.8.4.1

• 0: Don’t use any brain cache but still write to cache. Use case: Want to save the model for later use, but we want
the current model to be built without any brain models.
• 1: Smart checkpoint from the latest best individual model. Use case: Want to use the latest matching model.
The match may not be precise, so use with caution.
• 2: Smart checkpoint if the experiment matches all column names, column types, classes, class labels, and time
series options identically. Use case: Driverless AI scans through the H2O.ai brain cache for the best models to
restart from.
• 3: Smart checkpoint like level #1 but for the entire population. Tune only if the brain population is of insufficient
size. Note that this will re-score the entire population in a single iteration, so it appears to take longer to complete
first iteration.
• 4: Smart checkpoint like level #2 but for the entire population. Tune only if the brain population is of insufficient
size. Note that this will re-score the entire population in a single iteration, so it appears to take longer to complete
first iteration.
• 5: Smart checkpoint like level #4 but will scan over the entire brain cache of populations to get the best scored
individuals. Note that this can be slower due to brain cache scanning if the cache is large.
When enabled, the directory where the H2O.ai Brain meta model files are stored is H2O.ai_brain. In addition, the
default maximum brain size is 20GB. Both the directory and the maximum size can be changed in the config.toml file.
This value defaults to 2.

Feature Brain Save Every Which Iteration

Save feature brain iterations every iter_num % feature_brain_iterations_save_every_iteration == 0, to be able to


restart/refit with which_iteration_brain >= 0. This is disabled (0) by default.
• -1: Don’t use any brain cache.
• 0: Don’t use any brain cache but still write to cache.
• 1: Smart checkpoint if an old experiment_id is passed in (for example, via running “resume one like this” in the
GUI).
• 2: Smart checkpoint if the experiment matches all column names, column types, classes, class labels, and time
series options identically. (default)
• 3: Smart checkpoint like level #1 but for the entire population. Tune only if the brain population is of insufficient
size.
• 4: Smart checkpoint like level #2 but for the entire population. Tune only if the brain population is of insufficient
size.
• 5: Smart checkpoint like level #4 but will scan over the entire brain cache of populations (starting from resumed
experiment if chosen) in order to get the best scored individuals.
When enabled, the directory where the H2O.ai Brain meta model files are stored is H2O.ai_brain. In addition, the
default maximum brain size is 20GB. Both the directory and the maximum size can be changed in the config.toml file.

274 Chapter 17. Experiments


Using Driverless AI, Release 1.8.4.1

Feature Brain Restart from Which Iteration

When performing restart or re-fit of type feature_brain_level with a resumed ID, specify which iteration to start from
instead of only last best. Available options include:
• -1: Use the last best
• 1: Run one experiment with feature_brain_iterations_save_every_iteration=1 or some other number
• 2: Identify which iteration brain dump you wants to restart/refit from
• 3: Restart/Refit from the original experiment, setting which_iteration_brain to that number here in expert set-
tings.
Note: If restarting from a tuning iteration, this will pull in the entire scored tuning population and use that for feature
evolution. This value defaults to -1.

Feature Brain Refit Uses Same Best Individual

Specify whether to use the same best individual when performing a refit. Disabling this setting allows the order of best
individuals to be rearranged, leading to a better final result. Enabling this setting allows you to view the exact same
model or feature with only one new feature added. This is disabled by default.

Feature Brain Adds Features with New Columns Even During Retraining of Final Model

Specify whether to add additional features from new columns to the pipeline, even when performing a retrain of the
final model. Use this option if you want to keep the same pipeline regardless of new columns from a new dataset. New
data may lead to new dropped features due to shift or leak detection. Disable this to avoid adding any columns as new
features so that the pipeline is perfectly preserved when changing data. This is enabled by default.

Min DAI Iterations

Specify the minimum number of Driverless AI iterations for an experiment. This can be used during restarting, when
you want to continue for longer despite a score not improving. This value defaults to 0.

Select Target Transformation of the Target for Regression Problems

Specify whether to automatically select target transformation for regression problems. Selecting identity disables any
transformation. This is set to AUTO by default.

Tournament Model for Genetic Algorithm

Select a method to decide which models are best at each iteration. This is set to AUTO by default. Choose from the
following:
• auto: Choose based on scoring metric
• fullstack: Choose from optimal model and feature types
• feature: Individuals with similar feature types compete
• model: Individuals with same model type compete
• uniform: All individuals in population compete

17.3. Expert Settings 275


Using Driverless AI, Release 1.8.4.1

Number of Cross-Validation Folds for Feature Evolution

Specify a fixed number of folds (if >= 2) for cross-validation. Note that the actual number of allowed folds can be
less than the specified value, and that the number of allowed folds is determined at the time an experiment is run. This
value defaults to -1 (auto).

Number of Cross-Validation Folds for Final Model

Specify a fixed number of folds (if >= 2) for cross-validation. Note that the actual number of allowed folds can be
less than the specified value, and that the number of allowed folds is determined at the time an experiment is run. This
value defaults to -1 (auto).

Max Number of Rows Times Number of Columns for Feature Evolution Data Splits

Specify the maximum number of rows allowed for feature evolution data splits (not for the final pipeline). This value
defaults to 100,000,000.

Max Number of Rows Times Number of Columns for Reducing Training Dataset

Specify the upper limit on the number of rows times the number of columns for training the final pipeline. This value
defaults to 500,000,000.

Maximum Size of Validation Data Relative to Training Data

Specify the maximum size of the validation data relative to the training data. Smaller values can make the final pipeline
model training process quicker. Note that final model predictions and scores will always be provided on the full dataset
provided. This value defaults to 2.0.

Perform Stratified Sampling for Binary Classification If the Target Is More Imbalanced Than This

For binary classification experiments, specify a threshold ratio of minority to majority class for the target column
beyond which stratified sampling is performed. If the threshold is not exceeded, random sampling is performed. This
value defaults to 0.1. You can choose to always perform random sampling by setting this value to 0, or to always
perform stratified sampling by setting this value to 1.

Add to config.toml via toml String

Specify any additional configuration overrides from the config.toml file that you want to include in the experiment.
(Refer to the Sample Config.toml File section to view options that can be overridden during an experiment.) Setting
this will override all other settings. Separate multiple config overrides with \n. For example, the following enables
Poisson distribution for LightGBM and disables Target Transformer Tuning. Note that in this example double quotes
are escaped (\" \").
params_lightgbm=\"{'objective':'poisson'}\" \n target_transformer=identity

Or you can specify config overrides similar to the following without having to escape double quotes:
""enable_glm="off" \n enable_xgboost_gbm="off" \n enable_lightgbm="off" \n enable_tensorflow="on"""
""max_cores=10 \n data_precision="float32" \n max_rows_feature_evolution=50000000000 \n ensemble_accuracy_switch=11 \n feature_engineering_effort=1 \n
˓→target_transformer="identity" \n tournament_feature_style_accuracy_switch=5 \n params_tensorflow="{'layers': [100, 100, 100, 100, 100, 100]}"""

When running the Python client, config overrides would be set as follows:

276 Chapter 17. Experiments


Using Driverless AI, Release 1.8.4.1

model = h2o.start_experiment_sync(
dataset_key=train.key,
target_col='target',
is_classification=True,
accuracy=7,
time=5,
interpretability=1,
config_overrides="""
feature_brain_level=0
enable_lightgbm="off"
enable_xgboost_gbm="off"
enable_ftrl="off"
"""
)

17.3.5 Model Settings

XGBoost GBM Models

This option allows you to specify whether to build XGBoost models as part of the experiment (for both the feature
engineering part and the final model). XGBoost is a type of gradient boosting method that has been widely successful
in recent years due to its good regularization techniques and high accuracy. This is set to AUTO by default. In this
case, Driverless AI will use XGBoost unless the number of rows * columns is greater than a threshold. This threshold
is a config setting that is 100M by default for CPU and 30M by default for GPU.

XGBoost Dart Models

This option specifies whether to use XGBoost’s Dart method when building models for experiment (for both the feature
engineering part and the final model). This is set to AUTO (disabled) by default.

GLM Models

This option allows you to specify whether to build GLM models (generalized linear models) as part of the experiment
(usually only for the final model unless it’s used exclusively). GLMs are very interpretable models with one coefficient
per feature, an intercept term and a link function. This is set to AUTO by default (enabled if accuracy <= 5 and
interpretability >= 6).

Decision Tree Models

This option allows you to specify whether to build Decision Tree models as part of the experiment. This is set to
AUTO by default. In this case, Driverless AI will build Decision Tree models if interpretability is greater than or
equal to the value of decision_tree_interpratibility_switch (which defaults to 7) and accuracy is less
than or equal to decision_tree_accuracy_switch (which defaults to 7).

LightGBM Models

This option allows you to specify whether to build LightGBM models as part of the experiment. LightGBM Models
are the default models. This is set to AUTO (enabled) by default.

17.3. Expert Settings 277


Using Driverless AI, Release 1.8.4.1

TensorFlow Models

This option allows you to specify whether to build TensorFlow models as part of the experiment (usually only for text
features engineering and for the final model unless it’s used exlusively). Enable this option for NLP experiments. This
is set to AUTO by default (not used unless the number of classes is greater than 10).
TensorFlow models are not yet supported by MOJOs (only Python scoring pipelines are supported).

FTRL Models

This option allows you to specify whether to build Follow the Regularized Leader (FTRL) models as part of the
experiment. Note that MOJOs are not yet supported (only Python scoring pipelines). FTRL supports binomial and
multinomial classification for categorical targets, as well as regression for continuous targets. This is set to AUTO
(disabled) by default.

RuleFit Models

This option allows you to specify whether to build RuleFit models as part of the experiment. Note that MOJOs are
not yet supported (only Python scoring pipelines). Note that multiclass classification is not yet supported for RuleFit
models. Rules are stored to text files in the experiment directory for now. This is set to AUTO (disabled) by default.

LightGBM Boosting Types

Specify which boosting types to enable for LightGBM. Select one or more of the following:
• gbdt: Boosted trees
• rf_early_stopping: Random Forest with early stopping
• rf: Random Forest
• dart: Dropout boosted trees with no early stopping
gbdt and rf are both enabled by default.

LightGBM Categorical Support

Specify whether to enable LightGBM categorical feature support (currently only available for CPU mode). This is
disabled by default.

Constant Models

Specify whether to enable constant models. This is set to AUTO (enabled) by default.

278 Chapter 17. Experiments


Using Driverless AI, Release 1.8.4.1

Whether to Show Constant Models in Iteration Panel

Specify whether to show constant models in the iteration panel. This is disabled by default.

Parameters for TensorFlow

Specify specific parameters for TensorFlow to override Driverless AI parameters. The following is an example of how
the parameters can be configured:
params_tensorflow = '{'lr': 0.01, 'add_wide': False, 'add_attention': True, 'epochs': 30,
'layers': [100, 100], 'activation': 'selu', 'batch_size': 64, 'chunk_size': 1000, 'dropout': 0.3,
'strategy': 'one_shot', 'l1': 0.0, 'l2': 0.0, 'ort_loss': 0.5, 'ort_loss_tau': 0.01, 'normalize_type': 'streaming'}'

The following is an example of how layers can be configured:


[500, 500, 500], [100, 100, 100], [100, 100], [50, 50]

More information about TensorFlow parameters can be found in the Keras documentation. Different strategies for
using TensorFlow parameters can be viewed here.

Max Number of Trees/Iterations

Specify the upper limit on the number of trees (GBM) or iterations (GLM). This defaults to 3000. Depending on
accuracy settings, a fraction of this limit will be used.

n_estimators List to Sample From for Model Mutations for Models That Do Not Use Early Stopping

For LightGBM, the dart and normal random forest modes do not use early stopping. This setting allows you to specify
the n_estimators (number of trees in the forest) list to sample from for model mutations for these types of models.

Minimum Learning Rate for Final Ensemble GBM Models

Specify the minimum learning rate for final ensemble GBM models. This value defaults to 0.01.

Maximum Learning Rate for Final Ensemble GBM Models

Specify the maximum learning rate for final ensemble GBM models. This value defaults to 0.05.

Reduction Factor for Number of Trees/Iterations During Feature Evolution

Specify the factor by which max_nestimators is reduced for tuning and feature evolution. This option defaults to 0.2.
So by default, Driverless AI will produce no more than 0.2 * 3000 trees/iterations during feature evolution.

17.3. Expert Settings 279


Using Driverless AI, Release 1.8.4.1

Minimum Learning Rate for Feature Engineering GBM Models

Specify the minimum learning rate for feature engineering GBM models. This value defaults to 0.05.

Max Learning Rate for Tree Models

Specify the maximum learning rate for tree models during feature engineering. Higher values can speed up feature
engineering but can hurt accuracy. This value defaults to 0.5.

Max Number of Epochs for TensorFlow/FTRL

When building TensorFlow or FTRL models, specify the maximum number of epochs to train models with (it might
stop earlier). This value defaults to 10. This option is ignored if TensorFlow models and/or FTRL models is disabled.

Max Tree Depth

Specify the maximum tree depth. The corresponding maximum value for max_leaves is double the specified value.
This value defaults to 12.

Max max_bin for Tree Features

Specify the maximum max_bin for tree features. This value defaults to 256.

Max Number of Rules for RuleFit

Specify the maximum number of rules to be used for RuleFit models. This defaults to -1, which specifies to use all
rules.

Ensemble Level for Final Modeling Pipeline

Specify one of the following ensemble levels:


• -1 = auto, based upon ensemble_accuracy_switch, accuracy, size of data, etc. (Default)
• 0 = No ensemble, only final single model on validated iteration/tree count. Note that holdout predicted proba-
bilities will not be available. (Refer to the following FAQ.)
• 1 = 1 model, multiple ensemble folds (cross-validation)
• 2 = 2 models, multiple ensemble folds (cross-validation)
• 3 = 3 models, multiple ensemble folds (cross-validation)
• 4 = 4 models, multiple ensemble folds (cross-validation)

280 Chapter 17. Experiments


Using Driverless AI, Release 1.8.4.1

Cross-Validate Single Final Model

Driverless AI normally produces a single final model for low accuracy settings (typically, less than 5). When the
Cross-validate single final model option is enabled (default for regular experiments), Driverless AI will perform
cross-validation to determine optimal parameters and early stopping before training the final single modeling pipeline
on the entire training data. The final pipeline will build 𝑁 + 1 models, with N-fold cross validation for the single final
model. This also creates holdout predictions for all non-time-series experiments with a single final model.
Note that the setting for this option is ignored for time-series experiments or when a validation dataset is provided.

Number of Models During Tuning Phase

Specify the number of models to tune during pre-evolution phase. Specify a lower value to avoid excessive tuning, or
specify a higher to perform enhanced tuning. This option defaults to -1 (auto).

Sampling Method for Imbalanced Binary Classification Problems

Specify the sampling method for imbalanced binary classification problems. This is set to off by default. Choose from
the following options:
• auto: sample both classes as needed, depending on data
• over_under_sampling: over-sample the minority class and under-sample the majority class, depending on data
• under_sampling: under-sample the majority class to reach class balance
• off: do not perform any sampling

Ratio of Majority to Minority Class for Imbalanced Binary Classification to Trigger Special Sampling
Techniques (if Enabled)

For imbalanced binary classification problems, specify the ratio of majority to minority class. Special imbalanced
models with sampling techniques are enabled when the ratio is equal to or greater than the specified ratio. This value
defaults to 5.

Ratio of Majority to Minority Class for Heavily Imbalanced Binary Classification to Only Enable Spe-
cial Sampling Techniques if Enabled

For heavily imbalanced binary classification, specify the ratio of the majority to minority class equal and above which
to enable only special imbalanced models on the full original data without upfront sampling. This value defaults to 25.

Number of Bags for Sampling Methods for Imbalanced Binary Classification (if Enabled)

Specify the number of bags for sampling methods for imbalanced binary classification. This value defaults to -1.

17.3. Expert Settings 281


Using Driverless AI, Release 1.8.4.1

Hard Limit on Number of Bags for Sampling Methods for Imbalanced Binary Classification

Specify the limit on the number of bags for sampling methods for imbalanced binary classification. This value defaults
to 10.

Hard Limit on Number of Bags for Sampling Methods for Imbalanced Binary Classification During
Feature Evolution Phase

Specify the limit on the number of bags for sampling methods for imbalanced binary classification. This value defaults
to 3. Note that this setting only applies to shift, leakage, tuning, and feature evolution models. To limit final models,
use the Hard Limit on Number of Bags for Sampling Methods for Imbalanced Binary Classification setting.

Max Size of Data Sampled During Imbalanced Sampling

Specify the maximum size of the data sampled during imbalanced sampling in terms of the dataset’s size. This setting
controls the approximate number of bags and is only active when the “Hard limit on number of bags for sampling
methods for imbalanced binary classification during feature evolution phase” option is set to -1. This value defaults to
1.

Target Fraction of Minority Class After Applying Under/Over-Sampling Techniques

Specify the target fraction of a minority class after applying under/over-sampling techniques. A value of 0.5 means
that models/algorithms will be given a balanced target class distribution. When starting from an extremely imbalanced
original target, it can be advantageous to specify a smaller value such as 0.1 or 0.01. This value defaults to -1.

Max Number of Automatic FTRL Interactions Terms for 2nd, 3rd, 4th order interactions terms (Each)

Specify a limit for the number of FTRL interactions terms sampled for each of second, third, and fourth order terms.
This value defaults to 10,000.

Enable Detailed Scored Model Info

Specify whether to dump every scored individual’s model parameters to a csv/tabulated file. If enabled (default),
Driverless AI produces files such as “individual_scored_id%d.iter%d*params*”. This is enabled by default.

Whether to Enable Bootstrap Sampling for Validation and Test Scores

Specify whether to enable bootstrap sampling. When enabled, this setting provides error bars to validation and test
scores based on the standard error of the bootstrap mean. This is enabled by default.

282 Chapter 17. Experiments


Using Driverless AI, Release 1.8.4.1

For Classification Problems with This Many Classes, Default to TensorFlow

Specify the number of classes above which to use TensorFlow when it is enabled. Others model that are set to AUTO
will not be used above this number. (Models set to ON, however, are still used.) This value defaults to 10.

17.3.6 Features Settings

Feature Engineering Effort

Specify a value from 0 to 10 for the Driverless AI feature engineering effort. Higher values generally lead to more
time (and memory) spent in feature engineering. This value defaults to 5.
• 0: Keep only numeric features. Only model tuning during evolution.
• 1: Keep only numeric features and frequency-encoded categoricals. Only model tuning during evolution.
• 2: Similar to 1 but instead just no Text features. Some feature tuning before evolution.
• 3: Similar to 5 but only tuning during evolution. Mixed tuning of features and model parameters.
• 4: Similar to 5 but slightly more focused on model tuning.
• 5: Balanced feature-model tuning. (Default)
• 6-7: Similar to 5 but slightly more focused on feature engineering.
• 8: Similar to 6-7 but even more focused on feature engineering with high feature generation rate and no feature
dropping even if high interpretability.
• 9-10: Similar to 8 but no model tuning during feature evolution.

Data Distribution Shift Detection

Specify whether Driverless AI should detect data distribution shifts between train/valid/test datasets (if provided).
Currently, this information is only presented to the user and not acted upon.

Data Distribution Shift Detection Drop of Features

Specify whether to drop high-shift features. This defaults to AUTO. Note that Auto for time series experiments turns
this feature off.

Max Allowed Feature Shift (AUC) Before Dropping Feature

Specify the maximum allowed AUC value for a feature before dropping the feature.
When train and test differ (or train/valid or valid/test) in terms of distribution of data, then there can be a model built
that tells you for each row whether the row is in train or test. That model includes an AUC value. If the AUC is above
this specified threshold, then Driverless AI will consider it a strong enough shift to drop features that are shifted.
This value defaults to 0.999.

17.3. Expert Settings 283


Using Driverless AI, Release 1.8.4.1

Leakage Detection

Specify whether to check leakage for each feature. Note that this is always disabled if a fold column is specified and
if the experiment is a time series experiment. This is set to AUTO by default.

Leakage Detection Dropping AUC/R2 Threshold

If Leakage Detection is enabled, specify to drop features for which the AUC (classification)/R2 (regression) is above
this value. This value defaults to 0.999.

Max Rows Times Columns for Leakage

Specify the maximum number of rows times the number of columns to trigger sampling for leakage checks. This value
defaults to 10,000,000.

Report Permutation Importance on Original Features

Specify whether Driverless AI reports permutation importance on original features. This is disabled by default.

Maximum Number of Rows to Perform Permutation-Based Feature Selection

Specify the maximum number of rows to when performing permutation feature importance. This value defaults to
1,000,000.

Max Number of Original Features Used

Specify the maximum number of columns to be selected from an existing set of columns using feature selection. This
value defaults to 10,000.

Max Number of Original Non-Numeric Features

Specify the maximum number of non-numeric columns to be selected. Feature selection is performed on all features
when this value is exceeded. This value defaults to 300.

Max Number of Original Features Used for FS Individual

Specify the maximum number of features you want to be selected in an experiment. Additional columns above the
specified value add special individual with original columns reduced. This value defaults to 500.

284 Chapter 17. Experiments


Using Driverless AI, Release 1.8.4.1

Number of Original Numeric Features to Trigger Feature Selection Model Type

The maximum number of original numeric columns, above which Driverless AI will do feature selection. Note that this
is applicable only to special individuals with original columns reduced. A separate individual in the genetic algorithm
is created by doing feature selection by permutation importance on original features. This value defaults to 500.

Number of Original Non-Numeric Features to Trigger Feature Selection Model Type

The maximum number of original non-numeric columns, above which Driverless AI will do feature selection on all
features. Note that this is applicable only to special individuals with original columns reduced. A separate individual
in the genetic algorithm is created by doing feature selection by permutation importance on original features. This
value defaults to 200.

Max Allowed Fraction of Uniques for Integer and Categorical Columns

Specify the maximum fraction of unique values for integer and categorical columns. If the column has a larger fraction
of unique values than that, it will be considered an ID column and ignored. This value defaults to 0.95.

Allow Treating Numerical as Categorical

Specify whether to allow some numerical features to be treated as categorical features. This is enabled by default.

Max Number of Unique Values for Int/Float to be Categoricals

Specify the number of unique values for integer or real columns to be treated as categoricals. This value defaults to
50.

Max Number of Engineered Features

Specify the maximum number of features to include in the final model’s feature engineering pipeline. If -1 is specified
(default), then Driverless AI will automatically determine the number of features.

Max Number of Genes

Specify the maximum number of genes (transformer instances) kept per model (and per each model within the final
model for ensembles). This controls the number of genes before features are scored, so Driverless AI will just ran-
domly samples genes if pruning occurs. If restriction occurs after scoring features, then aggregated gene importances
are used for pruning genes. Instances includes all possible transformers, including original transformer for numeric
features. A value of -1 means no restrictions except internally-determined memory and interpretability restriction.

17.3. Expert Settings 285


Using Driverless AI, Release 1.8.4.1

Correlation Beyond Which Triggers Monotonicity Constraints (if Enabled)

Specify the threshold of Pearson product-moment correlation coefficient between numerical and encoded transformed
feature and target. This value defaults to 0.1.

Max Feature Interaction Depth

Specify the maximum number of features to use for interaction features like grouping for target encoding, weight of
evidence, and other likelihood estimates.
Exploring feature interactions can be important in gaining better predictive performance. The interaction can take
multiple forms (i.e. feature1 + feature2 or feature1 * feature2 + . . . featureN). Although certain machine learning
algorithms (like tree-based methods) can do well in capturing these interactions as part of their training process, still
generating them may help them (or other algorithms) yield better performance.
The depth of the interaction level (as in “up to” how many features may be combined at once to create one single
feature) can be specified to control the complexity of the feature engineering process. Higher values might be able to
make more predictive models at the expense of time. This value defaults to 8.

Fixed Feature Interaction Depth

Specify a fixed non-zero number of features to use for interaction features like grouping for target encoding, weight
of evidence, and other likelihood estimates. To use all features for each transformer, set this to be equal to the number
of columns. To do a 50/50 sample and a fixed feature interaction depth of 𝑛 features, set this to -𝑛.

Enable Target Encoding

Specify whether to use Target Encoding when building the model. Target encoding refers to several different feature
transformations (primarily focused on categorical data) that aim to represent the feature using information of the actual
target variable. A simple example can be to use the mean of the target to replace each unique category of a categorical
feature. These type of features can be very predictive but are prone to overfitting and require more memory as they
need to store mappings of the unique categories and the target values.

Enable Lexicographical Label Encoding

Specify whether to enable lexicographical label encoding. This is disabled by default.

Enable Isolation Forest Anomaly Score Encoding

Isolation Forest is useful for identifying anomalies or outliers in data. Isolation Forest isolates observations by ran-
domly selecting a feature and then randomly selecting a split value between the maximum and minimum values of
that selected feature. This split depends on how long it takes to separate the points. Random partitioning produces
noticeably shorter paths for anomalies. When a forest of random trees collectively produces shorter path lengths for
particular samples, they are highly likely to be anomalies.
This option allows you to specify whether to return the anomaly score of each sample. This is disabled by default.

286 Chapter 17. Experiments


Using Driverless AI, Release 1.8.4.1

Enable One HotEncoding

Specify whether one-hot encoding is enabled. The default AUTO setting is only applicable for small datasets and
GLMs.

Number of Estimators for Isolation Forest Encoding

Specify the number of estimators for Isolation Forest encoding. This value defaults to 200.

Drop Constant Columns

Specify whether to drop columns with constant values. This is enabled by default.

Drop ID Columns

Specify whether to drop columns that appear to be an ID. This is enabled by default.

Don’t Drop Any Columns

Specify whether to avoid dropping any columns (original or derived). This is disabled by default.

Features to Drop

Specify which features to drop. This setting allows you to select many features at once by copying and pasting a list
of column names (in quotes) separated by commas.

Features to Group By

Specify which features to group columns by. When this field is left empty (default), Driverless AI automatically
searches all columns (either at random or based on which columns have high variable importance).

Sample from Features to Group By

Specify whether to sample from given features to group by or to always group all features. This is disabled by default.

Aggregation Functions (Non-Time-Series) for Group By Operations

Specify whether to enable aggregation functions to use for group by operations. Choose from the following (all are
selected by default):
• mean
• sd
• min
• max
• count

17.3. Expert Settings 287


Using Driverless AI, Release 1.8.4.1

Number of Folds to Obtain Aggregation When Grouping

Specify the number of folds to obtain aggregation when grouping. Out-of-fold aggregations will result in less overfit-
ting, but they analyze less data in each fold.

Type of Mutation Strategy

Specify which strategy to apply when performing mutations on transformers. Select from the following:
• sample: Sample transformer parameters (Default)
• batched: Perform multiple types of the same transformation together
• full: Perform more types of the same transformation together than the above strategy

Enable Detailed Scored Features Info

Specify whether to dump every scored individual’s variable importance (both derived and original) to a
csv/tabulated/json file. If enabled, Driverless AI produces files such as “individual_scored_id%d.iter%d*features*”.
This is disabled by default.

Enable Detailed Logs for Timing and Types of Features Produced

Specify whether to dump every scored fold’s timing and feature info to a timings.txt file. This is disabled by default.

Compute Correlation Matrix

Specify whether to compute training, validation, and test correlation matrixes. When enabled, this setting creates table
and heatmap PDF files that are saved to disk. Note that this setting is currently a single threaded process that may be
slow for experiments with many columns. This is disabled by default.

17.3.7 Time Series Settings

Time-Series Lag-Based Recipe

This recipe specifies whether to include Time Series lag features when training a model with a provided (or autode-
tected) time column. This is enabled by default. Lag features are the primary automatically generated time series
features and represent a variable’s past values. At a given sample with time stamp 𝑡, features at some time difference
𝑇 (lag) in the past are considered. For example, if the sales today are 300, and sales of yesterday are 250, then the lag
of one day for sales is 250. Lags can be created on any feature as well as on the target. Lagging variables are important
in time series because knowing what happened in different time periods in the past can greatly facilitate predictions
for the future. Note: Ensembling is disabled when the lag-based recipe with time columns is activated because it only
supports a single final model. Ensembling is also disabled if a time column is selected or if time column is set to
[AUTO] on the experiment setup screen.
More information about time series lag is available in the Time Series Use Case: Sales Forecasting section.

288 Chapter 17. Experiments


Using Driverless AI, Release 1.8.4.1

Custom Validation Splits for Time-Series Experiments

Specify date or datetime timestamps (in the same format as the time column) to use for custom training and validation
splits.

Timeout in Seconds for Time-Series Properties Detection in UI

Specify the timeout in seconds for time-series properties detection in Driverless AI’s user interface. This value defaults
to 30.

Generate Holiday Features

For time-series experiments, specify whether to generate holiday features for the experiment. This is enabled by
default.

Time-Series Lags Override

Specify the override lags to be used. These can be used to give more importance to the lags that are still considered
after the override is applied. The following examples show the variety of different methods that can be used to specify
override lags:
• “[7, 14, 21]” specifies this exact list
• “21” specifies every value from 1 to 21
• “21:3” specifies every value from 1 to 21 in steps of 3
• “5-21” specifies every value from 5 to 21
• “5-21:3” specifies every value from 5 to 21 in steps of 3

17.3. Expert Settings 289


Using Driverless AI, Release 1.8.4.1

Smallest Considered Lag Size

Specify a minimum considered lag size. This value defaults to -1.

Enable Feature Engineering from Time Column

Specify whether to enable feature engineering based on the selected time column, e.g. Date~weekday. This is enabled
by default.

Allow Integer Time Column as Numeric Feature

Specify whether to allow an integer time column to be used as a numeric feature. Note that if you are using a time
series recipe, using a time column (numeric time stamps) as an input feature can lead to a model that memorizes the
actual timestamps instead of features that generalize to the future. This is disabled by default.

Allowed Date and Date-Time Transformations

Specify the date or date-time transformations to allow Driverless AI to use. Choose from the following transformers:
• year
• quarter
• month
• week
• weekday
• day
• dayofyear
• num (direct numeric value representing the floating point value of time, disabled by default)
• hour
• minute
• second
Features in Driverless AI will appear as get_ followed by the name of the transformation. Note that get_num can
lead to overfitting if used on IID problems and is disabled by default.

Consider Time Groups Columns as Standalone Features

Specify whether to consider time groups columns as standalone features. This is disabled by default.

290 Chapter 17. Experiments


Using Driverless AI, Release 1.8.4.1

Which TGC Feature Types to Consider as Standalone Features

Specify whether to consider time groups columns (TGC) as standalone features. If “Consider time groups columns as
standalone features” is enabled, then specify which TGC feature types to consider as standalone features. Available
types are numeric, categorical, ohe_categorical, datetime, date, and text. All types are selected by default. Note
that “time_column” is treated separately via the “Enable Feature Engineering from Time Column” option. Also note
that if “Time Series Lag-Based Recipe” is disabled, then all time group columns are allowed features.

Enable Time Unaware Transformers

Specify whether various transformers (clustering, truncated SVD) are enabled, which otherwise would be disabled for
time series experiments due to the potential to overfit by leaking across time within the fit of each fold. This is set to
AUTO by default.

Always Group by All Time Groups Columns for Creating Lag Features

Specify whether to group by all time groups columns for creating lag features. This is enabled by default.

Generate Time-Series Holdout Predictions

Specify whether to create diagnostic holdout predictions on training data using moving windows. This is enabled by
default. This can be useful for MLI, but it will slow down the experiment considerably when enabled. Note that the
model itself remains unchanged when this setting is enabled.

Number of Time-Based Splits for Internal Model Validation

Specify a fixed number of time-based splits for internal model validation. Note that the actual number of allowed splits
can be less than the specified value, and that the number of allowed splits is determined at the time an experiment is
run. This value defaults to -1 (auto).

Maximum Overlap Between Two Time-Based Splits

Specify the maximum overlap between two time-based splits. The amount of possible splits increases with higher
values. This value defaults to 0.5.

Maximum Number of Splits Used for Creating Final Time-Series Model’s Holdout Predictions

Specify the maximum number of splits used for creating the final time-series Model’s holdout predictions. The default
value (-1) will use the same number of splits that are used during model validation.

17.3. Expert Settings 291


Using Driverless AI, Release 1.8.4.1

Whether to Speed up Calculation of Time-Series Holdout Predictions

Specify whether to speed up time-series holdout predictions for back-testing on training data. This setting is used for
MLI and calculating metrics. Note that predictions can be slightly less accurate when this setting is enabled. This is
disabled by default.

Whether to Speed up Calculation of Shapley Values for Time-Series Holdout Predictions

Specify whether to speed up Shapley values for time-series holdout predictions for back-testing on training data. This
setting is used for MLI. Note that predictions can be slightly less accurate when this setting is enabled. This is enabled
by default.

Generate Shapley Values for Time-Series Holdout Predictions at the Time of Experiment

Specify whether to enable the creation of Shapley values for holdout predictions on training data using moving win-
dows at the time of the experiment. This can be useful for MLI, but it can slow down the experiment when enabled. If
this setting is disabled, MLI will generate Shapley values on demand. This is enabled by default.

Lower Limit on Interpretability Setting for Time-Series Experiments (Implicitly Enforced)

Specify the lower limit on interpretability setting for time-series experiments. Values of 5 (default) or more can
improve generalization by more aggressively dropping the least important features. To disable this setting, set this
value to 1.

Dropout Mode for Lag Features

Specify the dropout mode for lag features in order to achieve an equal n.a. ratio between train and validation/tests.
Independent mode performs a simple feature-wise dropout. Dependent mode takes the lag-size dependencies per
sample/row into account. Dependent is enabled by default.

Probability to Create Non-Target Lag Features

Lags can be created on any feature as well as on the target. Specify a probability value for creating non-target lag
features. This value defaults to 0.1.

Method to Create Rolling Test Set Predictions

Specify the method used to create rolling test set predictions. Choose between test time augmentation (TTA) and a
successive refitting of the final pipeline. TTA is enabled by default.

292 Chapter 17. Experiments


Using Driverless AI, Release 1.8.4.1

Probability for New Time-Series Transformers to Use Default Lags

Specify the probability for new lags or the EWMA gene to use default lags. This is determined independently of the
data by frequency, gap, and horizon. This value defaults to 0.2.

Probability of Exploring Interaction-Based Lag Transformers

Specify the unnormalized probability of choosing other lag time-series transformers based on interactions. This value
defaults to 0.2.

Probability of Exploring Aggregation-Based Lag Transformers

Specify the unnormalized probability of choosing other lag time-series transformers based on aggregations. This value
defaults to 0.2.

17.3.8 NLP Settings

Max TensorFlow Epochs for NLP

When building TensorFlow NLP features (for text data), specify the maximum number of epochs to train feature
engineering models with (it might stop earlier). The higher the number of epochs, the higher the run time. This value
defaults to 2 and is ignored if TensorFlow models is disabled.

Accuracy Above Enable TensorFlow NLP by Default for All Models

Specify the accuracy threshold. Values equal and above will add all enabled TensorFlow NLP models at the start of
the experiment for text-dominated problems when the following NLP expert settings are set to AUTO:
• Enable word-based CNN TensorFlow models for NLP
• Enable word-based BigRU TensorFlow models for NLP
• Enable character-based CNN TensorFlow models for NLP
If the above transformations are set to ON, this parameter is ignored.
At lower accuracy, TensorFlow NLP transformations will only be created as a mutation. This value defaults to 5.

Enable Word-Based CNN TensorFlow Models for NLP

Specify whether to use Word-based CNN TensorFlow models for NLP. This option is ignored if TensorFlow is dis-
abled. We recommend that you disable this option on systems that do not use GPUs.

17.3. Expert Settings 293


Using Driverless AI, Release 1.8.4.1

Enable Word-Based BiGRU TensorFlow Models for NLP

Specify whether to use Word-based BiG-RU TensorFlow models for NLP. This option is ignored if TensorFlow is
disabled. We recommend that you disable this option on systems that do not use GPUs.

Enable Character-Based CNN TensorFlow Models for NLP

Specify whether to use Character-level CNN TensorFlow models for NLP. This option is ignored if TensorFlow is
disabled. We recommend that you disable this option on systems that do not use GPUs.

Path to Pretrained Embeddings for TensorFlow NLP Models

Specify a path to pretrained embeddings that will be used for the TensorFlow NLP models. For example,
/path/on/server/to/file.txt
• You can download the Glove embeddings from here and specify the local path in this box.
• You can download the fasttext embeddings from here and specify the local path in this box.
• You can also train your own custom embeddings. Please refer to this code sample for creating custom embed-
dings that can be passed on to this option.
• If this field is left empty, embeddings will be trained from scratch.

Allow Training of Unfrozen Pretrained Embeddings

Specify whether to allow training of all weights of the neural network graph, including the pretrained embedding layer
weights. If this is disabled, the embedding layer will be frozen. All other weights, however, will still be fine-tuned.
This is disabled by default.

Whether Python/MOJO Scoring Runtime Will Have GPUs

Specify whether the Python/MOJO scoring runtime will have GPUs (otherwise BiGRU will fail in production if this
is enabled). Enabling this setting can speed up training for BiGRU, but doing so will require GPUs and CuDNN in
production. This is disabled by default.

Fraction of Text Columns Out of All Features to be Considered a Text-Dominanted Problem

Specify the fraction of text columns out of all features to be considered as a text-dominated problem. This value
defaults to 0.3.
Specify when a string column will be treated as text (for an NLP problem) or just as a standard categorical variable.
Higher values will favor string columns as categoricals, while lower values will favor string columns as text. This
value defaults to 0.3.

294 Chapter 17. Experiments


Using Driverless AI, Release 1.8.4.1

Fraction of Text per All Transformers to Trigger That Text Dominated

Specify the fraction of text columns out of all features to be considered a text-dominated problem. This value defaults
to 0.3.

Threshold for String Columns to be Treated as Text

Specify the threshold value (from 0 to 1) for string columns to be treated as text (0.0 - text; 1.0 - string). This value
defaults to 0.3.

17.3.9 Recipes Settings

Include Specific Transformers

Select the transformer(s) that you want to use in the experiment. Use the Check All/Uncheck All button to quickly
add or remove all transfomers at once. Note: If you uncheck all transformers so that none is selected, Driverless AI
will ignore this and will use the default list of transformers for that experiment. This list of transformers will vary for
each experiment.

Include Specific Models

Specify the type(s) of models that you want Driverless AI to build in the experiment.

Include Specific Scorers

Specify the scorer(s) that you want Driverless AI to include when running the experiment.

Probability to Add Transformers

Specify the unnormalized probability to add genes or instances of transformers with specific attributes. If no genes
can be added, other mutations are attempted. This value defaults to 0.5.

Probability to Add Best Shared Transformers

Specify the unnormalized probability to add genes or instances of transformers with specific attributes that have shown
to be beneficial to other individuals within the population. This value defaults to 0.5.

Probability to Prune Transformers

Specify the unnormalized probability to prune genes or instances of transformers with specific attributes. This value
defaults to 0.5.

17.3. Expert Settings 295


Using Driverless AI, Release 1.8.4.1

Probability to Mutate Model Parameters

Specify the unnormalized probability to change model hyper parameters. This value defaults to 0.25.

Probability to Prune Weak Features

Specify the unnormalized probability to prune features that have low variable importance instead of pruning entire
instances of genes/transformers. This value defaults to 0.25.

Timeout in Minutes for Testing Acceptance of Each Recipe

Specify the number of minutes to wait until a recipe’s acceptance testing is aborted. A recipe is rejected if acceptance
testing is enabled and it times out. This value defaults to 20.0.

Whether to Skip Failures of Transformers

Specify whether to avoid failed transformers. This is enabled by default.

Whether to Skip Failures of Models

Specify whether to avoid failed models. Failures are logged according to the specified level for logging skipped
failures. This is enabled by default.

Level to Log for Skipped Failures

Specify one of the following levels for the verbosity of log failure messages for skipped transformers or models:
• 0 = Log simple message
• 1 = Log code line plus message (Default)
• 2 = Log detailed stack traces

17.3.10 System Settings

Number of Cores to Use

Specify the number of cores to use for the experiment. Note that if you specify 0, all available cores will be used.
Lower values can reduce memory usage but might slow down the experiment. This value defaults to 0.

Maximum Number of Cores to Use for Model Fit

Specify the maximum number of cores to use for a model’s fit call. Note that if you specify 0, all available cores will
be used. This value defaults to 10.

296 Chapter 17. Experiments


Using Driverless AI, Release 1.8.4.1

Maximum Number of Cores to Use for Model Predict

Specify the maximum number of cores to use for a model’s predict call. Note that if you specify 0, all available cores
will be used. This value defaults to 0.

Maximum Number of Cores to Use for Model Transform and Predict When Doing MLI, Autoreport

Specify the maximum number of cores to use for a model’s transform and predict call when doing operations in the
Driverless AI MLI GUI and the Driverless AI R and Python clients. Note that if you specify 0, all available cores will
be used. This value defaults to 4.

Tuning Workers per Batch for CPU

Specify the number of workers used in CPU mode for tuning. A value of 0 uses the socket count, while a value of -1
uses all physical cores greater than or equal to 1 that count. This value defaults to 0.

Number of Workers for CPU Training

Specify the number of workers used in CPU mode for training:


• 0: Use socket count (Default)
• -1: Use all physical cores >= 1 that count

#GPUs/Experiment

Specify the number of GPUs to user per experiment. A value of -1 (default) specifies to use all available GPUs. Must
be at least as large as the number of GPUs to use per model (or -1).

Num Cores/GPU

Specify the number of CPU cores per GPU. In order to have a sufficient number of cores per GPU, this setting limits
the number of GPUs used. This value defaults to 4.

#GPUs/Model

Specify the number of GPUs to user per model, with -1 meaning all GPUs per model. In all cases, XGBoost tree
and linear models use the number of GPUs specified per model, while LightGBM and Tensorflow revert to using 1
GPU/model and run multiple models on multiple GPUs. This value defaults to 1.
Note: FTRL does not use GPUs. Rulefit uses GPUs for parts involving obtaining the tree using LightGBM.

17.3. Expert Settings 297


Using Driverless AI, Release 1.8.4.1

Num. of GPUs for Isolated Prediction/Transform

Specify the number of GPUs to use for predict for models and transform for transformers when running outside
of fit/fit_transform. If predict or transform are called in the same process as fit/fit_transform,
the number of GPUs will match. New processes will use this count for applicable models and transformers. Note that
enabling tensorflow_nlp_have_gpus_in_production will override this setting for relevant TensorFlow
NLP transformers. This value defaults to 0.

Max Number of Threads to Use for datatable and OpenBLAS for Munging and Model Training

Specify the maximum number of threads to use for datatable and OpenBLAS during data munging (applied on a per
process basis):
• 0 = Use all threads
• -1 = Automatically select number of threads (Default)

Max Number of Threads to Use for datatable Read and Write of Files

Specify the maximum number of threads to use for datatable during data reading and writing (applied on a per process
basis):
• 0 = Use all threads
• -1 = Automatically select number of threads (Default)

Max Number of Threads to Use for datatable Stats and OpenBLAS

Specify the maximum number of threads to use for datatable stats and OpenBLAS (applied on a per process basis):
• 0 = Use all threads
• -1 = Automatically select number of threads (Default)

GPU Starting ID

Specify Which gpu_id to start with. If using CUDA_VISIBLE_DEVICES=. . . to control GPUs (preferred method),
gpu_id=0 is the first in that restricted list of devices. For example, if CUDA_VISIBLE_DEVICES='4,5' then
gpu_id_start=0 will refer to device #4.
From expert mode, to run 2 experiments, each on a distinct GPU out of 2 GPUs, then:
• Experiment#1: num_gpus_per_model=1, num_gpus_per_experiment=1, gpu_id_start=0
• Experiment#2: num_gpus_per_model=1, num_gpus_per_experiment=1, gpu_id_start=1
From expert mode, to run 2 experiments, each on a distinct GPU out of 8 GPUs, then:
• Experiment#1: num_gpus_per_model=1, num_gpus_per_experiment=4, gpu_id_start=0
• Experiment#2: num_gpus_per_model=1, num_gpus_per_experiment=4, gpu_id_start=4
To run on all 4 GPUs/model, then
• Experiment#1: num_gpus_per_model=4, num_gpus_per_experiment=4, gpu_id_start=0
• Experiment#2: num_gpus_per_model=4, num_gpus_per_experiment=4, gpu_id_start=4

298 Chapter 17. Experiments


Using Driverless AI, Release 1.8.4.1

If num_gpus_per_model!=1, global GPU locking is disabled. This is because the underlying algorithms do not support
arbitrary gpu ids, only sequential ids, so be sure to set this value correctly to avoid overlap across all experiments by
all users.
More information is available at: https://github.com/NVIDIA/nvidia-docker/wiki/nvidia-docker#gpu-isolation Note
that gpu selection does not wrap, so gpu_id_start + num_gpus_per_model must be less than the number of visibile
GPUs.

Enable Detailed Traces

Specify whether to enable detailed tracing in Driverless AI trace when running an experiment. This is disabled by
default.

Enable Debug Log Level

If enabled, the log files will also include debug logs. This is disabled by default.

Enable Logging of System Information for Each Experiment

Specify whether to include system information such as CPU, GPU, and disk space at the start of each experiment log.
Note that this information is already included in system logs. This is enabled by default.

17.4 Scorers

17.4.1 Classification or Regression

• GINI (Gini Coefficient): The Gini index is a well-established method to quantify the inequality among values
of a frequency distribution, and can be used to measure the quality of a binary classifier. A Gini index of zero
expresses perfect equality (or a totally useless classifier), while a Gini index of one expresses maximal inequality
(or a perfect classifier).
The Gini index is based on the Lorenz curve. The Lorenz curve plots the true positive rate (y-axis) as a
function of percentiles of the population (x-axis).
The Lorenz curve represents a collective of models represented by the classifier. The location on the
curve is given by the probability threshold of a particular model. (i.e., Lower probability thresholds for
classification typically lead to more true positives, but also to more false positives.)
The Gini index itself is independent of the model and only depends on the Lorenz curve determined by
the distribution of the scores (or probabilities) obtained from the classifier.

17.4. Scorers 299


Using Driverless AI, Release 1.8.4.1

Regression

• R2 (R Squared): The R2 value represents the degree that the predicted value and the actual value move in
unison. The R2 value varies between 0 and 1 where 0 represents no correlation between the predicted and actual
value and 1 represents complete correlation.
Calculating the R2 value for linear models is mathematically equivalent to 1 − 𝑆𝑆𝐸/𝑆𝑆𝑇 (or 1 −
residual sum of squares/total sum of squares). For all other models, this equivalence does not hold, so
the 1 − 𝑆𝑆𝐸/𝑆𝑆𝑇 formula cannot be used. In some cases, this formula can produce negative R2 values,
which is mathematically impossible for a real number. Because Driverless AI does not necessarily use
linear models, the R2 value is calculated using the squared Pearson correlation coefficient.
R2 equation:

∑︀𝑛
(𝑥𝑖 − 𝑥 ¯)(𝑦𝑖 − 𝑦¯)
𝑅2 = √︀∑︀𝑛 𝑖=1 ∑︀𝑛
− 2 ¯)2
(𝑥
𝑖=1 𝑖 𝑥
¯ ) 𝑖=1 (𝑦𝑖 − 𝑦

Where:
• x is the predicted target value
• y is the actual target value

300 Chapter 17. Experiments


Using Driverless AI, Release 1.8.4.1

• MSE (Mean Squared Error): The MSE metric measures the average of the squares of the errors or deviations.
MSE takes the distances from the points to the regression line (these distances are the “errors”) and squaring
them to remove any negative signs. MSE incorporates both the variance and the bias of the predictor.
MSE also gives more weight to larger differences. The bigger the error, the more it is penalized. For
example, if your correct answers are 2,3,4 and the algorithm guesses 1,4,3, then the absolute error on each
one is exactly 1, so squared error is also 1, and the MSE is 1. But if the algorithm guesses 2,3,6, then the
errors are 0,0,2, the squared errors are 0,0,4, and the MSE is a higher 1.333. The smaller the MSE, the
better the model’s performance. (Tip: MSE is sensitive to outliers. If you want a more robust metric, try
mean absolute error (MAE).)
MSE equation:

𝑁
1 ∑︁
𝑀 𝑆𝐸 = (𝑦𝑖 − 𝑦ˆ𝑖 )2
𝑁 𝑖=1

• RMSE (Root Mean Squared Error): The RMSE metric evaluates how well a model can predict a continuous
value. The RMSE units are the same as the predicted target, which is useful for understanding if the size of the
error is of concern or not. The smaller the RMSE, the better the model’s performance. (Tip: RMSE is sensitive
to outliers. If you want a more robust metric, try mean absolute error (MAE).)
RMSE equation:

⎸ 𝑁
⎸ 1 ∑︁
𝑅𝑀 𝑆𝐸 = ⎷ (𝑦𝑖 − 𝑦ˆ𝑖 )2
𝑁 𝑖=1

Where:
• N is the total number of rows (observations) of your corresponding dataframe.
• y is the actual target value.
• 𝑦ˆ is the predicted target value.
• RMSLE (Root Mean Squared Logarithmic Error): This metric measures the ratio between actual values and
predicted values and takes the log of the predictions and actual values. Use this instead of RMSE if an under-
prediction is worse than an over-prediction. You can also use this when you don’t want to penalize large differ-
ences when both of the values are large numbers.
RMSLE equation:

⎸ 𝑁
⎸ 1 ∑︁ (︀ (︀ 𝑦𝑖 + 1 )︀)︀2
𝑅𝑀 𝑆𝐿𝐸 = ⎷ 𝑙𝑛
𝑁 𝑖=1 𝑦ˆ𝑖 + 1

Where:
• N is the total number of rows (observations) of your corresponding dataframe.
• y is the actual target value.
• 𝑦ˆ is the predicted target value.
• RMSPE (Root Mean Square Percentage Error): This metric is the RMSE expressed as a percentage. The
smaller the RMSPE, the better the model performance.
RMSPE equation:

⎸ 𝑁
⎸ 1 ∑︁ (𝑦𝑖 − 𝑦ˆ𝑖 )2
𝑅𝑀 𝑆𝑃 𝐸 = ⎷
𝑁 𝑖=1 (𝑦𝑖 )2

17.4. Scorers 301


Using Driverless AI, Release 1.8.4.1

• MAE (Mean Absolute Error): The mean absolute error is an average of the absolute errors. The MAE units are
the same as the predicted target, which is useful for understanding whether the size of the error is of concern or
not. The smaller the MAE the better the model’s performance. (Tip: MAE is robust to outliers. If you want a
metric that is sensitive to outliers, try root mean squared error (RMSE).)
MAE equation:
𝑁
1 ∑︁
𝑀 𝐴𝐸 = |𝑥𝑖 − 𝑥|
𝑁 𝑖=1
Where:
– N is the total number of errors
– |𝑥𝑖 − 𝑥| equals the absolute errors.
• MAPE (Mean Absolute Percentage Error): MAPE measures the size of the error in percentage terms. It is
calculated as the average of the unsigned percentage error.
MAPE equation:
(︀ 1 ∑︁ |𝐴𝑐𝑡𝑢𝑎𝑙 − 𝐹 𝑜𝑟𝑒𝑐𝑎𝑠𝑡| )︀
𝑀 𝐴𝑃 𝐸 = * 100
𝑁 |𝐴𝑐𝑡𝑢𝑎𝑙|
Because the MAPE measure is in percentage terms, it gives an indication of how large the error is across
different scales. Consider the following example:

Actual Predicted Absolute Error Absolute Percentage Error


5 1 4 80%
15,000 15,004 4 0.03%

Both records have an absolute error of 4, but this error could be considered “small” or “big” when you
compare it to the actual value.
• SMAPE (Symmetric Mean Absolute Percentage Error): Unlike the MAPE, which divides the absolute errors
by the absolute actual values, the SMAPE divides by the mean of the absolute actual and the absolute predicted
values. This is important when the actual values can be 0 or near 0. Actual values near 0 cause the MAPE value
to become infinitely high. Because SMAPE includes both the actual and the predicted values, the SMAPE value
can never be greater than 200%.
Consider the following example:

Actual Predicted
0.01 0.05
0.03 0.04

The MAPE for this data is 216.67% but the SMAPE is only 80.95%.
Both records have an absolute error of 4, but this error could be considered “small” or “big” when you
compare it to the actual value.
• MER (Median Error Rate or Median Absolute Percentage Error): MER measures the median size of the error
in percentage terms. It is calculated as the median of the unsigned percentage error.
MER equation:
(︀ |𝐴𝑐𝑡𝑢𝑎𝑙 − 𝐹 𝑜𝑟𝑒𝑐𝑎𝑠𝑡| )︀
𝑀 𝐸𝑅 = 𝑚𝑒𝑑𝑖𝑎𝑛 * 100
|𝐴𝑐𝑡𝑢𝑎𝑙|
Because the MER is the median, half the scored population has a lower absolute percentage error than the
MER, and half the population has a larger absolute percentage error than the MER.

302 Chapter 17. Experiments


Using Driverless AI, Release 1.8.4.1

Classification

• MCC (Matthews Correlation Coefficient): The goal of the MCC metric is to represent the confusion matrix of
a model as a single number. The MCC metric combines the true positives, false positives, true negatives, and
false negatives using the equation described below.
A Driverless AI model will return probabilities, not predicted classes. To convert probabilities to predicted
classes, a threshold needs to be defined. Driverless AI iterates over possible thresholds to calculate a
confusion matrix for each threshold. It does this to find the maximum MCC value. Driverless AI’s goal is
to continue increasing this maximum MCC.
Unlike metrics like Accuracy, MCC is a good scorer to use when the target variable is imbalanced. In the
case of imbalanced data, high Accuracy can be found by simply predicting the majority class. Metrics
like Accuracy and F1 can be misleading, especially in the case of imbalanced data, because they do not
consider the relative size of the four confusion matrix categories. MCC, on the other hand, takes the
proportion of each class into account. The MCC value ranges from -1 to 1 where -1 indicates a classifier
that predicts the opposite class from the actual value, 0 means the classifier does no better than random
guessing, and 1 indicates a perfect classifier.
MCC equation:

𝑇𝑃 𝑥 𝑇𝑁 − 𝐹𝑃 𝑥 𝐹𝑁
𝑀 𝐶𝐶 = √︀
(𝑇 𝑃 + 𝐹 𝑃 )(𝑇 𝑃 + 𝐹 𝑁 )(𝑇 𝑁 + 𝐹 𝑃 )(𝑇 𝑁 + 𝐹 𝑁 )

• F05, F1, and F2: A Driverless AI model will return probabilities, not predicted classes. To convert probabilities
to predicted classes, a threshold needs to be defined. Driverless AI iterates over possible thresholds to calculate
a confusion matrix for each threshold. It does this to find the maximum F metric value. Driverless AI’s goal is
to continue increasing this maximum F metric.
The F1 score provides a measure for how well a binary classifier can classify positive cases (given a
threshold value). The F1 score is calculated from the harmonic mean of the precision and recall. An F1
score of 1 means both precision and recall are perfect and the model correctly identified all the positive
cases and didn’t mark a negative case as a positive case. If either precision or recall are very low it will
be reflected with a F1 score closer to 0.
F1 equation:

(︁ (𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛) (𝑟𝑒𝑐𝑎𝑙𝑙) )︁
𝐹1 = 2
𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑟𝑒𝑐𝑎𝑙𝑙

Where:
• precision is the positive observations (true positives) the model correctly identified from all the
observations it labeled as positive (the true positives + the false positives).
• recall is the positive observations (true positives) the model correctly identified from all the actual
positive cases (the true positives + the false negatives).
The F0.5 score is the weighted harmonic mean of the precision and recall (given a threshold value).
Unlike the F1 score, which gives equal weight to precision and recall, the F0.5 score gives more weight
to precision than to recall. More weight should be given to precision for cases where False Positives are
considered worse than False Negatives. For example, if your use case is to predict which products you
will run out of, you may consider False Positives worse than False Negatives. In this case, you want your
predictions to be very precise and only capture the products that will definitely run out. If you predict
a product will need to be restocked when it actually doesn’t, you incur cost by having purchased more
inventory than you actually need.

17.4. Scorers 303


Using Driverless AI, Release 1.8.4.1

F05 equation:
(︁ (𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛) (𝑟𝑒𝑐𝑎𝑙𝑙) )︁
𝐹 0.5 = 1.25
0.25 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑟𝑒𝑐𝑎𝑙𝑙
Where:
• precision is the positive observations (true positives) the model correctly identified from all the
observations it labeled as positive (the true positives + the false positives).
• recall is the positive observations (true positives) the model correctly identified from all the actual
positive cases (the true positives + the false negatives).
The F2 score is the weighted harmonic mean of the precision and recall (given a threshold value). Unlike
the F1 score, which gives equal weight to precision and recall, the F2 score gives more weight to recall
than to precision. More weight should be given to recall for cases where False Negatives are considered
worse than False Positives. For example, if your use case is to predict which customers will churn, you
may consider False Negatives worse than False Positives. In this case, you want your predictions to
capture all of the customers that will churn. Some of these customers may not be at risk for churning, but
the extra attention they receive is not harmful. More importantly, no customers actually at risk of churning
have been missed.
F2 equation:

(︁ (𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛) (𝑟𝑒𝑐𝑎𝑙𝑙) )︁
𝐹2 = 5
4 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑟𝑒𝑐𝑎𝑙𝑙

Where:
• precision is the positive observations (true positives) the model correctly identified from all the
observations it labeled as positive (the true positives + the false positives).
• recall is the positive observations (true positives) the model correctly identified from all the actual
positive cases (the true positives + the false negatives).
• Accuracy: In binary classification, Accuracy is the number of correct predictions made as a ratio of all pre-
dictions made. In multiclass classification, the set of labels predicted for a sample must exactly match the
corresponding set of labels in y_true.
A Driverless AI model will return probabilities, not predicted classes. To convert probabilities to predicted
classes, a threshold needs to be defined. Driverless AI iterates over possible thresholds to calculate a
confusion matrix for each threshold. It does this to find the maximum Accuracy value. Driverless AI’s
goal is to continue increasing this maximum Accuracy.
Accuracy equation:

(︁ number correctly predicted )︁


𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 =
number of observations
• Logloss: The logarithmic loss metric can be used to evaluate the performance of a binomial or multinomial
classifier. Unlike AUC which looks at how well a model can classify a binary target, logloss evaluates how close
a model’s predicted values (uncalibrated probability estimates) are to the actual target value. For example, does
a model tend to assign a high predicted value like .80 for the positive class, or does it show a poor ability to
recognize the positive class and assign a lower predicted value like .50? Logloss can be any value greater than
or equal to 0, with 0 meaning that the model correctly assigns a probability of 0% or 100%.
Binary classification equation:

304 Chapter 17. Experiments


Using Driverless AI, Release 1.8.4.1

𝑁
1 ∑︁
𝐿𝑜𝑔𝑙𝑜𝑠𝑠 = − 𝑤𝑖 ( 𝑦𝑖 ln(𝑝𝑖 ) + (1 − 𝑦𝑖 ) ln(1 − 𝑝𝑖 ) )
𝑁 𝑖=1

Multiclass classification equation:

𝑁 𝐶
1 ∑︁ ∑︁
𝐿𝑜𝑔𝑙𝑜𝑠𝑠 = − 𝑤𝑖 ( 𝑦𝑖 ,𝑗 ln(𝑝𝑖 ,𝑗 ) )
𝑁 𝑖=1 𝑗=1

Where:
• N is the total number of rows (observations) of your corresponding dataframe.
• w is the per row user-defined weight (defaults is 1).
• C is the total number of classes (C=2 for binary classification).
• p is the predicted value (uncalibrated probability) assigned to a given row (observation).
• y is the actual target value.
• AUC (Area Under the Receiver Operating Characteristic Curve): This model metric is used to evaluate how well
a binary classification model is able to distinguish between true positives and false positives. For multi-class
problems, this score is computed by micro-averaging the ROC curves for each class. Use MACROAUC if you
prefer the macro average.
An AUC of 1 indicates a perfect classifier, while an AUC of .5 indicates a poor classifier whose perfor-
mance is no better than random guessing.
• AUCPR (Area Under the Precision-Recall Curve): This model metric is used to evaluate how well a binary
classification model is able to distinguish between precision recall pairs or points. These values are obtained
using different thresholds on a probabilistic or other continuous-output classifier. AUCPR is an average of the
precision-recall weighted by the probability of a given threshold.
The main difference between AUC and AUCPR is that AUC calculates the area under the ROC curve and
AUCPR calculates the area under the Precision Recall curve. The Precision Recall curve does not care
about True Negatives. For imbalanced data, a large quantity of True Negatives usually overshadows the
effects of changes in other metrics like False Positives. The AUCPR will be much more sensitive to True
Positives, False Positives, and False Negatives than AUC. As such, AUCPR is recommended over AUC
for highly imbalanced data.
• MACROAUC (Macro Average of Areas Under the Receiver Operating Characteristic Curves): For multiclass
classification problems, this score is computed by macro-averaging the ROC curves for each class (one per
class). The area under the curve is a constant. A MACROAUC of 1 indicates a perfect classifier, while a
MACROAUC of .5 indicates a poor classifier whose performance is no better than random guessing. This
option is not available for binary classification problems.

17.4. Scorers 305


Using Driverless AI, Release 1.8.4.1

17.4.2 Scorer Best Practices - Regression

When deciding which scorer to use in a regression problem, some main questions to ask are:
• Do you want your scorer sensitive to outliers?
• What unit should the scorer be in?
Sensitive to Outliers
Certain scorers are more sensitive to outliers. When a scorer is sensitive to outliers, it means that it is important that
the model predictions are never “very” wrong. For example, let’s say we have an experiment predicting number of
days until an event. The graph below shows the absolute error in our predictions.

Usually our model is very good. We have an absolute error less than 1 day about 70% of the time. There is one
instance, however, where our model did very poorly. We have one prediction that was 30 days off.
Instances like this will more heavily penalize scorers that are sensitive to outliers. If we do not care about these outliers
in poor performance as long as we typically have a very accurate prediction, then we would want to select a scorer that
is robust to outliers. We can see this reflected in the behavior of the scorers: MSE and RMSE.

MSE RMSE
Outlier 0.99 2.64
No Outlier 0.80 1.0

Calculating the RMSE and MSE on our error data, the RMSE is more than twice as large as the MSE because RMSE is
sensitive to outliers. If we remove the one outlier record from our calculation, RMSE drops down significantly.
Performance Units
Different scorers will show the performance of the Driverless AI experiment in different units. Let’s continue with our
example where our target is to predict the number of days until an event. Some possible performance units are:
• Same as target: The unit of the scorer is in days
– ex: MAE = 5 means the model predictions are off by 5 days on average

306 Chapter 17. Experiments


Using Driverless AI, Release 1.8.4.1

• Percent of target: The unit of the scorer is the percent of days


– ex: MAPE = 10% means the model predictions are off by 10 percent on average
• Square of target: The unit of the scorer is in days squared
– ex: MSE = 25 means the model predictions are off by 5 days on average (square root of 25 = 5)
Comparison

Metric Units Sensitive to Out- Tip


liers
R2 scaled between 0 and 1 No use when you want perfor mance scaled betwee n 0
and 1
MSE square of target Yes
RMSE same as target Yes
RM- log of target Yes
SLE
RM- percent of target Yes use when target values are across differ ent scales
SPE
MAE same as target No
MAPE percent of target No use when target values are across differ ent scales
SMAPE percent of target divided No use when target values close to 0
by 2

17.4.3 Scorer Best Practices - Classification

When deciding which scorer to use in a classification problem some main questions to ask are:
• Do you want the scorer to evaluate the predicted probabilities or the classes that those probabilities can be
converted to?
• Is your data imbalanced?
Scorer Evaluates Probabilities or Classes
The final output of a Driverless AI model is a predicted probability that a record is in a particular class. The scorer you
choose will either evaluate how accurate the probability is or how accurate the assigned class is from that probability.
Choosing this depends on the use of the Driverless AI model. Do we want to use the probabilities or do we want to
convert those probabilities into classes? For example, if we are predicting whether a customer will churn, we may take
the predicted probabilities and turn them into classes - customers who will churn vs customers who won’t churn. If
we are predicting the expected loss of revenue, we will instead use the predicted probabilities (predicted probability
of churn * value of customer).
If your use case requires a class assigned to each record, you will want to select a scorer that evaluates the model’s
performance based on how well it classifies the records. If your use case will use the probabilities, you will want to
select a scorer that evaluates the model’s performance based on the predicted probability.
Robust to Imbalanced Data
For certain use cases, positive classes may be very rare. In these instances, some scorers can be misleading. For
example, if I have a use case where 99% of the records have Class = No, then a model which always predicts No
will have 99% accuracy.
For these use cases, it is best to select a metric that does not include True Negatives or considers relative size of the
True Negatives like AUCPR or MCC.
Comparison

17.4. Scorers 307


Using Driverless AI, Release 1.8.4.1

Metric Evaluation Based On Tip


MCC Class good for imbalanced data
F1 Class
F0.5 Class good when you want to give more weight to precision
F2 Class good when you want to give more weight to recall
Accuracy Class highly interpretable
Logloss Probability
AUC Class
AUCPR Class good for imbalanced data

17.5 New Experiments

1. Run an experiment by selecting [Click for Actions] button beside the dataset that you want to use. Click Predict
to begin an experiment.

2. The Experiment Settings form displays and auto-fills with the selected dataset. Optionally enter a custom name
for this experiment. If you do not add a name, Driverless AI will create one for you.
3. Optionally specify a validation dataset and/or a test dataset.
• The validation set is used to tune parameters (models, features, etc.). If a validation dataset is not
provided, the training data is used (with holdout splits). If a validation dataset is provided, training
data is not used for parameter tuning - only for training. A validation dataset can help to improve
the generalization performance on shifting data distributions.
• The test dataset is used for the final stage scoring and is the dataset for which model metrics will be
computed against. Test set predictions will be available at the end of the experiment. This dataset is
not used during training of the modeling pipeline.

308 Chapter 17. Experiments


Using Driverless AI, Release 1.8.4.1

Keep in mind that these datasets must have the same number of columns as the training dataset. Also
note that if provided, the validation set is not sampled down, so it can lead to large memory usage, even if
accuracy=1 (which reduces the train size).
4. Specify the target (response) column. Note that not all explanatory functionality will be available for multiclass
classification scenarios (scenarios with more than two outcomes). When the target column is selected, Driverless
AI automatically provides the target column type and the number of rows. If this is a classification problem,
then the UI shows unique and frequency statistics (Target Freq/Most Freq) for numerical columns. If this is a
regression problem, then the UI shows the dataset mean and standard deviation values.
Notes Regarding Frequency:
• For data imported in versions <= 1.0.19, TARGET FREQ and MOST FREQ both represent the count
of the least frequent class for numeric target columns and the count of the most frequent class for
categorical target columns.
• For data imported in versions 1.0.20-1.0.22, TARGET FREQ and MOST FREQ both represent the
frequency of the target class (second class in lexicographic order) for binomial target columns; the
count of the most frequent class for categorical multinomial target columns; and the count of the
least frequent class for numeric multinomial target columns.
• For data imported in version 1.0.23 (and later), TARGET FREQ is the frequency of the target class
for binomial target columns, and MOST FREQ is the most frequent class for multinomial target
columns.
5. The next step is to set the parameters and settings for the experiment. (Refer to the Experiment Settings section
for more information about these settings.) You can set the parameters individually, or you can let Driverless AI
infer the parameters and then override any that you disagree with. Available parameters and settings include the
following:
• Dropped Columns: The columns we do not want to use as predictors such as ID columns, columns
with data leakage, etc.
• Weight Column: The column that indicates the per row observation weights. If “None” is specified,
each row will have an observation weight of 1.
• Fold Column: The column that indicates the fold. If “None” is specified, the folds will be determined
by Driverless AI. This is set to “Disabled” if a validation set is used.
• Time Column: The column that provides a time order, if applicable. If “AUTO” is specified, Driver-
less AI will auto-detect a potential time order. If “OFF” is specified, auto-detection is disabled. This
is set to “Disabled” if a validation set is used.
• Specify the Scorer to use for this experiment. The available scorers vary based on whether this is a
classification or regression experiment. Scorers include:
– Regression: GINI, MAE, MAPE, MER, MSE, R2, RMSE (default), RMSLE, RMSPE,
SMAPE, TOPDECILE
– Classification: ACCURACY, AUC (default), AUCPR, F05, F1, F2, GINI, LOGLOSS,
MACROAUC, MCC
• Specify a desired relative Accuracy from 1 to 10
• Specify a desired relative Time from 1 to 10
• Specify a desired relative Interpretability from 1 to 10
Driverless AI will automatically infer the best settings for Accuracy, Time, and Interpretability and
provide you with an experiment preview based on those suggestions. If you adjust these knobs, the
experiment preview will automatically update based on the new settings.
Expert Settings (optional):

17.5. New Experiments 309


Using Driverless AI, Release 1.8.4.1

• Optionally specify additional expert settings for the experiment. Refer to the Expert Settings section
for more information about these settings. The default values for these options are derived from the
environment variables in the config.toml file. Refer to the Setting Environment Variables section for
more information.

Additional settings (optional):


• Classification or Regression button. Driverless AI automatically determines the problem type based
on the response column. Though not recommended, you can override this setting by clicking this
button.
• Reproducible: This button allows you to build an experiment with a random seed and get repro-
ducible results. If this is disabled (default), then results will vary between runs.
• Enable GPUs: Specify whether to enable GPUs. (Note that this option is ignored on CPU-only
systems.)
6. After your settings are made, review the Experiment Preview to learn what each of the settings means. Note:
When changing the algorithms used via Expert Settings, you may notice that those changes are not applied.
Driverless AI determines whether to include models and/or recipes based on a hierarchy of those expert settings.
Refer to the Why do my selected algorithms not show up in the Experiment Preview? FAQ for more information.

310 Chapter 17. Experiments


Using Driverless AI, Release 1.8.4.1

7. Click Launch Experiment to start the experiment.


The experiment launches with a randomly generated experiment name. You can change this name at
anytime during or after the experiment. Mouse over the name of the experiment to view an edit icon, then
type in the desired name.
As the experiment runs, a running status displays in the upper middle portion of the UI. First Driverless
AI figures out the backend and determines whether GPUs are running. Then it starts parameter tuning,
followed by feature engineering. Finally, Driverless AI builds the scoring pipeline.

17.5.1 Understanding the Experiment Page

In addition to the status, as an experiment is running, the UI also displays the following:
• Details about the dataset.
• The iteration data (internal validation) for each cross validation fold along with the specified scorer value. Click
on a specific iteration or drag to view a range of iterations. Double click in the graph to reset the view. In this
graph, each “column” represents one iteration of the experiment. During the iteration, Driverless AI will train 𝑛
models. (This is called individuals in the experiment preview.) So for any column, you may see the score value
for those 𝑛 models for each iteration on the graph.
• The variable importance values. To view variable importance for a specific iteration, just select that iteration in
the Iteration Data graph. The Variable Importance list will automatically update to show variable importance
information for that iteration. Hover over an entry to view more info. Note: When hovering over an entry,

17.5. New Experiments 311


Using Driverless AI, Release 1.8.4.1

you may notice the term “Internal[. . . ] specification.” This label is used for features that do not need to be
translated/explained and ensures that all features are uniquely identified.
The values that display are specific to the variable importance of the model class:
– XGBoost and LightGBM: Gains Variable importance. Gain-based importance is calculated from the gains
a specific variable brings to the model. In the case of a decision tree, the gain-based importance will sum
up the gains that occurred whenever the data was split by the given variable. The gain-based importance is
normalized between 0 and 1. If a variable is never used in the model, the gain-based importance will be 0.
– GLM: The variable importance is the absolute value of the coefficient for each predictor. The variable
importance is normalized between 0 and 1. If a variable is never used in the model, the importance will be
0.
– TensorFlow: TensorFlow follows the Gedeon method described in this paper: https://www.ncbi.nlm.nih.
gov/pubmed/9327276.
– RuleFit: Sums over a feature’s contribution to each rule. Specifically, Driverless AI:
1. Assigns all features to have zero importance.
2. Scans through all the rules. If a feature is in that rule, Driverless AI adds its contribution (i.e, the
absolute values of a rule’s coefficient ) to its overall feature importance.
3. Normalizes the importance.
The calculation for the shift of variable importance is determined by the ensemble level:
– Ensemble Level = 0: The shift is determined between the last best genetic algorithm (GA) and the single
final model.
– Ensemble Level >=1: GA individuals used for the final model have variable importance blended with
the model’s meta learner weights, and the final model itself has variable importance blended with its
final weights. The shift of variable importance is determined between these two final variable importance
blends.
This information is reported in the logs or in the GUI if the shift is beyond the absolute magnitude speci-
fied by the max_num_varimp_shift_to_log configuration option. The Experiment Summary also
includes experiment_features_shift files that contain information about shift.
• CPU/Memory information including Notifications, Logs, and Trace info. (Note that Trace is used for develop-
ment/debugging and to show what the system is doing at that moment.)
• For classification problems, the lower right section includes a toggle between an ROC curve, Precision-Recall
graph, Lift chart, Gains chart, and GPU Usage information (if GPUs are available). For regression problems,
the lower right section includes a toggle between a Residuals chart, an Actual vs. Predicted chart, and GPU
Usage information (if GPUs are available). (Refer to the Experiment Graphs section for more information.)
Upon completion, an Experiment Summary section will populate in the lower right section.
• The bottom portion of the experiment screen will show any warnings that Driverless AI encounters. You can
hide this pane by clicking the x icon.

312 Chapter 17. Experiments


Using Driverless AI, Release 1.8.4.1

17.5.2 Finishing/Aborting Experiments

You can finish and/or abort experiments that are currently running.
• Finish Click the Finish button to stop a running experiment. Driverless AI will end the experiment and then
complete the ensembling and the deployment package.
• Abort: After clicking Finish, you have the option to click Abort, which terminates the experiment. (You will
be prompted to confirm the abort.) Aborted experiments will display on the Experiments page as Failed. You
can restart aborted experiments by clicking the right side of the experiment, then selecting Restart from Last
Checkpoint. This will start a new experiment based on the aborted one. Alternatively, you can started a new
experiment based on the aborted one by selecting New Model with Same Params. Refer to Checkpointing,
Rerunning, and Retraining for more information.

17.5.3 Aborting Experiment Report

The final step that Driverless AI performs during an experiment is to complete the experiment report. During this step,
you can click Abort to skip this report.

17.5. New Experiments 313


Using Driverless AI, Release 1.8.4.1

17.5.4 “Pausing” an Experiment

A trick for “pausing” an experiment is to:


1. Abort the experiment.
2. On the Experiments page, select Restart from Last Checkpoint for the aborted experiment.
3. On the Expert Settings page, specify 0 for the Ensemble level for final modeling pipeline option in the new
experiment’s Expert Settings.

17.6 Completed Experiment

After an experiment status changes from RUNNING to COMPLETE, the UI provides you with several options:

314 Chapter 17. Experiments


Using Driverless AI, Release 1.8.4.1

• Scores: Refer to Model Scores.


• Deploy (Local and Cloud): Refer to Deploying the MOJO Pipeline.
• Interpret this Model: Refer to Interpreting a Model. (Not supported for multiclass Time Series experiments.)
• Diagnose Model on New Dataset: Refer to Diagnosing a Model.
• Score on Another Dataset: Refer to Score on Another Dataset.
• Transform Another Dataset: Refer to Transform Another Dataset. (Not available for Time Series experi-
ments.)
• Download Predictions dropdown:
– Training (Holdout) Predictions: In csv format, available if a validation set was NOT provided.
– Validation Set Predictions: In csv format, available if a validation set was provided.
– Test Set Predictions: In csv format, available if a test dataset is used.
• Download Python Scoring Pipeline: A standalone Python scoring pipeline for H2O Driverless AI. Refer to
Driverless AI Standalone Python Scoring Pipeline.
• Download MOJO Scoring Pipeline: A standalone Model Object, Optimized scoring pipeline. Refer to MOJO
Scoring Pipelines. (Not available for TensorFlow or RuleFit models.)
• Visualize Scoring Pipeline (Experimental): Opens an experiment pipeline visualization page. Refer to Visu-
alizing the Scoring Pipeline.
• Download Summary & Logs: A zip file containing the following files. Refer to the Experiment Summary
section for more information.
– Experiment logs (regular and anonymized)
– A summary of the experiment

17.6. Completed Experiment 315


Using Driverless AI, Release 1.8.4.1

– The experiment features along with their relative importance


– Ensemble information
– An experiment preview
– Word version of an auto-generated report for the experiment
– A target transformations tuning leaderboard
– A tuning leaderboard
• Download Autoreport: A Word version of an auto-generated report for the experiment. This file is also avail-
able in the Experiment Summary zip file. Note that this option is not available for deprecated models. Refer to
the Experiment Autoreport section for more information.
Note: The “Download” options above (with the exception of Autoreport) will appear as “Export” options if artifacts
were enabled when Driverless AI was started. Refer to Export Artifacts for more information.

17.7 Experiment Graphs

This section describes the dashboard graphs that display for running and completed experiments. These graphs are
interactive. Hover over a point on the graph for more details about the point.

17.7.1 Binary Classfication Experiments

For Binary Classification experiments, Driverless AI shows a ROC Curve, a Precision-Recall graph, a Lift chart, a
Kolmogorov-Smirnov chart, and a Gains chart.

• ROC: This shows Receiver-Operator Characteristics curve stats on validation data along with the best Accuracy,
FCC, and F1 values. An ROC curve is a useful tool because it only focuses on how well the model was able to
distinguish between classes. Keep in mind, though, that for models where the prediction happens rarely, a high
AUC could provide a false sense that the model is correctly predicting the results. This is where the notion of
precision and recall become important.

316 Chapter 17. Experiments


Using Driverless AI, Release 1.8.4.1

The area under this curve is called AUC. The True Positive Rate (TPR) is the relative fraction of correct
positive predictions, and the False Positive Rate (FPR) is the relative fraction of incorrect positive cor-
rections. Each point corresponds to a classification threshold (e.g., YES if probability >= 0.3 else NO).
For each threshold, there is a unique confusion matrix that represents the balance between TPR and FPR.
Most useful operating points are in the top left corner in general.
Hover over a point in the ROC curve to see the True Negative, False Positive, False Negative, True
Positive, Threshold, FPR, TPR, Accuracy, F1, and MCC values for that point in the form of a confusion
matrix.

If a test set was provided for the experiment, then click on the Validation Metrics button below the graph
to view these stats on test data.
• Precision-Recall: This shows the Precision-Recall curve on validation data along with the best Accuracy, FCC,
and F1 values. The area under this curve is called AUCPR. Prec-Recall is a complementary tool to ROC
curves, especially when the dataset has a significant skew. The Prec-Recall curve plots the precision or positive
predictive value (y-axis) versus sensitivity or true positive rate (x-axis) for every possible classification threshold.
At a high level, you can think of precision as a measure of exactness or quality of the results and recall as a
measure of completeness or quantity of the results obtained by the model. Prec-Recall measures the relevance
of the results obtained by the model.
• Precision: correct positive predictions (TP) / all positives (TP + FP).
• Recall: correct positive predictions (TP) / positive predictions (TP + FN).
Each point corresponds to a classification threshold (e.g., YES if probability >= 0.3 else NO). For each
threshold, there is a unique confusion matrix that represents the balance between Recall and Precision.
This ROCPR curve can be more insightful than the ROC curve for highly imbalanced datasets.
Hover over a point in this graph to see the True Positive, True Negative, False Positive, False Negative,
Threshold, Recall, Precision, Accuracy, F1, and MCC value for that point.
If a test set was provided for the experiment, then click on the Validation Metrics button below the graph
to view these stats on test data.
• Lift: This chart shows lift stats on validation data. For example, “How many times more observations of the
positive target class are in the top predicted 1%, 2%, 10%, etc. (cumulative) compared to selecting observations
randomly?” By definition, the Lift at 100% is 1.0. Lift can help answer the question of how much better you
can expect to do with the predictive model compared to a random model (or no model). Lift is a measure of
the effectiveness of a predictive model calculated as the ratio between the results obtained with a model and

17.7. Experiment Graphs 317


Using Driverless AI, Release 1.8.4.1

with a random model(or no model). In other words, the ratio of gain % to the random expectation % at a given
quantile. The random expectation of the xth quantile is x%.
Hover over a point in the Lift chart to view the quantile percentage and cumulative lift value for that point.
If a test set was provided for the experiment, then click on the Validation Metrics button below the graph
to view these stats on test data.
• Kolmogorov-Smirnov: This chart measures the degree of separation between positives and negatives for vali-
dation or test data.
Hover over a point in the chart to view the quantile percentage and Kolmogorov-Smirnov value for that
point.
If a test set was provided for the experiment, then click on the Validation Metrics button below the graph
to view these stats on test data.
• Gains: This shows Gains stats on validation data. For example, “What fraction of all observations of the positive
target class are in the top predicted 1%, 2%, 10%, etc. (cumulative)?” By definition, the Gains at 100% are 1.0.
Hover over a point in the Gains chart to view the quantile percentage and cumulative gain value for that
point.
If a test set was provided for the experiment, then click on the Validation Metrics button below the graph
to view these stats on test data.

17.7.2 Multiclass Classification Experiments

For multiclass classification experiments, a Confusion Matrix is available in addition to the ROC Curve, Precision-
Recall graph, Lift chart, Kolmogorov-Smirnov chart, and Gains chart. Driverless AI generates these graphs by con-
sidering the multiclass problem as multiple one-vs-all problems. These graphs and charts (Confusion Matrix ex-
cepted) are based on a method known as micro-averaging (reference: http://scikit-learn.org/stable/auto_examples/
model_selection/plot_roc.html#multiclass-settings).
For example, you may want to predict the species in the iris data. The predictions would look something like this:

class.Iris-setosa class.Iris-versicolor class.Iris-virginica


0.9628 0.021 0.0158
0.0182 0.3172 0.6646
0.0191 0.9534 0.0276

To create these charts, Driverless AI converts the results to 3 one-vs-all problems:

prob-setosa actual-setosa prob-versicolor actual-versicolor prob-virginica actual-virginica


0.9628 1 0.021 0 0.0158 0
0.0182 0 0.3172 1 0.6646 0
0.0191 0 0.9534 1 0.0276 0

The result is 3 vectors of predicted and actual values for binomial problems. Driverless AI concatenates these 3 vectors
together to compute the charts.
predicted = [0.9628, 0.0182, 0.0191, 0.021, 0.3172, 0.9534, 0.0158, 0.6646, 0.0276]
actual = [1, 0, 0, 0, 1, 1, 0, 0, 0]

318 Chapter 17. Experiments


Using Driverless AI, Release 1.8.4.1

Multiclass Confusion Matrix

A confusion matrix shows experiment performance in terms of false positives, false negatives, true positives, and true
negatives. For each threshold, the confusion matrix represents the balance between TPR and FPR (ROC) or Precision
and Recall (Prec-Recall). In general, most useful operating points are in the top left corner.
In this graph, the actual results display in the columns and the predictions display in the rows; correct predictions are
highlighted. In the example below, Iris-setosa was predicted correctly 30 times, while Iris-virginica was predicted
correctly 32 times, and Iris-versicolor was predicted as Iris-virginica 2 times (against the validation set).
If a test set was provided for the experiment, then click on the Validation Metrics button below the graph to view
these stats on test data.

17.7.3 Regression Experiments

• Residuals: Residuals are the differences between observed responses and those predicted by a model. Any
pattern in the residuals is evidence of an inadequate model or of irregularities in the data, such as outliers, and
suggests how the model may be improved. This chart shows Residuals (Actual-Predicted) vs Predicted values
on validation or test data. Note that this plot preserves all outliers. For a perfect model, residuals are zero.
Hover over a point on the graph to view the Predicted and Residual values for that point.
If a test set was provided for the experiment, then click on the Validation Metrics button below the graph
to view these stats on test data.

17.7. Experiment Graphs 319


Using Driverless AI, Release 1.8.4.1

• Actual vs. Predicted: This chart shows Actual vs Predicted values on validation data. A small sample of values
are displayed. A perfect model has a diagonal line.
Hover over a point on the graph to view the Actual and Predicted values for that point.
If a test set was provided for the experiment, then click on the Validation Metrics button below the graph
to view these stats on test data.

17.8 Model Scores

You can view detailed information about model scores after an experiment is complete by clicking on the Scores
option.

The Model Scores page that opens includes the following tables:
• Model and feature tuning leaderboard: This leaderboard shows scoring information based on the scorer that
was selected in the experiment. This information is also available in the tuning_leaderboard.json file of the
Experiment Summary. You can download that file directly from the bottom of this table.

320 Chapter 17. Experiments


Using Driverless AI, Release 1.8.4.1

• Final pipeline scores across cross-validation folds and models: This table shows the final pipeline scores
across cross-validation folds and models. Note that if Constant Model was enabled (default), then that model
is added in this table as a baseline (reference) only and will be dropped in most cases. This information is also
included in the ensemble_base_learner_fold_scores.json file of the Experiment Summary. You can download
that file directly from a link at the bottom of this table.

• Pipeline Description: This shows how the final Stacked Ensemble pipeline was calculated. This information
is also included in the ensemble_model_description.json file of the Experiment Summary. You can download
that file directly from a link at the bottom of this table.

17.8. Model Scores 321


Using Driverless AI, Release 1.8.4.1

• Final Ensemble Scores: This shows the final scores for each scorer in DAI. If a custom scorer was used in the
experiment, that scorer will also appear here. This information is also included in the ensemble_scores.json file
of the Experiment Summary. You can download that file directly from a link at the bottom of this table.

17.9 Experiment Summary

An experiment summary is available for each completed experiment. Click the Download Summary & Logs button
to download the h2oai_experiment_summary_<experiment>.zip file.

322 Chapter 17. Experiments


Using Driverless AI, Release 1.8.4.1

The files within the experiment summary zip provide textual explanations of the graphical representations that are
shown on the Driverless AI UI. Details of each artifact are described below.

17.9.1 Experiment Autoreport

A report file (AutoDoc) is included in the experiment summary. This report provides insight into the training data and
any detected shifts in distribution, the validation schema selected, model parameter tuning, feature evolution and the
final set of features chosen during the experiment.
• report.docx: the report available in Word format
Click here to download and view a sample experiment report in Word format.

17.9.2 Autoreport Support

Autoreport only supports resumed experiments for certain Driverless AI versions. See the following table to check
what types of resumed experiments are supported for your version:

Autoreport Support for Resumed Experiments Via LTS 1.7.0 and older 1.7.1 1.8.x
New model with same parameters yes yes yes yes
Restart from last checkpoint no no yes yes
Retrain final pipeline no no no yes

Notes:
• Autoreport does not support experiments that were built off of previously aborted or failed experiments.
• Reports for unsupported resumed experiments will still build, but they will only include the following text:
“AutoDoc not yet supported for resumed experiments.”

17.9.3 Experiment Artifacts Overview

The Experiment Summary contains artifacts that provide overviews of the experiment.
• preview.txt: Provides a preview of the experiment. (This is the same information that was included on the UI
before starting the experiment.)
• summary: Provides the same summary that appears in the lower-right portion of the UI for the experiment.
(Available in txt or json.)
• config.json: Provides a list of the settings used in the experiment.
• config_overrides_toml_string.txt: Provides any overrides for this experiment that were made to the config.toml
file.
• args_do_auto_dl.json: The internal arguments used in the Driverless AI experiment based on the dataset and
accuracy, time and interpretability settings.
• experiment_column_types.json: Provides the column types for each column included in the experiment.
• experiment_original_column.json: A list of all columns available in the dataset that was used in the experi-
ment.
• experiment_pipeline_original_required_columns.json: For columns used in the experiment, this includes the
column name and type.
• experiment_sampling_description.json: A description of the sampling performed on the dataset.

17.9. Experiment Summary 323


Using Driverless AI, Release 1.8.4.1

• timing.json: The timing and number of models generated in each part of the Driverless AI pipeline.

17.9.4 Tuning Artifacts

During the Driverless AI experiment, model tuning is performed to determined the optimal algorithm and parameter
settings for the provided dataset. For regression problems, target tuning is also performed to determine the best way
to represent the target column (i.e. does taking the log of the target column improve results). The results from these
tuning steps are available in the Experiment Summary.
• tuning_leaderboard: A table of the model tuning performed along with the score generated from the model
and training time. (Available in txt or json.)
• target_transform_tuning_leaderboard.txt: A table of the transforms applied to the target column along with
the score generated from the model and training time. (This will be empty for binary and multiclass use cases.)

17.9.5 Features Artifacts

Driverless AI performs feature engineering on the dataset to determine the optimal representation of the data. The
top features used in the final model can be seen in the GUI. The complete list of features used in the final model is
available in the Experiment Summary artifacts.
The Experiment Summary also provides a list of the original features and their estimated feature importance. For
example, given the features in the final Driverless AI model, we can estimate the feature importance of the original
features.

Feature Feature Importance


NumToCatWoE:PAY_AMT2 1
PAY_3 0.92
ClusterDist9:BILL_AMT1:LIMIT_BAL:PAY_3 0.90

To calculate the feature importance of PAY_3, we can aggregate the feature importance for all variables that used
PAY_3:
• NumToCatWoE:PAY_AMT2: 1 * 0 (PAY_3 not used.)
• PAY_3: 0.92 * 1 (PAY_3 is the only variable used.)
• ClusterDist9:BILL_AMT1:LIMIT_BAL:PAY_3: 0.90 * 1/3 (PAY_3 is one of three variables used.)
Estimated Feature Importance = (1*0) + (0.92*1) + (0.9*(1/3)) = 1.22
Note: The feature importance is converted to relative feature importance. (The feature with the highest estimated
feature importance will have a relative feature importance of 1).
• ensemble_features: A list of features used in the final model, a description of the feature, and the relative
feature importance. Feature importances for multiple models are linearly blended with same weights as the final
ensemble of models. (Available in txt, table, or json.)
• ensemble_features_orig: A complete list of all original features used in the final model, a description of the
feature, the relative feature importance, and the standard deviation of relative importance. (Available in txt or
json.)
• ensemble_features_orig_shift: A list of original user features used in the final model and the difference in
relative feature importance between the final model and the corresponding feature importance of the final pop-
ulation. (Available in txt or json.)
• ensemble_features_prefit: A list of features used by the best individuals in the final population, each model
blended with same weights as ensemble if ensemble used blending. (Available in txt, table, or json.)

324 Chapter 17. Experiments


Using Driverless AI, Release 1.8.4.1

• ensemble_features_shift: A list of features used in the final model and the difference in relative feature impor-
tance between the final model and the corresponding feature importance of the final population. (Available in
txt, table, or json.)
• features: A list of features used by the best individual pipeline (identified by the genetic algorithm) and each
feature’s relative importance. (Available in txt, table, or json.)
• features_orig: A list of original user features used by the best individual pipeline (identified by the genetic
algorithm) and each feature’s estimated relative importance. (Available in txt or json.)
• leaked_features.json: A list of all leaked features provided along with the relative importance and the standard
deviation of relative importance. (Available in txt, table, or json.)
• leakage_features_orig.json: A list of leaked original features provided and an estimate of the relative feature
importance of that leaked original feature in the final model.
• shift_features.json: A list of all features provided along with the relative importance and the shift in standard
deviation of relative importance of that feature.
• shifit_features_orig.json: A list of original features provided and an estimate of the shift in relative feature
importance of that original feature in the final model.

17.9.6 Final Model Artifacts

The Experiment Summary includes artifacts that describe the final model. This is the model that is used to score new
datasets and create the MOJO scoring pipeline. The final model may be an ensemble of models depending on the
Accuracy setting.
• coefs.json: A list of coefficients and standard deviation of coefficients for features.
• ensemble.txt: A summary of the final model which includes a description of the model(s), gains/lifts table,
confusion matrix, and scores of the final model for our list of scorers.
• ensemble_base_learner_fold_scores: The internal validation scorer metrics for each base learner when the
final model is an ensemble. (Available in txt or json.)
• ensemble_description.txt: A sentence describing the final model. (For example: “Final TensorFlowModel
pipeline with ensemble_level=0 transforming 21 original features -> 54 features in each of 1 models each fit on
full training data (i.e. no hold-out).”)
• ensemble_coefs: The coefficent and standard deviation coefficient for each feature in the ensemble. (Available
as txt or json.)
• ensemble_coefs_shift: The coefficient and shift of coefficient for each feature in the ensemble. (Avalable as txt
or json.)
• ensemble_model_description.json/ensemble_model_extra_description: A json file describing the model(s)
and for ensembles how the model predictions are weighted.
• ensemble_model_params.json: A json file decribing the parameters of the model(s).
• ensemble_folds_data.json: A json file describing the folds used for the final model(s). This includes the size of
each fold of data and the performance of the final model on each fold. (Available if a fold column was specified.)
• ensemble_features_orig: A list of the original features provided and an estimate of the relative feature impor-
tance of that original feature in the ensemble of models. (Available in txt or json.)
• ensemble_features: A complete list of all features used in the final ensemble of models, a description of the
feature, and the relative feature importance. (Available in txt, table, or json.)
• leakage_coefs.json: A list of coefficients and standard deviation of coefficients for leaked features.

17.9. Experiment Summary 325


Using Driverless AI, Release 1.8.4.1

• shift_coefs.json: A list of coefficients and the shift in standard deviation for those coefficients used in the
experiment.
The Experiment Summary also includes artifacts about the final model performance.
• ensemble_scores.json: The scores of the final model for our list of scorers.
• ensemble_confusion_matrix: The confusion matrix for the internal validation and test data if test data is pro-
vided.
• ensemble_confusion_matrix_stats_test.json: Confusion matrix statistics on the test data. (Only available if
test data provided)
• ensemble_gains: The lift and gains table for the internal validation and test data if test data is provided. (Visu-
alization of lift and gains can be seen in the UI.)
• ensemble_roc: The ROC and Precision Recall table for the internal validation and test data if test data is
provided. (Visualization of ROC and Precision Recall curve can be seen in the UI.)
• individual_scored.params_base: Detailed information about each iteration run in the experiment. (Available
in csv, table, or json.)

17.10 Viewing Experiments

The upper-right corner of the Driverless AI UI includes an Experiments link.

Click this link to open the Experiments page. From this page, you can rename an experiment, view previous experi-
ments, begin a new experiment, rerun an experiment, and delete an experiment.

17.10.1 Checkpointing, Rerunning, and Retraining

In Driverless AI, you can retry an experiment from the last checkpoint, you can run a new experiment using an existing
experiment’s settings, and you can retrain an experiment’s final pipeline.

326 Chapter 17. Experiments


Using Driverless AI, Release 1.8.4.1

Checkpointing Experiments

In real-world scenarios, data can change. For example, you may have a model currently in production that was built
using 1 million records. At a later date, you may receive several hundred thousand more records. Rather than building
a new model from scratch, Driverless AI includes H2O.ai Brain, which enables caching and smart re-use of prior
models to generate features for new models.
You can configure one of the following Brain levels in the experiment’s Expert Settings.
• -1: Don’t use any brain cache
• 0: Don’t use any brain cache but still write to cache
• 1: Smart checkpoint if an old experiment_id is passed in (for example, via running “resume one like this” in the
GUI)
• 2: Smart checkpoint if the experiment matches all column names, column types, classes, class labels, and time
series options identically. (default)
• 3: Smart checkpoint like level #1, but for the entire population. Tune only if the brain population is of insufficient
size.
• 4: Smart checkpoint like level #2, but for the entire population. Tune only if the brain population is of insufficient
size.
• 5: Smart checkpoint like level #4, but will scan over the entire brain cache of populations (starting from resumed
experiment if chosen) in order to get the best scored individuals.
If you chooses Level 2 (default), then Level 1 is also done when appropriate.
To make use of smart checkpointing, be sure that the new data has:
• The same data column names as the old experiment
• The same data types for each column as the old experiment. (This won’t match if, e.g,. a column was all int and
then had one string row.)
• The same target as the old experiment
• The same target classes (if classification) as the old experiment

17.10. Viewing Experiments 327


Using Driverless AI, Release 1.8.4.1

• For time series, all choices for intervals and gaps must be the same
When the above conditions are met, then you can:
• Start the same kind of experiment, just rerun for longer.
• Use a smaller or larger data set (i.e. fewer or more rows).
• Effectively do a final ensemble re-fit by varying the data rows and starting an experiment with a new accuracy,
time=1, and interpretability. Check the experiment preview for what the ensemble will be.
• Restart/Resume a cancelled, aborted, or completed experiment
To run smart checkpointing on an existing experiment, click the right side of the experiment that you want to retry,
then select Restart from Last Checkpoint. The experiment settings page opens. Specify the new dataset. If desired,
you can also change experiment settings, though the target column must be the same. Click Launch Experiment to
resume the experiment from the last checkpoint and build a new experiment.
The smart checkpointing continues by adding a prior model as another model used during tuning. If that prior model
is better (which is likely if it was run for more iterations), then that smart checkpoint model will be used during feature
evolution iterations and final ensemble.
Notes:
• Driverless AI does not guarantee exact continuation, only smart continuation from any last point.
• The directory where the H2O.ai Brain meta model files are stored is tmp/H2O.ai_brain. In addition, the default
maximum brain size is 20GB. Both the directory and the maximum size can be changed in the config.toml file.

Rerunning Experiments

To run a new experiment using an existing experiment’s settings, click the right side of the experiment that you want
to use as the basis for the new experiment, then select New Model with Same Params. This opens the experiment
settings page. From this page, you can rerun the experiment using the original settings, or you can specify to use
new data and/or specify different experiment settings. Click Launch Experiment to create a new experiment with the
same options.

Retrain Final Pipeline

To retrain an experiment’s final pipeline, click the right side of the experiment that you want to use as the basis for the
new experiment, then select Retrain Final Pipeline. This opens the experiment settings page with the same settings
as the original experiment except that Time is set to 0.

17.10.2 “Pausing” an Experiment

A trick for “pausing” an experiment is to:


1. Abort the experiment.
2. On the Experiments page, select Restart from Last Checkpoint for the aborted experiment.
3. On the Expert Settings page, specify 0 for the Ensemble Level for Final Modeling Pipeline option.

328 Chapter 17. Experiments


Using Driverless AI, Release 1.8.4.1

17.10.3 Deleting Experiments

To delete an experiment, click the right side of the experiment that you want to remove, then select Delete. A confir-
mation message will display asking you to confirm the delete. Click OK to delete the experiment or Cancel to return
to the experiments page without deleting.

17.10. Viewing Experiments 329


Using Driverless AI, Release 1.8.4.1

330 Chapter 17. Experiments


CHAPTER

EIGHTEEN

DIAGNOSING A MODEL

The Diagnosing Model on New Dataset option allows you to view model performance for multiple scorers based on
existing model and dataset.
On the completed experiment page, click the Diagnose Model on New Dataset button.
Note: You can also diagnose a model by selecting Diagnostic from the top menu, then selecting an experiment and
test dataset.

Select a dataset to use when diagnosing this experiment. Note that the dataset must include the target column that is
in the original dataset. At this point, Driverless AI will begin calculating all available scores for the experiment.
When the diagnosis is complete, it will be available on the Model Diagnostics page. Click on the new diagnosis.
From this page, you can download predictions. You can also view scores and metric plots. The plots are interactive.
Click a graph to enlarge. In the enlarged view, you can hover over the graph to view details for a specific point. You
can also download the graph in the enlarged view.

18.1 Classification Metric Plots

Classification metric plots include the following graphs:


• ROC Curve
• Precision-Recall Curve
• Cumulative Gains
• Lift Chart
• Kolmogorov-Smirnov Chart
• Confusion Matrix

331
Using Driverless AI, Release 1.8.4.1

Note: In the Confusion Matrix graph, the threshold value defaults to 0.5. For binary classification experiments, users
can specify a different threshold value. The threshold selector is available after clicking on the Confusion Matrix and
opening the enlarged view. When you specify a value or change the slider value, Driverless AI automatically computes
a diagnostic Confusion Matrix for that given threshold value.

332 Chapter 18. Diagnosing a Model


Using Driverless AI, Release 1.8.4.1

18.2 Regression Metric Plots

Regression metric plots include the following graphs:


• Actual vs Predicted
• Residual Plot with LOESS curve
• Residual Histogram

18.2. Regression Metric Plots 333


Using Driverless AI, Release 1.8.4.1

334 Chapter 18. Diagnosing a Model


CHAPTER

NINETEEN

PROJECT WORKSPACE

Driverless AI provides a Project Workspace for managing datasets and experiments related to a specific business
problem or use case. Whether you are trying to detect fraud or predict user retention, datasets and experiments can
be stored and saved in the individual projects. A Leaderboard on the Projects page allows you to easily compare
performance and results and identify the best solution for your problem.
To create a Project Workspace:
1. Click the Projects option on the top menu.
2. Click New Project.
3. Specify a name for the project and provide a description.
4. Click Create Project. This creates an empty Project page.

From the Projects page, you can link datasets and/or experiments, and you can run new experiments. When you link
an existing experiment to a Project, the datasets used for the experiment will automatically be linked to this project (if
not already linked).

19.1 Linking Datasets

Any dataset that has been added to Driverless AI can be linked to a project. In addition, when you link an experiment,
the datasets used for that experiment are also automatically linked to the project.
To link a dataset:
1. Select Training, Validation, or Test from the dropdown menu.
2. Click the Link Dataset button.
3. Select the dataset(s) that you want to link.

335
Using Driverless AI, Release 1.8.4.1

The list available datasets link include those that were added on The Datasets Page, or you can browse datasets in your
file system. Be sure to select the correct dropdown option before linking a training, validation, or test dataset. This
is because, when you run a new experiment in the project, the training data, validation data, and test data options for
that experiment come from list of datasets linked here. You will not be able to, for example, select any datasets from
within the Training tab when specifying a test dataset on the experiment.

When datasets are linked, the same menu options are available here as on the Datasets page. Refer to The Datasets
Page section for more information.

336 Chapter 19. Project Workspace


Using Driverless AI, Release 1.8.4.1

19.2 Linking Experiments

Existing experiments can be selected and linked to a Project. Additionally, you can run a new experiment or check-
pointing an existing experiment from this page, and those experiments will automatically be linked to this Project.
Link an existing experiment to the project by clicking Link Experiment and then selecting the experiment(s) to
include. When you link experiments, the datasets used to create the experiments are also automatically linked.

19.2. Linking Experiments 337


Using Driverless AI, Release 1.8.4.1

19.2.1 Selecting Datasets

In the Datasets section, you can select a training, validation, or testing dataset. The Experiments section will show
experiments in the Project that use the selected dataset.

19.2.2 New Experiments

When experiments are run from within a Project, only linked datasets or datasets available on the file system can be
used.
1. Click the New Experiment link to begin a new experiment.
2. Select your training data and optionally your validation and/or testing data.
3. Specify your desired experiment settings (refer to Experiment Settings and Expert Settings), and then click
Launch Experiment.
As the experiment is running, it will be listed at the top of the Experiments Leaderboard until it is completed. It will
also be available on the Experiments page.

19.2.3 Checkpointing Experiments

When experiments are linked to a Project, the same checkpointing options for experiments are available here as on the
Experiments page. Refer to Checkpointing, Rerunning, and Retraining for more information.

338 Chapter 19. Project Workspace


Using Driverless AI, Release 1.8.4.1

19.3 Experiments List

When attempting to solve a business problem, a normal workflow will include running multiple experiments, either
with different/new data or with a variety of settings, and the optimal solution can vary for different users and/or
business problems. For some users, the model with the highest accuracy for validation and test data could be the most
optimal one. Other users might be willing to make an acceptable compromise on the accuracy of the model for a
model with greater performance (faster prediction). For some, it could also mean how quickly the model could be
trained with acceptable levels of accuracy. The Experiments list makes it easy for you to find the best solution for your
business problem.
The list is organized based on experiment name. You can change the sorting of experiments by selecting the up/down
arrows beside a column heading in the experiment menu.
Hover over the right menu of an experiment to view additional information about the experiment, including the prob-
lem type, datasets used, and the target column.

19.3.1 Experiment Scoring

Experiments linked to projects do not automatically include a test score. To view Test Scores in the Leaderboard, you
must first complete the scoring step for a particular dataset and experiment combination. Without the scoring step, no
scoring data is available to populate in the Test Score and Score Time columns. Experiments that do not include a test
score or that have an invalid scorer (for example, if the R2 scorer is selected for classification experiments) show N/A
in the Leaderboard. Also, if None is selected for the scorer, then all experiments will show N/A.
To score the experiment:
1. Click Select Scoring Dataset at the top of the Experiments list and select a linked Test Dataset or a test dataset
available on the file system.
2. Select the model or models that you want to score.

19.3. Experiments List 339


Using Driverless AI, Release 1.8.4.1

3. Click the Select Scorer button at the top of the Experiments list and select a scorer.
4. Click the Score n Items button.
This starts the Model Diagnostic process and scores the selected experiment(s) against the selected scorer and dataset.
(Refer to Diagnosing a Model for more information.) Upon completion, the experiment(s) will be populated with a
test score, and the performance information will also be available on the Model Diagnostics page.

Notes:
• If an experiment has already scored a dataset, Driverless AI will not score it again. The scoring step is deter-
ministic, so for a particular test dataset and experiment combination, the score will be same regardless of how
many times you repeat it.
• The test dataset absolutely needs to have all the columns that are expected by the various experiments you are
scoring it on. However, the columns of the test dataset need not be exactly the same as input features expected by
the experiment. There can be additional columns in the test dataset. If these columns were not used for training,
they will be ignored. This feature gives you the ability to train experiments on different training datasets (i.e.,
having different features), and if you have an “uber test dataset” that includes all these feature columns, then
you can use the same dataset to score these experiments.
• You will notice a Score Time in the Experiments Leaderboard. This values shows the total time (in seconds) that
it took for calculating the experiment scores for all applicable scorers for the experiment type. This is valuable
to users who need to estimate the runtime performance of an experiment.

340 Chapter 19. Project Workspace


Using Driverless AI, Release 1.8.4.1

19.3.2 Comparing Experiments

You can compare two or three experiments and view side-by-side detailed information about each.
1. Click the Select button at the top of the Leaderboard and select either two or three experiments that you want to
compare. You cannot compare more than three experiments.
2. Click the Compare n Items button.

This opens the Compare Experiments page. This page includes the experiment summary and metric plots for each
experiment. The metric plots vary depending on whether this is a classification or regression experiment.
For classification experiments, this page includes:
• Variable Importance list
• Confusion Matrix
• ROC Curve
• Precision Recall Curve
• Lift Chart
• Gains Chart
• Kolmogorov-Smirnov Chart
For regression experiments, this page includes:
• Variable Importance list
• Actual vs. Predicted Graph

19.3. Experiments List 341


Using Driverless AI, Release 1.8.4.1

19.4 Unlinking Data on a Projects Page

Unlinking datasets and/or experiments does not delete that data from Driverless AI. The datasets and experiments will
still be available on the Datasets and Experiments pages.
• Unlink a dataset by clicking on the dataset and selecting Unlink from the menu. Note: You cannot unlink
datasets that are tied to experiments in the same project.
• Unlink an experiment by clicking on the experiment and selecting Unlink from the menu. Note that this will
not automatically unlink datasets that were tied to the experiment.

19.5 Deleting Projects

To delete a project, click the Projects option on the top menu to open the main Projects page. Click the dotted menu
the right-most column, and then select Delete. You will be prompted to confirm the deletion.
Note that deleting projects does not delete datasets and experiments from Driverless AI. Any datasets and experiments
from deleted projects will still be available on the Datasets and Experiments pages.

342 Chapter 19. Project Workspace


Using Driverless AI, Release 1.8.4.1

19.5. Deleting Projects 343


Using Driverless AI, Release 1.8.4.1

344 Chapter 19. Project Workspace


CHAPTER

TWENTY

MLI OVERVIEW

Driverless AI provides robust interpretability of machine learning models to explain modeling results in a human-
readable format. In the Machine Learning Interpetability (MLI) view, Driverless AI employs a host of different
techniques and methodologies for interpreting and explaining the results of its models. A number of charts are gener-
ated automatically (depending on experiment type), including K-LIME, Shapley, Variable Importance, Decision Tree
Surrogate, Partial Dependence, Individual Conditional Expectation, Sensitivity Analysis, NLP Tokens, NLP LOCO,
and more. Additionally, you can download a CSV of LIME and Shapley reasons codes from this view.
This chapter describes Machine Learning Interpretability (MLI) in Driverless AI for both regular and time-series
experiments. Refer to the following sections for more information:
• The Interpreted Models Page
• MLI for Regular (Non-Time-Series) Experiments
• MLI for Time-Series Experiments
Additional Resources
• Click here to download our MLI cheat sheet.
• “An Introduction to Machine Learning Interpetability” book.
• Click here to access the H2O.ai MLI Resources repository. This repo includes materials that illustrate applica-
tions or adaptations of various MLI techniques for practicing data scientists.
• Click here to view our H2O Driverless AI Machine Learning Interpretability walkthrough video.
Limitations
• This release deprecates experiments run in 1.7.0 and earlier. MLI will not be available for experiments from
versions <= 1.7.0.
• MLI is not supported for multiclass Time Series experiments.
• MLI does not require an Internet connection to run on current models.

345
Using Driverless AI, Release 1.8.4.1

346 Chapter 20. MLI Overview


CHAPTER

TWENTYONE

THE INTERPRETED MODELS PAGE

Click the MLI link in the upper-right corner of the UI to view a list of interpreted models.

You can sort this page by Name, Target, Model, Dataset, N-Folds, Feature Set, Cluster Col, LIME Method, Status, or
ETA/Runtime.
Click the right-most column of an interpreted model to view an additional menu. This menu allows you to open,
rename, or delete the interpretation.

Click on an interpeted model to view the MLI page for that interpretation. The MLI page that displays will vary
depending on whether the experiment was a regular experiment or a time series experiment.

347
Using Driverless AI, Release 1.8.4.1

348 Chapter 21. The Interpreted Models Page


CHAPTER

TWENTYTWO

MLI FOR REGULAR (NON-TIME-SERIES) EXPERIMENTS

This section describes MLI functionality and features for regular experiments. Refer to MLI for Time-Series Experi-
ments for MLI information with time-series experiments.

22.1 Interpreting a Model

There are two methods you can use for interpreting models:
• Using the Interpret this Model button on a completed experiment page to interpret a Driverless AI model on
original and transformed features.
• Using the MLI link in the upper right corner of the UI to interpret either a Driverless AI model or an external
model.
Notes:
• Experiments run in 1.7.0 and earlier are deprecated in this release. MLI will not be available for experiments
from versions <= 1.7.0.
• MLI does not require an Internet connection to run on current models.

22.1.1 Intrepret this Model Button - Non-Time-Series

Clicking the Interpret this Model button on a completed experiment page launches the Model Interpretation for that
experiment. Python and Java logs can be viewed for non-time-series experiments while the interpretation is running.
For non-time-series experiments, this page provides several visual explanations and reason codes for the trained Driver-
less AI model and its results. More information about this page is available in the Understanding the Model Interpration
Page section later in this chapter.

349
Using Driverless AI, Release 1.8.4.1

22.1.2 Model Interpretation on Driverless AI Models

This method allows you to run model interpretation on a Driverless AI model. This method is similar to clicking
“Interpret This Model” on an experiment summary page.
1. Click the MLI link in the upper-right corner of the UI to view a list of interpreted models.

2. Click the New Interpretation button.


3. Select the dataset that was used to train the model that you will use for interpretation.
4. Specify the Driverless AI model that you want to use for the interpretation. Once selected, the Target Column
used for the model will be selected.
5. Select a LIME method of either K-LIME (default) or LIME-SUP.
• K-LIME creates one global surrogate GLM on the entire training data and also creates numerous local surrogate
GLMs on samples formed from k-means clusters in the training data. The features used for k-means are selected
from the Random Forest surrogate model’s variable importance. The number of features used for k-means is the
minimum of the top 25% of variables from the Random Forest surrogate model’s variable importance and the
max number of variables that can be used for k-means, which is set by the user in the config.toml setting for
mli_max_number_cluster_vars. (Note, if the number of features in the dataset are less than or equal to

350 Chapter 22. MLI for Regular (Non-Time-Series) Experiments


Using Driverless AI, Release 1.8.4.1

6, then all features are used for k-means clustering.) The previous setting can be turned off to use all features
for k-means by setting use_all_columns_klime_kmeans in the config.toml file to true. All penalized
GLM surrogates are trained to model the predictions of the Driverless AI model. The number of clusters for
local explanations is chosen by a grid search in which the 𝑅2 between the Driverless AI model predictions
and all of the local K-LIME model predictions is maximized. The global and local linear model’s intercepts,
coefficients, 𝑅2 values, accuracy, and predictions can all be used to debug and develop explanations for the
Driverless AI model’s behavior.
• LIME-SUP explains local regions of the trained Driverless AI model in terms of the original variables. Local
regions are defined by each leaf node path of the decision tree surrogate model instead of simulated, perturbed
observation samples - as in the original LIME. For each local region, a local GLM model is trained on the
original inputs and the predictions of the Driverless AI model. Then the parameters of this local GLM can be
used to generate approximate, local explanations of the Driverless AI model.
6. For K-LIME interpretations, specify the depth that you want for your decision tree surrogate model. The tree
depth value can be a value from 2-5 and defaults to 3. For LIME-SUP interpretations, specify the LIME-SUP
tree depth. This can be a value from 2-5 and defaults to 3.
7. Specify whether to use original features or transformed features in the surrogate model for the new interpretation.
Note: If Use Original Features for Surrogate Models is disabled, then the K-LIME clustering column option
will not be available, and quantile binning will not be available.
8. Specify whether to perform the interpretation on a sample of the training data. By default, MLI will sample the
training dataset if it is greater than 100k rows. (Note that this value can be modified in the config.toml setting
for mli_sample_size.) Turn this toggle off to run MLI on the entire dataset.
9. Optionally specify weight and dropped columns.
10. For K-LIME interpretations, optionally specify a clustering column. Note that this column should be categorical.
Also note that this is only available when K-LIME is used as the LIME method and when Use Original Features
for Surrogate Models is enabled. If the LIME method is changed to LIME-SUP, then this option is no longer
available.
11. Optionally specify the number of surrogate cross-validation folds to use (from 0 to 10). When running ex-
periments, Driverless AI automatically splits the training data and uses the validation data to determine the
performance of the model parameter tuning and feature engineering steps. For a new interpretation, Driverless
AI uses 3 cross-validation folds by default for the interpretation.
12. For K-LIME interpretations, optionally specify one or more columns to generate decile bins (uniform distribu-
tion) to help with MLI accuracy. Columns selected are added to top n columns for quantile binning selection.
If a column is not numeric or not in the dataset (transformed features), then the column will be skipped. Note:
This option is only available when Use Original Features for Surrogate Models is enabled.
13. For K-LIME interpretations, optionally specify the number of top variable importance numeric columns to run
decile binning to help with MLI accuracy. (Note that variable importances are generated from a Random Forest
model.) This defaults to 0, and the maximum value is 10. Note: This option is only available when Use Original
Features for Surrogate Models is enabled.
14. Optionally specify the number of top features for which partial dependence and ICE will be computed. This
value defaults to 10. Setting a value greater than 10 can significantly increase the computation time. Setting this
to -1 specifies to use all features.
15. Click the Launch MLI button.

22.1. Interpreting a Model 351


Using Driverless AI, Release 1.8.4.1

22.1.3 Model Interpretation on External Models

Model Interpretation does not need to be run on a Driverless AI experiment. You can train an external model and run
Model Interpretability on the predictions.
1. Click the MLI link in the upper-right corner of the UI to view a list of interpreted models.

2. Click the New Interpretation button.


3. Select the dataset that you want to use for the model interpretation. This must include a prediction column that
was generated by the external model. If the dataset does not have predictions, then you can join the external
predictions. An example showing how to do this in Python is available in the Run Model Interpretation on
External Model Predictions section of the Credit Card Demo.

352 Chapter 22. MLI for Regular (Non-Time-Series) Experiments


Using Driverless AI, Release 1.8.4.1

Note: When running interpretations on an external model, leave the Select Model option empty. That option is
for selecting a Driverless AI model.
4. Specify a Target Column (actuals) and the Prediction Column (scores from the model).
5. Select a LIME method of either K-LIME (default) or LIME-SUP.
• K-LIME creates one global surrogate GLM on the entire training data and also creates numerous local surrogate
GLMs on samples formed from k-means clusters in the training data. The features used for k-means are selected
from the Random Forest surrogate model’s variable importance. The number of features used for k-means is the
minimum of the top 25% of variables from the Random Forest surrogate model’s variable importance and the
max number of variables that can be used for k-means, which is set by the user in the config.toml setting for
mli_max_number_cluster_vars. (Note, if the number of features in the dataset are less than or equal to
6, then all features are used for k-means clustering.) The previous setting can be turned off to use all features
for k-means by setting use_all_columns_klime_kmeans in the config.toml file to true. All penalized
GLM surrogates are trained to model the predictions of the Driverless AI model. The number of clusters for
local explanations is chosen by a grid search in which the 𝑅2 between the Driverless AI model predictions
and all of the local K-LIME model predictions is maximized. The global and local linear model’s intercepts,
coefficients, 𝑅2 values, accuracy, and predictions can all be used to debug and develop explanations for the
Driverless AI model’s behavior.
• LIME-SUP explains local regions of the trained Driverless AI model in terms of the original variables. Local
regions are defined by each leaf node path of the decision tree surrogate model instead of simulated, perturbed
observation samples - as in the original LIME. For each local region, a local GLM model is trained on the
original inputs and the predictions of the Driverless AI model. Then the parameters of this local GLM can be
used to generate approximate, local explanations of the Driverless AI model.
6. For K-LIME interpretations, specify the depth that you want for your decision tree surrogate model. The tree
depth value can be a value from 2-5 and defaults to 3. For LIME-SUP interpretations, specify the LIME-SUP
tree depth. This can be a value from 2-5 and defaults to 3.
7. Specify whether to perform the interpretation on a sample of the training data. By default, MLI will sample the
training dataset if it is greater than 100k rows. (Note that this value can be modified in the config.toml setting
for mli_sample_size.) Turn this toggle off to run MLI on the entire dataset.
8. Optionally specify weight and dropped columns.
9. For K-LIME interpretations, optionally specify a clustering column. Note that this column should be categorical.
Also note that this is only available when K-LIME is used as the LIME method. If the LIME method is changed
to LIME-SUP, then this option is no longer available.
10. Optionally specify the number of surrogate cross-validation folds to use (from 0 to 10). When running ex-
periments, Driverless AI automatically splits the training data and uses the validation data to determine the
performance of the model parameter tuning and feature engineering steps. For a new interpretation, Driverless
AI uses 3 cross-validation folds by default for the interpretation.
11. For K-LIME interpretations, optionally specify one or more columns to generate decile bins (uniform distribu-
tion) to help with MLI accuracy. Columns selected are added to top n columns for quantile binning selection. If
a column is not numeric or not in the dataset (transformed features), then the column will be skipped.
12. For K-LIME interpretations, optionally specify the number of top variable importance numeric columns to run
decile binning to help with MLI accuracy. (Note that variable importances are generated from a Random Forest
model.) This value is combined with any specific columns selected for quantile binning. This defaults to 0, and
the maximum value is 10.
13. Optionally specify the number of top features for which partial dependence and ICE will be computed. This
value defaults to 10. Setting a value greater than 10 can significantly increase the computation time. Setting this
to -1 specifies to use all features.
14. Click the Launch MLI button.

22.1. Interpreting a Model 353


Using Driverless AI, Release 1.8.4.1

22.2 Understanding the Model Interpretation Page

This section describes the features on the Model Interpretation page for non-time-series experiments.
The Model Interpretation page opens with a Summary of the interpretation and also provides a row search feature on
the top of the page:
• Row Selection: Provides the ability to search for a particular row by Row Number or by Identifier Column. See
the Row Selection section for more information.
This page also provides left-hand navigation for viewing additional plots. This navigation includes:
• Summary: Provides a summary of the MLI experiment. See the Summary Page section for more information.
• DAI Model: See DAI Model Dropdown for more information.
– For binary classification and regression experiments, the DAI Model menu provides the following plots
for Driverless AI models:
– Feature Importance for transformed features
– Shapley plot for transformed features
– Partial Dependence/ICE
– Disparate Impact Analysis
– Sensitivity Analysis

354 Chapter 22. MLI for Regular (Non-Time-Series) Experiments


Using Driverless AI, Release 1.8.4.1

– NLP Tokens (for text experiments only)


– NLP LOCO (for text experiments)
– Permutation Feature Importance (if the autodoc_include_permutation_feature_importance
configuration option is enabled)
– For multiclass classification experiments, the DAI Model menu provides the following plots for Driverless
AI models:
– Feature Importance for transformed features
– Shapley plots for transformed features
– Notes:
– Shapley plots are not supported for RuleFit and TensorFlow models. Shapley plots are also not supported
for BYOR models that DO NOT implement the has_pred_contribs method and DO implement
pred_contribs=True in predict.
– The Permutation-based feature importance plot is only available when the
autodoc_include_permutation_feature_importance configuration option is enabled
when starting Driverless AI or when starting the experiment.
• Surrogate Models: See Surrogate Models Dropdown for more information.
– For binary classification and regression experiments, the Surrogate Model menu provides K-LIME and
Decision Tree plots. This also includes a Random Forest submenu, which includes Global and Local
Feature Importance plots for original features and a Partial Dependence plot.
– For multiclass classification experiments, the Surrogate Model menu provides a Random Forest submenu
that includes a Global Feature Importance plot for the Random Forest surrogate model.
• Dashboard: See the Dashboard Page section for more information.
– For binary classification and regression experiments, the Dashboard page provides a single page with a
Global Interpretable Model Explanations plot, a Feature Importance plot, a Decision Tree plot, and a
Partial Dependence plot.
– The Dashboard page is not available for multinomial experiments.
• MLI Docs: A link to the Interpreting a Model section in the online help.
• Download MLI Logs: Downloads a zip file of the logs that were generated during this interpretation.
• Experiment: Provides a link back to the experiment that generated this interpretation.
• Scoring Pipeline:
– For binomial and regression experiments, Scoring Pipeline option downloads the scoring pipeline for this
interpretation.
– The Scoring Pipeline option is not available for multinomial experiments.
• Download Reason Codes:
– For binomial experiments, download a CSV file of LIME and/or Shapley reason codes.
– For multinomial experiments, download a CSV file of the Shapley reason codes.

22.2. Understanding the Model Interpretation Page 355


Using Driverless AI, Release 1.8.4.1

22.2.1 Row Selection

The row selection feature allows a user to search for a particular observation by row number or by an identifier column.
Identifier columns cannot be specified by the user - MLI makes this choice automatically by choosing columns whose
values are unique (dataset row count equals the number of unique values in a column). To find a row by identifier
column, choose Identifier Column from the drop-down menu (if it meets the logic of being an identifier column), and
then specify a value. In addition to identifier columns, the drop-down menu also allows you to find a row using Row
Number.

22.2.2 Summary Page

The Summary page is the first page that opens when you view an interpretation. This page provides an overview of the
interpretation, including the dataset and Driverless AI experiment (if available) that were used for the interpretation
along with the feature space (original or transformed), target column, problem type, and k-Lime information. If the
interpretation was created from a Driverless AI model, then a table with the Driverless AI model summary is also
included along with the top variables for the model.

22.2.3 DAI Model Dropdown

This menu provides a Feature Importance plot and a Shapley plot (not supported for RuleFit and TensorFlow
models) for transformed features as well as Partial Dependence/ICE, Disparate Impact Analysis (DIA), Sensitiv-
ity Analysis, NLP Tokens and NLP LOCO (for text experiments), and Permutation Feature Importance (if the
autodoc_include_permutation_feature_importance configuration option is enabled) plots for Driver-
less AI models.
Note: On the Feature Importance and Shapley plots, the transformed feature names are encoded as follows:
<transformation/gene_details_id>_<transformation_name>:<orig>:<. . . >:<orig>.<extra>
So in 32_NumToCatTE:BILL_AMT1:EDUCATION:MARRIAGE:SEX.0, for example:
• 32_ is the transformation index for specific transformation parameters.

356 Chapter 22. MLI for Regular (Non-Time-Series) Experiments


Using Driverless AI, Release 1.8.4.1

• NumToCatTE is the tranformation type.


• BILL_AMT1:EDUCATION:MARRIAGE:SEX represent original features used.

Feature Importance

This plot is available for all models for binary classification, multiclass classification, and regression experiments.
This plot shows the Driverless AI feature importance. Driverless AI feature importance is a measure of the contribution
of an input variable to the overall predictions of the Driverless AI model. Global feature importance is calculated by
aggregating the improvement in splitting criterion caused by a single variable across all of the decision trees in the
Driverless AI model.

Shapley Plot

This plot is not available for RuleFit or TensorFlow models. For all other models, this plot is available for binary
classification, multiclass classification, and regression experiments.
Shapley explanations are a technique with credible theoretical support that presents consistent global and local variable
contributions. Local numeric Shapley values are calculated by tracing single rows of data through a trained tree en-
semble and aggregating the contribution of each input variable as the row of data moves through the trained ensemble.
For regression tasks, Shapley values sum to the prediction of the Driverless AI model. For classification problems,
Shapley values sum to the prediction of the Driverless AI model before applying the link function. Global Shapley
values are the average of the absolute Shapley values over every row of a dataset.
More information is available at https://arxiv.org/abs/1706.06060.

22.2. Understanding the Model Interpretation Page 357


Using Driverless AI, Release 1.8.4.1

The Showing 𝑛 Features dropdown for Feature Importance and Shapley plots allows you to select between original
and transformed features. If there are a significant amount of features, they will be organized in numbered pages that
can be viewed individually. Note: The provided original values are approximations derived from the accompanying
transformed values. For example, if the transformed feature 𝑓 𝑒𝑎𝑡𝑢𝑟𝑒1_𝑓 𝑒𝑎𝑡𝑢𝑟𝑒2 has a value of 0.5, then the value of
the original features (𝑓 𝑒𝑎𝑡𝑢𝑟𝑒1 and 𝑓 𝑒𝑎𝑡𝑢𝑟𝑒2) will be 0.25.

Partial Dependence (PDP) and Individual Conditional Expectation (ICE)

A Partial Dependence and ICE plot is available for both Driverless AI and surrogate models.

The Partial Dependence Technique

Partial dependence is a measure of the average model prediction with respect to an input variable. Partial dependence
plots display how machine-learned response functions change based on the values of an input variable of interest, while
taking nonlinearity into consideration and averaging out the effects of all other input variables. Partial dependence plots
are well-known and described in the Elements of Statistical Learning (Hastie et al, 2001). Partial dependence plots
enable increased transparency in Driverless AI models and the ability to validate and debug Driverless AI models by
comparing a variable’s average predictions across its domain to known standards, domain knowledge, and reasonable
expectations.

358 Chapter 22. MLI for Regular (Non-Time-Series) Experiments


Using Driverless AI, Release 1.8.4.1

The ICE Technique

This plot is available for binary classification and regression models.


Individual conditional expectation (ICE) plots, a newer and less well-known adaptation of partial dependence plots, can
be used to create more localized explanations for a single individual using the same basic ideas as partial dependence
plots. ICE Plots were described by Goldstein et al (2015). ICE values are simply disaggregated partial dependence, but
ICE is also a type of nonlinear sensitivity analysis in which the model predictions for a single row are measured while
a variable of interest is varied over its domain. ICE plots enable a user to determine whether the model’s treatment of
an individual row of data is outside one standard deviation from the average model behavior, whether the treatment of
a specific row is valid in comparison to average model behavior, known standards, domain knowledge, and reasonable
expectations, and how a model will behave in hypothetical situations where one variable in a selected row is varied
across its domain.
Given the row of input data with its corresponding Driverless AI and K-LIME predictions:

debt_to_income_ credit_ savings_acct_ observed_ H2OAI_predicted_ K-


ratio score balance default default LIME_predicted_
default
30 600 1000 1 0.85 0.9

Taking the Driverless AI model as F(X), assuming credit scores vary from 500 to 800 in the training data, and that
increments of 30 are used to plot the ICE curve, ICE is calculated as follows:
ICE𝑐𝑟𝑒𝑑𝑖𝑡_𝑠𝑐𝑜𝑟𝑒,500 = 𝐹 (30, 500, 1000)
ICE𝑐𝑟𝑒𝑑𝑖𝑡_𝑠𝑐𝑜𝑟𝑒,530 = 𝐹 (30, 530, 1000)
ICE𝑐𝑟𝑒𝑑𝑖𝑡_𝑠𝑐𝑜𝑟𝑒,560 = 𝐹 (30, 560, 1000)
...
ICE𝑐𝑟𝑒𝑑𝑖𝑡_𝑠𝑐𝑜𝑟𝑒,800 = 𝐹 (30, 800, 1000)
The one-dimensional partial dependence plots displayed here do not take interactions into account. Large differences
in partial dependence and ICE are an indication that strong variable interactions may be present. In this case partial
dependence plots may be misleading because average model behavior may not accurately reflect local behavior.

The Partial Dependence Plot

This plot is available for binary classification and regression models.


Overlaying ICE plots onto partial dependence plots allow the comparison of the Driverless AI model’s treatment of
certain examples or individuals to the model’s average predictions over the domain of an input variable of interest.
This plot shows the partial dependence when a variable is selected and the ICE values when a specific row is selected.
Users may select a point on the graph to see the specific value at that point. Partial dependence (yellow) portrays
the average prediction behavior of the Driverless AI model across the domain of an input variable along with +/- 1
standard deviation bands. ICE (grey) displays the prediction behavior for an individual row of data when an input
variable is toggled across its domain. Currently, partial dependence and ICE plots are only available for the top ten
most important original input variables. Categorical variables with 20 or more unique values are never included in
these plots.

22.2. Understanding the Model Interpretation Page 359


Using Driverless AI, Release 1.8.4.1

Disparate Impact Analysis

This plot is available for binary classification and regression models.


DIA is a technique that is used to evaluate fairness. Bias can be introduced to models during the process of collecting,
processing, and labeling data—as a result, it is important to determine whether a model is harming certain users by
making a significant number of biased decisions.
DIA typically works by comparing aggregate measurements of unprivileged groups to a privileged group. For instance,
the proportion of the unprivileged group that receives the potentially harmful outcome is divided by the proportion of
the privileged group that receives the same outcome—the resulting proportion is then used to determine whether the
model is biased. Refer to the Summary section to determine if a categorical level (for example, Fairness Female) is
fair in comparison to the specified reference level and user-defined thresholds. Fairness All is a true or false value that
is only true if every category is fair in comparison to the reference level.
Disparate impact testing is best suited for use with constrained models in Driverless AI, such as linear models, mono-
tonic GBMs, or RuleFit. The average group metrics reported in most cases by DIA may miss cases of local discrimina-
tion, especially with complex, unconstrained models that can treat individuals very differently based on small changes
in their data attributes.
DIA allows you to specify a disparate impact variable (the group variable that is analyzed), a reference level (the
group level that other groups are compared to), and user-defined thresholds for disparity. Several tables are provided
as part of the analysis:
• Group metrics: The aggregated metrics calculated per group. For example, true positive rates per group.
• Group disparity: This is calculated by dividing the metric_for_group by the
reference_group_metric. Disparity is observed if this value falls outside of the user-defined
thresholds.
• Group parity: This builds on Group disparity by converting the above calculation to a true or false value by
applying the user-defined thresholds to the disparity values.
In accordance with the established four-fifths rule, user-defined thresholds are set to 0.8 and 1.25 by default. These
thresholds will generally detect if the model is (on average) treating the non-reference group 20% more or less favor-
ably than the reference group. Users are encouraged to set the user-defined thresholds to align with their organization’s
guidance on fairness thresholds.

360 Chapter 22. MLI for Regular (Non-Time-Series) Experiments


Using Driverless AI, Release 1.8.4.1

Metrics - Binary Classification

The following are formulas for error metrics and parity checks utilized by binary DIA. Note that in the tables below:
• tp = true positive
• fp = false positive
• tn = true negative
• fn = false negative

Error Metric / Parity Metric Formula


Adverse Impact (tp + fp) / (tp + tn + fp + fn)
Accuracy (tp + tn) / (tp + tn + fp + fn)
True Positive Rate tp / (tp + fn)
Precision tp / (tp + fp)
Specificity tn / (tn + fp)
Negative Predicted Value tn / (tn + fn)
False Positive Rate fp / (tn + fp)
False Discovery Rate fp / (tp + fp)
False Negative Rate fn / (tp + fn)
False Omissions Rate fn / (tn + fn)

Parity Check Description


Type I Parity Fairness in both FDR Parity and FPR Parity
Type II Parity Fairness in both FOR Parity and FNR Parity
Equalized Odds Fairness in both FPR Parity and TPR Parity
Supervised Fairness Fairness in both Type I and Type II Parity
Overall Fairness Fairness across all parities for all metrics where:
• tp == true positive
• fp == false positive
• tn == true negative
• fn == false negative

Metrics - Regression

The following are metrics utilized by regression DIA:


• Mean Prediction: The mean of all predictions
• Std.Dev Prediction: The standard deviation of all predictions
• Maximum Prediction: The prediction with the highest value
• Minimum Prediction: The prediction with the lowest value
• R2: The measure that represents the proportion of the variance for a dependent variable that is explained by an
independent variable or variables
• RMSE: The measure of the differences between values predicted by a model and the values actually observed
Notes:
• Although the process of DIA is the same for both classification and regression experiments, the returned infor-
mation is dependent on the type of experiment being interpreted. An analysis of a regression experiment returns
an actual vs. predicted plot, while an analysis of a binary classification experiment returns confusion matrices.

22.2. Understanding the Model Interpretation Page 361


Using Driverless AI, Release 1.8.4.1

• Users are encouraged to consider the explanation dashboard to understand and augment results from disparate
impact analysis. In addition to its established use as a fairness tool, users may want to consider disparate impact
for broader model debugging purposes. For example, users can analyze the supplied confusion matrices and
group metrics for important, non-demographic features in the Driverless AI model.

Fig. 1: Classification Experiment

Fig. 2: Regression Experiment

362 Chapter 22. MLI for Regular (Non-Time-Series) Experiments


Using Driverless AI, Release 1.8.4.1

Sensitivity Analysis

Note: Sensitivity Analysis (SA) is not available for multiclass experiments.


Sensitivity Analysis (or “What if?”) is a simple and powerful model debugging, explanation, fairness, and security
tool. The idea behind SA is both direct and simple: Score your trained model on a single row, on multiple rows, or
on an entire dataset of potentially interesting simulated values and compare the model’s new outcome to the predicted
outcome on the original data.
Beyond traditional assessment practices, sensitivity analysis of machine learning model predictions is perhaps the
most important validation technique for machine learning models. Sensitivity analysis investigates whether model
behavior and outputs remain stable when data is intentionally perturbed or other changes are simulated in the data.
Machine learning models can make drastically differing predictions for only minor changes in input variable values.
For example, when looking at predictions that determine financial decisions, SA can be used to help you understand
the impact of changing the most important input variables and the impact of changing socially sensitive variables (such
as Sex, Age, Race, etc.) in the model. If the model changes in reasonable and expected ways when important variable
values are changed, this can enhance trust in the model. Similarly, if the model changes to sensitive variables have
minimal impact on the model, then this is an indication of fairness in the model predictions.
This page utilizes the What If Tool for displaying the SA information.
The top portion of this page includes:
• A summary of the experiment
• Predictions for a specified column. Change the column on the Y axis to view predictions for that column.
• The current working score set. This updates each time you rescore.
The bottom portion of this page includes:
• A filter tool for filtering the analysis. Choose a different column, predictions, or residuals. Set the filter type (<,
>, etc.). Choose to filter by False Positive, False Negative, True Positive, or True Negative.
• Scoring chart. Click the Rescore button after applying a filter to update the scoring chart. This chart also allows
you to add or remove variables, toggle the main chart aggregation, reset the data, and delete the global history
while resetting the data.
• The current history of actions taken on this page. You can delete individual actions by selecting the action and
then clicking the Delete button that appears.

22.2. Understanding the Model Interpretation Page 363


Using Driverless AI, Release 1.8.4.1

Use Case 1: Using SA on a Single Row or on a Small Group of Rows


This section describes scenarios for using SA for explanation, debugging, security, or fairness when scoring a trained
model on a single row or on a small group of rows.
• Explanation: Change values for a variable, and then rescore the model. View the difference between the original
prediction and the new model prediction. If the change is big, then the changed variable is locally important.
• Debugging: Change values for a variable, and then rescore the model. View the difference between the original
prediction and the new model prediction and determine whether the change to variable made the model more or
less accurate.
• Security: Change values for a variable, and then rescore the model. View the difference between the original
prediction and the new model prediction. If the change is big, then the user can, for example, inform their IT
department that this variable can be used in an adversarial attack or inform the model makers that this variable
should be more regularized.
• Fairness: Change values for a demographic variable, and then rescore the model. View the difference between
the original prediction and the new model prediction. If change is big, then the user can consider using a different
model, regularizing the model more, or applying post-hoc bias remediation techniques.
• Random: Set variables to random values, and then rescore the model. This can help you look for things the you
might not have thought of.
Use Case 2: Using SA on an Entire Dataset and Trained Model
This section describes scenarios for using SA for explanation, debugging, security, or fairness when scoring a trained
model for an entire dataset and trained predictive model.
• Financial Stress Testing: Assume the user wants to see how their loan default rates will change (according to
their trained probability of default model) when they change an entire dataset to simulate that all their customers
are under more financial stress (such as lower FICO scores, lower savings balances, higher unemployment, etc).
Change the values of the variables in their entire dataset, and look at the Percentage Change in the average
model score (default probability) on the original and new data. They can then use this discovered information

364 Chapter 22. MLI for Regular (Non-Time-Series) Experiments


Using Driverless AI, Release 1.8.4.1

along with external information and processes to understand whether their institution has enough cash on hand
to be prepared for the simulated crisis.
• Random: Set variables to random values, and then rescore the model. This allows the user to look for things
the user might not have thought of.

Additional Resources

Sensitivity Analysis on a Driverless AI Model: This ipynb uses the UCI credit card default data to perform sensitivity
analysis and test model performance.

NLP Tokens

This plot is available for natural language processing (NLP) models.


This plot shows both the global and local importance values of each token in a corpus (a large and structured set of
texts). The corpus is automatically generated from text features used by Driverless AI models prior to the process of
tokenization.
Local importance values are calculated by using the term frequency–inverse document frequency (TFIDF) as a weight-
ing factor for each token in each row. The TFIDF increases proportionally to the number of times a token appears in
a given document and is offset by the number of documents in the corpus that contain the token. Specify the row that
you want to view, then click the Search button to see the local importance of each token in that row.
Global importance values are calculated by using the inverse document frequency (IDF), which measures how common
or rare a given token is across all documents. (Default View)
Notes:
• MLI support for NLP is not available for multinomial experiments.
• MLI for NLP does not currently feature the option to remove stop words.
• By default, up to 10,000 tokens are created during the tokenization process. This value can be changed in the
configuration.
• By default, Driverless AI uses up to 10,000 documents to extract tokens from. This value can be changed with
the config.mli_nlp_sample_limit parameter. Downsampling is used for datasets that are larger than
the default sample limit.
• Driverless AI does not currently generate a K-LIME scoring pipeline for MLI NLP problems.

22.2. Understanding the Model Interpretation Page 365


Using Driverless AI, Release 1.8.4.1

NLP LOCO

This plot is available for natural language processing (NLP) models.


This plot applies a leave-one-covariate-out (LOCO) styled approach to NLP models by removing a specific token from
all text features in a record and predicting local importance without that token. The difference between the resulting
score and the original score (token included) is useful when trying to determine how specific changes to text features
alter the predictions made by the model.
Notes:
• MLI support for NLP is not available for multinomial experiments.
• Due to computational complexity, the global importance value is only calculated for 𝑁 (20 by default) tokens.
This value can be changed with the mli_nlp_top_n configuration option.
• A specific token selection method can be used by specifying one of the following options for the
mli_nlp_min_token_mode configuration option:
• linspace: Selects 𝑁 evenly spaced tokens according to their IDF score (Default)
• top: Selects top 𝑁 tokens by IDF score
• bottom: Selects bottom 𝑁 tokens by IDF score
• Local values for NLP LOCO can take a significant amount of time to calculate depending on the specifications
of your hardware.
• Driverless AI does not currently generate a K-LIME scoring pipeline for MLI NLP problems.

366 Chapter 22. MLI for Regular (Non-Time-Series) Experiments


Using Driverless AI, Release 1.8.4.1

Permutation Feature Importance

Note: The Permutation-based feature importance plot is only available if the


autodoc_include_permutation_feature_importance configuration option was enabled when
starting Driverless AI or when starting the experiment. In addition, this plot is only available for binary classification
and regression experiments.
Permutation-based feature importance shows how much a model’s performance would change if a feature’s values
were permuted. If the feature has little predictive power, shuffling its values should have little impact on the model’s
performance. If a feature is highly predictive, however, shuffling its values should decrease the model’s performance.
The difference between the model’s performance before and after permuting the feature provides the feature’s absolute
permutation importance.

22.2. Understanding the Model Interpretation Page 367


Using Driverless AI, Release 1.8.4.1

22.2.4 Surrogate Models Dropdown

The Surrogate Models dropdown includes K-LIME/LIME-SUP and Decision Tree plots as well as a Random Forest
submenu, which includes Global and Local Feature Importance plots for original features and a Partial Dependence
plot.
Note: For multiclass classification experiments, only the Global and Local Feature Importance plot for the Random
Forest surrogate model is available in this dropdown.

K-LIME and LIME-SUP

The MLI screen includes a K-LIME or LIME-SUP graph. A K-LIME graph is available by default when you interpret
a model from the experiment page. When you create a new interpretation, you can instead choose to use LIME-SUP
as the LIME method. Note that these graphs are essentially the same, but the K-LIME/LIME-SUP distinction provides
insight into the LIME method that was used during model interpretation.

The K-LIME Technique

This plot is available for binary classification and regression models.


K-LIME is a variant of the LIME technique proposed by Ribeiro at al (2016). K-LIME generates global and local
explanations that increase the transparency of the Driverless AI model, and allow model behavior to be validated and
debugged by analyzing the provided plots, and comparing global and local explanations to one-another, to known
standards, to domain knowledge, and to reasonable expectations.
K-LIME creates one global surrogate GLM on the entire training data and also creates numerous local surrogate
GLMs on samples formed from k-means clusters in the training data. The features used for k-means are se-
lected from the Random Forest surrogate model’s variable importance. The number of features used for k-means
is the minimum of the top 25% of variables from the Random Forest surrogate model’s variable importance and
the max number of variables that can be used for k-means, which is set by the user in the config.toml setting for

368 Chapter 22. MLI for Regular (Non-Time-Series) Experiments


Using Driverless AI, Release 1.8.4.1

mli_max_number_cluster_vars. (Note, if the number of features in the dataset are less than or equal to 6,
then all features are used for k-means clustering.) The previous setting can be turned off to use all features for k-means
by setting use_all_columns_klime_kmeans in the config.toml file to true. All penalized GLM surrogates
are trained to model the predictions of the Driverless AI model. The number of clusters for local explanations is
chosen by a grid search in which the 𝑅2 between the Driverless AI model predictions and all of the local K-LIME
model predictions is maximized. The global and local linear model’s intercepts, coefficients, 𝑅2 values, accuracy, and
predictions can all be used to debug and develop explanations for the Driverless AI model’s behavior.
The parameters of the global K-LIME model give an indication of overall linear feature importance and the overall
average direction in which an input variable influences the Driverless AI model predictions. The global model is also
used to generate explanations for very small clusters (𝑁 < 20) where fitting a local linear model is inappropriate.
The in-cluster linear model parameters can be used to profile the local region, to give an average description of the
important variables in the local region, and to understand the average direction in which an input variable affects the
Driverless AI model predictions. For a point within a cluster, the sum of the local linear model intercept and the
products of each coefficient with their respective input variable value are the K-LIME prediction. By disaggregating
the K-LIME predictions into individual coefficient and input variable value products, the local linear impact of the
variable can be determined. This product is sometimes referred to as a reason code and is used to create explanations
for the Driverless AI model’s behavior.
In the following example, reason codes are created by evaluating and disaggregating a local linear model.
Given the row of input data with its corresponding Driverless AI and K-LIME predictions:

debt_to_income_ credit_ savings_acct_ observed_ H2OAI_predicted_ K-


ratio score balance default default LIME_predicted_
default
30 600 1000 1 0.85 0.9

And the local linear model:


𝑦K-LIME = 0.1 + 0.01 * 𝑑𝑒𝑏𝑡_𝑡𝑜_𝑖𝑛𝑐𝑜𝑚𝑒_𝑟𝑎𝑡𝑖𝑜 + 0.0005 * 𝑐𝑟𝑒𝑑𝑖𝑡_𝑠𝑐𝑜𝑟𝑒 + 0.0002 * 𝑠𝑎𝑣𝑖𝑛𝑔𝑠_𝑎𝑐𝑐𝑜𝑢𝑛𝑡_𝑏𝑎𝑙𝑎𝑛𝑐𝑒
It can be seen that the local linear contributions for each variable are:
• debt_to_income_ratio: 0.01 * 30 = 0.3
• credit_score: 0.0005 * 600 = 0.3
• savings_acct_balance: 0.0002 * 1000 = 0.2
Each local contribution is positive and thus contributes positively to the Driverless AI model’s prediction of 0.85 for
H2OAI_predicted_default. By taking into consideration the value of each contribution, reason codes for the Driverless
AI decision can be derived. debt_to_income_ratio and credit_score would be the two largest negative reason codes,
followed by savings_acct_balance.
The local linear model intercept and the products of each coefficient and corresponding value sum to the K-LIME
prediction. Moreover it can be seen that these linear explanations are reasonably representative of the nonlinear
model’s behavior for this individual because the K-LIME predictions are within 5.5% of the Driverless AI model
prediction. This information is encoded into English language rules which can be viewed by clicking the Explanations
button.
Like all LIME explanations based on linear models, the local explanations are linear in nature and are offsets from
the baseline prediction, or intercept, which represents the average of the penalized linear model residuals. Of course,
linear approximations to complex non-linear response functions will not always create suitable explanations and users
are urged to check the K-LIME plot, the local model 𝑅2 , and the accuracy of the K-LIME prediction to understand the
validity of the K-LIME local explanations. When K-LIME accuracy for a given point or set of points is quite low, this
can be an indication of extremely nonlinear behavior or the presence of strong or high-degree interactions in this local
region of the Driverless AI response function. In cases where K-LIME linear models are not fitting the Driverless AI

22.2. Understanding the Model Interpretation Page 369


Using Driverless AI, Release 1.8.4.1

model well, nonlinear LOCO feature importance values may be a better explanatory tool for local model behavior. As
K-LIME local explanations rely on the creation of k-means clusters, extremely wide input data or strong correlation
between input variables may also degrade the quality of K-LIME local explanations.

The LIME-SUP Technique

This plot is available for binary classification and regression models.


LIME-SUP explains local regions of the trained Driverless AI model in terms of the original variables. Local regions
are defined by each leaf node path of the decision tree surrogate model instead of simulated, perturbed observation
samples - as in the original LIME. For each local region, a local GLM model is trained on the original inputs and the
predictions of the Driverless AI model. Then the parameters of this local GLM can be used to generate approximate,
local explanations of the Driverless AI model.
The Global Interpretable Model Explanation Plot
This plot shows Driverless AI model predictions and LIME model predictions in sorted order by the Driverless AI
model predictions. This graph is interactive. Hover over the Model Prediction, LIME Model Prediction, or Actual
Target radio buttons to magnify the selected predictions. Or click those radio buttons to disable the view in the graph.
You can also hover over any point in the graph to view LIME reason codes for that value. By default, this plot shows
information for the global LIME model, but you can change the plot view to show local results from a specific cluster.
The LIME plot also provides a visual indication of the linearity of the Driverless AI model and the trustworthiness of
the LIME explanations. The closer the local linear model approximates the Driverless AI model predictions, the more
linear the Driverless AI model and the more accurate the explanation generated by the LIME local linear models.

370 Chapter 22. MLI for Regular (Non-Time-Series) Experiments


Using Driverless AI, Release 1.8.4.1

Decision Tree

The Decision Tree Surrogate Model Technique

The decision tree surrogate model increases the transparency of the Driverless AI model by displaying an approximate
flow-chart of the complex Driverless AI model’s decision making process. The decision tree surrogate model also
displays the most important variables in the Driverless AI model and the most important interactions in the Driverless
AI model. The decision tree surrogate model can be used for visualizing, validating, and debugging the Driverless
AI model by comparing the displayed decision-process, important variables, and important interactions to known
standards, domain knowledge, and reasonable expectations.
A surrogate model is a data mining and engineering technique in which a generally simpler model is used to explain
another, usually more complex, model or phenomenon. The decision tree surrogate is known to date back at least to
1996 (Craven and Shavlik). The decision tree surrogate model here is trained to predict the predictions of the more
complex Driverless AI model using the of original model inputs. The trained surrogate model enables a heuristic un-
derstanding (i.e., not a mathematically precise understanding) of the mechanisms of the highly complex and nonlinear
Driverless AI model.

The Decision Tree Plot

This plot is available for binary classification and regression models.


In the Decision Tree plot, the highlighted row shows the path to the highest probability leaf node and indicates the
globally important variables and interactions that influence the Driverless AI model prediction for that row.

22.2. Understanding the Model Interpretation Page 371


Using Driverless AI, Release 1.8.4.1

Random Forest Dropdown

The Random Forest dropdown provides a submenu that includes a Feature Importance plot, a Partial Dependence plot,
and a LOCO plot. These plots are for original features rather than transformed features.

Feature Importance

Global Feature Importance vs Local Feature Importance


Global feature importance (yellow) is a measure of the contribution of an input variable to the overall predictions of
the Driverless AI model. Global feature importance is calculated by aggregating the improvement in splitting criterion
caused by a single variable across all of the decision trees in the Driverless AI model.
Local feature importance (grey) is a measure of the contribution of an input variable to a single prediction of the
Driverless AI model. Local feature importance is calculated by removing the contribution of a variable from every
decision tree in the Driverless AI model and measuring the difference between the prediction with and without the
variable.
Both global and local variable importance are scaled so that the largest contributor has a value of 1.
Note: Engineered features are used for MLI when a time series experiment is built. This is because munged time
series features are more useful features for MLI than raw time series features, as raw time series features are not IID
(Independent and Identically Distributed).

LOCO

Local feature importance describes how the combination of the learned model rules or parameters and an individual
row’s attributes affect a model’s prediction for that row while taking nonlinearity and interactions into effect. Local
feature importance values reported here are based on a variant of the leave-one-covariate-out (LOCO) method (Lei et
al, 2017).
In the LOCO-variant method, each local feature importance is found by re-scoring the trained Driverless AI model
for each feature in the row of interest, while removing the contribution to the model prediction of splitting rules that
contain that feature throughout the ensemble. The original prediction is then subtracted from this modified prediction

372 Chapter 22. MLI for Regular (Non-Time-Series) Experiments


Using Driverless AI, Release 1.8.4.1

to find the raw, signed importance for the feature. All local feature importance values for the row are then scaled
between 0 and 1 for direct comparison with global feature importance values.
Given the row of input data with its corresponding Driverless AI and K-LIME predictions:

debt_to_income_ credit_ savings_acct_ observed_ H2OAI_predicted_ K-


ratio score balance default default LIME_predicted_
default
30 600 1000 1 0.85 0.9

Taking the Driverless AI model as F(X), LOCO-variant feature importance values are calculated as follows.
First, the modified predictions are calculated:
𝐹 𝑑𝑒𝑏𝑡_𝑡𝑜_𝑖𝑛𝑐𝑜𝑚𝑒_𝑟𝑎𝑡𝑖𝑜 = 𝐹 (𝑁 𝐴, 600, 1000) = 0.99
𝐹 𝑐𝑟𝑒𝑑𝑖𝑡_𝑠𝑐𝑜𝑟𝑒 = 𝐹 (30, 𝑁 𝐴, 1000) = 0.73
𝐹 𝑠𝑎𝑣𝑖𝑛𝑔𝑠_𝑎𝑐𝑐𝑡_𝑏𝑎𝑙𝑎𝑛𝑐𝑒 = 𝐹 (30, 600, 𝑁 𝐴) = 0.82
Second, the original prediction is subtracted from each modified prediction to generate the unscaled local feature
importance values:
LOCO𝑑𝑒𝑏𝑡_𝑡𝑜_𝑖𝑛𝑐𝑜𝑚𝑒_𝑟𝑎𝑡𝑖𝑜 = 𝐹 𝑑𝑒𝑏𝑡_𝑡𝑜_𝑖𝑛𝑐𝑜𝑚𝑒_𝑟𝑎𝑡𝑖𝑜 − 0.85 = 0.99 − 0.85 = 0.14
LOCO𝑐𝑟𝑒𝑑𝑖𝑡_𝑠𝑐𝑜𝑟𝑒 = 𝐹 𝑐𝑟𝑒𝑑𝑖𝑡_𝑠𝑐𝑜𝑟𝑒 − 0.85 = 0.73 − 0.85 = −0.12
LOCO𝑠𝑎𝑣𝑖𝑛𝑔𝑠_𝑎𝑐𝑐𝑡_𝑏𝑎𝑙𝑎𝑛𝑐𝑒 = 𝐹 𝑠𝑎𝑣𝑖𝑛𝑔𝑠_𝑎𝑐𝑐𝑡_𝑏𝑎𝑙𝑎𝑛𝑐𝑒 − 0.85 = 0.82 − 0.85 = −0.03
Finally LOCO values are scaled between 0 and 1 by dividing each value for the row by the maximum value for the
row and taking the absolute magnitude of this quotient.
Scaled(LOCO𝑑𝑒𝑏𝑡_𝑡𝑜_𝑖𝑛𝑐𝑜𝑚𝑒_𝑟𝑎𝑡𝑖𝑜 ) = Abs(LOCO 𝑑𝑒𝑏𝑡_𝑡𝑜_𝑖𝑛𝑐𝑜𝑚𝑒_𝑟𝑎𝑡𝑖𝑜 /0.14) = 1
Scaled(LOCO𝑐𝑟𝑒𝑑𝑖𝑡_𝑠𝑐𝑜𝑟𝑒 ) = Abs(LOCO 𝑐𝑟𝑒𝑑𝑖𝑡_𝑠𝑐𝑜𝑟𝑒 /0.14) = 0.86
Scaled(LOCO𝑠𝑎𝑣𝑖𝑛𝑔𝑠_𝑎𝑐𝑐𝑡_𝑏𝑎𝑙𝑎𝑛𝑐𝑒 ) = Abs(LOCO 𝑠𝑎𝑣𝑖𝑛𝑔𝑠_𝑎𝑐𝑐𝑡_𝑏𝑎𝑙𝑎𝑛𝑐𝑒 /0.14) = 0.21
One drawback to these LOCO-variant feature importance values is, unlike K-LIME, it is difficult to generate a mathe-
matical error rate to indicate when LOCO values may be questionable.

22.2. Understanding the Model Interpretation Page 373


Using Driverless AI, Release 1.8.4.1

Partial Dependence and Individual Conditional Expectation

A Partial Dependence and ICE plot is available for both Driverless AI and surrogate models. Refer to the previous
Partial Dependence and Individual Conditional Expectation section for more information about this plot.

NLP Surrogate Models

These plots are available for natural language processing (NLP) models.
For NLP surrogate models, Driverless AI creates a TFIDF matrix by tokenizing all text features. The resulting frame
is appended to numerical or categorical columns from the training dataset, and the original text columns are removed.
This frame is then used for training surrogate models that have prediction columns consisting of tokens and the original
numerical or categorical features.
Notes:
• MLI support for NLP is not available for multinomial experiments.
• Each row in the TFIDF matrix contains 𝑁 columns, where 𝑁 is the total number of tokens in the corpus with
values that are appropriate for that row (0 if absent).
• Driverless AI does not currently generate a K-LIME scoring pipeline for MLI NLP problems.

22.2.5 Dashboard Page

The Model Interpretation Dashboard includes the following information:


• Global interpretable model explanation plot
• Feature importance (Global for original features; LOCO for interpretations with predictions and when interpret-
ing on raw features)
• Decision tree surrogate model
• Partial dependence and individual conditional expectation plots
Note: The Dashboard is not available for multiclass classification experiments.

374 Chapter 22. MLI for Regular (Non-Time-Series) Experiments


Using Driverless AI, Release 1.8.4.1

22.3 Viewing Explanations

Note: Not all explanatory functionality is available for multinomial classification scenarios.
Driverless AI provides easy-to-read explanations for a completed model. You can view these by clicking the Expla-
nations button on the Model Interpretation > Dashboard page for an interpreted model.
The UI allows you to view global, cluster-specific, and local reason codes. You can also export the explanations to
CSV.
• Global Reason Codes: To view global reason codes, select the Global from the Cluster dropdown.

With Global selected, click the Explanations button beside the Cluster dropdown.

22.3. Viewing Explanations 375


Using Driverless AI, Release 1.8.4.1

• Cluster Reason Codes: To view reason codes for a specific cluster, select a cluster from the Cluster dropdown.

With a cluster selected, click the Explanations button.

• Local Reason Codes by Row Number: To view local reason codes for a specific row, select a point on the
graph or type a value in the Row Number field.

376 Chapter 22. MLI for Regular (Non-Time-Series) Experiments


Using Driverless AI, Release 1.8.4.1

With a value selected, click the Explanations button.

22.3. Viewing Explanations 377


Using Driverless AI, Release 1.8.4.1

• Local Reason Codes by ID: Identifier columns cannot be specified by the user - MLI makes this choice auto-
matically by choosing columns whose values are unique (dataset row count equals the number of unique values
in a column). To find a row by identifier column, choose Identifier Column from the drop-down menu (if it
meets the logic of being an identifier column), and then specify a value.

With a value selected, click the Explanations button.

378 Chapter 22. MLI for Regular (Non-Time-Series) Experiments


Using Driverless AI, Release 1.8.4.1

22.4 General Considerations

22.4.1 Machine Learning and Approximate Explanations

For years, common sense has deemed the complex, intricate formulas created by training machine learning algorithms
to be uninterpretable. While great advances have been made in recent years to make these often nonlinear, non-
monotonic, and non-continuous machine-learned response functions more understandable (Hall et al, 2017), it is
likely that such functions will never be as directly or universally interpretable as more traditional linear models.
Why consider machine learning approaches for inferential purposes? In general, linear models focus on understanding
and predicting average behavior, whereas machine-learned response functions can often make accurate, but more
difficult to explain, predictions for subtler aspects of modeled phenomenon. In a sense, linear models create very exact
interpretations for approximate models. The approach here seeks to make approximate explanations for very exact
models. It is quite possible that an approximate explanation of an exact model may have as much, or more, value and
meaning than the exact interpretations of an approximate model. Moreover, the use of machine learning techniques
for inferential or predictive purposes does not preclude using linear models for interpretation (Ribeiro et al, 2016).

22.4.2 The Multiplicity of Good Models in Machine Learning

It is well understood that for the same set of input variables and prediction targets, complex machine learning al-
gorithms can produce multiple accurate models with very similar, but not exactly the same, internal architectures
(Breiman, 2001). This alone is an obstacle to interpretation, but when using these types of algorithms as interpretation
tools or with interpretation tools it is important to remember that details of explanations will change across multiple
accurate models.

22.4.3 Expectations for Consistency Between Explanatory Techniques

• The decision tree surrogate is a global, nonlinear description of the Driverless AI model behavior. Variables that
appear in the tree should have a direct relationship with variables that appear in the global feature importance
plot. For certain, more linear Driverless AI models, variables that appear in the decision tree surrogate model
may also have large coefficients in the global K-LIME model.
• K-LIME explanations are linear, do not consider interactions, and represent offsets from the local linear model
intercept. LOCO importance values are nonlinear, do consider interactions, and do not explicitly consider a
linear intercept or offset. LIME explanations and LOCO importance values are not expected to have a direct
relationship but can align roughly as both are measures of a variable’s local impact on a model’s predictions,
especially in more linear regions of the Driverless AI model’s learned response function.
• ICE is a type of nonlinear sensitivity analysis which has a complex relationship to LOCO feature importance
values. Comparing ICE to LOCO can only be done at the value of the selected variable that actually appears in
the selected row of the training data. When comparing ICE to LOCO the total value of the prediction for the
row, the value of the variable in the selected row, and the distance of the ICE value from the average prediction
for the selected variable at the value in the selected row must all be considered.
• ICE curves that are outside the standard deviation of partial dependence would be expected to fall into less
populated decision paths of the decision tree surrogate; ICE curves that lie within the standard deviation of
partial dependence would be expected to belong to more common decision paths.
• Partial dependence takes into consideration nonlinear, but average, behavior of the complex Driverless AI model
without considering interactions. Variables with consistently high partial dependence or partial dependence that
swings widely across an input variable’s domain will likely also have high global importance values. Strong
interactions between input variables can cause ICE values to diverge from partial dependence values.

22.4. General Considerations 379


Using Driverless AI, Release 1.8.4.1

380 Chapter 22. MLI for Regular (Non-Time-Series) Experiments


CHAPTER

TWENTYTHREE

MLI FOR TIME-SERIES EXPERIMENTS

This section describes how to run MLI for time-series experiments. Refer to MLI for Regular (Non-Time-Series)
Experiments for MLI information with regular experiments.
There are two methods you can use for interpreting time-series models:
• Using the Interpret this Model button on a completed experiment page to interpret a Driverless AI model on
original and transformed features. (See below.)
• Using the MLI link in the upper right corner of the UI to interpret either a Driverless AI model or an external
model. This process is described in the Model Interpretation on Driverless AI Models and Model Interpretation
on External Models sections.
Limitations
• This release deprecates experiments run in 1.7.0 and earlier. MLI will not be available for experiments from
versions <= 1.7.0.
• MLI is not available for NLP experiments or for multiclass Time Series.
• When the test set contains actuals, you will see the time series metric plot and the group metrics table. If there
are no actuals, MLI will run, but you will see only the prediction value time series and a Shapley table.
• MLI does not require an Internet connection to run on current models.

23.1 Multi-Group Time Series MLI

This section describes how to run MLI on time series data for multiple groups.
1. Click the Interpret this Model button on a completed time series experiment to launch Model Interpretation
for that experiment. This page includes the following:
• A Help panel describing how to read and use this page. Click the Hide Help Button to hide this
text.
• If a test set is provided and the test set includes actuals, then a panel will display showing a time
series plot and the top and bottom group matrix tables based on the scorer that was used in the
experiment. The metric plot will show the metric of interest per time point for holdout predictions
and the test set. Likewise, the actual vs. predicted plot will show actuals vs. predicted values per
time point for the holdout set and the test set. Note that this panel can be resized if necessary.
• If a test set is not provided, then internal validation predictions will be used. The metric plot will
only show the metric of interest per time point for holdout predictions. Likewise, the actual vs.
predicted plot will only show actuals vs. predicted values per time point for the holdout set.
• A Download Logs button for retrieving logs that were generated when this interpretation was built.

381
Using Driverless AI, Release 1.8.4.1

• A Download Group Metrics button for retrieving the averages of each group’s scorer, as well as
each group’s sample size.
• A Show Summary button that provide details about the experiment settings that were used.
• A Group Search entry field (scroll to bottom) for selecting the groups to view.
• Use the zoom feature to magnify any portion of a graph by clicking the Enable Zoom icon near the
top-right corner of a graph. While this icon is selected, click and drag to draw a box over the portion
of the graph you want to magnify. Click the Disable Zoom icon to return to the default view.

2. Scroll to the bottom of the panel and select a grouping in the Group Search field to view a graph of Actual vs.
Predicted values for the group. The outputted graph can be downloaded to your local machine.

382 Chapter 23. MLI for Time-Series Experiments


Using Driverless AI, Release 1.8.4.1

3. Click on a prediction point in the plot (white line) to view Shapley values for that prediction point. The Shapley
values plot can also be downloaded to your local machine.

4. Click Add Panel to add a new MLI Time Series panel. This allows you to compare different groups in the same
model and also provides the flexibility to do a “side-by-side” comparison between different models.

23.1. Multi-Group Time Series MLI 383


Using Driverless AI, Release 1.8.4.1

23.2 Single Time Series MLI

Time Series MLI can also be run when only one group is available.
1. Click the Interpret this Model button on a completed time series experiment to launch Model Interpretation
for that experiment. This page includes the following:
• A Help panel describing how to read and use this page. Click the Hide Help Button to hide this
text.
• If a test set is provided and the test set includes actuals, then a panel will display showing a time
series plot and the top and bottom group matrix tables based on the scorer that was used in the
experiment. The metric plot will show the metric of interest per time point for holdout predictions
and the test set. Likewise, the actual vs. predicted plot will show actuals vs. predicted values per
time point for the holdout set and the test set. Note that this panel can be resized if necessary.
• If a test set is not provided, then internal validation predictions will be used. The metric plot will
only show the metric of interest per time point for holdout predictions. Likewise, the actual vs.
predicted plot will only show actuals vs. predicted values per time point for the holdout set.
• A Download Logs button for retrieving logs that were generated when this interpretation was built.
• A Download Group Metrics button for retrieving the average of the group’s scorer, as well as the
group’s sample size.
• A Show Summary button that provides details about the experiment settings that were used.
• A Group Search entry field for selecting the group to view. Note that for Single Time Series MLI,
there will only be one option in this field.
• Use the zoom feature to magnify any portion of a graph by clicking the leftmost square icon near the
top-right corner of a graph. While this icon is selected, click and drag to draw a box over the portion
of the graph you want to magnify. To return to the default view, click the square-shaped arrow icon
to the right of the zoom icon.

384 Chapter 23. MLI for Time-Series Experiments


Using Driverless AI, Release 1.8.4.1

2. Scroll to the bottom of the panel and select an option in the Group Search field to view a graph of Actual vs.
Predicted values for the group. (Note that for Single Time Series MLI, there will only be one option in this
field.) The outputted graph can be downloaded to your local machine.

3. Click on a prediction point in the plot (white line) to view Shapley values for that prediction point. The Shapley
values plot can also be downloaded to your local machine.

23.2. Single Time Series MLI 385


Using Driverless AI, Release 1.8.4.1

4. Click Add Panel to add a new MLI Time Series panel. This allows you to do a “side-by-side” comparison
between different models.

386 Chapter 23. MLI for Time-Series Experiments


CHAPTER

TWENTYFOUR

SCORE ON ANOTHER DATASET

After you generate a model, you can use that model to make predictions on another dataset.
1. Click the Experiments link in the top menu and select the experiment that you want to use.
2. On the Experiment page, click the Score on Another Dataset button.
3. Locate the new dataset (test set) that you want to score on. Note that this new dataset must include the same
columns as the dataset used in selected experiment.
4. Select the columns from the test set to include in the predictions frame.
5. Click Done to start the scoring process.
6. Click the Download Predictions button after scoring is complete.
Note: This feature runs batch scoring on a new dataset. You may notice slow speeds if you attempt to perform
single-row scoring.

387
Using Driverless AI, Release 1.8.4.1

388 Chapter 24. Score on Another Dataset


CHAPTER

TWENTYFIVE

TRANSFORM ANOTHER DATASET

When a training dataset is used in an experiment, Driverless AI transforms the data into an improved, feature en-
gineered dataset. (Refer to Driverless AI Transformations for more information about the transformations that are
provided in Driverless AI.) But what happens when new rows are added to your dataset? In this case, you can specify
to transform the new dataset after adding it to Driverless AI, and the same transformations that Driverless AI applied
to the original dataset will be applied to these new rows.
Follow these steps to transform another dataset. Note that this assumes the new dataset has been added to Driverless
AI already.
Note: Transform Another Dataset is not available for Time Series experiments.
1. On the completed experiment page for the original dataset, click the Transform Another Dataset button.
2. Select the new training dataset that you want to transform. Note that this must have the same number columns
as the original dataset.
3. In the Select drop down, specify a validation dataset to use with this dataset, or specify to split the training data.
If you specify to split the data, then you also specify the split value (defaults to 25%) and the seed (defaults to
1234). Note: To ensure the transformed dataset respects the row order, choose a validation dataset instead of
splitting the training data. Splitting the training data will result in a shuffling of the row order.
4. Optionally specify a test dataset. If specified, then the output also include the final test dataset for final scoring.
5. Click Launch Transformation.

389
Using Driverless AI, Release 1.8.4.1

The following datasets will be available for download upon successful completion:
• Training dataset (not for cross validation)
• Validation dataset for parameter tuning
• Test dataset for final scoring. This option is available if a test dataset was used.

390 Chapter 25. Transform Another Dataset


CHAPTER

TWENTYSIX

SCORING PIPELINES OVERVIEW

Driverless AI provides several Scoring Pipelines for experiments and/or interpreted models.
• A standalone Python Scoring Pipeline is available for experiments and interpreted models.
• A low-latency, standalone MOJO Scoring Pipeline is available for experiments, with both Java and C++ back-
ends.
The Python Scoring Pipeline is implemented as a Python whl file. While this allows for a single process scoring
engine, the scoring service is generally implemented as a client/server architecture and supports interfaces for TCP
and HTTP.
The MOJO Scoring Pipeline provides a standalone scoring pipeline that converts experiments to MOJOs, which can
be scored in real time. The MOJO Scoring Pipeline is available as either a Java runtime or a C++ runtime. For the
C++ runtime, both Python and R wrappers are provided.
Examples are included with each scoring package.
Note: These sections describe scoring pipelines and not deployments of scoring pipelines. For information on how to
deploy a MOJO scoring pipeline, refer to the Deploying the MOJO Pipeline section.

391
Using Driverless AI, Release 1.8.4.1

392 Chapter 26. Scoring Pipelines Overview


CHAPTER

TWENTYSEVEN

VISUALIZING THE SCORING PIPELINE

A visualization of the scoring pipeline is available for each completed experiment.


Notes:
• This pipeline is best viewed in the latest version of Chrome.
• A .png image of this pipeline is available in the Autoreport and in the mojo.zip file ONLY with the Driverless AI
Docker image. For tar, deb, and rpm installs, you must install Graphviz manually in order for the visualization
pipeline to be included in the Autoreport and mojo.zip.

Click the Visualize Scoring Pipeline (Experimental) button on the completed experiment page to view the visualiza-
tion.

393
Using Driverless AI, Release 1.8.4.1

To view a visual representation of a specific model, click on the oval that corresponds with that model.

394 Chapter 27. Visualizing the Scoring Pipeline


Using Driverless AI, Release 1.8.4.1

To change the orientation of the visualization, click the Transpose button in the bottom right corner of the screen.

395
Using Driverless AI, Release 1.8.4.1

396 Chapter 27. Visualizing the Scoring Pipeline


CHAPTER

TWENTYEIGHT

WHICH PIPELINE SHOULD I USE?

Driverless AI provides a Python Scoring Pipeline, an MLI Standalone Scoring Pipeline, and a MOJO Scoring Pipeline.
Consider the following when determining the scoring pipeline that you want to use.
• For all pipelines, the higher the accuracy, the slower the scoring.
• The Python Scoring Pipeline is slower but easier to use than the MOJO scoring pipeline.
• When running the Python Scoring Pipeline:
– HTTP is easy and is supported by virtually any language. HTTP supports RESTful calls via curl, wget, or
supported packages in various scripting languages.
– TCP is a bit more complex, though faster. TCP also requires Thrift, which currently does not handle NAs.
• The MOJO Scoring Pipeline is flexible and is faster than the Python Scoring Pipeline, but it requires a bit more
coding. The MOJO Scoring Pipeline is available as either a Java runtime or a C++ runtime.
• The MLI Standalone Python Scoring Pipeline can be used to score interpreted models but only supports k-LIME
reason codes.
– For obtaining k-LIME reason codes from an MLI experiment, use the MLI Standalone Python Scoring
Pipeline. k-LIME reason codes are available for all models.
– For obtaining Shapley reason codes from an MLI experiment, use the DAI Standalone Python Scoring
Pipeline. Shapley is only available for XGBoost and LightGBM models. Note that obtaining Shapley
reason codes through the Python Scoring Pipeline can be time consuming.

397
Using Driverless AI, Release 1.8.4.1

398 Chapter 28. Which Pipeline Should I Use?


CHAPTER

TWENTYNINE

DRIVERLESS AI STANDALONE PYTHON SCORING PIPELINE

As indicated earlier, a scoring pipeline is available after a successfully completed experiment. This package contains
an exported model and Python 3.6 source code examples for productionizing models built using H2O Driverless AI.
The files in this package allow you to transform and score on new data in a couple of different ways:
• From Python 3.6, you can import a scoring module, and then use the module to transform and score on new
data.
• From other languages and platforms, you can use the TCP/HTTP scoring service bundled with this package to
call into the scoring pipeline module through remote procedure calls (RPC).
Note about custom recipes and the Python Scoring Pipeline: By default, if a custom recipe was has been uploaded
into Driverless AI but then not used in the experiment, the Python Scoring Pipeline still contains the H2O recipe
server. If this pipeline is then deployed in a container, the H2O recipe server causes the size of the pipeline to be much
larger. In addition, Java has to be installed in the container, which further increases the runtime storage and memory
requirements. A workaround is to set the following environment variable before running the Python Scoring Pipeline:
export dai_enable_custom_recipes=0

29.1 Python Scoring Pipeline Files

The scoring-pipeline folder includes the following notable files:


• example.py: An example Python script demonstrating how to import and score new records.
• run_example.sh: Runs example.py (also sets up a virtualenv with prerequisite libraries).
• tcp_server.py: A standalone TCP server for hosting scoring services.
• http_server.py: A standalone HTTP server for hosting scoring services.
• run_tcp_server.sh: Runs TCP scoring service (runs tcp_server.py).
• run_http_server.sh: Runs HTTP scoring service (runs http_server.py).
• example_client.py: An example Python script demonstrating how to communicate with the scoring server.
• run_tcp_client.sh: Demonstrates how to communicate with the scoring service via TCP (runs exam-
ple_client.py).
• run_http_client.sh: Demonstrates how to communicate with the scoring service via HTTP (using curl).

399
Using Driverless AI, Release 1.8.4.1

29.2 Quick Start

There are two methods for starting the Python Scoring Pipeline.

29.2.1 Quick Start - Recommended Method

This is the recommended method for running the Python Scoring Pipeline. Use this method if:
• You have an air gapped environment with no access to the Internet.
• You are running Power.
• You want an easy quick start approach.

Prerequisites

• A valid Driverless AI license key.


• A completed Driverless AI experiment.
• Downloaded Python Scoring Pipeline.

Running the Python Scoring Pipeline - Recommended

1. Download the TAR SH version of Driverless AI from https://www.h2o.ai/download/ (for either Linux or IBM
Power).
2. Use bash to execute the download. This creates a new dai-<dai_version> folder, where <dai_version> repre-
sents your version of Driverless AI, for example, 1.7.1-linux-x86_64.)
3. Change directories into the new Driverless AI folder. (Replace <dai_version> below with your the version
that was created in Step 2.)
cd dai-<dai_version>

4. Run the following to change permissions:


chmod -R a+w python

5. Run the following to install the Python Scoring Pipeline for your completed Driverless AI experiment:
./dai-env.sh pip install /path/to/your/scoring_experiment.whl

6. Run the following command to run the included scoring pipeline example:
DRIVERLESS_AI_LICENSE_KEY="pastekeyhere" SCORING_PIPELINE_INSTALL_DEPENDENCIES=0 ./dai-env.sh /path/to/your/run_example.sh

400 Chapter 29. Driverless AI Standalone Python Scoring Pipeline


Using Driverless AI, Release 1.8.4.1

29.2.2 Quick Start - Alternative Method

Prerequisites

• The scoring module and scoring service are supported only on Linux with Python 3.6 and OpenBLAS.
• The scoring module and scoring service download additional packages at install time and require Internet access.
Depending on your network environment, you might need to set up internet access via a proxy.
• Valid Driverless AI license. Driverless AI requires a license to be specified in order to run the Python Scoring
Pipeline.
• Apache Thrift (to run the scoring service in TCP mode)
• Linux environment
• Python 3.6
• libopenblas-dev (required for H2O4GPU)
• OpenCL
Examples of how to install these prerequisites are below.
Installing Python 3.6
Installing Python 3.6 and OpenBLAS on Ubuntu 16.10+
sudo apt install python3.6 python3.6-dev python3-pip python3-dev \
python-virtualenv python3-virtualenv libopenblas-dev

Installing Python 3.6 and OpenBLAS on Ubuntu 16.04


sudo add-apt-repository ppa:deadsnakes/ppa
sudo apt-get update
sudo apt-get install python3.6 python3.6-dev python3-pip python3-dev \
python-virtualenv python3-virtualenv libopenblas-dev

Installing Conda 3.6:


You can install Conda using either Anaconda or Miniconda. Refer to the links below for more information:
• Anaconda - https://docs.anaconda.com/anaconda/install.html
• Miniconda - https://conda.io/docs/user-guide/install/index.html
Installing OpenCL
Install OpenCL on RHEL
yum -y clean all
yum -y makecache
yum -y update
wget http://dl.fedoraproject.org/pub/epel/7/x86_64/Packages/c/clinfo-2.1.17.02.09-1.el7.x86_64.rpm
wget http://dl.fedoraproject.org/pub/epel/7/x86_64/Packages/o/ocl-icd-2.2.12-1.el7.x86_64.rpm
rpm -if clinfo-2.1.17.02.09-1.el7.x86_64.rpm
rpm -if ocl-icd-2.2.12-1.el7.x86_64.rpm
clinfo

mkdir -p /etc/OpenCL/vendors && \


echo "libnvidia-opencl.so.1" > /etc/OpenCL/vendors/nvidia.icd

Install OpenCL on Ubuntu


sudo apt install ocl-icd-libopencl1

mkdir -p /etc/OpenCL/vendors && \


echo "libnvidia-opencl.so.1" > /etc/OpenCL/vendors/nvidia.icd

License Specification
Driverless AI requires a license to be specified in order to run the Python Scoring Pipeline. The license can be specified
via an environment variable in Python:

29.2. Quick Start 401


Using Driverless AI, Release 1.8.4.1

# Set DRIVERLESS_AI_LICENSE_FILE, the path to the Driverless AI license file


%env DRIVERLESS_AI_LICENSE_FILE="/home/ubuntu/license/license.sig"

# Set DRIVERLESS_AI_LICENSE_KEY, the Driverless AI license key (Base64 encoded string)


%env DRIVERLESS_AI_LICENSE_KEY="oLqLZXMI0y..."

The examples that follow use DRIVERLESS_AI_LICENSE_FILE. Using DRIVERLESS_AI_LICENSE_KEY


would be similar.
Installing the Thrift Compiler
Thrift is required to run the scoring service in TCP mode, but it is not required to run the scoring module. The
following steps are available on the Thrift documentation site at: https://thrift.apache.org/docs/BuildingFromSource.
sudo apt-get install automake bison flex g++ git libevent-dev \
libssl-dev libtool make pkg-config libboost-all-dev ant
wget https://github.com/apache/thrift/archive/0.10.0.tar.gz
tar -xvf 0.10.0.tar.gz
cd thrift-0.10.0
./bootstrap.sh
./configure
make
sudo make install

Run the following to refresh the runtime shared after installing Thrift:
sudo ldconfig /usr/local/lib

Running the Python Scoring Pipeline - Alternative Method

1. On the completed Experiment page, click on the Download Python Scoring Pipeline button to download the
scorer.zip file for this experiment onto your local machine.

2. Unzip the scoring pipeline.


After the pipeline is downloaded and unzipped, you will be able to run the scoring module and the scoring service.

402 Chapter 29. Driverless AI Standalone Python Scoring Pipeline


Using Driverless AI, Release 1.8.4.1

Score from a Python Program


If you intend to score from a Python program, run the scoring module example. (Requires Linux and Python 3.6.)
export DRIVERLESS_AI_LICENSE_FILE="/path/to/license.sig"
bash run_example.sh

Score Using a Web Service


If you intend to score using a web service, run the HTTP scoring server example. (Requires Linux x86_64 and Python
3.6.)
export DRIVERLESS_AI_LICENSE_FILE="/path/to/license.sig"
bash run_http_server.sh
bash run_http_client.sh

Score Using a Thrift Service


If you intend to score using a Thrift service, run the TCP scoring server example. (Requires Linux x86_64, Python
3.6 and Thrift.)
export DRIVERLESS_AI_LICENSE_FILE="/path/to/license.sig"
bash run_tcp_server.sh
bash run_tcp_client.sh

Note: By default, the run_*.sh scripts mentioned above create a virtual environment using virtualenv and pip,
within which the Python code is executed. The scripts can also leverage Conda (Anaconda/Mininconda) to create
Conda virtual environment and install required package dependencies. The package manager to use is provided as an
argument to the script.
# to use conda package manager
export DRIVERLESS_AI_LICENSE_FILE="/path/to/license.sig"
bash run_example.sh --pm conda

# to use pip package manager


export DRIVERLESS_AI_LICENSE_FILE="/path/to/license.sig"
bash run_example.sh --pm pip

If you experience errors while running any of the above scripts, please check to make sure your system has a properly
installed and configured Python 3.6 installation. Refer to the Troubleshooting Python Environment Issues section that
follows to see how to set up and test the scoring module using a cleanroom Ubuntu 16.04 virtual machine.

29.3 The Python Scoring Module

The scoring module is a Python module bundled into a standalone wheel file (name scoring_*.whl). All the prereq-
uisites for the scoring module to work correctly are listed in the requirements.txt file. To use the scoring module, all
you have to do is create a Python virtualenv, install the prerequisites, and then import and use the scoring module as
follows:
# See 'example.py' for complete example.
from scoring_487931_20170921174120_b4066 import Scorer
scorer = Scorer() # Create instance.
score = scorer.score([ # Call score()
7.416, # sepal_len
3.562, # sepal_wid
1.049, # petal_len
2.388, # petal_wid
])

The scorer instance provides the following methods (and more):


• score(list): Score one row (list of values).
• score_batch(df): Score a Pandas dataframe.
• fit_transform_batch(df): Transform a Pandas dataframe.
• get_target_labels(): Get target column labels (for classification problems).

29.3. The Python Scoring Module 403


Using Driverless AI, Release 1.8.4.1

The process of importing and using the scoring module is demonstrated by the bash script run_example.sh, which
effectively performs the following steps:
# See 'run_example.sh' for complete example.
virtualenv -p python3.6 env
source env/bin/activate
pip install -r requirements.txt
export DRIVERLESS_AI_LICENSE_FILE="/path/to/license.sig"
python example.py

29.4 The Scoring Service

The scoring service hosts the scoring module as an HTTP or TCP service. Doing this exposes all the functions of
the scoring module through remote procedure calls (RPC). In effect, this mechanism allows you to invoke scoring
functions from languages other than Python on the same computer or from another computer on a shared network or
on the Internet.
The scoring service can be started in two ways:
• In TCP mode, the scoring service provides high-performance RPC calls via Apache Thrift (https://thrift.apache.
org/) using a binary wire protocol.
• In HTTP mode, the scoring service provides JSON-RPC 2.0 calls served by Tornado (http://www.tornadoweb.
org).
Scoring operations can be performed on individual rows (row-by-row) or in batch mode (multiple rows at a time).

29.4.1 Scoring Service - TCP Mode (Thrift)

The TCP mode allows you to use the scoring service from any language supported by Thrift, including C, C++, C#,
Cocoa, D, Dart, Delphi, Go, Haxe, Java, Node.js, Lua, perl, PHP, Python, Ruby and Smalltalk.
To start the scoring service in TCP mode, you will need to generate the Thrift bindings once, then run the server:
# See 'run_tcp_server.sh' for complete example.
thrift --gen py scoring.thrift
python tcp_server.py --port=9090

Note that the Thrift compiler is only required at build-time. It is not a run time dependency, i.e. once the scoring
services are built and tested, you do not need to repeat this installation process on the machines where the scoring
services are intended to be deployed.
To call the scoring service, simply generate the Thrift bindings for your language of choice, then make RPC calls via
TCP sockets using Thrift’s buffered transport in conjunction with its binary protocol.
# See 'run_tcp_client.sh' for complete example.
thrift --gen py scoring.thrift

# See 'example_client.py' for complete example.


socket = TSocket.TSocket('localhost', 9090)
transport = TTransport.TBufferedTransport(socket)
protocol = TBinaryProtocol.TBinaryProtocol(transport)
client = ScoringService.Client(protocol)
transport.open()
row = Row()
row.sepalLen = 7.416 # sepal_len
row.sepalWid = 3.562 # sepal_wid
row.petalLen = 1.049 # petal_len
row.petalWid = 2.388 # petal_wid
scores = client.score(row)
transport.close()

You can reproduce the exact same result from other languages, e.g. Java:
thrift --gen java scoring.thrift

// Dependencies:
// commons-codec-1.9.jar
// commons-logging-1.2.jar

(continues on next page)

404 Chapter 29. Driverless AI Standalone Python Scoring Pipeline


Using Driverless AI, Release 1.8.4.1

(continued from previous page)


// httpclient-4.4.1.jar
// httpcore-4.4.1.jar
// libthrift-0.10.0.jar
// slf4j-api-1.7.12.jar

import ai.h2o.scoring.Row;
import ai.h2o.scoring.ScoringService;
import org.apache.thrift.TException;
import org.apache.thrift.protocol.TBinaryProtocol;
import org.apache.thrift.transport.TSocket;
import org.apache.thrift.transport.TTransport;
import java.util.List;

public class Main {


public static void main(String[] args) {
try {
TTransport transport = new TSocket("localhost", 9090);
transport.open();

ScoringService.Client client = new ScoringService.Client(


new TBinaryProtocol(transport));

Row row = new Row(7.642, 3.436, 6.721, 1.020);


List<Double> scores = client.score(row);
System.out.println(scores);

transport.close();
} catch (TException ex) {
ex.printStackTrace();
}
}
}

Scoring Service - HTTP Mode (JSON-RPC 2.0)

The HTTP mode allows you to use the scoring service using plaintext JSON-RPC calls. This is usually less performant
compared to Thrift, but has the advantage of being usable from any HTTP client library in your language of choice,
without any dependency on Thrift.
For JSON-RPC documentation, see http://www.jsonrpc.org/specification.
To start the scoring service in HTTP mode:
# See 'run_http_server.sh' for complete example.
export DRIVERLESS_AI_LICENSE_FILE="/path/to/license.sig"
python http_server.py --port=9090

To invoke scoring methods, compose a JSON-RPC message and make a HTTP POST request to http://host:port/rpc as
follows:
# See 'run_http_client.sh' for complete example.
curl http://localhost:9090/rpc \
--header "Content-Type: application/json" \
--data @- <<EOF
{
"id": 1,
"method": "score",
"params": {
"row": [ 7.486, 3.277, 4.755, 2.354 ]
}
}
EOF

Similarly, you can use any HTTP client library to reproduce the above result. For example, from Python, you can use
the requests module as follows:
import requests
row = [7.486, 3.277, 4.755, 2.354]
req = dict(id=1, method='score', params=dict(row=row))
res = requests.post('http://localhost:9090/rpc', data=req)
print(res.json()['result'])

29.4. The Scoring Service 405


Using Driverless AI, Release 1.8.4.1

29.5 Python Scoring Pipeline FAQ

Why am I getting a “TensorFlow is disabled” message when I run the Python Scoring Pipeline?
If you ran an experiment when TensorFlow was enabled and then attempt to run the Python Scoring Pipeline, you may
receive a message similar to the following:
TensorFlow is disabled. To enable, export DRIVERLESS_AI_ENABLE_TENSORFLOW=1 or set enable_tensorflow=true in config.toml.

To successfully run the Python Scoring Pipeline, you must enable the DRIVERLESS_AI_ENABLE_TENSORFLOW
flag. For example:
export DRIVERLESS_AI_LICENSE_FILE="/path/to/license.sig"
DRIVERLESS_AI_ENABLE_TENSORFLOW=1 bash run_example.sh

29.6 Troubleshooting Python Environment Issues

The following instructions describe how to set up a cleanroom Ubuntu 16.04 virtual machine to test that this scoring
pipeline works correctly.
Prerequisites:
• Install Virtualbox: sudo apt-get install virtualbox
• Install Vagrant: https://www.vagrantup.com/downloads.html
1. Create configuration files for Vagrant.
• bootstrap.sh: contains commands to set up Python 3.6 and OpenBLAS.
• Vagrantfile: contains virtual machine configuration instructions for Vagrant and VirtualBox.
----- bootstrap.sh -----

#!/usr/bin/env bash

sudo apt-get -y update


sudo apt-get -y install apt-utils build-essential python-software-properties software-properties-common zip libopenblas-dev
sudo add-apt-repository -y ppa:deadsnakes/ppa
sudo apt-get update -yqq
sudo apt-get install -y python3.6 python3.6-dev python3-pip python3-dev python-virtualenv python3-virtualenv

# end of bootstrap.sh

----- Vagrantfile -----

# -*- mode: ruby -*-


# vi: set ft=ruby :

Vagrant.configure(2) do |config|
config.vm.box = "ubuntu/xenial64"
config.vm.provision :shell, path: "bootstrap.sh", privileged: false
config.vm.hostname = "h2o"
config.vm.provider "virtualbox" do |vb|
vb.memory = "4096"
end
end

# end of Vagrantfile

2. Launch the VM and SSH into it. Note that we’re also placing the scoring pipeline in the same directory so that
we can access it later inside the VM.
cp /path/to/scorer.zip .
vagrant up
vagrant ssh

3. Test the scoring pipeline inside the virtual machine.


cp /vagrant/scorer.zip .
unzip scorer.zip
cd scoring-pipeline/
export DRIVERLESS_AI_LICENSE_FILE="/path/to/license.sig"
bash run_example.sh

406 Chapter 29. Driverless AI Standalone Python Scoring Pipeline


Using Driverless AI, Release 1.8.4.1

At this point, you should see scores printed out on the terminal. If not, contact us at [email protected].

29.6. Troubleshooting Python Environment Issues 407


Using Driverless AI, Release 1.8.4.1

408 Chapter 29. Driverless AI Standalone Python Scoring Pipeline


CHAPTER

THIRTY

DRIVERLESS AI MLI STANDALONE PYTHON SCORING PACKAGE

This package contains an exported model and Python 3.6 source code examples for productionizing models built using
H2O Driverless AI Machine Learning Interpretability (MLI) tool. This is only available for interpreted models and
can be downloaded by clicking the Scoring Pipeline button the Interpreted Models page.
The files in this package allow you to obtain reason codes for a given row of data a couple of different ways:
• From Python 3.6, you can import a scoring module, and then use the module to transform and score on new
data.
• From other languages and platforms, you can use the TCP/HTTP scoring service bundled with this package to
call into the scoring pipeline module through remote procedure calls (RPC).

30.1 MLI Python Scoring Package Files

The scoring-pipeline-mli folder includes the following notable files:


• example.py: An example Python script demonstrating how to import and interpret new records.
• run_example.sh: Runs example.py (This also sets up a virtualenv with prerequisite libraries.)
• run_example_shapley.sh: Runs example_shapley.py. This compares K-LIME and Driverless AI Shapley rea-
son codes.
• tcp_server.py: A standalone TCP server for hosting MLI services.
• http_server.py: A standalone HTTP server for hosting MLI services.
• run_tcp_server.sh: Runs the TCP scoring service (specifically, tcp_server.py).
• run_http_server.sh: Runs HTTP scoring service (runs http_server.py).
• example_client.py: An example Python script demonstrating how to communicate with the MLI server.
• example_shapley.py: An example Python script demonstrating how to compare K-LIME and Driverless AI
Shapley reason codes.
• run_tcp_client.sh: Demonstrates how to communicate with the MLI service via TCP (runs example_client.py).
• run_http_client.sh: Demonstrates how to communicate with the MLI service via HTTP (using curl).

409
Using Driverless AI, Release 1.8.4.1

30.2 Quick Start

There are two methods for starting the MLI Standalone Scoring Pipeline.

30.2.1 Quick Start - Recommended Method

This is the recommended method for running the MLI Scoring Pipeline. Use this method if:
• You have an air gapped environment with no access to the Internet.
• You are running Power.
• You want an easy quick start approach.

Prerequisites

• A valid Driverless AI license key.


• A completed Driverless AI experiment.
• Downloaded MLI Scoring Pipeline.

Running the MLI Scoring Pipeline - Recommended

1. Download the TAR SH version of Driverless AI from https://www.h2o.ai/download/ (for either Linux or IBM
Power).
2. Use bash to execute the download. This creates a new dai-nnn folder.
3. Change directories into the new Driverless AI folder.
cd dai-nnn directory.

4. Run the following to install the Python Scoring Pipeline for your completed Driverless AI experiment:
./dai-env.sh pip install /path/to/your/scoring_experiment.whl

5. Run the following command to run the included scoring pipeline example:
DRIVERLESS_AI_LICENSE_KEY="pastekeyhere" SCORING_PIPELINE_INSTALL_DEPENDENCIES=0 ./dai-env.sh /path/to/your/run_example.sh

30.2.2 Quick Start - Alternative Method

This section describes an alternative method for running the MLI Standalone Scoring Pipeline. This version requires
Internet access. It is also not supported on Power machines.

410 Chapter 30. Driverless AI MLI Standalone Python Scoring Package


Using Driverless AI, Release 1.8.4.1

Prerequisites

• Valid Driverless AI license.


• The scoring module and scoring service are supported only on Linux with Python 3.6 and OpenBLAS.
• The scoring module and scoring service download additional packages at install time and require internet access.
Depending on your network environment, you might need to set up internet access via a proxy.
• Apache Thrift (to run the scoring service in TCP mode)
Examples of how to install these prerequisites are below.
Installing Python 3.6
Installing Python3.6 on Ubuntu 16.10+:
sudo apt install python3.6 python3.6-dev python3-pip python3-dev \
python-virtualenv python3-virtualenv

Installing Python3.6 on Ubuntu 16.04:


sudo add-apt-repository ppa:deadsnakes/ppa
sudo apt-get update
sudo apt-get install python3.6 python3.6-dev python3-pip python3-dev \
python-virtualenv python3-virtualenv

Installing Conda 3.6:


You can install Conda using either Anaconda or Miniconda. Refer to the links below for more information:
• Anaconda - https://docs.anaconda.com/anaconda/install.html
• Miniconda - https://conda.io/docs/user-guide/install/index.html
Installing the Thrift Compiler
Refer to Thrift documentation at https://thrift.apache.org/docs/BuildingFromSource for more information.
sudo apt-get install automake bison flex g++ git libevent-dev \
libssl-dev libtool make pkg-config libboost-all-dev ant
wget https://github.com/apache/thrift/archive/0.10.0.tar.gz
tar -xvf 0.10.0.tar.gz
cd thrift-0.10.0
./bootstrap.sh
./configure
make
sudo make install

Run the following to refresh the runtime shared after installing Thrift.
sudo ldconfig /usr/local/lib

Running the MLI Scoring Pipeline - Alternative Method

1. On the MLI page, click the Scoring Pipeline button.

30.2. Quick Start 411


Using Driverless AI, Release 1.8.4.1

2. Unzip the scoring pipeline, and run the following examples in the scoring-pipeline-mli folder.
Run the scoring module example. (This requires Linux and Python 3.6.)
bash run_example.sh

Run the TCP scoring server example. Use two terminal windows. (This requires Linux, Python 3.6 and
Thrift.)
bash run_tcp_server.sh
bash run_tcp_client.sh

Run the HTTP scoring server example. Use two terminal windows. (This requires Linux, Python 3.6 and
Thrift.)
bash run_http_server.sh
bash run_http_client.sh

Note: By default, the run_*.sh scripts mentioned above create a virtual environment using vir-
tualenv and pip, within which the Python code is executed. The scripts can also leverage Conda (Ana-

412 Chapter 30. Driverless AI MLI Standalone Python Scoring Package


Using Driverless AI, Release 1.8.4.1

conda/Mininconda) to create Conda virtual environment and install required package dependencies. The
package manager to use is provided as an argument to the script.
# to use conda package manager
bash run_example.sh --pm conda

# to use pip package manager


bash run_example.sh --pm pip

30.3 MLI Python Scoring Module

The MLI scoring module is a Python module bundled into a standalone wheel file (name scoring_*.whl). All the
prerequisites for the scoring module to work correctly are listed in the ‘requirements.txt’ file. To use the scoring
module, all you have to do is create a Python virtualenv, install the prerequisites, and then import and use the scoring
module as follows:
----- See 'example.py' for complete example. -----
from scoring_487931_20170921174120_b4066 import Scorer
scorer = KLimeScorer() # Create instance.
score = scorer.score_reason_codes([ # Call score_reason_codes()
7.416, # sepal_len
3.562, # sepal_wid
1.049, # petal_len
2.388, # petal_wid
])

The scorer instance provides the following methods:


• score_reason_codes(list): Get K-LIME reason codes for one row (list of values).
• score_reason_codes_batch(dataframe): Takes and outputs a Pandas Dataframe
• get_column_names(): Get the input column names
• get_reason_code_column_names(): Get the output column names
The process of importing and using the scoring module is demonstrated by the bash script run_example.sh, which
effectively performs the following steps:
----- See 'run_example.sh' for complete example. -----
virtualenv -p python3.6 env
source env/bin/activate
pip install -r requirements.txt
python example.py

30.4 K-LIME vs Shapley Reason Codes

There are times when the K-LIME model score is not close to the Driverless AI model score. In this case it may be
better to use reason codes using the Shapley method on the Driverless AI model. Please note: the reason codes from
Shapley will be in the transformed feature space.
To see an example of using both K-LIME and Driverless AI Shapley reason codes in the same Python session, run:
bash run_example_shapley.sh

For this batch script to succeed, MLI must be run on a Driverless AI model. If you have run MLI in standalone
(external model) mode, there will not be a Driverless AI scoring pipeline.
If MLI was run with transformed features, the Shapley example scripts will not be exported. You can generate exact
reason codes directly from the Driverless AI model scoring pipeline.

30.3. MLI Python Scoring Module 413


Using Driverless AI, Release 1.8.4.1

30.5 MLI Scoring Service Overview

The MLI scoring service hosts the scoring module as a HTTP or TCP service. Doing this exposes all the functions of
the scoring module through remote procedure calls (RPC).
In effect, this mechanism allows you to invoke scoring functions from languages other than Python on the same
computer, or from another computer on a shared network or the internet.
The scoring service can be started in two ways:
• In TCP mode, the scoring service provides high-performance RPC calls via Apache Thrift (https://thrift.apache.
org/) using a binary wire protocol.
• In HTTP mode, the scoring service provides JSON-RPC 2.0 calls served by Tornado (http://www.tornadoweb.
org).
Scoring operations can be performed on individual rows (row-by-row) using score or in batch mode (multiple rows
at a time) using score_batch. Both functions allow you to specify pred_contribs=[True|False] to get
MLI predictions (KLime/Shapley) on a new dataset. See the example_shapley.py file for more information.

30.5.1 MLI Scoring Service - TCP Mode (Thrift)

The TCP mode allows you to use the scoring service from any language supported by Thrift, including C, C++, C#,
Cocoa, D, Dart, Delphi, Go, Haxe, Java, Node.js, Lua, perl, PHP, Python, Ruby and Smalltalk.
To start the scoring service in TCP mode, you will need to generate the Thrift bindings once, then run the server:
----- See 'run_tcp_server.sh' for complete example. -----
thrift --gen py scoring.thrift
python tcp_server.py --port=9090

Note that the Thrift compiler is only required at build-time. It is not a run time dependency, i.e. once the scoring
services are built and tested, you do not need to repeat this installation process on the machines where the scoring
services are intended to be deployed.
To call the scoring service, simply generate the Thrift bindings for your language of choice, then make RPC calls via
TCP sockets using Thrift’s buffered transport in conjunction with its binary protocol.
----- See 'run_tcp_client.sh' for complete example. -----
thrift --gen py scoring.thrift

----- See 'example_client.py' for complete example. -----


socket = TSocket.TSocket('localhost', 9090)
transport = TTransport.TBufferedTransport(socket)
protocol = TBinaryProtocol.TBinaryProtocol(transport)
client = ScoringService.Client(protocol)
transport.open()
row = Row()
row.sepalLen = 7.416 # sepal_len
row.sepalWid = 3.562 # sepal_wid
row.petalLen = 1.049 # petal_len
row.petalWid = 2.388 # petal_wid
scores = client.score_reason_codes(row)
transport.close()

You can reproduce the exact same result from other languages, e.g. Java:
thrift --gen java scoring.thrift

// Dependencies:
// commons-codec-1.9.jar
// commons-logging-1.2.jar
// httpclient-4.4.1.jar
// httpcore-4.4.1.jar
// libthrift-0.10.0.jar
// slf4j-api-1.7.12.jar

import ai.h2o.scoring.Row;
import ai.h2o.scoring.ScoringService;
import org.apache.thrift.TException;
import org.apache.thrift.protocol.TBinaryProtocol;
import org.apache.thrift.transport.TSocket;
import org.apache.thrift.transport.TTransport;

(continues on next page)

414 Chapter 30. Driverless AI MLI Standalone Python Scoring Package


Using Driverless AI, Release 1.8.4.1

(continued from previous page)


import java.util.List;

public class Main {


public static void main(String[] args) {
try {
TTransport transport = new TSocket("localhost", 9090);
transport.open();

ScoringService.Client client = new ScoringService.Client(


new TBinaryProtocol(transport));

Row row = new Row(7.642, 3.436, 6.721, 1.020);


List<Double> scores = client.score_reason_codes(row);
System.out.println(scores);

transport.close();
} catch (TException ex) {
ex.printStackTrace();
}
}
}

30.5.2 Scoring Service - HTTP Mode (JSON-RPC 2.0)

The HTTP mode allows you to use the scoring service using plaintext JSON-RPC calls. This is usually less performant
compared to Thrift, but has the advantage of being usable from any HTTP client library in your language of choice,
without any dependency on Thrift.
For JSON-RPC documentation, see http://www.jsonrpc.org/specification .
To start the scoring service in HTTP mode:
----- See 'run_http_server.sh' for complete example. -----
python http_server.py --port=9090

To invoke scoring methods, compose a JSON-RPC message and make a HTTP POST request to http://host:port/rpc as
follows:
----- See 'run_http_client.sh' for complete example. -----
curl http://localhost:9090/rpc \
--header "Content-Type: application/json" \
--data @- <<EOF
{
"id": 1,
"method": "score_reason_codes",
"params": {
"row": [ 7.486, 3.277, 4.755, 2.354 ]
}
}
EOF

Similarly, you can use any HTTP client library to reproduce the above result. For example, from Python, you can use
the requests module as follows:
import requests
row = [7.486, 3.277, 4.755, 2.354]
req = dict(id=1, method='score_reason_codes', params=dict(row=row))
res = requests.post('http://localhost:9090/rpc', data=req)
print(res.json()['result'])

30.5. MLI Scoring Service Overview 415


Using Driverless AI, Release 1.8.4.1

416 Chapter 30. Driverless AI MLI Standalone Python Scoring Package


CHAPTER

THIRTYONE

MOJO SCORING PIPELINES

As indicated previously, the MOJO Scoring Pipeline provides a standalone scoring pipeline that converts experiments
to MOJOs, which can be scored in real time. The MOJO Scoring Pipeline is available as either a Java runtime or a
C++ runtime (with Python and R wrappers).

31.1 Driverless AI MOJO Scoring Pipeline - Java Runtime

For completed experiments, Driverless AI automatically converts models to MOJOs (Model Objects, Optimized). The
MOJO Scoring Pipeline is a scoring engine that can be deployed in any Java environment for scoring in real time.
(Refer to Driverless AI MOJO Scoring Pipeline - C++ Runtime with Python and R Wrappers for information about
the C++ scoring runtime with Python and R wrappers.)
Keep in mind that, similar to H2O-3, MOJOs are tied to experiments. Experiments and MOJOs are not automatically
upgraded when Driverless AI is upgraded.
Notes:
• This scoring pipeline is not currently available for TensorFlow, RuleFit, or FTRL models.
• To disable the automatic creation of this scoring pipeline, set the Make MOJO Scoring Pipeline expert setting
to Off.

31.1.1 Prerequisites

The following are required in order to run the MOJO scoring pipeline.
• Java 7 runtime (JDK 1.7) or newer. NOTE: We recommend using Java 11+ due to a bug in Java. (See https:
//bugs.openjdk.java.net/browse/JDK-8186464.)
• Valid Driverless AI license. You can download the license.sig file from the machine hosting Driverless AI
(usually in the license folder). Copy the license file into the downloaded mojo-pipeline folder.
• mojo2-runtime.jar file. This is available from the top navigation menu in the Driverless AI UI and in the
downloaded mojo-pipeline.zip file for an experiment.

417
Using Driverless AI, Release 1.8.4.1

License Specification

Driverless AI requires a license to be specified in order to run the MOJO Scoring Pipeline. The license can be specified
in one of the following ways:
• Via an environment variable:
– DRIVERLESS_AI_LICENSE_FILE: Path to the Driverless AI license file, or
– DRIVERLESS_AI_LICENSE_KEY: The Driverless AI license key (Base64 encoded string)
• Via a system property of JVM (-D option):
– ai.h2o.mojos.runtime.license.file: Path to the Driverless AI license file, or
– ai.h2o.mojos.runtime.license.key: The Driverless AI license key (Base64 encoded
string)
• Via an application classpath:
– The license is loaded from a resource called /license.sig.
– The default resource name can be changed via the JVM system property ai.h2o.mojos.
runtime.license.filename.
For example:
java -Dai.h2o.mojos.runtime.license.file=/etc/dai/license.sig -cp mojo2-runtime.jar ai.h2o.mojos.ExecuteMojo pipeline.mojo example.csv

31.1.2 MOJO Scoring Pipeline Files

The mojo-pipeline folder includes the following files:


• run_example.sh: An bash script to score a sample test set.
• pipeline.mojo: Standalone scoring pipeline in MOJO format.
• mojo2-runtime.jar: MOJO Java runtime.
• example.csv: Sample test set (synthetic, of the correct format).
• DOT files: Text files that can be rendered as graphs that provide a visual representation of the MOJO scoring
pipeline (can be edited to change the appearance and structure of a rendered graph).
• PNG files: Image files that provide a visual representation of the MOJO scoring pipeline.

31.1.3 Quickstart

Before running the quickstart examples, be sure that the MOJO scoring pipeline is already downloaded and unzipped:
1. On the completed Experiment page, click on the Download MOJO Scoring Pipeline button.

418 Chapter 31. MOJO Scoring Pipelines


Using Driverless AI, Release 1.8.4.1

2. In the pop-up menu that appears, click on the Download MOJO Scoring Pipeline button once again to down-
load the scorer.zip file for this experiment onto your local machine. Refer to the provided instructions for Java,
Python, or R.

3. To score all rows in the sample test set (example.csv) with the MOJO pipeline (pipeline.mojo) and
license stored in the environment variable DRIVERLESS_AI_LICENSE_KEY:
bash run_example.sh

4. To score a specific test set (example.csv) with MOJO pipeline (pipeline.mojo) and the license file
(license.sig):
bash run_example.sh pipeline.mojo example.csv license.sig

5. To run the Java application for data transformation directly:

31.1. Driverless AI MOJO Scoring Pipeline - Java Runtime 419


Using Driverless AI, Release 1.8.4.1

java -Dai.h2o.mojos.runtime.license.file=license.sig -cp mojo2-runtime.jar ai.h2o.mojos.ExecuteMojo pipeline.mojo example.csv

Note: For very large models, it may be necessary to increase the memory limit when running the Java
application for data transformation. This can be done by specifying -Xmx25g when running the above
command.

31.1.4 Compile and Run the MOJO from Java

1. Open a new terminal window and change directories to the experiment folder:
cd experiment

2. Create your main program in the experiment folder by creating a new file called Main.java (for example, using
vim Main.java). Include the following contents.
import java.io.IOException;

import ai.h2o.mojos.runtime.MojoPipeline;
import ai.h2o.mojos.runtime.frame.MojoFrame;
import ai.h2o.mojos.runtime.frame.MojoFrameBuilder;
import ai.h2o.mojos.runtime.frame.MojoRowBuilder;
import ai.h2o.mojos.runtime.utils.SimpleCSV;
import ai.h2o.mojos.runtime.lic.LicenseException;

public class Main {

public static void main(String[] args) throws IOException, LicenseException {


// Load model and csv
MojoPipeline model = MojoPipeline.loadFrom("pipeline.mojo");

// Get and fill the input columns


MojoFrameBuilder frameBuilder = model.getInputFrameBuilder();
MojoRowBuilder rowBuilder = frameBuilder.getMojoRowBuilder();
rowBuilder.setValue("AGE", "68");
rowBuilder.setValue("RACE", "2");
rowBuilder.setValue("DCAPS", "2");
rowBuilder.setValue("VOL", "0");
rowBuilder.setValue("GLEASON", "6");
frameBuilder.addRow(rowBuilder);

// Create a frame which can be transformed by MOJO pipeline


MojoFrame iframe = frameBuilder.toMojoFrame();

// Transform input frame by MOJO pipeline


MojoFrame oframe = model.transform(iframe);
// `MojoFrame.debug()` can be used to view the contents of a Frame
// oframe.debug();

// Output prediction as CSV


SimpleCSV outCsv = SimpleCSV.read(oframe);
outCsv.write(System.out);
}
}

3. Compile the source code:


javac -cp mojo2-runtime.jar -J-Xms2g -J-XX:MaxPermSize=128m Main.java

4. Run the MOJO example:


# Linux and OS X users
java -Dai.h2o.mojos.runtime.license.file=license.sig -cp .:mojo2-runtime.jar Main
# Windows users
java -Dai.h2o.mojos.runtime.license.file=license.sig -cp .;mojo2-runtime.jar Main

5. The following output is displayed:


CAPSULE.True
0.5442205910902282

420 Chapter 31. MOJO Scoring Pipelines


Using Driverless AI, Release 1.8.4.1

31.1.5 Using the MOJO Scoring Pipeline with Spark/Sparkling Water

Note: The Driverless AI 1.5 release will be the last release with TOML-based MOJO2. Releases after 1.5 will include
protobuf-based MOJO2.
MOJO scoring pipeline artifacts can be used in Spark to deploy predictions in parallel using the Sparkling Water API.
This section shows how to load and run predictions on the MOJO scoring pipeline in Spark using Scala and the Python
API.
In the event that you upgrade H2O Driverless AI, we have a good news! Sparkling Water is backwards compatible
with MOJO versions produced by older Driverless AI versions.

Requirements

• You must have a Spark cluster with the Sparkling Water JAR file passed to Spark.
• To run with PySparkling, you must have the PySparkling zip file.
The H2OContext does not have to be created if you only want to run predictions on MOJOs using Spark. This is
because the scoring is independent of the H2O run-time.

Preparing Your Environment

In order use the MOJO scoring pipeline, Driverless AI license has to be passed to Spark. This can be achieved via
--jars argument of the Spark launcher scripts.
Note: In Local Spark mode, please use --driver-class-path to specify path to the license file.

PySparkling

First, start PySpark with PySparkling Python package and Driverless AI license.
./bin/pyspark --jars license.sig --py-files pysparkling.zip

or, you can download official Sparkling Water distribution from H2O Download page. Please follow steps on the
Sparkling Water download page. Once you are in the Sparkling Water directory, you can call:
./bin/pysparkling --jars license.sig

At this point, you should have available a PySpark interactive terminal where you can try out predictions. If you
would like to productionalize the scoring process, you can use the same configuration, except instead of using ./
bin/pyspark, you would use ./bin/spark-submit to submit your job to a cluster.
# First, specify the dependency
from pysparkling.ml import H2OMOJOPipelineModel

# The 'namedMojoOutputColumns' option ensures that the output columns are named properly.
# If you want to use old behavior when all output columns were stored inside an array,
# set it to False. However we strongly encourage users to use True which is defined as a default value.
settings = H2OMOJOSettings(namedMojoOutputColumns = True)

# Load the pipeline. 'settings' is an optional argument. If it's not specified, the default values are used.
mojo = H2OMOJOPipelineModel.createFromMojo("file:///path/to/the/pipeline.mojo", settings)

# Load the data as Spark's Data Frame


dataFrame = spark.read.csv("file:///path/to/the/data.csv", header=True)

# Run the predictions. The predictions contain all the original columns plus the predictions
# added as new columns
predictions = mojo.transform(dataFrame)

# You can easily get the predictions for a desired column using the helper function as
predictions.select(mojo.selectPredictionUDF("AGE")).collect()

31.1. Driverless AI MOJO Scoring Pipeline - Java Runtime 421


Using Driverless AI, Release 1.8.4.1

Sparkling Water

First, start Spark with Sparkling Water Scala assembly and Driverless AI license.
./bin/spark-shell --jars license.sig,sparkling-water-assembly.jar

or, you can download official Sparkling Water distribution from H2O Download page. Please follow steps on the
Sparkling Water download page. Once you are in the Sparkling Water directory, you can call:
./bin/sparkling-shell --jars license.sig

At this point, you should have available a Sparkling Water interactive terminal where you can carry out predictions.
If you would like to productionalize the scoring process, you can use the same configuration, except instead of using
./bin/spark-shell, you would use ./bin/spark-submit to submit your job to a cluster.
// First, specify the dependency
import ai.h2o.sparkling.ml.models.H2OMOJOPipelineModel

// The 'namedMojoOutputColumns' option ensures that the output columns are named properly.
// If you want to use old behavior when all output columns were stored inside an array,
// set it to false. However we strongly encourage users to use true which is defined as a default value.
val settings = H2OMOJOSettings(namedMojoOutputColumns = true)

// Load the pipeline. 'settings' is an optional argument. If it's not specified, the default values are used.
val mojo = H2OMOJOPipelineModel.createFromMojo("file:///path/to/the/pipeline.mojo", settings)

// Load the data as Spark's Data Frame


val dataFrame = spark.read.option("header", "true").csv("file:///path/to/the/data.csv")

// Run the predictions. The predictions contain all the original columns plus the predictions
// added as new columns
val predictions = mojo.transform(dataFrame)

// You can easily get the predictions for desired column using the helper function as follows:
predictions.select(mojo.selectPredictionUDF("AGE"))

31.2 Driverless AI MOJO Scoring Pipeline - C++ Runtime with Python


and R Wrappers

The C++ Scoring Pipeline is provided as R and Python packages for the protobuf-based MOJO2 protocol. The pack-
ages are self contained, so no additional software is required. Simply build the MOJO Scoring Pipeline and begin
using your preferred method.
Notes:
• These scoring pipelines are currently not available for RuleFit models.
• The Download MOJO Scoring Pipeline button appears as Build MOJO Scoring Pipeline if the MOJO Scor-
ing Pipeline is disabled.

31.2.1 Downloading the Scoring Pipeline Runtimes

The R and Python packages can be downloaded from within the Driverless AI application. To do this, click Resources,
then click MOJO2 R Runtime and MOJO2 Py Runtime from the drop-down menu. In the pop-up menu that appears,
click the button that corresponds to the OS you are using. Choose from Linux, Mac OS X, and IBM PowerPC.

422 Chapter 31. MOJO Scoring Pipelines


Using Driverless AI, Release 1.8.4.1

31.2.2 Examples

The following examples show how to use the R and Python APIs of the C++ MOJO runtime.

R Example

Prerequisites

• Linux OS (x86 or PPC) or Mac OS X (10.9 or newer)


• Driverless AI License (either file or environment variable)
• Rcpp (>=1.0.0)
• data.table

Running the MOJO2 R Runtime

# Install the R MOJO runtime using one of the methods below

# Install the R MOJO runtime on PPC Linux


install.packages("./daimojo_2.2.0_ppc64le-linux.tar.gz")

# Install the R MOJO runtime on x86 Linux


install.packages("./daimojo_2.2.0_x86_64-linux.tar.gz")

#Install the R MOJO runtime on Mac OS X


install.packages("./daimojo_2.2.0_x86_64-darwin.tar.gz")

# Load the MOJO


library(daimojo)
m <- load.mojo("./mojo-pipeline/pipeline.mojo")

# retrieve the creation time of the MOJO


create.time(m)
## [1] "2019-11-18 22:00:24 UTC"

# retrieve the UUID of the experiment


uuid(m)
## [1] "65875c15-943a-4bc0-a162-b8984fe8e50d"

# Load data and make predictions


col_class <- setNames(feature.types(m), feature.names(m)) # column names and types

library(data.table)
d <- fread("./mojo-pipeline/example.csv", colClasses=col_class, header=TRUE, sep=",")

predict(m, d)
## label.B label.M
## 1 0.08287659 0.91712341
## 2 0.77655075 0.22344925
## 3 0.58438434 0.41561566
## 4 0.10570505 0.89429495
## 5 0.01685609 0.98314391
## 6 0.23656610 0.76343390
## 7 0.17410333 0.82589667
## 8 0.10157948 0.89842052
## 9 0.13546191 0.86453809
## 10 0.94778244 0.05221756

Python Example

Prerequisites

• Linux OS (x86 or PPC) or Mac OS X (10.9 or newer)


• Driverless AI License (either file or environment variable)
• Python 3.6
• datatable. Run the following to install:

31.2. Driverless AI MOJO Scoring Pipeline - C++ Runtime with Python and R Wrappers 423
Using Driverless AI, Release 1.8.4.1

# Install on Linux PPC, Linux x86, or Mac OS X


pip install datatable

• Python MOJO runtime. Run one of the following commands after downloading from the GUI:
# Install the MOJO runtime on Linux PPC
pip install daimojo-2.2.0-cp36-cp36m-linux_ppc64le.whl

# Install the MOJO runtime on Linux x86


pip install daimojo-2.2.0-cp36-cp36m-linux_x86_64.whl

# Install the MOJO runtime on Mac OS X


pip install daimojo-2.2.0-cp36-cp36m-macosx_10_7_x86_64.whl

Running the MOJO2 Python Runtime

# import the daimojo model package


import daimojo.model

# specify the location of the MOJO


m = daimojo.model("./mojo-pipeline/pipeline.mojo")

# retrieve the creation time of the MOJO


m.created_time
# 'Mon November 18 14:00:24 2019'

# retrieve the UUID of the experiment


m.uuid

# retrieve a list of missing values


m.missing_values
# ['',
# '?',
# 'None',
# 'nan',
# 'NA',
# 'N/A',
# 'unknown',
# 'inf',
# '-inf',
# '1.7976931348623157e+308',
# '-1.7976931348623157e+308']

# retrieve the feature names


m.feature_names
# ['clump_thickness',
# 'uniformity_cell_size',
# 'uniformity_cell_shape',
# 'marginal_adhesion',
# 'single_epithelial_cell_size',
# 'bare_nuclei',
# 'bland_chromatin',
# 'normal_nucleoli',
# 'mitoses']

# retrieve the feature types


m.feature_types
# ['float32',
# 'float32',
# 'float32',
# 'float32',
# 'float32',
# 'float32',
# 'float32',
# 'float32',
# 'float32']

# retrieve the output names


m.output_names
# ['label.B', 'label.M']

# retrieve the output types


m.output_types
# ['float64', 'float64']

# import the datatable module


import datatable as dt

# parse the example.csv file


pydt = dt.fread("./mojo-pipeline/example.csv", na_strings=m.missing_values, header=True, separator=',')
pydt
# clump_thickness uniformity_cell_size uniformity_cell_shape marginal_adhesion single_epithelial_cell_size bare_nuclei bland_chromatin normal_
˓→nucleoli mitoses
# 0 8 1 3 10 6 6 9
˓→ 1 1
# 1 2 1 2 2 5 3 4
˓→ 8 8
# 2 1 1 1 9 4 10 3
˓→ 5 4
# 3 2 6 9 10 4 8 1
˓→ 1 3
# 4 10 10 8 1 8 3 6
˓→ 3 4
# 5 1 8 4 5 10 1 2
˓→ 5 3
# 6 2 10 2 9 1 2 9
˓→ 3 8
(continues on next page)

424 Chapter 31. MOJO Scoring Pipelines


Using Driverless AI, Release 1.8.4.1

(continued from previous page)


# 7 2 8 9 2 10 10 3
˓→ 5 4
# 8 6 3 8 5 2 3 5
˓→ 3 4
# 9 4 2 2 8 1 2 8
˓→ 9 1

# [10 rows × 9 columns]

# retrieve the column types


pydt.stypes
# (stype.float64,
# stype.float64,
# stype.float64,
# stype.float64,
# stype.float64,
# stype.float64,
# stype.float64,
# stype.float64,
# stype.float64)

# make predictions on the example.csv file


res = m.predict(pydt)

# retrieve the predictions


res
# label.B label.M
# 0 0.0828766 0.917123
# 1 0.776551 0.223449
# 2 0.584384 0.415616
# 3 0.105705 0.894295
# 4 0.0168561 0.983144
# 5 0.236566 0.763434
# 6 0.174103 0.825897
# 7 0.101579 0.898421
# 8 0.135462 0.864538
# 9 0.947782 0.0522176

# [10 rows × 2 columns]

# retrieve the prediction column names


res.names
# ('label.B', 'label.M')

# retrieve the prediction column types


res.stypes
# (stype.float64, stype.float64)

# convert datatable results to common data types


# res.to_pandas() # need pandas
# res.to_numpy() # need numpy
res.to_list()

31.3 MOJO2 Javadoc

The downloaded mojo.zip file contains the entire scoring pipeline. This pipeline also includes a MOJO2 Javadoc,
which can be opened by running the following in the mojo-pipeline folder:
jar -xf mojo2-runtime-javadoc.jar

This opens the following files:


SaraJanes-MBP13TB:mojo-pipeline sarajane$ jar -xf mojo2-runtime-javadoc.jar
SaraJanes-MBP13TB:mojo-pipeline sarajane$ ls
META-INF index-all.html
README.txt index.html
ai mojo2-runtime-javadoc.jar
allclasses-frame.html mojo2-runtime.jar
allclasses-noframe.html overview-frame.html
classref.txt overview-summary.html
constant-values.html overview-tree.html
deprecated-list.html package-list
example.csv pipeline.mojo
help-doc.html run_example.sh
highlight-LICENSE.txt script.js
highlight.css serialized-form.html
highlight.pack.js stylesheet.css
SaraJanes-MBP13TB:mojo-pipeline sarajane$

Open the index.html file to view the MOJO2 Javadoc.

31.3. MOJO2 Javadoc 425


Using Driverless AI, Release 1.8.4.1

426 Chapter 31. MOJO Scoring Pipelines


CHAPTER

THIRTYTWO

DEPLOYING THE MOJO PIPELINE

Driverless AI can deploy the MOJO scoring pipeline for you to test and/or to integrate into a final product.
Notes:
• This section describes how to deploy a MOJO scoring pipeline and assumes that a MOJO scoring pipeline exists.
Refer to the MOJO Scoring Pipelines section for information on how to build a MOJO scoring pipeline.
• This is an early feature that will continue to support additional deployments.

32.1 Deployments Overview Page

All of the existing MOJO scoring pipeline deployments are available in the Deployments Overview page, which is
available from the top menu. This page lists all active deployments and the information needed to access the respective
endpoints. In addition, it allows you to stop any deployments that are no longer needed.

32.2 Amazon Lambda Deployment

Driverless AI can deploy the trained MOJO scoring pipeline as an AWS Lambda Function, i.e., a server-less scorer
running in Amazon Cloud and charged by the actual usage.

32.2.1 Additional Resources

Refer to the aws-lambda-scorer folder in the dai-deployment-templates repository to see different deployment tem-
plates for AWS Lambda scorer.

427
Using Driverless AI, Release 1.8.4.1

32.2.2 Driverless AI Prerequisites

• Driverless AI MOJO Scoring Pipeline: To deploy a MOJO scoring pipeline as an AWS Lambda function, the
MOJO pipeline archive has to be created first by choosing the Build MOJO Scoring Pipeline option on the
completed experiment page. Refer to the MOJO Scoring Pipelines section for information on how to build a
MOJO scoring pipeline.
• Terraform v0.11.x (specifically v0.11.10 or greater): In addition, the Terraform tool (https://www.terraform.
io/) has to be installed on the system running Driverless AI. The tool is included in the Driverless AI Docker
images but not in native install packages. To install Terraform, follow the steps on Terraform installation page.
Notes:
• Terraform is not available on every platform. In particular, there is no Power build, so AWS Lambda
Deployment is currently not supported on Power installations of Driverless AI.
• Terraform v0.12 is not supported. If you have v0.12 installed, you will need to download to v0.11.x
(specifically v0.11.10 or greater) in order to deploy a MOJO scoring pipeline as an AWS lambda
function.

32.2.3 AWS Prerequisites

Usage Plans

Usage plans must be enabled in the target AWS region in order for API keys to work when accessing the AWS Lambda
via its REST API. Refer to https://aws.amazon.com/blogs/aws/new-usage-plans-for-amazon-api-gateway/ for more
information.

Access Permissions

The following AWS access permissions need to be provided to the role in order for Driverless AI Lambda deployment
to succeed.
• AWSLambdaFullAccess
• IAMFullAccess
• AmazonAPIGatewayAdministrator

The policy can be further stripped down to restrict Lambda and S3 rights using the JSON policy definition as follows:
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "VisualEditor0",
"Effect": "Allow",

(continues on next page)

428 Chapter 32. Deploying the MOJO Pipeline


Using Driverless AI, Release 1.8.4.1

(continued from previous page)


"Action": [
"iam:GetPolicyVersion",
"iam:DeletePolicy",
"iam:CreateRole",
"iam:AttachRolePolicy",
"iam:ListInstanceProfilesForRole",
"iam:PassRole",
"iam:DetachRolePolicy",
"iam:ListAttachedRolePolicies",
"iam:GetRole",
"iam:GetPolicy",
"iam:DeleteRole",
"iam:CreatePolicy",
"iam:ListPolicyVersions"
],
"Resource": [
"arn:aws:iam::*:role/h2oai*",
"arn:aws:iam::*:policy/h2oai*"
]
},
{
"Sid": "VisualEditor1",
"Effect": "Allow",
"Action": "apigateway:*",
"Resource": "*"
},
{
"Sid": "VisualEditor2",
"Effect": "Allow",
"Action": [
"lambda:CreateFunction",
"lambda:ListFunctions",
"lambda:InvokeFunction",
"lambda:GetFunction",
"lambda:UpdateFunctionConfiguration",
"lambda:DeleteFunctionConcurrency",
"lambda:RemovePermission",
"lambda:UpdateFunctionCode",
"lambda:AddPermission",
"lambda:ListVersionsByFunction",
"lambda:GetFunctionConfiguration",
"lambda:DeleteFunction",
"lambda:PutFunctionConcurrency",
"lambda:GetPolicy"
],
"Resource": "arn:aws:lambda:*:*:function:h2oai*"
},
{
"Sid": "VisualEditor3",
"Effect": "Allow",
"Action": "s3:*",
"Resource": [
"arn:aws:s3:::h2oai*/*",
"arn:aws:s3:::h2oai*"
]
}
]
}

32.2.4 Deploying on Amazon Lambda

Once the MOJO pipeline archive is ready, Driverless AI provides a Deploy (Local & Cloud) option on the completed
experiment page.
Notes: This button is only available after the MOJO Scoring Pipeline has been built.

32.2. Amazon Lambda Deployment 429


Using Driverless AI, Release 1.8.4.1

This option opens a new dialog for setting the AWS account credentials (or use those supplied in the Driverless AI
configuration file or environment variables), AWS region, and the desired deployment name (which must be unique
per Driverless AI user and AWS account used).

Amazon Lambda deployment parameters:


• Deployment Name: A unique name of the deployment. By default, Driverless AI offers a name based on the

430 Chapter 32. Deploying the MOJO Pipeline


Using Driverless AI, Release 1.8.4.1

name of the experiment and the deployment type. This has to be unique both for Driverless AI user and the
AWS account used.
• Region: The AWS region to deploy the MOJO scoring pipeline to. It makes sense to choose a region geograph-
ically close to any client code calling the endpoint in order to minimize request latency. (See also AWS Regions
and Availability Zones.)
• Use AWS environment variables: If enabled, the AWS creden-
tials are taken from the Driverless AI configuration file (see records
deployment_aws_access_key_id and deployment_aws_secret_access_key)
or environment variables (DRIVERLESS_AI_DEPLOYMENT_AWS_ACCESS_KEY_ID and
DRIVERLESS_AI_DEPLOYMENT_AWS_SECRET_ACCESS_KEY). This would usually be entered by
the Driverless AI installation administrator.
• AWS Access Key ID and AWS Secret Access Key: Credentials to access the AWS account. This pair of secrets
identifies the AWS user and the account and can be obtained from the AWS account console.

32.2.5 Testing the Lambda Deployment

On a successful deployment, all the information needed to access the new endpoint (URL and an API Key) is printed,
and the same information is available in the Deployments Overview Page after clicking on the deployment row.

Note that the actual scoring endpoint is located at the path /score. In addition, to prevent DDoS and other malicious
activities, the resulting AWS lambda is protected by an API Key, i.e., a secret that has to be passed in as a part of the
request using the x-api-key HTTP header.
The request is a JSON object containing attributes:
• fields: A list of input column names that should correspond to the training data columns.
• rows: A list of rows that are in turn lists of cell values to predict the target values for.

32.2. Amazon Lambda Deployment 431


Using Driverless AI, Release 1.8.4.1

• optional includeFieldsInOutput: A list of input columns that should be included in the output.
An example request providing 2 columns on the input and asking to get one column copied to the output looks as
follows:
{
"fields": [
"age", "salary"
],
"includeFieldsInOutput": [
"salary"
],
"rows": [
[
"48.0", "15000.0"
],
[
"35.0", "35000.0"
],
[
"18.0", "22000.0"
]
]
}

Assuming the request is stored locally in a file named test.json, the request to the endpoint can be sent, e.g., using
the curl utility, as follows:
URL={place the endpoint URL here}
API_KEY={place the endpoint API key here}
curl \
-d @test.json \
-X POST \
-H "x-api-key: ${API_KEY}" \
${URL}/score

The response is a JSON object with a single attribute score, which contains the list of rows with the optional copied
input values and the predictions.
For the example above with a two class target field, the result is likely to look something like the following snippet.
The particular values would of course depend on the scoring pipeline:
{
"score": [
[
"48.0",
"0.6240277982943945",
"0.045458571508101536",
],
[
"35.0",
"0.7209441819603676",
"0.06299909138586585",
],
[
"18.0",
"0.7209441819603676",
"0.06299909138586585",
]
]
}

432 Chapter 32. Deploying the MOJO Pipeline


Using Driverless AI, Release 1.8.4.1

32.2.6 AWS Deployment Issues

We create a new S3 bucket per AWS Lambda deployment. The bucket names have to be unique throughout AWS
S3, and one user can create a maximum of 100 buckets. Therefore, we recommend setting the bucket name used for
deployment with the deployment_aws_bucket_name config option.

32.3 REST Server Deployment

This section describes how to deploy the trained MOJO scoring pipeline as a local Representational State Transfer
(REST) Server.

32.3.1 Additional Resources

The REST server deployment supports API endpoints such as model metadata, file/CSV scoring, etc. It uses SpringFox
for both programmatic and manual inspection of the API. Refer to the local-rest-scorer folder in the dai-deployment-
templates repository to see different deployment templates for Local REST scorers.

32.3.2 Prerequisites

• Driverless AI MOJO Scoring Pipeline: To deploy a MOJO scoring pipeline as a Local REST Scorer, the MOJO
pipeline archive has to be created first by choosing the Build MOJO Scoring Pipeline option on the completed
experiment page. Refer to the MOJO Scoring Pipelines section for information on how to build a MOJO scoring
pipeline.
• When using a firewall or a virtual private cloud (VPC), the ports that are used by the REST server must be
exposed.
• Ensure that you have enough memory and CPUs to run the REST scorer. Typically, a good estimation for the
amount of required memory is 12 times the size of the pipeline.mojo file. For example, a 100MB pipeline.mojo
file will require approximately 1200MB of RAM. (Note: To conveniently view in-depth information about your
system in Driverless AI, click on Resources at the top of the sceen, then click System Info.)
• When running Driverless AI in a Docker container, you must expose ports on Docker for the REST service
deployment within the Driverless AI Docker container. For example, the following exposes the Driverless AI
Docker container to listen to port 8094 for requests arriving at the host port at 18094.
docker run \
-d \
--pid=host \
--init \
--rm \
--shm-size=256m \
-u `id -u`:`id -g` \
-p 12181:12345 \
-p 18094:8094 \
-v `pwd`/data:/data \
-v `pwd`/log:/log \
-v `pwd`/license:/license \
-v `pwd`/tmp:/tmp \
h2oai/<dai-image-name>:TAG

32.3. REST Server Deployment 433


Using Driverless AI, Release 1.8.4.1

32.3.3 Deploying on REST Server

Once the MOJO pipeline archive is ready, Driverless AI provides a Deploy (Local & Cloud) option on the completed
experiment page.
Notes:
• This button is only available after the MOJO Scoring Pipeline has been built.
• This button is not available on PPC64LE environments.

This option opens a new dialog for setting the REST Server deployment name, port number, and maximum heap size
(optional).

434 Chapter 32. Deploying the MOJO Pipeline


Using Driverless AI, Release 1.8.4.1

1. Specify a name for the REST scorer in order to help track the deployed REST scorers.
2. Provide a port number on which the REST scorer will run. For example, if port number 8081 is selected, the
scorer will be available at http://my-ip-address:8081/models
3. Optionally specify the maximum heap size for the Java Virtual Machine (JVM) running the REST scorer. This
can help constrain the REST scorer from overconsuming memory of the machine. Because the REST scorer is
running on the same machine as Driverless AI, it may be helpful to limit the amount of memory that is allocated
to the REST scorer. This option will limit the amount of memory the REST scorer can use, but it will also
produce an error if the memory allocated is not enough to run the scorer. (The amount of memory required is
mostly dependent on the size of MOJO. See Prerequisites for more information.)

32.3. REST Server Deployment 435


Using Driverless AI, Release 1.8.4.1

32.3.4 Testing the REST Server Deployment

Note that the actual scoring endpoint is located at the path /score.
The request is a JSON object containing attributes:
• fields: A list of input column names that should correspond to the training data columns.
• rows: A list of rows that are in turn lists of cell values to predict the target values for.
• optional includeFieldsInOutput: A list of input columns that should be included in the output.
An example request providing 2 columns on the input and asking to get one column copied to the output looks as
follows:
{
"fields": [
"age", "salary"
],
"includeFieldsInOutput": [
"salary"
],
"rows": [
[
"48.0", "15000.0"
],
[
"35.0", "35000.0"
],
[
"18.0", "22000.0"
]
]
}

Assuming the request is stored locally in a file named test.json, the request to the endpoint can be sent, e.g., using
the curl utility, as follows:
URL={place the endpoint URL here}
curl \
-X POST \
-d {"fields": ['age', 'salary', 'education'], "rows": [1, 2, 3], "includeFieldsInOutput": ["education"]}\
-H "Content-Type: application/json" \
${URL}/score

436 Chapter 32. Deploying the MOJO Pipeline


Using Driverless AI, Release 1.8.4.1

The response is a JSON object with a single attribute score, which contains the list of rows with the optional copied
input values and the predictions.
For the example above with a two class target field, the result is likely to look something like the following snippet.
The particular values would of course depend on the scoring pipeline:
{
"score": [
[
"48.0",
"0.6240277982943945",
"0.045458571508101536",
],
[
"35.0",
"0.7209441819603676",
"0.06299909138586585",
],
[
"18.0",
"0.7209441819603676",
"0.06299909138586585",
]
]
}

32.3.5 REST Server Deployment Issues

When using Docker, local REST scorers are deployed within the same container as Driverless AI. As a result, all
REST scorers will be turned off if the Driverless AI container is closed. When using native installs (rpm/deb/tar.sh),
the REST scorers will continue to run even if Driverless AI is shut down.

32.3. REST Server Deployment 437


Using Driverless AI, Release 1.8.4.1

438 Chapter 32. Deploying the MOJO Pipeline


CHAPTER

THIRTYTHREE

WHAT’S HAPPENING IN DRIVERLESS AI?

H2O Driverless AI is an automatic machine learning platform that uses feature engineering recipes from some of
the world’s best data scientists to deliver highly accurate machine learning models. As part of the automatic feature
engineering process, the system uses a variety of transformers to enhance the available data. This section describes
what’s happening underneath the hood, including details about the feature engineering transformations and time series
and natural language processing functionality.
Refer to one of the following topics:
• Data Sampling
• Driverless AI Transformations
• Internal Validation Technique
• Missing and Unseen Levels Handling
• Imputation in Driverless AI
• Time Series in Driverless AI
• NLP in Driverless AI

439
Using Driverless AI, Release 1.8.4.1

440 Chapter 33. What’s Happening in Driverless AI?


CHAPTER

THIRTYFOUR

DATA SAMPLING

Driverless AI does not perform any type of data sampling unless the dataset is big or highly imbal-
anced (for improved accuracy). What is considered big is dependent on your accuracy setting and the
statistical_threshold_data_size_large parameter in the config.toml or in the Expert Settings.
You can see if the data will be sampled by viewing the Experiment Preview when you set up the experiment. In
the experiment preview below, I can see that my data was sampled down to 5 million rows.

If Driverless AI decides to sample the data based on these settings and the data size, then Driverless AI will perform
the following types of sampling at the start of the experiment:

441
Using Driverless AI, Release 1.8.4.1

• Random sampling for regression problems


• Stratified sampling for classification problems
• Imbalanced sampling for binary problems where the data is considered imbalanced
– By default, imbalanced is defined as when the majority class is 5 times more common than the minority
class. (This is also configurable.)
With imbalanced sampling, there are multiple approaches:
• Sample both classes as needed depending on the data (automatic)
• Under-sample the majority class to reach class balance
• Over-sample the minority class and under-sample the majority class, depending on data
• Do not perform any sampling
When imbalanced sampling is enabled, sampling is usually performed with replacement, and repeated multiple times
to improve accuracy (bagging). By default, the number of bags is automatically determined, but can be specified in
expert settings.

442 Chapter 34. Data Sampling


CHAPTER

THIRTYFIVE

DRIVERLESS AI TRANSFORMATIONS

Transformations in Driverless AI are applied to columns in the data. The transformers create the engineered features
in experiments.
Driverless AI provides a number of transformers. The downloaded experiment logs include the transformations that
were applied to your experiment. Note that you can exclude transformations in the config.toml file, and that list of
excluded transformers will also be available in the experiment log.

35.1 Available Transformers

The following transformers are available for classification (multiclass and binary) and regression experiments.

35.1.1 Numeric Transformers (Integer, Real, Binary)

• ClusterDist Transformer
The Cluster Distance Transformer clusters selected numeric columns and uses the distance to a spe-
cific cluster as a new feature.
• ClusterTE Transformer
The Cluster Target Encoding Transformer clusters selected numeric columns and calculates the mean
of the response column for each cluster. The mean of the response is used as a new feature. Cross
Validation is used to calculate mean response to prevent overfitting.
• Interactions Transformer
The Interactions Transformer adds, divides, multiplies, and subtracts two numeric columns in the
data to create a new feature. This transformation uses a smart search to identify which feature pairs
to transform. Only interactions that improve the baseline model score are kept.
• InteractionsSimple Transformer
The InteractionsSimple Transformer adds, divides, multiplies, and subtracts two numeric columns in
the data to create a new feature. This transformation randomly selects pairs of features to transform.
• NumCatTE Transformer
The Numeric Categorical Target Encoding Transformer calculates the mean of the response column
for several selected columns. If one of the selected columns is numeric, it is first converted to cate-
gorical by binning. The mean of the response column is used as a new feature. Cross Validation is
used to calculate mean response to prevent overfitting.
• NumToCatTE Transformer

443
Using Driverless AI, Release 1.8.4.1

The Numeric to Categorical Target Encoding Transformer converts numeric columns to categoricals
by binning and then calculates the mean of the response column for each group. The mean of the
response for the bin is used as a new feature. Cross Validation is used to calculate mean response to
prevent overfitting.
• NumToCatWoEMonotonic Transformer
The Numeric to Categorical Weight of Evidence Monotonic Transformer converts a numeric col-
umn to categorical by binning and then calculates Weight of Evidence for each bin. The monotonic
constraint ensures the bins of values are monotonically related to the Weight of Evidence value. The
Weight of Evidence is used as a new feature. Weight of Evidence measures the “strength” of a group-
ing for separating good and bad risk and is calculated by taking the log of the ratio of distributions
for a binary response column.
• NumToCatWoE Transformer
The Numeric to Categorical Weight of Evidence Transformer converts a numeric column to categor-
ical by binning and then calculates Weight of Evidence for each bin. The Weight of Evidence is used
as a new feature. Weight of Evidence measures the “strength” of a grouping for separating good and
bad risk and is calculated by taking the log of the ratio of distributions for a binary response column.
• Original Transformer
The Original Transformer applies an identity transformation to a numeric column.
• TruncSVDNum Transformer
Truncated SVD Transformer trains a Truncated SVD model on selected numeric columns and uses
the components of the truncated SVD matrix as new features.

35.1.2 Time Series Experiments Transformers

• DateOriginal Transformer
The Date Original Transformer retrieves date values such as year, quarter, month, day, day of the
year, week, and weekday values.
• DateTimeOriginal Transformer
The Date Time Original Transformer retrieves date and time values such as year, quarter, month, day,
day of the year, week, weekday, hour, minute, and second values.
• EwmaLags Transformer
The Exponentially Weighted Moving Average (EWMA) Transformer calculates the exponentially
weighted moving average of target or feature lags.
• LagsAggregates Transformer
The Lags Aggregates Transformer calculates aggregations of target/feature lags like mean(lag7,
lag14, lag21) with support for mean, min, max, median, sum, skew, kurtosis, std. The aggregation is
used as a new feature.
• LagsInteraction Transformer
The Lags Interaction Transformer creates target/feature lags and calculates interactions between the
lags (lag2 - lag1, for instance). The interaction is used as a new feature.
• Lags Transformer
The Lags Transformer creates target/feature lags, possibly over groups. Each lag is used as a new fea-
ture. Lag transformers may apply to categorical (strings) features or binary/multiclass string valued
targets after they have been internally numerically encoded.

444 Chapter 35. Driverless AI Transformations


Using Driverless AI, Release 1.8.4.1

• LinearLagsRegression Transformer
The Linear Lags Regression transformer trains a linear model on the target or feature lags to predict
the current target or feature value. The linear model prediction is used as a new feature.

35.1.3 Categorical Transformers (String)

• Cat Transformer
The Cat Transformer sorts a categorical column in lexicographical order and uses the order index
created as a new feature. This transformer works with models that can handle categorical features.
• CatOriginal Transformer
The Categorical Original Transformer applies an identity transformation that leaves categorical fea-
tures as they are. This transformer works with models that can handle non-numeric feature values.
• CVCatNumEncode Transformer
The Cross Validation Categorical to Numeric Encoding Transformer calculates an aggregation of a
numeric column for each value in a categorical column (ex: calculate the mean Temperature for each
City) and uses this aggregation as a new feature.
• CVTargetEncode Transformer
The Cross Validation Target Encoding Transformer calculates the mean of the response column for
each value in a categorical column and uses this as a new feature. Cross Validation is used to calculate
mean response to prevent overfitting.
• Frequent Transformer
The Frequent Transformer calculates the frequency for each value in categorical column(s) and uses
this as a new feature. This count can be either the raw count or the normalized count.
• LexiLabelEncoder Transformer
The Lexi Label Encoder sorts a categorical column in lexicographical order and uses the order index
created as a new feature.
• NumCatTE Transformer
The Numeric Categorical Target Encoding Transformer calculates the mean of the response column
for several selected columns. If one of the selected columns is numeric, it is first converted to cate-
gorical by binning. The mean of the response column is used as a new feature. Cross Validation is
used to calculate mean response to prevent overfitting.
• OneHotEncoding Transformer
The One-hot Encoding transformer converts a categorical column to a series of boolean features by
performing one-hot encoding. The boolean features are used as new features.
• SortedLE Transformer
The Sorted Label Encoding Transformer sorts a categorical column by the response column and uses
the order index created as a new feature.
• WeightOfEvidence Transformer
The Weight of Evidence Transformer calculates Weight of Evidence for each value in categorical
column(s). The Weight of Evidence is used as a new feature. Weight of Evidence measures the
“strength” of a grouping for separating good and bad risk and is calculated by taking the log of the
ratio of distributions for a binary response column.

35.1. Available Transformers 445


Using Driverless AI, Release 1.8.4.1

This only works with a binary target variable. The likelihood needs to be created within a stratified
kfold if a fit_transform method is used. More information can be found here: http://ucanalytics.com/
blogs/information-value-and-weight-of-evidencebanking-case/.

35.1.4 Text Transformers (String)

• TextBiGRU Transformer
The Text Bidirectional GRU Transformer trains a bi-directional GRU TensorFlow model on word
embeddings created from a text feature to predict the response column. The GRU prediction is used
as a new feature. Cross Validation is used when training the GRU model to prevent overfitting.
• TextCharCNN Transformer
The Text Character CNN Transformer trains a CNN TensorFlow model on character embeddings
created from a text feature to predict the response column. The CNN prediction is used as a new
feature. Cross Validation is used when training the CNN model to prevent overfitting.
• TextCNN Transformer
The Text CNN Transformer trains a CNN TensorFlow model on word embeddings created from a
text feature to predict the response column. The CNN prediction is used as a new a feature. Cross
Validation is used when training the CNN model to prevent overfitting.
• TextLinModel Transformer
The Text Linear Model Transformer trains a linear model on a TF-IDF matrix created from a text
feature to predict the response column. The linear model prediction is used as a new feature. Cross
Validation is used when training the linear model to prevent overfitting.
• Text Transformer
The Text Transformer tokenizes a text column and creates a TFIDF matrix (term frequency-inverse
document frequency) or count (count of the word) matrix. This may be followed by dimensionality
reduction using truncated SVD. Selected components of the TF-IDF/Count matrix are used as new
features.

35.1.5 Time Transformers (Date, Time)

• Dates Transformer
The Dates Transformer retrieves any date values, including:
– Year
– Quarter
– Month
– Day
– Day of year

446 Chapter 35. Driverless AI Transformations


Using Driverless AI, Release 1.8.4.1

– Week
– Week day
– Hour
– Minute
– Second
• IsHoliday Transformer
The Is Holiday Transformer determines if a date column is a holiday. A boolean column indicating
if the date is a holiday is added as a new feature. Creates a separate feature for holidays in the
United States, United Kingdom, Germany, Mexico, and the European Central Bank. Other countries
available in the python Holiday package can be added via the configuration file.

35.2 Example Transformations

In this section, we will describe some of the available transformations using the example of predicting house prices on
the example dataset.

Date Built Square Footage Num Beds Num Baths State Price
01/01/1920 1700 3 2 NY $700K

35.2.1 Frequent Transformer

• the count of each categorical value in the dataset


• the count can be either the raw count or the normalized count

Date Built Square Footage Num Beds Num Baths State Price Freq_State
01/01/1920 1700 3 2 NY 700,000 4,500

There are 4,500 properties in this dataset with state = NY.

35.2.2 Bulk Interactions Transformer

• add, divide, multiply, and subtract two columns in the data

Date Square Num Num State Price Interac-


Built Footage Beds Baths tion_NumBeds#subtract#NumBaths
01/01/1920 1700 3 2 NY 700,000 1

There is one more bedroom than there are number of bathrooms for this property.

35.2. Example Transformations 447


Using Driverless AI, Release 1.8.4.1

35.2.3 Truncated SVD Numeric Transformer

• truncated SVD trained on selected numeric columns of the data


• the components of the truncated SVD will be new features

Date Square Num Num State Price TruncSVD_Price_NumBeds_NumBaths_1


Built Footage Beds Baths
01/01/1920 1700 3 2 NY 700,000 0.632

The first component of the truncated SVD of the columns Price, Number of Beds, Number of Baths.

35.2.4 Dates Transformer

• get year, get quarter, get month, get day, get day of year, get week, get week day, get hour, get minute, get
second

Date Built Square Footage Num Beds Num Baths State Price DateBuilt_Month
01/01/1920 1700 3 2 NY 700,000 1

The home was built in the month January.

35.2.5 Text Transformer

• transform text column using methods: TFIDF or count (count of the word)
• this may be followed by dimensionality reduction using truncated SVD

35.2.6 Categorical Target Encoding Transformer

• cross validation target encoding done on a categorical column

Date Built Square Footage Num Beds Num Baths State Price CV_TE_State
01/01/1920 1700 3 2 NY 700,000 550,000

The average price of properties in NY state is $550,000*.


*In order to prevent overfitting, Driverless AI calculates this average on out-of-fold data using cross validation.

35.2.7 Numeric to Categorical Target Encoding Transformer

• numeric column converted to categorical by binning


• cross validation target encoding done on the binned numeric column

Date Built Square Footage Num Beds Num Baths State Price CV_TE_SquareFootage
01/01/1920 1700 3 2 NY 700,000 345,000

The column Square Footage has been bucketed into 10 equally populated bins. This property lies in the Square
Footage bucket 1,572 to 1,749. The average price of properties with this range of square footage is $345,000*.
*In order to prevent overfitting, Driverless AI calculates this average on out-of-fold data using cross validation.

448 Chapter 35. Driverless AI Transformations


Using Driverless AI, Release 1.8.4.1

35.2.8 Cluster Target Encoding Transformer

• selected columns in the data are clustered


• target encoding is done on the cluster ID

Date Square Num Num State Price ClusterTE_4_NumBeds_NumBaths_SquareFootage


Built Footage Beds Baths
01/01/1920 1700 3 2 NY 700,000 450,000

The columns: Num Beds, Num Baths, Square Footage have been segmented into 4 clusters. The average
price of properties in the same cluster as the selected property is $450,000*.
*In order to prevent overfitting, Driverless AI calculates this average on out-of-fold data using cross validation.

35.2.9 Cluster Distance Transformer

• selected columns in the data are clustered


• the distance to a chosen cluster center is calculated

Date Square Num Num State Price ClusterDist_4_NumBeds_NumBaths_SquareFootage_1


Built Footage Beds Baths
01/01/19201700 3 2 NY 700,000 0.83

The columns: Num Beds, Num Baths, Square Footage have been segmented into 4 clusters. The difference
from this record to Cluster 1 is 0.83.

35.2. Example Transformations 449


Using Driverless AI, Release 1.8.4.1

450 Chapter 35. Driverless AI Transformations


CHAPTER

THIRTYSIX

INTERNAL VALIDATION TECHNIQUE

This section describes the technique behind internal validation in Driverless AI.
For the experiment, Driverless AI will either:
(1) split the data into a training set and internal validation set
or
(2) use cross validation to split the data into 𝑛 folds
Driverless AI chooses the method based on the size of the data and the Accuracy setting. For method 1, part of the
data is removed to be used for internal validation. (Note: This train and internal validation split may be repeated if the
data is small so that more data can be used for training.)
For method 2, however, no data is wasted for internal validation. With cross validation, the whole dataset is utilized,
and each model is trained on a different subset of the training data. The following visualization shows an example of
cross validation with 5 folds.

Driverless AI randomly splits the data into the specified number of folds for cross validation. With cross validation,
the whole dataset is utilized, and each model is trained on a different subset of the training data.
Driverless AI will not automatically create the internal validation data randomly if a user provides a Fold Column or a
Validation Dataset. If a Fold Column or a Validation Dataset is provided, Driverless AI will use that data to calculate
the performance of the Driverless AI models and to calculate all performance graphs and statistics.
If the experiment is a Time Series use case, and a Time Column is selected, Driverless AI will change the way the
internal validation data is created. In the case of temporal data, it is important to train on historical data and validate
on more recent data. Driverless AI does not perform random splits, but instead respects the temporal nature of the data

451
Using Driverless AI, Release 1.8.4.1

to prevent any data leakage. In addition, the train/validation split is a function of the time gap between train and test
as well as the forecast horizon (amount of time periods to predict). If test data is provided, Driverless AI will suggest
values for these parameters that lead to a validation set that resembles the test set as much as possible. But users can
control the creation of the validation split in order to adjust it to the actual application.

452 Chapter 36. Internal Validation Technique


CHAPTER

THIRTYSEVEN

MISSING AND UNSEEN LEVELS HANDLING

This section describes how missing and unseen levels are handled by each algorithm during training and scoring.

37.1 How Does the Algorithm Handle Missing Values During Train-
ing?

37.1.1 LightGBM, XGBoost, RuleFit

Driverless AI treats missing values natively. (I.e., a missing value is treated as a special value.) Experiments rarely
benefit from imputation techniques, unless the user has a strong understanding of the data.

37.1.2 GLM

Driverless AI automatically performs mean value imputation (equivalent to setting the value to zero after standardiza-
tion).

37.1.3 TensorFlow

Driverless AI provides an imputation setting for TensorFlow in the config.toml file: tf_nan_impute_value (post-
normalization). If you set this option to 0, then missing values will be imputed by the mean. Setting it to (for
example) +5 will specify 5 standard deviations above the mean of the distribution. The default value in Driverless AI
is -5, which specifies that TensorFlow will treat missing values as outliers on the negative end of the spectrum. Specify
0 if you prefer mean imputation.

37.1.4 FTRL

In FTRL, missing values have their own representation for each datable column type. These representations are used
to hash the missing value, with their column’s name, to an integer. This means FTRL replaces missing values with
special constants that are the same for each column type, and then treats these special constants like a normal data
value.

453
Using Driverless AI, Release 1.8.4.1

37.2 How Does the Algorithm Handle Missing Values During Scoring
(Production)?

37.2.1 LightGBM, XGBoost, RuleFit

If missing data is present during training, these tree-based algorithms learn the optimal direction for missing data for
each split (left or right). This optimal direction is then used for missing values during scoring. If no missing data is
present during scoring (for a particular feature), then the majority path is followed if the value is missing.

37.2.2 GLM

Missing values are replaced by the mean value (from training), same as in training.

37.2.3 TensorFlow

Missing values are replaced by the same value as specified during training (parameterized by tf_nan_impute_value).

37.2.4 FTRL

To ensure consistency, FTRL treats missing values during scoring in exactly the same way as during training.

37.2.5 Clustering in Transformers

Missing values are replaced with the mean along each column. This is used only on numeric columns.

37.2.6 Isolation Forest Anomaly Score Transformer

Isolation Forest uses out-of-range imputation that fills missing values with the values beyond the maximum.

37.3 What Happens When You Try to Predict on a Categorical Level


Not Seen During Training?

37.3.1 XGBoost, LightGBM, RuleFit, TensorFlow, GLM

Driverless AI’s feature engineering pipeline will compute a numeric value for every categorical level present in the
data, whether it’s a previously seen value or not. For frequency encoding, unseen levels will be replaced by 0. For
target encoding, the global mean of the target value will be used.

454 Chapter 37. Missing and Unseen Levels Handling


Using Driverless AI, Release 1.8.4.1

37.3.2 FTRL

FTRL models don’t distinguish between categorical and numeric values. Whether or not FTRL saw a particular value
during training, it will hash all the data, row by row, to numeric and then make predictions. Because you can think of
FTRL as learning all the possible values in the dataset by heart, there is no guarantee it will make accurate predictions
for unseen data. Therefore, it is important to ensure that the training dataset has a reasonable “overlap” in terms of
unique values with the ones used to make predictions.

37.4 What Happens if the Response Has Missing Values?

All algorithms will skip an observation (aka record) if the response value is missing.

37.4. What Happens if the Response Has Missing Values? 455


Using Driverless AI, Release 1.8.4.1

456 Chapter 37. Missing and Unseen Levels Handling


CHAPTER

THIRTYEIGHT

IMPUTATION IN DRIVERLESS AI

The impute feature allows you to fill in missing values with substituted values. Missing values can be imputed based on
the column’s mean, median, minimum, maximum, or mode value. You can also impute based on a specific percentile
or by a constant value.
The imputation is precomputed on all data or inside the pipeline (based on what’s in the train split).
The following guidelines should be followed when performing imputation:
• For constant imputation on numeric columns, constant must be numeric.
• For constant imputation on string columns, constant must be a string.
• For percentile imputation, the percentage value must be between 0 and 100.
Notes:
• This feature is experimental.
• Time columns cannot be imputed.

38.1 Enabling Imputation

Imputation is disabled by default. It can be enabled by setting enable_imputation=true in the config.toml (for
native installs) or via the DRIVERLESS_AI_ENABLE_IMPUTATION=true environment variable (Docker image
installs). This enables imputation functionality in transformers.

38.2 Running an Experiment with Imputation

Once imputation is enabled, you will have the option when running an experiment to add imputation columns.
1. Click on Columns Imputation in the Experiment Setup page.

457
Using Driverless AI, Release 1.8.4.1

2. Click on Add Imputation in the upper-right corner.


3. Select the column that contains missing values you want to impute.
4. Select the imputation type. Available options are:
• mean: The column’s numeric mean value displays by default. (Default method for numeric values.)
• median: When selected, the column’s numeric numeric median value displays by default.
• min: When selected, the column’s numeric minimum value displays by default.
• max: When selected, the column’s numeric maximum value displays by default.
• const: Enter a string of characters. (Default method for string columns)
• mode: When selected, the column’s numeric mode value displays by default.
• percentile: Specify a percentile rank value between 0 and 100. (Defaults to 95.) In addition, specify a numeric
imputed value.
5. Optionally allow Driverless AI to compute the imputation value during validation instead of using the inputted
imputed value.
6. Click Save when you are done.

458 Chapter 38. Imputation in Driverless AI


Using Driverless AI, Release 1.8.4.1

7. At this point, you can add additional imputations, delete the imputation you just created, or close this form and
return to the experiment. Note that each column can have only a single imputation.

38.2. Running an Experiment with Imputation 459


Using Driverless AI, Release 1.8.4.1

460 Chapter 38. Imputation in Driverless AI


CHAPTER

THIRTYNINE

TIME SERIES IN DRIVERLESS AI

Time-series forecasting is one of the most common and important tasks in business analytics. There are many real-
world applications like sales, weather, stock market, and energy demand, just to name a few. At H2O, we believe
that automation can help our users deliver business value in a timely manner. Therefore, we combined advanced time
series analysis and our Kaggle Grand Masters’ time-series recipes into Driverless AI.
The key features/recipes that make automation possible are:
• Automatic handling of time groups (e.g., different stores and departments)
• Robust time-series validation
– Accounts for gaps and forecast horizon
– Uses past information only (i.e., no data leakage)
• Time-series-specific feature engineering recipes
– Date features like day of week, day of month, etc.
– AutoRegressive features, like optimal lag and lag-features interaction
– Different types of exponentially weighted moving averages
– Aggregation of past information (different time groups and time intervals)
– Target transformations and differentiation
• Integration with existing feature engineering functions (recipes and optimization)
• Rolling-window based predictions for time series experiments with test-time augmentation or re-fit
• Automatic pipeline generation (See “From Kaggle Grand Masters’ Recipes to Production Ready in a Few
Clicks” blog post.)

39.1 Understanding Time Series

39.1.1 Modeling Approach

Driverless AI uses GBMs, GLMs and neural networks with a focus on time-series-specific feature engineering. The
feature engineering includes:
• Autoregressive elements: creating lag variables
• Aggregated features on lagged variables: moving averages, exponential smoothing descriptive statistics, corre-
lations
• Date-specific features: week number, day of week, month, year

461
Using Driverless AI, Release 1.8.4.1

• Target transformations: Integration/Differentiation, univariate transforms (like logs, square roots)


This approach is combined with AutoDL features as part of the genetic algorithm. The selection is still based on
validation accuracy. In other words, the same transformations/genes apply; plus there are new transformations that
come from time series. Some transformations (like target encoding) are deactivated.
When running a time-series experiment, Driverless AI builds multiple models by rolling the validation window back
in time (and potentially using less and less training data).

39.1.2 User-Configurable Options

Gap

The guiding principle for properly modeling a time series forecasting problem is to use the historical data in the model
training dataset such that it mimics the data/information environment at scoring time (i.e. deployed predictions).
Specifically, you want to partition the training set to account for: 1) the information available to the model when
making predictions and 2) the number of units out that the model should be optimized to predict.
Given a training dataset, the gap and forecast horizon are parameters that determine how to split the training dataset
into training samples and validation samples.
Gap is the amount of missing time bins between the end of a training set and the start of test set (with regards to time).
For example:
• Assume there are daily data with days 1/1/2019, 2/1/2019, 3/1/2019, 4/1/2019 in train. There are 4 days in total
for training.
• In addition, the test data will start from 6/1/2019. There is only 1 day in the test data.
• The previous day (5/1/2019) does not belong to the train data. It is a day that cannot be used for training (i.e
because information from that day may not be available at scoring time). This day cannot be used to derive
information (such as historical lags) for the test data either.
• Here the time bin (or time unit) is 1 day. This is the time interval that separates the different samples/rows in the
data.
• In summary, there are 4 time bins/units for the train data and 1 time bin/unit for the test data plus the Gap.
• In order to estimate the Gap between the end of the train data and the beginning of the test data, the following
formula is applied.
• Gap = min(time bin test) - max(time bin train) - 1.
• In this case min(time bin test) is 6 (or 6/1/2019). This is the earliest (and only) day in the test data.
• max(time bin train) is 4 (or 4/1/2019). This is the latest (or the most recent) day in the train data.
• Therefore the GAP is 1 time bin (or 1 day in this case), because Gap = 6 - 4 - 1 or Gap = 1

462 Chapter 39. Time Series in Driverless AI


Using Driverless AI, Release 1.8.4.1

Forecast Horizon

Quite often, it is not possible to have the most recent data available when applying a model (or it is costly to update
the data table too often); hence models need to be built accounting for a “future gap”. For example, if it takes a week
to update a certain data table, ideally we would like to predict “7 days ahead” with the data as it is “today”; hence a
gap of 7 days would be sensible. Not specifying a gap and predicting 7 days ahead with the data as it is is unrealistic
(and cannot happen, as we update the data on a weekly basis in this example). Similarly, gap can be used for those
who want to forecast further in advance. For example, users want to know what will happen 7 days in the future, they
will set the gap to 7 days.
Forecast Horizon (or prediction length) is the period that the test data spans for (for example, one day, one week,
etc.). In other words it is the future period that the model can make predictions for (or the number of units out that
the model should be optimized to predict). Forecast horizon is used in feature selection and engineering and in model
selection. Note that forecast horizon might not equal the number of predictions. The actual predictions are determined
by the test dataset.

The periodicity of updating the data may require model predictions to account for significant time in the future. In
an ideal world where data can be updated very quickly, predictions can always be made having the most recent data
available. In this scenario there is no need for a model to be able to predict cases that are well into the future, but rather
focus on maximizing its ability to predict short term. However this is not always the case, and a model needs to be
able to make predictions that span deep into the future because it may be too costly to make predictions every single
day after the data gets updated.
In addition, each future data point is not the same. For example, predicting tomorrow with today’s data is easier than
predicting 2 days ahead with today’s data. Hence specifying the forecast horizon can facilitate building models that
optimize prediction accuracy for these future time intervals.

time_period_in_seconds

Note: This is only available in the Python and R clients. Time period in seconds cannot be specified in the UI.
In Driverless AI, the forecast horizon (a.k.a., num_prediction_periods) needs to be in periods, and the size
is unknown. To overcome this, you can use the optional time_period_in_seconds parameter when running
start_experiment_sync (in Python) or train (in R). This is used to specify the forecast horizon in real time
units (as well as for gap.) If this parameter is not specified, then Driverless AI will automatically detect the period size
in the experiment, and the forecast horizon value will respect this period. I.e., if you are sure that your data has a 1
week period, you can say num_prediction_periods=14; otherwise it is possible that the model will not work
correctly.

39.1. Understanding Time Series 463


Using Driverless AI, Release 1.8.4.1

Groups

Groups are categorical columns in the data that can significantly help predict the target variable in time series problems.
For example, one may need to predict sales given information about stores and products. Being able to identify that
the combination of store and products can lead to very different sales is key for predicting the target variable, as a big
store or a popular product will have higher sales than a small store and/or with unpopular products.
For example, if we don’t know that the store is available in the data, and we try to see the distribution of sales along
time (with all stores mixed together), it may look like that:

The same graph grouped by store gives a much clearer view of what the sales look like for different stores.

464 Chapter 39. Time Series in Driverless AI


Using Driverless AI, Release 1.8.4.1

Lag

The primary generated time series features are lag features, which are a variable’s past values. At a given sample with
time stamp 𝑡, features at some time difference 𝑇 (lag) in the past are considered. For example, if the sales today are
300, and sales of yesterday are 250, then the lag of one day for sales is 250. Lags can be created on any feature as well
as on the target.

As previously noted, the training dataset is appropriately split such that the amount of validation data samples equals
that of the testing dataset samples. If we want to determine valid lags, we must consider what happens when we will
evaluate our model on the testing dataset. Essentially, the minimum lag size must be greater than the gap size.
Aside from the minimum useable lag, Driverless AI attemps to to discover predictive lag sizes based on auto-
correlation.
“Lagging” variables are important in time series because knowing what happened in different time periods in the past
can greatly facilitate predictions for the future. Consider the following example to see the lag of 1 and 2 days:

Date Sales Lag1 Lag2


1/1/2018 100 - -
2/1/2018 150 100 -
3/1/2018 160 150 100
4/1/2018 200 160 150
5/1/2018 210 200 160
6/1/2018 150 210 200
7/1/2018 160 150 210
8/1/2018 120 160 150
9/1/2018 80 120 160
10/1/2018 70 80 120

39.1. Understanding Time Series 465


Using Driverless AI, Release 1.8.4.1

39.1.3 Settings Determined by Driverless AI

Window/Moving Average

Using the above Lag table, a moving average of 2 would constitute the average of Lag1 and Lag2:

Date Sales Lag1 Lag2 MA2


1/1/2018 100 - - -
2/1/2018 150 100 - -
3/1/2018 160 150 100 125
4/1/2018 200 160 150 155
5/1/2018 210 200 160 180
6/1/2018 150 210 200 205
7/1/2018 160 150 210 180
8/1/2018 120 160 150 155
9/1/2018 80 120 160 140
10/1/2018 70 80 120 100

Aggregating multiple lags together (instead of just one) can facilitate stability for defining the target variable. It may
include various lags values, for example lags [1-30] or lags [20-40] or lags [7-70 by 7].

Exponential Weighting

Exponential weighting is a form of weighted moving average where more recent values have higher weight than less
recent values. That weight is exponentially decreased over time based on an alpha (a) (hyper) parameter (0,1), which
is normally within the range of [0.9 - 0.99]. For example:
• Exponential Weight = a**(time)
• If sales 1 day ago = 3.0 and 2 days ago =4.5 and a=0.95:
• Exp. smooth = 3.0*(0.95**1) + 4.5*(0.95**2) / ((0.95**1) + (0.95**2)) =3.73 approx.

39.2 Rolling-Window-Based Predictions

Driverless AI supports rolling-window-based predictions for time-series experiments with two options: Test Time
Augmentation (TTA) or re-fit.
This process is automated when the test set spans for a longer period than the forecast horizon. If the user does not
provide a test set, but then scores one after the experiment is finished, rolling predictions will still be applied as long
as the selected horizon is shorter than the test set.
When using Rolling Windows and TTA, Driverless AI takes into account the Prediction Duration and the Rolling
Duration.
• Prediction Duration (PD): This is the duration configured as forecaset horizon while training the Driverless AI
experiment. If you don’t want to predict beyond the horizon configured during experiment training using the
experiment’s scoring pipeline, then in that case, PD may be the same as Test Data Duration/Horizon and the
situation is shown in the previous Horizon image (above).
Note: When using TTA, the prediction duration represents the forecast horizon during experiment train-
ing. During scoring, the prediction duration will be the duration of data passed to score for each invocation
of the score_batch method of the scoring module.

466 Chapter 39. Time Series in Driverless AI


Using Driverless AI, Release 1.8.4.1

• Rolling Duration (RD): This is the amount of duration by which we move ahead (roll) in time before we score
again for the next prediction duration data.

39.3 Time Series Constraints

39.3.1 Dataset Size

Usually, the forecast horizon (prediction length) 𝐻 equals the number of time periods in the testing data 𝑁𝑇 𝐸𝑆𝑇 (i.e.
𝑁𝑇 𝐸𝑆𝑇 = 𝐻). You want to have enough training data time periods 𝑁𝑇 𝑅𝐴𝐼𝑁 to score well on the testing dataset. At
a minimum, the training dataset should contain at least three times as many time periods as the testing dataset (i.e.
𝑁𝑇 𝑅𝐴𝐼𝑁 >= 3𝑁𝑇 𝐸𝑆𝑇 ). This allows for the training dataset to be split into a validation set with the same amount of
time periods as the testing dataset while maintaining enough historical data for feature engineering.

39.4 Time Series Use Case: Sales Forecasting

Below is a typical example of sales forecasting based on the Walmart competition on Kaggle. In order to frame it as a
machine learning problem, we formulate the historical sales data and additional attributes as shown below:
Raw data

39.3. Time Series Constraints 467


Using Driverless AI, Release 1.8.4.1

Data formulated for machine learning

The additional attributes are attributes that we will know at time of scoring. In this example, we want to forecast the
next week of sales. Therefore, all of the attributes included in our data must be known at least one week in advance.
In this case, we assume that we will know whether or not a Store and Department will be running a promotional
markdown. We will not use features like the temperature of the Week since we will not have that information at the
time of scoring.
Once you have your data prepared in tabular format (see raw data above), Driverless AI can formulate it for machine
learning and sort out the rest. If this is your very first session, the Driverless AI assistant will guide you through the
journey.

468 Chapter 39. Time Series in Driverless AI


Using Driverless AI, Release 1.8.4.1

Similar to previous Driverless AI examples, you need to select the dataset for training/test and define the target. For
time-series, you need to define the time column (by choosing AUTO or selecting the date column manually). If
weighted scoring is required (like the Walmart Kaggle competition), you can select the column with specific weights
for different samples.

If you prefer to use automatic handling of time groups, you can leave the setting for time groups columns as AUTO,
or you can define specific time groups. You can also specify the columns that will be unavailable at prediction time,
the forecast horizon (in weeks), and the gap (in weeks) between the train and test periods.
Once the experiment is finished, you can make new predictions and download the scoring pipeline just like any other

39.4. Time Series Use Case: Sales Forecasting 469


Using Driverless AI, Release 1.8.4.1

Driverless AI experiments.

39.5 Time Series Expert Settings

The user may further configure the time series experiments with a dedicated set of options available through the
EXPERT SETTINGS. The EXPERT SETTINGS panel is available from within the experiment page right above the
Scorer knob.
Refer to Time Series Settings for information about the available Time Series Settings options.

39.6 Using a Driverless AI Time Series Model to Forecast

When you set the experiment’s forecast horizon, you are telling the Driverless AI experiment the dates this model will
be asked to forecast for. In the Walmart Sales example, we set the Driverless AI forecast horizon to 1 (1 week in the
future). This means that Driverless AI expects this model to be used to forecast 1 week after training ends. Since the
training data ends on 2012-10-26, then this model should be used to score for the week of 2012-11-02.
What should the user do once the 2012-11-02 week has passed?
There are two options:
Option 1: Trigger a Driverless AI experiment to be trained once the forecast horizon ends. A Driverless AI experiment
will need to be re-trained every week.
Option 2: Use Test Time Augmentation to update historical features so that we can use the same model to forecast
outside of the forecast horizon.
Test Time Augmentation refers to the process where the model stays the same but the features are refreshed using the
latest data. In our Walmart Sales Forecasting example, a feature that may be very important is the Weekly Sales from

470 Chapter 39. Time Series in Driverless AI


Using Driverless AI, Release 1.8.4.1

the previous week. Once we move outside of the forecast horizon, our model no longer knows the Weekly Sales from
the previous week. By performing Test Time Augmentation, Driverless AI will automatically generate these historical
features if new data is provided.
In Option 1, we would launch a new Driverless AI experiment every week with the latest data and use the resulting
model to forecast the next week. In Option 2, we would continue using the same Driverless AI experiment outside of
the forecast horizon by using Test Time Augmentation.
Both options have their advantages and disadvantages. By re-training an experiment with the latest data, Driverless
AI has the ability to possibly improve the model by changing the features used, choosing a different algorithm, and/or
selecting different parameters. As the data changes over time, for example, Driverless AI may find that the best
algorithm for this use case has changed.
There may be clear advantages for retraining an experiment after each forecast horizon or for using Test Time Aug-
mentation. Refer to this example to see how to use the scoring pipeline to predict future data instead of using the
prediction endpoint on the Driverless AI server.
Using Test Time Augmentation to be able to continue using the same experiment over a longer period of time means
there would be no need to continually repeat a model review process. The model may become out of date, however,
and the MOJO scoring pipeline is not supported.

Scoring Supported Retraining Model Test Time Augmentation


Driverless AI Scoring Supported Supported
Python Scoring Pipeline Supported Supported
MOJO Scoring Pipeline Supported Not Supported

For different use cases, there may be clear advantages for retraining an experiment after each forecast horizon or for
using Test Time Augmentation. In this notebook, we show how to perform both and compare the performance: Time
Series Model Rolling Window.
How to trigger Test Time Augmentation?
To tell Driverless AI to perform Test Time Augmentation, simply create your forecast data to include any data that
occurred after the training data ended up to the date you want a forecast for. The date which you want Driverless AI
to forecast should have NA where the target column is. Here is an example of forecasting 2012-11-09.

Date Store Dept Mark Down 1 Mark Down 2 Weekly_Sales


2012-11-02 1 1 -1 -1 $40,000
2012-11-09 1 1 -1 -1 NA

If we do not include an NA in the Target column for the date we are interested in forecasting, then Test Time Augmen-
tation will not be triggered.

39.7 Additional Resources

Refer to the following for examples showing how to run Time Series examples in Driverless AI:
• Training a Time Series Model
• Time Series Recipes with Rolling Window
• Time Series Pipeline with Time Test Augmentation

39.7. Additional Resources 471


Using Driverless AI, Release 1.8.4.1

472 Chapter 39. Time Series in Driverless AI


CHAPTER

FORTY

NLP IN DRIVERLESS AI

Driverless AI version 1.3 introduced support for TensorFlow Natural Language Processing (NLP) experiments for text
classification and regression problems. The Driverless AI platform has the ability to support both standalone text and
text with other numerical values as predictive features.
The following is the set of features created by the NLP recipe for a given text column:
• N-gram frequency / TFIDF followed by Truncated SVD
• N-gram frequency / TFIDF followed by Linear / Logistic regression
• Word embeddings followed by CNN model (TensorFlow)
• Word embeddings followed by BiGRU model (TensorFlow)
• Character embeddings followed by CNN model (TensorFlow)
In addition to these techniques, Driverless AI supports custom NLP recipes using, for example, PyTorch or Flair.

40.1 n-gram

An n-gram is a contiguous sequence of n items from a given sample of text or speech.

40.1.1 n-gram Frequency

Frequency-based features represent the count of each word from a given text in the form of vectors. These are created
for different n-gram values. For example, a one-gram is equivalent to a single word, a two-gram is equivalent to two
consecutive words paired together, and so on.
Words and n-grams that occur more often will receive a higher weightage. The ones that are rare will receive a lower
weightage.

40.1.2 TFIDF of n-grams

Frequency-based features can be multiplied with the inverse document frequency to get term frequency–inverse doc-
ument frequency (TFIDF) vectors. Doing so also gives importance to the rare terms that occur in the corpus, which
may be helpful in certain classification tasks.

473
Using Driverless AI, Release 1.8.4.1

40.2 Truncated SVD Features

TFIDF and the frequency of n-grams both result in higher dimensions of the representational vectors. To counteract
this, Truncated SVD is commonly used to decompose the vectorized arrays into lower dimensions.

474 Chapter 40. NLP in Driverless AI


Using Driverless AI, Release 1.8.4.1

40.3 Linear Models for TFIDF Vectors

Linear models are also available in the Driverless AI NLP recipe. These capture linear dependencies that are crucial
to the process of achieving high accuracy rates.

40.4 Word Embeddings

Word embeddings is the term for a collective set of feature engineering techniques for text where words or phrases
from the vocabulary are mapped to vectors of real numbers. Representations are made so that words with similar
meanings are placed close to or equidistant from one another. For example, the word “king” is closely associated with
the word “queen” in this kind of vector representation.

TFIDF and frequency-based models represent counts and significant word information, but they lack the semantic
context for these words. Word embedding techniques are used to make up for this lack of semantic information.

40.4.1 CNN Models for Word Embedding

Although Convolutional Neural Network (CNN) models are primarily used on image-level machine learning tasks,
their use case on representing text as information has proven to be quite efficient and faster compared to RNN models.
In Driverless AI, we pass word embeddings as input to CNN models, which return cross validated predictions that can
be used as a new set of features.

40.3. Linear Models for TFIDF Vectors 475


Using Driverless AI, Release 1.8.4.1

40.4.2 Bi-direction GRU Models for Word Embedding

Recurrent neural networks like long short-term memory units (LSTM) and gated recurrent units (GRU) are state-
of-the-art algorithms for NLP problems. In Driverless AI, we implement bi-directional GRU features for previous
word steps and for later steps to predict the current state. For example, in the sentence “John is walking on the golf
course,” a unidirectional model would represent states that represent “golf” based on “John is walking on,” but would
not represent “course.” Using a bi-directional model, the representation would also account the later representations,
giving the model more predictive power.
In simple terms, a bi-directional GRU model combines two independent RNN models into a single model. A GRU
architecture provides high speeds and accuracy rates similar to a LSTM architecture. As with CNN models, we pass
word embeddings as input to these models, which return cross validated predictions that can be used as a new set of
features.

40.4.3 CNN Models for Character Embedding

For languages like Japanese and Mandarin Chinese, where characters play a major role, character level embedding is
available in the NLP recipe.
In character embedding, each character is represented in the form of vectors rather than words. Driverless AI uses
character level embedding as the input to CNN models and later extracts class probabilities to feed as features for
downstream models.
The image below represents the overall set of features created by our NLP recipes:

476 Chapter 40. NLP in Driverless AI


Using Driverless AI, Release 1.8.4.1

40.5 NLP Naming Conventions

The naming conventions of the NLP features help to understand the type of feature that has been created.
The syntax for the feature names is as follows:
[FEAT TYPE]:[COL].[TARGET_CLASS]
• [FEAT TYPE] represents one of the following:
• Txt – Frequency / TFIDF of N-grams followed by SVD
• TxtTE - Frequency / TFIDF of N-grams followed by linear model
• TextCNN_TE – Word embeddings followed by CNN model
• TextBiGRU_TE – Word embeddings followed by Bidirectional GRU model
• TextCharCNN_TE – Character embeddings followed by CNN model
• [COL] represents the name of the text column.
• [TARGET_CLASS] represents the target class for which the model predictions are made.
For example, TxtTE:text.0 equates to class 0 predictions for the text column “text” using Frequency / TFIDF of n-
grams followed by a linear model.

40.6 NLP Expert Settings

A number of configurable settings are available for NLP in Driverless AI. Refer to NLP Settings in the Expert Settings
topic for more information.

40.5. NLP Naming Conventions 477


Using Driverless AI, Release 1.8.4.1

40.7 A Typical NLP Example: Sentiment Analysis

The following section provides an NLP example. This information is based on the Automatic Feature Engineering for
Text Analytics blog post. A similar example using the Python Client is available in The Python Client.
This example uses a classical example of sentiment analysis on tweets using the US Airline Sentiment dataset from
Figure Eight’s Data for Everyone library. We can split the dataset into training and test with this simple script. We
will just use the tweets in the ‘text’ column and the sentiment (positive, negative or neutural) in the ‘airline_sentiment’
column for this demo. Here are some samples from the dataset:

Once we have our dataset ready in the tabular format, we are all set to use the Driverless AI. Similar to other problems
in the Driverless AI setup, we need to choose the dataset, and then specify the target column (‘airline_sentiment’).

Because we don’t want to use any other columns in the dataset, we need to click on Dropped Cols, and then exclude
everything but text as shown below:

478 Chapter 40. NLP in Driverless AI


Using Driverless AI, Release 1.8.4.1

Next, we will need to make sure TensorFlow is enabled for the experiment. We can go to Expert Settings and enable
TensorFlow Models.

At this point, we are ready to launch an experiment. Text features will be automatically generated and evaluated
during the feature engineering process. Note that some features such as TextCNN rely on TensorFlow models. We
recommend using GPU(s) to leverage the power of TensorFlow and accelerate the feature engineering process.

40.7. A Typical NLP Example: Sentiment Analysis 479


Using Driverless AI, Release 1.8.4.1

Once the experiment is done, users can make new predictions and download the scoring pipeline just like any other
Driverless AI experiments.
Resources:
• fastText: https://fasttext.cc/
• GloVe: https://nlp.stanford.edu/projects/glove/

480 Chapter 40. NLP in Driverless AI


CHAPTER

FORTYONE

THE PYTHON CLIENT

This section describes how to install the Driverless AI Python Client. It also provides some end-to-end examples
showing how to use the Driverless AI Python client. Additional examples are available in the https://github.com/
h2oai/driverlessai-tutorials repository.
Note: This section and the Python API are in a pre-release state and are a WIP. The documentation and API will both
continue to change, and functions described in these examples may not work in other versions of Driverless AI.

41.1 Installing the Python Client

The Python Client is available on the Driverless AI UI and published on the h2oai channel at https://anaconda.org/
h2oai/repo.

41.1.1 Installing from Driverless AI

Requirements

• Python 3.6. This is the only supported version.

481
Using Driverless AI, Release 1.8.4.1

Download from UI

On the Driverless AI top menu, select the RESOURCES > PYTHON CLIENT link. This downloads the
h2oai_client wheel.

Download from Command Line

The Driverless AI Python client is exposed as the /clients/py HTTP end point. This can be accessed via the
command line:
wget --trust-server-names http://<Driverless AI address>/clients/py

Wheel Installation

Install this wheel to your local Python via pip comand. Once installed, you can launch a Jupyter notebook and begin
using the Driverless AI Python Client.

482 Chapter 41. The Python Client


Using Driverless AI, Release 1.8.4.1

41.1.2 Installing from Anaconda Cloud

Note: Conda installs of the Python client are not supported on Windows.

Requirements

• Conda Package Manager. You can install the Conda Package Manager either through Anaconda or Miniconda.
Note that the Driverless AI Python client requires Python 3.6, so ensure that you install the Python 3 version of
Anaconda or Miniconda.
– Anaconda Install Instructions
– Miniconda Install Instructions

Installation Procedure

After Conda is installed and the Conda executable is available in $PATH, create a new Anaconda environment for
h2oai_client:
conda create -n h2oaiclientenv -c h2oai -c conda-forge h2oai_client

The above command installs the latest version of the Python client. Include the version number to install a specific
version. For example, the following command installs the 1.6.3 Python client:
conda create -n h2oaiclientenv -c h2oai -c conda-forge h2oai_client=1.6.3

A list of available Python client versions is available here: https://anaconda.org/h2oai/h2oai_client/files


Upon completion, you can launch a Jupyter notebook and begin using the Driverless AI Python Client.

41.2 Driverless AI: Credit Card Demo

This notebook provides an H2OAI Client workflow, of model building and scoring, that parallels the Driverless AI
workflow.
Notes:
• This is an early release of the Driverless AI Python client.
• This notebook was tested in Driverless AI version 1.8.2.
• Python 3.6 is the only supported version.
• You must install the h2oai_client wheel to your local Python. This is available from the RESOURCES
link in the top menu of the UI.

41.2. Driverless AI: Credit Card Demo 483


Using Driverless AI, Release 1.8.4.1

41.2.1 Workflow Steps

Build an Experiment with Python API:


1. Sign in
2. Import train & test set/new data
3. Specify experiment parameters
4. Launch Experiement
5. Examine Experiment
6. Download Predictions
Build an Experiment in Web UI and Access Through Python:
1. Get pointer to experiment
Score on New Data:
1. Score on new data with H2OAI model
Model Diagnostics on New Data:
1. Run model diagnostincs on new data with H2OAI model
Run Model Interpretation
1. Run model interpretation on the raw features
2. Run Model Interpretation on External Model Predictions
Build Scoring Pipelines
1. Build Python Scoring Pipeline
2. Build MOJO Scoring Pipeline

484 Chapter 41. The Python Client


Using Driverless AI, Release 1.8.4.1

41.2.2 Build an Experiment with Python API

1. Sign In

Import the required modules and log in.


Pass in your credentials through the Client class which creates an authentication token to send to the Driverless AI
Server. In plain English: to sign into the Driverless AI web page (which then sends requests to the Driverless Server),
instantiate the Client class with your Driverless AI address and login credentials.
[1]: from h2oai_client import Client
import matplotlib.pyplot as plt
import pandas as pd

[2]: address = 'http://ip_where_driverless_is_running:12345'


username = 'username'
password = 'password'
h2oai = Client(address = address, username = username, password = password)
# make sure to use the same user name and password when signing in through the GUI

Equivalent Steps in Driverless: Signing In

2. Upload Datasets

Upload training and testing datasets from the Driverless AI /data folder.
You can provide a training, validation, and testing dataset for an experiment. The validation and testing dataset are
optional. In this example, we will provide only training and testing.
[3]: train_path = '/data/Kaggle/CreditCard/CreditCard-train.csv'
test_path = '/data/Kaggle/CreditCard/CreditCard-test.csv'

train = h2oai.create_dataset_sync(train_path)
test = h2oai.create_dataset_sync(test_path)

41.2. Driverless AI: Credit Card Demo 485


Using Driverless AI, Release 1.8.4.1

Equivalent Steps in Driverless: Uploading Train & Test CSV Files

3. Set Experiment Parameters

We will now set the parameters of our experiment. Some of the parameters include:
• Target Column: The column we are trying to predict.
• Dropped Columns: The columns we do not want to use as predictors such as ID columns, columns with data
leakage, etc.
• Weight Column: The column that indicates the per row observation weights. If None, each row will have an
observation weight of 1.
• Fold Column: The column that indicates the fold. If None, the folds will be determined by Driverless AI.
• Is Time Series: Whether or not the experiment is a time-series use case.
For information on the experiment settings, refer to the Experiment Settings.
For this example, we will be predicting ``default payment next month``. The parameters that con-
trol the experiment process are: accuracy, time, and interpretability. We can use the
get_experiment_preview_sync function to get a sense of what will happen during the experiment.
We will start out by seeing what the experiment will look like with accuracy, time, and interpretability
all set to 5.
[4]: target="default payment next month"
exp_preview = h2oai.get_experiment_preview_sync(dataset_key= train.key
, validset_key=''
, classification=True
, dropped_cols=[]
, target_col=target
, is_time_series=False
, enable_gpus=False
, accuracy=5, time=5, interpretability=5
, reproducible=True
, resumed_experiment_id=''
, time_col=''
, config_overrides=None)
exp_preview

[4]: ['ACCURACY [5/10]:',


'- Training data size: *23,999 rows, 25 cols*',
'- Feature evolution: *[LightGBM, XGBoostGBM]*, *1/3 validation split*',
'- Final pipeline: *Ensemble (4 models), 4-fold CV*',
'',
'TIME [5/10]:',
'- Feature evolution: *4 individuals*, up to *66 iterations*',
'- Early stopping: After *10* iterations of no improvement',
'',
'INTERPRETABILITY [5/10]:',
'- Feature pre-pruning strategy: None',
'- Monotonicity constraints: disabled',
'- Feature engineering search space: [CVCatNumEncode, CVTargetEncode, ClusterDist, ClusterTE, Frequent, Interactions, NumCatTE, NumToCatTE, NumToCatWoE,
˓→Original, TruncSVDNum, WeightOfEvidence]',
'',
'[LightGBM, XGBoostGBM] models to train:',
'- Model and feature tuning: *16*',
'- Feature evolution: *104*',
'- Final pipeline: *4*',
'',
'Estimated runtime: *minutes*',
'Auto-click Finish/Abort if not done in: *1 day*/*7 days*']

486 Chapter 41. The Python Client


Using Driverless AI, Release 1.8.4.1

With these settings, the Driverless AI experiment will train about 124 models: * 16 for model and feature tuning * 104
for feature evolution * 4 for the final pipeline
When we start the experiment, we can either:
• specify parameters
• use Driverless AI to suggest parameters
Driverless AI can suggest the parameters based on the dataset and target column. Below we will use the
``get_experiment_tuning_suggestion`` to see what settings Driverless AI suggests.
[5]: # let Driverless suggest parameters for experiment
params = h2oai.get_experiment_tuning_suggestion(dataset_key = train.key, target_col = target,
is_classification = True, is_time_series = False,
config_overrides = None, cols_to_drop=[])

params.dump()

[5]: {'dataset': {'key': '6f09ae32-33dc-11ea-ba27-0242ac110002',


'display_name': ''},
'resumed_model': {'key': '', 'display_name': ''},
'target_col': 'default payment next month',
'weight_col': '',
'fold_col': '',
'orig_time_col': '',
'time_col': '',
'is_classification': True,
'cols_to_drop': [],
'validset': {'key': '', 'display_name': ''},
'testset': {'key': '', 'display_name': ''},
'enable_gpus': True,
'seed': False,
'accuracy': 5,
'time': 4,
'interpretability': 6,
'score_f_name': 'AUC',
'time_groups_columns': [],
'unavailable_columns_at_prediction_time': [],
'time_period_in_seconds': None,
'num_prediction_periods': None,
'num_gap_periods': None,
'is_timeseries': False,
'config_overrides': None}

Driverless AI has found that the best parameters are to set ``accuracy = 5``, ``time = 4``, ``interpretability = 6``. It
has selected ``AUC`` as the scorer (this is the default scorer for binomial problems).

41.2. Driverless AI: Credit Card Demo 487


Using Driverless AI, Release 1.8.4.1

Equivalent Steps in Driverless: Set the Knobs, Configuration & Launch

4. Launch Experiment: Feature Engineering + Final Model Training

Launch the experiment using the parameters that Driverless AI suggested along with the testset, scorer, and seed that
were added. We can launch the experiment with the suggested parameters or create our own.
[6]: experiment = h2oai.start_experiment_sync(dataset_key=train.key,
testset_key = test.key,
target_col=target,
is_classification=True,
accuracy=5,
time=4,
interpretability=6,
scorer="AUC",
enable_gpus=True,
seed=1234,
cols_to_drop=['ID'])

488 Chapter 41. The Python Client


Using Driverless AI, Release 1.8.4.1

Equivalent Steps in Driverless: Launch Experiment

5. Examine Experiment

View the final model score for the validation and test datasets. When feature engineering is complete, an ensemble
model can be built depending on the accuracy setting. The experiment object also contains the score on the validation
and test data for this ensemble model. In this case, the validation score is the score on the training cross-validation
predictions.
[7]: print("Final Model Score on Validation Data: " + str(round(experiment.valid_score, 3)))
print("Final Model Score on Test Data: " + str(round(experiment.test_score, 3)))

Final Model Score on Validation Data: 0.779


Final Model Score on Test Data: 0.8

The experiment object also contains the scores calculated for each iteration on bootstrapped samples on the validation
data. In the iteration graph in the UI, we can see the mean performance for the best model (yellow dot) and +/- 1
standard deviation of the best model performance (yellow bar).
This information is saved in the experiment object.
[8]: # Add scores from experiment iterations
iteration_data = h2oai.list_model_iteration_data(experiment.key, 0, len(experiment.iteration_data))
iterations = list(map(lambda iteration: iteration.iteration, iteration_data))
scores_mean = list(map(lambda iteration: iteration.score_mean, iteration_data))
scores_sd = list(map(lambda iteration: iteration.score_sd, iteration_data))

(continues on next page)

41.2. Driverless AI: Credit Card Demo 489


Using Driverless AI, Release 1.8.4.1

(continued from previous page)


# Add score from final ensemble
iterations = iterations + [max(iterations) + 1]
scores_mean = scores_mean + [experiment.valid_score]
scores_sd = scores_sd + [experiment.valid_score_sd]

plt.figure()
plt.errorbar(iterations, scores_mean, yerr=scores_sd, color = "y",
ecolor='yellow', fmt = '--o', elinewidth = 4, alpha = 0.5)
plt.xlabel("Iteration")
plt.ylabel("AUC")
plt.ylim([0.65, 0.82])
plt.show();

Equivalent Steps in Driverless: View Results

490 Chapter 41. The Python Client


Using Driverless AI, Release 1.8.4.1

6. Download Results

Once an experiment is complete, we can see that the UI presents us options of downloading the:
• predictions
– on the (holdout) train data
– on the test data
• experiment summary - summary of the experiment including feature importance
We will show an example of downloading the test predictions below. Note that equivalent commands can also be run
for downloading the train (holdout) predictions.
[9]: h2oai.download(src_path=experiment.test_predictions_path, dest_dir=".")

[9]: './test_preds.csv'

[10]: test_preds = pd.read_csv("./test_preds.csv")


test_preds.head()

[10]: default payment next month.0 default payment next month.1


0 0.399347 0.600653
1 0.858302 0.141698
2 0.938646 0.061354
3 0.619994 0.380006
4 0.865251 0.134749

41.2.3 Build an Experiment in Web UI and Access Through Python

It is also possible to use the Python API to examine an experiment that was started through the Web UI using the
experiment key.

1. Get pointer to experiment

You can get a pointer to the experiment by referencing the experiment key in the Web UI.
[11]: # Get list of experiments
experiment_list = list(map(lambda x: x.key, h2oai.list_models(offset=0, limit=100).models))
experiment_list

[11]: ['7f7b429e-33dc-11ea-ba27-0242ac110002',
'0be7d94a-33d8-11ea-ba27-0242ac110002',
'2e6bbcfa-30a1-11ea-83f9-0242ac110002',
'3c06c58c-27fd-11ea-9e09-0242ac110002',
'a3c6dfda-2353-11ea-9f4a-0242ac110002',
'15fe0c0a-203d-11ea-97b6-0242ac110002']

[12]: # Get pointer to experiment


experiment = h2oai.get_model_job(experiment_list[0]).entity

41.2. Driverless AI: Credit Card Demo 491


Using Driverless AI, Release 1.8.4.1

41.2.4 Score on New Data

You can use the Python API to score on new data. This is equivalent to the SCORE ON ANOTHER DATASET
button in the Web UI. The example below scores on the test data and then downloads the predictions.
Pass in any dataset that has the same columns as the original training set. If you passed a test set during the H2OAI
model building step, the predictions already exist.

1. Score Using the H2OAI Model

The following shows the predicted probability of default for each record in the test.
[13]: prediction = h2oai.make_prediction_sync(experiment.key, test.key, output_margin = False, pred_contribs = False)
pred_path = h2oai.download(prediction.predictions_csv_path, '.')
pred_table = pd.read_csv(pred_path)
pred_table.head()

[13]: default payment next month.0 default payment next month.1


0 0.399347 0.600653
1 0.858302 0.141698
2 0.938646 0.061354
3 0.619994 0.380006
4 0.865251 0.134749

We can also get the contribution each feature had to the final prediction by setting pred_contribs = True. This
will give us an idea of how each feature effects the predictions.
[14]: prediction_contributions = h2oai.make_prediction_sync(experiment.key, test.key,
output_margin = False, pred_contribs = True)
pred_contributions_path = h2oai.download(prediction_contributions.predictions_csv_path, '.')
pred_contributions_table = pd.read_csv(pred_contributions_path)
pred_contributions_table.head()

[14]: contrib_0_AGE contrib_10_PAY_0 contrib_11_PAY_2 contrib_12_PAY_3 \


0 -0.017284 1.559683 0.336257 -0.004992
1 -0.006587 -0.283259 -0.058643 -0.014668
2 -0.017300 -0.347706 -0.061856 -0.027669
3 -0.064728 0.916047 0.081330 0.153046
4 -0.029481 -0.260402 -0.058225 -0.155098

contrib_13_PAY_4 contrib_14_PAY_5 contrib_15_PAY_6 contrib_16_PAY_AMT1 \


0 -0.009970 -0.028749 -0.046268 0.064267
1 -0.025089 -0.021999 -0.021543 0.091787
2 -0.018144 -0.011283 -0.017874 0.047035
3 0.166219 0.170304 0.077901 0.059009
4 -0.047545 -0.022028 -0.024870 0.076568

contrib_17_PAY_AMT2 contrib_18_PAY_AMT3 ... contrib_22_SEX \


0 -0.021834 -0.015015 ... 0.000940
1 -0.061340 -0.057014 ... 0.056241
2 -0.229566 -0.065025 ... 0.024883
3 0.052934 0.125983 ... -0.001882
4 0.119231 0.123152 ... 0.021810

contrib_2_BILL_AMT2 contrib_3_BILL_AMT3 contrib_4_BILL_AMT4 \


0 -0.039787 -0.010946 -0.034707
1 -0.041630 0.027173 -0.000515
2 -0.041808 -0.024132 -0.050757
3 0.050526 0.059255 -0.033044
4 -0.015791 -0.161386 -0.032228

contrib_5_BILL_AMT5 contrib_6_BILL_AMT6 contrib_7_EDUCATION \


0 0.025392 0.013222 0.008324
1 -0.017727 -0.006212 0.029378
2 -0.016839 -0.021221 0.020798
3 -0.001438 0.011796 -0.858484
4 -0.005339 -0.016350 0.010969

contrib_8_LIMIT_BAL contrib_9_MARRIAGE contrib_bias


0 0.163888 -0.046693 -1.507163
1 0.323900 -0.066966 -1.507163
2 -0.335934 -0.069577 -1.507163
3 0.100657 -0.043778 -1.507163
4 0.112077 -0.039533 -1.507163

[5 rows x 24 columns]

We will examine the contributions for our first record more closely.
[15]: contrib = pd.DataFrame(pred_contributions_table.iloc[0][1:])
contrib.columns = ["contribution"]
contrib["abs_contribution"] = contrib.contribution.abs()
contrib.sort_values(by="abs_contribution", ascending=False)[["contribution"]].head()

[15]: contribution
contrib_10_PAY_0 1.559683
contrib_bias -1.507163
contrib_11_PAY_2 0.336257
contrib_8_LIMIT_BAL 0.163888
contrib_1_BILL_AMT1 -0.100378

492 Chapter 41. The Python Client


Using Driverless AI, Release 1.8.4.1

The clusters from this customer’s: PAY_0, PAY_2, and LIMIT_BAL had the greatest impact on their prediction.
Since the contribution is positive, we know that it increases the probability that they will default.

41.2.5 Model Diagnostics on New Data

You can use the Python API to also perform model diagnostics on new data. This is equivalent to the Model Diagnos-
tics tab in the Web UI.

1. Run model diagnostincs on new data with H2OAI model

The example below performs model diagnostics on the test dataset but any data with the same columns can be selected.
[16]: test_diagnostics = h2oai.make_model_diagnostic_sync(experiment.key, test.key)

[17]: [{'scorer': x.score_f_name, 'score': x.score} for x in test_diagnostics.scores]

[17]: [{'scorer': 'ACCURACY', 'score': 0.8326666666666667},


{'scorer': 'AUC', 'score': 0.8003248324279806},
{'scorer': 'AUCPR', 'score': 0.5805185237503238},
{'scorer': 'F05', 'score': 0.5957993999142734},
{'scorer': 'F1', 'score': 0.5563139931740614},
{'scorer': 'F2', 'score': 0.6467051171922935},
{'scorer': 'GINI', 'score': 0.6006496648559612},
{'scorer': 'LOGLOSS', 'score': 0.406033500339215},
{'scorer': 'MACROAUC', 'score': 0.8003248324279806},
{'scorer': 'MCC', 'score': 0.4517110838237804}]

Here is the same model diagnostics displayed in the UI:

41.2. Driverless AI: Credit Card Demo 493


Using Driverless AI, Release 1.8.4.1

41.2.6 Run Model Interpretation

Once we have completed an experiment, we can interpret our H2OAI model. Model Interpretability is used to provide
model transparency and explanations. More information on Model Interpretability can be found here: http://docs.h2o.
ai/driverless-ai/latest-stable/docs/userguide/interpreting.html.

1. Run Model Interpretation on the Raw Data

We can run the model interpretation in the Python client as shown below. By setting the parameter,
use_raw_features to True, we are interpreting the model using only the raw features in the data. This will
not use the engineered features we saw in our final model’s features to explain the data.
[18]: mli_experiment = h2oai.run_interpretation_sync(dai_model_key = experiment.key,
dataset_key = train.key,
target_col = target,
use_raw_feature = True)

This is equivalent to clicking Interpet this Model on Original Features in the UI once the experiment has completed.

494 Chapter 41. The Python Client


Using Driverless AI, Release 1.8.4.1

Once our interpretation is finished, we can navigate to the MLI tab in the UI to see our interpreted model.

We can also see the list of interpretations using the Python Client:
[19]: # Get list of interpretations
mli_list = list(map(lambda x: x.key, h2oai.list_interpretations(offset=0, limit=100)))
mli_list

[19]: ['2ff44e94-33de-11ea-ba27-0242ac110002',
'6ce9e9f6-33db-11ea-ba27-0242ac110002']

2. Run Model Interpretation on External Model Predictions

Model Interpretation does not need to be run on a Driverless AI experiment. We can also train an external model
and run Model Interpretability on the predictions. In this next section, we will walk through the steps to interpret an
external model.

41.2. Driverless AI: Credit Card Demo 495


Using Driverless AI, Release 1.8.4.1

Train External Model

We will begin by training a model with scikit-learn. Our end goal is to use Driverless AI to interpret the predictions
made by our scikit-learn model.
[25]: # Dataset must be located where Python client is running - you may need to download it locally
train_pd = pd.read_csv(train_path)

[26]: from sklearn.ensemble import GradientBoostingClassifier

predictors = list(set(train_pd.columns) - set([target]))

gbm_model = GradientBoostingClassifier(random_state=10)
gbm_model.fit(train_pd[predictors], train_pd[target])

[26]: GradientBoostingClassifier(criterion='friedman_mse', init=None,


learning_rate=0.1, loss='deviance', max_depth=3,
max_features=None, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=100,
n_iter_no_change=None, presort='auto',
random_state=10, subsample=1.0, tol=0.0001,
validation_fraction=0.1, verbose=0,
warm_start=False)

[27]: predictions = gbm_model.predict_proba(train_pd[predictors])


predictions[0:5]

[27]: array([[0.20060823, 0.79939177],


[0.57311744, 0.42688256],
[0.88034896, 0.11965104],
[0.86483594, 0.13516406],
[0.87349721, 0.12650279]])

Interpret on External Predictions

Now that we have the predictions from our scikit-learn GBM model, we can call Driverless AI’s
``h2o_ai.run_interpretation_sync`` to create the interpretation screen.
[28]: train_gbm_path = "./CreditCard-train-gbm_pred.csv"
predictions = pd.concat([train_pd, pd.DataFrame(predictions[:, 1], columns = ["p1"])], axis = 1)
predictions.to_csv(path_or_buf=train_gbm_path, index = False)

[29]: train_gbm_pred = h2oai.upload_dataset_sync(train_gbm_path)

[30]: mli_external = h2oai.run_interpretation_sync(dai_model_key = "", # no experiment key


dataset_key = train_gbm_pred.key,
target_col = target,
prediction_col = "p1")

We can also run Model Interpretability on an external model in the UI as shown below:

496 Chapter 41. The Python Client


Using Driverless AI, Release 1.8.4.1

[31]: # Get list of interpretations


mli_list = list(map(lambda x: x.key, h2oai.list_interpretations(offset=0, limit=100)))
mli_list

[31]: ['minemutu', 'nutiduha', 'senikahi']

41.2. Driverless AI: Credit Card Demo 497


Using Driverless AI, Release 1.8.4.1

41.2.7 Build Scoring Pipelines

In our last section, we will build the scoring pipelines from our experiment. There are two scoring pipeline options:
• Python Scoring Pipeline: requires Python runtime
• MOJO Scoring Pipeline: requires Java runtime
Documentation on the scoring pipelines is provided here: http://docs.h2o.ai/driverless-ai/latest-stable/docs/userguide/
python-mojo-pipelines.html.

The experiment screen shows two scoring pipeline buttons: Download Python Scoring Pipeline or Build MOJO
Scoring Pipeline. Driverless AI determines if any scoring pipeline should be automatically built based on the con-
fig.toml file. In this example, we have run Driverless AI with the settings:
# Whether to create the Python scoring pipeline at the end of each experiment
make_python_scoring_pipeline = true

# Whether to create the MOJO scoring pipeline at the end of each experiment
# Note: Not all transformers or main models are available for MOJO (e.g. no gblinear main model)
make_mojo_scoring_pipeline = false

Therefore, only the Python Scoring Pipeline will be built by default.

1. Build Python Scoring Pipeline

The Python Scoring Pipeline has been built by default based on our config.toml settings. We can get the path to the
Python Scoring Pipeline in our experiment object.
[32]: experiment.scoring_pipeline_path

[32]: 'h2oai_experiment_daguwofe/scoring_pipeline/scorer.zip'

We can also build the Python Scoring Pipeline - this is useful if the ``make_python_scoring_pipeline`` option was
set to false.
[58]: python_scoring_pipeline = h2oai.build_scoring_pipeline_sync(experiment.key)

498 Chapter 41. The Python Client


Using Driverless AI, Release 1.8.4.1

[59]: python_scoring_pipeline.file_path

[59]: 'h2oai_experiment_adbb4dca-c460-11e9-b1a0-0242ac110002/scoring_pipeline/scorer.zip'

Now we will download the scoring pipeline zip file.


[60]: h2oai.download(python_scoring_pipeline.file_path, dest_dir=".")

[60]: './scorer.zip'

2. Build MOJO Scoring Pipeline

The MOJO Scoring Pipeline has not been built by default because of our config.toml settings. We can build the MOJO
Scoring Pipeline using the Python client. This is equivalent to selecting the Build MOJO Scoring Pipeline on the
experiment screen.
[61]: mojo_scoring_pipeline = h2oai.build_mojo_pipeline_sync(experiment.key)

[62]: mojo_scoring_pipeline.file_path

[62]: 'h2oai_experiment_adbb4dca-c460-11e9-b1a0-0242ac110002/mojo_pipeline/mojo.zip'

Now we can download the scoring pipeline zip file.


[63]: h2oai.download(mojo_scoring_pipeline.file_path, dest_dir=".")

[63]: './mojo.zip'

Once the MOJO Scoring Pipeline is built, the Build MOJO Scoring Pipeline changes to Download MOJO Scoring
Pipeline.

41.2. Driverless AI: Credit Card Demo 499


Using Driverless AI, Release 1.8.4.1

[ ]:

41.3 Driverless AI - Training Time Series Model

The purpose of this notebook is to show an example of using Driverless AI to train a time series model. Our goal will
be to forecast the Weekly Sales for a particular Store and Department for the next week. The data used in this notebook
is from the: Walmart Kaggle Competition where features.csv and train.csv have been joined together.
Note: This notebook was tested and run on Driverless AI 1.8.1.

500 Chapter 41. The Python Client


Using Driverless AI, Release 1.8.4.1

41.3.1 Workflow

1. Import data into Python


2. Format data for Time Series
3. Upload data to Driverless AI
4. Launch Driverless AI Experiment
5. Evaluate model performance
[1]: import pandas as pd
from h2oai_client import Client

%matplotlib inline
import matplotlib.pyplot as plt
import matplotlib.dates as mdates

41.3.2 Step 1: Import Data

We will begin by importing our data using pandas. We are going to first work with the data in Python to correctly
format it for a Driverless AI time series use case.
[2]: sales_data = pd.read_csv("./walmart_train.csv")
sales_data.head()

[2]: Store Dept Date Weekly_Sales Temperature Fuel_Price MarkDown1 \


0 1 1 2010-02-05 24924.50 42.31 2.572 -1.0
1 1 2 2010-02-05 50605.27 42.31 2.572 -1.0
2 1 3 2010-02-05 13740.12 42.31 2.572 -1.0
3 1 4 2010-02-05 39954.04 42.31 2.572 -1.0
4 1 5 2010-02-05 32229.38 42.31 2.572 -1.0

MarkDown2 MarkDown3 MarkDown4 MarkDown5 CPI Unemployment \


0 -1.0 -1.0 -1.0 -1.0 211.096358 8.106
1 -1.0 -1.0 -1.0 -1.0 211.096358 8.106
2 -1.0 -1.0 -1.0 -1.0 211.096358 8.106
3 -1.0 -1.0 -1.0 -1.0 211.096358 8.106
4 -1.0 -1.0 -1.0 -1.0 211.096358 8.106

IsHoliday sample_weight
0 0 1
1 0 1
2 0 1
3 0 1
4 0 1

[3]: # Convert Date column to datetime


sales_data["Date"] = pd.to_datetime(sales_data["Date"], format="%Y-%m-%d")

41.3.3 Step 2: Format Data for Time Series

The data has one record per Store, Department, and Week. Our goal for this use case will be to forecast the total sales
for the next week.
The only features we should use as predictors are ones that we will have available at the time of scoring. Features
like the Temperature, Fuel Price, and Unemployment will not be known in advance. Therefore, before we start our
Driverless AI experiments, we will choose to use the previous week’s Temperature, Fuel Price, Unemployment, and
CPI attributes. This information we will know at time of scoring.
[4]: lag_variables = ["Temperature", "Fuel_Price", "CPI", "Unemployment"]
dai_data = sales_data.set_index(["Date", "Store", "Dept"])
lagged_data = dai_data.loc[:, lag_variables].groupby(level=["Store", "Dept"]).shift(1)

[5]: # Join lagged predictor variables to training data


dai_data = dai_data.join(lagged_data.rename(columns=lambda x: x +"_lag"))

[6]: # Drop original predictor variables - we do not want to use these in the model
dai_data = dai_data.drop(lagged_data, axis=1)
dai_data = dai_data.reset_index()

41.3. Driverless AI - Training Time Series Model 501


Using Driverless AI, Release 1.8.4.1

[7]: dai_data.head()

[7]: Date Store Dept Weekly_Sales MarkDown1 MarkDown2 MarkDown3 \


0 2010-02-05 1 1 24924.50 -1.0 -1.0 -1.0
1 2010-02-05 1 2 50605.27 -1.0 -1.0 -1.0
2 2010-02-05 1 3 13740.12 -1.0 -1.0 -1.0
3 2010-02-05 1 4 39954.04 -1.0 -1.0 -1.0
4 2010-02-05 1 5 32229.38 -1.0 -1.0 -1.0

MarkDown4 MarkDown5 IsHoliday sample_weight Temperature_lag \


0 -1.0 -1.0 0 1 NaN
1 -1.0 -1.0 0 1 NaN
2 -1.0 -1.0 0 1 NaN
3 -1.0 -1.0 0 1 NaN
4 -1.0 -1.0 0 1 NaN

Fuel_Price_lag CPI_lag Unemployment_lag


0 NaN NaN NaN
1 NaN NaN NaN
2 NaN NaN NaN
3 NaN NaN NaN
4 NaN NaN NaN

Now that our training data is correctly formatted, we can run a Driverless AI experiment to forecast the next week’s
sales.

41.3.4 Step 3: Upload Data to Driverless AI

We will split out data into two pieces: training and test (which consists of the last week of data).
[8]: train_data = dai_data.loc[dai_data["Date"] < "2012-10-26"]
test_data = dai_data.loc[dai_data["Date"] == "2012-10-26"]

To upload the datasets, we will sign into Driverless AI.


[9]: address = 'http://<ip_where_driverless_is_running>:12345'
username = 'username'
password = 'password'
h2oai = Client(address = address, username = username, password = password)
# make sure to use the same user name and password when signing in through the GUI

[10]: train_path = "./train_data.csv"


test_path = "./test_data.csv"

train_data.to_csv(train_path, index = False)


test_data.to_csv(test_path, index = False)

[11]: # Add datasets to Driverless AI


train_dai = h2oai.upload_dataset_sync(train_path)
test_dai = h2oai.upload_dataset_sync(test_path)

Equivalent Steps in Driverless: Uploading Train & Test CSV Files

502 Chapter 41. The Python Client


Using Driverless AI, Release 1.8.4.1

41.3.5 Step 4: Launch Driverless AI Experiment

We will now launch the Driverless AI experiment. To do that we will need to specify the parameters for our experiment.
Some of the parameters include:
• Target Column: The column we are trying to predict.
• Dropped Columns: The columns we do not want to use as predictors such as ID columns, columns with data
leakage, etc.
• Is Time Series: Whether or not the experiment is a time-series use case.
• Time Column: The column that contains the date/date-time information.
• Time Group Columns: The categorical columns that indicate how to group the data so that there is one time
series per group. In our example, our Time Groups Columns are Store and Dept. Each Store and Dept,
corresponds to a single time series.
• Number of Prediction Periods: How far in the future do we want to predict?
• Number of Gap Periods: After how many periods can we start predicting? If we assume that we can start
forecasting right after the training data ends, then the Number of Gap Periods will be 0.
For this experiment, we want to forecast next week’s sales for each Store and Dept. Therefore, we will use the
following time series parameters:
• Time Group Columns: [Store, Dept]
• Number of Prediction Periods: 1 (a.k.a., horizon)
• Number of Gap Periods: 0
Note that the period size is unknown to the Python client. To overcome this, you can also specify the optional
time_period_in_seconds parameter, which can help specify the horizon in real time units. If this parameter is
omitted, Driverless AI will automatically detect the period size in the experiment, and the horizon value will respect
this period. I.e., if you are sure your data has 1 week period, you can say num_prediction_periods=14,
otherwise it is possible that the model may not work out correctly.
[12]: experiment = h2oai.start_experiment_sync(dataset_key=train_dai.key,
testset_key = test_dai.key,
target_col="Weekly_Sales",
is_classification=False,
cols_to_drop = ["sample_weight"],
accuracy=5,
time=3,
interpretability=1,
scorer="RMSE",
enable_gpus=True,
seed=1234,
time_col = "Date",
time_groups_columns = ["Store", "Dept"],
num_prediction_periods = 1,
num_gap_periods = 0)

41.3. Driverless AI - Training Time Series Model 503


Using Driverless AI, Release 1.8.4.1

Equivalent Steps in Driverless: Launching Driverless AI Experiment

41.3.6 Step 5. Evaluate Model Performance

Now that our experiment is complete, we can view the model performance metrics within the experiment object.
[13]: print("Validation RMSE: ${:,.0f}".format(experiment.valid_score))
print("Test RMSE: ${:,.0f}".format(experiment.test_score))

Validation RMSE: $2,281


Test RMSE: $2,483

We can also plot the actual versus predicted values from the test data.
[14]: plt.scatter(experiment.test_act_vs_pred.x_values, experiment.test_act_vs_pred.y_values)
plt.plot([0, max(experiment.test_act_vs_pred.x_values)],[0, max(experiment.test_act_vs_pred.y_values)], 'b--',)
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.show()

504 Chapter 41. The Python Client


Using Driverless AI, Release 1.8.4.1

Lastly, we can download the test predictions from Driverless AI and examine the forecasted sales vs actual for a
selected store and department.
[15]: preds_path = h2oai.download(src_path=experiment.test_predictions_path, dest_dir=".")
forecast_predictions = pd.read_csv(preds_path)
forecast_predictions.columns = ["predicted_Weekly_Sales"]

actual = test_data[["Date", "Store", "Dept", "Weekly_Sales"]].reset_index(drop = True)


forecast_predictions = pd.concat([actual, forecast_predictions], axis = 1)
forecast_predictions.head()

[15]: Date Store Dept Weekly_Sales predicted_Weekly_Sales


0 2012-10-26 1 1 27390.81 28837.857422
1 2012-10-26 1 2 43134.88 43528.121094
2 2012-10-26 1 3 9350.90 8774.910156
3 2012-10-26 1 4 36292.60 35721.511719
4 2012-10-26 1 5 25846.94 23501.814453

[16]: selected_ts = sales_data.loc[(sales_data["Store"] == 1) & (sales_data["Dept"] == 1)].tail(n = 51)

selected_ts_forecast = forecast_predictions.loc[(forecast_predictions["Store"] == 1) &


(forecast_predictions["Dept"] == 1)]

[17]: # Plot the forecast of a select store and department


years = mdates.MonthLocator()
yearsFmt = mdates.DateFormatter('%b')

fig, ax = plt.subplots()
ax.plot(selected_ts["Date"], selected_ts["Weekly_Sales"], label = "Actual")
ax.plot(selected_ts_forecast["Date"], selected_ts_forecast["predicted_Weekly_Sales"], marker='o', label = "Predicted")
ax.xaxis.set_major_locator(years)
ax.xaxis.set_major_formatter(yearsFmt)
plt.legend(loc='upper left')
plt.show()

41.3. Driverless AI - Training Time Series Model 505


Using Driverless AI, Release 1.8.4.1

[ ]:

41.4 Driverless AI - Time Series Recipes with Rolling Window

The purpose of this notebook is to show an example of using Driverless AI to train experiments on different subsets of
data. This would result in a collection of forecasted values that can be evaluated. The data used in this notebook is a
public dataset: S+P 500 Stock Data. In this example, we are using the all_stocks_5yr.csv dataset.

41.4.1 Workflow

1. Import data into Python


2. Create function that slices data by index
3. For each slice of data:
• import data into Driverless AI
• train an experiment
• combine test predictions

41.4.2 Import Data

We will begin by importing our data using pandas.


[1]: import pandas as pd

stock_data = pd.read_csv("./all_stocks_5yr.csv")
stock_data.head()

[1]: date open high low close volume Name


0 2013-02-08 15.07 15.12 14.63 14.75 8407500 AAL
1 2013-02-11 14.89 15.01 14.26 14.46 8882000 AAL
2 2013-02-12 14.45 14.51 14.10 14.27 8126000 AAL
3 2013-02-13 14.30 14.94 14.25 14.66 10259500 AAL
4 2013-02-14 14.94 14.96 13.16 13.99 31879900 AAL

[2]: # Convert Date column to datetime


stock_data["date"] = pd.to_datetime(stock_data["date"], format="%Y-%m-%d")

506 Chapter 41. The Python Client


Using Driverless AI, Release 1.8.4.1

We will add a new column which is the index. We will use this later on to do a rolling window of training and testing.
We will use this index instead of the actual date because this data only occurs on weekdays (when the stock market is
opened). When you use Driverless AI to perform a forecast, it will forecast the next n days. In this particular case, we
never want to forecast Saturday’s and Sunday’s. We will instead treat our time column as the index of the record.
[3]: dates_index = pd.DataFrame(sorted(stock_data["date"].unique()), columns = ["date"])
dates_index["index"] = range(len(dates_index))
stock_data = pd.merge(stock_data, dates_index, on = "date")

stock_data.head()

[3]: date open high low close volume Name index


0 2013-02-08 15.0700 15.1200 14.6300 14.7500 8407500 AAL 0
1 2013-02-08 67.7142 68.4014 66.8928 67.8542 158168416 AAPL 0
2 2013-02-08 78.3400 79.7200 78.0100 78.9000 1298137 AAP 0
3 2013-02-08 36.3700 36.4200 35.8250 36.2500 13858795 ABBV 0
4 2013-02-08 46.5200 46.8950 46.4600 46.8900 1232802 ABC 0

41.4.3 Create Moving Window Function

Now we will create a function that can split our data by time to create multiple experiments.
We will start by first logging into Driverless AI.
[4]: import h2oai_client
import numpy as np
import pandas as pd
# import h2o
import requests
import math
from h2oai_client import Client, ModelParameters

[10]: address = 'http://ip_where_driverless_is_running:12345'


username = 'username'
password = 'password'
h2oai = Client(address = address, username = username, password = password)
# make sure to use the same user name and password when signing in through the GUI

Our function will split the data into training and testing based on the training length and testing length specified by the
user. It will then run an experiment in Driverless AI and download the test predictions.
[11]: def dai_moving_window(dataset, train_len, test_len, target, predictors, index_col, time_group_cols,
accuracy, time, interpretability):

# Calculate windows for the training and testing data based on the train_len and test_len arguments
num_dates = max(dataset[index_col])
num_windows = (num_dates - train_len) // test_len

windows = []
for i in range(num_windows):
train_start_id = i*test_len
train_end_id = train_start_id + (train_len - 1)
test_start_id = train_end_id + 1
test_end_id = test_start_id + (test_len - 1)

window = {'train_start_index': train_start_id,


'train_end_index': train_end_id,
'test_start_index': test_start_id,
'test_end_index': test_end_id}
windows.append(window)

# Split the data by the window


forecast_predictions = pd.DataFrame([])
for window in windows:
train_data = dataset[(dataset[index_col] >= window.get("train_start_index")) &
(dataset[index_col] <= window.get("train_end_index"))]

test_data = dataset[(dataset[index_col] >= window.get("test_start_index")) &


(dataset[index_col] <= window.get("test_end_index"))]

# Get the Driverless AI forecast predictions


window_preds = dai_get_forecast(train_data, test_data, predictors, target, index_col, time_group_cols,
accuracy, time, interpretability)
forecast_predictions = forecast_predictions.append(window_preds)

return forecast_predictions

[12]: def dai_get_forecast(train_data, test_data, predictors, target, index_col, time_group_cols,


accuracy, time, interpretability):

# Save dataset
train_path = "./train_data.csv"
test_path = "./test_data.csv"
keep_cols = predictors + [target, index_col] + time_group_cols
train_data[keep_cols].to_csv(train_path)
test_data[keep_cols].to_csv(test_path)

(continues on next page)

41.4. Driverless AI - Time Series Recipes with Rolling Window 507


Using Driverless AI, Release 1.8.4.1

(continued from previous page)

# Add datasets to Driverless AI


train_dai = h2oai.upload_dataset_sync(train_path)
test_dai = h2oai.upload_dataset_sync(test_path)

# Run Driverless AI Experiment


experiment = h2oai.start_experiment_sync(dataset_key = train_dai.key,
testset_key = test_dai.key,
target_col = target,
cols_to_drop = [],
is_classification = False,
accuracy = accuracy,
time = time,
interpretability = interpretability,
scorer = "RMSE",
seed = 1234,
time_col = index_col,
time_groups_columns = time_group_cols,
num_prediction_periods = test_data[index_col].nunique(),
num_gap_periods = 0)

# Download the predictions on the test dataset


test_predictions_path = h2oai.download(experiment.test_predictions_path, "./")
test_predictions = pd.read_csv(test_predictions_path)
test_predictions.columns = ["Prediction"]

# Add predictions to original test data


keep_cols = [target, index_col] + time_group_cols
test_predictions = pd.concat([test_data[keep_cols].reset_index(drop=True), test_predictions], axis = 1)

return test_predictions

[13]: predictors = ["Name", "index"]


target = "close"
index_col = "index"
time_group_cols = ["Name"]

[ ]: # We will filter the dataset to the first 1030 dates for demo purposes
filtered_stock_data = stock_data[stock_data["index"] <= 1029]
forecast_predictions = dai_moving_window(filtered_stock_data, 1000, 3, target, predictors, index_col, time_group_cols,
accuracy = 1, time = 1, interpretability = 1)

[25]: forecast_predictions.head()

[25]: close index Name Prediction


0 44.90 1000 AAL 48.050527
1 121.63 1000 AAPL 119.485352
2 164.63 1000 AAP 167.960700
3 60.43 1000 ABBV 60.784213
4 83.62 1000 ABC 86.939174

[26]: # Calculate some error metric


mae = (forecast_predictions[target] - forecast_predictions["Prediction"]).abs().mean()
print("Mean Absolute Error: ${:,.2f}".format(mae))

Mean Absolute Error: $6.79

[ ]:

41.5 Time Series Analysis on a Driverless AI Model Scoring Pipeline

This example describes how to run the Python Scoring Pipeline on a time series model. This example has been tested
on a Linux machine.

41.5.1 Download the Python Scoring Pipeline

After successfully completing an experiment in DAI, click the DOWNLOAD PYTHON SCORING PIPELINE
button.

508 Chapter 41. The Python Client


Using Driverless AI, Release 1.8.4.1

This downloads a scorer.zip file, which contains a scoring-pipeline folder.


After unzipping the scorer.zip file, run the following commands. (Note that the run_example.sh file can be found in
the scoring-pipeline folder):
# to use conda package manager
export DRIVERLESS_AI_LICENSE_FILE="/path/to/license.sig"
bash run_example.sh --pm conda

The above will create a conda environment with the necessary requirements to be able to run the scoring pipeline and
scores the test.csv files, which proves that the installation is successful.
Run the following to check the list of conda environments:
conda env list

An environment with following the format scoring_h2oai_experiment_xxx should be available, where xxx
is the name of your experiment.
At this point, you can run the example below.
[ ]: import os
import pandas as pd
from sklearn.model_selection import train_test_split
from dateutil.parser import parse
import datetime
from datetime import timedelta
import numpy as np
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:90% !important; }</style>"))
pd.options.display.max_rows = 999
pd.options.display.max_columns = 999

41.5. Time Series Analysis on a Driverless AI Model Scoring Pipeline 509


Using Driverless AI, Release 1.8.4.1

41.5.2 Load datasets for scoring

For this example, we are using the walmart_before_20120205.zip and walmart_after_20120205.zip


[ ]: time1_path = "data/walmart_before_20120205.zip"
time2_path = "data/walmart_after_20120205.zip"

time1_pd = pd.read_csv(time1_path,parse_dates=['Date'])
time2_pd = pd.read_csv(time2_path,parse_dates=['Date'])

[ ]: # Import the scorer for the experiment. For example, below imports
# the scorer for experiemnt "hawolewo". Be sure to replace "hawolewo"
# with you rexperiment name.
from scoring_h2oai_experiment_hawolewo import Scorer

[ ]: %%capture
#Create a singleton Scorer instance.
#For optimal performance, create a Scorer instance once, and call score() or score_batch() multiple times.
scorer = Scorer()

41.5.3 Make predictions


[ ]: time2_pd["predict"] = scorer.score_batch(time2_pd)
time1_pd["predict"] = scorer.score_batch(time1_pd)

Join train and test datasets


[ ]: train_and_test= time1_pd.append(time2_pd,ignore_index=True)
train_and_test = train_and_test.reset_index(drop=True)

41.5.4 Model Evaluation

Here we look at the overall model performance in test and train. We also show the model horizon window in red to
illustrate the performance when the model is generating predictions beyond the horizon. We prefer to use R-squared
as the performance metric since the groups of Store and Department weekly sales are on vastly different scales.
[ ]: from sklearn.metrics import r2_score, mean_squared_error

def r2_rmse( g ):
r2 = r2_score( g['Weekly_Sales'], g['predict'] )
rmse = np.sqrt( mean_squared_error( g['Weekly_Sales'], g['predict'] ) )
return pd.Series( dict( r2 = r2, rmse = rmse ) )

metrics_ts = train_and_test.groupby( 'Date' ).apply( r2_rmse )


metrics_ts.index = [x for x in metrics_ts.index]

41.5.5 R2 Time Series Plot

This would be a useful plot to compare R2 over time between different DAI time series models each with different
prediction horizons
[ ]: # Note: horizon_in_weeks is how many weeks the model can predict out to.
# In this example 34 had been picked
horizon_in_weeks = 34

# Gap between train dataset and when predictions will start


# In this example 0 had been picked
num_gap_periods = 0

[ ]: %matplotlib inline
import matplotlib.pyplot as plt
metrics_ts['r2'].plot(figsize=(20,10), title=("Training dataset in Green. Test data with model prediction horizon in Red"))

test_window_start = str((time2_pd["Date"].min()) + datetime.timedelta(weeks=num_gap_periods))


test_window_end = str(parse(test_window_start) + datetime.timedelta(weeks=horizon_in_weeks))

plt.axvspan(time2_pd["Date"].min(), test_window_end, facecolor='r', alpha=0.1)


plt.axvspan(time1_pd["Date"].min(), test_window_start, facecolor='g', alpha=0.1)
plt.suptitle("R2 Time Series Plot", fontsize=21, fontweight='bold')
plt.show()

510 Chapter 41. The Python Client


Using Driverless AI, Release 1.8.4.1

41.5.6 Worst and Best Groups (in test perdiod)

Here we generate the best and worst groups by R2. We filter out groups that have some missing data. We only calculate
R2 within the valid test horizon window.
[ ]: avg_count = train_and_test.groupby(["Store","Dept"]).size().mean()
print("average count: " + str(avg_count))
train_and_test_filtered = train_and_test.groupby(["Store","Dept"]).filter(lambda x: len(x) > 0.8 * avg_count)
train_and_test_filtered = train_and_test_filtered.loc[(train_and_test_filtered.Date < test_window_end) &
(train_and_test_filtered.Date > test_window_start)]

[ ]: grouped_r2s = train_and_test_filtered.groupby(["Store","Dept"]).apply( r2_rmse ).sort_values("r2")

[ ]: print(grouped_r2s.head()) # worst groups


print(grouped_r2s.tail()) # best groups

41.5.7 Worst and Best Groups (in train perdiod)

Here we generate the best and worst groups by R2. We filter out groups that have some missing data. We only calculate
R2 within the train horizon window.
[ ]: avg_count = train_and_test.groupby(["Store","Dept"]).size().mean()
print("average count: " + str(avg_count))
train_and_test_filtered = train_and_test.groupby(["Store","Dept"]).filter(lambda x: len(x) > 0.8 * avg_count)
train_and_test_filtered = train_and_test_filtered.loc[(train_and_test_filtered.Date < test_window_start) &
(train_and_test_filtered.Date >= '2010-01-10')]

[ ]: grouped_r2s = train_and_test_filtered.groupby(["Store","Dept"]).apply( r2_rmse ).sort_values("r2")

[ ]: print(grouped_r2s.head()) # worst groups


print(grouped_r2s.tail()) # best groups

41.5.8 Choose group


[ ]: # Choose Group
store_num = 11
dept_num = 56
date_selection = '2012-02-10'

41.5.9 Plot Actual vs Predicted


[ ]: plot_df = train_and_test[(train_and_test.Store == store_num) & (train_and_test.Dept == dept_num)][["Date","Weekly_Sales","predict"]]
plot_df["Date"] = plot_df["Date"].apply(lambda x: (x))
plot_df = plot_df.set_index("Date")

[ ]: plot_df.plot(figsize=(20,10), title=("Training dataset in Green. Test data with model prediction horizon in Red. Date selection in Yellow"))

plt.axvspan(time2_pd["Date"].min(), test_window_end, facecolor='r', alpha=0.1)


plt.axvspan(time1_pd["Date"].min(), test_window_start, facecolor='g', alpha=0.1)
plt.axvline(x=date_selection, color='y')
plt.suptitle("Actual vs. Predicted", fontsize=21, fontweight='bold')
plt.show()

41.5. Time Series Analysis on a Driverless AI Model Scoring Pipeline 511


Using Driverless AI, Release 1.8.4.1

41.5.10 Plot Actual vs Predicted vs Deploy


[ ]: deploy_path = "data/walmart_deploy.zip"

deploy=pd.read_csv(deploy_path,parse_dates=['Date'])
deploy=deploy[pd.to_datetime(deploy['Date'])<=((time2_pd.Date.max()) + timedelta(days=7*horizon_in_weeks) )].reset_index(drop=True).copy()
#deploy.loc[deploy['Weekly_Sales'].isna(),'Weekly_Sales']=0

deploy=train_and_test.append(deploy,ignore_index=True).reset_index(drop=True)
deploy=deploy[['Store','Dept','Weekly_Sales','Date','IsHoliday','Hl', 'Size', 'ThanksG', 'Type', 'Unemployment', 'Xmas']].copy()

[ ]: deploy["predict"] = scorer.score_batch(deploy)
#deploy2["predict"] = scorer.score_batch(deploy2)

[ ]: #deploy= deploy.append(test,ignore_index=True)
#deploy = deploy.reset_index(drop=True)
plot_df2 = deploy[(deploy.Store == store_num) &
(deploy.Dept == dept_num)][["Date","Weekly_Sales","predict"]]
plot_df2["Date"] = plot_df2["Date"].apply(lambda x: (x))
plot_df2 = plot_df2.set_index("Date")

[ ]: plot_df2.plot(figsize=(20,10), title=("Training dataset in Green. Test data with model prediction horizon in Red. Date selection in Yellow"))

plt.axvspan(time2_pd["Date"].min(), test_window_end, facecolor='r', alpha=0.1)


plt.axvspan(time1_pd["Date"].min(), test_window_start, facecolor='g', alpha=0.1)
plt.axvline(x=date_selection, color='y')
plt.suptitle("Actual vs. Predicted vs. Deploy", fontsize=21, fontweight='bold')
plt.show()

41.5.11 Make Shapley Values

Shapley values show the contribution of engineered features to the predicted weekly sales generated by the model. Will
Shapley values you can break down the components of a prediction and attribute precise values to specfic features.
Please note, in some cases the model has a “link function” that yet to be applied to make the sum of the Shapley
contributions equal to the prediction value.
[ ]: shapley = scorer.score_batch(train_and_test, pred_contribs=True, fast_approx=True)
shapley.columns = [x.replace('contrib_','',1) for x in shapley.columns]

41.5.12 Plot Shapley

This is a global vs local Shapley plot, with global being the average Shapley values for all of the predictions in the
selected group and local being the Shapley value for that specific prediction. Looking at this plot can give clues as to
which features contributed to the error in the prediction.
[ ]: shap_vals_group = shapley.loc[(train_and_test.Store==store_num) & (train_and_test.Dept==dept_num),:]
shap_vals_timestamp = shapley.loc[(train_and_test.Store==store_num)
& (train_and_test.Dept==dept_num)
& (train_and_test.Date==date_selection),:]
shap_vals = shap_vals_group.mean()
shap_vals = pd.concat([pd.DataFrame(shap_vals), shap_vals_timestamp.transpose()], axis=1, ignore_index=True)
shap_vals = shap_vals.sort_values(by=0)
bias = shap_vals.loc["bias",0]
shap_vals = shap_vals.drop("bias",axis=0)
shap_vals.columns = ["Global (Group)", "Local (Timestamp)"]

[ ]: shap_vals_group = shapley.loc[(train_and_test.Store==store_num) & (train_and_test.Dept==dept_num),:]


shap_vals_timestamp = shapley.loc[(train_and_test.Store==store_num)
& (train_and_test.Dept==dept_num)
& (train_and_test.Date==date_selection),:]
shap_vals = shap_vals_group.mean()
shap_vals = pd.concat([pd.DataFrame(shap_vals), shap_vals_timestamp.transpose()], axis=1, ignore_index=True)
shap_vals = shap_vals.sort_values(by=0)
bias = shap_vals.loc["bias",0]
shap_vals = shap_vals.drop("bias",axis=0)
shap_vals.columns = ["Global (Group)", "Local (Timestamp)"]

shap_vals_timestamp.transpose()

[ ]: from matplotlib.ticker import FuncFormatter


formatter = FuncFormatter(lambda x, y: str(round(float(x) + bias)))
ax = shap_vals.plot.barh(figsize=(8,30), fontsize=10, colormap="Set1")
ax.xaxis.set_major_formatter(formatter)
plt.show()

512 Chapter 41. The Python Client


Using Driverless AI, Release 1.8.4.1

41.5.13 Summary

This notebook should get you started with all you need to diagnose and debug time series models from DAI. Try
different horizons during training and compare the model’s R2 over time to pick the best horizon for your use case.
Use the actual vs prediction plots to do detailed debugging. Find some interesting dates to examine and use the Shapley
plots to see how the features impacted the final prediction.

41.6 Driverless AI NLP Demo - Airline Sentiment Dataset

In this notebook, we will see how to use Driverless AI python client to build text classification models using the Airline
sentiment twitter dataset.
Import the necessary python modules to get started including the Driverless AI client. If not already installed, please
download and install the python client from Driverless AI GUI.
This notebook was tested in Driverless AI version 1.8.2.
[1]: import pandas as pd
from sklearn import model_selection
from h2oai_client import Client

The below code downloads the twitter airline sentiment dataset and save it in the current folder.
[2]: ! wget https://www.figure-eight.com/wp-content/uploads/2016/03/Airline-Sentiment-2-w-AA.csv

--2020-01-17 09:38:39-- https://www.figure-eight.com/wp-content/uploads/2016/03/Airline-Sentiment-2-w-AA.csv


Resolving www.figure-eight.com (www.figure-eight.com)... 54.86.123.68, 35.169.155.50
Connecting to www.figure-eight.com (www.figure-eight.com)|54.86.123.68|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3704908 (3.5M) [application/octet-stream]
Saving to: ‘Airline-Sentiment-2-w-AA.csv.1’

Airline-Sentiment-2 100%[===================>] 3.53M 2.26MB/s in 1.6s

2020-01-17 09:38:41 (2.26 MB/s) - ‘Airline-Sentiment-2-w-AA.csv.1’ saved [3704908/3704908]

We can now split the data into training and testing datasets.
[3]: al = pd.read_csv("Airline-Sentiment-2-w-AA.csv", encoding='ISO-8859-1')
train_al, test_al = model_selection.train_test_split(al, test_size=0.2, random_state=2018)
train_al.to_csv("train_airline_sentiment.csv", index=False)
test_al.to_csv("test_airline_sentiment.csv", index=False)

The first step is to establish a connection to Driverless AI using Client. Please key in your credentials and the url
address.
[4]: address = 'http://ip_where_driverless_is_running:12345'
username = 'username'
password = 'password'
h2oai = Client(address = address, username = username, password = password)
# # make sure to use the same user name and password when signing in through the GUI

Read the train and test files into Driverless AI using the upload_dataset_sync command.
[5]: train_path = './train_airline_sentiment.csv'
test_path = './test_airline_sentiment.csv'

train = h2oai.upload_dataset_sync(train_path)
test = h2oai.upload_dataset_sync(test_path)

Now let us look at some basic information about the dataset.


[6]: print('Train Dataset: ', len(train.columns), 'x', train.row_count)
print('Test Dataset: ', len(test.columns), 'x', test.row_count)

[c.name for c in train.columns]

Train Dataset: 20 x 11712


Test Dataset: 20 x 2928

41.6. Driverless AI NLP Demo - Airline Sentiment Dataset 513


Using Driverless AI, Release 1.8.4.1

[6]: ['_unit_id',
'_golden',
'_unit_state',
'_trusted_judgments',
'_last_judgment_at',
'airline_sentiment',
'airline_sentiment:confidence',
'negativereason',
'negativereason:confidence',
'airline',
'airline_sentiment_gold',
'name',
'negativereason_gold',
'retweet_count',
'text',
'tweet_coord',
'tweet_created',
'tweet_id',
'tweet_location',
'user_timezone']

We just need two columns for our experiment. text which contains the text of the tweet and airline_sentiment
which contains the sentiment of the tweet (target column). We can drop the remaining columns for this experiment.
We will enable tensorflow models and transformations to take advantage of CNN based text features.
[7]: exp_preview = h2oai.get_experiment_preview_sync(
dataset_key=train.key
, validset_key=''
, target_col='airline_sentiment'
, classification=True
, dropped_cols=["_unit_id", "_golden", "_unit_state", "_trusted_judgments", "_last_judgment_at",
"airline_sentiment:confidence", "negativereason", "negativereason:confidence", "airline",
"airline_sentiment_gold", "name", "negativereason_gold", "retweet_count",
"tweet_coord", "tweet_created", "tweet_id", "tweet_location", "user_timezone"]
, accuracy=6
, time=4
, interpretability=5
, is_time_series=False
, time_col=''
, enable_gpus=True
, reproducible=False
, resumed_experiment_id=''
, config_overrides="""
enable_tensorflow='on'
enable_tensorflow_charcnn='on'
enable_tensorflow_textcnn='on'
enable_tensorflow_textbigru='on'
"""
)
exp_preview

[7]: ['ACCURACY [6/10]:',


'- Training data size: *11,712 rows, 2 cols*',
'- Feature evolution: *[LightGBM, TensorFlow, XGBoostGBM]*, *3-fold CV**, 2 reps*',
'- Final pipeline: *Ensemble (6 models), 3-fold CV*',
'',
'TIME [4/10]:',
'- Feature evolution: *4 individuals*, up to *46 iterations*',
'- Early stopping: After *5* iterations of no improvement',
'',
'INTERPRETABILITY [5/10]:',
'- Feature pre-pruning strategy: None',
'- Monotonicity constraints: disabled',
'- Feature engineering search space: [CVTargetEncode, Frequent, TextBiGRU, TextCNN, TextCharCNN, Text]',
'',
'[LightGBM, TensorFlow, XGBoostGBM] models to train:',
'- Model and feature tuning: *144*',
'- Feature evolution: *504*',
'- Final pipeline: *6*',
'',
'Estimated runtime: *minutes*',
'Auto-click Finish/Abort if not done in: *1 day*/*7 days*']

Please note that the Text and TextCNN features are enabled for this experiment.
Now we can start the experiment.
[8]: model = h2oai.start_experiment_sync(
dataset_key=train.key,
testset_key=test.key,
target_col='airline_sentiment',
scorer='F1',
is_classification=True,
cols_to_drop=["_unit_id", "_golden", "_unit_state", "_trusted_judgments", "_last_judgment_at",
"airline_sentiment:confidence", "negativereason", "negativereason:confidence", "airline",
"airline_sentiment_gold", "name", "negativereason_gold", "retweet_count",
"tweet_coord", "tweet_created", "tweet_id", "tweet_location", "user_timezone"],
accuracy=6,
time=2,
interpretability=5,
enable_gpus=True,
config_overrides="""
enable_tensorflow='on'
enable_tensorflow_charcnn='on'
enable_tensorflow_textcnn='on'
enable_tensorflow_textbigru='on'

(continues on next page)

514 Chapter 41. The Python Client


Using Driverless AI, Release 1.8.4.1

(continued from previous page)


"""
)

[9]: print('Modeling completed for model ' + model.key)

Modeling completed for model ce5935e6-3950-11ea-9465-0242ac110002

[10]: logs = h2oai.download(model.log_file_path, '.')


print('Logs available at', test_preds)

Logs available at ./test_preds.csv

We can download the predictions to the current folder.


[11]: test_preds = h2oai.download(model.test_predictions_path, '.')
print('Test set predictions available at', test_preds)

Test set predictions available at ./test_preds.csv

[ ]:

41.7 Driverless AI Autoviz Python Client Example

This example shows how to use the Autoviz Python client in Driverless AI to retrieve graphs in Vega Lite format. (See
https://vega.github.io/vega-lite/.)
When running the Autoviz Python client in Jupyter Notebook you can use https://github.com/vega/ipyvega (installed
through pip) and render the graph directly in Jupyter notebook. You can also copy paste the result e.g. to https:
//vega.github.io/vega-editor/?mode=vega-lite. The final graph can then be downloaded to png/svg/json.
The end of this document includes the available API methods.

41.7.1 Prerequisities

Using the Driverless AI Autoviz Python client doesn’t require any additional packages, other than the Driverless AI
client. However, if you are using Jupyter notebooks or labs, installing Vega package can help better user experience,
as it allows you to render the produced graphs directly inside Jupyter environment. In addition, it provides options to
download the generated files in SVG, PNG or JSON formats.

41.7.2 Initialization

To initialize the Autoviz Python client, follow the same steps as when initializing the client for new experiment. You
need to import the Client and initialize it, providing the Driverless AI host address and login credentials.
[1]: from h2oai_client import Client
address = 'http://ip_where_driverless_is_running:12345'
username = 'username'
password = 'password'
cli = Client(address=address, username=username, password=password)

41.7. Driverless AI Autoviz Python Client Example 515


Using Driverless AI, Release 1.8.4.1

41.7.3 Upload the dataset


[18]: dataset_local_path = '/data/Kaggle/CreditCard/CreditCard-train.csv'
dataset = cli.create_dataset_sync(dataset_local_path)

41.7.4 Request specific visualizations


[19]: hist = cli.autoviz.get_histogram(dataset.key, variable_name='PAY_0')

[29]: barchart = cli.autoviz.get_scatterplot(dataset.key, x_variable_name='BILL_AMT3', y_variable_name='BILL_AMT4')

41.7.5 Visualize using Vega (Optional)

All of the methods provided in Client.autoviz return graphics in Vega Lite (v3) format. In order to visualize
them, you can either paste the returned graph (hist) into Online editor, or you can utilize the Python Vega package,
which can visualize the charts directly in Jupyter environment.
[32]: from vega import VegaLite

[33]: VegaLite(hist)

[33]: <vega.vegalite.VegaLite at 0x1098b90f0>

[34]: VegaLite(barchart)

[34]: <vega.vegalite.VegaLite at 0x1087c6eb8>

516 Chapter 41. The Python Client


Using Driverless AI, Release 1.8.4.1

41.7.6 API Methods


get_histogram(
dataset_key: str,
variable_name: str,
number_of_bars: int = 0,
transformation: str = "none",
mark: str = "bar",
) -> dict:
"""
---------------------------
Required Keyword arguments:
---------------------------
dataset_key -- str, Key of visualized dataset in DriverlessAI
variable_name -- str, name of variable

---------------------------
Optional Keyword arguments:
---------------------------
number_of_bars -- int, number of bars
transformation -- str, default value is "none"
(otherwise, "log" or "square_root")
mark -- str, default value is "bar" (use "area" to get a density polygon)
"""

get_scatterplot(
dataset_key: str,
x_variable_name: str,
y_variable_name: str,
mark: str = "point",
) -> dict:
"""
---------------------------
Required Keyword arguments:
---------------------------
dataset_key -- str, Key of visualized dataset in DriverlessAI
x_variable_name -- str, name of x variable
(y variable assumed to be counts if no y variable specified)
y_variable_name -- str, name of y variable

---------------------------
Optional Keyword arguments:
---------------------------

get_bar_chart(
dataset_key: str,
x_variable_name: str,
y_variable_name: str = "",
transpose: bool = False,
mark: str = "bar",
) -> dict:
"""
---------------------------
Required Keyword arguments:
---------------------------
dataset_key -- str, Key of visualized dataset in DriverlessAI
x_variable_name -- str, name of x variable
(y variable assumed to be counts if no y variable specified)

---------------------------
Optional Keyword arguments:
---------------------------
y_variable_name -- str, name of y variable
transpose -- Boolean, default value is false

(continues on next page)

41.7. Driverless AI Autoviz Python Client Example 517


Using Driverless AI, Release 1.8.4.1

(continued from previous page)


mark -- str, default value is "bar" (use "point" to get a Cleveland dot plot)
"""

get_parallel_coordinates_plot(
dataset_key: str,
variable_names: list = [],
permute: bool = False,
transpose: bool = False,
cluster: bool = False,
) -> dict:
"""
---------------------------
Required Keyword arguments:
---------------------------
dataset_key -- str, Key of visualized dataset in DriverlessAI

---------------------------
Optional Keyword arguments:
---------------------------
variable_names -- str, name of variables
(if no variables specified, all in dataset will be used)
permute -- Boolean, default value is false
(if true, use SVD to permute variables)
transpose -- Boolean, default value is false
cluster -- Boolean, k-means cluster variables and color plot by cluster IDs,
default value is false
"""

get_heatmap(
dataset_key: str,
variable_names: list = [],
permute: bool = False,
transpose: bool = False,
matrix_type: str = "rectangular",
) -> dict:
"""
---------------------------
Required Keyword arguments:
---------------------------
dataset_key -- str, Key of visualized dataset in DriverlessAI

---------------------------
Optional Keyword arguments:
---------------------------
variable_names -- str, name of variables
(if no variables specified, all in dataset will be used)
permute -- Boolean, default value is false
(if true, use SVD to permute rows and columns)
transpose -- Boolean, default value is false
matrix_type -- str, default value is "rectangular" (alternative is "symmetric")
"""

get_boxplot(
dataset_key: str,
variable_name: str,
group_variable_name: str = "",
transpose: bool = False,
) -> dict:
"""
---------------------------
Required Keyword arguments:
---------------------------
dataset_key -- str, Key of visualized dataset in DriverlessAI
variable_name -- str, name of variable for box

---------------------------
Optional Keyword arguments:
---------------------------
group_variable_name -- str, name of grouping variable
transpose -- Boolean, default value is false
"""

get_linear_regression(
dataset_key: str,
x_variable_name: str,
y_variable_name: str,
mark: str = "point",
) -> dict:
"""
---------------------------
Required Keyword arguments:
---------------------------
dataset_key -- str, Key of visualized dataset in DriverlessAI
x_variable_name -- str, name of x variable
(y variable assumed to be counts if no y variable specified)
y_variable_name -- str, name of y variable

---------------------------
Optional Keyword arguments:
---------------------------
mark -- str, default value is "point" (alternative is "square")
"""

get_loess_regression(
dataset_key: str,
x_variable_name: str,
y_variable_name: str,
mark: str = "point",
bandwidth: float = 0.5,
) -> dict:
"""
---------------------------
Required Keyword arguments:
---------------------------

(continues on next page)

518 Chapter 41. The Python Client


Using Driverless AI, Release 1.8.4.1

(continued from previous page)


dataset_key -- str, Key of visualized dataset in DriverlessAI
x_variable_name -- str, name of x variable
(y variable assumed to be counts if no y variable specified)
y_variable_name -- str, name of y variable

---------------------------
Optional Keyword arguments:
---------------------------
mark -- str, default value is "point" (alternative is "square")
bandwidth -- float, number in the (0,1)
interval denoting proportion of cases in smoothing window (default is 0.5)
"""

get_dotplot(
dataset_key: str,
variable_name: str,
mark: str = "point",
) -> dict:
"""
---------------------------
Required Keyword arguments:
---------------------------
dataset_key -- str, Key of visualized dataset in DriverlessAI
variable_name -- str, name of variable on which dots are calculated

---------------------------
Optional Keyword arguments:
---------------------------
mark -- str, default value is "point" (alternative is "square" or "bar")
"""

get_distribution_plot(
dataset_key: str,
x_variable_name: str,
y_variable_name: str = "",
subtype: str = "probability_plot",
distribution: str = "normal",
mark: str = "point",
transpose: bool = False,
) -> dict:
"""
---------------------------
Required Keyword arguments:
---------------------------
dataset_key -- str, Key of visualized dataset in DriverlessAI
x_variable_name -- str, name of x variable

---------------------------
Optional Keyword arguments:
---------------------------
y_variable_name -- str, name of y variable for quantile plot
subtype -- str "probability_plot" or "quantile_plot"
(default is "probability_plot" done on x variable)
distribution -- str, type of distribution, "normal" or "uniform"
("normal" is default)
mark -- str, default value is "point" (alternative is "square")
transpose -- Boolean, default value is false
"""

41.7. Driverless AI Autoviz Python Client Example 519


Using Driverless AI, Release 1.8.4.1

520 Chapter 41. The Python Client


CHAPTER

FORTYTWO

THE R CLIENT

This section describes how to install the Driverless AI R Client. It also provides an example tutorial showing how to
use the Driverless AI R client.

42.1 Installing the R Client

The R Client is available on the Driverless AI UI and from the command line. The installation process includes
downloading the R client and then installing the source package.

42.1.1 Prerequisites

The R client requires R version 3.3 or greater. In addition, the following R packages must be installed:
• Rcurl
• jsonlite
• rlang
• methods

42.1.2 Download the R Client

The R Client can be downloaded from within Driverless AI or from the command line.

Download from UI

On the Driverless AI top menu, select the RESOURCES > R CLIENT link. This downloads the
dai_<version>.tar.gz file.

521
Using Driverless AI, Release 1.8.4.1

Download from Command Line

The Driverless AI R client is exposed as the /clients/ HTTP end point. This can be accessed via the command
line:
wget --trust-server-names http://<Driverless AI address>/clients/r

42.1.3 Install the Source Package

After you have downloaded the R client, the next step is to install the source package in R. This can be done by running
the following command in R.
install.packages('~/Downloads/dai_VERSION.tar.gz', type = 'source', repos = NULL)

After the package is installed, you can run the available dai-tutorial vignette to see an example of how to use the client:

522 Chapter 42. The R Client


Using Driverless AI, Release 1.8.4.1

vignette('dai-tutorial', package = 'dai')

42.2 R Client Tutorial

This tutorial describes how to use the Driverless AI R client package to use and control the Driveless AI platform. It
covers the main predictive data-science workflow, including:
1. Data load
2. Automated feature engineering and model tuning
3. Model inspection
4. Predicting on new data
5. Managing the datasets and models
Note: These steps assume that you have entered your license key in the Driverless AI UI.

42.2.1 Loading the Data

Before we can start working with the Driverless.ai platform (DAI), we have to import the package and initialize the
connection:
library(dai)
dai.connect(uri = 'http://localhost:12345', username = 'h2oai', password = 'h2oai')

creditcard <- dai.create_dataset('/data/smalldata/kaggle/CreditCard/creditcard_train_cat.csv')


#>
|
| | 0%
|
|================ | 24%
|
|=================================================================| 100%

The function dai.create_dataset() loads the data located at the machine that hosts DAI. The above command
assumes that the creditcard_train_cat.csv is in the /data folder on the machine running Driverless AI. This file is
available at https://s3.amazonaws.com/h2o-public-test-data/smalldata/kaggle/CreditCard/creditcard_train_cat.csv.
If you want to upload the data located at your workstation, use dai.upload_dataset() instead.
If you already have the data loaded into R data.frame, you can simply convert it into a DAIFrame. For example:
iris.dai <- as.DAIFrame(iris)
#>
|
| | 0%
|
|=================================================================| 100%

print(iris.dai)
#> DAI frame '7c38cb84-5baa-11e9-a50b-b938de969cdb': 150 obs. of 5 variables
#> File path: ./tmp/7c38cb84-5baa-11e9-a50b-b938de969cdb/iris9e1f15d2df00.csv.1554912339.9424415.bin

You can switch off the progress bar whenever it is displayed by setting progress = FALSE.
Upon creation of the dataset, you can display the basic information and summary statistics by calling generics print
and summary:
print(creditcard)
#> DAI frame '7abe28b2-5baa-11e9-a50b-b938de969cdb': 23999 obs. of 25 variables
#> File path: tests/smalldata/kaggle/CreditCard/creditcard_train_cat.csv

summary(creditcard)
#> variable num_classes is_numeric count
#> 1 ID 0 TRUE 23999
#> 2 LIMIT_BAL 79 TRUE 23999
#> 3 SEX 2 FALSE 23999
#> 4 EDUCATION 4 FALSE 23999
#> 5 MARRIAGE 4 FALSE 23999

(continues on next page)

42.2. R Client Tutorial 523


Using Driverless AI, Release 1.8.4.1

(continued from previous page)


#> 6 AGE 55 TRUE 23999
#> 7 PAY_1 11 TRUE 23999
#> 8 PAY_2 11 TRUE 23999
#> 9 PAY_3 11 TRUE 23999
#> 10 PAY_4 11 TRUE 23999
#> 11 PAY_5 10 TRUE 23999
#> 12 PAY_6 10 TRUE 23999
#> 13 BILL_AMT1 0 TRUE 23999
#> 14 BILL_AMT2 0 TRUE 23999
#> 15 BILL_AMT3 0 TRUE 23999
#> 16 BILL_AMT4 0 TRUE 23999
#> 17 BILL_AMT5 0 TRUE 23999
#> 18 BILL_AMT6 0 TRUE 23999
#> 19 PAY_AMT1 0 TRUE 23999
#> 20 PAY_AMT2 0 TRUE 23999
#> 21 PAY_AMT3 0 TRUE 23999
#> 22 PAY_AMT4 0 TRUE 23999
#> 23 PAY_AMT5 0 TRUE 23999
#> 24 PAY_AMT6 0 TRUE 23999
#> 25 DEFAULT_PAYMENT_NEXT_MONTH 2 TRUE 23999
#> mean std min max unique freq
#> 1 12000 6928.05889120466 1 23999 23999 1
#> 2 165498.715779824 129130.743065318 10000 1000000 79 2740
#> 3 2 8921
#> 4 4 11360
#> 5 4 12876
#> 6 35.3808492020501 9.2710457493384 21 79 55 1284
#> 7 -0.00312513021375891 1.12344874325651 -2 8 11 11738
#> 8 -0.123463477644902 1.20059118344043 -2 8 11 12543
#> 9 -0.154756448185341 1.20405796618856 -2 8 11 12576
#> 10 -0.211675486478603 1.16657279943005 -2 8 11 13250
#> 11 -0.252885536897371 1.13700672904 -2 8 10 13520
#> 12 -0.278011583815992 1.1581916495226 -2 8 10 12876
#> 13 50598.9286636943 72650.1978092856 -165580 964511 18717 1607
#> 14 48648.0474186424 70365.3956426641 -69777 983931 18367 2049
#> 15 46368.9035376474 68194.7195202748 -157264 1664089 18131 2325
#> 16 42369.8728280345 63071.4551670874 -170000 891586 17719 2547
#> 17 40002.3330972124 60345.7282797424 -81334 927171 17284 2840
#> 18 38565.2666361098 59156.5011434754 -339603 961664 16906 3258
#> 19 5543.09804575191 15068.86272958 0 505000 6918 4270
#> 20 5815.52852202175 20797.443884891 0 1684259 6839 4362
#> 21 4969.43139297471 16095.9292948255 0 896040 6424 4853
#> 22 4743.65686070253 14883.5548720259 0 497000 6028 5200
#> 23 4783.64369348723 15270.7039035392 0 417990 5984 5407
#> 24 5189.57360723363 17630.7185745277 0 528666 5988 5846
#> 25 0.223717654902288 0.41674368928609 FALSE TRUE 2 5369
#>
˓→ num_hist_ticks
#> 1 1.0, 2400.8, 4800.6, 7200.400000000001, 9600.2, 12000.0, 14399.800000000001, 16799.600000000002, 19199.4,
˓→ 21599.2, 23999.0
#> 2 10000.0, 109000.0, 208000.0, 307000.0, 406000.0, 505000.0, 604000.0, 703000.0, 802000.0,
˓→901000.0, 1000000.0
#> 3
#> 4
#> 5
#> 6 21.0, 26.8, 32.6, 38.4, 44.2, 50.0, 55.8, 61.6, 67.4, 73.
˓→19999999999999, 79.0
#> 7 -2, -1, 0, 1, 2,
˓→ 3, 4, 5, 6, 7, 8
#> 8 -2, -1, 0, 1, 2,
˓→ 3, 4, 5, 6, 7, 8
#> 9 -2, -1, 0, 1, 2,
˓→ 3, 4, 5, 6, 7, 8
#> 10 -2, -1, 0, 1, 2,
˓→ 3, 4, 5, 6, 7, 8
#> 11 -2, -1, 0, 2,
˓→ 3, 4, 5, 6, 7, 8
#> 12 -2, -1, 0, 2,
˓→ 3, 4, 5, 6, 7, 8
#> 13 -165580.0, -52570.899999999994, 60438.20000000001, 173447.30000000005, 286456.4, 399465.5, 512474.6000000001, 625483.7000000001, 738492.8,
˓→851501.9, 964511.0
#> 14 -69777.0, 35593.8, 140964.6, 246335.40000000002, 351706.2, 457077.0, 562447.8, 667818.6, 773189.4, 878560.
˓→2000000001, 983931.0
#> 15 -157264.0, 24871.29999999999, 207006.59999999998, 389141.8999999999, 571277.2, 753412.5, 935547.7999999998, 1117683.0999999999, 1299818.4,
˓→1481953.7, 1664089.0
#> 16 -170000.0, -63841.399999999994, 42317.20000000001, 148475.80000000005, 254634.40000000002, 360793.0, 466951.6000000001, 573110.2000000001, 679268.8,
˓→785427.4, 891586.0
#> 17 -81334.0, 19516.5, 120367.0, 221217.5, 322068.0, 422918.5, 523769.0, 624619.5, 725470.0,
˓→826320.5, 927171.0
#> 18 -339603.0, -209476.3, -79349.6, 50777.09999999998, 180903.8, 311030.5, 441157.19999999995, 571283.9, 701410.6,
˓→831537.3, 961664.0
#> 19 0.0, 50500.0, 101000.0, 151500.0, 202000.0, 252500.0, 303000.0, 353500.0, 404000.0,
˓→454500.0, 505000.0
#> 20 0.0, 168425.9, 336851.8, 505277.69999999995, 673703.6, 842129.5, 1010555.3999999999, 1178981.3, 1347407.2, 1515833.
˓→0999999999, 1684259.0
#> 21 0.0, 89604.0, 179208.0, 268812.0, 358416.0, 448020.0, 537624.0, 627228.0, 716832.0,
˓→806436.0, 896040.0
#> 22 0.0, 49700.0, 99400.0, 149100.0, 198800.0, 248500.0, 298200.0, 347900.0, 397600.0,
˓→447300.0, 497000.0
#> 23 0.0, 41799.0, 83598.0, 125397.0, 167196.0, 208995.0, 250794.0, 292593.0, 334392.0,
˓→376191.0, 417990.0
#> 24 0.0, 52866.6, 105733.2, 158599.8, 211466.4, 264333.0, 317199.6, 370066.2, 422932.8, 475799.
˓→39999999997, 528666.0
#> 25
˓→ False, True
#> num_hist_counts top
#> 1 2400, 2400, 2400, 2400, 2399, 2400, 2400, 2400, 2400, 2400
#> 2 10151, 6327, 3965, 2149, 1251, 96, 44, 15, 0, 1
#> 3 female
#> 4 university
#> 5 single
#> 6 4285, 6546, 5187, 3780, 2048, 1469, 501, 147, 34, 2
#> 7 2086, 4625, 11738, 2994, 2185, 254, 66, 17, 9, 7, 18
#> 8 2953, 4886, 12543, 20, 3204, 268, 76, 21, 9, 18, 1

(continues on next page)

524 Chapter 42. The R Client


Using Driverless AI, Release 1.8.4.1

(continued from previous page)


#> 9 3197, 4787, 12576, 4, 3121, 183, 64, 17, 21, 27, 2
#> 10 3382, 4555, 13250, 2, 2515, 158, 55, 29, 5, 46, 2
#> 11 3539, 4482, 13520, 2178, 147, 71, 11, 3, 47, 1
#> 12 3818, 4722, 12876, 2324, 158, 37, 9, 16, 37, 2
#> 13 2, 17603, 4754, 1193, 316, 111, 18, 1, 0, 1
#> 14 14571, 7214, 1578, 429, 155, 43, 7, 1, 0, 1
#> 15 12977, 10150, 767, 99, 5, 0, 0, 0, 0, 1
#> 16 2, 16619, 5775, 1181, 311, 89, 20, 1, 0, 1
#> 17 12722, 9033, 1720, 374, 113, 31, 4, 0, 1, 1
#> 18 1, 1, 18312, 4788, 745, 131, 19, 1, 0, 1
#> 19 23643, 249, 56, 26, 14, 8, 0, 1, 1, 1
#> 20 23936, 50, 11, 1, 0, 0, 0, 0, 0, 1
#> 21 23836, 130, 20, 9, 3, 0, 0, 0, 0, 1
#> 22 23647, 235, 65, 29, 11, 5, 4, 0, 2, 1
#> 23 23588, 234, 94, 40, 22, 7, 3, 8, 0, 3
#> 24 23605, 235, 77, 56, 15, 5, 1, 3, 0, 2
#> 25 18630, 5369
#> nonnum_hist_ticks nonnum_hist_counts
#> 1
#> 2
#> 3 female, male, Other 15078, 8921, 0
#> 4 university, graduate, Other 11360, 8442, 4197
#> 5 single, married, Other 12876, 10813, 310
#> 6
#> 7
#> 8
#> 9
#> 10
#> 11
#> 12
#> 13
#> 14
#> 15
#> 16
#> 17
#> 18
#> 19
#> 20
#> 21
#> 22
#> 23
#> 24
#> 25

A couple of other generics works as usual on a DAIFrame: dim, head, and format.
dim(creditcard)
#> [1] 23999 25

head(creditcard, 10)
#> ID LIMIT_BAL SEX EDUCATION MARRIAGE AGE PAY_1 PAY_2 PAY_3 PAY_4
#> 1 1 20000 female university married 24 2 2 -1 -1
#> 2 2 120000 female university single 26 -1 2 0 0
#> 3 3 90000 female university single 34 0 0 0 0
#> 4 4 50000 female university married 37 0 0 0 0
#> 5 5 50000 male university married 57 -1 0 -1 0
#> 6 6 50000 male graduate single 37 0 0 0 0
#> 7 7 500000 male graduate single 29 0 0 0 0
#> 8 8 100000 female university single 23 0 -1 -1 0
#> 9 9 140000 female highschool married 28 0 0 2 0
#> 10 10 20000 male highschool single 35 -2 -2 -2 -2
#> PAY_5 PAY_6 BILL_AMT1 BILL_AMT2 BILL_AMT3 BILL_AMT4 BILL_AMT5 BILL_AMT6
#> 1 -2 -2 3913 3102 689 0 0 0
#> 2 0 2 2682 1725 2682 3272 3455 3261
#> 3 0 0 29239 14027 13559 14331 14948 15549
#> 4 0 0 46990 48233 49291 28314 28959 29547
#> 5 0 0 8617 5670 35835 20940 19146 19131
#> 6 0 0 64400 57069 57608 19394 19619 20024
#> 7 0 0 367965 412023 445007 542653 483003 473944
#> 8 0 -1 11876 380 601 221 -159 567
#> 9 0 0 11285 14096 12108 12211 11793 3719
#> 10 -1 -1 0 0 0 0 13007 13912
#> PAY_AMT1 PAY_AMT2 PAY_AMT3 PAY_AMT4 PAY_AMT5 PAY_AMT6
#> 1 0 689 0 0 0 0
#> 2 0 1000 1000 1000 0 2000
#> 3 1518 1500 1000 1000 1000 5000
#> 4 2000 2019 1200 1100 1069 1000
#> 5 2000 36681 10000 9000 689 679
#> 6 2500 1815 657 1000 1000 800
#> 7 55000 40000 38000 20239 13750 13770
#> 8 380 601 0 581 1687 1542
#> 9 3329 0 432 1000 1000 1000
#> 10 0 0 0 13007 1122 0
#> DEFAULT_PAYMENT_NEXT_MONTH
#> 1 TRUE
#> 2 TRUE
#> 3 FALSE
#> 4 FALSE
#> 5 FALSE
#> 6 FALSE
#> 7 FALSE
#> 8 FALSE
#> 9 FALSE
#> 10 FALSE

You cannot, however, use DAIFrame to access all its data, nor can you use it to modify the data. It only represents
the data set loaded into the DAI platform. The head function gives access only to example data:

42.2. R Client Tutorial 525


Using Driverless AI, Release 1.8.4.1

creditcard$example_data[1:10, ]
#> ID LIMIT_BAL SEX EDUCATION MARRIAGE AGE PAY_1 PAY_2 PAY_3 PAY_4
#> 1 1 20000 female university married 24 2 2 -1 -1
#> 2 2 120000 female university single 26 -1 2 0 0
#> 3 3 90000 female university single 34 0 0 0 0
#> 4 4 50000 female university married 37 0 0 0 0
#> 5 5 50000 male university married 57 -1 0 -1 0
#> 6 6 50000 male graduate single 37 0 0 0 0
#> 7 7 500000 male graduate single 29 0 0 0 0
#> 8 8 100000 female university single 23 0 -1 -1 0
#> 9 9 140000 female highschool married 28 0 0 2 0
#> 10 10 20000 male highschool single 35 -2 -2 -2 -2
#> PAY_5 PAY_6 BILL_AMT1 BILL_AMT2 BILL_AMT3 BILL_AMT4 BILL_AMT5 BILL_AMT6
#> 1 -2 -2 3913 3102 689 0 0 0
#> 2 0 2 2682 1725 2682 3272 3455 3261
#> 3 0 0 29239 14027 13559 14331 14948 15549
#> 4 0 0 46990 48233 49291 28314 28959 29547
#> 5 0 0 8617 5670 35835 20940 19146 19131
#> 6 0 0 64400 57069 57608 19394 19619 20024
#> 7 0 0 367965 412023 445007 542653 483003 473944
#> 8 0 -1 11876 380 601 221 -159 567
#> 9 0 0 11285 14096 12108 12211 11793 3719
#> 10 -1 -1 0 0 0 0 13007 13912
#> PAY_AMT1 PAY_AMT2 PAY_AMT3 PAY_AMT4 PAY_AMT5 PAY_AMT6
#> 1 0 689 0 0 0 0
#> 2 0 1000 1000 1000 0 2000
#> 3 1518 1500 1000 1000 1000 5000
#> 4 2000 2019 1200 1100 1069 1000
#> 5 2000 36681 10000 9000 689 679
#> 6 2500 1815 657 1000 1000 800
#> 7 55000 40000 38000 20239 13750 13770
#> 8 380 601 0 581 1687 1542
#> 9 3329 0 432 1000 1000 1000
#> 10 0 0 0 13007 1122 0
#> DEFAULT_PAYMENT_NEXT_MONTH
#> 1 TRUE
#> 2 TRUE
#> 3 FALSE
#> 4 FALSE
#> 5 FALSE
#> 6 FALSE
#> 7 FALSE
#> 8 FALSE
#> 9 FALSE
#> 10 FALSE

A dataset can be split into e.g. training and test sets directly in R:
creditcard.splits <- dai.split_dataset(creditcard,
output_name1 = 'train',
output_name2 = 'test',
ratio = .8,
seed = 25,
progress = FALSE)

In this case the creditcard.splits is a list with two elements with names “train” and “test”, where 80% of the data went
into train and 20% of the data went into test.
creditcard.splits$train
#> DAI frame '7cf3024c-5baa-11e9-a50b-b938de969cdb': 19199 obs. of 25 variables
#> File path: ./tmp/7cf3024c-5baa-11e9-a50b-b938de969cdb/train.1554912341.0864356.bin

creditcard.splits$test
#> DAI frame '7cf613a6-5baa-11e9-a50b-b938de969cdb': 4800 obs. of 25 variables
#> File path: ./tmp/7cf613a6-5baa-11e9-a50b-b938de969cdb/test.1554912341.0966916.bin

By default it yields a simple random sample, but you can do stratified or time-based splits as well. See the function’s
documentation for more details.

42.2.2 Automated Feature Engineering and Model Tuning

One of the main strengths of Driverless AI is the fully automated feature engineering along with hyperparameter
tuning, model selection and ensembling. The function dai.train() executes the experiment that results in a
DAIModel instance that represents the model.
model <- dai.train(training_frame = creditcard.splits$train,
testing_frame = creditcard.splits$test,
target_col = 'DEFAULT_PAYMENT_NEXT_MONTH',
is_classification = T,
is_timeseries = F,
accuracy = 1, time = 1, interpretability = 10,
seed = 25)
#>
|
| | 0%
|
|========================== | 40%
|

(continues on next page)

526 Chapter 42. The R Client


Using Driverless AI, Release 1.8.4.1

(continued from previous page)


|=============================================== | 73%
|
|=========================================================== | 91%
|
|=================================================================| 100%

If you do not specify the accuracy, time, or interpretability, they will be suggested by the DAI platform. (See dai.
suggest_model_params.)

42.2.3 Model Inspection

As with DAIFrame, generic methods such as print, format, summary, and predict work with DAIModel:
print(model)
#> Status: Complete
#> Experiment: 7e2b70ae-5baa-11e9-a50b-b938de969cdb, 2019-04-10 18:06, 1.7.0+local_0c7d019-dirty
#> Settings: 1/1/10, seed=25, GPUs enabled
#> Train data: train (19199, 25)
#> Validation data: N/A
#> Test data: test (4800, 24)
#> Target column: DEFAULT_PAYMENT_NEXT_MONTH (binary, 22.366% target class)
#> System specs: Linux, 126 GB, 40 CPU cores, 2/2 GPUs
#> Max memory usage: 0.406 GB, 0.167 GB GPU
#> Recipe: AutoDL (2 iterations, 2 individuals)
#> Validation scheme: stratified, 1 internal holdout
#> Feature engineering: 33 features scored (18 selected)
#> Timing:
#> Data preparation: 4.94 secs
#> Model and feature tuning: 10.13 secs (3 models trained)
#> Feature evolution: 5.54 secs (1 of 3 model trained)
#> Final pipeline training: 7.85 secs (1 model trained)
#> Python / MOJO scorer building: 42.05 secs / 0.00 secs
#> Validation score: AUC = 0.77802 +/- 0.0077539 (baseline)
#> Validation score: AUC = 0.77802 +/- 0.0077539 (final pipeline)
#> Test score: AUC = 0.7861 +/- 0.0064711 (final pipeline)

summary(model)$score
#> [1] 0.7780229

42.2.4 Predicting on New Data

New data can be scored in two different ways:


• Call predict() directly on the model in R session.
• Download a scoring pipeline and embed that into your Python or Java workflow.

Predicting in R

Generic predict() either directly returns an R data.frame with the results (by default) or it returns a URL pointing
to a CSV file with the results (return_df=FALSE). The latter option may be useful when you predict on a large dataset.
predictions <- predict(model, newdata = creditcard.splits$test)
#>
|
| | 0%
|
|=================================================================| 100%
#> Loading required package: bitops

head(predictions)
#> DEFAULT_PAYMENT_NEXT_MONTH.0 DEFAULT_PAYMENT_NEXT_MONTH.1
#> 1 0.8879988 0.11200116
#> 2 0.9289870 0.07101299
#> 3 0.9550328 0.04496716
#> 4 0.3513577 0.64864230
#> 5 0.9183724 0.08162758
#> 6 0.9154425 0.08455751

predict(model, newdata = creditcard.splits$test, return_df = FALSE)


#>
|
| | 0%
|
|=================================================================| 100%
#> [1] "h2oai_experiment_7e2b70ae-5baa-11e9-a50b-b938de969cdb/7e2b70ae-5baa-11e9-a50b-b938de969cdb_preds_f854b49f.csv"

42.2. R Client Tutorial 527


Using Driverless AI, Release 1.8.4.1

Downloading Python or MOJO Scoring Pipelines

For productizing your model in a Python or Java, you can download full Python or MOJO pipelines, respectively. For
more information about how to use the pipelines, please see the documentation.
dai.download_mojo(model, path = tempdir(), force = TRUE)
#>
|
| | 0%
|
|=================================================================| 100%
#> Downloading the pipeline:
#> [1] "/tmp/RtmppsLTZ9/mojo-7e2b70ae-5baa-11e9-a50b-b938de969cdb.zip"

dai.download_python_pipeline(model, path = tempdir(), force = TRUE)


#>
|
| | 0%
|
|=================================================================| 100%
#> Downloading the pipeline:
#> [1] "/tmp/RtmppsLTZ9/python-pipeline-7e2b70ae-5baa-11e9-a50b-b938de969cdb.zip"

42.2.5 Managing the Datasets and Models

After some time, you may have multiple datasets and models on your DAI server. The dai package offers a few utility
functions to find, reuse, and remove the existing datasets and models.
If you already have the dataset loaded into DAI, you can get the DAIFrame object by either dai.get_frame (if
you know the frame’s key) or dai.find_dataset (if you know the original path or at least a part of it):
dai.get_frame(creditcard$key)
#> DAI frame '7abe28b2-5baa-11e9-a50b-b938de969cdb': 23999 obs. of 25 variables
#> File path: tests/smalldata/kaggle/CreditCard/creditcard_train_cat.csv

dai.find_dataset('creditcard')
#> DAI frame '7abe28b2-5baa-11e9-a50b-b938de969cdb': 23999 obs. of 25 variables
#> File path: tests/smalldata/kaggle/CreditCard/creditcard_train_cat.csv

The latter directly returns you the frame if there’s only one match. Otherwise it let you select which frame to return
from all the matching candidates.
Furthemore, you can get a list of datasets or models:
datasets <- dai.list_datasets()
head(datasets)
#> key name
#> 1 7cf613a6-5baa-11e9-a50b-b938de969cdb test
#> 2 7cf3024c-5baa-11e9-a50b-b938de969cdb train
#> 3 7c38cb84-5baa-11e9-a50b-b938de969cdb iris9e1f15d2df00.csv
#> 4 7abe28b2-5baa-11e9-a50b-b938de969cdb creditcard_train_cat.csv
#> file_path
#> 1 ./tmp/7cf613a6-5baa-11e9-a50b-b938de969cdb/test.1554912341.0966916.bin
#> 2 ./tmp/7cf3024c-5baa-11e9-a50b-b938de969cdb/train.1554912341.0864356.bin
#> 3 ./tmp/7c38cb84-5baa-11e9-a50b-b938de969cdb/iris9e1f15d2df00.csv.1554912339.9424415.bin
#> 4 tests/smalldata/kaggle/CreditCard/creditcard_train_cat.csv
#> file_size data_source row_count column_count import_status import_error
#> 1 567584 upload 4800 25 0
#> 2 2265952 upload 19199 25 0
#> 3 7064 upload 150 5 0
#> 4 2832040 file 23999 25 0
#> aggregation_status aggregation_error aggregated_frame mapping_frame
#> 1 -1
#> 2 -1
#> 3 -1
#> 4 -1
#> uploaded
#> 1 TRUE
#> 2 TRUE
#> 3 TRUE
#> 4 FALSE

models <- dai.list_models()


head(models)
#> key description
#> 1 7e2b70ae-5baa-11e9-a50b-b938de969cdb mupulori
#> dataset_name parameters.dataset_key
#> 1 train.1554912341.0864356.bin 7cf3024c-5baa-11e9-a50b-b938de969cdb
#> parameters.resumed_model_key parameters.target_col
#> 1 DEFAULT_PAYMENT_NEXT_MONTH
#> parameters.weight_col parameters.fold_col parameters.orig_time_col
#> 1
#> parameters.time_col parameters.is_classification parameters.cols_to_drop
#> 1 [OFF] TRUE NULL
#> parameters.validset_key parameters.testset_key
#> 1 7cf613a6-5baa-11e9-a50b-b938de969cdb
#> parameters.enable_gpus parameters.seed parameters.accuracy

(continues on next page)

528 Chapter 42. The R Client


Using Driverless AI, Release 1.8.4.1

(continued from previous page)


#> 1 TRUE 25 1
#> parameters.time parameters.interpretability parameters.scorer
#> 1 1 10 AUC
#> parameters.time_groups_columns parameters.time_period_in_seconds
#> 1 NULL NA
#> parameters.num_prediction_periods parameters.num_gap_periods
#> 1 NA NA
#> parameters.is_timeseries parameters.config_overrides
#> 1 FALSE NA
#> log_file_path
#> 1 h2oai_experiment_7e2b70ae-5baa-11e9-a50b-b938de969cdb/h2oai_experiment_logs_7e2b70ae-5baa-11e9-a50b-b938de969cdb.zip
#> pickle_path
#> 1 h2oai_experiment_7e2b70ae-5baa-11e9-a50b-b938de969cdb/best_individual.pickle
#> summary_path
#> 1 h2oai_experiment_7e2b70ae-5baa-11e9-a50b-b938de969cdb/h2oai_experiment_summary_7e2b70ae-5baa-11e9-a50b-b938de969cdb.zip
#> train_predictions_path valid_predictions_path
#> 1
#> test_predictions_path
#> 1 h2oai_experiment_7e2b70ae-5baa-11e9-a50b-b938de969cdb/test_preds.csv
#> progress status training_duration scorer score test_score deprecated
#> 1 1 0 71.43582 AUC 0.7780229 0.7861 FALSE
#> model_file_size diagnostic_keys
#> 1 695996094 NULL

If you know the key of the dataset or model, you can obtain the instance of DAIFrame or DAIModel by dai.
get_model and dai.get_frame:
dai.get_model(models$key[1])
#> Status: Complete
#> Experiment: 7e2b70ae-5baa-11e9-a50b-b938de969cdb, 2019-04-10 18:06, 1.7.0+local_0c7d019-dirty
#> Settings: 1/1/10, seed=25, GPUs enabled
#> Train data: train (19199, 25)
#> Validation data: N/A
#> Test data: test (4800, 24)
#> Target column: DEFAULT_PAYMENT_NEXT_MONTH (binary, 22.366% target class)
#> System specs: Linux, 126 GB, 40 CPU cores, 2/2 GPUs
#> Max memory usage: 0.406 GB, 0.167 GB GPU
#> Recipe: AutoDL (2 iterations, 2 individuals)
#> Validation scheme: stratified, 1 internal holdout
#> Feature engineering: 33 features scored (18 selected)
#> Timing:
#> Data preparation: 4.94 secs
#> Model and feature tuning: 10.13 secs (3 models trained)
#> Feature evolution: 5.54 secs (1 of 3 model trained)
#> Final pipeline training: 7.85 secs (1 model trained)
#> Python / MOJO scorer building: 42.05 secs / 0.00 secs
#> Validation score: AUC = 0.77802 +/- 0.0077539 (baseline)
#> Validation score: AUC = 0.77802 +/- 0.0077539 (final pipeline)
#> Test score: AUC = 0.7861 +/- 0.0064711 (final pipeline)
dai.get_frame(datasets$key[1])
#> DAI frame '7cf613a6-5baa-11e9-a50b-b938de969cdb': 4800 obs. of 25 variables
#> File path: ./tmp/7cf613a6-5baa-11e9-a50b-b938de969cdb/test.1554912341.0966916.bin

Finally, the datasets and models can be removed by dai.rm:


dai.rm(model, creditcard, creditcard.splits$train, creditcard.splits$test)
#> Model 7e2b70ae-5baa-11e9-a50b-b938de969cdb removed
#> Dataset 7abe28b2-5baa-11e9-a50b-b938de969cdb removed
#> Dataset 7cf3024c-5baa-11e9-a50b-b938de969cdb removed
#> Dataset 7cf613a6-5baa-11e9-a50b-b938de969cdb removed

The function dai.rm deletes the objects by default both from the server and the R session. If you wish to remove it
only from the server, you can set from_session=FALSE. Please note that only objects can be removed from the
session, i.e. in the example above the creditcard.splits$train and creditcard.splits$test objects
will not be removed from R session because they are actually function calls (recall that $ is a function).

42.2. R Client Tutorial 529


Using Driverless AI, Release 1.8.4.1

530 Chapter 42. The R Client


CHAPTER

FORTYTHREE

DRIVERLESS AI LOGS

This section describes how to access Driverless AI logs and includes information on which logs to send in the event
of a failure.

43.1 Accessing Driverless AI Logs

Driverless AI provides a number of logs that can be retrieved while visualizing datasets, while an experiment is
running, and after an experiment is completed.

43.1.1 While Visualizing Datasets

When running Autovisualization, you can access the Autoviz logs by clicking the Display Logs button on the Visualize
Datasets page.

531
Using Driverless AI, Release 1.8.4.1

This page presents logs created while the dataset visualization was being performed. You can download the vis-data-
server.log file by clicking the Download Logs button on this page. This file can be used to troubleshoot any issues
encountered during dataset visualization.

43.1.2 While an Experiment is Running

While the experiment is running, you can access the logs by clicking on the Log button on the experiment screen. The
Log button can be found in the CPU/Memory section.

Clicking on the Log button will present the experiment logs in real time. You can download these logs by clicking on
the Download Logs button in the upper right corner.

532 Chapter 43. Driverless AI Logs


Using Driverless AI, Release 1.8.4.1

Only the h2oai_experiment.log can be downloaded while the experiment is running (for example:
h2oai_experiment_tobosoru.log). It will have the same information as the logs being presented in real time on
the screen.
For troubleshooting purposes, it is best to view the complete h2oai_experiment.log (or
h2oai_experiment_anonymized.log). This will be available after the experiment finishes, as described in the
next section.

43.1.3 After an Experiment has Finished

If the experiment has finished, you can download the logs by clicking on the Download Experiment & Logs button
on the completed experiment screen.

43.1. Accessing Driverless AI Logs 533


Using Driverless AI, Release 1.8.4.1

This will download a zip file that includes the following logs along with a summary of the experiment:
• h2oai_experiment.log: This is the log corresponding to the experiment.
• h2oai_experiment_anonymized.log: This is the log corresponding to the experiment where all data in the log
is anonymized.

43.1.4 During Model Interpretation

Driverless AI allows you to view and download Python and/or Java logs while MLI is running. Note that these logs
are not available for time-series experiments.

534 Chapter 43. Driverless AI Logs


Using Driverless AI, Release 1.8.4.1

• The Display MLI Python Logs button allows you to view or download the Python log for the model interpre-
tation. The downloaded file is named h2oai_experiment_{mli_key}.log.
• The Display MLI Java Logs button allows you to view or download the Java log for the model interpretation.
The downloaded file is named mli_experiment_{mli_key}.log.

43.1.5 After Model Interpretation

You can view an MLI log for completed model interpretations by selecting the Download MLI Logs link on the MLI
page.

This will download a zip file which includes the following logs:
• h2oai_experiment_{mli_key}.log: This is the log corresponding to the model interpretation.
• h2oai_experiment_{mli_key}_anonymized.log: This is the log corresponding to the model interpretation
where all data in the log is anonymized.
• mli_experiment_{mli_key}.log: This is the Java log corresponding to the model interpretation.
This file can be used to view logging information for successful interpretations. If MLI fails, then those
logs are in ./tmp/h2oai_experiment_{mli_key}.log, ./tmp/h2oai_experiment_{mli_key}_anonymized.log, and
./tmp/mli_experiment_{mli_key}.log.

43.1. Accessing Driverless AI Logs 535


Using Driverless AI, Release 1.8.4.1

43.2 Sending Logs to H2O

This section describes the logs to send in the event of failures when running Driverless AI.

43.2.1 Dataset Failures

• Adding Datasets: If a dataset fails to import, a message on the screen should provide the reason for the failure.
The logs to send are available in the Driverless AI ./tmp folder.
• Dataset Details: If a failure occurs when attempting to view Dataset Details, the logs to send are available in
the Driverless AI ./tmp folder.
• Autovisualization: If a failure occurs when attempting to Visualize Datasets, a message on the screen should
provide a reason for the failure. The logs to send are available in the Driverless AI ./tmp folder.

43.2.2 Experiments

• While Running an Experiment: As indicated previously, a Log button is available on the Experiment page.
Clicking on the Log button will present the experiment logs in real time. You can download these logs by
clicking on the Download Logs button in the upper right corner. You can also retrieve the h2oai_experiment.log
for the corresponding experiment in the Driverless AI ./tmp folder.

43.2.3 MLI

• During Model Interpretation: If a failure occurs during model interpretation, then the logs to send are
./tmp/h2oai_experiment_{mli_key}.log and ./tmp/h2oai_experiment_{mli_key}_anonymized.log.

43.2.4 Custom Recipes

• After Running an Experiment: If a Custom Recipe is producing errors, the entire zip file obtained by clicking
on the Download Experiments & Logs button can be sent for troubleshooting. Note that these files may contain
information that is not anonymized.

536 Chapter 43. Driverless AI Logs


CHAPTER

FORTYFOUR

DRIVERLESS AI SECURITY

44.1 Objective

The goal of this document is to describe different aspects of Driverless AI security and to provide guidelines to secure
the system by reducing its surface of vulnerability.
This section covers the following areas of the product:
• User access
• Authentication
• Authorization
• Data security
• Data import
• Data export
• Logs
• User data isolation
• Transfer security
• Custom recipes security
• Web UI security

44.2 Important things to know

Warning: WARNING Security in a default installation of Driverless AI is DISABLED! By default, a Driverless


AI installation targets ease-of- use and does not enable all security features listed in this document. For production
environments, we recommend following this document and performing a secure Driverless AI installation.

537
Using Driverless AI, Release 1.8.4.1

44.3 User Access

44.3.1 Authentication Methods

Option Default Recommended Value Description


Value
authentication_method Any supported authentication (e.g., LDAP, Define user authentica-
PAM) method except "unvalidated" and
"unvalidated tion method.
" "none".
Consult your security requirements.
authentication_default_timeout_hours
72 Number of hours after
which a user has to rel-
ogin.

44.3.2 Authorization Methods

At this point, Driverless AI does not perform any authorization.

44.4 Data Security

44.4.1 Data Import

Option Default Recommended Value Description


Value
"upload, Configure only needed data
enabled_file_systems Control list of available/configured data sources.
file, sources.
hdfs,
s3"
max_file_upload_size
104857600000B Configure based on expected file Limit maximum size of uploaded file.
size and size of Driverless AI de-
ployment.
see con- It is recommended to limit
supported_file_types Supported file formats listed in filesystem
fig.toml file types to extension used in browsers.
the target environment (e.g.,
parquet).
show_all_filesystems
true false Show all available data sources in WebUI (even
though there are not configured). It is recom-
mended to show only configured data sources.

538 Chapter 44. Driverless AI Security


Using Driverless AI, Release 1.8.4.1

44.4.2 Data Export

Option De- Recom- Description


fault mended
Value Value
enable_dataset_downloading
true false Control ability to download any datasets (uploaded, predictions, MLI).
(disable Note: if dataset download is disabled, we strongly suggest to disable cus-
download tom recipes as well to remove another way how data could be exported
of datasets)
from the application.
enable_artifacts_upload
false false Replace all downloads on the experiment page to “exports”, and allow
users to push to the artifact store configured with artifacts_store.
(See notes below.)
artifacts_store
file_system Stores a MOJO on a file system directory denoted by
file_system
artifacts_file_system_directory. (See notes below.)
File system location where artifacts will be copied in case
artifacts_file_system_directory
tmp tmp
artifacts_store is set to file_system. (See notes below.)

Notes about Artifacts:


• Currently, file_system is the only option that can be specified for artifacts_store. Additional options
will be available in future releases.
• The location for artifacts_file_system_directory is expected to be a directory on your server.
• When these artifacts are enabled/configured, the menu options on the Completed Experiment page will change.
Specifically, all “Download” options (with the exception of Autoreport) will change to “Export.” Refer to Export
Artifacts for more information.

44.4. Data Security 539


Using Driverless AI, Release 1.8.4.1

44.4.3 Logs

The Driverless AI produces several logs:


• audit logs
• server logs
• experiment logs
The administrator of Driverless AI application (i.e., person who is responsible for configuration and setup of the
application) has control over content which is written to the logs.

Option De- Recom- Description


fault mended
Value Value
audit_log_retention_period
5 0 (disable Number of days to keep audit logs. The value 0 disable rota-
(days) audit log tion.
rotation)
do_not_log_list see — Contain list of configuration options which are not recorded
con- in logs.
fig.toml
log_level 1 see con- Define verbosity of logging
fig.toml
collect_server_logs_in_experiment_logs
false false Dump server logs with experiment. Dangerous since server
logs can contain information about experiments of other users
using Driverless AI.
None —
h2o_recipes_log_level Log level for OSS H2O instances used by custom recipes.
debug_log false false Enable debug logs.
write_recipes_to_experiment_logger
false false Dump a custom recipe source code into logs.

44.4.4 User Data Isolation

Option De- Recommended Value Description


fault
Value
data_directory
". Specify proper name and location of Directory where Driverless AI stores all computed
/ directory. experiments and datasets
tmp
"
file_hide_data_directory
true true Hide data_directory in file-system
browser. It is recommended to hide it to pro-
tect data_directory from browsing and
corruption.
file_path_filtering_enabled
falsetrue Enable path filter for file-system browser
(file data source). By default the filter is dis-
abled which means users can browse the entire
application-local filesystem.
It is recommended to predefine a list of
file_path_filter_include
[] List of absolute path prefixes to restrict access to
paths which user can access in a file- in file-browser.
browser. For example, ['/home',
'/data'].

540 Chapter 44. Driverless AI Security


Using Driverless AI, Release 1.8.4.1

44.5 Client-Server Communication Security

Option Default Value Recommended Value Description


enable_https
false true Enable HTTPS
ssl_key_file
"/etc/dai/ Correct private key. Private key to setup
private_key.pem" HTTPS/SSL communication.
ssl_crt_file
"/etc/dai/cert. Correct public certifikate. Public certificate to setup
pem" HTTPS/SSL.
ssl_no_sslv2
true true Prevents an SSLv2 connection.
ssl_no_sslv3
true true Prevents an SSLv3 connection.
ssl_no_tlsv1
true true Prevents an TLSv1 connec-
tiona.
ssl_no_tlsv1_1
true true Prevents an TLSv1.1 connec-
tion.
ssl_no_tlsv1_2
false false (disable TLSv1.2 only if Prevents a TLSv1.2 connection.
TLSv1.3 is available)
ssl_no_tlsv1_3
false false Prevents a TLSv1.3 connection.

44.5.1 Response Headers

The response headers which are passed between Driverless AI server and client (browser, Python/R clients) are con-
trolled via the following option:

Option Default Recommended Description


Value Value
extra_http_headers"{}" See below Configure HTTP header returned in server re-
sponse.

44.5. Client-Server Communication Security 541


Using Driverless AI, Release 1.8.4.1

Recommended Response Headers

HeaderDescription Example value Link


The header lets a web site tell
Strict-Transport-Security max-age=63072000 https://developer.
browsers that it should only be ac- mozilla.org/
cessed using HTTPS, instead of us- en-US/docs/Web/
ing HTTP. The max-age specifies HTTP/Headers/
time, in seconds, that the browser Strict-Transport-Security
should remember that a site is only
to be accessed using HTTPS.
Content Security Policy (CSP) is an
Content-Security-Policy default-src https: ; https://developer.
added layer of security that helps font-src 'self'; script-src mozilla.org/
to detect and mitigate certain types 'self' 'unsafe-eval' en-US/docs/Web/
of attacks, including Cross Site 'unsafe-inline'; style-src HTTP/Headers/
Scripting and data injection attacks. 'self' 'unsafe-inline'; Content-Security-Policy
Controls from where the page can object-src 'none' https://developer.
download source. Note: The Driverless AI is still re- mozilla.org/en-US/
quires to have unsafe-eval and docs/Web/HTTP/CSP
unsafe-inline configured, which https://infosec.mozilla.
potentionally makes the server vulnerable org/guidelines/web_
to XSS attacks. security#Examples_5
Controls where a page can get
X-Frame-Options deny https://developer.
source to render in a frame. The mozilla.org/
value here overrides the default, en-US/docs/Web/
which is SAMEORIGIN. HTTP/Headers/
X-Frame-Options
Prevents the browser from trying to
X-Content-Type-Options nosniff https://developer.
determine the content-type of a re- mozilla.org/
source that is different than the de- en-US/docs/Web/
clared content-type. HTTP/Headers/
X-Content-Type-Options
The HTTP X-XSS-Protection re-
X-XSS-Protection 1; mode=block https://developer.
sponse header is a feature of In- mozilla.org/
ternet Explorer, Chrome and Sa- en-US/docs/Web/
fari that stops pages from loading HTTP/Headers/
when they detect reflected cross- X-XSS-Protection
site scripting (XSS) attacks. When
value is set to 1 and a cross-
site scripting attack is detected, the
browser will sanitize the page (re-
move the unsafe parts).

Other Headers to Consider

Header Documentation
Public-Key-Pins https://developer.mozilla.org/en-US/docs/Web/HTTP/Public_Key_Pinning https://
CORS-related headers developer.mozilla.org/en-US/docs/Web/HTTP/CORS

542 Chapter 44. Driverless AI Security


Using Driverless AI, Release 1.8.4.1

44.6 Web UI Security

Note: The Driverless AI UI is design to be user-friendly, and by default all features like auto-complete are enabled.
Disabling the user-friendly features increases security of the application, but impacts user-friendliness and usability of
the application.

Option De- Recom- Description


fault mended
Value Value
allow_form_autocomplete
true false Control auto-completion in Web UI elements (e.g., login inputs).
allow_localstorage
true false Disable use of web browser local storage.
show_all_filesystems
true false Show all available data sources in WebUI (even though there are not
configured). It is recommended to show only configured data sources.

44.7 Custom Recipe Security

Note: By default Driverless AI enables custom recipes as a main route for the way data-science teams can extend the
application capabilities. In enterprise environments, it is recommended to follow best software engineering practices
for development of custom recipes (i.e., code reviews, testing, stage releasing, etc.) and bundle only a pre-defined and
approved set of custom Driverless AI extensions.

Option Recom- Default Description


mended Value
Value
enable_custom_recipes true false Enable custom Python recipes.
enable_custom_recipes_upload
true false Enable uploading of custom recipes.
include_custom_recipes_by_default
false false Include custom recipes in default inclusion lists.
(warning: enables all custom recipes)

44.8 Baseline Secure Configuration

The following Driverless AI configuration is an example of secure configuration. Please, make sure that you fill all
necessary config options.
#
# Auth
#

# Configure auth method


#authentication_method="PAM"
# Force user re-login after 24hours
authentication_default_timeout_hours=24

#
# Data
#

(continues on next page)

44.6. Web UI Security 543


Using Driverless AI, Release 1.8.4.1

(continued from previous page)


# Configure available connectors
#enabled_file_systems="hdfs"
show_all_filesystems=false

# Restrict Downloads
enable_dataset_downloading=false

#
# Logs
#
audit_log_retention_period=0
collect_server_logs_in_experiment_logs=false

#
# User data isolation
#
file_hide_data_directory=true
#file_path_filter_include=true
#file_path_filter_include=[]

#
# Client-Server Communication
enable_https=true
ssl_key_file="<<FILL ME>>"
ssl_crt_file="<<FILL ME>>"
# Disable support of TLSv1.2 on server side only if your environment supports TLSv1.3
#ssl_no_tlsv1_2=true

#
# Web UI security
#
allow_form_autocomplete=false
allow_localstorage=false

extra_http_headers={ "Strict-Transport-Security" = "max-age=63072000", "Content-Security-Policy" = "default-src https: ; font-src 'self'; script-src 'self
˓→' 'unsafe-eval' 'unsafe-inline'; style-src 'self' 'unsafe-inline'; object-src 'none'", "X-Frame-Options" = "deny", "X-Content-Type-Options" = "nosniff",
˓→ "X-XSS-Protection" = "1; mode=block" }

#
# Custom Recipes
#
enable_custom_recipes=false
enable_custom_recipes_upload=false
include_custom_recipes_by_default=false

544 Chapter 44. Driverless AI Security


CHAPTER

FORTYFIVE

FAQ

H2O Driverless AI is an artificial intelligence (AI) platform for automatic machine learning. Driverless AI automates
some of the most difficult data science and machine learning workflows such as feature engineering, model validation,
model tuning, model selection and model deployment. It aims to achieve highest predictive accuracy, comparable to
expert data scientists, but in much shorter time thanks to end-to-end automation. Driverless AI also offers automatic
visualizations and machine learning interpretability (MLI). Especially in regulated industries, model transparency and
explanation are just as important as predictive performance. Modeling pipelines (feature engineering and models) are
exported (in full fidelity, without approximations) both as Python modules and as Java standalone scoring artifacts.
This section provides answers to frequently asked questions. If you have additional questions about using Driverless
AI, post them on Stack Overflow using the driverless-ai tag at http://stackoverflow.com/questions/tagged/driverless-ai.
General
• How is Driverless AI different than any other black box ML algorithm?
• How often do new versions come out?
Installation/Upgrade/Authentication
• How can I change my username and password?
• Can Driverless AI run on CPU-only machines?
• How can I upgrade to a newer version of Driverless AI?
• What kind of authentication is supported in Driverless AI?
• How can I automatically turn on persistence each time the GPU system reboots?
• How can I start Driverless AI on a different port that 12345?
• Can I set up TLS/SSL on Driverless AI?
• Can I set up TLS/SSL on Driverless AI in AWS?
• Why do I receive a “package dai-<version>.x86_64 does not verify: no digest” error during the installation?
• I received a “Must have exactly one OpenCL platform ‘NVIDIA CUDA’” error. How can I fix that?
• Is it possible for multiple users to share a single Driverless AI instance?
• Can multiple Driverless AI users share a GPU server?
• How can I retrieve a list of Driverless AI users?
• Start of Driverless AI fails on the message “Segmentation fault (core dumped)” on Ubuntu 18/RHEL 7.6. How
can I fix this?
• Why do I receive a “shared object file” error when runnnig on RHEL 8?
Data

545
Using Driverless AI, Release 1.8.4.1

• Is there a file size limit for datasets?


Connectors
• Why can’t I import a folder as a file when using a data connector on Windows?
• I get a ClassNotFoundException error when I try to select a JDBC connection. How can I fix that?
Recipes
• Where can I retrieve H2O’s custom recipes?
• How can I create my own custom recipe?
• Are MOJOs supported for experiments that use custom recipes?
Experiments
• How much memory does Driverless AI require in order to run experiments?
• How many columns can Driverless AI handle?
• How should I use Driverless AI if I have large data?
• How does Driverless AI detect the ID column?
• Can Driverless AI handle data with missing values/nulls?
• How does Driverless AI deal with categorical variables? What if an integer column should really be treated as
categorical?
• How are outliers handled?
• If I drop several columns from the Train dataset, will Driverless AI understand that it needs to drop the same
columns from the Test dataset?
• Does Driverless AI treat numeric variables as categorical variables?
• Which algorithms are used in Driverless AI?
• Why do my selected algorithms not show up in the Experiment Preview?
• How can we turn on TensorFlow Neural Networks so they are evaluated?
• Does Driverless AI standardize the data?
• What objective function is used in XGBoost?
• Does Driverless AI perform internal or external validation?
• How does Driverless AI prevent overfitting?
• How does Driverless AI avoid the multiple hypothesis (MH) problem?
• How does Driverless AI suggest the experiment settings?
• What happens when I set Interpretability and Accuracy to the same number?
• Can I specify the number of GPUs to use when running Driverless AI?
• How can I create the simplest model in Driverless AI?
• Why is my experiment suddenly slow?
• When I run multiple experiments with different seeds, why do I see different scores, runtimes, and sizes on disk
in the Experiments listing page?
• Why does the final model performance appear to be worse than previous iterations?
• How can I find features that may be causing data leakages in my Driverless AI model?

546 Chapter 45. FAQ


Using Driverless AI, Release 1.8.4.1

• How can I see the performance metrics on the test data?


• How can I see all the performance metrics possible for my experiment?
• What if my training/validation and testing data sets come from different distributions?
• Does Driverless AI handle weighted data?
• How does Driverless AI handle fold assignments for weighted data?
• Why do I see that adding new features to a dataset deteriorates the performance of the model?
• How does Driverless AI handle imbalanced data for binary classification experiments?
Feature Transformations
• Where can I get details of the various transformations performed in an experiment?
Predictions
• How can I download the predictions onto the machine where Driverless AI is running?
• Why are predicted probabilities not available when I run an experiment without ensembling?
Deployment
• What drives the size of a MOJO?
• Running the scoring pipeline for my MOJO is taking several hours. How can I get this to run faster?
• Why have I encountered a “Best Score is not finite” error?
Time Series
• What if my data has a time dependency?
• What is a lag, and why does it help?
• Why can’t I specify a validation data set for time-series problems? Why do you look at the test set for time-series
problems
• Why does the gap between train and test matter? Is it because of creating the lag features on the test set?
• In regards to applying the target lags to different subsets of the time group columns, are you saying Driverless AI
perform auto-correlation at “levels” of the time series? For example, consider the Walmart dataset where I have
Store and Dept (and my target is Weekly Sales). Are you saying that Driverless AI checks for auto-correlation
in Weekly Sales based on just Store, just Dept, and both Store and Dept?
• How does Driverless AI detect the time period?
• What is the logic behind the selectable numbers for forecast horizon length?
• Assume that in my Walmart dataset, all stores provided data at the week level, but one store provided data at the
day level. What would Driverless AI do?
• Assume that in my Walmart dataset, all stores and departments provided data at the weekly level, but one
department in a specific store provided weekly sales on a bi-weekly basis (every two weeks). What would
Driverless AI do?
• Why does the number of weeks that you want to start predicting matter?
• Are the scoring components of time series sensitive to the order in which new pieces of data arrive? I.e., is each
row independent at scoring time, or is there a real-time windowing effect in the scoring pieces?
• What happens if the user, at predict time, gives a row with a time value that is too small or too large?
• What’s the minimum data size for a time series recipe?
• How long must the training data be compared to the test data?

547
Using Driverless AI, Release 1.8.4.1

• How does the time series recipe deal with missing values?
• Can the time information be distributed acrosss multiple columns in the input data (such as [year, day, month]?
• What type of modeling approach does Driverless AI use for time series?
• What’s the idea behind exponential weighting of moving averages?
Logging
• How can I reduce the size of the Audit Logger?

45.1 General

How is Driverless AI different than any other black box ML algorithm?


Driverless AI uses many techniques (some older and some cutting-edge) for interpreting black box models
including creating reason codes for every prediction the system makes. We have also created numerous
open source code examples and free publications that explain these techniques. See the list below for
links to these resources and for references for the interpretability techniques.
• Open source interpretability examples:
• https://github.com/jphall663/interpretable_machine_learning_with_python
• https://content.oreilly.com/oriole/Interpretable-machine-learning-with-Python-XGBoost-and-H2O
• https://github.com/h2oai/mli-resources
• Free Machine Learning Interpretability publications:
• http://www.oreilly.com/data/free/an-introduction-to-machine-learning-interpretability.csp
• http://docs.h2o.ai/driverless-ai/latest-stable/docs/booklets/MLIBooklet.pdf
• Machine Learning Techniques already in Driverless AI:
• Tree-based Variable Importance: https://web.stanford.edu/~hastie/ElemStatLearn/
printings/ESLII_print12.pdf
• Partial Dependence: https://web.stanford.edu/~hastie/ElemStatLearn/printings/ESLII_
print12.pdf
• LIME: http://www.kdd.org/kdd2016/papers/files/rfp0573-ribeiroA.pdf
• LOCO: http://www.stat.cmu.edu/~ryantibs/papers/conformal.pdf
• ICE: https://arxiv.org/pdf/1309.6392.pdf
• Surrogate Models:
• https://papers.nips.cc/paper/1152-extracting-tree-structured-representations-of-trained-networks.
pdf
• https://arxiv.org/pdf/1705.08504.pdf
• Shapely Explanations: http://papers.nips.cc/paper/7062-a-unified-approach-to-interpreting-model-predictions
How often do new versions come out?
The frequency of major new Driverless AI releases has historically been about every two months.

548 Chapter 45. FAQ


Using Driverless AI, Release 1.8.4.1

45.2 Installation/Upgrade/Authentication

How can I change my username and password?


The username and password is tied to the experiments you have created. For example, if I log in with
the username/password: megan/megan and start an experiment, then I would need to log back in with the
same username and password to see those experiments. The username and password, however, does not
limit your access to Driverless AI. If you want to use a new user name and password, you can log in again
with a new username and password, but keep in mind that you won’t see your old experiments.
Can Driverless AI run on CPU-only machines?
Yes, Driverless AI can run on machines with CPUs only, though GPUs are recommended. Installation
instructions are available for GPU and CPU systems. Refer to Installing and Upgrading Driverless AI for
more information.
How can I upgrade to a newer version of Driverless AI?
Upgrade instructions vary depending on your environment. Refer to the installation section for your
environment. Upgrade instructions are included there.
What kind of authentication is supported in Driverless AI?
Driverless AI supports Client Certificate, LDAP, Local, mTLS, OpenID, none, and unvalidated (default)
authentication. These can be configured by setting the appropriate environment variables in the con-
fig.toml file or by specifying the environment variables when starting Driverless AI. Refer to Configuring
Authentication for more information.
How can I automatically turn on persistence each time the GPU system reboots?
For GPU machines, the sudo nvidia-persistenced --user dai command can be run af-
ter each reboot to enable persistence. For systems that have systemd, it is possible to automatically
enable persistence after each reboot by removing the --no-persistence-mode flag from nvidia-
persistenced.service. Before running the steps below, be sure to review the following for more informa-
tion:
• https://docs.nvidia.com/deploy/driver-persistence/index.html#persistence-daemon
• https://docs.nvidia.com/deploy/driver-persistence/index.html#installation
1. Run the following to stop the nvidia-persistenced.service:
sudo systemctl stop nvidia-persistenced.service

2. Open the file /lib/systemd/system/nvidia-persistenced.service. This file includes a line


“ExecStart=/usr/bin/nvidia-persistenced –user nvidia-persistenced –no-persistence-mode –verbose”.
3. Remove the flag --no-persistence-mode from that line so that it reads:
ExecStart=/usr/bin/nvidia-persistenced --user nvidia-persistenced --verbose

4. Run the following command to start the nvidia-persistenced.service:


sudo systemctl start nvidia-persistenced.service

How can I start Driverless AI on a different port that 12345?


Docker Installs: When starting Driverless AI in Docker, the -p option specifies the port on which Driver-
less AI will run. Change this option in the start script if you need to run on a port other than 12345. For
example, to run on port 443, use the following (change nvidia-docker run to docker-run if
needed):

45.2. Installation/Upgrade/Authentication 549


Using Driverless AI, Release 1.8.4.1

docker run \
--pid=host \
--init \
--rm \
--shm-size=256m \
-u `id -u`:`id -g` \
-p 443:12345 \
-v `pwd`/data:/data \
-v `pwd`/log:/log \
-v `pwd`/license:/license \
-v `pwd`/tmp:/tmp \
h2oai/dai-centos7-x86_64:TAG

Native Installs: To run on a port other than 12345, update the port value in the config.toml file. For
example, edit the following to run Driverless AI on port 443:
# Export the Driverless AI config.toml file (or add it to ~/.bashrc)
export DRIVERLESS_AI_CONFIG_FILE=“/config/config.toml”

# IP address and port for Driverless AI HTTP server.


ip = "127.0.0.1"
port = 443

Point to this updated config file when restarting Driverless AI.


Can I set up TLS/SSL on Driverless AI?
Yes, Driverless AI provides configuration options that allow you to set up HTTPS/TLS/SSL. You will
need to have your own SSL certificate, or you can create a self-signed certificate for yourself.
To enable HTTPS/TLS/SSL on the Driverless AI server, add the following to the config.toml file:
enable_https = true
ssl_key_file = "/etc/dai/private_key.pem"
ssl_crt_file = "/etc/dai/cert.pem"

You can make a self-signed certificate for testing with the following commands:
umask 077
openssl req -x509 -newkey rsa:4096 -keyout private_key.pem -out cert.pem -days 20 -nodes -subj '/O=Driverless AI'
sudo chown dai:dai cert.pem private_key.pem
sudo mv cert.pem private_key.pem /etc/dai

To configure specific versions of TLS/SSL, enable or disable the following settings in the config.toml file:
ssl_no_sslv2 = true
ssl_no_sslv3 = true
ssl_no_tlsv1 = true
ssl_no_tlsv1_1 = true
ssl_no_tlsv1_2 = false
ssl_no_tlsv1_3 = false

Can I set up TLS/SSL on Driverless AI in AWS?


Yes, you can set up HTTPS/TLS/SSL on Driverless AI running in an AWS environment.
HTTPS/TLS/SSL needs to be configured on the host machine, and the necessary ports will need to be
opened on the AWS side. You will need to have your own TLS/SSL cert or you can create a self signed
cert for yourself.
The following is a very simple example showing how to configure HTTPS with a proxy pass to the port on
the container 12345 with the keys placed in /etc/nginx/. Replace <server_name> with your server name.
server {
listen 80;
return 301 https://$host$request_uri;
}

server {
listen 443;

# Specify your server name here


server_name <server_name>;

ssl_certificate /etc/nginx/cert.crt;
ssl_certificate_key /etc/nginx/cert.key;
ssl on;
ssl_session_cache builtin:1000 shared:SSL:10m;
ssl_protocols TLSv1 TLSv1.1 TLSv1.2;
ssl_ciphers HIGH:!aNULL:!eNULL:!EXPORT:!CAMELLIA:!DES:!MD5:!PSK:!RC4;
ssl_prefer_server_ciphers on;

(continues on next page)

550 Chapter 45. FAQ


Using Driverless AI, Release 1.8.4.1

(continued from previous page)


access_log /var/log/nginx/dai.access.log;

location / {
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;

# Fix the “It appears that your reverse proxy set up is broken" error.
proxy_pass http://localhost:12345;
proxy_read_timeout 90;

# Specify your server name for the redirect


proxy_redirect http://localhost:12345 https://<server_name>;
}
}

More information about SSL for Nginx in Ubuntu 16.04 can be found here: https://www.digitalocean.
com/community/tutorials/how-to-create-a-self-signed-ssl-certificate-for-nginx-in-ubuntu-16-04.
I received a “package dai-<version>.x86_64 does not verify: no digest” error during the installation. How can
I fix this?
You will recieve a “package dai-<version>.x86_64 does not verify: no digest” error when installing the
rpm using an RPM version newer than 4.11.3. You can run the following as a workaround, replacing
<version> with your DAI version:
rpm --nodigest -i dai-<version>.x86_64.rpm

I received a “Must have exactly one OpenCL platform ‘NVIDIA CUDA’” error. How can I fix that?
If you encounter problems with opencl errors at server time, you may see the following message:
2018-11-08 14:26:15,341 C: D:452.2GB M:246.0GB 21603 ERROR : Must have exactly one OpenCL platform 'NVIDIA CUDA', but got:
Platform #0: Clover
Platform #1: NVIDIA CUDA
+-- Device #0: GeForce GTX 1080 Ti
+-- Device #1: GeForce GTX 1080 Ti
+-- Device #2: GeForce GTX 1080 Ti

Uninstall all but 'NVIDIA CUDA' platform.

For Ubuntu, the solution is to run the following:


sudo apt-get remove mesa-opencl-icd

Is it possible for multiple users to share a single Driverless AI instance?


Driverless AI supports multiple users, and Driverless AI is licensed per a single named user. Therefore, in
order, to have different users run experiments simultaneously, they would each need a license. Driverless
AI manages the GPU(s) that it is given and ensures that different experiments from different users can
run safely simultaneously and don’t interfere with each other. So when two licensed users log in with
different credentials, then neither of them will see the other’s experiment. Similarly, if a licensed user
logs in using a different set of credentials, then that user will not see any previously run experiments.
Can multiple Driverless AI users share a GPU server?
Yes, you can allocate multiple users in a single GPU box. For example, a single box with four GPUs can
allocate that User1 has two GPUs and User2 has the other two GPUs. This is accomplished by having
two separated Driverless AI instances running on the same server.
There are two ways to assign specific GPUs to Driverless AI. And in the scenario with four GPUs (two
GPUs allocated to two users), both of these options allow each Docker container only to see two GPUs.
• Use the CUDA_VISIBLE_DEVICES environment variable. In the case of Docker deployment, this
will translate in passing the -e CUDA_VISIBLE_DEVICES="0,1" to the nvidia-docker
run command.
• Passing the NV_GPU option at the beginning of the nvidia-docker run command. (See ex-
ample below.)

45.2. Installation/Upgrade/Authentication 551


Using Driverless AI, Release 1.8.4.1

#Team 1
NV_GPU='0,1' nvidia-docker run
--pid=host
--init
--rm
--shm-size=256m
-u id -u:id -g
-p port-to-team:12345
-e DRIVERLESS_AI_CONFIG_FILE="/config/config.toml"
-v /data:/data
-v /log:/log
-v /license:/license
-v /tmp:/tmp
-v /config:/config
h2oai/dai-centos7-x86_64:TAG

#Team 2
NV_GPU='0,1' nvidia-docker run
--pid=host
--init
--rm
--shm-size=256m
-u id -u:id -g
-p port-to-team:12345
-e DRIVERLESS_AI_CONFIG_FILE="/config/config.toml"
-v /data:/data
-v /log:/log
-v /license:/license
-v /tmp:/tmp
-v /config:/config
h2oai/dai-centos7-x86_64:TAG

Note, however, that a Driverless AI instance expects to fully utilize and not share the GPUs that are
assigned to it. Sharing a GPU with other Driverless AI instances or other running programs can result in
out-of-memory issues.
How can I retrieve a list of Driverless AI users?
A list of users can be retrieved using the Python client.
h2o = Client(address='http://<client_url>:12345', username='<username>', password='<password>')
h2o.get_users()

Start of Driverless AI fails on the message ``Segmentation fault (core dumped)`` on Ubuntu 18/RHEL 7.6. How
can I fix this?
This problem is caused by the font NotoColorEmoji.ttf, which cannot be processed by the Python
matplotlib library. A workaround is to disable the font by renaming it. (Do not use fontconfig because it
is ignored by matplotlib.) The following will print out the command that should be executed.
sudo find / -name "NotoColorEmoji.ttf" 2>/dev/null | xargs -I{} echo sudo mv {} {}.backup

Why do I receive a “shared object file” error when runnnig on RHEL 8?


RHEL 8 is not currently supported. Supported Linux systems include x86_64 RHEL 7, CentOS 7, and
SLES 12.

45.3 Data

Is there a file size limit for datasets?


For GBMs, the file size for datasets is limited by the collective CPU or GPU memory on the system, but
we continue to make optimizations for getting more data into an experiment, such as using TensorFlow
streaming to stream to arbitrarily large datasets.

552 Chapter 45. FAQ


Using Driverless AI, Release 1.8.4.1

45.4 Connectors

Why can’t I import a folder as a file when using a data connector on Windows?
If you try to use the Import Folder as File option via a data connector on Windows, the import will fail if
the folder contains files that do not have file extensions. For example, if a folder contains the files file1.csv,
file2.csv, file3.csv, and _SUCCESS, the function will fail due to the presence of the _SUCCESS file.
Note that this only occurs if the data is sourced from a volume that is mounted from the Windows filesys-
tem onto the Docker container via -v /path/to/windows/filesystem:/path/in/docker/
container flags. This error occurs because of the difference in how files without file extensions are
treated in Windows and in the Docker container (CentOS Linux).
I get a ClassNotFoundException error when I try to select a JDBC connection. How can I fix that?
The folder storing the JDBC jar file must be visible/readable by the dai process user.
If you downloaded the JDBC jar file from Oracle, they may provide you with a tar.gz file that you can
unpackage with the following command:
tar --no-same-permissions --no-same-owner -xzvf <my-jdbc-driver.tar>.gz

Alternatively you can ensure that the permissions on the file are correct in general by running the follow-
ing:
chmod -R o+rx /path/to/folder_containing_jar_file

Finally, if you just want to check the permissions use the command ls -altr and check the final 3
values in the permissions output.

45.5 Recipes

Where can I retrieve H2O’s custom recipes?


H2O’s custom recipes can be obtained from the official Recipes for Driverless AI repo.
How can I create my own custom recipe?
Refer to the How to Write a Recipe guide for details on how to create your own custom recipe.
Are MOJOs supported for experiments that use custom recipes?
In most cases, MOJOs will not be available for custom recipes. Unless the recipe is simple, creating
the MOJO is only possible with additional MOJO runtime support. Contact [email protected] for more
information about creating MOJOs for custom recipes. (Note: The Python Scoring Pipeline features full
support for custom recipes.)

45.6 Experiments

How much memory does Driverless AI require in order to run experiments?


Right now, Driverless AI requires approximately 10x the size of the data in system memory.
How many columns can Driverless AI handle?
Driverless AI has been tested on datasets with 10k columns. When running experiments on wide data,
Driverless AI automatically checks if it is running out of memory, and if it is, it reduces the number of

45.4. Connectors 553


Using Driverless AI, Release 1.8.4.1

features until it can fit in memory. This may lead to a worse model, but Driverless AI shouldn’t crash
because the data is wide.
How should I use Driverless AI if I have large data?
Driverless AI can handle large datasets out of the box. For very large datasets (more than 10 billion rows
x columns), we recommend sampling your data for Driverless AI. Keep in mind that the goal of driverless
AI is to go through many features and models to find the best modeling pipeline, and not to just train a
few models on the raw data (H2O-3 is ideally suited for that case).
For large datasets, the recommended steps are:
1. Run with the recommended accuracy/time/interpretability settings first, especially accuracy <= 7
2. Gradually increase accuracy settings to 7 and choose accuracy 9 or 10 only after observing runs with
<= 7.
How does Driverless AI detect the ID column?
The ID column logic is one of the following:
• The column is named ‘id’, ‘Id’, ‘ID’ or ‘iD’ exactly
• The column contains a significant number of unique values (above
max_relative_cardinality in the config.toml file or Max. allowed fraction of uniques
for integer and categorical cols in Expert settings)
Can Driverless AI handle data with missing values/nulls?
Yes, data that is imported into Driverless AI can include missing values. Feature engineering is fully
aware of missing values, and missing values are treated as information - either as a special categorical
level or as a special number. So for target encoding, for example, rows with a certain missing feature
will belong to the same group. For Categorical Encoding where aggregations of a numeric columns are
calculated for a grouped categorical column, missing values are kept. The formula for calculating the
mean is the sum of non-missing values divided by the count of all non-missing values. For clustering,
we impute missing values. And for frequency encoding, we count the number of rows that have a certain
missing feature.
The imputation strategy is as follows:
• XGBoost/LightGBM do not need missing value imputation and may, in fact, perform worse with
any specific other strategy unless the user has a strong understanding of the data.
• Driverless AI automatically imputes missing values using the mean for GLM.
• Driverless AI provides an imputation setting for TensorFlow in the config.toml file:
tf_nan_impute_value post-normalization. If you set this option to 0, then missing
values will be imputed. Setting it to (for example) +5 will specify 5 standard deviations outside the
distribution. The default for TensorFlow is -5, which specifies that TensorFlow will treat NAs like a
missing value. We recommend that you specify 0 if the mean is better.
More information is available in the Missing Values Handling section.
How does Driverless AI deal with categorical variables? What if an integer column should really be treated as
categorical?
If a column has string values, then Driverless AI will treat it as a categorical feature. There are multiple
methods for how Driverless AI converts the categorical variables to numeric. These include:
• One Hot Encoding: creating dummy variables for each value
• Frequency Encoding: replace category with how frequently it is seen in the data

554 Chapter 45. FAQ


Using Driverless AI, Release 1.8.4.1

• Target Encoding: replace category with the average target value (additional steps included to prevent
overfitting)
• Weight of Evidence: calculate weight of evidence for each category (http://ucanalytics.com/blogs/
information-value-and-weight-of-evidencebanking-case/)
Driverless AI will try multiple methods for representing the column and determine which representation(s)
are best.
If the column has integers, Driverless AI will try treating the column as a categorical column and numeric
column. It will treat any integer column as both categorical and numeric if the number of unique values
is less than 50.
This is configurable in the config.toml file:
# Whether to treat some numerical features as categorical
# For instance, sometimes an integer column may not represent a numerical feature but
# represents different numerical codes instead.
num_as_cat = true

# Max number of unique values for integer/real columns to be treated as categoricals (test applies to first statistical_threshold_data_
˓→size_small rows only)
max_int_as_cat_uniques = 50

(Note: Driverless AI will also check if the distribution of any numeric column differs significantly from
the distribution of typical numerical data using Benford’s Law. If the column distribution does not obey
Benford’s Law, we will also try to treat it as categorical even if there are more than 50 unique values.)
How are outliers handled?
Outliers are not removed from the data. Instead Driverless AI finds the best way to represent data with
outliers. For example, Driverless AI may find that binning a variable with outliers improves performance.
For target columns, Driverless AI first determines the best representation of the column. It may find that
for a target column with outliers, it is best to predict the log of the column.
If I drop several columns from the Train dataset, will Driverless AI understand that it needs to drop the same
columns from the Test dataset?
If you drop columns from the training dataset, Driverless AI will do the same for the validation and test
datasets (if the columns are present). There is no need for these columns because no features will be
created from them.
Does Driverless AI treat numeric variables as categorical variables?
In certain cases, yes. You can prevent this behavior by setting the num_as_cat variable in your instal-
lation’s config.toml file to false. You can have finer grain control over this behavior by excluding
the Numeric to Categorical Target Encoding Transformer and the Numeric To
Categorical Weight of Evidence Transformer and their corresponding genes in your in-
stallation’s config.toml file. To learn more about the config.toml file, see the Using the config.toml File
section.
Which algorithms are used in Driverless AI?
Features are engineered with a proprietary stack of Kaggle-winning statistical approaches including some
of the most sophisticated target encoding and likelihood estimates based on groupings, aggregations and
joins, but we also employ linear models, neural nets, clustering and dimensionality reduction models and
many traditional approaches such as one-hot encoding etc.
On top of the engineered features, sophisticated models are fitted, including, but not limited to: XG-
Boost (both original XGBoost and ‘lossguide’ (LightGBM) mode), Decision Trees, GLM, TensorFlow
(including a TensorFlow NLP recipe based on CNN Deeplearning models), RuleFit, FTRL (Follow the
Regularized Leader), Isolation Forest, and Constant Models. (Refer to Supported Algorithms for more
information.) And additional algorithms can be added via recipes.

45.6. Experiments 555


Using Driverless AI, Release 1.8.4.1

In general, GBMs are the best single-shot algorithms. Since 2006, boosting methods have proven to
be the most accurate for noisy predictive modeling tasks outside of pattern recognition in images and
sound (https://www.cs.cornell.edu/~caruana/ctp/ct.papers/caruana.icml06.pdf). The advent of XGBoost
and Kaggle only cemented this position.
Why do my selected algorithms not show up in the Experiment Preview?
When changing the algorithms used via Expert Settings > Model and Expert Settings > Recipes, you may notice
in the Experiment Preview that those changes are not applied. Driverless AI determines whether to include models
and/or recipes based on a hierarchy of those expert settings.

• Setting an Algorithm to “OFF” in Expert Settings: If an algorithm is turned OFF in Expert Settings (for example,
GLM Models) when running, then that algorithmn will not be included in the experiement.
• Algorithms Not Included from Recipes (BYOR): If an algorithm from a custom recipe is not selected for the
experiment in the Include specific models option, then that algorithm will not be included in the experiment,
regardless of whether that same algorithm is set to AUTO or ON on the Expert Settings > Model page.
• Algorithms Not Specified as “OFF” and Included from Recipes: If a Driverless AI algorithm is specified as
either “AUTO” or “ON” and additional models are selected for the experiment in the Include specific models
option, than those algorithms may or may not be included in the experiment. Driverless AI will determine the
algorithms to use based on the data and experiment type.
How can we turn on TensorFlow Neural Networks so they are evaluated?
Neural networks are considered by Driverless AI, although they may not be evaluated by default. To
ensure that neural networks are tried, you can turn on TensorFlow in the Expert Settings:

556 Chapter 45. FAQ


Using Driverless AI, Release 1.8.4.1

Once you have set TensorFlow to ON. You should see the Experiment Preview on the left hand side
change and mention that it will evaluate TensorFlow models:

45.6. Experiments 557


Using Driverless AI, Release 1.8.4.1

We recommend using TensorFlow neural networks if you have a multinomial use case with more than 5
unique values.
Does Driverless AI standardize the data?
Driverless AI will automatically do variable standardization for certain algorithms. For example, with
Linear Models and Neural Networks, the data is automatically standardized. For decision tree algorithms,
however, we do not perform standardization since these algorithms do not benefit from standardization.
What objective function is used in XGBoost?
The objective function used in XGBoost is:
• reg:linear and custom absolute error objective function for regression
• binary:logistic or multi:softprob for classification
The objective function does not change depending on the scorer chosen. The scorer influences parameter
tuning only.
For regression, Tweedie/Gamma/Poisson/etc. regression is not yet supported, but Driverless AI handles
various target transforms so many target distributions can be handled very effectively already. Driverless
AI handles quantile regression for alpha=0.5 (media), and general quantiles are on the roadmap.
Further details for the XGBoost instantiations can be found in the logs and in the model summary, both
of which can be downloaded from the GUI or are found in the /tmp/h2oai_experiment_<name>/ folder
on the server.
Does Driverless AI perform internal or external validation?
Driverless AI does internal validation when only training data is provided. It does external validation
when training and validation data are provided. In either scenario, the validation data is used for all
parameter tuning (models and features), not just for feature selection. Parameter tuning includes target
transformation, model selection, feature engineering, feature selection, stacking, etc.
Specifically:
• Internal validation (only training data given):
– Ideal when data is either close to i.i.d., or for time-series problems
– Internal holdouts are used for parameter tuning, with temporal causality for time-series prob-
lems
– Will do the full spectrum from single holdout split to 5-fold CV, depending on accuracy settings
– No need to split training data manually
– Final models are trained using CV on the training data
• External validation (training + validation data given):
– Ideal when there’s some amount of drift in the data, and the validation set mimics the test set
data better than the training data
– No training data wasted during training because training data not used for parameter tuning
– Validation data is used only for parameter tuning, and is not part of training data
– No CV possible because we explicitly do not want to overfit on the training data
– Not allowed for time-series problems (see Time Series FAQ section that follows)

558 Chapter 45. FAQ


Using Driverless AI, Release 1.8.4.1

Tip: If you want both training and validation data to be used for parameter tuning (the training process),
just concatenate the datasets together and turn them both into training data for the “internal validation”
method.
How does Driverless AI prevent overfitting?
Driverless AI performs a number of checks to prevent overfitting. For example, during certain transfor-
mations, Driverless AI calculates the average on out-of-fold data using cross validation. Driverless AI
also performs early stopping for every model built, ensuring that the model build will stop when it ceases
to improve on holdout data. And additional steps to prevent overfitting include checking for i.i.d. and
avoiding leakage during feature engineering.
A blog post describing Driverless AI overfitting protection in greater detail is available here: https://www.
h2o.ai/blog/driverless-ai-prevents-overfitting-leakage/.
How does Driverless AI avoid the multiple hypothesis (MH) problem?
Or more specifically, like many brute force methods for tuning hyperparameters/model selection, Driver-
less AI runs up against the multihypothesis problem (MH). For example, if I randomly generated a gazil-
lion models, the odds that a few will do awesome on the test data that they are all measured against is
pretty large, simply by sheer luck. How does Driverless AI address this?
Driverless AI uses a variant of the reusable holdout technique to address the multiple hypothesis problem.
Refer to https://pdfs.semanticscholar.org/25fe/96591144f4af3d8f8f79c95b37f415e5bb75.pdf for more in-
formation.
How does Driverless AI suggest the experiment settings?
When you run an experiment on a dataset, the experiment settings (Accuracy, Time, and Interpretability)
are automatically suggested by Driverless AI. For example, Driverless AI may suggest the parameters
Accuracy = 7, Time = 3, Interpretability = 6, based on your data.
Driverless AI will automatically suggest experiment settings based on the number of columns and number
of rows in your dataset. The settings are suggested to ensure best handling when the data is small. If the
data is small, Driverless AI will suggest the settings that prevent overfitting and ensure the full dataset is
utilized.
If the number of rows and number of columns are each below a certain threshold, then:
• Accuracy will be increased up to 8.
– The accuracy is increased so that cross validation is done. (We don’t want to “throw away” any
data for internal validation purposes.)
• Interpretability will be increased up to 8.
– The higher the interpretability setting, the smaller the number of features in the final model.
– More complex features are not allowed.
– This prevents overfitting.
• Time will be decreased down to 2.
– There will be fewer feature engineering iterations to prevent overfitting.
What happens when I set Interpretability and Accuracy to the same number?
The answer is currently that interpretability controls which features are created and what features are
kept. (Also above interpretability = 6, monotonicity constraints are used in XGBoost GBM, XGBoost
Dart, LightGBM, and LightGBM Random Forest models.) The accuracy refers to how hard Driverless AI
then tries to make those features into the most accurate model.
Can I specify the number of GPUs to use when running Driverless AI?

45.6. Experiments 559


Using Driverless AI, Release 1.8.4.1

When running an experiment, the Expert Settings allow you to specify the starting GPU ID for Driverless
AI to use. You can also specify the maximum number of GPUs to use per model and per experiment.
Refer to the Expert Settings section for more information.
How can I create the simplest model in Driverless AI?
To create the simplest model in Driverless AI, set the following Experiment Settings:
• Set Accuracy to 1. Note that this can hurt performance as a sample will be used. If necessary, adjust
the konb until the preview shows no sampling.
• Set Time to 1.
• Set Interpretability to 10.
Next, configure the following Expert Settings:
• Turn OFF all algorithms except GLM.
• Set GLM models to ON.
• Set Ensemble level to 0.
• Set Select target transformation of the target for regression problems to Identity.
• Disable Data distribution shift detection.
• Disable Target Encoding.
Alternatively, you can set Pipeline Building Recipe to Compliant. Compliant automatically configures
the following experiment and expert settings:
• interpretability=10 (To avoid complexity. This overrides GUI or Python client settings for Inter-
pretability.)
• enable_glm=’on’ (Remaing algos are ‘off’, to avoid complexity and be compatible with algorithms
supported by MLI.)
• num_as_cat=true: Treat some numerical features as categorical. For instance, sometimes an integer
column may not represent a numerical feature but represent different numerical codes instead.
• fixed_ensemble_level=0: Don’t use any ensemble (to avoid complexity).
• feature_brain_level=0: No feature brain used (to ensure every restart is identical).
• max_feature_interaction_depth=1: Interaction depth is set to 1 (no multi-feature interactions to
avoid complexity).
• target_transformer=”identity”: For regression (to avoid complexity).
• check_distribution_shift=”off”: Don’t use distribution shift between train, valid, and test to drop
features (bit risky without fine-tuning).
Why is my experiment suddenly slow?
It is possible that your experiment has gone from using GPUs to using CPUs due to a change of the host
system outside of Driverless AI’s control. You can verify this using any of the following methods:
• Check GPU usage by going to your Driverless AI experiment page and clicking on the GPU USAGE
tab in the lower-right quadrant of the experiment.
• Run nvidia-smi in a terminal to see if any processes are using GPU resources in an unexpected
way (such as those using a large amount of memory).
• Check if System/GPU memory is being consumed by prior jobs or other tasks or if older jobs are
still running some tasks.

560 Chapter 45. FAQ


Using Driverless AI, Release 1.8.4.1

• Check and diable automatic NVIDIA driver updates on your system (as they can interfere with
running experiments).
The general solution to these kind of sudden slowdown problems is to restart:
• Restart Docker if using Docker
• pkill --signal 9 h2oai if using the native installation method
• Restart the system if nvidia-smi does not work as expected (e.g., after a driver update)
More ML-related issues that can lead to a slow experiment are:
• Choosing high accuracy settings on a system with insufficient memory
• Choosing low interpretability settings (can lead to more feature engineering which can increase
memory usage)
• Using a dataset with a lot of columns (> 500)
• Doing multi-class classification with a GBM model when there are many target classes (> 5)
When I run multiple experiments with different seeds, why do I see different scores, runtimes, and sizes on disk
in the Experiments listing page?
When running multiple experiments with all of the same settings except the seed, understand that a feature
brain level > 0 can lead to variations in models, features, timing, and sizes on disk. (The default value is
2.) These variations can be disabled by setting the Feature Brain Level to 0 in the Expert Settings or in
the config.toml file.
In addition, if you use a different seed for each experiment, then each experiment can be different due to
the randomness in the genetic algorithm that searches for the best features and model parameters. Only
if Reproducible is set with the same seed and with the a feature brain level of 0 should users expect the
same outcome. Once a different seed is set, the models, features, timing, and sizes on disk can all vary
within the constraints set by the choices made for the experiment. (I.e., accuracy, time, interpretability,
expert settings, etc., all constrain the outcome, and then a different seed can change things within those
constraints.)
Why does the final model performance appear to be worse than previous iterations?
There are a few things to remember:
• Driverless AI creates a best effort estimate of the generalization performance of the best modeling
pipeline found so far
• The performance estimation is always based on holdout data (data unseen by the model).
• If no validation dataset is provided, the training data is split internally to create internal validation
holdout data (once or multiple times or cross-validation, depending on the accuracy settings).
• If no validation dataset is provided, for accuracy <= 7, a single holdout split is used, and a “lucky”
or “unlucky” split can bias estimates for small datasets or datasets with high variance.
• If a validation dataset is provided, then all performance estimates are solely based on the entire
validation dataset (independent of accuracy settings).
• All scores reported are based on bootstrapped-based statistical methods and come with error bars
that represent a range of estimate uncertainty.
After the final iteration, a best final model is trained on a final set of engineered features. Depending on
accuracy settings, a more accurate estimation of generalization performance may be done using cross-
validation. Also, the final model may be a stacked ensemble consisting of multiple base models, which
generally leads to better performance. Consequently, in rare cases, the difference in performance estima-
tion method can lead to the final model’s estimated performance seeming poorer than those from previous

45.6. Experiments 561


Using Driverless AI, Release 1.8.4.1

iterations. (i.e., The final model’s estimated score is significantly worse than the last iteration score and
error bars don’t overlap.) In that case, it is very likely that the final model performance estimation is more
accurate, and the prior estimates were biased due to a “lucky” split. To confirm this, you can re-run the
experiment multiple times (without setting the reproducible flag).
If you would like to minimize the likelihood of the final model performance appearing worse than previous
iterations, here are some recommendations:
• Increase accuracy settings
• Provide a validation dataset
• Provide more data
How can I find features that may be causing data leakages in my Driverless AI model?
To find original features that are causing leakage, have a look at features_orig.txt in the experiment sum-
mary download. Features causing leakage will have high importance there. To get a hint at derived
features that might be causing leakage, create a new experiment with dials set to 2/2/8, and run the new
experiment on your data with all your features and response. Then analyze the top 1-2 features in the
model variable importance. They are likely the main contributors to data leakage if it is occurring.
How can I see the performance metrics on the test data?
As long as you provide a target column in the test set, Driverless AI will show the best estimate of the
final model’s performance on the test set at the end of the experiment. The test set is never used to tune
parameters (unlike to what Kagglers often do), so this is purely a convenience. Of course, you can still
make test set predictions and compute your own metrics using a method of your choice.
How can I see all the performance metrics possible for my experiment?
At the end of the experiment, the model’s estimated performance on all provided datasets with a target
column is printed in the experiment logs. For example, for the test set:
Final scores on test (external holdout) +/- stddev:
GINI = 0.87794 +/- 0.035305 (more is better)
MCC = 0.71124 +/- 0.043232 (more is better)
F05 = 0.79175 +/- 0.04209 (more is better)
F1 = 0.75823 +/- 0.038675 (more is better)
F2 = 0.82752 +/- 0.03604 (more is better)
ACCURACY = 0.91513 +/- 0.011975 (more is better)
LOGLOSS = 0.28429 +/- 0.016682 (less is better)
AUCPR = 0.79074 +/- 0.046223 (more is better)
optimized: AUC = 0.93386 +/- 0.018856 (more is better)

What if my training/validation and testing data sets come from different distributions?
In general, Driverless AI uses training data to engineer features and train models and validation data to
tune all parameters. If no external validation data is given, the training data is used to create internal
holdouts. The way holdouts are created internally depends on whether there is a strong time dependence,
see the point below. If the data has no obvious time dependency (e.g., if there is no time column neither
implicit or explicit), or if the data can be sorted arbitrarily and it won’t affect the outcome (e.g., Iris data,
predicting flower species from measurements), and if the test dataset is different (e.g., new flowers or only
large flowers), then the model performance on validation (either internal or external) as measured during
training won’t be achieved during final testing due to the obvious inability of the model to generalize.
Does Driverless AI handle weighted data?
Yes. You can optionally provide an extra weight column in your training (and validation) data with non-
negative observation weights. This can be useful to implement domain-specific effects such as exponential
weighting in time or class weights. All of our algorithms and metrics in Driverless AI support observation
weights, but note that estimated likelihoods can be skewed as a consequence.
How does Driverless AI handle fold assignments for weighted data?
Currently, Driverless AI does not take the weights into account during fold creation, but you can provide
a fold column to enforce your own grouping, i.e., to keep rows that belong to the same group together

562 Chapter 45. FAQ


Using Driverless AI, Release 1.8.4.1

(either in train or valid). The fold column has to be a categorical column (integers ok) that assigns a group
ID to each row. (It needs to have at least 5 groups because we do up to 5-fold CV.)
Why do I see that adding new features to a dataset deteriorates the performance of the model?
You may notice that after adding one or more new features to a dataset, it deteriorates the performance of
the Driverless AI model. In Driverless AI, the feature engineering sequence is fairly random and may end
up not doing same things with original features if you restart entirely fresh with new columns.
Beginning in Driverless AI v1.4.0, you now have the option to Restart from Last Checkpoint. This
allows you to pull in a new dataset with more columns, and Driverless AI will more iteratively take
advantage of the new columns.
How does Driverless AI handle imbalanced data for binary classification experiments?
If you have data that is imbalanced, a binary imbalanced model can help to improve scoring with a variety
of imbalanced sampling methods. An imbalanced model is able to take advantage of most (or even all)
of the imbalanced dataset’s positive values during sampling, while a regular model significantly limits
the population of positive values. Imbalanced models, however, take more time to make predictions,
and they are not always more accurate than regular models. We still recommend that you try using an
imbalanced model if your data is imbalanced to see if scoring is improved over a regular model. Note that
this information only applies to binary models.

45.7 Feature Transformations

Where can I get details of the various transformations performed in an experiment?


Download the experiment’s log .zip file from the GUI. This zip file includes summary information, log
information, and a gene_summary.txt file with details of the transformations used in the experiment.
Specificially, there is a details folder with all subprocess logs.
On the server, the experiment specific files are inside the /tmp/h2oai_experiment_<name>/
folder after the experiment completes, particularly h2oai_experiment_logs_<name>.zip and
h2oai_experiment_summary_<name>.zip.

45.8 Predictions

How can I download the predictions onto the machine where Driverless AI is running?
When you select Score on Another Dataset, the predictions will automatically be stored on the machine
where Driverless AI is running. They will be saved in the following locations (and can be opened again
by Driverless AI, both for .csv and .bin):
• Training Data Predictions: tmp/h2oai_experiment_<name>/train_preds.csv (also saved as .bin)
• Testing Data Predictions: tmp/h2oai_experiment_<name>/test_preds.csv (also saved as .bin)
• New Data Predictions: tmp/h2oai_experiment_<name>/automatically_generated_name.csv. Note
that the automatically generated name will match the name of the file downloaded to your local
computer.
Why are predicted probabilities not available when I run an experiment without ensembling?
When Driverless AI provides pre-computed predictions after completing an experiment, it uses only those
parts of the modeling pipeline that were not trained on the particular rows for which the predictions are
made. This means that Driverless AI needs holdout data in order to create predictions, such as validation
or test sets, where the model is trained on training data only. In the case of ensembles, Driverless AI

45.7. Feature Transformations 563


Using Driverless AI, Release 1.8.4.1

uses cross-validation to generate holdout folds on the training data, so we are able to provide out-of-fold
estimates for every row in the training data and, hence, can also provide training holdout predictions (that
will provide a good estimate of generalization performance). In the case of a single model, though, that
is trained on 100% of the training data. There is no way to create unbiased estimates for any row in the
training data. While DAI uses an internal validation dataset, this is a re-usable holdout, and therefore
will not contain holdout predictions for the full training dataset. You need cross-validation in order to
get out-of-fold estimates, and then that’s not a single model anymore. If you want to still get predictions
for the training data for a single model, then you have to use the scoring API to create predictions on the
training set. From the GUI, this can be done using the Score on Another Dataset button for a completed
experiment. Note, though, that the results will likely be overly optimistic, too good to be true, and virtually
useless.

45.9 Deployment

What drives the size of a MOJO?


The size of the MOJO is based on the complexity of the final modeling pipeline (i.e., feature engineering
and models). One of the biggest factors is the amount of higher-order interactions between features, espe-
cially target encoding and related features, which have to store lookup tables for all possible combinations
observed in the training data. You can reduce the amount of these transformations by reducing the value
of Max. feature interaction depth and/or Feature engineering effort under Expert Settings, or by in-
creasing the interpretability settings for the experiment. Ensembles also contribute to the final modeling
pipeline’s complexity as each model has its own pipeline. Lowering the accuracy settings or increasing
the ensemble_accuracy_switch setting in the config.toml file can help here. The number of fea-
tures Max. pipeline features also affects the MOJO size. Text transformers are pretty bulky as well and
can add to the MOJO size.
Running the scoring pipeline for my MOJO is taking several hours. How can I get this to run faster?
When running example.sh, Driverless AI implements a memory setting, which is suitable for most use
cases. For very large models, however, it may be necessary to increase the memory limit when running the
Java application for data transformation. This can be done using the -Xmx25g parameter. For example:
java -Xmx25g -Dai.h2o.mojos.runtime.license.file=license.sig -cp mojo2-runtime.jar ai.h2o.mojos.ExecuteMojo pipeline.mojo example.csv

Why have I encountered a “Best Score is not finite” error?


Driverless AI uses 32-bit floats by default. You my encounter this error if your data value exceeds 1E38 or
if you are resolving more than 1 part in 10 million. You can resolve this error using one of the following
methods:
• Enable the Force 64-bit Precision option in the experiment’s Expert Settings.
or
• Set data_precision="float64" and transformer_precision="float64" in con-
fig.toml.

564 Chapter 45. FAQ


Using Driverless AI, Release 1.8.4.1

45.10 Time Series

What if my data has a time dependency?


If you know that your data has a strong time dependency, select a time column before starting the experi-
ment. The time column must be in a Datetime format that can be parsed by pandas, such as “2017-11-06
14:32:21”, “Monday, June 18, 2012” or “Jun 18 2018 14:34:00” etc., or contain only integers.
If you are unsure about the strength of the time dependency, run two experiments: One with time column
set to “[OFF]” and one with time column set to “[AUTO]” (or pick a time column yourself).
What is a lag, and why does it help?
A lag is a feature value from a previous point in time. Lags are useful to take advantage of the fact that
the current (unknown) target value is often correlated with previous (known) target values. Hence, they
can better capture target patterns along the time axis.
Why can’t I specify a validation data set for time-series problems? Why do you look at the test set for time-series
problems
The problem with validation vs test in the time series setting is that there is only one valid way to define
the split. If a test set is given, its length in time defines the validation split and the validation data has to
be part of train. Otherwise the time-series validation won’t be useful.
For instance: Let’s assume we have train = [1,2,3,4,5,6,7,8,9,10] and test = [12,13], where integers define
time periods (e.g., weeks). For this example, the most natural train/valid split that mimics the test scenario
would be: train = [1,2,3,4,5,6,7] and valid = [9,10], and month 8 is not included in the training set to allow
for a gap. Note that we will look at the start time and the duration of the test set only (if provided), and
not at the contents of the test data (neither features nor target). If the user provides validation = [8,9,10]
instead of test data, then this could lead to inferior validation strategy and worse generalization. Hence,
we use the user-given test set only to create the optimal internal train/validation splits. If no test set is
provided, the user can provide the length of the test set (in periods), the length of the train/test gap (in
periods) and the length of the period itself (in seconds).
Why does the gap between train and test matter? Is it because of creating the lag features on the test set?
Taking the gap into account is necessary in order to avoid too optimistic estimates of the true error and
to avoid creating history-based features like lags for the training and validation data (which cannot be
created for the test data due to the missing information).
In regards to applying the target lags to different subsets of the time group columns, are you saying Driverless
AI perform auto-correlation at “levels” of the time series? For example, consider the Walmart dataset where
I have Store and Dept (and my target is Weekly Sales). Are you saying that Driverless AI checks for auto-
correlation in Weekly Sales based on just Store, just Dept, and both Store and Dept?
Currently, auto-correlation is only applied on the detected superkey (entire TGC) of the training dataset
relation at the very beginning. It’s used to rank potential lag-sizes, with the goal to prune the search space
for the GA optimization process, which is responsible for selecting the lag features.
How does Driverless AI detect the time period?
Driverless AI treats each time series as a function with some frequency 1/ns. The actual value is estimated
by the median of time deltas across maximal length TGC subgroups. The choosen SI unit minimizes the
distance to all available SI units.
What is the logic behind the selectable numbers for forecast horizon length?
The shown forecast horizon options are based on quantiles of valid splits. This is necessary because
Driverless AI cannot display all possible options in general.

45.10. Time Series 565


Using Driverless AI, Release 1.8.4.1

Assume that in my Walmart dataset, all stores provided data at the week level, but one store provided data at
the day level. What would Driverless AI do?
Driverless AI would still assume “weekly data” in this case because the majority of stores are yielding
this property. The “daily” store would be resampled to the detected overall frequency.
Assume that in my Walmart dataset, all stores and departments provided data at the weekly level, but one
department in a specific store provided weekly sales on a bi-weekly basis (every two weeks). What would
Driverless AI do?
That’s similar to having missing data. Due to proper resampling, Driverless AI can handle this without
any issues.
Why does the number of weeks that you want to start predicting matter?
That’s an option to provide a train-test gap if there is no test data is available. That is to say, “I don’t have
my test data yet, but I know it will have a gap to train of x.”
Are the scoring components of time series sensitive to the order in which new pieces of data arrive? I.e., is each
row independent at scoring time, or is there a real-time windowing effect in the scoring pieces?
Each row is independent at scoring time.
What happens if the user, at predict time, gives a row with a time value that is too small or too large?
Internally, “out-of bounds” time values are encoded with special values. The samples will still be scored,
but the predictions won’t be trustworthy.
What’s the minimum data size for a time series recipe?
We recommended that you have around 10,000 validation samples in order to get a reliable estimate of
the true error. The time series recipe can still be applied for smaller data, but the validation error might be
inaccurate.
How long must the training data be compared to the test data?
At a minimum, the training data has to be at least twice as long as the test data along the time axis.
However, we recommended that the training data is at least three times as long as the test data.
How does the time series recipe deal with missing values?
Missing values will be converted to a special value, which is different from any non-missing feature value.
Explicit imputation techniques won’t be applied.
Can the time information be distributed acrosss multiple columns in the input data (such as [year, day, month]?
Currently Driverless AI requires the data to have the time stamps given in a single column. Driverless AI
will create addidtion time features like [year, day, month] on its own, if they turn out to be useful.
What type of modeling approach does Driverless AI use for time series?
Driverless AI combines the creation of history-based features like lags, moving averages etc. with the
modeling techniques, which are also applied for i.i.d. data. The primary model of choice is XGBoost.
What’s the idea behind exponential weighting of moving averages?
Exponential weighting accounts for the possibility that more recent observatons are better suited to explain
the present than older observations.

566 Chapter 45. FAQ


Using Driverless AI, Release 1.8.4.1

45.11 Logging

How can I reduce the size of the Audit Logger?


An Audit Logger file is created every day that Driverless AI is in use. The
audit_log_retention_period config variable allows you to specify the number of days,
after which the audit.log will be overwritten. This option defaults to 5 days, which means that Driverless
AI will maintain Audit Logger files for the last 5 days, and audit.log files older than 5 days are removed
and replaced with newer log files. When this option is set to 0, the audit.log file will not be overwritten.

45.11. Logging 567


Using Driverless AI, Release 1.8.4.1

568 Chapter 45. FAQ


CHAPTER

FORTYSIX

TIPS ‘N TRICKS

This section includes Arno’s tips for running Driverless AI.

46.1 Pipeline Tips

Given training data and a target column to predict, H2O Driverless AI produces an end-to-end pipeline tuned for high
predictive performance (and/or high interpretability) for general classification and regression tasks. The pipeline has
only one purpose: to take a test set, row by row, and turn its feature values into predictions.
A typical pipeline creates dozens or even hundreds of derived features from the user-given dataset. Those transforma-
tions are often based on precomputed lookup tables and parameterized mathematical operations that were selected and
optimized during training. It then feeds all these derived features to one or several machine learning algorithms such
as linear models, deep learning models, or gradient boosting models (and several more derived models). If there are
multiple models, then their output is post-processed to form the final prediction (either probabilities or target values).
The pipeline is a directed acyclic graph.
It is important to note that the training dataset is processed as a whole for better results (e.g., aggregate statistics).
For scoring, however, every row of the test dataset must be processed independently to mimic the actual production
scenario.
To facilitate deployment to various production environments, there are multiple ways to obtain predictions from a
completed Driverless AI experiment, either from the GUI, from the R or Python client API, or from a standalone
pipeline.
GUI
• Score on Another Dataset - Convenient, parallelized, ideal for imported data
• Download Predictions - Available if a test set was provided during training
• Deploy - Creates an Amazon Lambda endpoint (more endpoints coming soon)
• Diagnostics - Useful if the test set includes a target column
Client APIs
• Python client - Use the make_prediction_sync() method. An optional argument can be used to get
per-row and per-feature ‘Shapley’ prediction contributions. (Pass pred_contribs=True.)
• R client - Use the predict() method. An optional argument can be used to get per-row and per-feature
‘Shapley’ prediction contributions. (Pass pred_contribs=True.)
Standalone Pipelines
• Python - Supports all models and transformers, and supports ‘Shapley’ prediction contributions and MLI reason
codes

569
Using Driverless AI, Release 1.8.4.1

• Java - Most portable, low latency, supports all models and transformers that are enabled by default (except
TensorFlow NLP transformers), can be used in Spark/H2O-3/SparklingWater for scale
• C++ - Highly portable, low latency, standalone runtime with a convenient Python and R wrapper

46.2 Time Series Tips

H2O Driverless AI handles time-series forecasting problems out of the box.


All you need to do when starting a time-series experiment is to provide a regular columnar dataset containing your
features. Then pick a target column and also pick a “time column” - a designated column containing time stamps
for every record (row) such as “April 10 2019 09:13:41” or “2019/04/10”. If you have a test set for which you want
predictions for every record, make sure to provide future time stamps and features as well.
In most cases, that’s it. You can launch the experiment and let Driverless AI do the rest. It will even auto-detect
multiple time series in the same dataset for different groups such as weekly sales for stores and departments (by
finding the columns that identify stores and departments to group by). Driverless AI will also auto-detect the time
period including potential gaps during weekends, as well as the forecast horizon, a possible time gap between training
and testing time periods (to optimize for deployment delay) and even keeps track of holiday calendars. Of course, it
automatically creates multiple causal time-based validation splits (sliding time windows) for proper validation, and
incorporates many other related grand-master recipes such as automatic target and non-target lag feature generation as
well as interactions between lags, first and second derivatives and exponential smoothing.
• If you find that the automatic lag-based time-series recipe isn’t performing well for your dataset, we recommend
that you try to disable the creation of lag-based features by disabling “Time-series lag-based recipe” in the expert
settings. This will lead to regular feature engineering but with time-based causal validation splits. Especially
for small datasets and short forecast periods, this can lead to better results.
• If the target column is present in the test set and has partially filled information (non-missing values), then
Driverless AI will automatically augment the model with those future target values to make better predictions.
This can be used to extend the usable lifetime of the model into the future without the need for retraining by
providing past known outcomes. Contact us if you’re interested in learning more about test-time augmentation.
• For now, training and test datasets should have the same input features available, so think about which of the
predictors (input features) will be available during production time and drop the rest (or create your own lag
features that can be available to both train and test sets).
• For datasets that are non-stationary in time, create a test set from the last temporal portion of data, and create
time-based features. This allows the model to be optimized for your production scenario.
• We are working on further improving many aspects of our time-series recipe. For example, we will add support
to automatically generate lags for features that are only available in the training set, but not in the test set, such as
environmental or economic factors. We’ll also improve the performance of back-testing using rolling windows.
• In 1.7.x, you will have the option to bring your own recipes (BYOR) for features, models and scorers, and that
includes time-series recipes! We are very excited about that. Please contact us if you are interested in learning
more about BYOR.

570 Chapter 46. Tips ‘n Tricks


Using Driverless AI, Release 1.8.4.1

46.3 Scorer Tips

A core capability of H2O Driverless AI is the creation of automatic machine learning modeling pipelines for supervised
problems. In addition to the data and the target column to be predicted, the user can pick a scorer. A scorer is a function
that takes actual and predicted values for a dataset and returns a number. Looking at this single number is the most
common way to estimate the generalization performance of a predictive model on unseen data by comparing the
model’s predictions on the dataset with its actual values. There are more detailed ways to estimate the performance
of a machine learning model such as residual plots (available on the Diagnostics page in Driverless AI), but we will
focus on scorers here.
For a given scorer, Driverless AI optimizes the pipeline to end up with the best possible score for this scorer. The
default scorer for regression problems is RMSE (root mean squared error), where 0 is the best possible value. For
example, for a dataset containing 4 rows, if actual target values are [1, 1, 10, 0], but predictions are [2, 3, 4, -1], then
the RMSE is sqrt((1+4+36+1)/4) and the largest misprediction dominates the overall score (quadratically). Driverless
AI will focus on improving the predictions for the third data point, which can be very difficult when hard-to-predict
outliers are present in the data. If outliers are not that important to get right, a metric like the MAE (mean absolute
error) can lead to better results. For this case, the MAE is (1+2+6+1)/4 and the optimization process will consider
all errors equally (linearly). Another scorer that is robust to outliers is RMSLE (root mean square logarithmic error),
which is like RMSE but after taking the logarithm of actual and predicted values - however, it is restricted to positive
values. For price predictions, scorers such as MAPE (mean absolute percentage error) or MER (median absolute
percentage error) are useful, but have problems with zero or small positive values. SMAPE (symmetric mean absolute
percentage error) is designed to improve upon that.
For classification problems, the default scorer is either the AUC (area under the receiver operating characteristic
curve) or LOGLOSS (logarithmic loss) for imbalanced problems. LOGLOSS focuses on getting the probabilities
right (strongly penalizes wrong probabilities), while AUC is designed for ranking problems. Gini is similar to the
AUC, but measures the quality of ranking (inequality) for regression problems. For general imbalanced classification
problems, AUCPR and MCC are good choices, while F05, F1 and F2 are designed to balance recall against precision.
We highly suggest experimenting with different scorers and to study their impact on the resulting models. Using the
Diagnostics page in Driverless AI, all applicable scores can be computed for any given model, no matter which scorer
was used during training.

46.4 Knob Settings Tips

H2O Driverless AI allows you to customize every experiment in great detail via the expert settings. The most important
controls however are the three knobs for accuracy, time and interpretability. A higher accuracy setting results in a
better estimate of the model generalization performance, usually through using more data, more holdout sets, more
parameter tuning rounds and other advanced techniques. Higher time settings means the experiment is given more
time to converge to an optimal solution. Higher interpretability settings reduces the model’s complexity through less
feature engineering and using simpler models. In general, a setting of 1/1/10 will lead to the simplest and usually
least accurate modeling pipeline, while a setting of 10/10/1 will lead to the most complex and most time consuming
experiment possible. Generally, it is sufficient to use settings of 7/5/5 or similar, and we recommend to start with the
default settings. We highly recommend studying the experiment preview on the left-hand side of the GUI before each
experiment - it can help you fine-tune the settings and save time overall.
Note that you can always finish an experiment early, either by clicking ‘Finish’ to get the deployable final pipeline out,
or by clicking ‘Abort’ to instantly terminate the experiment. In either case, the experiment can be continued seamlessly
at a later time with ‘Restart from last Checkpoint’ or ‘Retrain Final Pipeline’, and you can always turn the knobs (or
modify the expert settings) to adapt to your requirements.

46.3. Scorer Tips 571


Using Driverless AI, Release 1.8.4.1

46.5 Tips for Running an Experiment

H2O Driverless AI is an automatic machine learning platform designed to create highly accurate modeling pipelines
from tabular training data. The predictive performance of the pipeline is a function of both the training data and
the parameters of the pipeline (details of feature engineering and modeling). During an experiment, Driverless AI
automatically tunes these parameters by scoring candidate pipelines on held out (“validation”) data. This important
validation data is either provided by the user (for experts) or automatically created (random, time-based or fold-based)
by Driverless AI. Once a final pipeline has been created, it should be scored on yet another held out dataset (“test data”)
to estimate its generalization performance. Understanding the origin of the training, validation and test datasets (“the
validation scheme”) is critical for success with machine learning, and we welcome your feedback and suggestions to
help us create the right validation schemes for your use cases.

46.6 Expert Settings Tips

H2O Driverless AI offers a range of ‘Expert Settings’ that allow you to customize each experiment. For example, you
can limit the amount of feature engineering by reducing the value for ‘Feature engineering effort’ or ‘Max. feature
interaction depth’ or by disabling ‘Target Encoding’. You can also select the model types to be used for training
on the engineered features (such as XGBoost, LightGBM, GLM, TensorFlow, FTRL, or RuleFit). For time-series
problems where the selected time_column leads to an error message (this can currently happen if the the time structure
is not regular enough - we are working on an improved version), you can disable the ‘Time-series lag-based recipe’
and Driverless AI will create train/validation splits based on the time order instead, which can increase the model’s
performance if the time column is important.

46.7 Checkpointing Tips

Driverless AI provides the option to checkpoint experiments to speed up feature engineering and model tuning when
running multiple experiments on the same dataset. By default, H2O Driverless AI automatically scans all prior exper-
iments (including aborted ones) for an optimal checkpoint to restart from. You can select a specific prior experiment
to restart a new experiment from with “Restart from Last Checkpoint” in the experiment listing page (click on the 3
yellow bars on the right). You can disable checkpointing by setting ‘Feature Brain Level’ in the expert settings (or
feature_brain_level in the configuration file) to 0 to force the experiment to start from scratch.

46.8 Text Data Tips

For datasets that contain text (string) columns - where each value can be a few words, a paragraph or an entire document
- Driverless AI automatically creates NLP features based on bag of words, tf-idf, singular value decomposition and
out-of-fold likelihood estimates. In versions 1.3 and above, you can enable TensorFlow in the expert settings to see
how CNN (convolutional neural net) based learned word embeddings can improve predictive accuracy even more. Try
this for sentiment analysis, document classification, and generic text-enriched datasets.

572 Chapter 46. Tips ‘n Tricks


CHAPTER

FORTYSEVEN

APPENDIX A: CUSTOM RECIPES

This appendix describes how to use custom recipes in Driverless AI. You’re welcome to create your own recipes, or
you can select from a number of recipes available in the https://github.com/h2oai/driverlessai-recipes/tree/rel-1.8.4
repository.
Notes:
• Recipes only need to be added once. After a recipe is added to an experiment, that recipe will then be available
for all future experiments.
• In most cases (especially for complex recipes), MOJOs won’t be available out of the box. But, it is possible to
get the MOJO. Contact [email protected] for more information about creating MOJOs for custom recipes. (Note
that the Python Scoring Pipeline features full support for custom recipes.)

47.1 Additional Resources

• Custom Recipes FAQ: For answers to common questions about custom recipes.
• How to Write a Recipe: A guide for writing your own recipes.
• Data Template: A template for creating your own Data recipe.
• Model Template: A template for creating your own Model recipe.
• Scorer Template: A template for creating your own Scorer recipe.
• Transformer Template: A template for creating your own Transformer recipe.

47.2 Examples

47.2.1 Adding a Data Recipe

Driverless AI allows you to create a new dataset by modifying an existing dataset with a data recipe. (Refer to the
modify_by_recipe section for more information.) This example shows you how to use the Live Code option to create
a new dataset by adding a data recipe.
1. Navigate to the Datasets page, then click on the dataset you want to modify.

573
Using Driverless AI, Release 1.8.4.1

2. Click Details from the submenu that appears to open the Dataset Details page.
3. Click the Modify by Recipe button in the top right portion of the UI, then click Live Code from the submenu
that appears.

574 Chapter 47. Appendix A: Custom Recipes


Using Driverless AI, Release 1.8.4.1

4. Enter the code for the data recipe you want to use to modify the dataset. Click the Get Preview button to see
a preview of how the data recipe will modify the dataset. In this simple example, the data recipe modifies the
number of rows and columns in the dataset.

5. Click the Save button to confirm the changes and create a new dataset. (The original dataset will still be available
on the Datasets page.)

47.2.2 Driverless AI with H2O-3 Algorithms

Driverless AI already supports a variety of algorithms. This example shows how you can use our h2o-3-models-py
recipe to include H2O-3 supervised learning algorithms in your experiment. The available H2O-3 algorithms in the
recipe include:
• Naive Bayes
• GBM
• Random Forest
• Deep Learning
• GLM
• AutoML
Caution: Because AutoML is treated as a regular ML algorithm here, the runtime requirements can be large.
We recommend that you adjust the max_runtime_secs parameters as suggested here: https://github.com/h2oai/
driverlessai-recipes/blob/rel-1.8.4/models/algorithms/h2o-3-models.py#L39
1. Start an experiment in Driverless AI by selecting your training dataset along with (optionally) validation and
testing datasets and then specifying a Target Column. Notice the list of algorithms that will be used in the Fea-
ture evolution section of the experiment summary. In the example below, the experiment will use LightGBM
and XGBoostGBM.

47.2. Examples 575


Using Driverless AI, Release 1.8.4.1

2. Click on Expert Settings.


3. Specify the custom recipe using one of the following methods:
• On your local machine, clone the https://github.com/h2oai/driverlessai-recipes for this re-
lease branch. Then use the Upload Custom Recipe button to upload the driverlessai-
recipes/models/h2o-3-models.py file.
• Click the Load Custom Recipe from URL button, then enter the URL for the raw h2o-3-
models.py file (for example, https://raw.githubusercontent.com/h2oai/driverlessai-recipes/rel-1.8.4/
models/algorithms/h2o-3-models.py).
Note: Click the Official Recipes (External) button to browse the driverlessai-recipes repository.

Driverless AI will begin uploading and verifying the new custom recipe.

576 Chapter 47. Appendix A: Custom Recipes


Using Driverless AI, Release 1.8.4.1

4. In the Expert Settings page, specify any additional settings and then click Save. This returns you to the experi-
ment summary.
5. To include each of the new models in your experiment, return to the Expert Settings option. Click the Recipes
> Include Specific Models option. Select the algorithm(s) that you want to include. Click Done to return to the
experiment summary.

Notice the updated list of available algorithms in the experiment.

47.2. Examples 577


Using Driverless AI, Release 1.8.4.1

6. Edit any additional experiment settings, and then click Launch Experiment.
Upon completion, you can download the Experiment Summary and review the Model Tuning section of the re-
port.docx file to see how each of the algorithms compare.

47.2.3 Using a Custom Scorer

Driverless AI supports a number of scorers, including:


• Regression: GINI, MAE, MAPE, MER, MSE, R2, RMSE (default), RMSLE, RMSPE, SMAPE, TOPDECILE
• Classification: ACCURACY, AUC (default), AUCPR, F05, F1, F2, GINI, LOGLOSS, MACROAUC, MCC
This example shows how you can include a custom scorer in your experiment. This example will use the Explained
Variance scorer, which is used for regression experiments.
1. Start an experiment in Driverless AI by selecting your training dataset along with (optionally) validation and
testing datasets and then specifying a (regression) Target Column.
2. The scorer defaults to RMSE. Click on Expert Settings.

578 Chapter 47. Appendix A: Custom Recipes


Using Driverless AI, Release 1.8.4.1

3. Specify the custom scorer recipe using one of the following methods:
• On your local machine, clone the https://github.com/h2oai/driverlessai-recipes for this re-
lease branch. Then use the Upload Custom Recipe button to upload the driverlessai-
recipes/scorers/explained_variance.py file.
• Click the Load Custom Recipe from URL button, then enter the URL for the raw h2o-3-
models.py file (for example, https://raw.githubusercontent.com/h2oai/driverlessai-recipes/rel-1.8.4/
scorers/regression/explained_variance.py).
Note: Click the Official Recipes (External) button to browse the driverlessai-recipes repository.

Driverless AI will begin uploading and verifying the new custom recipe.
4. In the Experiment Summary page, select the new Explained Variance (EXPVAR) scorer. (Note: If you do not
see the EXPVAR option, return to the Expert Settings, select Recipes > Include Specific Scorers, then click the
Enable Custom button in the top right corner. Click Done and then Save to return to the Experiment Summary.)

47.2. Examples 579


Using Driverless AI, Release 1.8.4.1

5. Edit any additional experiment settings, and then click Launch Experiment. The experiment will run using the
custom Explained Variance scorer.

47.2.4 Using a Custom Transformer

Driverless AI supports a number of feature transformers as described in Driverless AI Transformations. This example
shows how you can include a custom transformer in your experiment. Specifically, this example will show how to add
the ExpandingMean transformer.
1. Start an experiment in Driverless AI by selecting your training dataset along with (optionally) validation and
testing datasets and then specifying a Target Column. Notice the list of transformers that will be used in the
Feature engineering search space (where applicable) section of the experiment summary. Driverless AI
determines this list based on the dataset and experiment.

580 Chapter 47. Appendix A: Custom Recipes


Using Driverless AI, Release 1.8.4.1

2. Click on Expert Settings.


3. Specify the custom recipe using one of the following methods:
• On your local machine, clone the https://github.com/h2oai/driverlessai-recipes for this re-
lease branch. Then use the Upload Custom Recipe button to upload the driverlessai-
recipes/transformers/targetencoding/ExpandingMean.py file.
• Click the Load Custom Recipe from URL button, then enter the URL for the raw h2o-3-
models.py file (for example, https://raw.githubusercontent.com/h2oai/driverlessai-recipes/rel-1.8.4/
transformers/targetencoding/ExpandingMean.py).
Note: Click the Official Recipes (External) button to browse the driverlessai-recipes repository.

Driverless AI will begin uploading and verifying the new custom recipe.

47.2. Examples 581


Using Driverless AI, Release 1.8.4.1

4. Navigate to the Expert Settings > Recipes tab and click the Include Specific Transformers button. Notice
that all transformers are selected by default, including the new ExpandingMean transformer (bottom of page).

5. Select the transformers that you want to include in the experiment. Use the Check All/Uncheck All button to
quickly add or remove all transfomers at once. This example removes all transformers except for OriginalTrans-
former and ExpandingMean.
Note: If you uncheck all transformers so that none is selected, Driverless AI will ignore this and will use
the default list of transformers for that experiment. (See the image in Step 1.) This list of transformers
will vary for each experiment.

582 Chapter 47. Appendix A: Custom Recipes


Using Driverless AI, Release 1.8.4.1

6. Edit any additional experiment settings, and then click Launch Experiment. The experiment will run using the
custom ExpandingMean transformer.

47.2. Examples 583


Using Driverless AI, Release 1.8.4.1

584 Chapter 47. Appendix A: Custom Recipes


CHAPTER

FORTYEIGHT

APPENDIX B: THIRD-PARTY INTEGRATIONS

H2O Driverless AI integrates with a (continuously growing) number of third-party products. Please contact
[email protected] to schedule a discussion with one of our Solution Engineers for more information.
If you are interested in a product not yet listed here, please ask us about it!

48.1 Instance Life-Cycle Management

The following products are able to manage (start and stop) Driverless AI instances themselves:

Name Notes
BlueData DAI runs in a BlueData container
Domino DAI runs in a Domino container
IBM Spectrum DAI runs in user mode via TAR SH distribution
Conductor
IBM Cloud Pri- Uses Kubernetes underneath; DAI runs in a docker container; requires HELM chart
vate (ICP)
Kubernetes DAI runs in as a long running service via Docker container
Kubeflow Abstraction of Kubernetes; allows additional monitoring and management of Kubernetes de-
ployments. Click here for more information.
Puddle (from Multi-tenant orchestration platform for DAI instances (not a third party, but listed here for
H2O.ai) completeness)
SageMaker Bring your own algorithm docker container

48.2 API Clients

The following products have Driverless AI client API integrations:

Name Notes
Alteryx Allows users to interact with a remote DAI server from Alteryx Designer
Cinchy Data collaboration for the Enterprise, use MOJOs to enrich data and use Cinchy data network to
train models
Jupyter/Python DAI Python API client library can be downloaded from the Web UI of a running instance
KDB Use KDB as a data source in Driverless AI for training
RStudio/R Under development, please ask for the DAI R API client library

585
Using Driverless AI, Release 1.8.4.1

48.3 Scoring

The following products have Driverless AI scoring integrations:

Name Notes
KDB Call a MOJO to score streaming data from KDB Ticker Service
ParallelM Deploy and monitor MOJO models
Qlik Call a MOJO from a Qlik dashboard
SageMaker Host scoring-only docker image that uses a MOJO
Trifacta Call a MOJO as a UDF
UiPath Call a MOJO from within an RPA workflow

48.4 Storage

Name Notes
Network Appliance A mounted expandable volume is convenient for the Driverless AI working (tmp) directory

48.5 Data Sources

Please visit the section on Enabling Data Connectors for information about data sources supported by Driverless AI.

586 Chapter 48. Appendix B: Third-Party Integrations


CHAPTER

FORTYNINE

REFERENCES

Adebayo, Julius A. “Fairml: Toolbox for diagnosing bias in predictive modeling.” Master’s Thesis, MIT, 2016.
Breiman, Leo. “Statistical Modeling: The Two Cultures (with comments and a rejoinder by the author).” Statistical
Science 16, no. 3, 2001.
Craven, Mark W. and Shavlik, Jude W. “Extracting tree structured representations of trained networks.” Advances in
Neural Information Processing Systems, 1996.
Goldstein, Alex, Kapelner, Adam, Bleich, Justin, and Pitkin, Emil. “Peeking inside the black box: Visualizing statisti-
cal learning with plots of individual conditional expectation.” Journal of Computational and Graphical Statistics, no.
24, 2015.
Groeneveld, R.A. and Meeden, G. (1984), “Measuring Skewness and Kurtosis.” The Statistician, 33, 391-399.
Hall, Patrick, Wen Phan, and SriSatish Ambati. “Ideas for Interpreting Machine Learning.” O’Reilly Ideas. O’Reilly
Media, 2017.
Hartigan, J. A. and Mohanty, S. (1992), “The RUNT test for multimodality,” Journal of Classification, 9, 63–70.
Hastie, Trevor, Tibshirani, Robert, and Friedman, Jerome. The Elements of Statistical Learning. Springer, 2008.
Lei, Jing, Max G’Sell, Alessandro Rinaldo, Ryan J. Tibshirani, and Larry Wasserman. “Distribution-Free Predictive
Inference for Regression.” Journal of the American Statistical Association (just-accepted), 2017.
Ribeiro, Marco Tulio, Sameer Singh, and Carlos Guestrin. “Why Should I Trust You?: Explaining the Predictions of
Any Classifier.” In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and
Data Mining, 2016.
Wilkinson, L. (1999). “Dot plots.” The American Statistician, 53, 276–281.
Wilkinson, L., Anand, A., and Grossman, R. (2005), “Graph-theoretic Scagnostics,” in Proceedings of the IEEE
Information Visualization 2005, pp. 157–164.

587

You might also like