Accelerated Data Science Readthedocs Io en v2.6.8
Accelerated Data Science Readthedocs Io en v2.6.8
Release 2.6.8
1 Release Notes 1
1.1 2.6.8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 2.6.7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.3 2.6.6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.4 2.6.5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.5 2.6.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.6 2.6.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.7 2.6.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.8 2.6.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.9 2.5.10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.10 2.5.9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.11 2.5.8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.12 2.5.7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.13 2.5.6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.14 2.5.5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.15 2.5.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.16 2.5.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.17 2.5.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.18 2.5.0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.19 2.4.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.20 2.4.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.21 2.4.0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.22 2.3.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.23 2.3.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.24 2.3.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.25 2.2.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.26 January 13, 2021 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.27 August 11, 2020 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.28 June 9, 2020 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.29 April 30, 2020 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.30 March 18, 2020 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2 Quick Start 17
i
3.2.2.2 Installing extras libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4 Authentication 23
4.1 1. Authenticating Using Resource Principals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.2 2. Authenticating Using API Keys . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.3 3. Overriding Defaults . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
5 CLI Configuration 27
7 Load Data 33
7.1 Connecting to Data Sources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
7.1.1 Object Storage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
7.1.2 Local Storage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
7.1.3 Oracle Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
7.1.3.1 Oracle ADB to Pandas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
7.1.3.2 Oracle Database to Pandas - No Wallet . . . . . . . . . . . . . . . . . . . . . . . . 36
7.1.3.3 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
7.1.3.4 Large Result Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
7.1.3.5 Very Large Result Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
7.1.3.6 Pandas to Oracle Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
7.1.4 MySQL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
7.1.5 BDS Hive . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
7.1.5.1 Connection Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
7.1.5.2 Partition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
7.1.5.3 Large Dataframe . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
7.1.6 HTTP(S) Sources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
7.1.7 Convert Pandas DataFrame to ADSDataset . . . . . . . . . . . . . . . . . . . . . . . . . . 42
7.1.8 Using PyArrow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
7.2 DataSetFactory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
7.2.1 Connect with DatasetFactory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
7.2.1.1 Object Storage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
7.2.1.2 Local Storage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
7.2.1.2.1 Oracle Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
7.2.1.3 Autonomous Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
7.2.1.3.1 Load from ADB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
7.2.1.3.2 Query ADB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
7.2.1.4 Train a Models with ADB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
7.2.1.5 Update ADB Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
7.2.1.6 Amazon S3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
7.2.1.7 HTTP(S) Sources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
7.2.1.8 DatasetBrowser . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
ii
8 Label Data 51
8.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
8.2 Quick Start . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
8.3 Export Metadata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
8.4 List . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
8.5 Load . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
8.5.1 LabeledDatasetReader . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
8.5.2 Pandas Accessor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
8.6 Visualize . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
8.6.1 Image . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
8.6.2 Text . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
8.7 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
8.7.1 Binary Text Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
8.7.1.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
8.7.1.2 Load . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
8.7.1.3 Preprocess . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
8.7.1.4 Train . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
8.7.1.5 Predict . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
8.7.2 Image Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
8.7.2.1 Data Source . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
8.7.2.2 Load . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
8.7.2.3 Visualize . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
8.7.2.4 Preprocess . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
8.7.2.5 Train . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
8.7.2.6 Predict . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
8.7.3 Multinomial Text Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
8.7.3.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
8.7.3.2 Load . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
8.7.3.3 Preprocess . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
8.7.3.4 Train . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
8.7.3.5 Predict . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
8.7.4 Named Entity Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
8.7.4.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
8.7.4.2 Load . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
8.7.4.3 Preprocess . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
8.7.4.4 Train . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
8.7.4.5 Predict . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
9 Transform Data 69
9.1 Loading the Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
9.2 Automated Transformations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
9.3 Row Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
9.3.1 Delete Rows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
9.3.2 Reset Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
9.3.3 Append Rows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
9.3.4 Row Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
9.3.5 Removing Duplicated Rows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
9.4 Column Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
9.4.1 Delete a Column . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
9.4.2 Rename a Column . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
9.4.3 Counts of Unique Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
9.4.4 Normalize a Column . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
9.4.5 Combine Columns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
9.4.6 Apply a Function to a Column . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
iii
9.4.7 Change Data Type . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
9.5 Dataset Manipulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
9.5.1 Categorical Encoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
9.5.2 One-Hot Encoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
9.5.3 Extract Null Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
9.5.4 Imputation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
9.5.5 Combine Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
9.5.5.1 Join Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
9.5.5.2 Concatenate Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
9.6 Train/Test Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
9.7 Text Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
9.7.1 TextStrings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
9.7.1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
9.7.1.2 Quick Start . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
9.7.1.2.1 NLP Parse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
9.7.1.2.2 Plugin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
9.7.1.2.3 RegEx Match . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
9.7.1.3 NLP Parse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
9.7.1.3.1 NLTK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
9.7.1.3.2 spaCy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
9.7.1.4 Plugin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
9.7.1.4.1 Custom Plugin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
9.7.1.4.2 OCI Language Services . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
9.7.1.5 RegEx Match . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
9.7.1.6 Still a String . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
9.7.2 Text Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
9.7.2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
9.7.2.1.1 Configure the Data Source . . . . . . . . . . . . . . . . . . . . . . . . . 95
9.7.2.2 Load . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
9.7.2.2.1 Read a Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
9.7.2.2.2 Read Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
9.7.2.3 Augment Records . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
9.7.2.3.1 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
9.7.2.4 Custom File Processor and Backend . . . . . . . . . . . . . . . . . . . . . . . . . 103
9.7.2.4.1 Custom Backend . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
9.7.2.4.2 Custom File Processor . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
9.7.2.4.3 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
iv
11.2.2.1 Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
11.2.2.2 OCI Policies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
11.2.2.3 Policy Syntax . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
11.2.3 Developer Guide . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
11.2.3.1 Build Image . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
11.2.3.2 Publish Docker Image . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
11.2.3.3 Run the container Image on the OCI Data Science or local . . . . . . . . . . . . . . 138
11.2.3.4 Development Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
11.2.3.5 Diagnosing Infrastructure Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
11.2.4 Dask . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
11.2.4.1 Creating Workloads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
11.2.4.2 Writing Dask Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
11.2.4.3 Distributed XGBoost & LightGBM . . . . . . . . . . . . . . . . . . . . . . . . . . 147
11.2.4.3.1 LightGBM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
11.2.4.3.2 XGBoost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
11.2.4.4 Securing with TLS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
11.2.4.5 Dask Cluster Tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
11.2.4.5.1 Configuring dask startup options . . . . . . . . . . . . . . . . . . . . . . 151
11.2.4.5.2 Configuration through Environment Variables . . . . . . . . . . . . . . . 152
11.2.4.6 Dask dashboard . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
11.2.4.6.1 Bastion Host . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
11.2.5 Horovod . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
11.2.5.1 Creating Horovod Workloads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
11.2.5.2 Writing Distributed code with Horovod Framework . . . . . . . . . . . . . . . . . 162
11.2.5.2.1 TensorFlow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
11.2.5.2.2 PyTorch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
11.2.5.3 Monitoring Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
11.2.6 PyTorch Distributed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
11.2.6.1 Creating PyTorch Distributed Workloads . . . . . . . . . . . . . . . . . . . . . . . 170
11.2.7 Tensorflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
11.2.7.1 Creating Tensorflow Workloads . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
11.2.8 Run Source Code from Git or Object Storage . . . . . . . . . . . . . . . . . . . . . . . . . 193
11.2.8.1 Git Repository . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
11.2.8.2 Object Storage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
11.2.9 YAML Schema . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
11.2.10 Troubleshooting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
11.3 TensorBoard . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
11.3.1 Setting up local environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198
11.3.2 Viewing logs from your experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198
11.3.3 Writing TensorBoard logs to Object Storage . . . . . . . . . . . . . . . . . . . . . . . . . . 198
11.3.3.1 PyTorch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
11.3.3.2 TensorFlow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200
11.3.3.2.1 OCI Data Science Notebook . . . . . . . . . . . . . . . . . . . . . . . . 200
11.3.3.2.2 OCI Data Science Jobs . . . . . . . . . . . . . . . . . . . . . . . . . . . 200
11.4 Model Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201
11.4.1 Quick Start . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202
11.4.1.1 Comparing Binary Classification Models . . . . . . . . . . . . . . . . . . . . . . . 202
11.4.1.2 Comparing Multi Classification Models . . . . . . . . . . . . . . . . . . . . . . . 202
11.4.1.3 Comparing Regression Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
11.4.2 Binary Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204
11.4.2.1 Fairness Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208
11.4.3 Multinomial Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
11.4.4 Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216
11.5 Model Explainability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219
v
11.5.1 Accumulated Local Effects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220
11.5.1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220
11.5.1.2 Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220
11.5.1.3 Interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222
11.5.1.4 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222
11.5.1.5 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223
11.5.1.6 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226
11.5.2 Feature Dependence Explanations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226
11.5.2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226
11.5.2.2 Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226
11.5.2.3 Interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227
11.5.2.3.1 PDP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227
11.5.2.3.2 ICE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227
11.5.2.4 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227
11.5.2.5 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236
11.5.3 Feature Importance Explanations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236
11.5.3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236
11.5.3.2 Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236
11.5.3.3 Interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237
11.5.3.4 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237
11.5.3.5 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241
11.5.4 Enhanced LIME . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242
11.5.4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242
11.5.4.2 Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242
11.5.4.3 Interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243
11.5.4.4 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245
11.5.4.5 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 250
11.5.5 WhatIf Explainer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 250
11.5.5.1 Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 250
11.5.5.2 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 250
vi
12.2.3 Model Schema . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266
12.2.3.1 Schema Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267
12.2.3.2 Generating Schema . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269
12.2.3.3 Update the Schema . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 270
12.2.4 Model Metadata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271
12.2.4.1 Taxonomy Metadata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271
12.2.4.2 Custom Metadata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276
12.2.5 Customizing the Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 278
12.2.5.1 Customize score.py . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 278
12.2.6 Large Model Artifacts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 286
12.2.6.1 Saving . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 286
12.2.6.2 Loading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 287
12.2.7 Dowloading Models from OCI Data Science . . . . . . . . . . . . . . . . . . . . . . . . . . 287
12.2.7.1 Download Registered Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 287
12.2.7.2 Download Deployed Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 288
12.3 Deploying model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 288
12.3.1 Deploy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 289
12.3.2 Predict . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 290
12.3.3 Observability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 290
12.4 Frameworks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 290
12.4.1 SklearnModel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 290
12.4.1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 290
12.4.1.2 Prepare Model Artifact . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291
12.4.1.3 Summary Status . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 292
12.4.1.4 Register Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293
12.4.1.5 Deploy and Generate Endpoint . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293
12.4.1.6 Run Prediction against Endpoint . . . . . . . . . . . . . . . . . . . . . . . . . . . 293
12.4.1.7 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293
12.4.2 PyTorchModel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295
12.4.2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295
12.4.2.2 Prepare Model Artifact . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295
12.4.2.3 Verify Changes to Score.py . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 298
12.4.2.4 Summary Status . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 298
12.4.2.5 Register Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 299
12.4.2.6 Deploy and Generate Endpoint . . . . . . . . . . . . . . . . . . . . . . . . . . . . 299
12.4.2.7 Run Prediction against Endpoint . . . . . . . . . . . . . . . . . . . . . . . . . . . 300
12.4.2.7.1 Predict with Image . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 300
12.4.2.8 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 301
12.4.3 TensorFlowModel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 302
12.4.3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 302
12.4.3.2 Prepare Model Artifact . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303
12.4.3.3 Summary Status . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304
12.4.3.4 Register Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305
12.4.3.5 Deploy and Generate Endpoint . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305
12.4.3.6 Run Prediction against Endpoint . . . . . . . . . . . . . . . . . . . . . . . . . . . 306
12.4.3.6.1 Predict with Image . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 306
12.4.3.7 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 306
12.4.4 SparkPipelineModel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 308
12.4.4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 308
12.4.4.2 Prepare Model Artifact . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 309
12.4.4.3 Summary Status . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 310
12.4.4.4 Register Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 311
12.4.4.5 Deploy and Generate Endpoint . . . . . . . . . . . . . . . . . . . . . . . . . . . . 311
12.4.4.6 Run Prediction against Endpoint . . . . . . . . . . . . . . . . . . . . . . . . . . . 311
vii
12.4.4.7 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 311
12.4.5 LightGBMModel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313
12.4.5.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313
12.4.5.2 Prepare Model Artifact . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314
12.4.5.3 Summary Status . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315
12.4.5.4 Register Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 316
12.4.5.5 Deploy and Generate Endpoint . . . . . . . . . . . . . . . . . . . . . . . . . . . . 316
12.4.5.6 Run Prediction against Endpoint . . . . . . . . . . . . . . . . . . . . . . . . . . . 316
12.4.5.7 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 316
12.4.6 XGBoostModel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 318
12.4.6.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 318
12.4.6.2 Prepare Model Artifact . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 319
12.4.6.3 Summary Status . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 320
12.4.6.4 Register Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 321
12.4.6.5 Deploy and Generate Endpoint . . . . . . . . . . . . . . . . . . . . . . . . . . . . 321
12.4.6.6 Run Prediction against Endpoint . . . . . . . . . . . . . . . . . . . . . . . . . . . 321
12.4.6.7 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 321
12.4.7 AutoMLModel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323
12.4.7.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323
12.4.7.2 Initialize . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323
12.4.7.3 Summary Status . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324
12.4.7.4 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325
12.4.8 Other Frameworks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325
12.4.8.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 326
12.4.8.2 Prepare Model Artifact . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 326
12.4.8.3 Summary Status . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 327
12.4.8.4 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 327
viii
13.5 Data Catalog Metastore . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354
13.5.1 Prerequisite . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354
13.5.2 Quick Start . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 355
13.5.2.1 Data Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 355
13.5.2.2 Interactive Spark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 357
13.5.3 Data Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 358
13.5.3.1 PySpark Script . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 358
13.5.3.2 Create Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 359
13.5.3.3 Run . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 360
13.5.4 Interactive Spark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 360
13.6 [Legacy] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 361
13.6.1 Prerequisite . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 361
13.6.2 Create a Instance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 362
13.6.3 Generate a Script Using a Template . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 362
13.6.4 Create a Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 362
13.6.4.1 Load an Existing Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363
13.6.4.2 Listing Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 364
13.6.4.3 Create a Run . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 364
13.6.4.4 Fetching Logs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 365
13.6.4.5 Edit and Synchronize PySpark Script . . . . . . . . . . . . . . . . . . . . . . . . . 365
13.6.4.6 Arguments and Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 365
13.6.4.7 Add Third-Party Libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 366
13.6.4.8 Fetching PySpark Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 367
13.6.5 Example Notebook: Develop Pyspark jobs locally - from local to remote workflows . . . . . 368
13.6.6 Example Notebook: Using the ADB with PySpark . . . . . . . . . . . . . . . . . . . . . . 377
ix
14.4.3.1 Connect . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 387
14.4.3.2 File Handle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 388
14.4.3.3 URL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 388
14.4.4 PyArrow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 388
14.4.4.1 Connect . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 389
14.4.4.2 Filesystem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 389
14.5 SQL Data Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 390
14.5.1 Ibis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 390
14.5.1.1 Connect . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 390
14.5.1.2 Query . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 390
14.5.1.3 Close a Connection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 391
14.5.2 Impala . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 391
14.5.2.1 Connect . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 391
14.5.2.2 Create a Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 391
14.5.2.3 Query . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 391
14.5.2.4 Drop a Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 392
14.5.2.5 Close a Connection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 392
14.5.3 PyHive . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 392
14.5.3.1 Connect . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 392
14.5.3.2 Create a Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 392
14.5.3.3 Query . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 393
14.5.3.4 Drop a Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 393
14.5.3.5 Close a Connection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 393
x
15.6.4.2 YAML . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 417
15.7 Run Code in ZIP or Folder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 419
15.7.1 ScriptRuntime . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 419
15.7.1.1 Python . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 419
15.7.1.2 YAML . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 420
15.7.2 PythonRuntime . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 420
15.7.2.1 Python . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 420
15.7.2.2 YAML . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 421
15.8 Working with OCI Data Science Jobs Using CLI . . . . . . . . . . . . . . . . . . . . . . . . . . . . 423
15.8.1 Prerequisite . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 423
15.8.2 Running a Pre Defined Job . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 424
15.8.3 Delete Job or Job Run . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 424
15.8.4 Cancel Job Run . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 424
15.8.5 Cancel Distributed Training Job . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 424
15.9 Monitoring With CLI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 424
15.9.1 watch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 424
xi
16.3.1.2.2 With the Wallet File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 436
16.3.2 Load Credentials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 436
16.3.2.1 Load . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 436
16.3.2.1.1 Using a with Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . 437
16.3.2.1.2 Without using a with Statement . . . . . . . . . . . . . . . . . . . . . . 437
16.3.2.2 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 437
16.3.2.2.1 Using a with Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . 437
16.3.2.2.2 Export to Environment Variables Using a with Statement . . . . . . . . . 438
16.3.2.2.3 Wallet File Location . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 439
16.4 Big Data Service . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 439
16.4.1 Save Credentials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 440
16.4.1.1 BDSSecretKeeper . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 440
16.4.1.1.1 Save . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 440
16.4.1.2 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 440
16.4.1.2.1 With the Keytab and kerb5 Config Files . . . . . . . . . . . . . . . . . . 440
16.4.1.2.2 Without the Keytab and kerb5 Config Files . . . . . . . . . . . . . . . . . 441
16.4.2 Load Credentials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 442
16.4.2.1 Load . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 442
16.4.2.1.1 Using a with Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . 442
16.4.2.1.2 Without Using a with Statement . . . . . . . . . . . . . . . . . . . . . . 442
16.4.2.2 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 443
16.4.2.2.1 Using a With Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . 443
16.4.2.2.2 Without Using a With Statement . . . . . . . . . . . . . . . . . . . . . . 443
16.5 MySQL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 444
16.5.1 Save Credentials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 444
16.5.1.1 MySQLDBSecretKeeper . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 444
16.5.1.1.1 Save . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 445
16.5.1.2 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 445
16.5.1.2.1 Save Credentials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 445
16.5.1.2.2 Save as a YAML File . . . . . . . . . . . . . . . . . . . . . . . . . . . . 446
16.5.2 Load Credentials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 446
16.5.2.1 Load . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 446
16.5.2.1.1 Using a with Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . 446
16.5.2.1.2 Without Using a with Statement . . . . . . . . . . . . . . . . . . . . . . 446
16.5.2.2 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 446
16.5.2.2.1 Using a with Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . 446
16.5.2.2.2 Export the Environment Variables Using a with Statement . . . . . . . . 447
16.6 Oracle Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 448
16.6.1 Save Credentials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 448
16.6.1.1 OracleDBSecretKeeper . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 448
16.6.1.2 Save . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 449
16.6.1.3 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 449
16.6.1.3.1 Save Credentials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 449
16.6.1.3.2 Save as a YAML File . . . . . . . . . . . . . . . . . . . . . . . . . . . . 450
16.6.2 Load Credentials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 450
16.6.2.1 Load . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 450
16.6.2.1.1 Using a with Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . 450
16.6.2.1.2 Without using a with Statement . . . . . . . . . . . . . . . . . . . . . . 450
16.6.2.2 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 451
16.6.2.2.1 Using a with Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . 451
16.6.2.2.2 Export the Environment Variable Using a with Statement . . . . . . . . . 451
xii
17.1.1 Subpackages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 453
17.1.1.1 ads.automl package . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 453
17.1.1.1.1 Submodules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 453
17.1.1.1.2 ads.automl.driver module . . . . . . . . . . . . . . . . . . . . . . . . . . 453
17.1.1.1.3 ads.automl.provider module . . . . . . . . . . . . . . . . . . . . . . . . . 454
17.1.1.1.4 Module contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 460
17.1.1.2 ads.catalog package . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 460
17.1.1.2.1 Submodules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 460
17.1.1.2.2 ads.catalog.model module . . . . . . . . . . . . . . . . . . . . . . . . . . 460
17.1.1.2.3 ads.catalog.notebook module . . . . . . . . . . . . . . . . . . . . . . . . 467
17.1.1.2.4 ads.catalog.project module . . . . . . . . . . . . . . . . . . . . . . . . . 469
17.1.1.2.5 ads.catalog.summary module . . . . . . . . . . . . . . . . . . . . . . . . 472
17.1.1.2.6 Module contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 473
17.1.1.3 ads.common package . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 473
17.1.1.3.1 Submodules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 473
17.1.1.3.2 ads.common.card_identifier module . . . . . . . . . . . . . . . . . . . . 473
17.1.1.3.3 ads.common.auth module . . . . . . . . . . . . . . . . . . . . . . . . . . 473
17.1.1.3.4 ads.common.data module . . . . . . . . . . . . . . . . . . . . . . . . . . 475
17.1.1.3.5 ads.common.model module . . . . . . . . . . . . . . . . . . . . . . . . . 477
17.1.1.3.6 ads.common.model_metadata module . . . . . . . . . . . . . . . . . . . 480
17.1.1.3.7 ads.common.decorator.runtime_dependency module . . . . . . . . . . . . 497
17.1.1.3.8 ads.common.decorator.deprecate module . . . . . . . . . . . . . . . . . . 499
17.1.1.3.9 ads.common.model_introspect module . . . . . . . . . . . . . . . . . . . 499
17.1.1.3.10 ads.common.model_export_util module . . . . . . . . . . . . . . . . . . 502
17.1.1.3.11 ads.common.function.fn_util module . . . . . . . . . . . . . . . . . . . . 506
17.1.1.3.12 ads.common.utils module . . . . . . . . . . . . . . . . . . . . . . . . . . 506
17.1.1.3.13 Module contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 515
17.1.1.3.14 ads.common.model_metadata_mixin module . . . . . . . . . . . . . . . 515
17.1.1.4 ads.bds package . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 516
17.1.1.4.1 Submodules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 516
17.1.1.4.2 ads.bds.auth module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 516
17.1.1.4.3 Module contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 518
17.1.1.5 ads.data_labeling package . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 518
17.1.1.5.1 Submodules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 518
17.1.1.5.2 ads.data_labeling.interface.loader module . . . . . . . . . . . . . . . . . 518
17.1.1.5.3 ads.data_labeling.interface.parser module . . . . . . . . . . . . . . . . . 518
17.1.1.5.4 ads.data_labeling.interface.reader module . . . . . . . . . . . . . . . . . 518
17.1.1.5.5 ads.data_labeling.boundingbox module . . . . . . . . . . . . . . . . . . . 518
17.1.1.5.6 ads.data_labeling.constants module . . . . . . . . . . . . . . . . . . . . . 521
17.1.1.5.7 ads.data_labeling.data_labeling_service module . . . . . . . . . . . . . . 521
17.1.1.5.8 ads.data_labeling.metadata module . . . . . . . . . . . . . . . . . . . . . 523
17.1.1.5.9 ads.data_labeling.ner module . . . . . . . . . . . . . . . . . . . . . . . . 525
17.1.1.5.10 ads.data_labeling.record module . . . . . . . . . . . . . . . . . . . . . . 526
17.1.1.5.11 ads.data_labeling.mixin.data_labeling module . . . . . . . . . . . . . . . 527
17.1.1.5.12 ads.data_labeling.parser.export_metadata_parser module . . . . . . . . . 529
17.1.1.5.13 ads.data_labeling.parser.export_record_parser module . . . . . . . . . . . 530
17.1.1.5.14 ads.data_labeling.reader.dataset_reader module . . . . . . . . . . . . . . 534
17.1.1.5.15 ads.data_labeling.reader.jsonl_reader module . . . . . . . . . . . . . . . 540
17.1.1.5.16 ads.data_labeling.reader.metadata_reader module . . . . . . . . . . . . . 541
17.1.1.5.17 ads.data_labeling.reader.record_reader module . . . . . . . . . . . . . . . 543
17.1.1.5.18 ads.data_labeling.visualizer.image_visualizer module . . . . . . . . . . . 546
17.1.1.5.19 ads.data_labeling.visualizer.text_visualizer module . . . . . . . . . . . . 549
17.1.1.5.20 Module contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 551
17.1.1.6 ads.database package . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 551
xiii
17.1.1.6.1 Subpackages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 551
17.1.1.6.2 Submodules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 551
17.1.1.6.3 ads.database.connection module . . . . . . . . . . . . . . . . . . . . . . 551
17.1.1.6.4 Module contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 553
17.1.1.7 ads.dataflow package . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 553
17.1.1.7.1 Submodules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 553
17.1.1.7.2 ads.dataflow.dataflow module . . . . . . . . . . . . . . . . . . . . . . . . 553
17.1.1.7.3 ads.dataflow.dataflowsummary module . . . . . . . . . . . . . . . . . . . 561
17.1.1.7.4 Module contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 561
17.1.1.8 ads.dataset package . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 561
17.1.1.8.1 Submodules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 561
17.1.1.8.2 ads.dataset.classification_dataset module . . . . . . . . . . . . . . . . . . 561
17.1.1.8.3 ads.dataset.correlation module . . . . . . . . . . . . . . . . . . . . . . . 564
17.1.1.8.4 ads.dataset.correlation_plot module . . . . . . . . . . . . . . . . . . . . 564
17.1.1.8.5 ads.dataset.dask_series module . . . . . . . . . . . . . . . . . . . . . . . 567
17.1.1.8.6 ads.dataset.dataframe_transformer module . . . . . . . . . . . . . . . . . 567
17.1.1.8.7 ads.dataset.dataset module . . . . . . . . . . . . . . . . . . . . . . . . . 567
17.1.1.8.8 ads.dataset.dataset_browser module . . . . . . . . . . . . . . . . . . . . 579
17.1.1.8.9 ads.dataset.dataset_with_target module . . . . . . . . . . . . . . . . . . . 582
17.1.1.8.10 ads.dataset.exception module . . . . . . . . . . . . . . . . . . . . . . . . 587
17.1.1.8.11 ads.dataset.factory module . . . . . . . . . . . . . . . . . . . . . . . . . 587
17.1.1.8.12 ads.dataset.feature_engineering_transformer module . . . . . . . . . . . . 592
17.1.1.8.13 ads.dataset.feature_selection module . . . . . . . . . . . . . . . . . . . . 592
17.1.1.8.14 ads.dataset.forecasting_dataset module . . . . . . . . . . . . . . . . . . . 593
17.1.1.8.15 ads.dataset.helper module . . . . . . . . . . . . . . . . . . . . . . . . . . 593
17.1.1.8.16 ads.dataset.label_encoder module . . . . . . . . . . . . . . . . . . . . . . 596
17.1.1.8.17 ads.dataset.pipeline module . . . . . . . . . . . . . . . . . . . . . . . . . 596
17.1.1.8.18 ads.dataset.plot module . . . . . . . . . . . . . . . . . . . . . . . . . . . 596
17.1.1.8.19 ads.dataset.progress module . . . . . . . . . . . . . . . . . . . . . . . . 597
17.1.1.8.20 ads.dataset.recommendation module . . . . . . . . . . . . . . . . . . . . 597
17.1.1.8.21 ads.dataset.recommendation_transformer module . . . . . . . . . . . . . 597
17.1.1.8.22 ads.dataset.regression_dataset module . . . . . . . . . . . . . . . . . . . 598
17.1.1.8.23 ads.dataset.sampled_dataset module . . . . . . . . . . . . . . . . . . . . 598
17.1.1.8.24 ads.dataset.target module . . . . . . . . . . . . . . . . . . . . . . . . . . 599
17.1.1.8.25 ads.dataset.timeseries module . . . . . . . . . . . . . . . . . . . . . . . . 600
17.1.1.8.26 Module contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 600
17.1.1.9 ads.evaluations package . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 600
17.1.1.9.1 Submodules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 600
17.1.1.9.2 ads.evaluations.evaluation_plot module . . . . . . . . . . . . . . . . . . 600
17.1.1.9.3 ads.evaluations.evaluator module . . . . . . . . . . . . . . . . . . . . . . 602
17.1.1.9.4 ads.evaluations.statistical_metrics module . . . . . . . . . . . . . . . . . 608
17.1.1.9.5 Module contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 610
17.1.1.10 ads.explanations package . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 610
17.1.1.10.1 Submodules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 610
17.1.1.10.2 ads.explanations.base_explainer module . . . . . . . . . . . . . . . . . . 610
17.1.1.10.3 ads.explanations.explainer module . . . . . . . . . . . . . . . . . . . . . 610
17.1.1.10.4 ads.explanations.mlx_global_explainer module . . . . . . . . . . . . . . 610
17.1.1.10.5 ads.explanations.mlx_interface module . . . . . . . . . . . . . . . . . . . 610
17.1.1.10.6 ads.explanations.mlx_local_explainer module . . . . . . . . . . . . . . . 610
17.1.1.10.7 ads.explanations.mlx_whatif_explainer module . . . . . . . . . . . . . . 610
17.1.1.10.8 Module contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 610
17.1.1.11 ads.feature_engineering package . . . . . . . . . . . . . . . . . . . . . . . . . . . 610
17.1.1.11.1 Submodules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 610
17.1.1.11.2 ads.feature_engineering.exceptions module . . . . . . . . . . . . . . . . 610
xiv
17.1.1.11.3 ads.feature_engineering.feature_type_manager module . . . . . . . . . . 611
17.1.1.11.4 ads.feature_engineering.accessor.dataframe_accessor module . . . . . . . 615
17.1.1.11.5 ads.feature_engineering.accessor.series_accessor module . . . . . . . . . 620
17.1.1.11.6 ads.feature_engineering.accessor.mixin.correlation module . . . . . . . . 623
17.1.1.11.7 ads.feature_engineering.accessor.mixin.eda_mixin module . . . . . . . . 623
17.1.1.11.8 ads.feature_engineering.accessor.mixin.eda_mixin_series module . . . . 626
17.1.1.11.9 ads.feature_engineering.accessor.mixin.feature_types_mixin module . . . 627
17.1.1.11.10ads.feature_engineering.adsstring.common_regex_mixin module . . . . . 629
17.1.1.11.11ads.feature_engineering.adsstring.oci_language module . . . . . . . . . . 630
17.1.1.11.12ads.feature_engineering.adsstring.string module . . . . . . . . . . . . . . 630
17.1.1.11.13ads.feature_engineering.feature_type.address module . . . . . . . . . . . 630
17.1.1.11.14ads.feature_engineering.feature_type.base module . . . . . . . . . . . . . 633
17.1.1.11.15ads.feature_engineering.feature_type.boolean module . . . . . . . . . . . 634
17.1.1.11.16ads.feature_engineering.feature_type.category module . . . . . . . . . . . 636
17.1.1.11.17ads.feature_engineering.feature_type.constant module . . . . . . . . . . . 639
17.1.1.11.18ads.feature_engineering.feature_type.continuous module . . . . . . . . . 641
17.1.1.11.19ads.feature_engineering.feature_type.creditcard module . . . . . . . . . . 643
17.1.1.11.20ads.feature_engineering.feature_type.datetime module . . . . . . . . . . . 647
17.1.1.11.21ads.feature_engineering.feature_type.discrete module . . . . . . . . . . . 650
17.1.1.11.22ads.feature_engineering.feature_type.document module . . . . . . . . . . 652
17.1.1.11.23ads.feature_engineering.feature_type.gis module . . . . . . . . . . . . . . 653
17.1.1.11.24ads.feature_engineering.feature_type.integer module . . . . . . . . . . . 657
17.1.1.11.25ads.feature_engineering.feature_type.ip_address module . . . . . . . . . 659
17.1.1.11.26ads.feature_engineering.feature_type.ip_address_v4 module . . . . . . . 661
17.1.1.11.27ads.feature_engineering.feature_type.ip_address_v6 module . . . . . . . 663
17.1.1.11.28ads.feature_engineering.feature_type.lat_long module . . . . . . . . . . . 666
17.1.1.11.29ads.feature_engineering.feature_type.object module . . . . . . . . . . . . 669
17.1.1.11.30ads.feature_engineering.feature_type.ordinal module . . . . . . . . . . . 670
17.1.1.11.31ads.feature_engineering.feature_type.phone_number module . . . . . . . 672
17.1.1.11.32ads.feature_engineering.feature_type.string module . . . . . . . . . . . . 675
17.1.1.11.33ads.feature_engineering.feature_type.text module . . . . . . . . . . . . . 677
17.1.1.11.34ads.feature_engineering.feature_type.unknown module . . . . . . . . . . 679
17.1.1.11.35ads.feature_engineering.feature_type.zip_code module . . . . . . . . . . 680
17.1.1.11.36ads.feature_engineering.feature_type.handler.feature_validator module . . 682
17.1.1.11.37ads.feature_engineering.feature_type.handler.feature_warning module . . 687
17.1.1.11.38ads.feature_engineering.feature_type.handler.warnings module . . . . . . 690
17.1.1.11.39Module contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 691
17.1.1.12 ads.hpo package . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 691
17.1.1.12.1 Submodules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 691
17.1.1.12.2 ads.hpo.distributions module . . . . . . . . . . . . . . . . . . . . . . . . 691
17.1.1.12.3 ads.hpo.search_cv module . . . . . . . . . . . . . . . . . . . . . . . . . 694
17.1.1.12.4 ads.hpo.stopping_criterion . . . . . . . . . . . . . . . . . . . . . . . . . 705
17.1.1.12.5 Module contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 706
17.1.1.13 ads.jobs package . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 706
17.1.1.13.1 Submodules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 706
17.1.1.13.2 ads.jobs.ads_job module . . . . . . . . . . . . . . . . . . . . . . . . . . 706
17.1.1.13.3 ads.jobs.builders.runtimes.python_runtime module . . . . . . . . . . . . 712
17.1.1.13.4 ads.jobs.builders.infrastructure.dataflow module . . . . . . . . . . . . . . 721
17.1.1.13.5 ads.jobs.builders.infrastructure.dsc_job module . . . . . . . . . . . . . . 730
17.1.1.13.6 Module contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 740
17.1.1.14 ads.model.framework other package . . . . . . . . . . . . . . . . . . . . . . . . . 740
17.1.1.14.1 Submodules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 740
17.1.1.14.2 ads.model.artifact module . . . . . . . . . . . . . . . . . . . . . . . . . . 740
17.1.1.14.3 ads.model.generic_model module . . . . . . . . . . . . . . . . . . . . . 742
xv
17.1.1.14.4 ads.model.model_properties module . . . . . . . . . . . . . . . . . . . . 760
17.1.1.14.5 ads.model.runtime.runtime_info module . . . . . . . . . . . . . . . . . . 761
17.1.1.14.6 ads.model.extractor.model_info_extractor_factory module . . . . . . . . . 762
17.1.1.14.7 ads.model.extractor.model_artifact module . . . . . . . . . . . . . . . . . 763
17.1.1.14.8 ads.model.extractor.automl_extractor module . . . . . . . . . . . . . . . 763
17.1.1.14.9 ads.model.extractor.xgboost_extractor module . . . . . . . . . . . . . . . 764
17.1.1.14.10ads.model.extractor.lightgbm_extractor module . . . . . . . . . . . . . . 765
17.1.1.14.11ads.model.extractor.model_info_extractor module . . . . . . . . . . . . . 766
17.1.1.14.12ads.model.extractor.sklearn_extractor module . . . . . . . . . . . . . . . 767
17.1.1.14.13ads.model.extractor.keras_extractor module . . . . . . . . . . . . . . . . 768
17.1.1.14.14ads.model.extractor.tensorflow_extractor module . . . . . . . . . . . . . 769
17.1.1.14.15ads.model.extractor.pytorch_extractor module . . . . . . . . . . . . . . . 770
17.1.1.14.16Module contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 772
17.1.1.15 ads.model.deployment package . . . . . . . . . . . . . . . . . . . . . . . . . . . . 772
17.1.1.15.1 Submodules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 772
17.1.1.15.2 ads.model.deployment.model_deployer module . . . . . . . . . . . . . . 772
17.1.1.15.3 ads.model.deployment.model_deployment module . . . . . . . . . . . . . 776
17.1.1.15.4 ads.model.deployment.model_deployment_properties module . . . . . . . 780
17.1.1.15.5 Module contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 785
17.1.1.16 ads.model.framework package . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 785
17.1.1.16.1 Submodules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 785
17.1.1.16.2 ads.model.framework.automl_model module . . . . . . . . . . . . . . . . 785
17.1.1.16.3 ads.model.framework.lightgbm_model module . . . . . . . . . . . . . . 789
17.1.1.16.4 ads.model.framework.pytorch_model module . . . . . . . . . . . . . . . 794
17.1.1.16.5 ads.model.framework.sklearn_model module . . . . . . . . . . . . . . . 799
17.1.1.16.6 ads.model.framework.tensorflow_model module . . . . . . . . . . . . . . 804
17.1.1.16.7 ads.model.framework.xgboost_model module . . . . . . . . . . . . . . . 809
17.1.1.16.8 Module contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 814
17.1.1.17 ads.model.runtime package . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 814
17.1.1.17.1 Submodules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 814
17.1.1.17.2 ads.model.runtime.env_info module . . . . . . . . . . . . . . . . . . . . 814
17.1.1.17.3 ads.model.runtime.model_deployment_details module . . . . . . . . . . 816
17.1.1.17.4 ads.model.runtime.model_provenance_details module . . . . . . . . . . . 816
17.1.1.17.5 ads.model.runtime.runtime_info module . . . . . . . . . . . . . . . . . . 817
17.1.1.17.6 ads.model.runtime.utils module . . . . . . . . . . . . . . . . . . . . . . . 817
17.1.1.17.7 Module contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 818
17.1.1.18 ads.oracledb package . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 818
17.1.1.18.1 Submodules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 818
17.1.1.18.2 ads.oracledb.oracle_db module . . . . . . . . . . . . . . . . . . . . . . . 818
17.1.1.19 ads.secrets package . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 818
17.1.1.19.1 Submodules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 818
17.1.1.19.2 ads.secrets.secrets module . . . . . . . . . . . . . . . . . . . . . . . . . 818
17.1.1.19.3 ads.secrets.adb module . . . . . . . . . . . . . . . . . . . . . . . . . . . 822
17.1.1.19.4 ads.secrets.mysqldb module . . . . . . . . . . . . . . . . . . . . . . . . . 826
17.1.1.19.5 ads.secrets.oracledb module . . . . . . . . . . . . . . . . . . . . . . . . . 828
17.1.1.19.6 ads.secrets.big_data_service module . . . . . . . . . . . . . . . . . . . . 830
17.1.1.19.7 ads.secrets.auth_token module . . . . . . . . . . . . . . . . . . . . . . . 834
17.1.1.19.8 Module contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 836
17.1.1.20 ads.text_dataset package . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 836
17.1.1.20.1 Submodules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 836
17.1.1.20.2 ads.text_dataset.backends module . . . . . . . . . . . . . . . . . . . . . 836
17.1.1.20.3 ads.text_dataset.dataset module . . . . . . . . . . . . . . . . . . . . . . . 838
17.1.1.20.4 ads.text_dataset.extractor module . . . . . . . . . . . . . . . . . . . . . . 843
17.1.1.20.5 ads.text_dataset.options module . . . . . . . . . . . . . . . . . . . . . . 845
xvi
17.1.1.20.6 Module contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 846
17.1.1.21 ads.vault package . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 846
17.1.1.21.1 Submodules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 846
17.1.1.21.2 ads.vault module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 846
17.1.1.21.3 Module contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 847
17.1.2 Submodules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 847
17.1.3 ads.config module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 847
17.1.4 Module contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 848
19 Examples 853
19.1 Load data from Object Storage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 853
19.2 Load data from Autonomous DB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 853
19.3 More Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 854
20 Contributing 855
21 Security 857
22 License 859
Index 865
xvii
xviii
CHAPTER
ONE
RELEASE NOTES
1.1 2.6.8
1.2 2.6.7
1.3 2.6.6
1
ADS Documentation, Release 2.6.8
1.4 2.6.5
1.5 2.6.4
1.6 2.6.3
1.7 2.6.2
1.8 2.6.1
1.9 2.5.10
1.7. 2.6.2 3
ADS Documentation, Release 2.6.8
• Fixed bug in import ads.jobs to notify users installing ADS optional dependencies.
• Fixed a bug in the generated score.py file, where Pandas dataframe’s dtypes changed when deserializing. Now
you can recover it from the input schema.
• Updated requirements to oci>=2.59.0.
1.10 2.5.9
1.11 2.5.8
1.12 2.5.7
1.13 2.5.6
1.14 2.5.5
1.15 2.5.4
1.12. 2.5.7 5
ADS Documentation, Release 2.6.8
1.16 2.5.3
1.17 2.5.2
1.18 2.5.0
1.19 2.4.2
1.20 2.4.1
1.21 2.4.0
1.19. 2.4.2 7
ADS Documentation, Release 2.6.8
1.22 2.3.4
1.23 2.3.3
1.24 2.3.1
1.25 2.2.1
1.25. 2.2.1 9
ADS Documentation, Release 2.6.8
• A full distribution of this release of ADS is found in the General Machine Learning for CPU and GPU environ-
ments. The Classic environments include the previous release of ADS.
• A distribution of ADS without AutoML and MLX is found in the remaining environments.
• DatasetFactory can now download files first before opening them in memory using the .download() method.
• Added support to archive files in creating Data Flow applications and runs.
• Support was added for loading Avro format data into ADS.
• Changed model serialization to use ONNX by default when possible on supported models.
• Added ADSTuner, which is a framework and model agnostic hyperparmater optimizer, use the adstuner.ipynb
notebook for examples of how to use this feature.
• Corrected the up_sample() method in get_recommendations() so that it does not fail when all features are
categorical. Up-sampling is possible for datasets containing continuous and categorical features.
• Resolved issues with serializing ndarray objects into JSON.
• A table of all of the ADS notebook examples can be found in our service documentation: Oracle Cloud Infras-
tructure Data Science
• Changed set_documentation_mode to false by default.
• Added unit-tests related to the dataset helper.
• Fixed the _check_object_exists to handle situations where the object storage bucket has more than 1000 objects.
• Added option overwrite_script in the create_app() method to allow a user to override a pre-existing file.
• Added support for newer fsspec versions.
• Added support for the C library Snappy.
• Fixed issue with uploading model provenance data due to inconsistency with OCI interface.
• Resolved issue with multiple versions of Cryptography being installed when installing fbprophet.
AutoML is upgraded to AutoML v0.5.2 and the changes include:
• AutoML is now distributed in the General Machine Learning and Data Exploration conda environments.
• Support for ONNX. AutoML models can now be serialized using ONNX by calling the to_onnx() API on the
AutoML estimator.
• Pre-processing has been overhauled to use sklearn pipelines to allow serialization using ONNX. Numerical,
categorical, and text columns are supported for ONNX serialization. Datetime and time series columns are not
supported.
• Torch-based deep learning models, TorchMLPClassifier and TorchMLPRegressor, have been added.
• GPU support for XGBoost and torch-based models have been added. This is disabled by default and can be
enabled by passing in ‘gpu_id’: ‘auto’ in engine_opts in the constructor. ONNX serialization for GPUs
has not been tested.
• Adaptive sampling’s learning curve has been smoothened. This allows adaptive sampling to converge faster on
some datasets.
• Improvements to ranking performance in feature selection were added. Feature selection is now much faster on
large datasets.
• The default execution engine for AutoML has been switched to Dask. You can still use the Python multiprocess-
ing by passing engine='local', engine_opts={'n_jobs' : -1} to init()
• GuassianNB has been enabled in the interface by default.
• The AdaBoostClassifier has been disabled in the pipeline-interface by default. The ONNX converter for
AdaBoost should not be used.
• The issue ValueError: Found unknown categories during transform has been fixed.
• You can manually specify a hyperparameter search space to AutoML. A new parameter was added to the pipeline.
This allows you to freeze some hyperparameters or to expose further ones for tuning.
• New API: Refit an AutoML pipeline to another dataset. This is primarily used to handle updated training data,
where you train the pipeline once, and refit in on newer data.
• AutoML no longer closes a user-specified Dask cluster.
• AutoML properly cleans up any existing futures on the Dask cluster at the end of fit.
MLX is upgraded to MLX v1.0.16 the changes include:
• MLX is now distributed in the General Machine Learning conda environments.
• Updated the explanation descriptions to use a base64 representation of the static plots. This obviates the need
for creating a mlx_static directory.
• Replaced the boolean indexing in slicing Pandas dataFrame with integer indexing. After updating to Pandas >=
1.1.0 the boolean indexing caused some issues. Integer indexing addresses these issues.
• Fixed MLX-related import warnings.
• Corrected an issue with ALE when the target values are strings.
• Support was added to use resource principles as an authentication mechanism for ADS.
• Support was added to MLX for an additional model explanation diagnostic, Accumulated Local Effects (ALEs).
• Support was added to MLX for “What-if” scenarios in model explainability.
• Improvements were made to the correlation heatmap calculations in show_in_notebook().
• Improvements were made to the model artifact.
The following bugs were fixed:
• Data Flow applications inherit the compartment assignment of the client. Runs inherit from applications by
default. Compartment OCIDs can also be specified independently at the client, application, and run levels.
• The Data Flow log link for logs pulled from an application loaded into the notebook session is fixed.
• Progress bars now complete fully (in ADSModel.prepare() and prepare_generic_model()).
• BaselineModel is now significantly faster and can be opted out of.
MLX upgraded to MLX v1.0.10 the changes include:
• Added support to specify the mlx_static root path (used for ALE summary).
• Added support for making mlx_static directory hidden (for example, <path>/.mlx_static/).
• Fixed issue with the boolean features in ALE.
• Support for datetime columns. AutoML should automatically infer datetime columns based on the Pandas
dataframe, and perform feature engineering on them. This can also be forced by using the col_types argument
in pipeline.fit(). The supported types are: ['categorical', 'numerical', 'datetime']
MLX upgraded to MLX 1.0.7 the changes include:
• Updated the feature distributions in the PDP/ICE plots (performance improvement).
• All distributions are now shown as PMFs. Categorical features show the category frequency and continuous
features are computed using a NumPy histogram (with ‘auto’). They are also separate sub-plots, which are
interactive.
• Classification PDP: The y-axis for continuous features is now auto-scaled (not fixed to 0-1).
• 1-feature PDP/ICE: The x-axis for continuous features now shows the entire feature distribution, whereas the plot
may show a subset depending on the partial_range parameter (for example, partial_range=[0.2, 0.8]
computes the PDP between the 20th and 80th percentile. The plot now shows the full distribution on the x-axis,
but the line charts are only drawn between the specified percentile ranges).
• 2-feature PDP: The plot x and y axes are now auto-set to match the partial_range specified by the user. This
ensures that the heatmap fills the entire plot by default. However, the entire feature distribution can be viewed
by zooming out or clicking Autoscale in plotly.
• Support for plotting scatter plots using WebGL (show_in_notebook(..., use_webgl=True)) was added.
• The side issues that were causing the MLX Visualization Omitted warnings in JupyterLab were fixed.
• ADS integration with the Oracle Cloud Infrastructure Data Flow service provides a more efficient and convenient
to launch a Spark application and run Spark jobs
• show_in_notebook() has had “head” removed from accordion and is replaced with dataset “warnings”.
• get_recommendations() is deprecated and replaced with suggest_recommendations(), which returns a
Pandas dataframe with all the recommendations and suggested code to implement each action.
• A progress indication of Autonomous Data Warehouse reads has been added.
AutoML updated to version 0.4.1 from 0.3.1:
• More consistent handling of stratification and random state.
• Bug-fix for LightGBM and XGBoost crashing on AMD shapes was implemented.
• Unified Proxy Models across all stages of the AutoML Pipeline, ensuring leaderboard rankings are consistent
was implemented.
• Remove visual option from the interface.
• The default tuning metric for both binary and multi-class classification has been changed to neg_log_loss.
• Bug-fix in AutoML XGBoost, where the predicted probabilities were sometimes NaN, was implemented.
• Fixed several corner case issues in Hyperparameter Optimization.
MLX updated to version 1.0.3 from 1.0.0:
• Added support for specifying the ‘average’ parameter in sklearn metrics by <metric>_<average>, for exam-
lple F1_avg.
• Fixed an issue with the detailed scatter plot visualizations and cutoff feature/axis names.
• Fixed an issue with the balanced sampling in the Global Feature Permutation Importance explainer.
• Updated the supported scoring metrics in MLX. The PermutationImportance explainer now supports a large
number of classification and regression metrics. Also, many of the metrics’ names were changed.
• Updated LIME and PermutationImportance explainer descriptions.
• Fixed an issue where sklearn.pipeline wasn’t imported.
• Fixed deprecated asscalar warnings.
ds = DatasetFactory.open(
connection_string,
format = 'sql',
table = """
SELECT *
FROM sh.times
WHERE rownum <= 30
"""
)
Example:
The output shows two lines, one for the total CPU percentage used by all the workers, and one for total memory used.
Dask Upgrade
Dask is updated to version 2.10.1 with support for Oracle Cloud Infrastructure Object Storage. The 2.10.1 version
provides better performance than the older version.
TWO
QUICK START
• Install
• Read and Write to Object Storage, Databases and other OCI Resources
• OCI serverless Spark - Data Flow
• Evaluate Trained Models
• Register and Deploy Models
• Store and Retrieve your data source credentials
• Conect to existing OCI Big Data Service
17
ADS Documentation, Release 2.6.8
THREE
Prerequisites
• Linux/Mac (Intel CPU)
• For Mac on M series - Experimental.
• For Windows: Use Windows Subsystem for Linux (WSL)
• python >=3.7, <3.10
ads cli provides a command line interface to Jobs API related features. Set up your development environment, build
docker images compliant with Notebook session and Data Science Jobs, build and publish conda pack locally, start
distributed training, etc.
Installation
Install ADS and enable CLI:
Tip
ads opctl subcommand lets us setup your local development envrionment for Data Science Jobs. More information
can be found by running ads opctl -h
ADS is installed in the data science conda environments. Upgrade your existing oracle-ads package by running -
19
ADS Documentation, Release 2.6.8
To work with gradient boosting models, install the boosted module. This module includes XGBoost and LightGBM
model classes.
For big data use cases using Oracle Big Data Service (BDS), install the bds module. It includes the following libraries:
ibis-framework[impala], hdfs[kerberos] and sqlalchemy.
To work with a broad set of data formats (for example, Excel, Avro, etc.) install the data module. It includes the
following libraries: fastavro, openpyxl, pandavro, asteval, datefinder, htmllistparse, and sqlalchemy.
To work with geospatial data install the geo module. It includes the geopandas and libraries from the viz module.
Install the notebook module to use ADS within the Oracle Cloud Infrastructure Data Science service Notebook Ses-
sion. This module installs ipywidgets and ipython libraries.
To work with ONNX-compatible run times and libraries designed to maximize performance and model portability, in-
stall the onnx module. It includes the following libraries, onnx, onnxruntime, onnxmltools, skl2onnx, xgboost, lightgbm
and libraries from the viz module.
For infrastructure tasks, install the opctl module. It includes the following libraries, oci-cli, docker, conda-pack,
nbconvert, nbformat, and inflection.
For hyperparameter optimization tasks install the optuna module. It includes the optuna and libraries from the viz
module.
Install the tensorflow module to include tensorflow and libraries from the viz module.
For text related tasks, install the text module. This will include the wordcloud, spacy libraries.
Install the torch module to include pytorch and libraries from the viz module.
Install the viz module to include libraries for visualization tasks. Some of the key packages are bokeh, folium, seaborn
and related packages.
Note
Multiple extra dependencies can be installed together. For example:
FOUR
AUTHENTICATION
When you are working within a notebook session, you are operating as the datascience Linux user. This user does
not have an OCI Identity and Access Management (IAM) identity, so it has no access to the Oracle Cloud Infrastructure
API. Oracle Cloud Infrastructure resources include Data Science projects, models, jobs, model deployment, and the
resources of other OCI services, such as Object Storage, Functions, Vault, Data Flow, and so on. To access these
resources, you must use one of the two provided authentication approaches:
Prerequisite
• You are operating within a OCI service that has resource principal based authentication configured
• You have setup the required policies allowing the resourcetype within which you are operating to use/manage
the target OCI resources.
This is the generally preferred way to authenticate with an OCI service. A resource principal is a feature of IAM that
enables resources to be authorized principal actors that can perform actions on service resources. Each resource has its
own identity, and it authenticates using the certificates that are added to it. These certificates are automatically created,
assigned to resources, and rotated avoiding the need for you to upload credentials to your notebook session.
Data Science enables you to authenticate using your notebook session’s resource principal to access other OCI re-
sources. When compared to using the OCI configuration and key files approach, using resource principals provides a
more secure and easy way to authenticate to the OCI APIs.
You can choose to use the resource principal to authenticate while using the Accelerated Data Science (ADS) SDK by
running ads.set_auth(auth='resource_principal') in a notebook cell. For example:
import ads
ads.set_auth(auth='resource_principal')
compartment_id = os.environ['NB_SESSION_COMPARTMENT_OCID']
pc = ProjectCatalog(compartment_id=compartment_id)
pc.list_projects()
23
ADS Documentation, Release 2.6.8
Prerequisite
• You have setup api keys as per the instruction here
Use API Key setup when you are working from a local workstation or on platform which does not support resource
principals.
This is the default method of authentication. You can also authenticate as your own personal IAM user by creating
or uploading OCI configuration and API key files inside your notebook session environment. The OCI configuration
file contains the necessary credentials to authenticate your user against the model catalog and other OCI services like
Object Storage. The example notebook, api_keys.ipynb demonstrates how to create these files.
You can follow the steps in api_keys.ipynb <https://github.com/oracle-samples/oci-data-science-ai-
samples/blob/master/ads_notebooks/api_keys.ipynb> for step by step instruction on setting up API Keys.
Note: If you already have an OCI configuration file (config) and associated keys, you can upload them directly to
the /home/datascience/.oci directory using the JupyterLab Upload Files or the drag-and-drop option.
The default authentication that is used by ADS is set with the set_auth() method. However, each relevant ADS
method has an optional parameter to specify the authentication method to use. The most common use case for this is
when you have different permissions in different API keys or there are differences between the permissions granted in
the resource principals and your API keys.
Most ADS methods do not require a signer to be explicitly given. By default, ADS uses the API keys to sign requests
to OCI resources. The set_auth() method is used to explicitly set a default signing method. This method accepts
one of two strings "api_key" or "resource_principal".
The ~/.oci/config configuration allow for multiple configurations to be stored in the same file. The set_auth()
method takes is oci_config_location parameter that specifies the location of the configuration, and the default
is "~/.oci/config". Each configuration is called a profile, and the default profile is DEFAULT. The set_auth()
method takes in a parameter profile. It specifies which profile in the ~/.oci/config configuration file to use. In
this context, the profile parameter is only used when API keys are being used. If no value for profile is specified,
then the DEFAULT profile section is used.
The auth module has helper functions that return a signer which is used for authentication. The api_keys() method
returns a signer that uses the API keys in the .oci configuration directory. There are optional parameters to specify
the location of the API keys and the profile section. The resource_principal() method returns a signer that uses
resource principals. The method default_signer() returns either a signer for API Keys or resource principals
depending on the defaults that have been set. The set_auth() method determines which signer type is the default. If
nothing is set then API keys are the default.
24 Chapter 4. Authentication
ADS Documentation, Release 2.6.8
# Example 2: Create Object Storage client with timeout set to 6000 using resource␣
˓→principal authentication.
# Example 3: Create Object Storage client with timeout set to 6000 using API Key␣
˓→authentication.
oc.OCIClientFactory(**auth).object_storage
In the this example, the default authentication uses API keys specified with the set_auth method. However, since the
os_auth is specified to use resource principals, the notebook session uses the resource principal to access OCI Object
Store.
26 Chapter 4. Authentication
CHAPTER
FIVE
CLI CONFIGURATION
Prerequisite
• You have completed ADS CLI installation
Setup default values for different options while running OCI Data Sciecne Jobs or OCI DataFlow. By setting
defaults, you can avoid inputing compartment ocid, project ocid, etc.
To setup configuration run -
This will prompt you to setup default ADS CLI configurations for each OCI profile defined in your OCI config. By
default, all the files are generated in the ~/.ads_ops folder.
~/.ads_ops/config.ini will contain OCI profile defaults and conda pack related information. For example:
[OCI]
oci_config = ~/.oci/config
oci_profile = ANOTHERPROF
[CONDA]
conda_pack_folder = </local/path/for/saving/condapack>
conda_pack_os_prefix = oci://my-bucket@mynamespace/conda_environments/
~/.ads_ops/ml_job_config.ini will contain defaults for running Data Science Job. Defaults are set for each
profile listed in your oci config file. Here is a sample -
[DEFAULT]
compartment_id = oci.xxxx.<compartment_ocid>
project_id = oci.xxxx.<project_ocid>
subnet_id = oci.xxxx.<subnet-ocid>
log_group_id = oci.xxxx.<log_group_ocid>
log_id = oci.xxxx.<log_ocid>
shape_name = VM.Standard2.2
block_storage_size_in_GBs = 100
[ANOTHERPROF]
compartment_id = oci.xxxx.<compartment_ocid>
project_id = oci.xxxx.<project_ocid>
subnet_id = oci.xxxx.<subnet-ocid>
shape_name = VM.Standard2.1
log_group_id =ocid1.loggroup.oc1.xxx.xxxxx
(continues on next page)
27
ADS Documentation, Release 2.6.8
~/.ads_ops/dataflow_config.ini will contain defaults for running Data Science Job. Defaults are set for
each profile listed in your oci config file. Here is a sample -
[MYTENANCYPROF]
compartment_id = oci.xxxx.<compartment_ocid>
driver_shape = VM.Standard2.1
executor_shape = VM.Standard2.1
logs_bucket_uri = oci://mybucket@mytenancy/dataflow/logs
script_bucket = oci://mybucket@mytenancy/dataflow/mycode/
num_executors = 3
spark_version = 3.0.2
archive_bucket = oci://mybucket@mytenancy/dataflow/archive
SIX
Prerequisite
• You have completed ADS CLI installation
• You have completed Configuaration
Setup up your workstation for development and testing your code locally before you submit it as a OCI Data Science
Job. This section will guide you on how to setup environment for -
• Building an OCI Data Science compatible conda environments on your workstation or CICD pipeline and pub-
lishing to object storage
• Developing and testing code with a conda environment that is compatible with OCI Data Science Notebooks and
OCI Data Science Jobs
• Developing and testing code for running Bring Your Own Container (BYOC) jobs.
Note
• In this version you cannot directly access the Service provided conda environments from ADS CLI, but you can
publish a service provided conda pack from an OCI Data Science Notebook session to your object storage bucket
and then use the CLI to access the published version.
To setup an environment that matches OCI Data Science, a container image must be built. With a Data Science com-
patible container image you can do the following -
• Build and Publish custom conda packs that can be used within Data Science environment. Enable building conda
packs in your CICD pipeline.
• Install an existing conda pack that was published from an OCI Data Science Notebook.
• Develop code locally against the same conda pack that will be used within an OCID Data Science image.
Prerequisites
1. Install docker on your workstation
2. Internet connection to pull dependencies
3. If the access is restricted through proxy -
• Setup proxy environment variables https_proxy, https_proxy and no_proxy
• For Linux Workstation - update proxy variables in docker.service file and restart docker
• For mac - update proxy setting in the docker desktop
29
ADS Documentation, Release 2.6.8
Visual Studio Code can automatically run the code that you are developing inside a preconfigured container. An OCI
Data Science compatible container on your workstation can be used as a development environment. Visual Studio
Code can automatically launch the container using the information from devcontainer.json, which is created in the
code directory. Automatically generate this file and further customize it with plugins. For more details see
Prerequisites
1. ADS CLI is configured
2. Install Visual Studio Code
3. Build Development Container Image
4. Install Visual Studio Code extension for Remote Development
The generated .devcontainer.json includes the python extension for Visual Studio Code by default.
Open the source_folder using Visual Studio Code. More details on running the workspace within the container can
be found here
Conda packs provide runtime dependencies and a python runtime for your code. The conda packs can be built inside
an OCI Data Science Notebook session or you can build it locally on your workstation. ads opctl cli provides
a way to setup a development environment to build and use the conda packs. You can push the conda packs that you
build locally to Object Storage and use them in Jobs, Notebooks, Pipelines, or in Model Deployments.
Prerequisites
1. Build a local OCI Data Science Job compatible docker image
2. Connect to Object Storage through the Internet
3. Setup conda pack bucket, namespace, and authentication information using ads opctl configure. Refer to
configuration instructions.
Note
• In this version you cannot directly access the Service provided conda environments from ADS CLI, but you can
publish a service provided conda pack from an OCI Data Science Notebook session to your object storage bucket
and then use the CLI to access the published version.
6.3.1 create
Build conda packs from your workstation using ads opctl conda create subcommand.
6.3.2 publish
Publish conda pack to the object storage bucket from your laptop or workstation. You can use this conda pack inside
OCI Data Science Jobs, OCI Data Science Notebooks and OCI Data Science Model Deployment
6.3.3 install
Install conda pack using its URI. The conda pack can be used inside the docker image that you built. Use Visual Studio
Code that is configured with the conda pack to help you test your code locally before submitting to OCI.
OCI Data Science Jobs allows you to use custom container images. ads cli can help you test a container image locally,
publish it, and run it in OCI with a uniform interface.
Running an image locally can be conveniently achieved with “docker run” directly. “ads opctl” commands are provided
here only to be symmetric to remote runs on OCI ML Job. The command looks like
ads opctl run -i <image-name> -e <docker entrypoint> -c "docker cmd" --env-var ENV_
˓→NAME=value -b <backend>
-b option can take either local - runs the container locally or job - runs the container on OCI.
During the course of development, it is more productive to work within the container environment to iterate over the
code. You can setup your VS Code environment to use the container as your development environment as shown here
-
{
"image": "ubuntu",
"mounts": [
"source=/Users/<username>/.oci,target=/root/.oci,type=bind"
],
"extensions": [
"ms-python.python"
],
"containerEnv": {
"TEST": "test"
}
}
To run a container image with OCI Data Science Job, the image needs to be in a registry accessible by OCI Data Science
Job. “ads opctl publish-image” is a thin wrapper on “docker push”. The command looks like
The image will be pushed to the docker registry specified in ml_job_config.ini. Check confiuration for de-
faults. To overwrite the registry, use -r <registry>.
To run a container on OCI Data Science, provide ml_job for -b option. Here is an example -
SEVEN
LOAD DATA
You can load data into ADS in several different ways from Oracle Cloud Infrastructure Object Storage, cx_Oracle, or
S3. Following are some examples.
Begin by loading the required libraries and modules:
import ads
import numpy as np
import pandas as pd
from ads.common.auth import default_signer
To load a dataframe from Object Storage using the API keys, you can use the following example, replacing the angle
bracketed content with the location and name of your file:
For a list of pandas functions to read different file format, please refer to the Pandas documentation.
To load a dataframe from Object Storage using the resource principal method, you can use the following example,
replacing the angle bracketed content with the location and name of your file:
ads.set_auth(auth='resource_principal')
bucket_name = <bucket-name>
file_name = <file-name>
namespace = <namespace>
df = pd.read_csv(f"oci://{bucket_name}@{namespace}/{file_name}", storage_options=default_
˓→signer())
To write a pandas dataframe to object storage, provide the file name in the following format - oci://
<mybucket>@<mynamespace>/<path/to/flle/name>
33
ADS Documentation, Release 2.6.8
ads.set_auth(auth='resource_principal')
bucket_name = <bucket-name>
file_name = <file-name>
namespace = <namespace>
df = pd.to_csv(f"oci://{bucket_name}@{namespace}/{file_name}", index=False, storage_
˓→options=default_signer())
# To setup the content type while writing to object storage, set ``oci_additional_
˓→kwargs`` attribute with ``storage_options`` to the desired content type
storage_optons = default_signer()
storage_options['oci_additional_kwargs'] = {"content_type":"application/octet-stream"}
df = pd.to_csv(f"oci://{bucket_name}@{namespace}/{file_name}", index=False, storage_
˓→options=storage_options)
To load a dataframe from a local source, use functions from pandas directly:
df = pd.read_csv("/path/to/data.data")
When using the Oracle ADB with Python the most common representation of tabular data is a Pandas dataframe.
When you’re in a dataframe, you can perform many operations from visualization to persisting in a variety of formats.
The Pandas read_sql(...) function is a general, database independent approach that uses the SQLAlchemy - Object
Relational Mapper to arbitrate between specific database types and Pandas.
Read SQL query or database table into a dataframe.
This function is a convenience wrapper around read_sql_table and read_sql_query (for backward com-
patibility). It delegates to the specific function depending on the provided input. A SQL query is routed
to read_sql_query, while a database table name is routed to read_sql_table.
Use the Pandas ADS accessor drop-in replacement, pd.DataFrame.ads.read_sql(...), instead of using pd.
read_sql.
See how to save and retrieve credentials from OCI Vault
# read of a SQL query into a dataframe with a bind variable. Use bind variables
# rather than string substitution to avoid the SQL injection attack vector.
df = pd.DataFrame.ads.read_sql(
"""
SELECT
*
FROM
SH.SALES
WHERE
ROWNUM <= :max_rows
""",
bind_variables={
"max_rows" : 100
}
,
connection_parameters=connection_parameters,
)
# read of a SQL query into a dataframe with a bind variable. Use bind variables
# rather than string substitution to avoid the SQL injection attack vector.
(continues on next page)
connection_parameters = {
"user_name": "<username>",
"password": "<password>",
"service_name": "<service_name>",
"host": "<database hostname>",
"port": "<database port number>""
}
import pandas as pd
import ads
# read of a SQL query into a dataframe with a bind variable. Use bind variables
# rather than string substitution to avoid the SQL injection attack vector.
df = pd.DataFrame.ads.read_sql(
"""
SELECT
*
FROM
SH.SALES
WHERE
(continues on next page)
7.1.3.3 Performance
If a database query returns more rows than the memory of the client permits, you have a couple of options. The simplest
is to use a larger client shape, along with increased compute performance because larger shapes come with more RAM.
If that’s not an option, then you can use the pd.DataFrame.ads.read_sql mixin in chunk mode, where the result is
no longer a Pandas dataframe it is an iterator over a sequence of dataframes. You could use this read a large data set
and write it to Object storage or a local file system with the following example:
for i, df in enumerate(pd.DataFrame.ads.read_sql(
"SELECT * FROM SH.SALES",
chunksize=100000 # rows per chunk,
connection_parameters=connection_parameters,
))
# each df will contain up to 100000 rows (chunksize)
# to write the data to object storage use oci://bucket@namespace/part_{i}.
˓→csv"
df.to_csv(f"part_{i}.csv")
If the data exceeds what’s practical in a notebook, then the next step is to use the Data Flow service to partition the data
across multiple nodes and handle data of any size up to the size of the cluster.
Typically, you would do this using df.to_sql. However, this uses Oracle Resource Manager to collect data and is less
efficient than code that has been optimized for a specific database.
Instead, use the Pandas ADS accessor mixin.
With a df dataframe, writing this to the database is as simple as:
df.ads.to_sql(
"MY_TABLE",
connection_parameters=connection_parameters, # Should contain wallet location if you␣
˓→are connecting to ADB
if_exists="replace"
)
The resulting data types (if the table was created by ADS as opposed to inserting into an existing table), are governed
by the following:
Pandas Oracle
bool NUMBER(1)
int16 INTEGER
int32 INTEGER
int64 INTEGER
float16 FLOAT
float32 FLOAT
float64 FLOAT
datetime64 TIMESTAMP
string VARCHAR2 (Maximum length of the actual data.)
When a table is created, the length of any VARCHAR2 column is computed from the longest string in the column. The
ORM defaults to CLOB data, which is not correct or efficient. CLOBS are stored efficiently by the database, but the c
API to query them works differently. The non-LOB columns are returned to the client through a cursor, but LOBs are
handled differently resulting in an additional network fetch per row, per LOB column. ADS deals with this by creating
the correct data type, and setting the correct VARCHAR2 length.
7.1.4 MySQL
connection_parameters = {
"user_name": "<username>",
"password": "<password>",
(continues on next page)
# read of a SQL query into a dataframe with a bind variable. Use bind variables
# rather than string substitution to avoid the SQL injection attack vector.
df = pd.DataFrame.ads.read_sql(
"""
SELECT
*
FROM
EMPLOYEE
WHERE
emp_no <= ?
""",
bind_variables=(1000,)
,
connection_parameters=connection_parameters,
engine="mysql"
)
df.ads.to_sql(
"MY_TABLE",
connection_parameters=connection_parameters,
if_exists="replace",
engine="mysql"
)
The resulting data types (if the table was created by ADS as opposed to inserting into an existing table), are governed
by the following:
Pandas MySQL
bool NUMBER(1)
int16 INTEGER
int32 INTEGER
int64 INTEGER
float16 FLOAT
float32 FLOAT
float64 FLOAT
datetime64 DATETIME (Format: %Y-%m-%d %H:%M:%S)
string VARCHAR (Maximum length of the actual data.)
connection_parameters = {
"host": "<hive hostname>",
"port": "<hive port number>",
}
connection_parameters = {
"host": "<hive hostname>",
"port": "<hive port number>",
"auth_mechanism": "PLAIN" # for connection with unsecure BDS
}
Example
connection_parameters = {
"host": "<database hostname>",
"port": "<database port number>",
}
import pandas as pd
import ads
# read of a SQL query into a dataframe with a bind variable. Use bind variables
# rather than string substitution to avoid the SQL injection attack vector.
df = pd.DataFrame.ads.read_sql(
"""
SELECT
*
FROM
EMPLOYEE
WHERE
`emp_no` <= ?
""",
bind_variables=(1000,)
,
connection_parameters=connection_parameters,
engine="hive"
)
To save the dataframe df to BDS Hive, use df.ads.to_sql API with engine="hive".
df.ads.to_sql(
"MY_TABLE",
connection_parameters=connection_parameters,
if_exists="replace",
engine="hive"
)
7.1.5.2 Partition
You can create table with partition, and then use df.ads.to_sql API with engine="hive", if_exists="append"
to insert data into the table.
create_table_sql = f'''
CREATE TABLE {table_name} (col1_name datatype, ...)
partitioned by (col_name datatype, ...)
'''
df.ads.to_sql(
"MY_TABLE",
connection_parameters=connection_parameters,
if_exists="append",
engine="hive"
)
If the dataframe waiting to be uploaded has many rows, and the .to_sql() method is slow, you have other options.
The simplest is to use a larger client shape, along with increased compute performance because larger shapes come
with more RAM. If that’s not an option, then you can follow these steps:
To load a dataframe from a remote web server source, use pandas directly and specify the URL of the data:
df = pd.read_csv('https://example.com/path/to/data.csv')
To convert a Pandas dataframe to ADSDataset, pass the pandas.DataFrame object directly into the ADS
DatasetFactory.open method:
import pandas as pd
from ads.dataset.factory import DatasetFactory
# use open...
# alternative form...
ds = DatasetFactory.from_dataframe(df)
# an example using Pandas to parse data on the clipboard as a CSV and construct an ADS␣
˓→Dataset object
# this allows easily transfering data from an application like Microsoft Excel, Apple␣
˓→Numbers, etc.
ds = DatasetFactory.from_dataframe(pd.read_clipboard())
ADS supports reading files into PyArrow dataset directly via ocifs. ocifs is installed as ADS dependencies.
import ocifs
import pyarrow.dataset as ds
bucket_name = <bucket_name>
namespace = <namespace>
path = <path>
fs = ocifs.OCIFileSystem(**default_signer())
ds = ds.dataset(f"{bucket_name}@{namespace}/{path}/", filesystem=fs)
7.2 DataSetFactory
Deprecation Note
• DataSetFactory.open is deprecated in favor of Pandas to read from file systems.
• Pandas(>1.2.1) can connect to object storage using uri format - oci://bucket@namepace/path/to/data.
• To read from Oracle database or MySQL, see DataBase sections under Connecting to Datasources
• DataSetFactory.from_dataframe is supported to create ADSDataset class from pandas dataframe
See Connecting to Datasources for examples.
You can load data into ADS in several different ways from Oracle Cloud Infrastructure Object Storage, cx_Oracle, or
S3. Following are some examples.
7.2. DataSetFactory 43
ADS Documentation, Release 2.6.8
import ads
import numpy as np
import pandas as pd
To open a dataset from Object Storage using the resource principal method, you can use the following example, replac-
ing the angle bracketed content with the location and name of your file:
import ads
import os
ads.set_auth(auth='resource_principal')
bucket_name = <bucket-name>
file_name = <file-name>
namespace = <namespace>
storage_options = {'config':{}, 'tenancy': os.environ['TENANCY_OCID'], 'region': os.
˓→environ['NB_REGION']}
ds = DatasetFactory.open(f"oci://{bucket_name}@{namespace}/{file_name}", storage_
˓→options=storage_options)
To open a dataset from Object Storage using the Oracle Cloud Infrastructure configuration file method, include the
location of the file using this format oci://<bucket_name>@<namespace>/<file_name> and modify the optional
parameter storage_options. Insert:
• The path to your Oracle Cloud Infrastructure configuration file,
• The profile name you want to use.
For example:
ds = DatasetFactory.open("oci://<bucket_name>@<namespace>/<file_name>", storage_options␣
˓→= {
"config": "~/.oci/config",
"profile": "DEFAULT"
})
To open a dataset from a local source, use DatasetFactory.open and specify the path of the data file:
To connect to Oracle Databases from Python, you use the cx_Oracle package that conforms to the Python database
API specification.
You must have the client credentials and connection information to connect to the database. The client credentials
include the wallet, which is required for all types of connections. Use these steps to work with ADB and wallet files:
1. From the Console, go to the Oracle Cloud Infrastructure ADW or ATP instance page that you want to load the
dataset from, and then click DB Connection.
2. Click Download Wallet.
3. You have to enter a password. This password is used for some ADB connections, but not the ones that are used
in the notebook.
4. Create a folder for your wallet in the notebook environment (<path_to_wallet_folder>).
5. Upload your wallet files into <path_to_wallet_folder> folder using the Jupyterlab Upload Files button.
6. Open the sqlnet.ora file from the wallet files, and then configure the METHOD_DATA to be: METHOD_DATA
= (DIRECTORY="<path_to_wallet_folder>")
7. Set the env variable, TNS_ADMIN. TNS_ADMIN, to point to the wallet you want to use.
In this example a Python dictionary, creds is used to store the creditionals. However, it is poor security practice to
store this information in a notebook. The notebook ads-examples/ADB_working_with.ipynb gives an example of
how to store them in Block Storage.
creds = {}
creds['tns_admin'] = <path_to_wallet_folder>
creds['sid'] = <your SID>
creds['user'] = <database username>
creds['password'] = <database password>
Once your Oracle client is setup, you can use cx_Oracle directly with Pandas as in this example:
import pandas as pd
import cx_Oracle
import os
os.environ['TNS_ADMIN'] = creds['tns_admin']
with cx_Oracle.connect(creds['user'], creds['password'], creds['sid']) as ora_conn:
df = pd.read_sql('''
SELECT ename, dname, job, empno, hiredate, loc
FROM emp, dept
WHERE emp.deptno = dept.deptno
(continues on next page)
7.2. DataSetFactory 45
ADS Documentation, Release 2.6.8
You can also use cx_Oracle within ADS by creating a connection string:
os.environ['TNS_ADMIN'] = creds['tns_admin']
from ads.dataset.factory import DatasetFactory
uri = 'oracle+cx_oracle://' + creds['user'] + ':' + creds['password'] + '@' + creds['sid
˓→']
Oracle has two configurations of Autonomous Databases. They are the Autonomous Data Warehouse (ADW) and
the Autonomous Transaction Processing (ATP) database. Both are fully autonomous databases that scale elastically,
deliver fast query performance, and require minimal database administration.
Note: To access ADW, review the Autonomous Database configuration section. It shows you how to get the client
credentials (wallet) and set up the proper environment variable.
After you have stored the ADB username, password, and database name (SID) as variables, you can build the URI as
your connection source.
You can use ADS to query a table from your database, and then load that table as an ADSDataset object through
DatasetFactory. When you open DatasetFactory, specify the name of the table you want to pull using the table
variable for a given table. For SQL expressions, use the table parameter also. For example, (`table=”SELECT * FROM
sh.times WHERE rownum <= 30”`).
os.environ['TNS_ADMIN'] = creds['tns_admin']
ds = DatasetFactory.open(uri, format="sql", table=table, target='label')
os.environ['TNS_ADMIN'] = creds['tns_admin']
engine = create_engine(uri)
df = pd.read_sql('SELECT * from <TABLENAME>', con=engine)
You can convert the pd.DataFrame into ADSDataset using the DatasetFactory.from_dataframe() function.
ds = DatasetFactory.from_dataframe(df)
These two examples run a simple query on ADW data. With read_sql_query you can use SQL expressions not just
for tables, but also to limit the number of rows and to apply conditions with filters, such as (where).
import
import pandas as pd
import numpy as np
import os
os.environ['TNS_ADMIN'] = creds['tns_admin']
connection = cx_Oracle.connect(creds['user'], creds['password'], creds['sid'])
cursor = connection.cursor()
results = cursor.execute("SELECT * from <TABLENAME>")
data = results.fetchall()
df = pd.DataFrame(np.array(data))
ds = DatasetFactory.from_dataframe(df)
cursor.close()
connection.close()
7.2. DataSetFactory 47
ADS Documentation, Release 2.6.8
After you load your data from ADB, the ADSDataset object is created, which allows you to build models using Au-
toML.
To add predictions to a table, you can either update an existing table, or create a new table with the added predictions.
There are many ways to do this. One way is to use the model to update a CSV file, and then use Oracle SQL*Loader
or SQL*Plus.
This example adds predictions programmatically using cx_Oracle. It uses executemany to insert rows as tuples created
using the model’s predict method:
ds = DatasetFactory.open("iris.csv")
cursor.executemany("""
insert into IRIS_PREDICTED
(sepal_length, sepal_width, petal_length, petal_width, SPECIES, yhat)
values (:1, :2, :3, :4, :5, :6)""",
rows
)
connection.commit()
cursor.close()
connection.close()
For some models, you could also use predict_proba to get an array of predictions and their confidence probability.
7.2.1.6 Amazon S3
You can open Amazon S3 public or private files in ADS. For private files, you must pass the right credentials through
the ADS storage_options dictionary.If you have large S3 files, then you benefit from an increased blocksize.
ds = DatasetFactory.open("s3://bucket_name/iris.csv", storage_options = {
'key': 'aws key',
'secret': 'aws secret,
'blocksize': 1000000,
'client_kwargs': {
"endpoint_url": "https://s3-us-west-1.amazonaws.com"
}
})
To open a dataset from a remote web server source, use DatasetFactory.open() and specify the URL of the data:
ds = DatasetFactory.open('https://example.com/path/to/data.csv', target='label')
7.2.1.8 DatasetBrowser
DatasetBrower allows easy access to datasets from reference libraries and index websites, such as scikit-learn. To
see the supported libraries, use the list() function:
DatasetBrowser.list()
sklearn = DatasetBrowser.sklearn()
sklearn.list()
Datasets are provided as a convenience. Datasets are considered Third Party Content and are not considered Materials
under Your agreement with Oracle applicable to the Services. Review the dataset license.
To explore one of the datasets, use open() specifying the name of the dataset:
ds = sklearn.open('wine')
7.2. DataSetFactory 49
ADS Documentation, Release 2.6.8
EIGHT
LABEL DATA
The Oracle Cloud Infrastructure (OCI) Data Labeling service allows you to create and browse datasets, view data
records (text, images) and apply labels for the purposes of building AI/machine learning (ML) models. The service
also provides interactive user interfaces that enable the labeling process. After you label records, you can export the
dataset as line-delimited JSON Lines (JSONL) for use in model development.
Datasets are the core resource available within the Data Labeling service. They contain records and their associated
labels. A record represents a single image or text document. Records are stored by reference to their original source
such as path on Object Storage. You can also upload records from local storage. Labels are annotations that describe
a data record. There are three different dataset formats, each having its respective annotation classes:
• Images: Single label, multiple label, and object detection. Supported image types are .png, .jpeg, and .jpg.
• Text: Single label, multiple label, and entity extraction. Plain text, .txt, files are supported.
• Document: Single label and multiple label. Supported document types are .pdf and .tiff.
8.1 Overview
The Oracle Cloud Infrastructure (OCI) Data Labeling service allows you to create and browse datasets, view data
records (text, images) and apply labels for the purposes of building AI/machine learning (ML) models. The service
also provides interactive user interfaces that enable the labeling process. After you label records, you can export the
dataset as line-delimited JSON Lines (JSONL) for use in model development.
Datasets are the core resource available within the Data Labeling service. They contain records and their associated
labels. A record represents a single image or text document. Records are stored by reference to their original source
such as path on Object Storage. You can also upload records from local storage. Labels are annotations that describe
a data record. There are three different dataset formats, each having its respective annotation classes:
• Images: Single label, multiple label, and object detection. Supported image types are .png, .jpeg, and .jpg.
• Text: Single label, multiple label, and entity extraction. Plain text, .txt, files are supported.
• Document: Single label and multiple label. Supported document types are .pdf and .tiff.
51
ADS Documentation, Release 2.6.8
The following examples provide an overview of how to use ADS to work with the Data Labeling service.
List all the datasets in the compartment:
With a labeled data set, the details of the labeling is called the export. To generate the export and get the path to the
metadata JSONL file, you can use export() with these parameters:
• dataset_id: The OCID of the Data Labeling dataset to take a snapshot of.
• path: The Object Storage path to store the generated snapshot.
metadata_path = dls.export(
dataset_id="<dataset_id>",
path="oci://<bucket_name>@<namespace>/<prefix>"
)
To load the labeled data into a Pandas dataframe, you can use LabeledDatasetReader object that has these parame-
ters:
• materialize: Load the contents of the dataset. This can be quite large. The default is False.
• path: The metadata file path that can be local or object storage path.
You can also read labeled datasets from the OCI Data Labeling Service into a Pandas dataframe using
LabeledDatasetReader object by specifying dataset_id:
Alternatively, you can use the .read_labeled_data() method by either specifying path or dataset_id.
This example loads a labeled dataset and returns a Pandas dataframe containing the content and the annotations:
df = pd.DataFrame.ads.read_labeled_data(
path="<metadata_path>",
materialize=True
)
The following example loads a labeled dataset from the OCI Data Labeling, and returns a Pandas dataframe containing
the content and the annotations:
df = pd.DataFrame.ads.read_labeled_data(
dataset_id="<dataset_ocid>",
materialize=True
)
To obtain a handle to a DataLabeling object, you call the DataLabeling() constructor. The default compartment is
the same compartment as the notebook session, but the compartment_id parameter can be used to select a different
compartment.
To work with the labeled data, you need a snapshot of the dataset. The export() method copies the labeled data from
the Data Labeling service into a bucket in Object Storage. The .export() method has the following parameters:
• dataset_id: The OCID of the Data Labeling dataset to take a snapshot of.
• path: The Object Storage path to store the generated snapshot.
The export process creates a JSONL file that contains metadata about the labeled dataset in the specified bucket. There
is also a record JSONL file that stores the image, text, or document file path of each record and its label.
The export() method returns the path to the metadata file that was created in the export operation.
8.4 List
The .list_dataset() method generates a list of the available labeled datasets in the compartment. The compartment
is set when you call DataLabeling(). The .list_dataset() method returns a Pandas dataframe where each row
is a dataset.
8.5 Load
The returned value from the .export() method is used to load a dataset. You can load a dataset into a Pandas dataframe
using LabeledDatasetReader or a Pandas accessor. The LabeledDatasetReader creates an object that allows you
to perform operations, such as getting information about the dataset without having to load the entire dataset. It also
allows you to read the data directly into a Pandas dataframe or to use an iterator to process the records one at a time.
The Pandas accessor approach provides a convenient method to load the data in a single command.
8.5.1 LabeledDatasetReader
Call the .from_export() method on LabeledDatasetReader to construct an object that allows you to read the data.
You need the metadata path that was generated by the .export() method. Optionally, you can set materialize to
True to load the contents of the dataset. It’s set to False by default.
You can explore the metadata information of the dataset by calling info() on the LabeledDatasetReader object.
You can also convert the metadata object to a dictionary using to_dict:
metadata = ds_reader.info()
metadata.labels
metadata.to_dict()
On the LabeledDatasetReader object, you call read() to load the labeled dataset. By default, it’s read into a Pandas
dataframe. You can specify the output annotation format to be spacy for the Entity Extraction dataset or yolo for the
Object Detection dataset.
An Entity Extraction dataset is a dataset type that supports natural language processing named entity recognition (NLP
NER). Here is an example of spacy format. A Object Detection dataset is a dataset type that contains data from detecting
instances of objects of a certain class within an image. Here is an example of yolo format.
df = ds_reader.read()
df = ds_reader.read(format="spacy")
df = ds_reader.read(format="yolo")
When a dataset is too large, you can read it in small portions. The result is presented as a generator.
for df in ds_reader.read(chunksize=10):
df.head()
Alternatively, you can call read(iterator=True) to return a generator of the loaded dataset, and loop all the records
in the ds_generator by running:
ds_generator = ds_reader.read(iterator=True)
for item in ds_generator:
print(item)
The iterator parameter can be combined with the chunksize parameter. When you use the two parameters, the
result is also presented as a generator. Every item in the generator is a list of dataset records.
The Pandas accessor approach allows you to to read a labeled dataset into a Pandas dataframe using a single command.
Use the .read_labeled_data() method to read the metadata file, record file, and all the corpus documents. To
do this, you must know the metadata path that was created from the .export() method. Optionally you can set
materialize to True to load content of the dataset. It’s set to False by default. The read_labeled_data() method
returns a dataframe that is easy to work with.
This example loads a labeled dataset and returns a Pandas dataframe containing the content and the annotations:
import pandas as pd
df = pd.DataFrame.ads.read_labeled_data(
path="<metadata_path>",
materialize=True
)
If you’d like to load a labeled dataset from the OCI Data Labeling, you can specify the dataset_id, which is dataset
OCID that you’d like to read.
The following example loads a labeled dataset from the OCI Data Labeling and returns a Pandas dataframe containing
the content and the annotations:
import pandas as pd
df = pd.DataFrame.ads.read_labeled_data(
dataset_id="<dataset_ocid>",
materialize=True
)
You can specify the output annotation format to be spacy for the Entity Extraction dataset or yolo for the Object
Detection dataset.
import pandas as pd
df = pd.DataFrame.ads.read_labeled_data(
dataset_id="<dataset_ocid>",
materialize=True,
format="spacy"
)
8.5. Load 55
ADS Documentation, Release 2.6.8
8.6 Visualize
After the labeled dataset is loaded in a Pandas dataframe, you can be visualize it using ADS. The visualization func-
tionality only works if there are no transformations made to the Annotations column.
8.6.1 Image
An image dataset, with an Object Detection annotation class, can have selected image records visualized by calling
the .render_bounding_box() method. You can provide customized colors for each label. If the path parameter is
specified, the annotated image file is saved to that path. Otherwise, the image is displayed in the notebook session. The
maximum number of records to display is set to 50 by default. This setting can be changed with the limit parameter:
df.iloc[1:3,:].ads.render_bounding_box(
options={"default_color": "white",
"colors": {"flower":"orange", "temple":"green"}},
path="test.png"
)
Optionally, you can convert the bounding box to YOLO format by calling to_yolo() on bounding box. The labels are
mapped to the index value of each label in the metadata.labels list.
df["Annotations"] = df.Annotations.apply(
lambda items: [item.to_yolo(metadata.labels) for item in items] if items else None
)
8.6.2 Text
For a text dataset, with an entity extraction annotation class, you can also visualize selected text records by calling
.render_ner(), and optionally providing customized colors for each label. By default, a maximum of 50 records are
displayed. However, you can adjust this using the limit parameter:
df.iloc[1:3,:].ads.render_ner(options={"default_color":"#DDEECC",
"colors": {"company":"#DDEECC",
"person":"#FFAAAA",
"city":"#CCC"}})
df["Annotations"] = df.Annotations.apply(
lambda items: [item.to_spacy() for item in items] if items else None
)
8.7 Examples
This example will demonstrate how to do binary text classification. It will demonstrate a typical data science workflow
using a single label dataset from the Data Labeling Service (DLS).
Start by loading in the required libraries:
import ads
import oci
import os
import pandas as pd
8.7. Examples 57
ADS Documentation, Release 2.6.8
8.7.1.1 Dataset
A subset of the 20 Newsgroups dataset is used in this example. The complete dataset is a collection of approximately
20,000 newsgroup documents partitioned across 20 different newsgroups. The dataset is popular for experiments where
the machine learning application predicts which newsgroup a record belongs to.
Since this example is a binary classification, only the rec.sport.baseball and sci.space newsgroups are used.
The dataset was previously labeled in the Data Labeling service. The metadata was exported and saved in a publicly
accessible Object Storage bucket.
The data was previously labeled in the Data Labeling service. The metadata was exported and was saved in a publicly
accessible Object Storage bucket. The metadata JSONL file is used to import the data and labels.
8.7.1.2 Load
You use the .read_labeled_data() method to read in the metadata file, record file, and the entire corpus of docu-
ments. Only the metadata file has to be specified because it contains references to the record and corpus documents.
The .read_labeled_data() method returns a dataframe that is easy to work with.
The next example loads a labeled dataset, and returns the text from each email and the labeled annotation:
df = pd.DataFrame.ads.read_labeled_data(
"oci://hosted-ds-datasets@bigdatadatasciencelarge/DLS/text_single_label_20news/
˓→metadata.jsonl",
materialize=True
)
8.7.1.3 Preprocess
The data needs to be standardized. The next example performs the following operations:
• Converts the text to lower case.
• Uses a regular expression (RegEx) command to remove any character that is not alphanumeric, underscore, or
whitespace.
• Replace the sequence of characters \n with a space.
The binary classifier model you train is a decision tree where the features are based on n-grams of the words. You use
n-grams that are one, two, and three words long (unigrams, bigrams, and trigrams). The vectorizer removes English
stop words because they provide little value to the model being built. A weight is assigned to these features using the
term frequency-inverse document frequency (TF*IDF) approach .
8.7.1.4 Train
In this example, you skip splitting the dataset into the training and test sets since the goal is to build a toy model. You
assign 0 for the rec.sport.baseball label and 1 for the sci.space label:
classifier = DecisionTreeClassifier()
feature = vectorizer.fit_transform(df['text_clean'])
model = classifier.fit(feature, df['Annotations'])
8.7.1.5 Predict
Use the following to predict the category for a given text data using the trained binary classifier:
This example demonstrates how to read image files and labels, normalize the size of the image, train a SVC model, and
make predictions. The SVC model is used to try and determine what class a model belongs to.
To start, import the required libraries:
import ads
import matplotlib.pyplot as plt
import oci
import os
import pandas as pd
The data for this example was taken from a set of x-rays that were previously labeled in the Data Labeling service
whether they have pneumonia or not. The metadata was exported and saved in a publicly accessible Object Storage
bucket. The following commands define the parameters needed to access the metadata JSONL file:
metadata_path = f"'oci://hosted-ds-datasets@bigdatadatasciencelarge/DLS/image_single_
˓→label_xray/metadata.jsonl'"
8.7. Examples 59
ADS Documentation, Release 2.6.8
8.7.2.2 Load
This example loads and materializes the data in the dataframe. That is the dataframe to contain a copy of the image
file. You do this with the .ads.read_labeled_data() method:
df = pd.DataFrame.ads.read_labeled_data(path=metadata_path,
materialize=True)
8.7.2.3 Visualize
The next example extracts images from the dataframe, and plots them along with their labels:
8.7.2.4 Preprocess
The image files are mixture of RGB and grayscale. Convert all the images to single channel grayscale so that the input
to the SVC model is consistent:
The images are different sizes and you can normalize the size with:
Convert the image to a numpy array as that is what the SVC is expecting. Each pixel in the image is now a dimension
in hyperspace.
The model needs to be trained on one set of data, and then its performance would be assessed on a set of data that it
has not seen before. Therefore, this splits the data into a training and testing sets:
8.7.2.5 Train
The following obtains an SVC classifier object, and trains it on the training set:
clf = svm.SVC(gamma=0.001)
clf.fit(X_train, y_train)
8.7.2.6 Predict
With the trained SVC model, you can now make predictions using the testing dataset:
predicted = clf.predict(X_test)
predicted
Building a multinomial text classifier is a similar to creating a binary text classifier except that you make a classifier for
each class. You use a one-vs-the-rest (OvR) multinomial strategy. That is, you create one classifier for each class where
one class is the class your are trying to predict, and the other class is all the other classes. You treat the other classes as
if they were one class. The classifier predicts whether the observation is in the class or not. If there are m classes, then
there will be m classifiers. Classification is based on which classifier has the more confidence that an observation is in
the class.
Start by loading in the required libraries:
import ads
import nltk
import oci
import os
import pandas as pd
8.7.3.1 Dataset
A subset of the Reuters Corpus dataset is used in this example. You use scikit-learn and nltk packages to build a
multinomial classifier. The Reuters data is a benchmark dataset for document classification. More precisely, it is a data
set where where the target variable it multinomial. It has 90 categories, 7,769 training documents, and 3,019 testing
documents.
The data was previously labeled in the Data Labeling service. The metadata was exported and was saved in a publicly
accessible Object Storage bucket. The metadata JSONL file is used to import the data and labels.
8.7. Examples 61
ADS Documentation, Release 2.6.8
8.7.3.2 Load
This example loads a dataset with a target variable that is multinomial. It returns the text and the class annotation in a
dataframe:
df = pd.DataFrame.ads.read_labeled_data(
"oci://hosted-ds-datasets@bigdatadatasciencelarge/DLS/text_multi_label_nltk_reuters/
˓→metadata.jsonl",
materialize=True
)
8.7.3.3 Preprocess
You can use the MultiLabelBinarizer() method to convert the labels into the scikit-learn classification format
during the dataset preprocessing. This transformer converts a list of sets or tuples into the supported multilabel format,
a binary matrix of samples*classes.
The next step is to vectorize the input text to feed it into a supervised machine learning system. In this example, TF*IDF
vectorization is used.
For performance reasons, the TfidfVectorizer is limited to 10,000 words.
nltk.download('stopwords')
8.7.3.4 Train
You train a Linear Support Vector, LinearSVC, classifier using the text data to generate features and annotations to
represent the response variable.
The data from the study class is treated as positive, and the data from all the other classes is treated as negative.
This example uses the scalable Linear Support Vector Machine, LinearSVC, for classification. It’s quick to train and
empirically adequate on NLP problems:
8.7.3.5 Predict
The next example applies cross-validation to estimate the prediction error. The K fold cross-validation works by parti-
tioning a dataset into K splits. For the k th part, it fits the model to the other K-1 splits of the data and calculates the
prediction error. It uses the k th part to do this prediction. For more details about this process, see here and specifically
this image.
By performing cross-validation, there are five separate models trained on different train and test splits to get an es-
timate of the error that is expected when the model is generalized to an independent dataset. This example uses the
cross_val_score method to estimate the mean and standard deviation of errors:
This example shows you how to use a labeled dataset to create a named entity recognition model. The dataset is labeled
using the Oracle Cloud Infrastructure (OCI) Data Labeling Service (DLS).
To start, load the required libraries
import ads
import os
import pandas as pd
import spacy
8.7.4.1 Dataset
The Reuters Corpus is a benchmark dataset that is used in the evaluation of document classification models. It is based
on Reuters’ financial newswire service articles from 1987. It contains the title and text of the article in addition to a
list of people, places and organizations that are referenced in the article. It is this information that is used to label the
dataset. A subset of the news articles were labeled using the DLS.
8.7.4.2 Load
This labeled dataset has been exported from the DLS and the metadata has been stored in a publically accessible Object
Storage bucket. The .read_labeled_data() method is used to load the data. The materialize parameter causes
the original data to be also be returned with the dataframe.
path = 'oci://hosted-ds-datasets@bigdatadatasciencelarge/DLS/text_entity_extraction_nltk_
˓→reuters/metadata.jsonl'
df = pd.DataFrame.ads.read_labeled_data(
path,
materialize=True
)
8.7. Examples 63
ADS Documentation, Release 2.6.8
8.7.4.3 Preprocess
Covert the annotations data to the SpaCy format This will give you the start and end position of each entity and then
the type of entity, such as person, place, organization.
In this example, you will not be evaluating the performance of the model. Therefore, the data will not be split into train
and test sets. Instead, you use all the data as training data. The following code snippet will create a list of tuples that
contain the original article text and the annotation data.
train_data = []
for i, row in df.iterrows():
train_data.append((row['Content'], {'entities': row['Annotations']}))
...
The DocBin format will be used as it provides faster serialization and efficient storage. The following code snippet
does the conversion and writes the resulting DocBin object to a file.
8.7.4.4 Train
The model will be trained using spaCy. Since this is done through the command line a configuration file is needed. In
spaCy, this is a two-step process. You will create a base_config.cfg file that will contain the non-default settings for
the model. Then the init fill-config argument on the spaCy module will be used to auto-fill a partial config.
cfg file with the default values for the parameters that are not given in the base_config.cfg file. The config.
cfg file contains all the settings and hyperparameters that will be needed to train the model. See the spaCy training
documentation for more details.
The following code snippet will write the base_config.cfg configuration file and contains all the non-default pa-
rameter values.
config = """
[paths]
train = null
dev = null
[system]
gpu_allocator = null
[nlp]
lang = "en"
pipeline = ["tok2vec","ner"]
batch_size = 1000
[components]
[components.tok2vec]
factory = "tok2vec"
[components.tok2vec.model]
@architectures = "spacy.Tok2Vec.v2"
[components.tok2vec.model.embed]
@architectures = "spacy.MultiHashEmbed.v2"
width = ${components.tok2vec.model.encode.width}
(continues on next page)
8.7. Examples 65
ADS Documentation, Release 2.6.8
[components.tok2vec.model.encode]
@architectures = "spacy.MaxoutWindowEncoder.v2"
width = 96
depth = 4
window_size = 1
maxout_pieces = 3
[components.ner]
factory = "ner"
[components.ner.model]
@architectures = "spacy.TransitionBasedParser.v2"
state_type = "ner"
extra_state_tokens = false
hidden_width = 64
maxout_pieces = 2
use_upper = true
nO = null
[components.ner.model.tok2vec]
@architectures = "spacy.Tok2VecListener.v1"
width = ${components.tok2vec.model.encode.width}
[corpora]
[corpora.train]
@readers = "spacy.Corpus.v1"
path = ${paths.train}
max_length = 0
[corpora.dev]
@readers = "spacy.Corpus.v1"
path = ${paths.dev}
max_length = 0
[training]
dev_corpus = "corpora.dev"
train_corpus = "corpora.train"
[training.optimizer]
@optimizers = "Adam.v1"
[training.batcher]
@batchers = "spacy.batch_by_words.v1"
discard_oversize = false
tolerance = 0.2
[training.batcher.size]
(continues on next page)
[initialize]
vectors = ${paths.vectors}
"""
The following code snippet calls a new Python interpretrer that runs the spaCy module. It loads the base_config.cfg
file and writes out the configuration file config.cfg that has all of the training parameters that will be used. It contains
the default values plus the ones that were specified in the base_config.cfg file.
To train the model, you will call a new Python interpreter to run the spaCy module using the train command-line
argument and other arguments that point to the training files that you have created.
8.7.4.5 Predict
The spaCy training procedure creates a number of models. The best model is stored in model-best under the output
directory that was specified. The following code snippet loads that model and creates a sample document. The model
is run and the output has the new document plus and entities that were detected are highlighted.
doc = nlp("The Japanese minister for post and telecommunications was reported as saying␣
˓→that he opposed Cable and Wireless having a managerial role in the new company.") #␣
8.7. Examples 67
ADS Documentation, Release 2.6.8
NINE
TRANSFORM DATA
When datasets are loaded with DatasetFactory, they can be transformed and manipulated easily with the built-in func-
tions. Underlying, an ADSDataset object is a Pandas dataframe. Any operation that can be performed to a Pandas
dataframe can also be applied to an ADS Dataset.
ds = DatasetFactory.from_dataframe(df)
ADS has built in automatic transform tools for datasets. When the get_recommendations() tool is applied to an
ADSDataset object, it shows the user detected issues with the data and recommends changes to apply to the dataset.
You can accept the changes is as easy as clicking a button in the drop down menu. After all the changes are applied,
the transformed dataset can be retrieved by calling get_transformed_dataset().
wine_ds.get_recommendations()
Alternatively, you can use auto_transform() to apply all the recommended transformations at once.
auto_transform() returns a transformed dataset with several optimizations applied automatically. The optimiza-
tions include:
• Dropping constant and primary key columns, which has no predictive quality.
• Imputation to fill in missing values in noisy data.
• Dropping strongly co-correlated columns that tend to produce less generalizable models.
• Balancing a dataset using up or down sampling.
One optional argument to auto_transform() is fix_imbalance, which is set to True by default. When True,
auto_transform() corrects any imbalance between the classes. ADS downsamples the dominant class first unless
there are too few data points. In that case, ADS upsamples the minority class.
ds = wine_ds.auto_transform()
69
ADS Documentation, Release 2.6.8
You can visualize the transformation that has been performed on a dataset by calling visualize_transforms().
Note: visualize_transforms() is only applied to the automated transformations and does not capture any custom
transformations that you may have applied to the dataset.
ds.visualize_transforms()
The operations that can be applied to a Pandas dataframe can be applied to an ADSDataset object.
Examples of some of the most common row operations you can apply on an ADSDataset object follow.
Rows within a dataset can be filtered out by row numbers. The index of the new dataset can be reset accordingly.
#Filter out rows by row number and reset index of new data
ds_subset = ds.loc[10:100]
ds_subset = ds_subset.reset_index()
Reset the index to the default index. When you reset index, the old index is added as a column index and a new
sequential index is used. You can use the drop parameter to avoid the old index being added as a column:
ds_subset = ds.loc[10:100]
ds_subset = ds_subset.reset_index(drop=True)
ds_subset.head()
The index restarts at zero for each partition. This is due to the inability to statically know the full length of the index.
Alternatively, you can use the append() method of a Pandas dataframe to achieve a similar result:
ds2 = wine_ds.df.append(ds)
ds_without_dup = ds.drop_duplicates()
The column operations that can be applied to a Pandas dataframe can be applied to an ADS dataset as in the following
examples.
To delete specific columns from the dataset, the drop_columns function can be used along with names of the columns
to be deleted from the dataset. The ravel Pandas command returns the flattened underlying data as an ndarray. The
name_of_df.columns[:].ravel() command returns the name of all the columns in a dataframe as an array.
The count per unique value can be obtained with the value_counts() method:
ds['target'].value_counts()
class_1 71
class_0 59
class_2 48
Name: target, dtype: int64
You can apply a variety of normalization techniques to numerical columns (both continuous and discrete). You can
leverage the built in max() and min() methods to perform a minmax normalization:
max_alcohol = wine_ds['alcohol'].max()
min_alcohol = wine_ds['alcohol'].min()
alcohol_range = max_alcohol - min_alcohol
wine_ds.df['norm_alcohol'] = (wine_ds['alcohol'] / alcohol_range)
This example creates a new column by performing operations to combine two or more columns together:
Alternatively, you can create a new column directly in the Pandas dataframe attribute:
To add new column, use a new name for it. You can add anew column and change it by combining with existing column:
noise = np.random.normal(0,.1,wine_ds.shape[0])
ds_noise = wine_ds.assign_column('noise', noise)
You can apply functions to update column values in existing column. This example updates the column in place using
lambda expression:
You can change the data type columns with the astype() method. ADS uses the Pandas method, astype(), on
dataframe objects. For specifics, see astype for a Pandas Dataframe, using numpy.dtype, or Pandas dtypes.
When you change the type of a column, ADS updates its semantic type to categorical, continuous, datetime, or ordinal.
For example, if you update a column type to integer, its semantic type updates to ordinal. For data type details, see
ref:loading-data-specify-dtype.
This example converts a dataframe column from float, to the low-level integer type and ADS updates its semantic type
to ordinal:
# Note: When you cast a float column to integer, you lose precision.
wine_ds['proline'].head()
To convert a column of type float to categorical, you convert it to integer first. This example converts a column data
type from float to integer, then to categorical, and then the number of categories in the column is reduced:
# create a new dataset with a renamed column for binned data and update the values
ds = wine_ds.rename_columns({'color_intensity': 'color_intensity_bin'})
ds = ds.assign_column('color_intensity_bin', lambda x: x/3)
You can use feature_types to see if the semantic data type of the converted column is categorical:
wine_ds.feature_types['color_intensity_bin']['type']
'categorical'
ds['color_intensity_bin'].head()
0 1
1 1
2 1
3 2
4 1
Name: color_intensity_bin, dtype: category
Categories (5, int64): [0, 1, 2, 3, 4]
ADS has built in functions that support categorical encoding, null values and imputation.
ADS has a built in categorical encoder that can be accessed by calling from ads.dataset.label_encoder import
DataFrameLabelEncoder. This example encodes the three classes of wine that make up the dataset:
1 71
0 59
2 48
One-hot encoding transforms one categorical column with n categories into n or n-1 columns with indicator variables.
You can prepare one of the columns to be categorical with categories low, medium, and high:
def convert_to_level(value):
if value < 12:
return 'low'
elif value > 13:
return 'high'
else:
return 'medium'
ds = wine_ds
ds = ds.assign_column('alcohol', convert_to_level)
You can use the Pandas method get_dummies() to perform one-hot encoding on a column. Use the prefix parameter
to assign a prefix to the new columns that contain the indicator variables. This example creates n columns with one-hot
encoding:
To create n-1 columns, use drop_first=True when converting the categorical column. You can add a one-hot column
to the initial dataset with the merge() method:
Encoding for all categorical columns can be accomplished with the fit_transform() method:
0 92
2 67
1 19
To drop the initial categorical column that you transformed into one-hot, use one of these examples:
To detect all nulls in a dataset, use the isnull function to return a boolean dataset matching the dimension of our
input:
ds_null = ds.isnull()
np.any(ds_null)
alcohol False
malic_acid False
ash False
alcalinity_of_ash False
magnesium False
total_phenols False
flavanoids False
nonflavanoid_phenols False
proanthocyanins False
color_intensity False
hue False
od280/od315_of_diluted_wines False
proline False
target False
9.5.4 Imputation
The fillna function ia used to replace null values with specific values. Generate a null value by replacing the entry
below a certain value with null, and then imputing it with a value:
0 NaN
1 NaN
2 2.36
3 NaN
(continues on next page)
ds_impute = ds_with_null.fillna(method='bfill')
ds_impute['malic_acid'].head()
0 2.36
1 2.36
2 2.36
3 2.59
4 2.59
Name: malic_acid, dtype: float64
ADS datasets can be merged and combined together to form a new dataset.
You can merge two datasets together with a database-styled join on columns or indexes by specifying the type of join
left, right, outer, or inner. These type are defined by:
• left: Use only keys from the left dataset, similar to SQL left outer join.
• right: Use only keys from the right dataset, similar to SQL right outer join.
• inner: Intersection of keys from both datasets, similar to SQL inner join.
• outer: Union of keys from both datasets, similar to SQL outer join.
This is an example of performing an outer join on two datasets. The datasets are subsets of the wine dataset, and each
dataset contains only one class of wine.
ds_class1 = ds[ds['target']=='class_1']
ds_class2 = ds[ds['target']=='class_2']
ds_merged_outer = ds_class1.merge(ds_class2, how='outer')
ds_merged_outer['target'].value_counts()
class_1 71
class_2 48
class_0 0
Name: target, dtype: int64
Two datasets can be concatenated along a particular axis (vertical or horizontal) with the option of performing set logic
(union or intersection) of the indexes on the other axes. You can stack two datasets vertically with:
class_1 71
class_2 48
class_0 0
Name: target, dtype: int64
After all data transformations are complete, you can split the data into a train and test or train, test, and validation set.
To split data into a train and test set with a train size of 80% and test size of 20%:
For a train, test, and validation set, the defaults are set to 80% of the data for training, 10% for testing, and 10% for
validation. This example sets split to 70%, 15%, and 15%:
data_split = wine_ds.train_validation_test_split(
test_size=0.15,
validation_size=0.15
)
train, validation, test = data_split
print(data_split) # print out shape of train, validation, test sets in split
The resulting three data subsets each have separate data (X) and labels (y).
You can split the dataset right after the DatasetFactory.open() statement:
ds = DatasetFactory.open("path/data.csv").set_target('target')
train, test = ds.train_test_split(test_size=0.25)
9.7.1 TextStrings
9.7.1.1 Overview
Text analytics uses a set of powerful tools to understand the content of unstructured data, such as text. It’s becoming an
increasingly more important tool in feature engineering as product reviews, media content, research papers, and more
are being mined for their content. In many data science areas, such as marketing analytics, the use of unstructured
text is becoming as popular as structured data. This is largely due to the relatively low cost of collection of the data.
However, the downside is the complexity of working with the data. To work with unstructured that you need to clean,
summarize, and create features from it before you create a model. The ADSString class provides tools that allow you
to quickly do this work. More importantly, you can expand the tool to meet your specific needs.
Data scientists need to be able to quickly and easily manipulate strings. ADS SDK provides an enhanced string class,
called ADSString. It adds functionality like regular expression (RegEx) matching and natural language processing
(NLP) parsing. The class can be expanded by registering custom plugins so that you can process a string in a way that
it fits your specific needs. For example, you can register the OCI Language service plugin to bind functionalities from
the OCI Language service to ADSString.
The following example parses a text corpus using the NTLK and spaCy engines.
s = ADSString("""
Lawrence Joseph Ellison (born August 17, 1944) is an American business magnate,
investor, and philanthropist who is a co-founder, the executive chairman and
chief technology officer (CTO) of Oracle Corporation. As of October 2019, he was
listed by Forbes magazine as the fourth-wealthiest person in the United States
and as the sixth-wealthiest in the world, with a fortune of $69.1 billion,
increased from $54.5 billion in 2018.[4] He is also the owner of the 41st
largest island in the United States, Lanai in the Hawaiian Islands with a
population of just over 3000.
""".strip())
# NLTK
ADSString.nlp_backend("nltk")
noun = s.noun
adj = s.adjective
pos = s.pos # Parts of Speech
# spaCy
ADSString.nlp_backend("spacy")
noun = s.noun
adj = adjective
pos = s.pos # Parts of Speech
9.7.1.2.2 Plugin
Custom Plugin
This example demonstrates how to create a custom plugin that will take a string, detect the credit card numbers, and
return a list of the last four digits of the credit card number.
class CreditCardLast4:
@property
def credit_card_last_4(self):
return [x[len(x)-4:len(x)] for x in ADSString(self.string).credit_card]
ADSString.plugin_register(CreditCardLast4)
creditcard_numbers = "I purchased the gift on this card 4532640527811543 and the dinner␣
˓→on 340984902710890"
s = ADSString(creditcard_numbers)
s.credit_card_last_4
This example uses the OCI Language service to perform an aspect-based sentiment analysis, language detection, key
phrase extraction, and a named entity recognition.
ADSString.plugin_register(OCILanguage)
s = ADSString("""
Lawrence Joseph Ellison (born August 17, 1944) is an American business magnate,
investor, and philanthropist who is a co-founder, the executive chairman and
chief technology officer (CTO) of Oracle Corporation. As of October 2019, he was
listed by Forbes magazine as the fourth-wealthiest person in the United States
and as the sixth-wealthiest in the world, with a fortune of $69.1 billion,
increased from $54.5 billion in 2018.[4] He is also the owner of the 41st
largest island in the United States, Lanai in the Hawaiian Islands with a
population of just over 3000.
""".strip())
# Language Detection
language = s.language_dominant
# Text Classification
classification = s.text_classification
In this example, the dates and prices are extracted from the text using regular expression matching.
from ads.feature_engineering.adsstring.string import ADSString
s = ADSString("""
Lawrence Joseph Ellison (born August 17, 1944) is an American business magnate,
investor, and philanthropist who is a co-founder, the executive chairman and
chief technology officer (CTO) of Oracle Corporation. As of October 2019, he was
listed by Forbes magazine as the fourth-wealthiest person in the United States
and as the sixth-wealthiest in the world, with a fortune of $69.1 billion,
increased from $54.5 billion in 2018.[4] He is also the owner of the 41st
largest island in the United States, Lanai in the Hawaiian Islands with a
population of just over 3000.
""".strip())
dates = s.date
prices = s.price
ADSString also supports NLP parsing and is backed by Natural Language Toolkit (NLTK) or spaCy. Unless otherwise
specified, NLTK is used by default. You can extract properties, such as nouns, adjectives, word counts, parts of speech
tags, and so on from text with NLP.
The ADSString class can have one backend enabled at a time. What properties are available depends on the backend,
as do the results of calling the property. The following examples provide an overview of the available parsers, and how
to use them. Generally, the parser supports the adjective, adverb, bigram, noun, pos, sentence, trigram, verb,
word, and word_count base properties. Parsers can support additional parsers.
9.7.1.3.1 NLTK
The Natural Language Toolkit (NLTK) is a powerful platform for processing human language data. It supports all
the base properties and in addition stem and token. The stem property returns a list of all the stemmed tokens. It
reduces a token to its word stem that affixes to suffixes and prefixes, or to the roots of words that is the lemma. The
token property is similar to the word property, except it returns non-alphanumeric tokens and doesn’t force tokens to
be lowercase.
The following example use a sample of text about Larry Ellison to demonstrate the use of the NLTK properties.
test_text = """
Lawrence Joseph Ellison (born August 17, 1944) is an American business␣
˓→magnate,
s.noun[1:5]
s.adjective
s.word[1:5]
By taking the difference between token and word, the token set contains non-alphanumeric tokes, and also the upper-
case version of words.
list(set(s.token) - set(s.word))[1:5]
The stem property takes the list of words and stems them. It produces morphological variations of a word’s root form.
The following example stems some words, and shows some of the stemmed words that were changed.
list(set(s.stem) - set(s.word))[1:5]
Part of speech (POS) is a category in which a word is assigned based on its syntactic function. POS depends on
the language. For English, the most common POS are adjective, adverb, conjunction, determiner, interjection, noun,
preposition, pronoun, and verb. However, each POS system has its own set of POS tags that vary based on their
respective training set. The NLTK parsers produce the following POS tags:
s.pos[1:5]
9.7.1.3.2 spaCy
spaCy is in an advanced NLP toolkit. It helps you understand what the words mean in context, and who is doing what
to whom. It helps you determine what companies and products are mentioned in a document. The spaCy backend is
used to parses the adjective, adverb, bigram, noun, pos, sentence, trigram, verb, word, and word_count base
properties. It also supports the following additional properties:
• entity: All entities in the text.
• entity_artwork: The titles of books, songs, and so on.
• entity_location: Locations, facilities, and geopolitical entities, such as countries, cities, and states.
• entity_organization: Companies, agencies, and institutions.
• entity_person: Fictional and real people.
• entity_product: Product names and so on.
• lemmas: A rule-based estimation of the roots of a word.
• tokens: The base tokens of the tokenization process. This is similar to word, but it includes non-alphanumeric
values and the word case is preserved.
If the spacy module is installed ,you can change the NLP backend using the ADSString.nlp_backend('spacy')
command.
ADSString.nlp_backend("spacy")
s = ADSString(test_text)
s.noun[1:5]
s.adjective
s.word[1:5]
You can identify all the locations that are mentioned in the text.
s.entity_location
s.entity_organization
The POS tagger in spaCy uses a smaller number of categories. For example, spaCy has the ADJ POS for all adjectives,
while NLTK has JJ to mean an adjective. JJR refers to a comparative adjective, and JJS refers to a superlative adjective.
For fine grain analysis of different parts of speech, NLTK is the preferred backend. However, spaCy’s reduced category
set tends to produce fewer errors,at the cost of not being as specific.
The spaCy parsers produce the following POS tags:
• ADJ: adjective; big, old, green, incomprehensible, first
• ADP: adposition; in, to, during
• ADV: adverb; very, tomorrow, down, where, there
• AUX: auxiliary; is, has (done), will (do), should (do)
• CONJ: conjunction; and, or, but
• CCONJ: coordinating conjunction; and, or, but
• DET: determiner; a, an, the
s.pos[1:5]
9.7.1.4 Plugin
One of the most powerful features of ADSString is that you can expand and customize it. The .plugin_register()
method allows you to add properties to the ADSString class. These plugins can be provided by third-party providers
or developed by you. This section demonstrates how to connect the to the OCI Language service, and how to create a
custom plugin.
You can bind additional properties to ADSString using custom plugins. This allows you to create custom text pro-
cessing extensions. A plugin has access to the self.string property in ADSString class. You can define functions
that perform a transformation on the text in the object. All functions defined in a plugin are bound to ADSString and
accessible across all objects of that class.
Assume that your text is "I purchased the gift on this card 4532640527811543 and the dinner on
340984902710890" and you want to know what credit cards were used. The .credit_card property returns the
entire credit card number. However, for privacy reasons you don’t what the entire credit card number, but the last four
digits.
To solve this problem, you can create the class CreditCardLast4 and use the self.string property in ADSString
to access the text associated with the object. It then calls the .credit_card method to get the credit card numbers.
Then it parses this to return the last four characters in each credit card.
The first step is to define the class that you want to bind to ADSString. Use the @property decorator and define a
property function. This function only takes self. The self.string is accessible with the text that is defined for a
given object. The property returns a list.
class CreditCardLast4:
@property
def credit_card_last_4(self):
return [x[len(x)-4:len(x)] for x in ADSString(self.string).credit_card]
After the class is defined, it must be registered with ADSString using the .register_plugin() method.
ADSString.plugin_register(CreditCardLast4)
Take the text and make it an ADSString object, and call the .credit_card_last_4 property to obtain the last four
digits of the credit cards that were used.
creditcard_numbers = "I purchased the gift on this card 4532640527811543 and the dinner␣
˓→on 340984902710890"
s = ADSString(creditcard_numbers)
s.credit_card_last_4
['1543', '0890']
The OCI Language service provides pretrained models that provide sophisticated text analysis at scale.
The Language service contains these pretrained language processing capabilities:
• Aspect-Based Sentiment Analysis: Identifies aspects from the given text and classifies each into positive,
negative, or neutral polarity.
• Key Phrase Extraction: Extracts an important set of phrases from a block of text.
• Language Detection: Detects languages based on the given text, and includes a confidence score.
• Named Entity Recognition: Identifies common entities, people, places, locations, email, and so on.
• Text Classification: Identifies the document category and subcategory that the text belongs to.
Those are accessible in ADS using the OCILanguage plugin.
ADSString.plugin_register(OCILanguage)
Aspect-based sentiment analysis can be used to gauge the mood or the tone of the text.
The aspect-based sentiment analysis (ABSA) supports fine-grained sentiment analysis by extracting the individual
aspects in the input document. For example, a restaurant review “The driver was really friendly, but the taxi was falling
apart.” contains positive sentiment toward the taxi driver aspect. Also, it has a strong negative sentiment toward the
service mechanical aspect of the taxi. Classifying the overall sentiment as negative would neglect the fact that the taxi
driver was nice.
ABSA classifies each of the aspects into one of the three polarity classes, positive, negative, mixed, and neutral. With the
predicted sentiment for each aspect. It also provides a confidence score for each of the classes and their corresponding
offsets in the input. The range of the confidence score for each class is between 0 and 1, and the cumulative scores of
all the three classes sum to 1.
In the next example, the sample sentence is analyzed. The two aspects, taxi cab and driver, have their sentiments
determined. It defines the location of the aspect by giving its offset position in the text, and the length of the aspect
in characters. It also gives the text that defines the aspect along with the sentiment scores and which sentiment is
dominant.
t = ADSString("The driver was really friendly, but the taxi was falling apart.")
t.absa
Key phrase (KP) extraction is the process of extracting the words with the most relevance, and expressions from the
input text. It helps summarize the content and recognizes the main topics. The KP extraction finds insights related to
the main points of the text. It understands the unstructured input text, and returns keywords and KPs. The KPs consist
of subjects and objects that are being talked about in the document. Any modifiers, like adjectives associated with these
subjects and objects, are also included in the output. Confidence scores for each key phrase that signify how confident
the algorithm is that the identified phrase is a KP. Confidence scores are a value from 0 to 1.
The following example determines the key phrases and the importance of these phrases in the text (which is the value
of test_text):
Lawrence Joseph Ellison (born August 17, 1944) is an American business magnate,
investor, and philanthropist who is a co-founder, the executive chairman and
chief technology officer (CTO) of Oracle Corporation. As of October 2019, he was
listed by Forbes magazine as the fourth-wealthiest person in the United States
and as the sixth-wealthiest in the world, with a fortune of $69.1 billion,
increased from $54.5 billion in 2018.[4] He is also the owner of the 41st
largest island in the United States, Lanai in the Hawaiian Islands with a
population of just over 3000.
s = ADSString(test_text)
s.key_phrase
Language Detection
The language detection tool identifies which natural language the input text is in. If the document contains more
than one language, the results may not be what you expect. Language detection can help make customer support
interactions more personable and quicker. Customer service chatbots can interact with customers based on the language
of their input text and respond accordingly. If a customer needs help with a product, the chatbot server can field the
corresponding language product manual, or transfer it to a call center for the specific language.
The following is a list of some of the supported languages:
The next example determines the language of the text, the ISO 639-1 language code, and a probability score.
s.language_dominant
Named entity recognition (NER) detects named entities in text. The NER model uses NLP, which uses machine learning
to find predefined named entities. This model also provides a confidence score for each entity and is a value from 0 to
1. The returned data is the text of the entity, its position in the document, and its length. It also identifies the type of
entity, a probability score that it is an entity of the stated type.
The following are the supported entity types:
• DATE: Absolute or relative dates, periods, and date range.
• EMAIL: Email address.
• EVENT: Named hurricanes, sports events, and so on.
• FAC: Facilities; Buildings, airports, highways, bridges, and so on.
s.ner
The output gives the named entity, its location, and offset position in the text. It also gives a probability and score that
this text is actually a named entity along with the type.
Text Classification
Text classification analyses the text and identifies categories for the content with a confidence score. Text classification
uses NLP techniques to find insights from textual data. It returns a category from a set of predefined categories. This
text classification uses NLP and relies on the main objective lies on zero-shot learning. It classifies text with no or
minimal data to train. The content of a collection of documents is analyzed to determine common themes.
The next example classifies the text and gives a probability score that the text is in that category.
s.text_classification
Text documents are often parsed looking for specific patterns to extract information like emails, dates, times, web links,
and so on. This pattern matching is often done using RegEx, which is hard to write, modify, and understand. Custom
written RegEx often misses the edge cases. ADSString provides a number of common RegEx patterns so that your
work is simplified. You can use the following patterns:
• credit_card: Credit card number.
• dates: Dates in a variety of standard formats.
• email: Email address.
• ip: IP addresses, versions IPV4 and IPV6.
• link: Text that appears to be a link to a website.
• phone_number_US: USA phone numbers including those with extensions.
• price: Text that appears to be a price.
• ssn: USA social security number.
• street_address: Street address.
• time: Text that appears to be a time and less than 24 hours.
• zip_code: USA zip code.
The preceding ADSString properties return an array with each pattern that in matches. The following examples demon-
strate how to extract email addresses, dates ,and links from the text. Note that the text is extracted as is. For example,
the dates aren’t converted to a standard format. The returned value is the text as it is represented in the input text. Use
the datetime.strptime() method to convert the date to a date time stamp.
s.email
['[email protected]', '[email protected]']
['www.oracle.com']
While ADSString expands your feature engineering capabilities, it can still be treated as a str object. Any standard
operation on str is preserved in ADSString. For instance, you can convert it to lowercase:
'hello world'
s.split()
['HELLO', 'WORLD']
You can use all the str methods, such as the .replace() method, to replace text.
s.replace("L", "N")
'HENNO WORND'
You can perform a number of str manipulation operations, such as .lower() and .upper() to get an ADSString
object back.
isinstance(s.lower().upper(), ADSString)
True
While a new ADSString object is created with str manipulation operations, the equality operation holds.
s.lower().upper() == s
True
The equality operation even holds between ADSString objects (s) and str objects (hello_world).
s == hello_world
True
Convert files such as PDF, and Microsoft Word files into plain text. The data is stored in Pandas dataframes and therefore
it can easily be manipulated and saved. The text extraction module allows you to read files of various file formats, and
convert them into different formats that can be used for text manipulation. The most common DataLoader commands
are demonstrated, and some advanced features, such as defining custom backend and file processor.
import ads
import fsspec
(continues on next page)
ads.set_debug_mode()
ads.set_auth("resource_principal")
9.7.2.1 Introduction
Text extraction is the process of extracting text from one document and converting it into another form, typically plain
text. For example, you can extract the body of text from a PDF document that has figures, tables, images, and text.
The process can also be used to extract metadata about the document. Generally, text extraction takes a corpus of
documents and returns the extracted text in a structured format. In the ADS text extraction module, that format is a
Pandas dataframe.
The Pandas dataframe has a record in each row. That record can be an entire document, a sentence, a line of text, or
some other unit of text. In the examples, you explore using a row to indicate a line of text and an entire document.
The ADS text extraction module supports:
• Input formats: text, pdf and docx or doc.
• Output formats: Use pandas for Pandas dataframe, or cudf for a cuDF dataframe.
• Backends: Apache Tika (default) and pdfplumber (for PDF).
• Source location: local block volume, and in cloud storage such as the Oracle Cloud Infrastructure (OCI) Object
Storage.
• Options to extract metadata.
You can manipulate files through the DataLoader object. Some of the most common commands are:
• .convert_to_text(): Convert document to text and then save them as plain text files.
• .metadata_all() and .metadata_schema(): Extract metadata from each file.
• .read_line(): Read files line-by-line. Each line corresponds to a record in the corpus.
• .read_text(): Read files where each file corresponds to a record in the corpus.
The OCI Data Science service has a corpus of text documents that are used in the examples. This corpus is stored in a
publicly accessible OCI Object Storage bucket. The following variables define the Object Storage namespace and the
bucket name. You can update these variables to point at your Object Storage bucket, but you might also have to change
some of the code in the examples so that the keys are correct.
namespace = 'bigdatadatasciencelarge'
bucket = 'hosted-ds-datasets'
9.7.2.2 Load
The TextDatasetFactory, which is aliased to textfactory in this notebook, provides access to the DataLoader,
and FileProcessor objects. The DataLoader is a file format-specific object for reading in documents such as PDF
and Word documents. Internally, a data loader binds together a file system interface (in this case fsspec) for opening
files. The FileProcessor object is used to convert these files into plain text. It also has an engine object to control
the output format. For a given DataLoader object, you can customize both the FileProcessor and engine.
Generally, the first step in reading a corpus of documents is to obtain a DataLoader object. For example,
TextDatasetFactory.format('pdf') returns a DataLoader for PDFs. Likewise, you can get a Word document
loaders by passing in docx or doc. You can choose an engine that controls how the data is returned. The default
engine is a Python generator. If you want to use the data as a dataframe, then use the .engine() method. A call to
.engine('pandas') returns the data as a Pandas dataframe. On a GPU machine, you can use cuDF dataframes with
a call to .engine('cudf').
The .format() method controls the backend with Apache Tika and pdfplumber being builtin. In addition, you can
write your own backend and plug it into the system. This allows you complete control over the backend. The file
processor is used to actually process a specific file format.
To obtain a DataLoader object, call the use the .format() method on textfactory. This returns a DataLoader
object that can then be configured with the .backend(), .engine(), and .options() methods. The .backend()
method is used to define which backend is to manage the process of parsing the corpus. If this is not specified then a
sensible default backend is chosen based on the file format that is being processed. The .engine() method is used
to control the output format of the data. If it is not specified, then an iterator is returned. The .options() method is
used to add extra fields to each record. These would be things such as the filename, or metadata about the file. There
are more details about this and the other configuration methods in the examples.
In this example you create a DataLoader object by calling textfactory.format('pdf'). This DataLoader
object is configured to read PDF documents. You then change the backend to use pdfplumber with the method
.backend('pdfplumber'). It’s easier to work with the results if they are in a dataframe. So, the method .
engine('pandas') returns a Pandas dataframe.
After you have the DataLoader object configured, you process the corpus. In this example, the corpus is a single PDF
file. It is read from a publicly accessible OCI Object Storage bucket. The .read_line() method is used to read in
the corpus where each line of the document is treated as a record. Thus, each row in the returned dataframe is a line of
text from the corpus.
dl = textfactory.format('pdf').backend('pdfplumber').engine('pandas')
df = dl.read_line(
f'oci://{bucket}@{namespace}/pdf_sample/paper-0.pdf',
storage_options={"config": {}},
(continues on next page)
Typically, you want to treat each line of a document or each document as a record. The method .read_line()
processes a corpus, and return each line in the documents as a text string. The method .read_text() treats each
document in the corpus as a record.
Both the .read_line() and .read_text() methods parse the corpus, convert it to text ,and reads it into memory.
The .convert_to_text() method does the same processing as .read_text(), but it outputs the plain text to files.
This allows you to post-process the data without having to again convert the raw documents into plain text documents,
which can be an expensive process.
Each document can have a custom set of metadata that describes the document. The .metadata_all() and .
metadata_schema() methods allow you to access this metadata. Metadata is represented as a key-value pair. The
.metadata_all() returns a set of key-value pairs for each document. The .metadata_schema() returns what keys
are used in defining the metadata. This can vary from document to document and this method creates a list of all
observed keys. You use this to understand what metadata is available in the corpus.
``.read_line()``
The .read_line() method allows you to read a corpus line-by-line. In other words, each line in a file corresponds to
one record. The only required argument to this method is path. It sets the path to the corpus, and it can contain a glob
pattern. For example, oci://{bucket}@{namespace}/pdf_sample/**.pdf, 'oci://{bucket}@{namespace}/
20news-small/**/[1-9]*', or /home/datascience/<path-to-folder>/[A-Za-z]*.docx are all valid paths
that contain a glob pattern for selecting multiple files. The path parameter can also be a list of paths. This allows for
reading files from different file paths.
The optional parameter udf stands for a user-defined function. This parameter can be a callable Python object, or a
regular expression (RegEx). If it is a callable Python object, then the function must accept a string as an argument and
returns a tuple. If the parameter is a RegEx, then the returned values are the captured RegEx patterns. If there is no
match, then the record is ignored. This is a convenient method to selectively capture text from a corpus. In either case,
the udf is applied on the record level, and is a powerful tool for data transformation and filtering.
The .read_line() method has the following arguments:
• df_args: Arguments to pass to the engine. It only applies to Pandas and cuDF dataframes.
• n_lines_per_file: Maximal number of lines to read from a single file.
• path: The path to the corpus.
• storage_options: Options that are necessary for connecting to OCI Object Storage.
• total_lines: Maximal number of lines to read from all files.
Examples
In the next example, a lambda function is used to create a Python callable object that is passed to the udf parameter.
The lambda function takes a line and splits it based on white space to tokens. It then counts the number of tokens ,and
returns a tuple where the first element is the token count and the second element is the line itself.
The df_args parameter is used to change the column names into user-friendly values.
dl = textfactory.format('docx').engine('pandas')
df = dl.read_line(
path=f'oci://{bucket}@{namespace}/docx_sample/*.docx',
udf=lambda x: (len(x.strip().split()), x),
storage_options={"config": {}},
df_args={'columns': ['token count', 'text']},
)
df.head()
In this example, the corpus is a collection of log files. A RegEx is used to parse the standard Apache log format. If a
line does not match the pattern, it is discarded. If it does match the pattern, then a tuple is returned where each element
is a value in the RegEx capture group.
This example uses the default engine, which returns an iterator. The next() method is used to iterate through the
values.
APACHE_LOG_PATTERN = r'^\[(\S+)\s(\S+)\s(\d+)\s(\d+\:\d+\:\d+)\s(\d+)]\s(\S+)\s(\S+)\s(\
˓→S+)\s(\S+)'
dl = textfactory.format('txt')
df = dl.read_line(
f'oci://{bucket}@{namespace}/log_sample/*.log',
udf=APACHE_LOG_PATTERN,
storage_options={"config": {}},
)
next(df)
['Sun',
'Dec',
'04',
'04:47:44',
'2005',
'[notice]',
'workerEnv.init()',
'ok',
'/etc/httpd/conf/workers2.properties']
``.read_text()``
It you want to treat each document in a corpus as a record, use the .read_text() method. The path parameter is the
only required parameter as it defines the location of the corpus.
The optional udf parameter stands for a user-defined function. This parameter can be a callable Python object or a
RegEx.
The .read_text() method has the following arguments:
• df_args: Arguments to pass to the engine. It only applies to Pandas and cuDF dataframes.
• path: The path to the corpus.
• storage_options: Options that are necessary for connecting to OCI Object Storage.
• total_files: The maximum number of files that should be processed.
• udf: User-defined function for data transformation and filtering.
Examples
total_files
In this example, the are six files in the corpus. However, the total_files parameter is set to 4 so only the first four
files are processed. There is no guarantee which four will actually be processed. However, this parameter is commonly
used to limit the size of the data when you are developing the code for the model. Later on, it is often removed so the
entire corpus is processed.
This example also demonstrates the use of a list, plus globbing, to define the corpus. Notice that the path parameter is
a list with two file paths. The output shows the dataframe has four rows and so only four files were processed.
dl = textfactory.format('docx').engine('pandas')
df = dl.read_text(
path=[f'oci://{bucket}@{namespace}/docx_sample/*.docx', f'oci://{bucket}@{namespace}/
˓→docx_sample/*.doc'],
total_files=4,
storage_options={"config": {}},
)
df.shape
(4, 1)
.convert_to_text()
Converting a set of raw documents can be an expensive process. The .convert_to_text() method allows you to
convert a corpus of source document,s and write them out as plain text files. Each document input document is written
to a separate file that has the same name as the source file. However, the file extension is changed to .txt. Converting
the raw documents allows you to post-process the raw text multiple times while only have to convert it once.
The src_path parameter defines the location of the corpus. The dst_path parameter gives the location where the
plain text files are to be written. It can be an Object Storage bucket or the local block storage. If the directory does not
exist, it is created. It overwrites any files in the directory.
The .convert_to_text() method has the following arguments:
• dst_path: Object Storage or local block storage path where plain text files are written.
• encoding: Encoding for files. The default is utf-8.
• src_path: The path to the corpus.
• storage_options: Options that are necessary for connecting to Object Storage.
The following example converts a corpus ,and writes it to a temporary directory. It then lists all the plain text files that
were created in the conversion process.
dst_path = tempfile.mkdtemp()
dl = textfactory.format('pdf')
dl.convert_to_text(
src_path=f'oci://{bucket}@{namespace}/pdf_sample/*.pdf',
dst_path=dst_path,
storage_options={"config": {}},
)
print(os.listdir(dst_path))
shutil.rmtree(dst_path)
Each document can contain metadata. The purpose of the .metadata_all() method is to capture this information
for each document in the corpus. There is no standard set of metadata across all documents so each document could
return different set of values.
The path parameter is the only required parameter as it defines the location of the corpus.
The .metadata_all() method has the following arguments:
• encoding: Encoding for files. The default is utf-8.
• path: The path to the corpus.
• storage_options: Options that are necessary for connecting to Object Storage.
The next example processes a corpus of PDF documents using pdfplumber, and prints the metadata for the first
document.
dl = textfactory.format('pdf').backend('pdfplumber').option(Options.FILE_NAME)
metadata = dl.metadata_all(
path=f'oci://{bucket}@{namespace}/pdf_sample/Emerging Infectious Diseases copyright␣
˓→info.pdf',
The backend that is used can affect what metadata is returned. For example, the Tika backend returns more metadata
than pdfplumber, and also the names of the metadata elements are also different. The following example processes
the same PDF document as previously used, but you can see that there is a difference in the metadata.
dl = textfactory.format('pdf').backend('default')
metadata = dl.metadata_all(
path=f'oci://{bucket}@{namespace}/pdf_sample/Emerging Infectious Diseases copyright␣
˓→info.pdf',
storage_options={"config": {}}
)
next(metadata)
{'Content-Type': 'application/pdf',
'Creation-Date': '2021-08-02T23:40:12Z',
'Last-Modified': '2021-08-02T23:40:12Z',
'Last-Save-Date': '2021-08-02T23:40:12Z',
'X-Parsed-By': ['org.apache.tika.parser.DefaultParser',
'org.apache.tika.parser.pdf.PDFParser'],
'access_permission:assemble_document': 'true',
'access_permission:can_modify': 'true',
'access_permission:can_print': 'true',
'access_permission:can_print_degraded': 'true',
'access_permission:extract_content': 'true',
'access_permission:extract_for_accessibility': 'true',
'access_permission:fill_in_form': 'true',
'access_permission:modify_annotations': 'true',
'created': '2021-08-02T23:40:12Z',
'date': '2021-08-02T23:40:12Z',
'dc:format': 'application/pdf; version=1.4',
'dcterms:created': '2021-08-02T23:40:12Z',
'dcterms:modified': '2021-08-02T23:40:12Z',
'meta:creation-date': '2021-08-02T23:40:12Z',
'meta:save-date': '2021-08-02T23:40:12Z',
'modified': '2021-08-02T23:40:12Z',
'pdf:PDFVersion': '1.4',
'pdf:charsPerPage': '2660',
'pdf:docinfo:created': '2021-08-02T23:40:12Z',
'pdf:docinfo:creator_tool': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7)␣
˓→AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36',
'pdf:docinfo:modified': '2021-08-02T23:40:12Z',
'pdf:docinfo:producer': 'Skia/PDF m91',
'pdf:encrypted': 'false',
(continues on next page)
'xmpTPg:NPages': '1'}
``.metadata_schema()``
As briefly discussed in the .metadata_all() method section, there is no standard set of metadata across all doc-
uments. The .metadata_schema() method is a convenience method that returns what metadata is available in the
corpus. It returns a list of all observed metadata fields in the corpus. Since each document can have a different set of
metadata, all the values returned may not exist in all documents. It should also be noted that the engine used can return
different metadata for the same document.
The path parameter is the only required parameter as it defines the location of the corpus.
Often, you don’t want to process an entire corpus of documents to get a sense of what metadata is available. Generally,
the engine returns a fairly consistent set of metadata. The n_files option is handy because it limits the number of
files that are processed.
The .metadata_schema() method has the following arguments:
• encoding: Encoding for files. The default is utf-8.
• n_files: Maximum number of files to process. The default is 1.
• path: The path to the corpus.
• storage_options: Options that are necessary for connecting to Object Storage.
The following example uses the .metadata_schema() method to collect the metadata fields on the first two files in
the corpus. The n_files=2 parameter is used to control the number of files that are processed.
dl = textfactory.format('pdf').backend('pdfplumber')
schema =dl.metadata_schema(
f'oci://{bucket}@{namespace}/pdf_sample/*.pdf',
storage_options={"config": {}},
n_files=2
)
print(schema)
The text_dataset module has the ability to augment the returned records with additional information using the .
option() method. This method takes an enum from the Options class. The .option() method can be used multiple
times on the same DataLoader to select a set of additional information that is returned. The Options.FILE_NAME
enum returns the filename that is associated with the record. The Options.FILE_METADATA enum allows you to
extract individual values from the document’s metadata. Notice that the engine used can return different metadata for
the same document.
9.7.2.3.1 Examples
``Options.FILE_NAME``
The following example uses .option(Options.FILE_NAME) to augment to add the filename of each record that is
returned. The example uses the txt for the FileProcessor, and Tika for the backend. The engine is Pandas so a
dataframe is returned. The df_args option is used to rename the columns of the dataframe. Notice that the returned
dataframe has a column named path. This is the information that was added to the record from the .option(Options.
FILE_NAME) method.
dl = textfactory.format('txt').backend('tika').engine('pandas').option(Options.FILE_NAME)
df = dl.read_text(
path=f'oci://{bucket}@{namespace}/20news-small/**/[1-9]*',
storage_options={"config": {}},
df_args={'columns': ['path', 'text']}
)
df.head()
``Options.FILE_METADATA``
You can add metadata about a document to a record using .option(Options.FILE_METADATA, {'extract':
['<key1>, '<key2>']}). When using Options.FILE_METADATA, there is a required second parameter. It takes a
dictionary where the key is the action to be taken. In the next example, the extract key provides a list of metadata
that can be extracted. When a list is used, the returned value is also a list of the metadata values. The example uses
repeated calls to .option() where different metadata values are extracted. In this case, a list is not returned, but each
value is in a separate Pandas column.
dl = textfactory.format('docx').engine('pandas') \
.option(Options.FILE_METADATA, {'extract': ['Character Count']}) \
.option(Options.FILE_METADATA, {'extract': ['Paragraph-Count']}) \
.option(Options.FILE_METADATA, {'extract': ['Author']})
df = dl.read_text(
path=f'oci://{bucket}@{namespace}/docx_sample/*.docx',
storage_options={"config": {}},
df_args={'columns': ['character count', 'paragraph count', 'author', 'content']},
)
df.head()
The text_dataset module supports a number of file processors and backends. However, it isn’t practical to provide
these for all possible documents. So, the text_dataset allows you to create your own.
When creating a custom file processor, you must register it with ADS using the FileProcessorFactory.
register() method. The first parameter is the name that you want to associate with the file processor. The second
parameter is the class that is to be registered. There is no need to register the backend class.
To create a backend, you need to develop a class that inherits from the ads.text_dataset.backends.Base class.
In your class, you need to overload any of the following methods that you want to use with: .read_line(), .
read_text(), .convert_to_text(), and .get_metadata(). The .get_metadata() method must be overload if
you want to use the .metadata_all() and .metadata_schema() methods in your backend.
The .convert_to_text() method takes a file handler, destination path, filename, and storage options as parameters.
This method must write the plain text file to the destination path, and return the path of the file.
The .get_metadata() method takes a file handler as an input parameter, and returns a dictionary of the meta-
data. The .metadata_all() and .metadata_schema() methods don’t need to be overload because they use the
.get_metadata() method to return their results.
The .read_line() method must take a file handle, and have a yield statement that returns a plain text line from the
document.
The .read_text() method has the same requirements as the .read_line() method, except it must yield the entire
document as plain text.
The following are the method signatures:
To create a custom file processor you must develop a class that inherits from ads.text_dataset.extractor.
FileProcessor. Generally, there are no methods that need to be overloaded. However, the backend_map class
variable has to be defined. This is a dictionary where the key is the name of the format that it support,s and the value
is the file processor class. There must be a key called default that is used when no file processor is defined for the
DataLoader. An example of the backend_map is:
9.7.2.4.3 Example
In the next example, you create a custom backend class called ReverseBackend. It overloads the .read_line() and
.read_text() methods. This toy backend returns the records in reverse order.
The TextReverseFileProcessor class is used to create a new file processor for use with the backend. This class
has the backend_map class variable that maps the backend label to the backend object. In this case, the only format
that is provided is the default class.
Having defined the backend (TextReverseBackend) and file processor (TextReverseFileProcessor) classes,
the format must be registered. You register it with the FileProcessorFactory.register('text_reverse',
TextReverseFileProcessor) command where the first parameter is the format and the second parameter is the
file processor class.
class TextReverseBackend(Base):
def read_line(self, fhandler):
with fhandler as f:
for line in f:
yield line.decode()[::-1]
class TextReverseFileProcessor(FileProcessor):
backend_map = {'default': TextReverseBackend}
FileProcessorFactory.register('text_reverse', TextReverseFileProcessor)
Having created the custom backend and file processor, you use the .read_line() method to read in one record and
print it.
dl = textfactory.format('text_reverse')
reverse_text = dl.read_line(
f'oci://{bucket}@{namespace}/20news-small/rec.sport.baseball/100521',
total_lines=1,
storage_options={"config": {}},
)
text = next(reverse_text)[0]
print(text)
The .read_line() method in the TextReverseBackend class reversed the characters in each line of text that is
processed. You can confirm this by reversing it back.
text[::-1]
TEN
VISUALIZE DATA
Data visualization is an important aspect of data exploration, analysis, and communication. Generally, visualization of
the data is one of the first steps in any analysis. It allows the analysts to efficiently gain an understanding of the data
and guides the exploratory data analysis (EDA) and the modeling process.
An efficient and flexible data visualization tool can provide a lot of insight into the data. ADS provides a smart visual-
ization tool. It automatically detects the data type and renders plots that optimally represent the characteristics of the
data. Within ADS, custom visualizations can be created using any plotting library.
10.1 Automatic
The ADS show_in_notebook() method creates a comprehensive preview of all the basic information about a dataset
including:
• The predictive data type (for example, regression, binary classification, or multinomial classification).
• The number of columns and rows.
• Feature type information.
• Summary visualization of each feature.
• The correlation map.
• Any warnings about data conditions that you should be aware of.
To improve plotting performance, the ADS show_in_notebook() method uses an optimized subset of the data. This
smart sample is selected so that it is statistically representative of the full dataset. The correlation map is only displayed
when the data only has numerical (continuous or oridinal) columns.
ds.show_in_notebook()
To visualize the correlation, call the show_corr() method. If the correlation matrices have not been cached, this call
triggers the corr() function which calculates the correlation matrices.
corr() uses the following methods to calculate the correlation based on the data types:
• Continuous-Continuous: `Pearson method <https://en.wikipedia.org/wiki/Pearson_correlation_
coefficient>`__. The correlations range from -1 to 1.
• Categorical-Categorical: `Cramer's V method <https://en.wikipedia.org/wiki/Cram%C3%A9r%27s_V>`__.
The correlations range from 0 to 1.
• Continuous-Categorical: `Correlation Ratio method <https://en.wikipedia.org/wiki/Correlation_
ratio>`__. The correlations range from 0 to 1.
105
ADS Documentation, Release 2.6.8
Correlations are displayed independently because the correlations are calculated using different methodologies and the
ranges are not the same. Consolidating them into one matrix could be confusing and inconsistent.
Note: Continuous features consist of continuous and ordinal types. Categorical features consist of
categorical and zipcode types.
ds.show_corr(nan_threshold=0.8, correlation_methods='all')
By default, nan_threshold is set to 0.8. This means that if more than 80% of the values in a column are missing,
that column is dropped from the correlation calculation. nan_threshold should be between 0 and 1. Other options
includes:
• correlation_methods: Methods to calculate the correlation. By default, only pearson correlation is calcu-
lated and shown. Can select one or more from pearson, cramers v, and correlation ratio. Or set to all
to show all correlation charts.
• correlation_target: Defaults to None. It can be any columns of type continuous, ordinal, categorical
or zipcode. When correlation_target is set, only pairs that contain correlation_target display.
• correlation_threshold: Apply a filter to the correlation matrices and only exhibit the pairs whose correlation
values are greater than or equal to the correlation_threshold.
• force_recompute: Defaults to False. Correlation matrices are cached. Set force_recompute to True
to recalculate the correlation. Note that both corr() and show_corr() method can trigger calculation of
correlation matrices if run with force_recompute set to be True, or when there is no cached value exists.
show_in_notebook() calculates the correlation only when there are only numerical columns in the dataset.
• frac: Defaults to 1. The portion of the original data to calculate the correlation on. frac must be between 0
and 1.
• plot_type: Defaults to heatmap. Valid values are heatmap and bar. If bar is chosen, correlation_target
also has to be set and the bar chart will only show the correlation values of the pairs which have the target in
them.
ds.show_corr(correlation_target='col01', plot_type='bar')
To explore features, use the smart plot() method. It accepts one or two feature names. The show_in_notebook()
method automatically determines the best type of plot based on the type of features that are to be plotted.
Three different examples are described. They use a binary classification dataset with 1,500 rows and 21 columns. 13
of the columns have a continuous data type, and 8 are categorical. There are three different examples.
• A single categorical feature: The plot() method detects that the feature is categorical because it only has the
values of 0 and 1. It then automatically renders a plot of the count of each category.
ds.plot("col02").show_in_notebook(figsize=(4,4))
• Categorical and continuous feature pair: ADS chooses the best plotting method, which is a violin plot.
ds.plot("col02", y="col01").show_in_notebook(figsize=(4,4))
• A pair of continuous features: ADS chooses a Gaussian heatmap as the best visualization. It generates a scatter
plot and assigns a color to each data point based on the local density (Gaussian kernel).
ds.plot("col01", y="col03").show_in_notebook()
10.2 Customized
ADS provides intelligent default options for your plots. However, the visualization API is flexible enough to let you
customize your charts or choose your own plotting library. You can use the ADS call() method to select your own
plotting routine.
10.2.1 Seaborn
In this example, a dataframe is passed directly to the Seaborn pair plot function. It does a faceted, pairwise plot between
all the features in the dataset. The function creates a grid of axises such that each variable in the data is shared in the y-
axis across a row and in the x-axis across a column. The diagonal axises are treated differently by drawing a histogram
of each feature.
10.2.2 Matplotlib
• Using Matplotlib:
ts_plot(df, figsize=(7,7))
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
data = {'data': [1109, 696, 353, 192, 168, 86, 74, 65, 53]}
df = pd.DataFrame(data, index = ['20-50 km', '50-75 km', '10-20 km', '75-100 km',
˓→'3-5 km', '7-10 km', '5-7 km', '>100 km', '2-3 km'])
bar_plot(df, figsize=(7,7))
This example uses the California earthquake data retrieved from United States Geological Survey (USGS) earthquake
catalog. It visualizes the location of major earthquakes.
earthquake.plot_gis_scatter(lon="longitude", lat="latitude")
ELEVEN
TRAIN MODELS
In this section you will learn about model training on the Data Science cloud service using a variety of popular frame-
works. This section covers the popular sklearn framework, along with gradient boosted tree estimators like LightGBM
and XGBoost, Oracle AutoML and deep learning packages likes TensorFlow and PyTorch.
The section covers how to serialize models and make use of the OCI Model Catalog to store model artifacts and meta
data all using ADS to prepare the upload.
In the distributed training section you will see examples of how to work with Dask, Horovod, TensorFlow and PyTorch
to do multinode training.
TensorBoard provides the visualization and the tooling that is needed to watch and record model training progress
throughout the tuning stages.
11.1 ADSTuner
In addition to the other services for training models, ADS includes a hyperparameter tuning framework called
ADSTuner.
ADSTuner supports using several hyperparameter search strategies that plug into common model architectures like
sklearn.
ADSTuner further supports users defining their own search spaces and strategies. This makes ADSTuner functional
and useful with any ML library that doesn’t include hyperparameter tuning.
First, import the packages:
import category_encoders as ce
import lightgbm
import logging
import numpy as np
import os
import pandas as pd
import pytest
import sklearn
import xgboost
119
ADS Documentation, Release 2.6.8
This is an example of running the ADSTuner on a support model SGD from sklearn:
ADSTuner generates a tuning report that lists its trials, best performing hyperparameters, and performance statistics
with:
You can use tuner.best_score to get the best score on the scoring metric used (accessible as``tuner.scoring_name``)
The best selected parameters are obtained with tuner.best_params and the complete record of trials with tuner.
trials
If you have further compute resources and want to continue hyperparameter optimization on a model that has already
been optimized, you can use:
tuner.resume(exit_criterion=[TimeBudget(5)], loglevel=logging.NOTSET)
print('So far the best {} score is {}'.format(tuner.scoring_name, tuner.best_score))
print("The best trial found was number: " + str(tuner.best_index))
tuner.plot_best_scores()
tuner.plot_intermediate_scores()
tuner.search_space()
tuner.plot_contour_scores(params=['penalty', 'alpha'])
(continues on next page)
ADSTuner supports custom scoring functions and custom search spaces. This example uses a different model:
model2 = LogisticRegression()
tuner = ADSTuner(model2,
strategy = {
'C': LogUniformDistribution(low=1e-05, high=1),
'solver': CategoricalDistribution(['saga']),
'max_iter': IntUniformDistribution(500, 1000, 50)},
scoring=make_scorer(f1_score, average='weighted'),
cv=3)
tuner.tune(X_train, y_train, exit_criterion=[NTrials(5)])
• ‘LGBMRegressor’,
• ‘SGDClassifier’,
• ‘SGDRegressor’
The AdaBoostRegressor model is not supported. This is an example of a custom strategy to use with this model:
model3 = AdaBoostRegressor()
X, y = load_boston(return_X_y=True)
X_train, X_valid, y_train, y_valid = train_test_split(X, y)
tuner = ADSTuner(model3, strategy={'n_estimators': IntUniformDistribution(50, 100)})
tuner.tune(X_train, y_train, exit_criterion=[TimeBudget(5)])
X = df.drop(target, axis=1)
y = df[target]
y = preprocessing.LabelEncoder().fit_transform(y)
numeric_transformer = Pipeline(steps=[
('num_imputer', SimpleImputer(strategy='median')),
('num_scaler', StandardScaler())
])
categorical_transformer = Pipeline(steps=[
('cat_imputer', SimpleImputer(strategy='constant', fill_value='missing')),
('cat_encoder', ce.woe.WOEEncoder())
])
preprocessor = ColumnTransformer(
transformers=[
('num', numeric_transformer, numeric_features),
('cat', categorical_transformer, categorical_features)
]
)
pipe = Pipeline(
steps=[
('preprocessor', preprocessor),
('feature_selection', SelectKBest(f_classif, k=int(0.9 * num_features))),
('classifier', LogisticRegression())
]
(continues on next page)
score = make_scorer(customerize_score)
ads_search = ADSTuner(
pipe,
scoring=score,
strategy='detailed',
cv=2,
random_state=42
)
ads_search.tune(X=X_train, y=y_train, exit_criterion=[NTrials(20)])
Overview:
A hyperparameter is a parameter that is used to control a learning process. This is in contrast to other parameters
that are learned in the training process. The process of hyperparameter optimization is to search for hyperparameter
values by building many models and assessing their quality. This notebook provides an overview of the ADSTuner
hyperparameter optimization engine. ADSTuner can optimize any estimator object that follows the scikit-learn API.
Objectives:
• Introduction
– Synchronous Tuning with Exit Criterion Based on Number of Trials
– Asynchronously Tuning with Exit Criterion Based on Time Budget
– Inspecting the Tuning Trials
• Defining a Custom Search Space and Score
– Changing the Search Space Strategy
• Optimizing a scikit-learn Pipeline()
• References
Important:
Placeholder text for required values are surrounded by angle brackets that must be removed when adding the indicated
content. For example, when adding a database name to database_name = "<database_name>" would become
database_name = "production".
Datasets are provided as a convenience. Datasets are considered third party content and are not considered materials
under your agreement with Oracle applicable to the services. The iris dataset is distributed under the BSD license.
import category_encoders as ce
import lightgbm
import logging
import numpy as np
import os
import pandas as pd
import sklearn
import time
Introduction
Hyperparameter optimization requires a model, dataset, and an ADSTuner object to perform the search.
ADSTuner() Performs a hyperparameter search using cross-validation. You can specify the number of folds you want
to use with the cv parameter.
Because the ADSTuner() needs a search space in which to tune the hyperparameters, you must use the strategy
parameter. This parameter can be set in two ways. You can specify detailed search criteria or you can use the built-
in defaults. For the supported model classes, ADSTuner provides perfunctoryand detailed search spaces that are
optimized for the chosen class of model. The perfunctory option is optimized for a small search space so that the most
important hyperparameters are tuned. Generally, this option is used early in your search as it reduces the computational
cost and allows you to assess the quality of the model class that you are using. The detailed search space instructs
ADSTuner to cover a broad search space by tuning more hyperparameters. Typically, you would use it when you have
determined what class of model is best suited for the dataset and type of problem you are working on. If you have
experience with the dataset and have a good idea of what the best hyperparameter values are, you can explicitly specify
the search space. You pass a dictionary that defines the search space into the strategy.
The parameter storage takes a database URL. For example, sqlite:////home/datascience/example.db. When
storage is set to the default value None, a new sqlite database file is created internally in the tmp folder with a unique
name. The name format is sqlite:////tmp/hpo_*.db. study_name is the name of this study for this ADSTuner
object. Each ADSTuner object has a unique study_name. However, one database file can be shared among different
ADSTuner objects. load_if_exists controls whether to load an existing study from an existing database file. If
False, it raises a DuplicatedStudyError when the study_name exists.
The loglevel parameter controls the amount of logging information displayed in the notebook.
This notebook uses the scikit-learn SGDClassifer() model and the iris dataset. This model object is a regularized
linear model with stochastic gradient descent (SGD) used to optimize the model parameters.
The next cell creates the SGDClassifer() model, initialize san ADSTuner object, and loads the iris data.
Each model class has a set of hyperparameters that you need to optimized. The strategy attribute returns what
strategy is being used. This can be perfunctory, detailed, or a dictionary that defines the strategy. The method
search_space() always returns a dictionary of hyperparameters that are to be searched. Any hyperparameter that is
required by the model, but is not listed, uses the default value that is defined by the model class. To see what search
space is being used for your model class when strategy is perfunctory or detailed use the search_space()
method to see the details.
The adstuner_search_space_update.ipynb notebook has detailed examples about how to work with and update
the search space.
The next cell displaces the search strategy and the search space.
The tune() method starts a tuning process. It has a synchronous and asynchronous mode for tuning. The mode is set
with the synchronous parameter. When it is set to False, the tuning process runs asynchronously so it runs in the
background and allows you to continue your work in the notebook. When synchronous is set to True, the notebook
is blocked until tune() finishes running. The adntuner_sync_and_async.ipynb notebook illustrates this feature
in a more detailed way.
The ADSTuner object needs to know when to stop tuning. The exit_criterion parameter accepts a list of criteria
that cause the tuning to finish. If any of the criteria are met, then the tuning process stops. Valid exit criteria are:
• NTrials(n): Run for n number of trials.
• TimeBudget(t): Run for t seconds.
• ScoreValue(s): Run until the score value exceeds s.
The default behavior is to run for 50 trials (NTrials(50)).
The stopping criteria are listed in the ads.hpo.stopping_criterion module.
Synchronous Tuning with Exit Criterion Based on Number of Trials
This section demonstrates how to perform a synchronous tuning process with the exit criteria based on the number of
trials. In the next cell, the synchronous parameter is set to True and the exit_criterion is set to [NTrials(5)].
You can access a summary of the trials by looking at the various attributes of the tuner object. The scoring_name
attribute is a string that defines the name of the scoring metric. The best_score attribute gives the best score of all
the completed trials. The best_params parameter defines the values of the hyperparameters that have to lead to the
best score. Hyperparameters that are not in the search criteria are not reported.
print(f"So far the best {tuner.scoring_name} score is {tuner.best_score} and the best␣
˓→hyperparameters are {tuner.best_params}")
So far the best mean accuracy score is 0.9666666666666667 and the best hyperparameters␣
˓→are {'alpha': 0.002623793623610696, 'penalty': 'none'}
You can also look at the detailed table of all the trials attempted:
tuner.trials.tail()
# This cell will return right away since it's running asynchronous.
tuner.tune(exit_criterion=[TimeBudget(5)])
while tuner.status == State.RUNNING:
print(f"So far the best score is {tuner.best_score} and the time left is {tuner.time_
˓→remaining}")
time.sleep(1)
So far the best score is 0.9666666666666667 and the time left is 4.977275848388672
So far the best score is 0.9666666666666667 and the time left is 3.9661824703216553
So far the best score is 0.9666666666666667 and the time left is 2.9267797470092773
So far the best score is 0.9666666666666667 and the time left is 1.912914752960205
So far the best score is 0.9733333333333333 and the time left is 0.9021461009979248
So far the best score is 0.9733333333333333 and the time left is 0
The attribute best_index givse you the index in the trials data frame where the best model is located.
tuner.trials.loc[tuner.best_index, :]
number 10
value 0.98
datetime_start 2021-04-21 20:04:17.013347
datetime_complete 2021-04-21 20:04:18.623813
duration 0 days 00:00:01.610466
params_alpha 0.014094
params_penalty l1
user_attrs_mean_fit_time 0.16474
user_attrs_mean_score_time 0.024773
user_attrs_mean_test_score 0.98
user_attrs_metric mean accuracy
user_attrs_split0_test_score 1.0
user_attrs_split1_test_score 1.0
user_attrs_split2_test_score 0.94
user_attrs_std_fit_time 0.006884
user_attrs_std_score_time 0.00124
user_attrs_std_test_score 0.028284
(continues on next page)
tuner.plot_best_scores()
tuner.plot_intermediate_scores()
tuner.plot_contour_scores(params=['penalty', 'alpha'])
tuner.plot_parallel_coordinate_scores(params=['penalty', 'alpha'])
tuner.plot_edf_scores()
tuner.plot_param_importance()
tuner = ADSTuner(LogisticRegression(),
strategy = {'C': LogUniformDistribution(low=1e-05, high=1),
'solver': CategoricalDistribution(['saga']),
'max_iter': IntUniformDistribution(500, 2000, 50)},
scoring=make_scorer(f1_score, average='weighted'),
cv=3)
tuner.tune(X, y, exit_criterion=[NTrials(5)], synchronous=True, loglevel=logging.WARNING)
tuner.search_space(strategy='detailed')
Alternatively, you can edit a subset of the search space by changing the range.
Here’s an example of using overwrite=True to reset to the default values for detailed:
tuner.search_space(strategy='detailed', overwrite=True)
X, y = load_iris(return_X_y=True)
X = pd.DataFrame(data=X, columns=["sepal_length", "sepal_width", "petal_length", "petal_
˓→width"])
y = pd.DataFrame(data=y)
y = preprocessing.LabelEncoder().fit_transform(y)
numeric_transformer = Pipeline(steps=[
('num_imputer', SimpleImputer(strategy='median')),
('num_scaler', StandardScaler())
])
categorical_transformer = Pipeline(steps=[
('cat_imputer', SimpleImputer(strategy='constant', fill_value='missing')),
('cat_encoder', ce.woe.WOEEncoder())
])
preprocessor = ColumnTransformer(
transformers=[
('num', numeric_transformer, numeric_features),
('cat', categorical_transformer, categorical_features)
]
)
pipe = Pipeline(
steps=[
('preprocessor', preprocessor),
('feature_selection', SelectKBest(f_classif, k=int(0.9 * num_features))),
('classifier', LogisticRegression())
]
)
You can define a custom score function. In this example, it is directly measuring how close the predicted y-values are
to the true y-values by taking the weighted average of the number of direct matches between the y-values.
score = make_scorer(custom_score)
Again, you instantiate the ADSTuner() object and use it to tune the iris` dataset:
ads_search = ADSTuner(
pipe,
scoring=score,
strategy='detailed',
(continues on next page)
The ads_search tuner can provide useful information about the tuning process, like the best parameter that was
optimized, the best score achieved, the number of trials, and so on.
ads_search.sklearn_steps
{'classifier__C': 9.47220908749299,
'classifier__dual': False,
'classifier__l1_ratio': 0.9967712201895031,
'classifier__penalty': 'elasticnet',
'classifier__solver': 'saga'}
ads_search.best_params
{'C': 9.47220908749299,
'dual': False,
'l1_ratio': 0.9967712201895031,
'penalty': 'elasticnet',
'solver': 'saga'}
ads_search.best_score
0.9733333333333334
ads_search.best_index
12
ads_search.trials.head()
ads_search.n_trials
20
References
• ADS Library Documentation
• Cross-Validation
• OCI Data Science Documentation
• Oracle Data & AI Blog
• Stochastic Gradient Descent
Distributed training is the process of taking a training workload which comprises training code and training data and
making both of these available in a cluster.
The conceptual difference with distributed training is that multiple workers coordinated in a cluster running on multiple
VM instances allows horizontal scaling of parallelizable tasks. While singe node training is well suited to traditional
ML models, very large datasets or compute intensive workloads like deep learning and deep neural networks, tends to
be better suited to distributed computing environments.
Distributed Training benefits two classes of problem, one where the data is parallelizable, the other where the model
network is parallelizable. The most common and easiest to develop is data parallelism. Both forms of parallelism can
be combined to handle both large models and large datasets.
Data Parallelism
In this form of distributed training the training data is partitioned into some multiple of the number of nodes in the
compute cluster. Each node holds the model and is in communication with other node participating in a coordinated
optimization effort.
Sometimes data sampling is possible, but often at the expense of model accuracy. With distributed training you can
avoid having to sample the data to fit a single node.
Model Parallelism
This form of distributed training is used when workers need to worker nodes need to synchronize and share parameters.
The data fits into the memory of each worker, but the training takes too long. With model parallelism more epochs can
run and more hyper-parameters can be explored.
Distributed Training with OCI Data Science
To outline the process by which you create distributed training workloads is the same regardless of framework used.
Sections of the configuration differ between frameworks but the experience is consistent. The user brings only the
(framework specific) training python code, along with the yaml declarative definition.
ADS makes use of yaml to express configurations. The yaml specification has sections to describe the cluster infras-
tructure, the python runtime code, and the cluster framework.
The architecture is extensible to support well known frameworks and future versions of these. The set of service
provided frameworks for distributed training include:
• Dask for LightGBM, XGBoost, Scikit-Learn, and Dask-ML
• Horovod for PyTorch & Tensorflow
• PyTorch Distributed for PyTorch native using DistributedDataParallel - no training code changes to run
PyTorch model training on a cluster. You can use Horovod to do the same, which has some advanced features
like auto-tuning to improve allreduce performance, and fp16 gradient compression.
• Tensorflow Distributed for Tensorflow distributed training strategies like MirroredStrategy,
MultiWorkerMirroredStrategy and ParameterServerStrategy
Prerequisite:
1. Internet Connection
2. ADS cli is installed
3. Docker engine
To run a distributed workload on OCI Data Science Jobs, you need prepare a container image with the source code
that you want to run and the framework (Dask|Horovod|PyTorch) setup. OCI Data Science provides you with the
Dockerfiles and bootstrapping scripts to build framework specific container images. This step creates a folder in the
current working directory called oci_distributed_training. This folder contains all the artifacts required to setup
and bootstrap the framework code. Refer to README.md file to see more details on how to build and push the container
image to the ocir
11.2.1.3 Check Config File generated by the Main and the Worker Nodes
Prerequisite:
1. A cluster that is in In-Progress, Succeeded or Failed (Supported only in some cases)
2. Job OCID and the work dir of the cluster or a yaml file which contains all the details displayed during cluster
creation.
The main node generates MAIN_config.json and worker nodes generate WORKER_<job run ocid>_config.json.
You may want to check the configuration details for find the IP address of the Main node. This can be useful to bring
up dashboard for dask or debugging.
11.2.2 Configurations
11.2.2.1 Networks
You need to use a private subnet for distributed training and configure the security list to allow traffic through specific
ports for communication between nodes. The following default ports are used by the corresponding frameworks:
• Dask:
– Scheduler Port: 8786. More information here
– Dashboard Port: 8787. More information here
– Worker Ports: Default is Random. It is good to open a specific range of port and then provide the value
in the startup option. More information here
– Nanny Process Ports: Default is Random. It is good to open a specific range of port and then provide
the value in the startup option. More information here
• PyTorch: By default, PyTorch uses 29400.
• Horovod: allow TCP traffic on all ports within the subnet.
• Tensorflow: Worker Port: Allow traffic from all source ports to one worker port (default: 12345). If changed,
provide this in train.yaml config.
See also: Security Lists
Policy subject
In the following example, group <your_data_science_users> is the subject of the policy. When starting the
job from an OCI notebook session using resource principal, the subject should be dynamic-group, for example,
dynamic-group <your_notebook_sessions>
Distributed training uses OCI Container Registry to store the container image.
To push images to container registry, the manage repos policy is needed, for example:
To pull images from container registry for local testing, the use repos policy is needed, for example:
You can also restrict the permission to specific repository, for example:
• read repos
• manage data-science-jobs
• manage data-science-job-runs
• use virtual-network-family
• manage log-groups
• use log-content
• read metrics
For example:
Tip
Use -h option to see options and usage help
Args
Note :
This command can be used to build a docker image from ads CLI. It writes the config.ini file in the user’s runtime
environment which can be used further referred by other CLI commands.
If -push tag is used in command then docker image is pushed to mentioned repository
Sample config.ini file
[main]
tag = $TAG
registry = $NAME_OF_REGISTRY
dockerfile = $PATH_TO_DOCKERFILE
source_folder = $MOUNT_FOLDER_PATH
; mount oci keys for local testing
oci_key_mnt = ~/.oci:/home/oci_dist_training/.oci
Args
• -image: Name of the Docker image (default value is picked from config.ini file)
Command
Note
This command can be used to push images to the OCI repository. In case the name of the image is not mentioned it
refers to the image name from the config.ini file.
11.2.3.3 Run the container Image on the OCI Data Science or local
Tip
Use -h option to see options and usage help
Args
• -f: Path to train.yaml file (required argument)
• -b :
– local → Run DT workflow on the local environment
– job → Run DT workflow on the OCI ML Jobs
– Note : default value is set to jobs
• -i: Auto increments the tag of the image
• -nopush: Doesn’t Push the latest image to OCIR
• -nobuild: Doesn’t build the image
• -t: Tag of the docker image
• -reg: Docker Repository
• -df: Dockerfile using which docker will be build
• -s: source code dir
Note : The value “@image” for image attribute in train.yaml is replaced at runtime using combination of -t and
-r params.
Command
Local Command
Jobs Command
Note
The command ads opctl run -f train.yaml is used to run distributed training jobs on OCI Data Science. By
default, it builds the new image and pushes it to the OCIR.
If required OCI API keys can be mounted by specifying the location in the config.ini file
Step 1:
Build the Docker and run it locally.
If required mount the code folder using the -s tag
Step 2:
If the user has changed files only in the mounted folder and needs to run it locally. {Build is not required}
In case there are some changes apart from the mounted folder and needs to run it locally. {Build is required}
-i tag is required only if the user needs to increment the tag of the image
Step 3:
Finally, to run on a jobs platform
Before submitting your code to Data Science Jobs, check if the infra setup meets the framework requirement. Each
framework has a specific set of requirements.
ads opctl check runs diagnosis by starting a single node jobrun using the container image specified in the train.
yaml file.
The train.yaml is the same yaml file that is defined for running distributed training code. The diagnostic report is saved
in the file provided in --output option.
Here is a sample report generated for Horovod cluster -
11.2.4 Dask
Dask is a flexible library for parallel computing in Python. The documentation will split between the two areas of writing
distributed training using the Dask framework and creating both the container and yaml spec to run the distributed
workload.
Dask
This is a good choice when you want to use Scikit-Learn, XGBoost, LightGBM or have data parallel tasks for very
large datasets where the data can be partitioned.
Prerequisites
1. Internet Connection
2. ADS cli is installed
3. Install docker: https://docs.docker.com/get-docker
Write your training code:
While running distributed workload, the IP address of the scheduler is known only during the runtime. The IP address
is exported as environment variable - SCHEDULER_IP in all the nodes when the Job Run is in IN_PROGRESS state.
Create dask.distributed.Client object using environment variable to specify the IP address. Eg. -
client = Client(f"{os.environ['SCHEDULER_IP']}:{os.environ.get('SCHEDULER_PORT','8786')}
˓→")
Listing 1: gridsearch.py
from dask.distributed import Client
from sklearn.datasets import make_classification
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV
import pandas as pd
import joblib
import os
import argparse
parser = argparse.ArgumentParser()
parser.add_argument("--n_samples", default=default_n_samples, type=int, help="size of␣
˓→dataset")
X, y = make_classification(n_samples=args.n_samples, random_state=42)
with joblib.parallel_backend("dask"):
GridSearchCV(
SVC(gamma="auto", random_state=0, probability=True),
param_grid={
"C": [0.001, 0.01, 0.1, 0.5, 1.0, 2.0, 5.0, 10.0],
"kernel": ["rbf", "poly", "sigmoid"],
"shrinking": [True, False],
},
return_train_score=False,
cv=args.cv,
n_jobs=-1,
).fit(X, y)
export IMAGE_NAME=<region.ocir.io/my-tenancy/image-name>
export TAG=latest
The code is assumed to be in the current working directory. To override the source code directory, use the -s flag and
specify the code dir. This folder should be within the current working directory.
If you are behind proxy, ads opctl will automatically use your proxy settings (defined via no_proxy, http_proxy and
https_proxy).
Define your workload yaml:
The yaml file is a declarative way to express the workload. Refer YAML schema for more details.
Listing 2: train.yaml
kind: distributed
apiVersion: v1.0
spec:
infrastructure:
kind: infrastructure
type: dataScienceJob
apiVersion: v1.0
spec:
projectId: oci.xxxx.<project_ocid>
compartmentId: oci.xxxx.<compartment_ocid>
displayName: my_distributed_training
logGroupId: oci.xxxx.<log_group_ocid>
logId: oci.xxx.<log_ocid>
subnetId: oci.xxxx.<subnet-ocid>
shapeName: VM.Standard2.4
blockStorageSize: 50
cluster:
kind: dask
apiVersion: v1.0
spec:
image: my-region.ocir.io/my-tenancy/dask-cluster-examples:dev
workDir: "oci://my-bucket@my-namespace/daskexample/001"
name: GridSearch Dask
main:
config:
worker:
(continues on next page)
Use ads opctl to create the cluster infrastructure and run the workload:
Do a dry run to inspect how the yaml translates to Job and Job Runs. This does not create actual Job or Job Run.
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Creating Main Job with following details:
Name: main
Environment Variables:
OCI__MODE:MAIN
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Creating 2 worker jobs with following details:
Name: worker
Environment Variables:
OCI__MODE:WORKER
-----------------------------Ending dryrun mode----------------------------------
Test Locally:
Before submitting the workload to jobs, you can run it locally to test your code, dependencies, configurations etc. With
-b local flag, it uses a local backend. Further when you need to run this workload on odsc jobs, simply use -b job
flag instead.
ads opctl run -f train.yaml -b local
If your code requires to use any oci services (like object bucket), you need to mount oci keys from your local host
machine onto the container. This is already done for you assuming the typical location of oci keys ~/.oci. You can
modify it though, in-case you have keys at a different location. You need to do this in the config.ini file.
oci_key_mnt = ~/.oci:/home/oci_dist_training/.oci
Note:: This will automatically push the docker image to the OCI container registry repo .
Once running, you will see on the terminal an output similar to the below. Note that this yaml can be used as input to
ads opctl distributed-training show-config -f <info.yaml> - to both save and see the run info use tee
- for example:
ads opctl run -f train.yaml | tee info.yaml
Listing 3: info.yaml
jobId: oci.xxxx.<job_ocid>
mainJobRunId:
mainJobRunIdName: oci.xxxx.<job_run_ocid>
workDir: oci://my-bucket@my-namespace/daskcluster-testing/005
(continues on next page)
You could stream the logs from any of the job run ocid using ads opctl watch command. Your could run this
comand from mutliple terminal to watch all of the job runs. Typically, watching mainJobRunId should yeild most
informative log.
To find the IP address of the scheduler dashboard, you could check the configuration file generated by the Main job by
running -
Main Info:
OCI__MAIN_IP: <ip address>
SCHEDULER_IP: <ip address>
tmpdir: oci://my-bucket@my-namesapce/daskcluster-testing/005/oci.xxxx.<job_ocid>
Dask dashboard is host at : http://{SCHEDULER_IP}:8787 If the IP address is reachable from your workstation
network, you can access the dashboard directly from your workstation. The alternate approach is to use either a Bastion
host on the same subnet as the Job Runs and create an ssh tunnel from your workstation.
For more information about the dashboard, checkout https://docs.dask.org/en/stable/diagnostics-distributed.html
Saving Artifacts to Object Storage Buckets
In case you want to save the artifacts generated by the training process (model checkpoints, TensorBoard logs, etc.)
to an object bucket you can use the ‘sync’ feature. The environment variable OCI__SYNC_DIR exposes the directory
location that will be automatically synchronized to the configured object storage bucket location. Use this directory in
your training script to save the artifacts.
To configure the destination object storage bucket location, use the following settings in the workload yaml
file(train.yaml).
- name: SYNC_ARTIFACTS
value: 1
- name: WORKSPACE
value: "<bucket_name>"
- name: WORKSPACE_PREFIX
value: "<bucket_prefix>"
Note: Change SYNC_ARTIFACTS to 0 to disable this feature. Use OCI__SYNC_DIR env variable in your code to save
the artifacts. For Example :
import time
import joblib
def long_running_function(i):
time.sleep(.1)
return i
This function can be called under Dask as a dask task which will be scheduled automatically by Dask across the cluster.
Watching the cluster utilization will show the tasks run on the workers.
with joblib.parallel_backend('dask'):
joblib.Parallel(verbose=100)(
joblib.delayed(long_running_function)(i)
for i in range(10))
import os
from dask.distributed import Client
import joblib
# the cluster once created will make available the IP address of the Dask scheduler
# through the SCHEDULER_IP environment variable
client = Client(f"{os.environ['SCHEDULER_IP']}:8786")
with joblib.parallel_backend('dask'):
# Your scikit-learn code
A full example showing scaling out CPU-bound workloads; workloads with datasets that fit in RAM, but have many
individual operations that can be done in parallel. To scale out to RAM-bound workloads (larger-than-memory datasets)
use one of the dask-ml provided parallel estimators, or the dask-ml wrapped XGBoost & LightGBM estimators.
import numpy as np
from dask.distributed import Client
import joblib
from sklearn.datasets import load_digits
(continues on next page)
client = Client(f"{os.environ['SCHEDULER_IP']}:8786")
digits = load_digits()
param_space = {
'C': np.logspace(-6, 6, 13),
'gamma': np.logspace(-8, 8, 17),
'tol': np.logspace(-4, -1, 4),
'class_weight': [None, 'balanced'],
}
model = SVC(kernel='rbf')
search = RandomizedSearchCV(model, param_space, cv=3, n_iter=50, verbose=10)
with joblib.parallel_backend('dask'):
search.fit(digits.data, digits.target)
11.2.4.3.1 LightGBM
For further examples and comprehensive documentation see LightGBM and Github Examples
import os
import joblib
import dask.array as da
from dask.distributed import Client
from sklearn.datasets import make_blobs
if __name__ == "__main__":
print("loading data")
size = int(os.environ.get("SIZE", 1000))
X, y = make_blobs(n_samples=size, n_features=50, centers=2)
client = Client(
f"{os.environ['SCHEDULER_IP']}:{os.environ.get('SCHEDULER_PORT','8786')}"
)
print("beginning training")
dask_model = lgb.DaskLGBMClassifier(n_estimators=10)
dask_model.fit(dX, dy)
assert dask_model.fitted_
11.2.4.3.2 XGBoost
import os
from distributed import LocalCluster, Client
import xgboost as xgb
if __name__ == "__main__":
with Client(f"{os.environ['SCHEDULER_IP']}:8786") as client:
main(client)
You can setup Dask cluster to run using TLS. To do so, you need three things -
1. CA Certificate
2. A Certificate signed by CA
3. Private key of the certificate
For more details refer Dask documentation
Self signed Certificate using openssl
openssl lets you create test CA and certificates required to setup TLS connectivity for Dask cluster. Use the commands
below to create certificate in your code folder. When the container image is built, all the artifacts in the code folder is
copied to /code directory inside container image.
1. Generate CA Certificate
openssl req -x509 -nodes -newkey rsa:4096 -days 10 -keyout dask-tls-ca-key.pem -out dask-
˓→tls-ca-cert.pem -subj "/C=US/ST=CA/CN=ODSC CLUSTER PROVISIONER"
2. Generate CSR
openssl req -nodes -newkey rsa:4096 -keyout dask-tls-key.pem -out dask-tls-req.pem -subj
˓→"/C=US/ST=CA/CN=DASK CLUSTER"
3. Sign CSR
4. Follow the container build instrcutions here to build, tag and push the image to ocir.
5. Create a cluster definition YAML and configure the certifacte information under cluster/config/
startOptions. Here is an example -
kind: distributed
apiVersion: v1.0
spec:
infrastructure:
kind: infrastructure
type: dataScienceJob
apiVersion: v1.0
spec:
projectId: oci.xxxx.<project_ocid>
compartmentId: oci.xxxx.<compartment_ocid>
displayName: my_distributed_training
logGroupId: oci.xxxx.<log_group_ocid>
logId: oci.xxx.<log_ocid>
subnetId: oci.xxxx.<subnet-ocid>
shapeName: VM.Standard2.4
blockStorageSize: 50
cluster:
kind: dask
apiVersion: v1.0
spec:
image: iad.ocir.io/mytenancy/dask-cluster-examples:dev
workDir: oci://mybucket@mytenancy/daskexample/001
name: LGBM Dask
main:
config:
startOptions:
- --tls-ca-file /code/dask-tls-ca-cert.pem
- --tls-cert /code/dask-tls-cert.pem
- --tls-key /code/dask-tls-key.pem
worker:
config:
startOptions:
- --tls-ca-file /code/dask-tls-ca-cert.pem
- --tls-cert /code/dask-tls-cert.pem
- --tls-key /code/dask-tls-key.pem
replicas: 2
runtime:
kind: python
apiVersion: v1.0
(continues on next page)
1. Create certificate authority, certificate and private key inside OCI Certificates console.
2. Create a cluster definition YAML and configure the certifacte information under cluster/config/
startOptions. Here is an example -
kind: distributed
apiVersion: v1.0
spec:
infrastructure:
kind: infrastructure
type: dataScienceJob
apiVersion: v1.0
spec:
projectId: oci.xxxx.<project_ocid>
compartmentId: oci.xxxx.<compartment_ocid>
displayName: my_distributed_training
logGroupId: oci.xxxx.<log_group_ocid>
logId: oci.xxx.<log_ocid>
subnetId: oci.xxxx.<subnet-ocid>
shapeName: VM.Standard2.4
blockStorageSize: 50
cluster:
kind: dask
apiVersion: v1.0
spec:
image: iad.ocir.io/mytenancy/dask-cluster-examples:dev
workDir: oci://mybucket@mytenancy/daskexample/001
name: LGBM Dask
certificate:
caCert:
id: ocid1.certificateauthority.oc1.xxx.xxxxxxx
downloadLocation: /code/dask-tls-ca-cert.pem
cert:
(continues on next page)
Dask scheduler
Dask scheduler is launched with dask-scheduler command. By default no arguments are supplied to
dask-scheduler. You could influence the startup option by adding them to startOptions under cluster/spec/
main/config section of the cluster YAML definition
Eg. Here is how you could change the scheduler port number:
# Note only portion of the yaml file is shown here for brevity.
cluster:
kind: dask
apiVersion: v1.0
spec:
image: region.ocir.io/my-tenancy/image:tag
workDir: "oci://my-bucket@my-namespace/daskcluster-testing/005"
ephemeral: True
name: My Precious
main:
config:
startOptions:
- --port 8788
Dask worker
Dask worker is launched with dask-worker command. By default no arguments are supplied to dask-worker. You
could influence the startup option by adding them to startOptions under cluster/spec/worker/config section
of the cluster YAML definition
Eg. Here is how you could change the worker port, nanny port, number of workers per host and number of threads per
process:
# Note only portion of the yaml file is shown here for brevity.
cluster:
kind: dask
apiVersion: v1.0
spec:
image: region.ocir.io/my-tenancy/image:tag
workDir: "oci://my-bucket@my-namespace/daskcluster-testing/005"
ephemeral: True
name: My Precious
main:
config:
worker:
config:
startOptions:
- --worker-port 8700:8800
- --nanny-port 3000:3100
- --nworkers 8
- --nthreads 2
You could set configuration parameters that Dask recognizes by add it to cluster/spec/config/env or cluster/
spec/main/config/env or cluster/spec/worker/config/env If a configuration value is some for both
scheduler and worker section, then set it at cluster/spec/config/env section.
# Note only portion of the yaml file is shown here for brevity.
cluster:
kind: dask
apiVersion: v1.0
spec:
image: region.ocir.io/my-tenancy/image:tag
workDir: "oci://my-bucket@my-tenancy/daskcluster-testing/005"
ephemeral: True
name: My Precious
config:
env:
- name: DASK_ARRAY__CHUNK_SIZE
value: 128 MiB
- name: DASK_DISTRIBUTED__WORKERS__MEMORY__SPILL
value: 0.85
- name: DASK_DISTRIBUTED__WORKERS__MEMORY__TARGET
value: 0.75
(continues on next page)
Dask dashboard allows you to monitor the progress of the tasks. It gives you a real time view of the resource usage,
task status, number of workers, task distribution, etc. To learn more about Dask dashboard refer this link.
Prerequisite
1. IP address of the Main/Scheduler Node. Use ads opctl distributed-training show-config or find the
IP address from the logs of the main job run.
2. The default port is 8787. You can override this port in cluster/main/config/startOptions in the cluster
definition file.
3. Allow ingress to the port 8787 in the security list associated with the Subnet of Main/Scheduler node.
The dashboard is accessible over <SCHEDULER_IP>:8787. The IP address may not always be accessible from your
workstation especially if you are using a subnet which is not connected to your corporate network. To overcome this,
you could setup a bastion host on the private regional subnet that was added to the jobrun and create an ssh tunnel from
your workstation to bastion host to the Job Run instance with <SCHEDULER_IP>
Here are the steps to setup a Bastion host to allow you to connect to the scheduler dashboard -
1. Launch a compute instance (Linux or Windows) with primary vnic with a public subnet or the subnet that is
connected to your corporate network.
2. Attach a secondary VNIC on the subnet used for starting the cluster. Follow the steps detailed here on how to
setup and configure the host to setup the secondary VNIC.
3. Create a public IP if you need access to the dashboard over the internet.
Linux instance
If you setup a Linux instance, you can create ssh tunnel from your workstation and access the scheduler dashboard from
your workstation at localhost:8787. To setup ssh tunnel -
Windows instance
RDP to the Windows instance and access the dashboard using <SCHEDULER_IP>:8787 from a browser running within
the Windows instance.
11.2.5 Horovod
Prerequisites
1. Internet Connection
2. ADS cli is installed
3. Install docker: https://docs.docker.com/get-docker
Write your training code:
Your model training script (TensorFlow or PyTorch) needs to be adapted to use (Elastic) Horovod APIs for distributed
training. Refer Writing distributed code with horovod framework
Also see : Horovod Examples
For this example, the code to run was inspired from an example found here . There are minimal changes to this script
to save the training artifacts and TensorBoard logs to a folder referenced by OCI__SYNC_DIR environment variable.
OCI__SYNC_DIR is a pre-provisioned folder which can be synchronized with an object bucket during the training
process.
Listing 4: train.py
# Script adapted from https://github.com/horovod/horovod/blob/master/examples/elastic/
˓→tensorflow2/tensorflow2_keras_mnist_elastic.py
# ==============================================================================
import argparse
import tensorflow as tf
import horovod.tensorflow.keras as hvd
from distutils.version import LooseVersion
import os
os.environ["TF_CPP_MIN_LOG_LEVEL"] = "2"
parser.add_argument(
"--data-dir",
help="location of the training dataset in the local filesystem (will be downloaded␣
˓→if needed)",
default='/code/data/mnist.npz'
)
args = parser.parse_args()
if args.use_mixed_precision:
print(f"using mixed precision {args.use_mixed_precision}")
if LooseVersion(tf.__version__) >= LooseVersion("2.4.0"):
from tensorflow.keras import mixed_precision
mixed_precision.set_global_policy("mixed_float16")
else:
policy = tf.keras.mixed_precision.experimental.Policy("mixed_float16")
tf.keras.mixed_precision.experimental.set_policy(policy)
# Horovod: pin GPU to be used to process local rank (one GPU per process)
gpus = tf.config.experimental.list_physical_devices("GPU")
for gpu in gpus:
tf.config.experimental.set_memory_growth(gpu, True)
if gpus:
tf.config.experimental.set_visible_devices(gpus[hvd.local_rank()], "GPU")
import numpy as np
minist_local = args.data_dir
def load_data():
print("using pre-fetched dataset")
with np.load(minist_local, allow_pickle=True) as f:
x_train, y_train = f["x_train"], f["y_train"]
x_test, y_test = f["x_test"], f["y_test"]
return (x_train, y_train), (x_test, y_test)
(mnist_images, mnist_labels), _ = (
load_data()
if os.path.exists(minist_local)
(continues on next page)
dataset = tf.data.Dataset.from_tensor_slices(
(
tf.cast(mnist_images[..., tf.newaxis] / 255.0, tf.float32),
tf.cast(mnist_labels, tf.int64),
)
)
dataset = dataset.repeat().shuffle(10000).batch(128)
model = tf.keras.Sequential(
[
tf.keras.layers.Conv2D(32, [3, 3], activation="relu"),
tf.keras.layers.Conv2D(64, [3, 3], activation="relu"),
tf.keras.layers.MaxPooling2D(pool_size=(2, 2)),
tf.keras.layers.Dropout(0.25),
tf.keras.layers.Flatten(),
tf.keras.layers.Dense(128, activation="relu"),
tf.keras.layers.Dropout(0.5),
tf.keras.layers.Dense(10, activation="softmax"),
]
)
state.register_reset_callbacks([on_state_reset])
callbacks = [
hvd.callbacks.MetricAverageCallback(),
hvd.elastic.UpdateEpochStateCallback(state),
hvd.elastic.UpdateBatchStateCallback(state),
hvd.elastic.CommitStateCallback(state),
]
# Horovod: save checkpoints only on worker 0 to prevent other workers from corrupting␣
˓→them.
train(state)
Note: If you choose to run a PyTorch example instead, use horovod-pytorch as the framework.
export IMAGE_NAME=<region.ocir.io/my-tenancy/image-name>
export TAG=latest
The code is assumed to be in the current working directory. To override the source code directory, use the -s flag and
specify the code dir. This folder should be within the current working directory.
If you are behind proxy, ads opctl will automatically use your proxy settings (defined via no_proxy, http_proxy and
https_proxy).
SSH Setup:
In Horovod distributed training, communication between scheduler and worker(s) uses a secure connection. For this
purpose, SSH keys need to be provisioned in the scheduler and worker nodes. This is already taken care in the docker
images. When the docker image is built, SSH key pair is placed inside the image with required configuration changes
(adding public key to authorized_keys file). This enables a secure connection between scheduler and the workers.
Define your workload yaml:
The yaml file is a declarative way to express the workload.
Listing 5: train.yaml
kind: distributed
apiVersion: v1.0
spec:
infrastructure: # This section maps to Job definition. Does not include environment␣
˓→variables
kind: infrastructure
type: dataScienceJob
apiVersion: v1.0
spec:
projectId: oci.xxxx.<project_ocid>
compartmentId: oci.xxxx.<compartment_ocid>
displayName: HVD-Distributed-TF
(continues on next page)
# - name: MIN_NP
# value: 2
# - name: MAX_NP
# value: 4
# - name: SLOTS
# value: 2
- name: WORKER_PORT
value: 12345
- name: START_TIMEOUT #Optional: Defaults to 600.
value: 600
- name: ENABLE_TIMELINE # Optional: Disabled by Default.Significantly␣
˓→increases training duration if switched on (1).
value: 0
- name: SYNC_ARTIFACTS #Mandatory: Switched on by Default.
value: 1
- name: WORKSPACE #Mandatory if SYNC_ARTIFACTS==1: Destination object bucket␣
˓→to sync generated artifacts to.
value: "<bucket_name>"
- name: WORKSPACE_PREFIX #Mandatory if SYNC_ARTIFACTS==1: Destination object␣
˓→bucket folder to sync generated artifacts to.
value: "<bucket_prefix>"
- name: HOROVOD_ARGS # Parameters for cluster tuning.
value: "--verbose"
main:
name: "scheduler"
replicas: 1 #this will be always 1
worker:
name: "worker"
replicas: 2 #number of workers
runtime:
kind: python
apiVersion: v1.0
spec:
entryPoint: "/code/train.py" #location of user's training script in docker image.
args: #any arguments that the training script requires.
env:
Use ads opctl to create the cluster infrastructure and run the workload:
Do a dry run to inspect how the yaml translates to Job and Job Runs
ads opctl run -f train.yaml --dry-run
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Creating Main Job with following details:
Name: scheduler
Environment Variables:
OCI__MODE:MAIN
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Creating 2 worker jobs with following details:
Name: worker
Environment Variables:
OCI__MODE:WORKER
-----------------------------Ending dryrun mode----------------------------------
Test Locally:
Before submitting the workload to jobs, you can run it locally to test your code, dependencies, configurations etc. With
-b local flag, it uses a local backend. Further when you need to run this workload on odsc jobs, simply use -b job
flag instead.
If your code requires to use any oci services (like object bucket), you need to mount oci keys from your local host
machine onto the container. This is already done for you assuming the typical location of oci keys ~/.oci. You can
modify it though, in-case you have keys at a different location. You need to do this in the config.ini file.
oci_key_mnt = ~/.oci:/home/oci_dist_training/.oci
Note:: This will automatically push the docker image to the OCI container registry repo .
Once running, you will see on the terminal an output similar to the below. Note that this yaml can be used as input to
ads opctl distributed-training show-config -f <info.yaml> - to both save and see the run info use tee
- for example:
Listing 6: info.yaml
jobId: oci.xxxx.<job_ocid>
mainJobRunId:
mainJobRunIdName: oci.xxxx.<job_run_ocid>
workDir: oci://my-bucket@my-namespace/daskcluster-testing/005
otherJobRunIds:
- workerJobRunIdName_1: oci.xxxx.<job_run_ocid>
- workerJobRunIdName_2: oci.xxxx.<job_run_ocid>
- workerJobRunIdName_3: oci.xxxx.<job_run_ocid>
to an object bucket you can use the ‘sync’ feature. The environment variable OCI__SYNC_DIR exposes the directory
location that will be automatically synchronized to the configured object storage bucket location. Use this directory in
your training script to save the artifacts.
To configure the destination object storage bucket location, use the following settings in the workload yaml
file(train.yaml).
- name: SYNC_ARTIFACTS
value: 1
- name: WORKSPACE
value: "<bucket_name>"
- name: WORKSPACE_PREFIX
value: "<bucket_prefix>"
Note: Change SYNC_ARTIFACTS to 0 to disable this feature. Use OCI__SYNC_DIR env variable in your code to save
the artifacts. For Example :
tf.keras.callbacks.ModelCheckpoint(os.path.join(os.environ.get("OCI__SYNC_DIR"),"ckpts",
˓→'checkpoint-{epoch}.h5'))
11.2.5.2.1 TensorFlow
To use Horovod in TensorFlow, following modifications are required in the training script:
1. Import Horovod and initialize it.
config = tf.ConfigProto()
config.gpu_options.visible_device_list = str(hvd.local_rank())
gpus = tf.config.experimental.list_physical_devices('GPU')
for gpu in gpus:
tf.config.experimental.set_memory_growth(gpu, True)
if gpus:
tf.config.experimental.set_visible_devices(gpus[hvd.local_rank()], 'GPU')
opt = hvd.DistributedOptimizer(opt)
5. Modify your code to save checkpoints(and any other artifacts) only in the rank-0 training process to prevent other
workers from corrupting them.
if hvd.rank() == 0:
tf.keras.callbacks.ModelCheckpoint(ckpts_path)
tf.keras.callbacks.TensorBoard(tblogs_path)
6. OCI Data Science Horovod workloads are based on Elastic Horovod. In addition to above changes, the training
script also needs to use state synchronization. In summary, this means:
a. Use the decorator hvd.elastic.run to wrap the main training process.
b. Use hvd.elastic.State to add all variables that needs to be sync across workers.
c. Save state periodically, using hvd.elastic.State
A complete example can be found in the Write your training code section. More examples can be found here. Refer
horovod with TensorFlow and horovod with Keras for more details.
11.2.5.2.2 PyTorch
To use Horovod in PyTorch, following modifications are required in the training script:
1. Import Horovod and initialize it.
torch.manual_seed(args.seed)
if args.cuda:
# Horovod: pin GPU to local rank.
torch.cuda.set_device(hvd.local_rank())
torch.cuda.manual_seed(args.seed)
optimizer = hvd.DistributedOptimizer(
optimizer,
named_parameters=model.named_parameters(),
compression=compression,
op=hvd.Adasum if args.use_adasum else hvd.Average
)
5. Modify your code to save checkpoints only in the rank-0 training process to prevent other workers from corrupting
them.
6. Like TensorFlow, Horovod PyTorch scripts also need to use state synchronization. Refer TensorFlow section above.
Here is a complete PyTorch sample which is inspired from examples found here and here.
Listing 7: train.py
# Script adapted from https://github.com/horovod/horovod/blob/master/examples/elastic/
˓→pytorch/pytorch_mnist_elastic.py
# ==============================================================================
import argparse
import os
from filelock import FileLock
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torchvision import datasets, transforms
import torch.utils.data.distributed
import horovod.torch as hvd
from torch.utils.tensorboard import SummaryWriter
# Training settings
parser = argparse.ArgumentParser(description='PyTorch MNIST Example')
parser.add_argument('--batch-size', type=int, default=64, metavar='N',
help='input batch size for training (default: 64)')
parser.add_argument('--test-batch-size', type=int, default=1000, metavar='N',
help='input batch size for testing (default: 1000)')
parser.add_argument('--epochs', type=int, default=10, metavar='N',
help='number of epochs to train (default: 10)')
parser.add_argument('--lr', type=float, default=0.01, metavar='LR',
help='learning rate (default: 0.01)')
parser.add_argument('--momentum', type=float, default=0.5, metavar='M',
help='SGD momentum (default: 0.5)')
parser.add_argument('--no-cuda', action='store_true', default=False,
help='disables CUDA training')
parser.add_argument('--seed', type=int, default=42, metavar='S',
help='random seed (default: 42)')
parser.add_argument('--log-interval', type=int, default=10, metavar='N',
help='how many batches to wait before logging training status')
parser.add_argument('--fp16-allreduce', action='store_true', default=False,
help='use fp16 compression during allreduce')
parser.add_argument('--use-adasum', action='store_true', default=False,
help='use adasum algorithm to do reduction')
parser.add_argument('--data-dir',
help='location of the training dataset in the local filesystem (will␣
˓→be downloaded if needed)')
args = parser.parse_args()
args.cuda = not args.no_cuda and torch.cuda.is_available()
if args.cuda:
# Horovod: pin GPU to local rank.
torch.cuda.set_device(hvd.local_rank())
torch.cuda.manual_seed(args.seed)
test_dataset = \
datasets.MNIST(data_dir, train=False, transform=transforms.Compose([
transforms.ToTensor(),
transforms.Normalize((0.1307,), (0.3081,))
]))
# Horovod: use DistributedSampler to partition the test data.
test_sampler = torch.utils.data.distributed.DistributedSampler(
test_dataset, num_replicas=hvd.size(), rank=hvd.rank())
test_loader = torch.utils.data.DataLoader(test_dataset, batch_size=args.test_batch_size,
sampler=test_sampler, **kwargs)
class Net(nn.Module):
def __init__(self):
super(Net, self).__init__()
self.conv1 = nn.Conv2d(1, 10, kernel_size=5)
self.conv2 = nn.Conv2d(10, 20, kernel_size=5)
self.conv2_drop = nn.Dropout2d()
self.fc1 = nn.Linear(320, 50)
self.fc2 = nn.Linear(50, 10)
model = Net()
if args.cuda:
# Move model to GPU.
model.cuda()
# If using GPU Adasum allreduce, scale learning rate by local_size.
if args.use_adasum and hvd.nccl_built():
lr_scaler = hvd.local_size()
def create_dir(dir):
if not os.path.exists(dir):
os.makedirs(dir)
# Horovod: average metrics from distributed training.
class Metric(object):
def __init__(self, name):
self.name = name
self.sum = torch.tensor(0.)
self.n = torch.tensor(0.)
@property
def avg(self):
return self.sum / self.n
train_sampler.set_epoch(state.epoch)
steps_remaining = len(train_loader) - state.batch
if args.cuda:
data, target = data.cuda(), target.cuda()
state.optimizer.zero_grad()
output = state.model(data)
loss = F.nll_loss(output, target)
train_loss.update(loss)
loss.backward()
state.optimizer.step()
if state.batch % args.log_interval == 0:
# Horovod: use train_sampler to determine the number of examples in
# this worker's partition.
print('Train Epoch: {} [{}/{} ({:.0f}%)]\tLoss: {:.6f}'.format(
state.epoch, state.batch * len(data), len(train_sampler),
100.0 * state.batch / len(train_loader), loss.item()))
state.commit()
if writer:
writer.add_scalar("Loss", train_loss.avg, state.epoch)
if hvd.rank() == 0:
chkpt_path = os.path.join(chkpts_dir,checkpoint_format.format(epoch=state.
˓→epoch + 1))
chkpt = {
'model': state.model.state_dict(),
'optimizer': state.optimizer.state_dict(),
}
torch.save(chkpt, chkpt_path)
state.batch = 0
def test():
model.eval()
test_loss = 0.
test_accuracy = 0.
for data, target in test_loader:
if args.cuda:
data, target = data.cuda(), target.cuda()
output = model(data)
# sum up batch loss
test_loss += F.nll_loss(output, target, size_average=False).item()
# get the index of the max log-probability
pred = output.data.max(1, keepdim=True)[1]
test_accuracy += pred.eq(target.data.view_as(pred)).cpu().float().sum()
Refer to more examples here. Refer horovod with PyTorch for more details.
Next Steps
Once you have the training code ready (either in TensorFlow or PyTorch), you can proceed to creating Horovod work-
loads.
Monitoring Horovod training using TensorBoard is similar to how it is usually done for TensorFlow or PyTorch
workloads. Your training script generates the TensorBoard logs and saves the logs to the directory reference by
OCI__SYNC_DIR env variable. With SYNC_ARTIFACTS=1, these TensorBoard logs will be periodically synchronized
with the configured object storage bucket.
Please refer Saving Artifacts to Object Storage Buckets.
Aggregating metrics:
In a distributed setup, the metrics(loss, accuracy etc.) need to be aggregated from all the workers. Horovod provides
MetricAverageCallback callback(for TensorFlow) which should be added to the model training step. For PyTorch, refer
this Pytorch Example.
Using TensorBoard Logs:
TensorBoard can be setup on a local machine and pointed to object storage. This will enable a live monitoring setup
of TensorBoard logs.
Note: The logs take some initial time (few minutes) to reflect on the tensorboard dashboard.
Horovod Timelines:
Horovod also provides Timelines, which provides a snapshot of the training activities. Timeline files can be optionally
generated with the following environment variable(part of workload yaml).
config:
env:
- name: ENABLE_TIMELINE #Disabled by Default(0).
value: 1
PyTorch is an open source machine learning framework used for applications such as computer vision and natural
language processing, primarily developed by Facebook’s AI Research lab. ADS supports running PyTorch’s native
distributed training code (torch.distributed and DistributedDataParallel) with OCI Data Science Jobs. Pro-
vided you are following the official PyTorch distributed data parallel guidelines, no changes to your PyTorch code
are required.
PyTorch distributed training requires initialization using the torch.distributed.init_process_group() func-
tion. By default this function collects uses environment variables to initialize the communications for the training
cluster. When using ADS to run PyTorch distributed training on OCI data science Jobs, the environment variables, in-
cluding MASTER_ADDR, MASTER_PORT, WORLD_SIZE RANK, and LOCAL_RANK will automatically be set in the job runs.
By default MASTER_PORT will be set to 29400.
Prerequisites
1. Internet Connection
2. ADS cli is installed
3. Install docker: https://docs.docker.com/get-docker
Write your training code:
For this example, the code to run was inspired from an example found here
Note that MASTER_ADDR, MASTER_PORT, WORLD_SIZE, RANK, and LOCAL_RANK are environment variables that will
automatically be set.
Listing 8: train.py
# Copyright (c) 2017 Facebook, Inc. All rights reserved.
# BSD 3-Clause License
#
# Script adapted from:
# https://github.com/Azure/azureml-examples/blob/main/python-sdk/workflows/train/
˓→pytorch/cifar-distributed/src/train.py
# ==============================================================================
import datetime
import torch
import torchvision
import torchvision.transforms as transforms
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import os, argparse
# define functions
def train(train_loader, model, criterion, optimizer, epoch, device, print_freq, rank):
running_loss = 0.0
for i, data in enumerate(train_loader, 0):
# get the inputs; data is a list of [inputs, labels]
inputs, labels = data[0].to(device), data[1].to(device)
# print statistics
running_loss += loss.item()
if i % print_freq == 0: # print every print_freq mini-batches
print(
"Rank %d: [%d, %5d] loss: %.3f "
% (rank, epoch + 1, i + 1, running_loss / print_freq)
)
running_loss = 0.0
model.eval()
correct = 0
total = 0
class_correct = list(0.0 for i in range(10))
class_total = list(0.0 for i in range(10))
with torch.no_grad():
(continues on next page)
def main(args):
# get PyTorch environment variables
world_size = int(os.environ["WORLD_SIZE"])
rank = int(os.environ["RANK"])
local_rank = int(os.environ["LOCAL_RANK"])
if torch.cuda.is_available():
print("CUDA is available.")
else:
print("CUDA is not available.")
# set device
if distributed:
if torch.cuda.is_available():
device = torch.device("cuda", local_rank)
else:
device = torch.device("cpu")
else:
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
train_set = torchvision.datasets.CIFAR10(
root=args.data_dir, train=True, download=True, transform=transform
)
if distributed:
train_sampler = torch.utils.data.distributed.DistributedSampler(train_set)
else:
train_sampler = None
train_loader = torch.utils.data.DataLoader(
train_set,
batch_size=args.batch_size,
shuffle=(train_sampler is None),
num_workers=args.workers,
sampler=train_sampler,
)
test_set = torchvision.datasets.CIFAR10(
root=args.data_dir, train=False, download=True, transform=transform
)
test_loader = torch.utils.data.DataLoader(
test_set, batch_size=args.batch_size, shuffle=False, num_workers=args.workers
)
model = Net().to(device)
# run script
if __name__ == "__main__":
# setup argparse
parser = argparse.ArgumentParser()
parser.add_argument(
"--data-dir", type=str, help="directory containing CIFAR-10 dataset"
)
parser.add_argument("--epochs", default=10, type=int, help="number of epochs")
parser.add_argument(
"--batch-size",
default=16,
type=int,
help="mini batch size for each gpu/process",
)
parser.add_argument(
"--workers",
default=2,
type=int,
help="number of data loading workers for each gpu/process",
)
parser.add_argument(
"--learning-rate", default=0.001, type=float, help="learning rate"
)
parser.add_argument("--momentum", default=0.9, type=float, help="momentum")
parser.add_argument(
"--output-dir", default="outputs", type=str, help="directory to save model to"
)
(continues on next page)
)
args = parser.parse_args()
export IMAGE_NAME=<region.ocir.io/my-tenancy/image-name>
export TAG=latest
The code is assumed to be in the current working directory. To override the source code directory, use the -s flag and
specify the code dir. This folder should be within the current working directory.
-s <code_dir>
If you are behind proxy, ads opctl will automatically use your proxy settings (defined via no_proxy, http_proxy and
https_proxy).
Define your workload yaml:
The yaml file is a declarative way to express the workload. Following is the YAML for running the example code, you
will need to replace the values in the spec sections for your project:
• infrastructure contains spec for OCI Data Science Jobs. Here you need to specify a subnet that allows
communications between nodes. The VM.GPU2.1 shape is used in this example.
• cluster contains spec for the image you built and a working directory on OCI object storage, which will be used
by job runs to shared internal configurations. Environment variables specified in the cluster.spec.config
will be available in all nodes. Here the NCCL_ASYNC_ERROR_HANDLING is used to enable the timeout for NCCL
backend. The job runs will be terminated if the nodes failed to connect to each other in certain minutes as
specified in your training code when calling init_process_group().
• runtime contains spec for the name of your training script, and the command line arguments for running the
script. Here the nccl backend is used for communications between GPUs. For CPU training, you can use
the gloo backend. The timeout argument specify the maximum minutes for the nodes to wait when calling
init_process_group(). This is useful for preventing the job runs to wait forever in case of node failure.
Listing 9: train.yaml
kind: distributed
apiVersion: v1.0
spec:
infrastructure:
kind: infrastructure
type: dataScienceJob
apiVersion: v1.0
spec:
projectId: oci.xxxx.<project_ocid>
compartmentId: oci.xxxx.<compartment_ocid>
displayName: PyTorch-Distributed
logGroupId: oci.xxxx.<log_group_ocid>
logId: oci.xxx.<log_ocid>
subnetId: oci.xxxx.<subnet-ocid>
shapeName: VM.GPU2.1
blockStorageSize: 50
cluster:
kind: pytorch
apiVersion: v1.0
spec:
image: <region.ocir.io/my-tenancy/image-name>
workDir: "oci://my-bucket@my-namespace/pytorch/distributed"
config:
env:
- name: NCCL_ASYNC_ERROR_HANDLING
value: '1'
main:
name: PyTorch-Distributed-main
replicas: 1
worker:
name: PyTorch-Distributed-worker
replicas: 3
(continues on next page)
Use ads opctl to create the cluster infrastructure and dry-run the workload:
the output from the dry run will show all the actions and infrastructure configuration.
Use ads opctl to create the cluster infrastructure and run the workload:
Test Locally:
Before submitting the workload to jobs, you can run it locally to test your code, dependencies, configurations etc. With
-b local flag, it uses a local backend. Further when you need to run this workload on odsc jobs, simply use -b job
flag instead.
If your code requires to use any oci services (like object bucket), you need to mount oci keys from your local host
machine onto the container. This is already done for you assuming the typical location of oci keys ~/.oci. You can
modify it though, in-case you have keys at a different location. You need to do this in the config.ini file.
oci_key_mnt = ~/.oci:/home/oci_dist_training/.oci
Note:: This will automatically push the docker image to the OCI container registry repo .
Once running, you will see on the terminal an output similar to the below. Note that this yaml can be used as input to
ads opctl distributed-training show-config -f <info.yaml> - to both save and see the run info use tee
- for example:
- name: SYNC_ARTIFACTS
value: 1
- name: WORKSPACE
value: "<bucket_name>"
- name: WORKSPACE_PREFIX
value: "<bucket_prefix>"
Note: Change SYNC_ARTIFACTS to 0 to disable this feature. Use OCI__SYNC_DIR env variable in your code to save
the artifacts. For Example :
model_path = os.path.join(os.environ.get("OCI__SYNC_DIR"),"model.pt")
torch.save(model, model_path)
Profiling
You may want to profile your training setup for optimization/performance tuning. Profiling typically provides a detailed
analysis of cpu utilization, gpu utilization, top cuda kernels, top operators etc. You can choose to profile your training
setup using the native Pytorch profiler or using a third party profiler such as Nvidia Nsights.
Profiling using Pytorch Profiler
Pytorch Profiler is a native offering from Pytorch for Pytorch performance profiling. Profiling is invoked using code
instrumentation using the api torch.profiler.profile.
Refer this link for changes that you need to do in your training script for instrumentation. You should choose the
OCI__SYNC_DIR directory to save the profiling logs. For example:
prof = torch.profiler.profile(activities=[torch.profiler.ProfilerActivity.CPU,torch.
˓→profiler.ProfilerActivity.CUDA],
schedule=torch.profiler.schedule(
wait=1,
warmup=1,
active=3,
repeat=1),
on_trace_ready=torch.profiler.tensorboard_trace_handler(os.environ.get("OCI__SYNC_
˓→DIR") + "/logs"),
with_stack=False)
prof.start()
# training code
prof.end()
Also, the sync feature SYNC_ARTIFACTS should be enabled '1' to sync the profiling logs to the configured object
storage.
You would also need to install the Pytorch Tensorboard Plugin.
Thereafter, use Tensorboard to view logs. Refer the Tensorboard setup for set-up on your computer.
Profiling using Nvidia Nsights
Nvidia Nsights. is a system wide profiling tool from Nvidia that can be used to profile Deep Learning workloads.
Nsights requires no change in your training code. This works on process level. You can enable this experimental feature
in your training setup via the following configuration in the runtime yaml file(highlighted).
spec:
image: "@image"
workDir: "oci://@/"
name: "tf_multiworker"
config:
env:
- name: WORKER_PORT
value: 12345
- name: SYNC_ARTIFACTS
value: 1
- name: WORKSPACE
value: "<bucket_name>"
- name: WORKSPACE_PREFIX
value: "<bucket_prefix>"
- name: PROFILE
value: 1
- name: PROFILE_CMD
value: "nsys profile -w true -t cuda,nvtx,osrt,cudnn,cublas -s none -o /opt/
˓→ml/nsight_report -x true"
main:
name: "main"
replicas: 1
worker:
name: "worker"
replicas: 1
Refer this for nsys profile command options. You can modify the command within the PROFILE_CMD but remember this
is all experimental. The profiling reports are generated per node. You need to download the reports to your computer
manually or via the oci command.
Note: -bn == WORKSPACE and --prefix path == WORKSPACE_PREFIX/<job_id> , as configured in the runtime
yaml file. To view the reports, you would need to install Nsight Systems app from here. Thereafter, open the downloaded
reports in the Nsight Systems app.
11.2.7 Tensorflow
Prerequisites
1. Internet Connection
2. ADS cli is installed
3. Install docker: https://docs.docker.com/get-docker
Write your training code:
Your model training script needs to use one of Distributed Strategies in tensorflow.
For example, you can have the following training Tensorflow script for MultiWorkerMirroredStrategy saved as mnist.py:
# Script adapted from tensorflow tutorial: https://www.tensorflow.org/tutorials/
˓→distribute/multi_worker_with_keras
import tensorflow as tf
import tensorflow_datasets as tfds
import os
import sys
import time
import ads
from ocifs import OCIFileSystem
from tensorflow.data.experimental import AutoShardPolicy
BUFFER_SIZE = 10000
BATCH_SIZE_PER_REPLICA = 64
def create_dir(dir):
if not os.path.exists(dir):
os.makedirs(dir)
info.splits['test'].num_examples)
train_dataset = mnist_train.map(scale).cache().shuffle(BUFFER_SIZE).batch(BATCH_SIZE)
test_dataset = mnist_test.map(scale).batch(BATCH_SIZE)
train = shard(train_dataset)
test = shard(test_dataset)
return train, test, info
def shard(dataset):
options = tf.data.Options()
options.experimental_distribute.auto_shard_policy = AutoShardPolicy.DATA
return dataset.with_options(options)
(continues on next page)
def decay(epoch):
if epoch < 3:
return 1e-3
elif epoch >= 3 and epoch < 7:
return 1e-4
else:
return 1e-5
class PrintLR(tf.keras.callbacks.Callback):
def on_epoch_end(self, epoch, logs=None):
print('\nLearning rate for epoch {} is {}'.format(epoch + 1, model.optimizer.
˓→lr.numpy()), flush=True)
callbacks = [
tf.keras.callbacks.TensorBoard(log_dir='./logs'),
tf.keras.callbacks.ModelCheckpoint(filepath=checkpoint_prefix,
# save_weights_only=True
),
tf.keras.callbacks.LearningRateScheduler(decay),
PrintLR()
]
return callbacks
def build_and_compile_cnn_model():
print("TF_CONFIG in model:", os.environ.get("TF_CONFIG"))
model = tf.keras.Sequential([
tf.keras.layers.Conv2D(32, 3, activation='relu', input_shape=(28, 28, 1)),
tf.keras.layers.MaxPooling2D(),
tf.keras.layers.Flatten(),
tf.keras.layers.Dense(64, activation='relu'),
tf.keras.layers.Dense(10)
])
model.compile(loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
optimizer=tf.keras.optimizers.Adam(),
metrics=['accuracy'])
return model
import tensorflow as tf
import argparse
import mnist
print(tf.__version__)
(continues on next page)
default='/code/data')
parser.add_argument('--data-bckt',
help='location of the training dataset in an object storage bucket',
default=None)
args = parser.parse_args()
strategy = tf.distribute.MirroredStrategy()
print('Number of devices: {}'.format(strategy.num_replicas_in_sync))
num_replicas=strategy.num_replicas_in_
˓→sync)
with strategy.scope():
model = mnist.build_and_compile_cnn_model()
model.save(model_dir, save_format='tf')
This will download the tensorflow framework and place it inside 'oci_dist_training_artifacts' folder.
Note: Whenever you change the code, you have to build, tag and push the image to repo. This is automatically done in
`ads opctl run` cli command.
Containerize your code and build container:
The required python dependencies are provided inside the conda environment file
oci_dist_training_artifacts/tensorflow/v1/environments.yaml. If your code requires additional dependency, up-
date this file.
Also, while updating environments.yaml do not remove the existing libraries. You can append to the list.
Update the TAG and the IMAGE_NAME as per your needs -
export IMAGE_NAME=<region.ocir.io/my-tenancy/image-name>
export TAG=latest
export MOUNT_FOLDER_PATH=.
The code is assumed to be in the current working directory. To override the source code directory, use the -s flag and
specify the code dir. This folder should be within the current working directory.
If you are behind proxy, ads opctl will automatically use your proxy settings (defined via no_proxy, http_proxy and
https_proxy).
Define your workload yaml:
The yaml file is a declarative way to express the workload. In this example, we bring up 1 worker node and 1 chief-
worker node. The training code to run is train.py. All your training code is assumed to be present inside /code
directory within the container. Additionally, you can also put any data files inside the same directory (and pass on the
location ex /code/data/** as an argument to your training script using runtime->spec->args).
kind: distributed
apiVersion: v1.0
spec:
infrastructure:
kind: infrastructure
type: dataScienceJob
apiVersion: v1.0
spec:
projectId: oci.xxxx.<project_ocid>
compartmentId: oci.xxxx.<compartment_ocid>
displayName: Tensorflow
logGroupId: oci.xxxx.<log_group_ocid>
subnetId: oci.xxxx.<subnet-ocid>
shapeName: VM.GPU2.1
blockStorageSize: 50
cluster:
kind: TENSORFLOW
apiVersion: v1.0
spec:
image: "@image"
workDir: "oci://<bucket_name>@<bucket_namespace>/<bucket_prefix>"
name: "tf_multiworker"
config:
env:
- name: WORKER_PORT #Optional. Defaults to 12345
value: 12345
- name: SYNC_ARTIFACTS #Mandatory: Switched on by Default.
value: 1
- name: WORKSPACE #Mandatory if SYNC_ARTIFACTS==1: Destination object bucket␣
˓→to sync generated artifacts to.
value: "<bucket_prefix>"
main:
name: "chief"
replicas: 1 #this will be always 1.
worker:
name: "worker"
replicas: 1 #number of workers. This is in addition to the 'chief' worker. Could␣
˓→be more than 1
runtime:
kind: python
apiVersion: v1.0
spec:
entryPoint: "/code/train.py" #location of user's training script in the container␣
˓→image.
Use ads opctl to create the cluster infrastructure and run the workload:
Do a dry run to inspect how the yaml translates to Job and Job Runs
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Creating Main Job Run with following details:
Name: chief
Environment Variables:
OCI__MODE:MAIN
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Creating Job Runs with following details:
Name: worker_0
Environment Variables:
OCI__MODE:WORKER
-----------------------------Ending dryrun mode----------------------------------
Test Locally:
Before submitting the workload to jobs, you can run it locally to test your code, dependencies, configurations etc. With
-b local flag, it uses a local backend. Further when you need to run this workload on odsc jobs, simply use -b job
flag instead.
If your code requires to use any oci services (like object bucket), you need to mount oci keys from your local host
machine onto the container. This is already done for you assuming the typical location of oci keys ~/.oci. You can
modify it though, in-case you have keys at a different location. You need to do this in the config.ini file.
oci_key_mnt = ~/.oci:/home/oci_dist_training/.oci
Note:: This will automatically push the docker image to the OCI container registry repo .
Once running, you will see on the terminal an output similar to the below. Note that this yaml can be used as input to
ads opctl distributed-training show-config -f <info.yaml> - to both save and see the run info use tee
- for example:
- name: SYNC_ARTIFACTS
value: 1
- name: WORKSPACE
value: "<bucket_name>"
- name: WORKSPACE_PREFIX
value: "<bucket_prefix>"
Note: Change SYNC_ARTIFACTS to 0 to disable this feature. Use OCI__SYNC_DIR env variable in your code to save
the artifacts. For Example :
tf.keras.callbacks.ModelCheckpoint(os.path.join(os.environ.get("OCI__SYNC_DIR"),"ckpts",
˓→'checkpoint-{epoch}.h5'))
Profiling
You may want to profile your training setup for optimization/performance tuning. Profiling typically provides a detailed
analysis of cpu utilization, gpu utilization, top cuda kernels, top operators etc. You can choose to profile your training
setup using the native Pytorch profiler or using a third party profiler such as Nvidia Nsights.
Profiling using Tensorflow Profiler
Tensorflow Profiler is a native offering from Tensforflow for Tensorflow performance profiling.
Profiling is invoked using code instrumentation using one of the following apis.
tf.keras.callbacks.TensorBoard
tf.profiler.experimental.Profile
Refer above links for changes that you need to do in your training script for instrumentation.
You should choose the OCI__SYNC_DIR directory to save the profiling logs. For example:
options = tf.profiler.experimental.ProfilerOptions(
host_tracer_level=2,
python_tracer_level=1,
device_tracer_level=1,
delay_ms=None)
with tf.profiler.experimental.Profile(os.environ.get("OCI__SYNC_DIR") + "/logs",
˓→options=options):
# training code
histogram_freq = 1,
profile_batch = '500,520')
model.fit(...,callbacks = [tboard_callback])
Also, the sync feature SYNC_ARTIFACTS should be enabled '1' to sync the profiling logs to the configured object
storage.
Thereafter, use Tensorboard to view logs. Refer the Tensorboard setup for set-up on your computer.
Profiling using Nvidia Nsights
Nvidia Nsights. is a system wide profiling tool from Nvidia that can be used to profile Deep Learning workloads.
Nsights requires no change in your training code. This works on process level. You can enable this experimental feature
in your training setup via the following configuration in the runtime yaml file(highlighted).
spec:
image: "@image"
workDir: "oci://@/"
name: "tf_multiworker"
config:
env:
- name: WORKER_PORT
value: 12345
- name: SYNC_ARTIFACTS
value: 1
- name: WORKSPACE
value: "<bucket_name>"
- name: WORKSPACE_PREFIX
value: "<bucket_prefix>"
- name: PROFILE
value: 1
- name: PROFILE_CMD
value: "nsys profile -w true -t cuda,nvtx,osrt,cudnn,cublas -s none -o /opt/
˓→ml/nsight_report -x true"
main:
(continues on next page)
Refer this for nsys profile command options. You can modify the command within the PROFILE_CMD but remember this
is all experimental. The profiling reports are generated per node. You need to download the reports to your computer
manually or via the oci command.
Note: -bn == WORKSPACE and --prefix path == WORKSPACE_PREFIX/<job_id> , as configured in the runtime
yaml file. To view the reports, you would need to install Nsight Systems app from here. Thereafter, open the downloaded
reports in the Nsight Systems app.
Other Tensorflow Strategies supported
Tensorflow has two multi-worker strategies: MultiWorkerMirroredStrategy and ParameterServerStrategy.
Let’s see changes that you would need to do to run ParameterServerStrategy workload.
You can have the following training Tensorflow script for ParameterServerStrategy saved as train.py (just like
mnist.py and train.py in case of MultiWorkerMirroredStrategy):
import os
import tensorflow as tf
import json
import multiprocessing
NUM_PS = len(json.loads(os.environ['TF_CONFIG'])['cluster']['ps'])
global_batch_size = 64
for i in range(num_workers):
print("cluster_resolver.task_id: ", cluster_resolver.task_id, flush=True)
s = tf.distribute.Server(
cluster_resolver.cluster_spec(),
job_name=cluster_resolver.task_type,
task_index=cluster_resolver.task_id,
config=worker_config,
(continues on next page)
if mode.lower() == 'worker':
print("Starting worker server...", flush=True)
worker(num_workers, cluster_resolver)
else:
print("Starting ps server...", flush=True)
ps(num_ps, cluster_resolver)
def decay(epoch):
if epoch < 3:
return 1e-3
elif epoch >= 3 and epoch < 7:
return 1e-4
else:
return 1e-5
def get_callbacks(model):
class PrintLR(tf.keras.callbacks.Callback):
def on_epoch_end(self, epoch, logs=None):
print('\nLearning rate for epoch {} is {}'.format(epoch + 1, model.optimizer.
˓→lr.numpy()), flush=True)
callbacks = [
tf.keras.callbacks.TensorBoard(log_dir='./logs'),
tf.keras.callbacks.LearningRateScheduler(decay),
PrintLR()
]
return callbacks
def create_dir(dir):
if not os.path.exists(dir):
os.makedirs(dir)
(continues on next page)
def get_artificial_data():
x = tf.random.uniform((10, 10))
y = tf.random.uniform((10,))
cluster_resolver = tf.distribute.cluster_resolver.TFConfigClusterResolver()
if not os.environ["OCI__MODE"] == "MAIN":
create_cluster(cluster_resolver, num_workers=1, num_ps=1, mode=os.environ["OCI__MODE
˓→"])
pass
variable_partitioner = (
tf.distribute.experimental.partitioners.MinSizePartitioner(
min_shard_bytes=(256 << 10),
max_shards=NUM_PS))
strategy = tf.distribute.ParameterServerStrategy(
cluster_resolver,
variable_partitioner=variable_partitioner)
dataset = get_artificial_data()
with strategy.scope():
model = tf.keras.models.Sequential([tf.keras.layers.Dense(10)])
model.compile(tf.keras.optimizers.SGD(), loss="mse", steps_per_execution=10)
callbacks = get_callbacks(model)
model.fit(dataset, epochs=5, steps_per_epoch=20, callbacks=callbacks)
Train.yaml: The only difference here is that the parameter server train.yaml also needs to have ps worker-pool. This
will create dedicated instance(s) for Tensorflow Parameter Servers.
Use the following train.yaml:
kind: distributed
apiVersion: v1.0
spec:
infrastructure:
kind: infrastructure
type: dataScienceJob
apiVersion: v1.0
spec:
projectId: oci.xxxx.<project_ocid>
compartmentId: oci.xxxx.<compartment_ocid>
displayName: Distributed-TF
logGroupId: oci.xxxx.<log_group_ocid>
(continues on next page)
value: "<bucket_name>"
- name: WORKSPACE_PREFIX #Mandatory if SYNC_ARTIFACTS==1: Destination object␣
˓→bucket folder to sync generated artifacts to.
value: "<bucket_prefix>"
main:
name: "coordinator"
replicas: 1 #this will be always 1.
worker:
name: "worker"
replicas: 1 #number of workers; any number > 0
ps:
name: "ps" # number of parameter servers; any number > 0
replicas: 1
runtime:
kind: python
apiVersion: v1.0
spec:
spec:
entryPoint: "/code/train.py" #location of user's training script in the container␣
˓→image.
The rest of the steps remain the same and should be followed as it is.
Instead of adding the training source code to the docker image, you can also fetch the code at runtime from Git repository
or object storage.
To fetch code from Git repository, you can update the runtime section of the yaml to specify type as git and add
uri of the Git repository to the runtime.spec section. For example:
1 runtime:
2 apiVersion: v1
3 kind: python
4 type: git
5 spec:
6 uri: [email protected]:username/repository.git
7 branch: develop
8 commit: abcdef
9 gitSecretId: ocid1.xxxxxx
10 entryPoint: "train.py"
To fetch code from Object Storage, you can update the runtime section of the yaml to specify type as remote and
add uri of the OCI object storage to the runtime.spec section. For example:
1 runtime:
2 apiVersion: v1
3 kind: python
4 type: remote
5 spec:
6 uri: oci://bucket@namespace/prefix/to/source_code_dir
7 entryPoint: "/code/source_code_dir/train.py"
The uri can be a single file or a prefix (directory). The entryPoint is the the file path to start the training code. When
using relative path, if uri is a single file, entryPoint should be the filename. If uri is a directory, the entryPoint
should contain the name of the directory like the example above. The source code is cloned to the /code directory.
You may also use the absolute path.
The distributed training workload is defined in YAML and can be launched by invoking the ads opctl run -f path/
to/yaml command.
Following is the YAML schema for validating the YAML using Cerberus:
1 kind:
2 type: string
3 allowed:
4 - distributed
5 apiVersion:
6 type: string
7 spec:
8 type: dict
9 schema:
10 infrastructure:
11 type: dict
12 schema:
13 kind:
14 type: string
15 allowed:
16 - infrastructure
17 type:
18 type: string
19 allowed:
20 - dataScienceJob
21 apiVersion:
22 type: string
23 spec:
24 type: dict
(continues on next page)
11.2.10 Troubleshooting
11.3 TensorBoard
TensorBoard helps visualizing your experiments. You bring up a TensorBoard session on your workstation and point
to the directory that contains the TensorBoard logs.
Prerequisite
1. Object storage bucket
2. Access to Object Storage bucket from your workstation
3. ocifs version 1.1.0 and above
It is required that tensorboard is installed in a dedicated conda environment or virtual environment. Prepare an
environment yaml file for creating conda environment with following command -
Create the conda environment from the yaml file generated in the preceeding step
This will create a conda environment called tensorboard. Activate the conda environment by running -
This will bring up TensorBoard app on your workstation. Access TensorBoard at http://localhost:6006/
Note: The logs take some initial time (few minutes) to reflect on the tensorboard dashboard.
Prerequisite
1. tensorboard is installed.
2. ocifs version is 1.1.0 and above.
3. oracle-ads version 2.6.0 and above.
11.3.3.1 PyTorch
You could write your logs from your PyTorch experiements directly to object storage and view the logs on Tensor-
Board running on your local workstation in real time. Here is an example or running PyTorch experiment and writing
TensorBoard logs from OCI Data Science Notebook
1. Create or Open an existing OCI Data Science Notebook session
2. Run odsc conda install -s pytorch110_p37_cpu_v1 on terminal inside the notebook session
3. Activate conda environment - conda activate /home/datascience/conda/pytorch110_p37_cpu_v1
4. Install TensorBoard - python3 -m pip install tensorboard
5. Upgrade to latest ocifs - python3 -m pip install ocifs --upgrade
6. Create a notebook and select pytorch110_p37_cpu_v1 kernel
7. Copy the following code into a cell and update the object storage path in the code snippet
# Reference: https://github.com/pytorch/tutorials/blob/master/recipes_source/recipes/
˓→tensorboard_with_pytorch.py
import torch
from torch.utils.tensorboard import SummaryWriter
writer = SummaryWriter("oci://my-bucket@my-namespace/path/to/logs")
x = torch.arange(-5, 5, 0.1).view(-1, 1)
y = -5 * x + 0.1 * torch.randn(x.size())
model = torch.nn.Linear(1, 1)
criterion = torch.nn.MSELoss()
optimizer = torch.optim.SGD(model.parameters(), lr = 0.1)
def train_model(iter):
for epoch in range(iter):
y1 = model(x)
loss = criterion(y1, y)
writer.add_scalar("Loss/train", loss, epoch)
optimizer.zero_grad()
loss.backward()
optimizer.step()
train_model(10)
writer.flush()
writer.close()
For more possibilities with TensorBoard and PyTorch check this link
11.3.3.2 TensorFlow
Currently TensorFlow cannot write directly to object storage. However, we can create logs in the local directory and
then copy the logs over to object storage, which then can be viewed from the TensorBoard running on your local
workstation.
When you run a OCI Data Science Job with ads.jobs.NotebookRuntime or ads.jobs.GitRuntime, all the
output is automatically copied over to the configured object storage bucket.
Here is an example of running a TensorFlow experiment in OCI Data Science Notebook and then viewing the logs
from TensorBoard
1. Create or open an existing notebook session.
2. Download notebook - https://raw.githubusercontent.com/mayoor/stats-ml-exps/master/tensorboard_tf.ipynb
!wget https://raw.githubusercontent.com/mayoor/stats-ml-exps/master/tensorboard_tf.ipynb
3. Run odsc conda install -s tensorflow27_p37_cpu_v1 on terminal to install TensorFlow 2.6 environ-
ment.
4. Open the downloaded notebook - tensorboard_tf.ipynb
5. Select tensorflow27_p37_cpu_v1 kernel.
6. Run all cells.
7. Copy TensorBoard logs folder - tflogs to object storage using oci-cli
View the logs from you workstation once the logs are uploaded by lauching the TensorBoard with following command
-
Here is an example of running a TensorFlow experiment in OCI Data Science Jobs and then viewing the logs from
TensorBoard
1. Run the following code to submit a notebook to OCI Data Science Job. You could run this code snippet from
your local workstation or OCI Data Science Notebook session. You need oracle-ads version >= 2.6.0.
.with_service_conda("tensorflow27_p37_cpu_v1")
# Saves the notebook with outputs to OCI object storage.
.with_output("oci://my-bucket@my-namespace/myexperiment/jobs/")
)
).create()
# Run and monitor the job
run = job.run().watch()
View the logs from you workstation once the jobs is complete by lauching the tensorboard with following command -
With the ever-growing suite of models at the disposal of data scientists, the problems with selecting a model have grown
similarly. ADS offers the Evaluation Class, a collection of tools, metrics, and charts concerned with the contradistinc-
tion of several models.
After working hard to architect and train your model, it’s important to understand how it performs across a series of
benchmarks. Evaluation is a set of functions that convert the output of your test data into an interpretable, standardized
series of scores and charts. From the accuracy of the ROC curve and residual QQ plots.
Evaluation can help machine learning developers to:
• Quickly compare models across several industry-standard metrics.
– For example, what’s the accuracy, and F1-Score of my binary classification model?
• Discover where a model is failing to feedback into future model development.
– For example, while accuracy is high, precision is low, which is why the examples I care about are failing.
• Increase understanding of the trade-offs of various model types.
Evaluation helps you understand where the model is likely to perform well or not. For example, model A performs well
when the weather is clear, but is much more uncertain during inclement conditions.
There are three types of ADS Evaluators, binary classifier, multinomial classifier, and regression.
seed = 42
lr_clf = LogisticRegression(
random_state=0, solver="lbfgs", multi_class="multinomial"
).fit(trainx, trainy)
evaluator = ADSEvaluator(
ADSData(testx, testy),
models=[bin_lr_model, bin_rf_model],
training_data=ADSData(trainx, trainy),
)
print(evaluator.metrics)
X, y = make_classification(
n_samples=10000, n_features=25, n_classes=3, flip_y=0.1, n_clusters_per_class=1
)
lr_multi_clf = LogisticRegression(
random_state=0, solver="lbfgs", multi_class="multinomial"
).fit(trainx, trainy)
multi_lr_model = ADSModel.from_estimator(lr_multi_clf)
multi_rf_model = ADSModel.from_estimator(rf_multi_clf)
evaluator = ADSEvaluator(
ADSData(testx, testy),
models=[multi_lr_model, multi_rf_model],
)
print(evaluator.metrics)
seed = 42
lin_reg_model = ADSModel.from_estimator(lin_reg)
lasso_reg_model = ADSModel.from_estimator(lasso_reg)
reg_evaluator = ADSEvaluator(
ADSData(testx, testy), models=[lin_reg_model, lasso_reg_model]
)
print(reg_evaluator.metrics)
Binary classification is a type of modeling wherein the output is binary. For example, Yes or No, Up or Down, 1 or 0.
These models are a special case of multinomial classification so have specifically catered metrics.
The prevailing metrics for evaluating a binary classification model are accuracy, hamming loss, kappa score, precision,
recall, 𝐹1 and AUC. Most information about binary classification uses a few of these metrics to speak to the importance
of the model.
• Accuracy: The proportion of predictions that were correct. It is generally converted to a percentage where 100%
is a perfect classifier. An accuracy of 50% is random (for a balanced dataset) and an accuracy of 0% is a perfectly
wrong classifier.
• AUC: Area Under the Curve (AUC) refers to the area under an ROC curve. This is a numerical way to summarize
the robustness of a model to its discrimination threshold. The AUC is computed by integrating the area under
the ROC curve. It is akin to the probability that your model scores better on results to which it accredits a higher
score. Thus 1.0 is a perfect score, 0.5 is the average score of a random classifier, and 0.0 is a perfectly backward
scoring classifier.
• F1 Score: There is generally a trade-off between the precision and recall and the 𝐹1 score is a metric that
combines them into a single number. The 𝐹1 Score is the harmonic mean of precision and recall:
𝑃 𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 * 𝑅𝑒𝑐𝑎𝑙𝑙
𝐹1 = 2
𝑃 𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑅𝑒𝑐𝑎𝑙𝑙
Therefore a perfect 𝐹1 score is 1. That is, the classifier has perfect precision and recall. The worst 𝐹1 score is 0.
The 𝐹1 score of a random classifier is heavily dependent on the nature of the data.
• Hamming Loss: The proportion of predictions that were incorrectly classified and is equivalent to 1−𝑎𝑐𝑐𝑢𝑟𝑎𝑐𝑦.
This means a Hamming Loss of 0 is a perfect classifier. A score of 0.5 is a random classifier (for a balanced
dataset), and 1 is a perfectly incorrect classifier.
• Kappa Score: Cohen’s 𝜅 coefficient is a statistic that measures inter-annotator agreement. This function com-
putes Cohen’s 𝜅, a score that expresses the level of agreement between two annotators on a classification problem.
It is defined as:
𝑝𝑜 − 𝑝𝑒
𝜅=
1 − 𝑝𝑒
𝑝𝑜 is the empirical probability of agreement on the label assigned to any sample (the observed agreement ratio).
𝑝𝑒 is the expected agreement when both annotators assign labels randomly. 𝑝𝑒 is estimated using a per-annotator
empirical prior over the class labels.
• Precision: The proportion of the True class that were predicted to be True and are actually in the True class
𝑇 𝑃 +𝐹 𝑃 . This is also known as Positive Predictive Value (PPV). A precision of 1.0 is perfect precision, 0.0 is
𝑇𝑃
bad precision. However, the precision of a random classifier varies highly based on the nature of the data and to
a lesser extent a bad precision.
• Recall: This is the proportion of the True class predictions that were correctly predicted over the number of
True predictions (correct or incorrect) 𝑇 𝑃𝑇+𝐹 𝑃
𝑁 . This is also known as True Positive Rate (TPR) or Sensitivity.
A recall of 1.0 is perfect recall, 0.0 is bad recall. however, the recall of a random classifier varies highly based
on the nature of the data and to a lesser extent a bad recall.
The prevailing charts and plots for binary classification are the Precision-Recall Curve, the ROC curve, the Lift Chart,
the Gain Chart, and the Confusion Matrix. These are inter-related with the previously described metrics and are com-
monly used in the binary classification literature.
• Confusion Matrix
• Gain Chart
• Lift Chart
• Precision-Recall Curve
• ROC curve
This code snippet demonstrates how to generate the above metrics and charts. The data has to be split into a testing
and training set with the features in X_train and X_test and the responses in y_train and y_test.
seed = 42
lr_clf = LogisticRegression(
random_state=0, solver="lbfgs", multi_class="multinomial"
).fit(trainx, trainy)
evaluator = ADSEvaluator(
ADSData(testx, testy),
models=[bin_lr_model, bin_rf_model],
training_data=ADSData(trainx, trainy),
)
print(evaluator.metrics)
evaluator.metrics
evaluator.show_in_notebook(perfect=True)
Important parameters:
• If perfect is set to True, ADS plots a perfect classifier for comparison in Lift and Gain charts.
• If baseline is set to True, ADS won’t include a baseline for the comparison of various plots.
• If use_training_data is set True, ADS plots the evaluations of the training data.
• If plots contain a list of plot types, ADS plots only those plot types.
This code snippet demonstrates how to add a custom metric, a 𝐹2 score, to the evaluator.
𝑃 (ˆ
𝑦 = 1|𝑌 = 𝑦, 𝐹 = 𝐴)
𝑃 (ˆ
𝑦 = 1|𝑌 = 𝑦, 𝐹 = 𝐵)
• Equal Opportunity: For each of the protected_features specified, Equal Opportunity is a ratio between the
true positive rates for each class within that feature. The closer this value is to 1, the less biased the model is
with respect to the feature F. In other terms, for a binary feature F with classes A and B, Equal Opportunity is
• Statistical Parity: For each of the protected_features specified, Statistical Parity is a ratio between the prediction
rates for each class within that feature. The closer this value is to 1, the less biased the model and data are with
respect to the feature F. In other terms, for a binary feature F with classes A and B, Statistical Parity is calculated
using the following formula:
𝑦 |𝐹 = 𝐴)
𝑃 (ˆ
𝑦 |𝐹 = 𝐵)
𝑃 (ˆ
Multinomial classification is a type of modeling wherein the output is discrete. For example, an integer 1-10, an animal
at the zoo, or a primary color. These models have a specialized set of charts and metrics for their evaluation.
The prevailing metrics for evaluating a multinomial classification model are:
• Accuracy: The proportion of predictions that were correct. It is generally converted to a percentage where 100%
is a perfect classifier. For a balanced dataset, an accuracy of 100%
𝑘 where 𝑘 is the number of classes, is a random
classifier. An accuracy of 0% is a perfectly wrong classifier.
• F1 Score (weighted, macro or micro): There is generally a trade-off between the precision and recall and the
𝐹1 score is a metric that combines them into a single number. The per-class 𝐹1 score is the harmonic mean of
precision and recall:
𝑃 𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 * 𝑅𝑒𝑐𝑎𝑙𝑙
𝐹1 = 2
𝑃 𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑅𝑒𝑐𝑎𝑙𝑙
As with precision, there are a number of other versions of 𝐹1 that are used in multinomial classification. The
micro and weighted 𝐹1 is computed the same as with precision, but with the per-class 𝐹1 replacing the per-class
precision. However, the macro 𝐹1 is computed a little differently. The precision and recall are computed by
summing the TP, FN, and FP across all classes, and then using them in the standard formulas.
• Hamming Loss: The proportion of predictions that were incorrectly classified and is equivalent to 1−𝑎𝑐𝑐𝑢𝑟𝑎𝑐𝑦.
This means a Hamming loss score of 0 is a perfect classifier. A score of 𝑘−1
𝑘 is a random classifier for a balanced
dataset, and 1.0 is a perfectly incorrect classifier.
• Kappa Score: Cohen’s 𝜅 coefficient is a statistic that measures inter-annotator agreement. This function com-
putes Cohen’s 𝜅, a score that expresses the level of agreement between two annotators on a classification problem.
It is defined as:
𝑝𝑜 − 𝑝𝑒
𝜅=
1 − 𝑝𝑒
𝑝𝑜 is the empirical probability of agreement on the class assigned to any sample (the observed agreement ratio).
𝑝𝑒 is the expected agreement when both annotators assign classes randomly. 𝑝𝑒 is estimated using a per-annotator
empirical prior over the class.
• Precision (weighted, macro or micro): This is the proportion of a class that was predicted to be in a given class
and are actually in that class. In multinomial classification, it is common to report the precision for each class
and this is called the per-class precision. It is computed using the same approach use in binary classification. For
example, 𝑇 𝑃𝑇+𝐹𝑃
𝑃 , but only the class under consideration is used. A value of 1 means that the classifier was able
to perfectly predict, for that class. A value of 0 means that the classifier was never correct, for that class. There
are three other versions of precision that are used in multinomial classification and they are weighted, macro and
micro-precision. Weighted precision, 𝑃𝑤 , combines the per-class precision by the number of true classes:
𝑃𝑤 = 𝑊 1 𝑃1 + · · · + 𝑊 𝑛 𝑃𝑛
𝑊𝑖 is the proportion of the true classes in class i 𝑃𝑖 is the per-class precision for the 𝑖𝑡ℎ class. The macro-
precision, 𝑃𝑚 , is the mean of all the per-class, 𝑃𝑖 , precisions.
1 ∑︁
𝑃𝑚 = 𝑃𝑖
𝑛 𝑖
seed = 42
X, y = make_classification(
n_samples=10000, n_features=25, n_classes=3, flip_y=0.1, n_clusters_per_class=1
)
lr_multi_clf = LogisticRegression(
random_state=0, solver="lbfgs", multi_class="multinomial"
).fit(trainx, trainy)
multi_lr_model = ADSModel.from_estimator(lr_multi_clf)
multi_rf_model = ADSModel.from_estimator(rf_multi_clf)
evaluator = ADSEvaluator(
ADSData(testx, testy),
models=[multi_lr_model, multi_rf_model],
)
print(evaluator.metrics)
evaluator.metrics
evaluator.show_in_notebook()
• precision_weighted: The weighted average of precision_by_label. Weights are proportional to the num-
ber of true instances for each class.
• precision_micro: Global precision. Calculated by using global true positives and false positives.
• recall_weighted: The weighted average of recall_by_label. Weights are proportional to the number of
true instances for each class.
• recall_micro: Global recall. Calculated by using global true positives and false negatives.
• f1_weighted: The weighted average of f1_by_label. Weights are proportional to the number of true instances
for each class.
• f1_micro: Global 𝐹1 . It is calculated using the harmonic mean of micro precision and recall metrics.
All of these metrics can be computed directly from the confusion matrix.
If the preceding metrics don’t include the specific metric you want to use, maybe an F2 score, simply add it to your
evaluator object as in this example:
11.4.4 Regression
Regression is a type of modeling wherein the output is continuous. For example, price, height, sales, length. These
models have their own specific metrics that help to benchmark the model. How close is close enough?
The prevailing metrics for evaluating a regression model are:
• Explained variance score: The variance of the model’s predictions. The mean of the squared difference between
the predicted values and the true mean of the data, see [Read More].
• Mean absolute error (MAE): The mean of the absolute difference between the true values and predicted values,
see [Read More].
• Mean squared error (MSE): The mean of the squared difference between the true values and predicted values,
see [Read More].
• R-squared: Also known as the coefficient of determination. It is the proportion in the data of the variance that
is explained by the model, see [Read More].
• Root mean squared error (RMSE): The square root of the mean squared error, see [Read More].
• Mean residuals: The mean of the difference between the true values and predicted values, see [Read More].
The prevailing charts and plots for regression are:
• Observed vs. predicted: A plot of the observed, or actual values, against the predicted values output by the
models.
• Residuals QQ: The quantile-quantile plot, shows the residuals and quantiles of a standard normal distribution.
It should be close to a straight line for a good model.
• Residuals vs observed: A plot of residuals vs observed values. This should not carry a lot of structure in a good
model.
• Residuals vs. predicted: A plot of residuals versus predicted values. This should not carry a lot of structure in
a good model.
This code snippet demonstrates how to generate the above metrics and charts. The data has to be split into a testing
and training set with the features in X_train and X_test and the responses in y_train and y_test.
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, Lasso
from sklearn.ensemble import RandomForestClassifier
seed = 42
lin_reg_model = ADSModel.from_estimator(lin_reg)
lasso_reg_model = ADSModel.from_estimator(lasso_reg)
reg_evaluator = ADSEvaluator(
ADSData(testx, testy), models=[lin_reg_model, lasso_reg_model]
)
print(reg_evaluator.metrics)
evaluator.metrics
evaluator.show_in_notebook()
This code snippet demonstrates how to add a custom metric, Number Correct, to the evaluator.
Prerequisites
• Currently Oracle AutoML and MLX libraries are available only via Data Science Conda Packs.
• See here for supported conda packs
• To install conda packs locally, see Working with Conda Packs
Machine learning and deep learning are becoming ubiquitous due to:
• The ability to solve complex problems in a variety of different domains.
• The growth in the performance and efficiency of modern computing resources.
• The widespread availability of large amounts of data.
However, as the size and complexity of problems continue to increase, so does the complexity of the machine learning
algorithms applied to these problems. The inherent and growing complexity of machine learning algorithms limits the
ability to understand what the model has learned or why a given prediction was made, acting as a barrier to the adoption
of machine learning. Additionally, there may be legal or regulatory requirements to be able to explain the outcome of
a prediction from a machine learning model, resulting in the use of biased models at the cost of accuracy.
Machine learning explainability (MLX) is the process of explaining and interpreting machine learning and deep learn-
ing models.
MLX can help machine learning developers to:
11.5.1.1 Overview
Similar to Partial Dependence Plots (PDP), Accumulated Local Effects (ALE) is a model-agnostic global explanation
method that evaluates the relationship between feature values and target variables. However, in the event that features
are highly correlated, PDP may include unlikely combinations of feature values in the average prediction calculation
due to the independent manipulation of feature values across the marginal distribution. This lowers the trust in the PDP
explanation when features have strong correlation. Unlike PDP, ALE handles feature correlations by averaging and
accumulating the difference in predictions across the conditional distribution, which isolates the effects of the specific
feature. This comes at the cost of requiring a larger number of observations and a near uniform distribution of those
observations so that the conditional distribution can be reliably determined.
11.5.1.2 Description
ALE highlights the effects that specific features have on the predictions of a machine learning model by partially
isolating the effects of other features. Therefore, it tends to be robust against correlated features. The resulting ALE
explanation is centered around the mean effect of the feature, such that the main feature effect is compared relative to
the average prediction of the data.
Correlated features can negatively affect the quality of many explanation techniques. Specifically, many challenges
arise when the black-box model is used to make predictions on unlikely artificial data. That is data that that fall outside
of the expected data distribution but are used in an explanation because they are not independent and the technique
is not sensitive to this possibility. This can occur, for example, when the augmented data samples are not generated
according the feature correlations or the effects of other correlated features are included in the evaluation of the feature
of interest. Consequently, the resulting explanations may be misleading. In the context of PDP, the effect of a given
feature may be heavily biased by the interactions with other features.
To address the issues associated with correlated features, ALE:
• Uses the conditional distribution of the feature of interest to generate augmented data. This tends to create more
realistic data that using marginal distribution. This helps to ensure that evaluated feature values, e.g., xi, are
only compared with instances from the dataset that have similar values to xi.
• Calculates the average of the differences in model predictions over the augmented data, instead of the average of
the predictions themselves. This helps to isolate the effect of the feature of interest. For example, assuming we are
evaluating the effect of a feature at value xi, ALE computes the average of the difference in model predictions of
the values in the neighborhood of xi. That is, that observation within xi ± that meet the conditional requirement.
This helps to reduce the effects of correlated features.
The following example demonstrates the challenges with accurately evaluating the effect of a feature on a model’s
predictions when features are highly correlated. Let us assume that features x1 and x2 are highly correlated. We can
artificially construct x2 by starting with x1 and adding a small amount of random noise. Further assume that the target
value is the product of these two features (e.g., y = x1 * x2). Since x1 and x2 are almost identical, the target value has
a quadratic relationship with them. A decision tree is trained on this dataset. Then different explanation techniques,
PDP (first column), ICE (second column), and ALE (third column), are used to evaluate the effect of the features on
the model predictions. Features x1 and x2 are evaluated in the first and second row, respectively. The following image
demonstrates that PDP is unable to accurately identify the expected relationship due to the assumption that the features
are not correlated. An examination of the ICE plots revels the quadratic relationship between the features and the
target. However, the when taking as an aggregate, this effect disappears. In contrast, ALE is able to properly capture
the isolated effect of each feature, highlighting the quadratic relationship.
The following summarizes the steps in computing ALE explanation (note: MLX supports one-feature ALE):
• Start with a trained model.
• Select a feature to explain (for example, one of the important features identified in the global feature importance
explanations).
• Compute the intervals of the selected feature to define the upper and lower bounds used to compute the difference
in model predictions when the feature is increased or decreased.
– Numerical features: using the selected feature’s value distribution extracted from the train dataset, MLX
selects multiple different intervals from the feature’s distribution to evaluate (e.g., based on percentiles).
The number of intervals to use and the range of the feature’s distribution to consider are configurable.
– Categorical features: since ALE computes the difference in model predictions between an increase and
decrease in a feature’s value, features must have some notion of order. This can be challenging for cate-
gorical features, as there may not be a notion of order (e.g., eye color). To address this, MLX estimates
the order of categorical feature values based on a categorical feature encoding technique. MLX provides
multiple different encoding techniques based on the input data (e.g., distance_similarity: computes
a similarity matrix between all categorical feature values and the other feature values, and orders based on
similarity. Target-based approaches estimate the similarity/order based on the relationship of categorical
feature values with the target variable. The supported techniques include, target encoding, target, James-
Stein encoding, jamesstein, Generalized Linear Mixed Model encoding, glmm, M-estimate encoding,
mestimate, and Weight of Evidence encoding, woe. The categorical feature value order is then used to
compute the upper (larger categorical value) and lower (smaller categorical value) bounds for the selected
categorical feature.
• For each interval, MLX approximates the conditional distribution by identifying the samples that are in the
neighborhood of the sample of interest. It then calculates the difference in the model prediction when the selected
feature’s value of the samples is replaced by the upper and lower limits of the interval. If N different intervals are
selected from the feature’s distribution, this process results in 2N different augmented datasets It is 2N as each
selected feature of the sample are replaced with the upper and lower limits of the interval. The model inference
then generates 2N different model predictions, which are used to calculate the N differences.
• The prediction differences within each interval are averaged and accumulated in order, such that the ALE of a
feature value that lies in the k-th interval is the sum of the effects of the first through the k-th interval.
• Finally, the accumulated feature effects at each interval is centered, such that the mean effect is zero.
11.5.1.3 Interpretation
• Continuous or discrete numerical features: Visualized as line graphs. Each line represents the change in the
model prediction when the selected feature has the given value compared to the average prediction. For example,
an ALE value of ±b at xj = k indicates that when the value of feature j is equal to k, the model prediction is
higher/lower by b compared to the average prediction. The x-axis shows the selected feature values and the
y-axis shows the delta in the target prediction variable relative to the average prediction (e.g., the prediction
probability for classification tasks and the raw predicted values for regression tasks).
• Categorical features: Visualized as vertical bar charts. Each bar represents the change in the model prediction
when the selected feature has the given value compared to the average prediction. The interpretation of the value
of the bar is similar to continuous features. The x-axis shows the different categorical values for the selected
feature and the y-axis shows the change in the predicted value relative to the average prediction. This would be
the prediction probability for classification tasks and the raw predicted values for regression tasks.
11.5.1.4 Limitations
There is an increased computational cost for performing an ALE analysis because of the large number of models that
need to be computed relative to PDP. On a small dataset, this is generally not an issue. However, on larger datasets it
can be. It is possible to parallelize the process and to also compute it in a distributed manner.
The main disadvantage comes from the problem of sparsity of data. There needs to be sufficient number of observa-
tions in each neighborhood that is used in order to make a reasonable estimation. Even with large dataset this can be
problematic if the data is not uniformly sampled, which is rarely the case. Also, with higher dimensionality the problem
is made increasingly more difficult because of this curse of dimensionality.
Depending on the class of model that is being use, it is common practice to remove highly correlated features. In
this cases there is some rational to using a PDP for interpretation. However, if there is correlation in the data and the
sampling of the data is suitable for an ALE analysis, it may be the preferred approach.
11.5.1.5 Examples
The following is a purposefully extreme, but realistic, example that demonstrates the effects of highly correlated features
on PDP and ALE explanations. The data set has three columns, x1, x2 and y.
• x1 is generated from a uniform distribution with a range of [-5, 5].
• x2 is x1 with some noise. x1 and x2 are highly correlated for illustration purposes.
• y is our target which is generated from an interaction term of x1 * x2 and x2.
This model is trained using a Sklearn RegressorMixin model and wrapped in an ADSModel object. Please note that the
ADS model explainers work with any model that is wrapped in an ADSModel object.
import numpy as np
import pandas as pd
from ads.dataset.factory import DatasetFactory
from ads.common.model import ADSModel
from sklearn.base import RegressorMixin
x1 = (np.random.rand(500) - 0.5) * 10
x2 = x1 + np.random.normal(loc=0, scale=0.5, size=500)
y = x1 * x2
correlated_train, _ = correlated_ds.train_test_split(test_size=0)
class CorrelatedRegressor(RegressorMixin):
'''
implement the true model
'''
def fit(self, X=None, y=None):
self.y_bar_ = X.iloc[:, 0].to_numpy() * X.iloc[:, 1].to_numpy() + X.iloc[:, 1].
˓→to_numpy()
The PDP plot shows a rug plot of the actual x1 values along the x-axis and the relationship between x1 and y appears
as a line. However, it is known that the true relationship is not linear. y is the product of x1 and x2. Since x2 nearly
identical to x1, effectively the relationship between x1 and y is quadratic. The high level of correlation between x1
and x2 violates one of the assumptions of the PDP. As demonstrated, the bias created by this correlation results in a
poor representation of the global relationship between x1 and y.
In comparison, the ALE plot does not have as strong a requirement that the features are uncorrelated. As such, there is
very little bias introduced when they are. The following ALE plot demonstrates that it is able to accurately represent
the relationship between x1 and y as being quadratic. This is due to the fact that ALE uses the conditional distribution
of these two features. This can be thought of as only using those instances where the values of x1 and x2 are close.
In general, ALE plots are unbiased with correlated features as they use conditional probabilities. The PDP method uses
the marginal probability and that can introduce a bias when there are highly correlated features. The advantage is that
when the data is not rich enough to adequately determine all of the conditional probabilities or when the features are
not highly correlated, it can be an effective method to assess the global impact of a feature in a model.
11.5.1.6 References
11.5.2.1 Overview
Feature Dependence Explanations (PDP and ICE) are model-agnostic global explanation methods that evaluate the
relationship between feature values and model target predictions.
11.5.2.2 Description
PDP and ICE highlight the marginal effect that specific features have on the predictions of a machine learning model.
These explanation methods visualize the effects that different feature values have on the model’s predictions.
These are the main steps in computing PDP or ICE explanations:
• Start with a trained machine learning model.
• Select a feature to explain (for example, one of the important features identified in the global feature permutation
importance explanations.)
• Using the selected feature’s value distribution extracted from the training dataset, ADS selects multiple different
values from the feature’s distribution to evaluate. The number of values to use and the range of the feature’s
distribution to consider are configurable.
• ADS replaces every sample in the provided dataset with the same feature value from the feature distribution
and computes the model inference on the augmented dataset. This process is repeated for all of the selected
values from the feature’s distribution. If N different values are selected from the feature’s distribution, this
process results in N different datasets. Each with the selected feature having the same value for all samples in the
corresponding dataset. The model inference then generates N different model predictions, each with M values
(one for each sample in the augmented dataset.)
• For ICE, the model predictions for each augmented sample in the provided dataset are considered separately
when the selected feature’s value is replaced with a value from the feature distribution. This results in N x M
different values.
• For PDP, the average model prediction is computed across all augmented dataset samples. This results in N
different values (each an average of M predictions).
The preceding is an example of one-feature PDP and ICE explanations. PDP also supports two-feature explanations
while ICE only supports one feature. The main steps of the algorithm are the same though the explanation is computed
on two features instead of one.
• Select two features to explain.
• ADS computes the cross-product of values selected from the feature distributions to generate a list of different
value combinations for the two selected features. For example, assuming we have selected N values from the
feature distribution for each feature:
[(𝑋11 , 𝑋21 ), (𝑋11 , 𝑋22 ), . . ., (𝑋11 , 𝑋2𝑁 −1 ), (𝑋11 , 𝑋2𝑁 ), (𝑋12 , 𝑋21 ), (𝑋12 , 𝑋22 ), . . ., (𝑋1𝑁 , 𝑋2𝑁 −1 ), (𝑋1𝑁 , 𝑋2𝑁 )]
• For each feature value combination, ADS replaces every sample in the provided set with these two feature values
and computes the model inference on the augmented dataset. There are M different samples in the provided
dataset and N different values for each selected feature. This results in 𝑁 2 predictions from the model, each an
average of M predictions.
11.5.2.3 Interpretation
11.5.2.3.1 PDP
• One-feature
– Continuous or discrete numerical features: Visualized as line graphs, each line represents the average pre-
diction from the model (across all samples in the provided dataset) when the selected feature is replaced
with the given value. The x-axis shows the selected feature values and the y-axis shows the predicted target
(e.g., the prediction probability for classification tasks and the raw predicted values for regression tasks).
– Categorical features: Visualized as vertical bar charts. Each bar represents the average prediction from
the model (across all samples in the provided dataset) when the selected feature is replaced with the given
value. The x-axis shows the different values for the selected feature and the y-axis shows the predicted
target (e.g., the prediction probability for classification tasks and the raw predicted values for regression
tasks).
• Two-feature
– Visualized as a heat map. The x and y-axis both show the selected feature values. The heat map color
represents the average prediction from the model (across all samples in the provided dataset) when the
selected features are replaced with the corresponding values.
11.5.2.3.2 ICE
• Continuous or discrete numerical features: Visualized as line graphs. While PDP shows the average prediction
across all samples in the provided dataset, ICE plots every sample from the provided dataset (when the selected
feature is replaced with the given value) separately. The x-axis shows the selected feature values and the y-axis
shows the predicted target (for example, the prediction probability for classification tasks and the raw predicted
values for regression tasks). The median value can be plotted to highlight the trend. The ICE plots can also
be centered around the first prediction from the feature distribution (for example, each prediction subtracts the
predicted value from the first sample).
• Categorical features: Visualized as violin plots. The x-axis shows the different values for the selected feature
and the y-axis shows the predicted target (for example, the prediction probability for classification tasks and the
raw predicted values for regression tasks).
Both PDP and ICE visualizations display the feature value distribution from the training dataset on the corresponding
axis. For example, the one-feature line graphs, bar charts, and violin plots show the feature value distribution on the
x-axis. The heat map shows the feature value distributions on the respective x-axis or y-axis.
11.5.2.4 Examples
The following example generates and visualizes global partial dependence plot (PDP) and Individual Conditional Ex-
pectation (ICE) explanations on the Titanic dataset. The model is constructed using the ADS OracleAutoMLProvider
(selected model: XGBClassifier), however, the ADS model explainers work with any model (classifier or regressor)
that is wrapped in an ADSModel object.
# Visualize the numerical feature PDP for the True (Survived) label
pdp_age.show_in_notebook(labels=True)
# Visualize the ICE plot for the numerical feature, "age", and center
# around the first prediction (smallest age)
pdp_age.show_in_notebook(mode='ice', labels=True, centered=True)
11.5.2.5 References
11.5.3.1 Overview
Feature permutation importance is a model-agnostic global explanation method that provides insights into a machine
learning model’s behavior. It estimates and ranks feature importance based on the impact each feature has on the trained
machine learning model’s predictions.
11.5.3.2 Description
Feature permutation importance measures the predictive value of a feature for any black box estimator, classifier, or
regressor. It does this by evaluating how the prediction error increases when a feature is not available. Any scoring
metric can be used to measure the prediction error. For example, 𝐹1 for classification or R2 for regression. To avoid
actually removing features and retraining the estimator for each feature, the algorithm randomly shuffles the feature
values effectively adding noise to the feature. Then, the prediction error of the new dataset is compared with the
prediction error of the original dataset. If the model heavily relies on the column being shuffled to accurately predict
the target variable, this random re-ordering causes less accurate predictions. If the model does not rely on the feature
for its predictions, the prediction error remains unchanged.
The following summarizes the main steps in computing feature permutation importance explanations:
• Start with a trained machine learning model.
• Calculate the baseline prediction error on the given dataset. For example, train dataset or test dataset.
• For each feature:
1. Randomly shuffle the feature column in the given dataset.
2. Calculate the prediction error on the shuffled dataset.
3. Store the difference between the baseline score and the shuffled dataset score as the feature importance. For
example, baseline score - shuffled score.
• Repeat the preceding three steps multiple times then report the average. Averaging mitigates the effects of random
shuffling.
• Rank the features based on the average impact each feature has on the model’s score. Features that have a larger
impact on the score when shuffled are assigned higher importance than features with minimal impact on the
model’s score.
• In some cases, randomly permuting an unimportant feature can actually have a positive effect on the model’s
prediction so the feature’s contribution to the model’s predictions is effectively noise. In the feature permutation
importance visualizations, ADS caps any negative feature importance values at zero.
11.5.3.3 Interpretation
Feature permutation importance explanations generate an ordered list of features along with their importance values.
Interpreting the output of this algorithm is straightforward. Features located at higher ranks have more impact on the
model predictions. Features at lower ranks have less impact on the model predictions. Additionally, the importance
values represent the relative importance of features.
The output supports three types of visualizations. They are all based on the same data but present the data differently
for various use cases:
• Bar chart ('bar'): The bar chart shows the model’s view of the relative feature importance. The x-axis high-
lights feature importance. A longer bar indicates higher importance than a shorter bar. Each bar also shows the
average feature importance value along with the standard deviation of importance values across all iterations of
the algorithm (mean importance +/- standard deviation*). Negative importance values are capped at zero. The
y-axis shows the different features in the relative importance order. The top being the most important, and the
bottom being the least important.
• Box plot ('box_plot'): The detailed box plot shows the feature importance values across the iterations of
the algorithm. These values are used to compute the average feature importance and the corresponding standard
deviations shown in the bar chart. The x-axis shows the impact that permuting a given feature had on the model’s
prediction score. The y-axis shows the different features in the relative importance order. The top being the most
important, and the bottom being the least important. The minimum, first quartile, median, third quartile, and a
maximum of the feature importance values across different iterations of the algorithm are shown by each box.
• Detailed scatter plot ('detailed'): The detailed bar chart shows the feature importance values for each iter-
ation of the algorithm. These values are used to compute the average feature importance values and the corre-
sponding standard deviations shown in the bar chart. The x-axis shows the impact that permuting a given feature
had on the model’s prediction score. The y-axis shows the different features in the relative importance order.
The top being the most important, and the bottom being the least important. The color of each dot in the graph
indicates the quality of the permutation for this iteration, which is computed by measuring the correlation of
the permuted feature column relative to the original feature colum. For example, how different is the permuted
feature column versus the original feature column.
11.5.3.4 Examples
This example generates and visualizes a global feature permutation importance explanation on the Titanic dataset. The
model is constructed using the ADS OracleAutoMLProvider. However, the ADS model explainers work with any
model (classifier or regressor) that is wrapped in an ADSModel object.
import logging
import requests
11.5.3.5 References
• Feature importance
• Perutation importance
• Vanderbilt Biostatistics - titanic data
11.5.4.1 Overview
Local explanations target specific predictions from the machine learning model. The goal is to understand why the
model made a particular prediction.
There are multiple different forms of local explanations, such as feature attribution explanations and exemplar-based
explanations. ADS supports local feature attribution explanations. They help to identify the most important features
leading towards a given prediction.
While a given feature might be important for the model in general, the values in a particular sample may cause certain
features to have a larger impact on the model’s predictions than others. Furthermore, given the feature values in a
specific sample, local explanations can also estimate the contribution that each feature had towards or against a target
prediction. For example, does the current value of the feature have a positive or negative effect on the prediction
probability of the target class? Does the feature increase or decrease the predicted regression target value?
The Enhanced Local Interpretable Model-Agnostic Explanation (LIME) is a model-agnostic local explanation method.
It provides insights into why a machine learning model made a specific prediction.
11.5.4.2 Description
ADS provides an enhanced version of Local Interpretable Model-Agnostic Explanations (LIME), which improves on
the explanation quality, performance, and interpretability. The key idea behind LIME is that while the global behavior
of a machine learning model might be very complex, the local behavior may be much simpler. In ADS, local refers to
the behavior of the model on similar samples. LIME tries to approximate the local behavior of the complex machine
learning model through the use of a simple, inherently interpretable surrogate model. For example, a linear model. If
the surrogate model is able to accurately approximate the complex model’s local behavior, ADS can generate a local
explanation of the complex model from the interpretable surrogate model. For example, when data is centered and
scaled the magnitude and sign of the coefficients in a linear model indicate the contribution each feature has towards
the target variable.
The predictions from complex machine learning models are challenging to explain and are generally considered as a
black box. As such, ADS refers to the model to be explained as the black box model. ADS supports classification and
regression models on tabular or text-based datasets (containing a single text-based feature).
The main steps in computing a local explanation for tabular datasets are:
• Start with a trained machine learning model (the black box model).
• Select a specific sample to explain (xexp ).
• Randomly generate a large sample space in a nearby neighborhood around xexp . The sample space is generated
based on the feature distributions from the training dataset. Each sample is then weighted based on its distance
from xexp to give higher weight to samples that are closer to xexp . ADS provides several enhancements, over the
standard algorithm, to improve the quality and locality of the sample generation and weighting methods.
• Using the black box model, generate a prediction for each of the randomly generated local samples. For classifi-
cation tasks, compute the prediction probabilities using predict_proba(). For regression tasks, compute the
predicted regression value using predict().
• Fit a linear surrogate model on the predicted values from the black box model on the local generated sample space.
If the surrogate model is able to accurately match the output of the black box model (referred to as surrogate model
fidelity), the surrogate model can act as a proxy for explaining the local behavior of the black box model. For
classification tasks, the surrogate model is a linear regression model fit on the prediction probabilities of the
black box model. Consequently, for multinomial classification tasks, a separate surrogate model is required to
explain each class. In that case, the explanation indicates if a feature contributes towards the specified class or
against the specified class (for example, towards one of the other N classes). For regression tasks, the surrogate
model is a linear regression model fit on the predicted regression values from the black box model.
• There are two available techniques for fitting the surrogate model:
– Use the features directly:
The raw (normalized) feature values are used to fit the linear surrogate model directly. This results in
a normal linear model. A positive coefficient indicates that when the feature value increases, the target
variable increases. A negative coefficient indicates that when a feature value increases, the target variable
decreases. Categorical features are converted to binary values. A value of 1 indicates that the feature in
the generated sample has the same value as xexp and a value of 0 indicates that the feature in the generated
sample has a different value than xexp .
– Translate the features to an interpretable feature space:
Continuous features are converted to categorical features by discretizing the feature values (for example,
quartiles, deciles, and entropy-based). Then, all features are converted to binary values. A value of 1 indi-
cates that the feature in the generated sample has the same value as xexp (for example, the same categorical
value or the continuous feature falls in the same bin) and a value of 0 indicates that the feature in the gen-
erated sample has a different value than xexp (for example, a different categorical value or the continuous
feature falls in a different bin). The interpretation of the linear model here is a bit different from the regres-
sion model. A positive coefficient indicates that when a feature has the same value as xexp (for example, the
same category), the feature increased the prediction output from the black box model. Similarly, negative
coefficients indicate that when a feature has the same value as xexp , the feature decreased the prediction
output from the black box model. This does not say what happens when the feature is in a different cate-
gory than xexp . It only provides information when the specific feature has the same value as xexp and if it
positively or negatively impacts the black box model’s prediction.
• The explanation is an ordered list of feature importances extracted from the coefficients of the linear surrogate
model. The magnitude of the coefficients indicates the relative feature importance and the sign indicates whether
the feature has a positive or negative impact on the black box model’s prediction.
• The algorithm is similar to text-based datasets. The main difference is in the random local sample space genera-
tion. Instead of randomly generating samples based on the feature distributions, a large number of local samples
are generated by randomly removing subsets of words from the text sample. Each of the randomly generated
samples is converted to a binary vector-based on the existence of a word. For example, the original sample to
explain, xexp , contains 1s for every word. If the randomly generated sample has the same word as xexp , it is a
value of 1. If the word has been removed in the randomly generated sample, it is a value of 0. In this case, the
linear surrogate model evaluates the behavior of the model when the word is there or not.
Additionally, an upper bound can be set on the number of features to include in the explanation (for example, explain
the top-N most important features). If the specified number of features is less than the total number of features, a simple
feature selection method is applied prior to fitting the linear surrogate model. The black box model is still evaluated on
all features, but the surrogate model is only fits on the subset of features.
11.5.4.3 Interpretation
ADS provides multiple enhancements to the local visualizations from LIME. The explanation is presented as a grid con-
taining information about the black box model, information about the local explainer, and the actual local explanation.
Each row in the grid is described as:
• Model (first row)
– The left column presents information about the black box model and the model’s prediction. For example,
the type of the black box model, the true label/value for the selected sample to explain, the predicted value
from the black box model, and the prediction probabilities (classification) or prediction values (regression).
– The right column displays the sample to explain. For tabular datasets, this is a table showing the feature
names and corresponding values for this sample. For text datasets, this shows the text sample to explain.
• Explainer (second row)
– The left column presents the explainer configuration parameters, such as the underlying local explanation
algorithm used (for example, LIME), the type of surrogate model (for example, linear), the number of
randomly generated local samples (for example, 5000) to train the local surrogate model (𝑁𝑡 ), whether
continuous features were discretized or not.
– The right column provides a legend describing how to interpret the model explanations.
• Explanations (remaining rows)
– For classification tasks, a local explanation can be generated for each of the target labels (since the surrogate
model is fit to the prediction probabilities from the black box model). For binary classification, the expla-
nation for one class will mirror the other. For multinomial classification, the explanations describe how
each feature contributes towards or against the specified target class. If the feature contributes against the
specified target class (for example, decreases the prediction probability), it increases the prediction proba-
bility of one or more other target classes. The explanation for each target class is shown as a separate row
in the Explanation section.
– The Feature Importances section presents the actual local explanation. The explanation is visualized as a
horizontal bar chart of feature importance values, ordered by relative feature importance. Features with
larger bars (top) are more important than features with shorter bars (bottom). Positive feature importance
values (to the right) indicate that the feature increases the prediction target value. Negative feature im-
portance values (to the left) indicate that the feature decreases the prediction target value. Depending on
whether continuous features are discretized or not changes the interpretation of this value (for example,
whether the specific feature value indicates a positive/negative attribution, or whether an increase/decrease
in the feature value indicates a positive/negative attribution). If the features are discretized, the correspond-
ing range is included. The feature importance value is shown beside each bar. This can either be the raw
coefficient taken from the linear surrogate model or can be normalized such that all importance values sum
to one. For text datasets, the explanation is visualized as a word cloud. Important words that have a large
positive contribution towards a given prediction (for example, increase the prediction value) are shown
larger than unimportant words that have a less positive impact on the target prediction.
• The Explanation Quality section presents information about the quality of the explanation. It is further broken
down into two sections:
– Sample Distance Distributions
This section presents the sample distributions used to train (𝑁𝑡 ) and evaluate (𝑁𝑣# ) the local surrogate
model based on the distances (Euclidean) of the generated samples from the sample to explain. This high-
lights the locality of generated sample spaces where the surrogate model (explainer) is trained and evaluated.
The distance distribution from the sample to explain for the actual dataset used to train the black box model,
Train, is also shown. This highlights the locality of 𝑁𝑡 relative to the entire train dataset. For the generated
evaluation sample spaces (𝑁𝑣# ), the sample space is generated based on a percentile value of the distances
in Train relative to the sample to explain. For example, 𝑁𝑣4 is generated with the maximum distance being
limited to the 4th percentile of the distances in train from the sample to explain.
– Evaluation Metrics
This section presents the fidelity of the surrogate model relative to the black box model on the randomly
generated sample spaces used to fit and evaluate the surrogate model. In other words, this section evaluates
how accurately the surrogate model approximates the local behavior of the complex black box model. Mul-
tiple different regression and classification metrics are supported. For classification tasks, ADS supports
both regression and classification metrics. Regression metrics are computed on the raw prediction prob-
abilities between the surrogate model and the black box model. For classification metrics, the prediction
probabilities are converted to the corresponding target labels and are compared between the surrogate model
and the black box model. Explanations for regression tasks only support regression metrics. Supported re-
gression metrics: MSE, RMSE (default), R2 , MAPE, SMAPE, Two-Sample Kolmogorov-Smirnov Test,
Pearson Correlation (default), and Spearman Correlation. Supported classification metrics: 𝐹1 , Accuracy,
Recall, and ROC_AUC.
– Performance
Explanation time in seconds.
11.5.4.4 Example
This example generates and visualizes local explanations on the Titanic dataset. The model is constructed using the
ADS OracleAutoMLProvider. However, the ADS model explainers work with any model (classifier or regressor)
that is wrapped in an ADSModel object.
import logging
import requests
# Compute the local explanation on our sample from the test set
explanation = local_explainer.explain(test.X.iloc[sample:sample+1],
test.y.iloc[sample:sample+1])
11.5.4.5 References
• LIME
• Vanderbilt Biostatistics - titanic data
• Why Should I Trust You? Explaining the Predictions of Any Classifier
11.5.5.1 Description
The WhatIf explainer tool helps to understand how changes in an observation affect a model’s prediction. Use it to
explore a model’s behavior on a single observation or the entire dataset by asking “what if” questions.
The WhatIf explainer has the following methods:
• explore_predictions: Explore the relationship between feature values and the model predictions.
• explore_sample: Modify the values in an observation and see how the prediction changes.
11.5.5.2 Example
In this example, a WhatIf explainer is created, and then the explore_predictions(), and explore_sample()
methods are demonstrated. A tree-based model is used to make predictions on the Boston housing dataset.
logging.basicConfig(format='%(levelname)s:%(message)s', level=logging.ERROR)
warnings.filterwarnings('ignore')
ds = DatasetBrowser.sklearn().open("boston").set_target("target")
train, test = ds.train_test_split(test_size=0.2)
X_boston = train.X.copy()
y_boston = train.y.copy()
le = DataFrameLabelEncoder()
X_boston = le.fit_transform(X_boston)
# Model Training
ensemble_regressor = ExtraTreesRegressor(n_estimators=245, random_state=42)
ensemble_regressor.fit(X_boston, y_boston)
model = ADSModel.from_estimator(make_pipeline(le, ensemble_regressor), name=
˓→"ExtraTreesRegressor")
The Sample Explorer method, explore_sample(), opens a GUI that has a single observation. The values of that
sample can then be changed. By clicking Run Inference, the model computes the prediction with the updated feature
values. The interface shows the original values and the values that have been changed.
example_sample() accepts the row_idx parameter that specifies the index of the observation that is to be evaluated.
The default is zero (0). The features parameter lists the feature names that are shown in the interface. By default,
it displays all features. For datasets with a large number of features, this can be cumbersome so the max_features
parameter can be used to display only the first n features.
The following command opens the Sample Explorer. Change the values then click Run Inference to see how the
prediction changes.
whatif_explainer.explore_sample()
The Predictions Explorer method, explore_predictions(), allows the exploration of model predictions across either
the marginal distribution (1-feature) or the joint distribution (2-features).
The method explore_predictions() has several optional parameters including:
• discretization: (str, optional) Discretization method applies the x-axis if the feature x is continuous. The
valid options are ‘quartile’, ‘decile’, or ‘percentile’. The default is None.
• label: (str or int, optional) Target label or target class name to explore only for classification problems. The
default is None.
• plot_type: (str, optional) Type of plot. For classification problems the valid options are ‘scatter’, ‘box’, or ‘bar’.
For a regression problem, the valid options are ‘scatter’ or ‘box’. The default is ‘scatter’.
• x: (str, optional) Feature column on x-axis. The default is None.
• y: (str, optional) Feature column or model prediction column on the y-axis, by default it is the target.
When only x is set, the chart shows the relationship between the features x and the target y.
whatif_explainer.explore_predictions(x='AGE')
If features are specified for both x and y, the plot uses color to indicate the value of the target.
whatif_explainer.explore_predictions(x='AGE', y='CRIM')
TWELVE
You could register your model with OCI Data Science service through ADS. Alternatively, the Oracle Cloud Infras-
tructure (OCI) Console can be used by going to the Data Science projects page, selecting a project, then click Models.
The models page shows the model artifacts that are in the model catalog for a given project.
After a model and its artifacts are registered, they become available for other data scientists if they have the correct
permissions.
Data scientists can:
• List, read, download, and load models from the catalog to their own notebook sessions.
• Download the model artifact from the catalog, and run the model on their laptop or some other machine.
• Deploy the model artifact as a model deployment.
• Document the model use case and algorithm using taxonomy metadata.
• Add custom metadata that describes the model.
• Document the model provenance including the resources and tags used to create the model (notebook session),
and the code used in training.
• Document the input data schema, and the returned inference schema.
• Run introspection tests on the model artifact to ensure that common model artifact errors are flagged. Thus, they
can be remediated before the model is saved to the catalog.
The ADS SDK automatically captures some of the metadata for you. It captures provenance, taxonomy, and some
custom metadata.
12.1 Workflow
255
ADS Documentation, Release 2.6.8
ADS has a set of framework specific classes that take your model and push it to production with a few quick steps.
The first step is to create a model serialization object. This object wraps your model and has a number of methods to
assist in deploying it. There are different model classes for different model classes. For example, if you have a PyTorch
model you would use the PyTorchModel class. If you have a TensorFlow model you would use the TensorFlowModel
class. ADS has model serialization for many different model classes. However, it is not feasible to have a model
serialization class for all model types. Therefore, the GenericModel can be used for any class that has a .predict()
method.
After creating the model serialization object, the next step is to use the .prepare() method to create the model
artifacts. The score.py file is created and it is customized to your model class. You may still need to modify it for
your specific use case but this is generally not required. The .prepare() method also can be used to store metadata
about the model, code used to create the model, input and output schema, and much more.
If you make changes to the score.py file, call the .verify() method to confirm that the load_model() and
predict() functions in this file are working. This speeds up your debugging as you do not need to deploy a model to
test it.
The .save() method is then used to store the model in the model catalog. A call to the .deploy() method creates a
load balancer and the instances needed to have an HTTPS access point to perform inference on the model. Using the
.predict() method, you can send data to the model deployment endpoint and it will return the predictions.
12.2 Register
ADS can auto generate the required files to register and deploy your models. Checkout the examples below to learn
how to deploy models of different frameworks.
12.2.1.1 Sklearn
import tempfile
from ads.model.framework.sklearn_model import SklearnModel
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
12.2.1.2 XGBoost
Create a model, prepare it, verify that it works, save it to the model catalog, deploy it, make a prediction, and then
delete the deployment.
import tempfile
import xgboost as xgb
from ads.model.framework.xgboost_model import XGBoostModel
from sklearn.datasets import load_iris
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
xgboost_model.prepare(inference_conda_env="generalml_p37_cpu_v1")
12.2.1.3 LightGBM
Create a model, prepare it, verify that it works, save it to the model catalog, deploy it, make a prediction, and then
delete the deployment.
lightgbm_model.prepare(inference_conda_env="generalml_p37_cpu_v1")
12.2.1.4 PyTorch
Create a model, prepare it, verify that it works, save it to the model catalog, deploy it, make a prediction, and then
delete the deployment.
import tempfile
import torch
import torchvision
from ads.model.framework.pytorch_model import PyTorchModel
torch_model.prepare(inference_conda_env="computervision_p37_cpu_v1")
Create a model, prepare it, verify that it works, save it to the model catalog, deploy it, make a prediction, and then
delete the deployment.
import tempfile
import os
from pyspark.sql import SparkSession
from pyspark.ml import Pipeline
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.feature import HashingTF, Tokenizer
from ads.model.framework.spark_model import SparkPipelineModel
spark = SparkSession \
.builder \
.appName("Python Spark SQL basic example") \
.getOrCreate()
# create data
training = spark.createDataFrame(
[
(0, "a b c d e spark", 1.0),
(1, "b d", 0.0),
(2, "spark f g h", 1.0),
(3, "hadoop mapreduce", 0.0),
(continues on next page)
12.2.1.6 TensorFlow
Create a model, prepare it, verify that it works, save it to the model catalog, deploy it, make a prediction, and then
delete the deployment.
mnist = tf.keras.datasets.mnist
(x_train, y_train), (x_test, y_test) = mnist.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0
tf_estimator = tf.keras.models.Sequential(
[
tf.keras.layers.Flatten(input_shape=(28, 28)),
tf.keras.layers.Dense(128, activation="relu"),
tf.keras.layers.Dropout(0.2),
(continues on next page)
tf_model.prepare(inference_conda_env="tensorflow27_p37_cpu_v1")
import tempfile
from ads.model.generic_model import GenericModel
generic_model.prepare(
inference_conda_env="dataexpl_p37_cpu_v3",
model_file_name="toy_model.pkl",
force_overwrite=True
)
generic_model.verify(2)
mnist = tf.keras.datasets.mnist
(continues on next page)
tf_estimator = tf.keras.models.Sequential(
[
tf.keras.layers.Flatten(input_shape=(28, 28)),
tf.keras.layers.Dense(128, activation="relu"),
tf.keras.layers.Dropout(0.2),
tf.keras.layers.Dense(10),
]
)
loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
tf_estimator.compile(optimizer="adam", loss=loss_fn, metrics=["accuracy"])
tf_estimator.fit(x_train, y_train, epochs=1)
tf_model.prepare(inference_conda_env="generalml_p37_cpu_v1",
use_case_type=UseCaseType.MULTINOMIAL_CLASSIFICATION,
X_sample=trainx,
y_sample=trainy
)
12.2.2.3 score.py
In the prepare step, the service automatically generates a score.py file in the artifact directory.
score.py is used by the Data Science Model Deployment service to generate predictions in the input feature. Here is
a minimal score.py implementation -
import joblib
model_name = "model.joblib"
model = None
with open(os.path.join(os.path.dirname(os.path.realpath(__file__)), model_file_name),
˓→"rb") as mfile:
model = joblib.load(mfile)
return model
12.2.2.3.1 load_model
During deployment, the load_model method loads the serialized model. The load_model method is always fully
populated, except when you set serialize=False for GenericModel.
• For the GenericModel class, if you choose serialize=True in the init function, the model is pickled and the
score.py is fully auto-populated to support loading the pickled model. Otherwise, the user is responsible to fill
the load_model.
• For other frameworks, this part is fully populated.
Note: load_model should return not None value for successful deployment.
12.2.2.3.2 predict
The predict method is triggered every time a payload is sent to the model deployment endpoint. The method takes
the payload and the loaded model as inputs. Based on the payload, the method returns the predicted results output by
the model.
12.2.2.3.3 pre_inference
If the payload passed to the endpoint needs preprocessing, this function does the preprocessing step. The user is fully
responsible for the preprocessing step.
12.2.2.3.4 post_inference
If the predicted result from the model needs some postprocessing, the user can put the logic in this function.
12.2.2.3.5 deserialize
When you use the .verify() or .predict() methods from model classes such as GenericModel or SklearnModel,
if the data passed in is not in bytes or JsonSerializable, the models try to serialize the data. For example, if a pandas
dataframe is passed and not accepted by the deployment endpoint, the pandas dataframe is converted to JSON inter-
nally. When the X_sample variable is passed into the .prepare() function, the data type of pandas dataframe is
passed to the endpoint, and the schema of the dataframe is recorded in the input_schema.json file. Then, the JSON
payload is sent to the endpoint. Because the model expects to take a pandas dataframe, the .deserialize() method
converts the JSON back to the pandas dataframe using the schema and the data type. For all frameworks except for the
GenericModel class, the .deserialize() method is auto-populated. Note that for each framework, only specific
data types are supported.
Starting from .. versionadded:: 2.6.3, you can send the bytes to the endpoint directly. If the bytes payload is sent to
the endpoint, bytes are passed directly to the model. If the model expects a specific data format, you need to write the
conversion logic yourself.
12.2.2.3.6 fetch_data_type_from_schema
This function is used to load the schema from the input_schema.json when needed.
The .intropect() method runs some sanity checks on the runtime.yaml, and score.py files. This is to help you
identify potential errors that might occur during model deployment. It checks fields such as environment path, validates
the path’s existence on the Object Storage, checks if the .load_model(), and .predict() functions are defined in
score.py, and so on. The result of model introspection is automatically saved to the taxonomy metadata and model
artifacts.
tf_model.introspect()
Reloading model artifacts automatically invokes model introspection. However, you can invoke introspection manually
by calling tf_model.introspect():
The ArtifactTestResults field is populated in metadata_taxonomy when instrospect is triggered:
tf_model.metadata_taxonomy['ArtifactTestResults']
key: ArtifactTestResults
value:
runtime_env_path:
category: conda_env
description: Check that field MODEL_DEPLOYMENT.INFERENCE_ENV_PATH is set
...
To .save() method saves the model, introspection results, schema, metadata, etc on OCI Data Science Service and returs
the model ocid.
See API documentation for more details.
The data schema provides a definition of the format and nature of the data that the model expects. It also defines the
output data from the model inference. The .populate_schema() method accepts the parameters, data_sample or
X_sample, and y_sample. When using these parameters, the model artifact gets populates the input and output data
schemas.
The .schema_input and .schema_output properties are Schema objects that define the schema of each input column
and the output. The Schema object contains these fields:
• description: Description of the data in the column.
• domain: A data structure that defines the domain of the data. The restrictions on the data and summary statistics
of its distribution.
– constraints: A data structure that is a list of expression objects that defines the constraints of the data.
∗ expression: A string representation of an expression that can be evaluated by the language corre-
sponding to the value provided in language attribute. The default value for language is python.
· expression: Required. Use the string.Template format for specifying the expression. $x is
used to represent the variable.
· language: The default value is python. Only python is supported.
– stats: A set of summary statistics that defines the distribution of the data. These are determined using the
feature type statistics as defined in ADS.
– values: A description of the values of the data.
• dtype: Pandas data type
• feature_type: The primary feature type as defined by ADS.
• name: Name of the column.
• required: Boolean value indicating if a value is always required.
{
"description": {
"nullable": true,
"required": false,
"type": "string"
},
"domain": {
"nullable": true,
"required": false,
"schema": {
"constraints": {
"nullable": true,
"required": false,
(continues on next page)
To auto generate schema from the training data, provide X sample and the y sample while preparing the model artifact.
Eg.
import tempfile
from ads.model.framework.sklearn_model import SklearnModel
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
sklearn_model.prepare(inference_conda_env="dataexpl_p37_cpu_v3", X_sample=trainx, y_
˓→sample=trainy)
{
"schema": [
{
"dtype": "int64",
"feature_type": "Integer",
"name": "class",
"domain": {
"values": "Integer",
"stats": {
"count": 465.0,
"mean": 0.5225806451612903,
"standard deviation": 0.5000278079030275,
"sample minimum": 0.0,
"lower quartile": 0.0,
"median": 1.0,
"upper quartile": 1.0,
"sample maximum": 1.0
(continues on next page)
You can specify a constraint for your data using Expression, and call evaluate to check if the data satisfies the
constraint:
sklearn_model.schema_input['col01'].domain.constraints[0].evaluate(x=0)
True
sklearn_model.model_artifact.populate_schema(X_sample=test.X, y_sample=test.y)
You can also load your schema from a JSON or YAML file:
sklearn_model.schema_output = Schema.from_file('schema.json'))
When you register a model, you can add metadata to help with the documentation of the model. Service defined
metadata fields are known as Taxonomy Metadata and user defined metadata fields are known as Custom Metadata
Taxonomy metadata includes the type of the model, use case type, libraries, framework, and so on. This metadata
provides a way of documenting the schema of the model. The UseCaseType, FrameWork, FrameWorkVersion,
Algorithm, and Hyperparameters are fixed taxonomy metadata. These fields are automatically populated when the
.prepare() method is called. You can also manually update the values of those fields.
• ads.common.model_metadata.UseCaseType: The machine learning problem associated with the Estimator
class. The UseCaseType.values() method returns the most current list. This is a list of allowed values.:
– UseCaseType.ANOMALY_DETECTION
– UseCaseType.BINARY_CLASSIFICATION
– UseCaseType.CLUSTERING
– UseCaseType.DIMENSIONALITY_REDUCTION
– UseCaseType.IMAGE_CLASSIFICATION
– UseCaseType.MULTINOMIAL_CLASSIFICATION
– UseCaseType.NER
– UseCaseType.OBJECT_LOCALIZATION
– UseCaseType.OTHER
– UseCaseType.RECOMMENDER
– UseCaseType.REGRESSION
– UseCaseType.SENTIMENT_ANALYSIS
– UseCaseType.TIME_SERIES_FORECASTING
– UseCaseType.TOPIC_MODELING
• ads.common.model_metadata.FrameWork: The FrameWork of the estimator object. You can get the list
of allowed values using Framework.values():
– FrameWork.BERT
– FrameWork.CUML
– FrameWork.EMCEE
– FrameWork.ENSEMBLE
– FrameWork.FLAIR
– FrameWork.GENSIM
– FrameWork.H2O
– FrameWork.KERAS
– FrameWork.LIGHTgbm
– FrameWork.MXNET
– FrameWork.NLTK
– FrameWork.ORACLE_AUTOML
– FrameWork.OTHER
– FrameWork.PROPHET
– FrameWork.PYOD
– FrameWork.PYMC3
– FrameWork.PYSTAN
– FrameWork.PYTORCH
– FrameWork.SCIKIT_LEARN
– FrameWork.SKTIME
– FrameWork.SPACY
– FrameWork.STATSMODELS
– FrameWork.TENSORFLOW
– FrameWork.TRANSFORMERS
– FrameWork.WORD2VEC
– FrameWork.XGBOOST
• FrameWorkVersion: The framework version of the estimator object. For example, 2.3.1.
• Algorithm: The model class.
• Hyperparameters: The hyperparameters of the estimator object.
You can’t add or delete any of the fields, or change the key of those fields.
You can populate the use_case_type by passing it in the .prepare() method. Or you can set and update it directly.
import tempfile
from ads.model.framework.sklearn_model import SklearnModel
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from ads.common.model_metadata import UseCaseType
sklearn_model.prepare(inference_conda_env="dataexpl_p37_cpu_v3", X_sample=trainx, y_
˓→sample=trainy)
sklearn_model.metadata_taxonomy['UseCaseType'].value = UseCaseType.BINARY_CLASSIFICATION
Update metadata_taxonomy
Update any of the taxonomy fields with allowed values:
sklearn_model.metadata_taxonomy['FrameworkVersion'].value = '0.24.2'
sklearn_model.metadata_taxonomy['UseCaseType'].update(value=UseCaseType.BINARY_
˓→CLASSIFICATION)
You can view the metadata_taxonomy in the dataframe format by calling to_dataframe:
sklearn_model.metadata_taxonomy.to_dataframe()
sklearn_model.metadata_taxonomy
data:
- key: FrameworkVersion
value: 0.24.2
- key: ArtifactTestResults
value:
runtime_env_path:
category: conda_env
description: Check that field MODEL_DEPLOYMENT.INFERENCE_ENV_PATH is set
error_msg: In runtime.yaml, the key MODEL_DEPLOYMENT.INFERENCE_ENV_PATH must
have a value.
success: true
value: oci://licence_checker@ociodscdev/conda_environments/cpu/Oracle Database/1.0/
˓→database_p37_cpu_v1.0
runtime_env_python:
category: conda_env
description: Check that field MODEL_DEPLOYMENT.INFERENCE_PYTHON_VERSION is set
to a value of 3.6 or higher
error_msg: In runtime.yaml, the key MODEL_DEPLOYMENT.INFERENCE_PYTHON_VERSION
must be set to a value of 3.6 or higher.
success: true
value: 3.7.10
runtime_env_slug:
category: conda_env
description: Check that field MODEL_DEPLOYMENT.INFERENCE_ENV_SLUG is set
error_msg: In runtime.yaml, the key MODEL_DEPLOYMENT.INFERENCE_ENV_SLUG must
have a value.
success: true
value: database_p37_cpu_v1.0
runtime_env_type:
category: conda_env
description: Check that field MODEL_DEPLOYMENT.INFERENCE_ENV_TYPE is set to
a value in (published, data_science)
error_msg: In runtime.yaml, the key MODEL_DEPLOYMENT.INFERENCE_ENV_TYPE must
be set to published or data_science.
success: true
value: published
runtime_path_exist:
category: conda_env
description: If MODEL_DEPLOYMENT.INFERENCE_ENV_TYPE is data_science and MODEL_
˓→DEPLOYMENT.INFERENCE_ENV_SLUG
max_features: auto
max_leaf_nodes: null
max_samples: null
min_impurity_decrease: 0.0
min_impurity_split: null
min_samples_leaf: 1
min_samples_split: 2
min_weight_fraction_leaf: 0.0
n_estimators: 10
n_jobs: null
oob_score: false
random_state: null
verbose: 0
warm_start: false
Update your custom metadata using the key, value, category, and description fields. The key, and value fields
are required.
You can see the allowed values for custom metadata category using MetadataCustomCategory.values():
• MetadataCustomCategory.PERFORMANCE
• MetadataCustomCategory.TRAINING_PROFILE
• MetadataCustomCategory.TRAINING_AND_VALIDATION_DATASETS
• MetadataCustomCategory.TRAINING_ENVIRONMENT
• MetadataCustomCategory.OTHER
Add New Custom Metadata
To add a new custom metadata, call .add():
sklearn_model.metadata_custom.add(key='test', value='test',␣
˓→category=MetadataCustomCategory.OTHER, description='test', replace=True)
sklearn_model.metadata_custom['test'].update(value='test1', description=None,␣
˓→category=MetadataCustomCategory.TRAINING_ENV)
sklearn_model.metadata_custom['test'].value = 'test1'
sklearn_model.metadata_custom['test'].description = None
sklearn_model.metadata_custom['test'].category = MetadataCustomCategory.TRAINING_ENV
You can view the custom metadata in the dataframe by calling .to_dataframe():
sklearn_model.metadata_custom.to_dataframe()
Alternatively, you can view the custom metadata in YAML format by calling .metadata_custom:
sklearn_model.metadata_custom
data:
- category: Training Environment
description: The conda env where model was trained
key: CondaEnvironment
value: database_p37_cpu_v1.0
- category: Training Environment
description: null
key: test
value: test1
- category: Training Environment
description: The env type, could be published conda or datascience conda
key: EnvironmentType
value: published
- category: Training Environment
description: The list of files located in artifacts folder
key: ModelArtifacts
value: score.py, runtime.yaml, onnx_data_transformer.json, model.onnx, .model-ignore
- category: Training Environment
description: The slug name of the conda env where model was trained
key: SlugName
value: database_p37_cpu_v1.0
- category: Training Environment
description: The oci path of the conda env where model was trained
key: CondaEnvironmentPath
value: oci://licence_checker@ociodscdev/conda_environments/cpu/Oracle Database/1.0/
˓→database_p37_cpu_v1.0
- category: Other
description: ''
key: ClientLibrary
value: ADS
- category: Training Profile
description: The model serialization format
key: ModelSerializationFormat
value: onnx
When the combined total size of metadata_custom and metadata_taxonomy exceeds 32000 bytes, an error occurs
when you save the model to the model catalog. You can save the metadata_custom and metadata_taxonomy to the
artifacts folder:
sklearn_model.metadata_custom.to_json_file(path_to_ADS_model_artifact)
You can also save individual items from the custom and taxonomy metadata:
sklearn_model.metadata_taxonomy['Hyperparameters'].to_json_file(path_to_ADS_model_
˓→artifact)
If you already have the training or validation dataset saved in Object Storage and want to document this information in
this model artifact object, you can add that information into metadata_custom:
sklearn_model.metadata_custom.set_training_data(path='oci://bucket_name@namespace/train_
˓→data_filename', data_size='(200,100)')
sklearn_model.metadata_custom.set_validation_data(path='oci://bucket_name@namespace/
˓→validation_data_filename', data_size='(100,100)')
Here is an example for preparing a model artifact for TensorFlow model which is trained on the minsit dataset. The
final layer in the model produces 10 values corresponding to each digit. The default score.py will produce an array of
10 elements for each input vector. Suppose you want to change the default behavior of predict fucntion in score.py
to return most likely digit instead of returning a probablity distribution over all the digits. To do so we can return the
position corresponding to the maximum value within the output array. Here are the steps to customize the score.py -
Step1: Train your estimator and then generate the Model artifact as shown below -
mnist = tf.keras.datasets.mnist
(x_train, y_train), (x_test, y_test) = mnist.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0
tf_estimator = tf.keras.models.Sequential(
[
tf.keras.layers.Flatten(input_shape=(28, 28)),
tf.keras.layers.Dense(128, activation="relu"),
tf.keras.layers.Dropout(0.2),
tf.keras.layers.Dense(10),
]
)
loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
tf_estimator.compile(optimizer="adam", loss=loss_fn, metrics=["accuracy"])
tf_estimator.fit(x_train, y_train, epochs=1)
Verify the output produced by the autogenerated score.py by calling verify on Model object.
print(tensorflow_model.verify(testx[:3])['prediction'])
˓→6433541774749756, -0.8167083263397217],
˓→0585914850234985, -10.736929893493652],
˓→2511857748031616, -2.273810625076294]]
import os
import sys
from functools import lru_cache
import pandas as pd
import numpy as np
import json
import tensorflow as tf
from io import BytesIO
import base64
model_name = 'model.h5'
"""
Inference script. This script is used for prediction by scoring server when schema is␣
˓→known.
"""
@lru_cache(maxsize=10)
def load_model(model_file_name=model_name):
"""
Loads model from the serialized format
Returns
-------
model: a model instance on which predict API can be invoked
"""
model_dir = os.path.dirname(os.path.realpath(__file__))
(continues on next page)
@lru_cache(maxsize=1)
def fetch_data_type_from_schema(input_schema_path=os.path.join(os.path.dirname(os.path.
˓→realpath(__file__)), "input_schema.json")):
"""
Returns data type information fetch from input_schema.json.
Parameters
----------
input_schema_path: path of input schema.
Returns
-------
data_type: data type fetch from input_schema.json.
"""
data_type = {}
if os.path.exists(input_schema_path):
schema = json.load(open(input_schema_path))
for col in schema['schema']:
data_type[col['name']] = col['dtype']
else:
print("input_schema has to be passed in in order to recover the same data type.␣
˓→pass `X_sample` in `ads.model.framework.tensorflow_model.TensorFlowModel.prepare`␣
˓→function to generate the input_schema. Otherwise, the data type might be changed after␣
˓→serialization/deserialization.")
return data_type
Parameters
----------
data: serialized input data.
input_schema_path: path of input schema.
"""
data_type = data.get('data_type', '')
json_data = data.get('data', data)
if "numpy.ndarray" in data_type:
load_bytes = BytesIO(base64.b64decode(json_data.encode('utf-8')))
return np.load(load_bytes, allow_pickle=True)
if "pandas.core.series.Series" in data_type:
return pd.Series(json_data)
if "pandas.core.frame.DataFrame" in data_type:
return pd.read_json(json_data, dtype=fetch_data_type_from_schema(input_schema_
˓→path))
if "tensorflow.python.framework.ops.EagerTensor" in data_type:
load_bytes = BytesIO(base64.b64decode(json_data.encode('utf-8')))
return tf.convert_to_tensor(np.load(load_bytes, allow_pickle=True))
return json_data
Parameters
----------
data: Data format as expected by the predict API of the core estimator.
input_schema_path: path of input schema.
Returns
-------
data: Data format after any processing.
"""
data = deserialize(data, input_schema_path)
def post_inference(yhat):
"""
Post-process the model results.
Parameters
----------
yhat: Data format after calling model.predict.
Returns
-------
yhat: Data format after any processing.
return yhat.numpy().tolist()
"""
Returns prediction given the model and data to predict.
Parameters
----------
model: Model instance returned by load_model API
data: Data format as expected by the predict API of the core estimator.
input_schema_path: path of input schema.
Returns
-------
predictions: Output from scoring server
Format: {'prediction': output from model.predict method}
"""
inputs = pre_inference(data, input_schema_path)
yhat = post_inference(
model(inputs)
)
return {'prediction': yhat}
Step 2: Update post_inference method in score.py to find the index corresponding the maximum value and return.
We can use argmax function from tensorflow to achieve that. Here is the modified code -
1 import os
2 import sys
3 from functools import lru_cache
4 import pandas as pd
5 import numpy as np
6 import json
7 import tensorflow as tf
8 from io import BytesIO
9 import base64
10
11 model_name = 'model.h5'
12
13
14 """
15 Inference script. This script is used for prediction by scoring server when schema is␣
˓→known.
16 """
17
18
19 @lru_cache(maxsize=10)
20 def load_model(model_file_name=model_name):
(continues on next page)
24 Returns
25 -------
26 model: a model instance on which predict API can be invoked
27 """
28 model_dir = os.path.dirname(os.path.realpath(__file__))
29 if model_dir not in sys.path:
30 sys.path.insert(0, model_dir)
31 contents = os.listdir(model_dir)
32 if model_file_name in contents:
33 print(f'Start loading {model_file_name} from model directory {model_dir} ...')
34 loaded_model = tf.keras.models.load_model(os.path.join(model_dir, model_file_
˓→name))
35
41
42 @lru_cache(maxsize=1)
43 def fetch_data_type_from_schema(input_schema_path=os.path.join(os.path.dirname(os.path.
˓→realpath(__file__)), "input_schema.json")):
44 """
45 Returns data type information fetch from input_schema.json.
46
47 Parameters
48 ----------
49 input_schema_path: path of input schema.
50
51 Returns
52 -------
53 data_type: data type fetch from input_schema.json.
54
55 """
56 data_type = {}
57 if os.path.exists(input_schema_path):
58 schema = json.load(open(input_schema_path))
59 for col in schema['schema']:
60 data_type[col['name']] = col['dtype']
61 else:
62 print("input_schema has to be passed in in order to recover the same data type.␣
˓→pass `X_sample` in `ads.model.framework.tensorflow_model.TensorFlowModel.prepare`␣
˓→function to generate the input_schema. Otherwise, the data type might be changed after␣
˓→serialization/deserialization.")
63 return data_type
64
65
71 Parameters
72 ----------
73 data: serialized input data.
74 input_schema_path: path of input schema.
75
76 Returns
77 -------
78 data: deserialized input data.
79
80 """
81 data_type = data.get('data_type', '')
82 json_data = data.get('data', data)
83
84 if "numpy.ndarray" in data_type:
85 load_bytes = BytesIO(base64.b64decode(json_data.encode('utf-8')))
86 return np.load(load_bytes, allow_pickle=True)
87 if "pandas.core.series.Series" in data_type:
88 return pd.Series(json_data)
89 if "pandas.core.frame.DataFrame" in data_type:
90 return pd.read_json(json_data, dtype=fetch_data_type_from_schema(input_schema_
˓→path))
91 if "tensorflow.python.framework.ops.EagerTensor" in data_type:
92 load_bytes = BytesIO(base64.b64decode(json_data.encode('utf-8')))
93 return tf.convert_to_tensor(np.load(load_bytes, allow_pickle=True))
94
95 return json_data
96
101 Parameters
102 ----------
103 data: Data format as expected by the predict API of the core estimator.
104 input_schema_path: path of input schema.
105
106 Returns
107 -------
108 data: Data format after any processing.
109 """
110 data = deserialize(data, input_schema_path)
111
123 Returns
124 -------
125 yhat: Data format after any processing.
126
127 """
128 yhat = tf.argmax(yhat, axis=1) # Get the index of the max value
129 return yhat.numpy().tolist()
130
132 """
133 Returns prediction given the model and data to predict.
134
135 Parameters
136 ----------
137 model: Model instance returned by load_model API
138 data: Data format as expected by the predict API of the core estimator.
139 input_schema_path: path of input schema.
140
141 Returns
142 -------
143 predictions: Output from scoring server
144 Format: {'prediction': output from model.predict method}
145
146 """
147 inputs = pre_inference(data, input_schema_path)
148
print(tensorflow_model.verify(testx[:3])['prediction'])
model_id = tensorflow_model.save()
print(tensorflow_model.predict(testx[:3])['prediction'])
[7, 2, 1]
12.2.6.1 Saving
We recommend that you work with model artifacts using the framework specific wrapper classes in ADS. After you
prepare and verify the model, the model is ready to be stored in the model catalog. The standard method to do this is
to use the .save() method. If the bucket_uri parameter is present, then the large model artifact is supported.
The URI syntax for the bucket_uri is:
oci://<bucket_name>@<namespace>/<path>/
The following saves the framework specific wrapper object, model, to the model catalog and returns the OCID from
the model catalog:
model_catalog_id = model.save(
display_name='Model With Large Artifact',
bucket_uri=<provide bucket url>,
overwrite_existing_artifact = True,
remove_existing_artifact = True,
)
12.2.6.2 Loading
We recommend that you transfer a model artifact from the model catalog to your notebook session using the framework
specific wrapper classes in ADS. The .from_model_catalog() method takes the model catalog OCID and some file
parameters. If the bucket_uri parameter is present, then a large model artifact is used.
The following example downloads a model from the model catalog using the large model artifact approach. The
bucket_uri has the following syntax:
oci://<bucket_name>@<namespace>/<path>/
large_model = model.from_model_catalog(
model_id=model_catalog_id,
model_file_name="model.pkl",
artifact_dir="./artifact/",
bucket_uri=<provide bucket url> ,
force_overwrite=True,
remove_existing_artifact=True,
)
Download and recreate framework specific wrapper objects using the ocid value of your model.
The downloaded artifact can be used for running inferece in local environment. You can update the artifact files to
change your score.py or model and then register as a new model. See here to learn how to change score.py
Here is an example for loading back a LightGBM model that was previously registered.
lgbm_model = LightGBMModel.from_model_catalog(
"ocid1.datasciencemodel.oc1.xxx.xxxxx",
model_file_name="model.joblib",
artifact_dir="lgbm-download-test",
)
Download and recreate framework specific wrapper objects using the ocid value of your OCI Model Deployment
instance.
The downloaded artifact can be used for running inferece in local environment. You can update the artifact files to
change your score.py or model and then register as a new model. See here to learn how to change score.py
Here is an example for loading back a LightGBM model that was previously deployed.
pytorchmodel = PyTorchModel.from_model_deployment(
"ocid1.datasciencemodeldeployment.oc1.xxx.xxxxx",
model_file_name="model.pt",
artifact_dir="pytorch-download-test",
)
print(pytorchmodel.model_deployment.url)
https://modeldeployment.us-ashburn-1.oci.customer-oci.com/ocid1.
˓→datasciencemodeldeployment.oc1.xxx.xxxx
Once you have ADS Model object, you can call deploy function to deploy the model and generate the endpoint.
Here is an example of deploying LightGBM model:
lightgbm_model.prepare(
inference_conda_env="generalml_p37_cpu_v1",
X_sample=trainx,
y_sample=trainy,
use_case_type=UseCaseType.BINARY_CLASSIFICATION,
)
12.3.1 Deploy
You can use the .deploy() method to deploy a model. You must first save the model to the model catalog, and then
deploy it.
The .deploy() method returns a ModelDeployment object. Specify deployment attributes such as display name,
instance type, number of instances, maximum router bandwidth, and logging groups. The API takes the following
parameters:
See API documentation for more details about the parameters.
Tips
• Providing deployment_access_log_id and deployment_predict_log_id helps in debugging your model
inference setup.
• Default Load Balancer configuration has bandwidth of 10 Mbps. Refer service document to help you choose the
right setup.
• Check for supported instance shapes here .
12.3.2 Predict
To invoke the endpoint of your deployed model, call the .predict() method. The .predict() method sends a
request to the deployed endpoint, and computes the inference values based on the data that you input in the .predict()
method.
See how to deploy and invoke deployed endpoint for different frameworks here.
See API documentation for more details about the parameters.
12.3.3 Observability
lightgbm_model.model_deployment.logs().tail()
12.4 Frameworks
12.4.1 SklearnModel
12.4.1.1 Overview
The SklearnModel class in ADS is designed to allow you to rapidly get a Scikit-learn model into production. The
.prepare() method creates the model artifacts that are needed to deploy a functioning model without you having to
configure it or write code. However, you can customize the required score.py file.
The .verify() method simulates a model deployment by calling the load_model() and predict() methods in the
score.py file. With the .verify() method, you can debug your score.py file without deploying any models. The
.save() method deploys a model artifact to the model catalog. The .deploy() method deploys a model to a REST
endpoint.
The following steps take your trained scikit-learn model and deploy it into production with a few lines of code.
Create a Scikit-learn Model
seed = 42
• project_id: str
• remove_existing_artifact: bool
• training_conda_env: str
• training_id: str
• training_python_version: str
• training_resource_id: str
• training_script_path: str
By default, properties is populated from the environment variables when not specified. For example, in note-
book sessions the environment variables are preset and stored in project id (PROJECT_OCID) and compartment id
(NB_SESSION_COMPARTMENT_OCID). So properties populates these environment variables, and uses the values in
methods such as .save() and .deploy(). Pass in values to overwrite the defaults. When you use a method that in-
cludes an instance of properties, then properties records the values that you pass in. For example, when you pass
inference_conda_env into the .prepare() method, then properties records the value. To reuse the properties
file in different places, you can export the properties file using the .to_yaml() method then reload it into a different
machine using the .from_yaml() method.
You can call the .summary_status() method after a model serialization instance such as AutoMLModel,
GenericModel, SklearnModel, TensorFlowModel, or PyTorchModel is created. The .summary_status()
method returns a Pandas dataframe that guides you through the entire workflow. It shows which methods are available
to call and which ones aren’t. Plus it outlines what each method does. If extra actions are required, it also shows those
actions.
The following image displays an example summary status table created after a user initiates a model instance. The
table’s Step column displays a Status of Done for the initiate step. And the Details column explains what the initiate
step did such as generating a score.py file. The Step column also displays the prepare(), verify(), save(),
deploy(), and predict() methods for the model. The Status column displays which method is available next. After
the initiate step, the prepare() method is available. The next step is to call the prepare() method.
'ocid1.datasciencemodel.oc1.xxx.xxxxx'
>>> # Deploy and create an endpoint for the Random Forest model
>>> sklearn_model.deploy(
display_name="Random Forest Model For Classification",
deployment_log_group_id="ocid1.loggroup.oc1.xxx.xxxxx",
deployment_access_log_id="ocid1.log.oc1.xxx.xxxxx",
deployment_predict_log_id="ocid1.log.oc1.xxx.xxxxx",
)
>>> print(f"Endpoint: {sklearn_model.model_deployment.url}")
https://modeldeployment.{region}.oci.customer-oci.com/ocid1.datasciencemodeldeployment.
˓→oc1.xxx.xxxxx
12.4.1.7 Examples
import tempfile
seed = 42
sklearn_model.verify(testx[:10])["prediction"]
sklearn_model.save(display_name="SKLearn Model")
print(f"Endpoint: {sklearn_model.model_deployment.url}")
sklearn_model.predict(testx)["prediction"]
12.4.2 PyTorchModel
12.4.2.1 Overview
import torch
import torchvision
model = torchvision.models.resnet18(pretrained=True)
model.eval()
import tempfile
# The score.py generated requires you to create the class instance of the Model before␣
˓→the weights are loaded.
Open pytorch_model_artifact/score.py and edit the code to instantiate the model class. The edits are high-
lighted -
import os
import sys
from functools import lru_cache
import torch
import json
from typing import Dict, List
import numpy as np
import pandas as pd
from io import BytesIO
import base64
import logging
import torchvision
the_model = torchvision.models.resnet18()
model_name = 'model.pt'
"""
Inference script. This script is used for prediction by scoring server when schema is␣
˓→known.
"""
@lru_cache(maxsize=10)
def load_model(model_file_name=model_name):
"""
Loads model from the serialized format
Returns
-------
model: a model instance on which predict API can be invoked
"""
model_dir = os.path.dirname(os.path.realpath(__file__))
if model_dir not in sys.path:
sys.path.insert(0, model_dir)
contents = os.listdir(model_dir)
if model_file_name in contents:
print(f'Start loading {model_file_name} from model directory {model_dir} ...')
model_state_dict = torch.load(os.path.join(model_dir, model_file_name))
print(f"loading {model_file_name} is complete.")
else:
raise Exception(f'{model_file_name} is not found in model directory {model_dir}')
except Exception as e:
(continues on next page)
the_model.eval()
print("Model is successfully loaded.")
return the_model
Instantiate a PyTorchModel() object with a PyTorch model. Each instance accepts the following parameters:
• artifact_dir: str. Artifact directory to store the files needed for deployment.
• auth: (Dict, optional): Defaults to None. The default authentication is set using the ads.
set_auth API. To override the default, use ads.common.auth.api_keys() or ads.common.auth.
resource_principal() and create the appropriate authentication signer and the **kwargs required to in-
stantiate the IdentityClient object.
• estimator: Callable. Any model object generated by the PyTorch framework.
• properties: (ModelProperties, optional). Defaults to None. The ModelProperties object required
to save and deploy model.
The properties is an instance of the ModelProperties class and has the following predefined fields:
• bucket_uri: str
• compartment_id: str
• deployment_access_log_id: str
• deployment_bandwidth_mbps: int
• deployment_instance_count: int
• deployment_instance_shape: str
• deployment_log_group_id: str
• deployment_predict_log_id: str
• deployment_memory_in_gbs: Union[float, int]
• deployment_ocpus: Union[float, int]
• inference_conda_env: str
• inference_python_version: str
• overwrite_existing_artifact: bool
• project_id: str
• remove_existing_artifact: bool
• training_conda_env: str
• training_id: str
• training_python_version: str
• training_resource_id: str
• training_script_path: str
By default, properties is populated from the environment variables when not specified. For example, in note-
book sessions the environment variables are preset and stored in project id (PROJECT_OCID) and compartment id
(NB_SESSION_COMPARTMENT_OCID). So properties populates these environment variables, and uses the values in
methods such as .save() and .deploy(). Pass in values to overwrite the defaults. When you use a method that in-
cludes an instance of properties, then properties records the values that you pass in. For example, when you pass
inference_conda_env into the .prepare() method, then properties records the value. To reuse the properties
file in different places, you can export the properties file using the .to_yaml() method then reload it into a different
machine using the .from_yaml() method.
# Download an image
import urllib.request
url, filename = ("https://github.com/pytorch/hub/raw/master/images/dog.jpg", "dog.jpg")
try: urllib.URLopener().retrieve(url, filename)
except: urllib.request.urlretrieve(url, filename)
You can call the .summary_status() method after a model serialization instance such as AutoMLModel,
GenericModel, SklearnModel, TensorFlowModel, or PyTorchModel is created. The .summary_status()
method returns a Pandas dataframe that guides you through the entire workflow. It shows which methods are available
to call and which ones aren’t. Plus it outlines what each method does. If extra actions are required, it also shows those
actions.
The following image displays an example summary status table created after a user initiates a model instance. The
table’s Step column displays a Status of Done for the initiate step. And the Details column explains what the initiate
step did such as generating a score.py file. The Step column also displays the prepare(), verify(), save(),
deploy(), and predict() methods for the model. The Status column displays which method is available next. After
the initiate step, the prepare() method is available. The next step is to call the prepare() method.
'ocid1.datasciencemodel.oc1.xxx.xxxxx'
https://modeldeployment.{region}.oci.customer-oci.com/ocid1.datasciencemodeldeployment.
˓→oc1.xxx.xxxxx
# Download an image
import urllib.request
url, filename = ("https://github.com/pytorch/hub/raw/master/images/dog.jpg", "dog.jpg")
try: urllib.URLopener().retrieve(url, filename)
except: urllib.request.urlretrieve(url, filename)
258
uri = ("https://github.com/pytorch/hub/raw/master/images/dog.jpg")
12.4.2.8 Example
import numpy as np
from PIL import Image
import tempfile
import torchvision
from torchvision import transforms
import urllib
# The score.py generated requires you to create the class instance of the Model before␣
˓→the weights are loaded.
# Load image
input_image = Image.open(filename)
preprocess = transforms.Compose(
[
(continues on next page)
prediction = pytorch_model.verify(input_batch)["prediction"]
print(np.argmax(prediction))
print(np.argmax(prediction))
12.4.3 TensorFlowModel
12.4.3.1 Overview
import tensorflow as tf
mnist = tf.keras.datasets.mnist
(trainx, trainy), (testx, testy) = mnist.load_data()
trainx, testx = trainx / 255.0, testx / 255.0
model = tf.keras.models.Sequential(
[
tf.keras.layers.Flatten(input_shape=(28, 28)),
tf.keras.layers.Dense(128, activation="relu"),
tf.keras.layers.Dropout(0.2),
tf.keras.layers.Dense(10),
])
loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
model.compile(optimizer="adam", loss=loss_fn, metrics=["accuracy"])
model.fit(trainx, trainy, epochs=1)
artifact_dir = tempfile.mkdtemp()
tensorflow_model = TensorFlowModel(estimator=model, artifact_dir=artifact_dir)
tensorflow_model.prepare(
inference_conda_env="tensorflow27_p37_cpu_v1",
training_conda_env="tensorflow27_p37_cpu_v1",
X_sample=trainx,
y_sample=trainy,
use_case_type=UseCaseType.MULTINOMIAL_CLASSIFICATION,
)
• deployment_bandwidth_mbps: int
• deployment_instance_count: int
• deployment_instance_shape: str
• deployment_log_group_id: str
• deployment_predict_log_id: str
• deployment_memory_in_gbs: Union[float, int]
• deployment_ocpus: Union[float, int]
• inference_conda_env: str
• inference_python_version: str
• overwrite_existing_artifact: bool
• project_id: str
• remove_existing_artifact: bool
• training_conda_env: str
• training_id: str
• training_python_version: str
• training_resource_id: str
• training_script_path: str
By default, properties is populated from the environment variables when not specified. For example, in note-
book sessions the environment variables are preset and stored in project id (PROJECT_OCID) and compartment id
(NB_SESSION_COMPARTMENT_OCID). So properties populates these environment variables, and uses the values in
methods such as .save() and .deploy(). Pass in values to overwrite the defaults. When you use a method that in-
cludes an instance of properties, then properties records the values that you pass in. For example, when you pass
inference_conda_env into the .prepare() method, then properties records the value. To reuse the properties
file in different places, you can export the properties file using the .to_yaml() method then reload it into a different
machine using the .from_yaml() method.
You can call the .summary_status() method after a model serialization instance such as AutoMLModel,
GenericModel, SklearnModel, TensorFlowModel, or PyTorchModel is created. The .summary_status()
method returns a Pandas dataframe that guides you through the entire workflow. It shows which methods are available
to call and which ones aren’t. Plus it outlines what each method does. If extra actions are required, it also shows those
actions.
The following image displays an example summary status table created after a user initiates a model instance. The
table’s Step column displays a Status of Done for the initiate step. And the Details column explains what the initiate
step did such as generating a score.py file. The Step column also displays the prepare(), verify(), save(),
deploy(), and predict() methods for the model. The Status column displays which method is available next. After
the initiate step, the prepare() method is available. The next step is to call the prepare() method.
In TensorFlowModel, data serialization is supported for JSON serializable objects. Plus, there is support for a dic-
tionary, string, list, np.ndarray, and tf.python.framework.ops.EagerTensor. Not all these objects are JSON
serializable, however, support to automatically serializes and deserialized is provided.
'ocid1.datasciencemodel.oc1.xxx.xxxxx'
https://modeldeployment.{region}.oci.customer-oci.com/ocid1.datasciencemodeldeployment.
˓→oc1.xxx.xxxxx
˓→6433541774749756, -0.8167083263397217],
˓→0585914850234985, -10.736929893493652],
˓→2511857748031616, -2.273810625076294]]
uri = ("https://github.com/pytorch/hub/raw/master/images/dog.jpg")
12.4.3.7 Example
import tensorflow as tf
import tempfile
artifact_dir = tempfile.mkdtemp()
tensorflow_model.verify(testx[:10])["prediction"]
12.4.4 SparkPipelineModel
12.4.4.1 Overview
The SparkPipelineModel class in ADS is designed to allow you to rapidly get a PySpark model into production.
The .prepare() method creates the model artifacts that are needed to deploy a functioning model without you having
to configure it or write code. However, you can customize the required score.py file.
The .verify() method simulates a model deployment by calling the load_model() and predict() methods in the
score.py file. With the .verify() method, you can debug your score.py file without deploying any models. The
.save() method deploys a model artifact to the model catalog. The .deploy() method deploys a model to a REST
endpoint.
The following steps take your trained PySpark model and deploy it into production with a few lines of code.
Create a Spark Pipeline Model
Generate a synthetic dataset:
spark = SparkSession \
.builder \
.appName("Python Spark SQL basic example") \
.getOrCreate()
training = spark.createDataFrame(
[
(0, "a b c d e spark", 1.0),
(1, "b d", 0.0),
(2, "spark f g h", 1.0),
(3, "hadoop mapreduce", 0.0),
],
["id", "text", "label"],
)
test = spark.createDataFrame(
[
(4, "spark i j k"),
(5, "l m n"),
(6, "spark hadoop spark"),
(7, "apache hadoop"),
],
["id", "text"],
)
Create a Spark Pipeline. (Note that a Spark Pipeline can be made with just 1 stage.)
import tempfile
from ads.model.framework.spark_model import SparkPipelineModel
from ads.common.model_metadata import UseCaseType
artifact_dir=tempfile.mkdtemp()
spark_model = SparkPipelineModel(estimator=model, artifact_dir=artifact_dir)
spark_model.prepare(inference_conda_env="pyspark30_p37_cpu_v5",
X_sample=training,
force_overwrite=True
use_case_type=UseCaseType.BINARY_CLASSIFICATION)
Instantiate a SparkPipelineModel() object with a PySpark model. Each instance accepts the following parameters:
• artifact_dir: str. Artifact directory to store the files needed for deployment.
• auth: (Dict, optional): Defaults to None. The default authentication is set using the ads.
set_auth API. To override the default, use ads.common.auth.api_keys() or ads.common.auth.
resource_principal() and create the appropriate authentication signer and the **kwargs required to in-
stantiate the IdentityClient object.
• estimator: Callable. Any model object generated by the PySpark framework.
• properties: (ModelProperties, optional). Defaults to None. The ModelProperties object required
to save and deploy model.
The properties is an instance of the ModelProperties class and has the following predefined fields:
• bucket_uri: str
• compartment_id: str
• deployment_access_log_id: str
• deployment_bandwidth_mbps: int
• deployment_instance_count: int
• deployment_instance_shape: str
• deployment_log_group_id: str
• deployment_predict_log_id: str
• deployment_memory_in_gbs: Union[float, int]
• deployment_ocpus: Union[float, int]
• inference_conda_env: str
• inference_python_version: str
• overwrite_existing_artifact: bool
• project_id: str
• remove_existing_artifact: bool
• training_conda_env: str
• training_id: str
• training_python_version: str
• training_resource_id: str
• training_script_path: str
By default, properties is populated from the environment variables when not specified. For example, in note-
book sessions the environment variables are preset and stored in project id (PROJECT_OCID) and compartment id
(NB_SESSION_COMPARTMENT_OCID). So properties populates these environment variables, and uses the values in
methods such as .save() and .deploy(). Pass in values to overwrite the defaults. When you use a method that in-
cludes an instance of properties, then properties records the values that you pass in. For example, when you pass
inference_conda_env into the .prepare() method, then properties records the value. To reuse the properties
file in different places, you can export the properties file using the .to_yaml() method then reload it into a different
machine using the .from_yaml() method.
You can call the .summary_status() method after a model serialization instance such as AutoMLModel,
GenericModel, SklearnModel, TensorFlowModel, or PyTorchModel is created. The .summary_status()
method returns a Pandas dataframe that guides you through the entire workflow. It shows which methods are available
to call and which ones aren’t. Plus it outlines what each method does. If extra actions are required, it also shows those
actions.
The following image displays an example summary status table created after a user initiates a model instance. The
table’s Step column displays a Status of Done for the initiate step. And the Details column explains what the initiate
step did such as generating a score.py file. The Step column also displays the prepare(), verify(), save(),
deploy(), and predict() methods for the model. The Status column displays which method is available next. After
the initiate step, the prepare() method is available. The next step is to call the prepare() method.
model_id = spark_model.save()
'ocid1.datasciencemodel.oc1.xxx.xxxxx'
spark_model.deploy(
display_name="Spark Pipeline Model For Classification",
deployment_log_group_id="ocid1.loggroup.oc1.xxx.xxxxx",
deployment_access_log_id="ocid1.log.oc1.xxx.xxxxx",
deployment_predict_log_id="ocid1.log.oc1.xxx.xxxxx",
)
print(f"Endpoint: {spark_model.model_deployment.url}")
# https://modeldeployment.{region}.oci.customer-oci.com/ocid1.datasciencemodeldeployment.
˓→oc1.xxx.xxxxx
spark_model.predict(test)['prediction']
# [0.0, 0.0, 1.0, 0.0]
12.4.4.7 Example
Adapted from an example provided by Apache in the PySpark API Reference Documentation.
import tempfile
from pyspark.ml import Pipeline
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.feature import HashingTF, Tokenizer
from pyspark.sql import SparkSession
from ads.model.framework.spark_model import SparkPipelineModel
from ads.common.model_metadata import UseCaseType
spark = SparkSession \
.builder \
.appName("Python Spark SQL basic example") \
.getOrCreate()
artifact_dir=tempfile.mkdtemp()
test = spark.createDataFrame(
[
(4, "spark i j k"),
(5, "l m n"),
(6, "spark hadoop spark"),
(7, "apache hadoop"),
],
["id", "text"],
)
model = pipeline.fit(training)
spark_model.prepare(inference_conda_env="pyspark30_p37_cpu_v5",
X_sample=training,
force_overwrite=True
use_case_type=UseCaseType.BINARY_CLASSIFICATION)
prediction = spark_model.verify(test)
12.4.5 LightGBMModel
12.4.5.1 Overview
seed = 42
artifact_dir = tempfile.mkdtemp()
lightgbm_model = LightGBMModel(estimator=model, artifact_dir=artifact_dir)
lightgbm_model.prepare(
inference_conda_env="generalml_p37_cpu_v1",
training_conda_env="generalml_p37_cpu_v1",
X_sample=trainx,
y_sample=trainy,
use_case_type=UseCaseType.BINARY_CLASSIFICATION,
)
• training_id: str
• training_python_version: str
• training_resource_id: str
• training_script_path: str
By default, properties is populated from the environment variables when not specified. For example, in note-
book sessions the environment variables are preset and stored in project id (PROJECT_OCID) and compartment id
(NB_SESSION_COMPARTMENT_OCID). So properties populates these environment variables, and uses the values in
methods such as .save() and .deploy(). Pass in values to overwrite the defaults. When you use a method that in-
cludes an instance of properties, then properties records the values that you pass in. For example, when you pass
inference_conda_env into the .prepare() method, then properties records the value. To reuse the properties
file in different places, you can export the properties file using the .to_yaml() method then reload it into a different
machine using the .from_yaml() method.
You can call the .summary_status() method after a model serialization instance such as AutoMLModel,
GenericModel, SklearnModel, TensorFlowModel, or PyTorchModel is created. The .summary_status()
method returns a Pandas dataframe that guides you through the entire workflow. It shows which methods are available
to call and which ones aren’t. Plus it outlines what each method does. If extra actions are required, it also shows those
actions.
The following image displays an example summary status table created after a user initiates a model instance. The
table’s Step column displays a Status of Done for the initiate step. And the Details column explains what the initiate
step did such as generating a score.py file. The Step column also displays the prepare(), verify(), save(),
deploy(), and predict() methods for the model. The Status column displays which method is available next. After
the initiate step, the prepare() method is available. The next step is to call the prepare() method.
'ocid1.datasciencemodel.oc1.xxx.xxxxx'
https://modeldeployment.{region}.oci.customer-oci.com/ocid1.datasciencemodeldeployment.
˓→oc1.xxx.xxxxx
[1,0,...,1]
12.4.5.7 Example
import tempfile
seed = 42
lightgbm_model.verify(testx[:10])["prediction"]
print(f"Endpoint: {lightgbm_model.model_deployment.url}")
12.4.6 XGBoostModel
12.4.6.1 Overview
import tempfile
import xgboost
seed = 42
artifact_dir = tempfile.mkdtemp()
xgb_model = XGBoostModel(estimator=model, artifact_dir=artifact_dir)
xgb_model.prepare(
inference_conda_env="generalml_p37_cpu_v1",
training_conda_env="generalml_p37_cpu_v1",
X_sample=trainx,
y_sample=trainy,
use_case_type=UseCaseType.BINARY_CLASSIFICATION,
)
• training_id: str
• training_python_version: str
• training_resource_id: str
• training_script_path: str
By default, properties is populated from the environment variables when not specified. For example, in note-
book sessions the environment variables are preset and stored in project id (PROJECT_OCID) and compartment id
(NB_SESSION_COMPARTMENT_OCID). So properties populates these environment variables, and uses the values in
methods such as .save() and .deploy(). Pass in values to overwrite the defaults. When you use a method that in-
cludes an instance of properties, then properties records the values that you pass in. For example, when you pass
inference_conda_env into the .prepare() method, then properties records the value. To reuse the properties
file in different places, you can export the properties file using the .to_yaml() method then reload it into a different
machine using the .from_yaml() method.
You can call the .summary_status() method after a model serialization instance such as AutoMLModel,
GenericModel, SklearnModel, TensorFlowModel, or PyTorchModel is created. The .summary_status()
method returns a Pandas dataframe that guides you through the entire workflow. It shows which methods are available
to call and which ones aren’t. Plus it outlines what each method does. If extra actions are required, it also shows those
actions.
The following image displays an example summary status table created after a user initiates a model instance. The
table’s Step column displays a Status of Done for the initiate step. And the Details column explains what the initiate
step did such as generating a score.py file. The Step column also displays the prepare(), verify(), save(),
deploy(), and predict() methods for the model. The Status column displays which method is available next. After
the initiate step, the prepare() method is available. The next step is to call the prepare() method.
'ocid1.datasciencemodel.oc1.xxx.xxxxx'
https://modeldeployment.{region}.oci.customer-oci.com/ocid1.datasciencemodeldeployment.
˓→oc1.xxx.xxxxx
12.4.6.7 Example
import tempfile
import xgboost
seed = 42
artifact_dir = tempfile.mkdtemp()
xgb_model = XGBoostModel(estimator=model, artifact_dir=artifact_dir)
xgb_model.prepare(
inference_conda_env="generalml_p37_cpu_v1",
training_conda_env="generalml_p37_cpu_v1",
X_sample=trainx,
y_sample=trainy,
use_case_type=UseCaseType.BINARY_CLASSIFICATION,
)
xgb_model.verify(testx)
print(f"Endpoint: {xgb_model.model_deployment.url}")
12.4.7 AutoMLModel
12.4.7.1 Overview
import logging
from ads.automl.driver import AutoML
from ads.automl.provider import OracleAutoMLProvider
from ads.dataset.dataset_browser import DatasetBrowser
ds = DatasetBrowser.sklearn().open("wine").set_target("target")
train, test = ds.train_test_split(test_size=0.1, random_state = 42)
12.4.7.2 Initialize
Instantiate an AutoMLModel() object with an AutoML model. Each instance accepts the following parameters:
• artifact_dir: str: Artifact directory to store the files needed for deployment.
• auth: (Dict, optional): Defaults to None. The default authentication is set using the ads.
set_auth API. To override the default, use ads.common.auth.api_keys() or ads.common.auth.
resource_principal() and create the appropriate authentication signer and the **kwargs required to in-
stantiate the IdentityClient object.
• estimator: (Callable): Trained AutoML model.
• properties: (ModelProperties, optional): Defaults to None. The ModelProperties object required
to save and deploy a model.
The properties is an instance of the ModelProperties class and has the following predefined fields:
• bucket_uri: str
• compartment_id: str
• deployment_access_log_id: str
• deployment_bandwidth_mbps: int
• deployment_instance_count: int
• deployment_instance_shape: str
• deployment_log_group_id: str
• deployment_predict_log_id: str
• deployment_memory_in_gbs: Union[float, int]
• deployment_ocpus: Union[float, int]
• inference_conda_env: str
• inference_python_version: str
• overwrite_existing_artifact: bool
• project_id: str
• remove_existing_artifact: bool
• training_conda_env: str
• training_id: str
• training_python_version: str
• training_resource_id: str
• training_script_path: str
By default, properties is populated from the environment variables when not specified. For example, in note-
book sessions the environment variables are preset and stored in project id (PROJECT_OCID) and compartment id
(NB_SESSION_COMPARTMENT_OCID). So properties populates these environment variables, and uses the values in
methods such as .save() and .deploy(). Pass in values to overwrite the defaults. When you use a method that in-
cludes an instance of properties, then properties records the values that you pass in. For example, when you pass
inference_conda_env into the .prepare() method, then properties records the value. To reuse the properties
file in different places, you can export the properties file using the .to_yaml() method then reload it into a different
machine using the .from_yaml() method.
You can call the .summary_status() method after a model serialization instance such as AutoMLModel,
GenericModel, SklearnModel, TensorFlowModel, or PyTorchModel is created. The .summary_status()
method returns a Pandas dataframe that guides you through the entire workflow. It shows which methods are available
to call and which ones aren’t. Plus it outlines what each method does. If extra actions are required, it also shows those
actions.
The following image displays an example summary status table created after a user initiates a model instance. The
table’s Step column displays a Status of Done for the initiate step. And the Details column explains what the initiate
step did such as generating a score.py file. The Step column also displays the prepare(), verify(), save(),
deploy(), and predict() methods for the model. The Status column displays which method is available next. After
the initiate step, the prepare() method is available. The next step is to call the prepare() method.
12.4.7.4 Example
import logging
import tempfile
ds = DatasetBrowser.sklearn().open("wine").set_target("target")
train, test = ds.train_test_split(test_size=0.1, random_state = 42)
artifact_dir = tempfile.mkdtemp()
automl_model = AutoMLModel(estimator=model, artifact_dir=artifact_dir)
automl_model.prepare(
inference_conda_env="generalml_p37_cpu_v1",
training_conda_env="generalml_p37_cpu_v1",
use_case_type=UseCaseType.BINARY_CLASSIFICATION,
X_sample=test.X,
force_overwrite=True,
training_id=None
)
automl_model.verify(test.X.iloc[:10])
model_id = automl_model.save(display_name='Demo AutoMLModel model')
deploy = automl_model.deploy(display_name='Demo AutoMLModel deployment')
automl_model.predict(test.X.iloc[:10])
automl_model.delete_deployment(wait_for_completion=True)
12.4.8.1 Overview
The ads.model.generic_model.GenericModel class in ADS provides an efficient way to serialize almost any
model class. This section demonstrates how to use the GenericModel class to prepare model artifacts, verify models,
save models to the model catalog, deploy models, and perform predictions on model deployment endpoints.
The GenericModel class works with any unsupported model framework that has a .predict() method. For the
most common model classes such as scikit-learn, XGBoost, LightGBM, TensorFlow, and PyTorch, and AutoML, we
recommend that you use the ADS provided, framework-specific serializations models. For example, for a scikit-learn
model, use SKLearnmodel. For other models, use the GenericModel class.
The .verify() method simulates a model deployment by calling the load_model() and predict() methods in the
score.py file. With the .verify() method, you can debug your score.py file without deploying any models. The
.save() method deploys a model artifact to the model catalog. The .deploy() method deploys a model to a REST
endpoint.
These simple steps take your trained model and will deploy it into production with just a few lines of code.
Instantiate a GenericModel() object by giving it any model object. It accepts the following parameters:
• artifact_dir: str: Artifact directory to store the files needed for deployment.
• auth: (Dict, optional): Defaults to None. The default authentication is set using the ads.
set_auth API. To override the default, use ads.common.auth.api_keys() or ads.common.auth.
resource_principal() and create the appropriate authentication signer and the **kwargs required to in-
stantiate the IdentityClient object.
• estimator: (Callable): Trained model.
• properties: (ModelProperties, optional): Defaults to None. ModelProperties object required to save
and deploy the model.
• serialize: (bool, optional): Defaults to True. If True the model will be serialized into a pickle file. If
False, you must set the model_file_name in the .prepare() method, serialize the model manually, and save
it in the artifact_dir. You will also need to update the score.py file to work with this model.
The properties is an instance of the ModelProperties class and has the following predefined fields:
• bucket_uri: str
• compartment_id: str
• deployment_access_log_id: str
• deployment_bandwidth_mbps: int
• deployment_instance_count: int
• deployment_instance_shape: str
• deployment_log_group_id: str
• deployment_predict_log_id: str
• deployment_memory_in_gbs: Union[float, int]
• deployment_ocpus: Union[float, int]
• inference_conda_env: str
• inference_python_version: str
• overwrite_existing_artifact: bool
• project_id: str
• remove_existing_artifact: bool
• training_conda_env: str
• training_id: str
• training_python_version: str
• training_resource_id: str
• training_script_path: str
By default, properties is populated from the environment variables when not specified. For example, in note-
book sessions the environment variables are preset and stored in project id (PROJECT_OCID) and compartment id
(NB_SESSION_COMPARTMENT_OCID). So properties populates these environment variables, and uses the values in
methods such as .save() and .deploy(). Pass in values to overwrite the defaults. When you use a method that in-
cludes an instance of properties, then properties records the values that you pass in. For example, when you pass
inference_conda_env into the .prepare() method, then properties records the value. To reuse the properties
file in different places, you can export the properties file using the .to_yaml() method then reload it into a different
machine using the .from_yaml() method.
You can call the .summary_status() method after a model serialization instance such as AutoMLModel,
GenericModel, SklearnModel, TensorFlowModel, or PyTorchModel is created. The .summary_status()
method returns a Pandas dataframe that guides you through the entire workflow. It shows which methods are available
to call and which ones aren’t. Plus it outlines what each method does. If extra actions are required, it also shows those
actions.
The following image displays an example summary status table created after a user initiates a model instance. The
table’s Step column displays a Status of Done for the initiate step. And the Details column explains what the initiate
step did such as generating a score.py file. The Step column also displays the prepare(), verify(), save(),
deploy(), and predict() methods for the model. The Status column displays which method is available next. After
the initiate step, the prepare() method is available. The next step is to call the prepare() method.
12.4.8.4 Example
By default, the GenericModel serializes to a pickle file. The following example, the user creates a model. In the
prepare step, the user saves the model as a pickle file with the name toy_model.pkl. Then the user verifies the model,
saves it to the model catalog, deploys the model and makes a prediction. Finally, the user deletes the model deployment
and then deletes the model.
import tempfile
from ads.model.generic_model import GenericModel
class Toy:
def predict(self, x):
return x ** 2
model = Toy()
generic_model.prepare(
inference_conda_env="dataexpl_p37_cpu_v3",
model_file_name="toy_model.pkl",
force_overwrite=True
)
generic_model.verify(2)
print(f"Endpoint: {generic_model.model_deployment.url}")
You can also use the shortcut .prepare_save_deploy() instead of calling .prepare(), .save() and .deploy()
seperately.
import tempfile
from ads.model.generic_model import GenericModel
class Toy:
def predict(self, x):
return x ** 2
estimator = Toy()
model = GenericModel(estimator=estimator)
model.summary_status()
# If you are running the code inside a notebook session and using a service pack,␣
˓→`inference_conda_env` can be omitted.
model.prepare_save_deploy(inference_conda_env="dataexpl_p37_cpu_v3")
model.verify(2)
THIRTEEN
APACHE SPARK
DataFlow
Oracle Cloud Infrastructure (OCI) Data Flow is a fully managed, serverless, and on-demand Apache Spark Service
that performs data processing or model training tasks on extremely large datasets without infrastructure to deploy or
manage.
Getting Started with Data Flow
ads integrates with OCI Data Flow allowing users to submit and monitor Spark jobs from Notebook Sessions or from
their local machine.
Data Flow is a hosted Apache Spark server. It is quick to start, and can scale to handle large datasets in parallel. ADS
provides a convenient API for creating and maintaining workloads on Data Flow.
Submit a python script to DataFlow entirely from your python environment. The following snippet uses a dummy
python script that prints “Hello World” followed by the spark version, 3.2.1.
def main():
print("Hello World")
print("Spark version is", pyspark.__version__)
(continues on next page)
331
ADS Documentation, Release 2.6.8
if __name__ == "__main__":
main()
"""
)
name = f"dataflow-app-{str(uuid4())}"
dataflow_configs = (
DataFlow()
.with_compartment_id("oci.xx.<compartment_id>")
.with_logs_bucket_uri("oci://<mybucket>@<mynamespace>/<dataflow-logs-
˓→prefix>")
.with_driver_shape("VM.Standard.E4.Flex")
.with_driver_shape_config(ocpus=2, memory_in_gbs=32)
.with_executor_shape("VM.Standard.E4.Flex")
.with_executor_shape_config(ocpus=4, memory_in_gbs=64)
.with_spark_version("3.2.1")
)
runtime_config = (
DataFlowRuntime()
.with_script_uri(os.path.join(td, "script.py"))
.with_script_bucket("oci://<mybucket>@<mynamespace>/<subdir_to_put_and_
˓→get_script>")
)
df = Job(name=name, infrastructure=dataflow_configs, runtime=runtime_config)
df.create()
df_run = df.run()
The same result can be achieved from the command line using ads CLI and a yaml file.
Assuming you have the following two files written in your current directory as script.py and dataflow.yaml re-
spectively:
# script.py
import pyspark
def main():
print("Hello World")
print("Spark version is", pyspark.__version__)
if __name__ == "__main__":
main()
# dataflow.yaml
kind: job
spec:
name: dataflow-app-<uuid>
infrastructure:
kind: infrastructure
spec:
compartmentId: oci.xx.<compartment_id>
logsBucketUri: oci://<mybucket>@<mynamespace>/<dataflow-logs-
(continues on next page)
From PySpark v3.0.0 and onwards, Data Flow allows a published conda environment as the Spark runtime environment
when built with ADS. Data Flow supports published conda environments only. Conda packs are tar’d conda environ-
ments. When you publish your own conda packs to object storage, ensure that the DataFlow Resource has access to
read the object or bucket. Below is a more built-out example using conda packs:
@click.command()
@click.argument("app_name")
@click.option(
"--limit", "-l", help="max number of row to print", default=10, required=False
)
@click.option("--verbose", "-v", help="print out result in verbose mode", is_flag=True)
def main(app_name, limit, verbose):
(continues on next page)
)
)
if verbose:
rows = query_result_df.toJSON().collect()
for i, row in enumerate(rows):
print(f"record {i}")
print(row)
if __name__ == "__main__":
main()
'''
)
name = f"dataflow-app-{str(uuid4())}"
dataflow_configs = (
DataFlow()
.with_compartment_id("oci.xx.<compartment_id>")
.with_logs_bucket_uri("oci://<mybucket>@<mynamespace>/<dataflow-logs-
˓→prefix>")
.with_driver_shape("VM.Standard.E4.Flex")
.with_driver_shape_config(ocpus=2, memory_in_gbs=32)
.with_executor_shape("VM.Standard.E4.Flex")
.with_executor_shape_config(ocpus=4, memory_in_gbs=64)
.with_spark_version("3.2.1")
)
runtime_config = (
(continues on next page)
Again, assume you have the following two files written in your current directory as script.py and dataflow.yaml
respectively:
# script.py
from pyspark.sql import SparkSession
import click
@click.command()
@click.argument("app_name")
@click.option(
"--limit", "-l", help="max number of row to print", default=10, required=False
)
@click.option("--verbose", "-v", help="print out result in verbose mode", is_flag=True)
def main(app_name, limit, verbose):
Create a Spark session
spark = SparkSession.builder.appName(app_name).getOrCreate()
)
)
if verbose:
rows = query_result_df.toJSON().collect()
for i, row in enumerate(rows):
print(f"record {i}")
print(row)
if __name__ == "__main__":
main()
# dataflow.yaml
kind: job
spec:
name: dataflow-app-<uuid>
infrastructure:
kind: infrastructure
spec:
compartmentId: oci.xx.<compartment_id>
logsBucketUri: oci://<mybucket>@<mynamespace>/<dataflow-logs-
˓→prefix>
driverShape: VM.Standard.E4.Flex
driverShapeConfig:
ocpus: 2
memory_in_gbs: 32
executorShape: VM.Standard.E4.Flex
executorShapeConfig:
ocpus: 4
memory_in_gbs: 64
sparkVersion: 3.2.1
numExecutors: 1
type: dataFlow
runtime:
kind: runtime
spec:
scriptUri: script.py
scriptBucket: oci://<mybucket>@<mynamespace>/<subdir_to_put_and_
˓→get_script>
conda:
uri: oci://<mybucket>@<mynamespace>/<path_to_conda_pack>
type: published
args:
- "run-test"
- "-v"
- "-l"
- "5"
Follow these set up instructions to submit Spark Jobs to Data Flow from an OCI Notebook Session.
To setup PySpark environment, install one of the PySpark conda environments from the Environment Explorer
Example -
Find the information about the latest pyspark conda environment here
Activate the conda environment to upgrade to the latest oracle-ads
When the conda environment is installed, a templated version of core-site.xml is also installed. You can update the
core-site.xml file using an automated configuration or manually.
Authentication with Resource Principals
Authentication to Object Storage can be done with a resource principal.
For automated configuration, run the following command in a terminal
This command will populate the file ~/spark_conf_dir/core-site.xml with the values needed to connect to Ob-
ject Storage.
The following command line options are available:
• -a, –authentication Authentication mode. Supports resource_principal and api_key (default).
• -r, –region Name of the region.
• -o, –overwrite Overwrite core-site.xml.
• -O, –output Output path for core-site.xml.
• -q, –quiet Suppress non-error output.
• -h, –help Show help message and exit.
To manually configure the core-site.xml file, you edit the file, and then specify these values:
fs.oci.client.hostname: The Object Storage endpoint for your region. See
https://docs.oracle.com/iaas/api/#/en/objectstorage/20160918/ for available endpoints.
fs.oci.client.custom.authenticator: Set the value to com.oracle.bmc.hdfs.auth.ResourcePrincipalsCustomAuthenticator.
When using resource principals, these properties don’t need to be configured:
• fs.oci.client.auth.tenantId
• fs.oci.client.auth.userId
• fs.oci.client.auth.fingerprint
• fs.oci.client.auth.pemfilepath
The following example core-site.xml file illustrates using resource principals for authentication to access Object Storage:
<?xml version="1.0"?>
<configuration>
<property>
<name>fs.oci.client.hostname</name>
<value>https://objectstorage.us-ashburn-1.oraclecloud.com</value>
</property>
<property>
<name>fs.oci.client.custom.authenticator</name>
<value>com.oracle.bmc.hdfs.auth.ResourcePrincipalsCustomAuthenticator</value>
</property>
</configuration>
For details, see HDFS connector for Object Storage #using resource principals for authentication.
Authentication with API Keys
When using authentication with API keys, the core-site.xml file is be updated in two ways, automated or manual
configuration.
For automated configuration, you use the odsc command line tool. With an OCI configuration file, you can run
By default, this command uses the OCI configuration file stored in ~/.oci/config, automatically populates the
core-site.xml file, and then saves it to ~/spark_conf_dir/core-site.xml.
The following command line options are available:
• -a, –authentication Authentication mode. Supports resource_principal and api_key (default).
• -c, –configuration Path to the OCI configuration file.
• -p, –profile Name of the profile.
• -r, –region Name of the region.
• -o, –overwrite Overwrite core-site.xml.
• -O, –output Output path for core-site.xml.
• -q, –quiet Suppress non-error output.
• -h, --help Show help message and exit.
To manually configure the core-site.xml file, you must specify these parameters:
fs.oci.client.hostname: Address of Object Storage. For example, https://objectstorage.us-ashburn-
1.oraclecloud.com. You must replace us-ashburn-1 with the region you are in. fs.oci.client.auth.tenantId:
OCID of your tenancy. fs.oci.client.auth.userId: Your user OCID. fs.oci.client.auth.fingerprint:
Fingerprint for the key pair. fs.oci.client.auth.pemfilepath: The fully qualified file name of the private key
used for authentication.
The values of these parameters are found in the OCI configuration file.
Follow these set up instructions to submit Spark Jobs to Data Flow from your local machine.
Prerequisite
You have completed Local Development Environment Setup
Use ADS CLI to setup a PySpark conda environment. Currently, the ADS CLI only supports fetching conda packs
published by you. If you haven’t already published a conda pack, you can create one using ADS CLI
To install from your published environment source -
dependencies:
- pyspark
- pip
- pip:
- oracle-ads
name: pysparkenv
EOF
Prerequisites
1. Setup Visual Studio Code development environment by following steps from Local Development Environment
Setup
2. ads conda install <oci uri of pyspark conda environment>. Currently, we cannot access service
pack directly. You can instead publish a pyspark service pack to your object storage and use the URI for the pack
in OCI Object Storage.
Once the development environment is setup, you could write your code and run it from the terminal of the Visual Studio
Code.
core-site.xml is setup automatically when you install a pyspark conda pack.
compartment_id = "<compartment_id>"
logs_bucket_uri = "<logs_bucket_uri>"
Ensure that you set up the correct policies. For instance, for Data Flow to access logs bucket, use a policy like:
Submit your code to DataFlow for workloads that require larger resources.
For most Notebook users, local or OCI Notebook Sessions, the notebook extension is the most straightforward inte-
gration with dataflow. It’s a “Set It and Forget It” API with options to update ad-hoc. You can configure your dataflow
runs by running ads opctl configure in the terminal.
After setting up your dataflow config, you can return to the Notebook. Import ads and DataFlowConfig:
import ads
from ads.jobs.utils import DataFlowConfig
%load_ext ads.jobs.extension
Define config. If you have not yet configured your dataflow setting, or would like to amend the defaults, you can modify
as shown below:
dataflow_config = DataFlowConfig()
dataflow_config.compartment_id = "ocid1.compartment.<your compartment ocid>"
dataflow_config.driver_shape = "VM.Standard.E4.Flex"
dataflow_config.driver_shape_config = oci.data_flow.models.ShapeConfig(ocpus=2, memory_
˓→in_gbs=32)
dataflow_config.executor_shape = "VM.Standard.E4.Flex"
dataflow_config.executor_shape_config = oci.data_flow.models.ShapeConfig(ocpus=4, memory_
˓→in_gbs=64)
Tip
Get more information about the dataflow extension by running %dataflow -h
Call the dataflow magic command in the first line of your cell to run it on dataflow.
This header will: - save the cell as a file called script.py and store it in your dataflow_config.script_bucket
- After the -- notation, all parameters are sent to your script. For example abc is a positional argument, and l and v
are named arguments.
Below is a full example:
@click.command()
@click.argument("app_name")
@click.option(
"--limit", "-l", help="max number of rows to print", default=10, required=False
)
@click.option("--verbose", "-v", help="print out result in verbose mode", is_flag=True)
def main(app_name, limit, verbose):
# Create a Spark session
spark = SparkSession.builder.appName(app_name).getOrCreate()
)
)
if verbose:
rows = query_result_df.toJSON().collect()
for i, row in enumerate(rows):
print(f"record {i}")
print(row)
if __name__ == "__main__":
main()
Prerequisites
1. Install ADS CLI
2. Configure Defaults
Tip
If, for some reason, you are unable to use CLI, instead skip to the Create, Run Data Flow Application Using
ADS Python SDK section below.
Sometimes your code is too complex to run in a single cell, and it’s better run as a notebook or file. In that case, use
the ADS Opctl CLI.
To submit your notebook to DataFlow using the ads CLI, run:
ads opctl run -s <folder where notebook is located> -e <notebook name> -b dataflow
Tip
You can avoid running cells that are not DataFlow environment compatible by tagging the cells and then provid-
ing the tag names to ignore. In the following example cells that are tagged ignore and remove will be ignored -
--exclude-tag ignore --exclude-tag remove
Tip
You can run the notebook in your local pyspark environment before submitting to DataFlow using the same CLI with
-b local
You could submit a notebook using ADS SDK APIs. Here is an example to submit a notebook -
df = (
DataFlow()
.with_compartment_id(
"ocid1.compartment.oc1..
˓→aaaaaaaapvb3hearqum6wjvlcpzm5ptfxqa7xfftpth4h72xx46ygavkqteq"
)
.with_driver_shape("VM.Standard.E4.Flex")
.with_driver_shape_config(ocpus=2, memory_in_gbs=32)
.with_executor_shape("VM.Standard.E4.Flex")
.with_executor_shape_config(ocpus=4, memory_in_gbs=64)
.with_logs_bucket_uri("oci://mybucket@mytenancy/")
)
rt = (
DataFlowNotebookRuntime()
.with_notebook(
"<path to notebook>"
) # This could be local path or http path to notebook ipynb file
.with_script_bucket("<my-bucket>")
.with_exclude_tag(["ignore", "remove"]) # Cells to Ignore
)
job = Job(infrastructure=df, runtime=rt).create(overwrite=True)
df_run = job.run(wait=True)
To create a Data Flow application using the ADS Python API you need two components:
• DataFlow, a subclass of Infrastructure.
• DataFlowRuntime, a subclass of Runtime.
DataFlow stores properties specific to Data Flow service, such as compartment_id, logs_bucket_uri, and so on. You
can set them using the with_{property} functions:
• with_compartment_id
• with_configuration
• with_driver_shape
• with_driver_shape_config
• with_executor_shape
• with_executor_shape_config
• with_language
• with_logs_bucket_uri
• with_metastore_id (doc)
• with_num_executors
• with_spark_version
• with_warehouse_bucket_uri
For more details, see DataFlow class documentation.
DataFlowRuntime stores properties related to the script to be run, such as the path to the script and CLI arguments.
Likewise all properties can be set using with_{property}. The DataFlowRuntime properties are:
• with_script_uri
• with_script_bucket
• with_archive_uri (doc)
• with_archive_bucket
• with_custom_conda
For more details, see the runtime class documentation.
Since service configurations remain mostly unchanged across multiple experiments, a DataFlow object can be reused
and combined with various DataFlowRuntime parameters to create applications.
In the following “hello-world” example, DataFlow is populated with compartment_id, driver_shape,
driver_shape_config, executor_shape, executor_shape_config and spark_version. DataFlowRuntime
is populated with script_uri and script_bucket. The script_uri specifies the path to the script. It can be
local or remote (an Object Storage path). If the path is local, then script_bucket must be specified addition-
ally because Data Flow requires a script to be available in Object Storage. ADS performs the upload step for you,
as long as you give the bucket name or the Object Storage path prefix to upload the script. Either can be given
to script_bucket. For example, either with_script_bucket("<bucket_name>") or with_script_bucket(
"oci://<bucket_name>@<namespace>/<prefix>") is accepted. In the next example, the prefix is given for
script_bucket.
def main():
print("Hello World")
print("Spark version is", pyspark.__version__)
if __name__ == "__main__":
main()
"""
)
name = f"dataflow-app-{str(uuid4())}"
dataflow_configs = (
DataFlow()
(continues on next page)
df_run = df.run()
After the run completes, check the stdout log from the application by running:
print(df_run.logs.application.stdout)
Hello World
Spark version is 3.0.2
Note on Policy
Data Flow supports adding third-party libraries using a ZIP file, usually called archive.zip, see the Data Flow docu-
mentation about how to create ZIP files. Similar to scripts, you can specify an archive ZIP for a Data Flow application
using with_archive_uri. In the next example, archive_uri is given as an Object Storage location. archive_uri
can also be local so you must specify with_archive_bucket and follow the same rule as with_script_bucket.
)
)
if verbose:
rows = query_result_df.toJSON().collect()
for i, row in enumerate(rows):
print(f"record {i}")
print(row)
if __name__ == "__main__":
main()
'''
)
name = f"dataflow-app-{str(uuid4())}"
dataflow_configs = (
DataFlow()
.with_compartment_id("oci1.xxx.<compartment_ocid>")
(continues on next page)
You can save the application specification into a YAML file for future reuse. You could also use the json format.
print(df.to_yaml("sample-df.yaml"))
You can also load a Data Flow application directly from the YAML file saved in the previous example:
df2 = Job.from_yaml(uri="sample-df.yaml")
df_run2 = df2.create().run()
df2.delete()
df_run2.status
df_run3 = df3.run()
assert len(df.run_list()) == 2
When you run a Data Flow application, a DataFlowRun object is created. You can check the status, wait for a run to
finish, check its logs afterwards, or cancel a run in progress. For example:
df_run.status
df_run.wait()
df_run.logs.application.stderr
df_run.logs.executor.stdout
df_run.logs.executor.stderr
You can also examine head or tail of the log, or download it to a local path. For example,
log = df_run.logs.application.stdout
log.head(n=1)
log.tail(n=1)
log.download(<local-path>)
For the sample script, the log prints first five rows of a sample dataframe in JSON and it looks like:
record 0
{"city":"Berlin","zipcode":"10119","lat_long":"52.53453732241747,13.402556926822387"}
record 1
{"city":"Berlin","zipcode":"10437","lat_long":"52.54851279221664,13.404552826587466"}
record 2
{"city":"Berlin","zipcode":"10405","lat_long":"52.534996191586714,13.417578665333295"}
record 3
{"city":"Berlin","zipcode":"10777","lat_long":"52.498854933130026,13.34906453348717"}
record 4
{"city":"Berlin","zipcode":"10437","lat_long":"52.5431572633131,13.415091104515707"}
'record 0'
{"city":"Berlin","zipcode":"10437","lat_long":"52.5431572633131,13.415091104515707"}
A link to run the page in the OCI Console is given using the run_details_link property:
df_run.run_details_link
To list Data Flow applications, a compartment id must be given with any optional filtering criteria. For example, you
can filter by name of the application:
Job.dataflow_job(compartment_id=compartment_id, display_name=name)
13.3.3.1 YAML
You can create a Data Flow job directly from a YAML string. You can pass a YAML string into the Job.from_yaml()
function to build a Data Flow job:
kind: job
spec:
id: <dataflow_app_ocid>
infrastructure:
kind: infrastructure
spec:
compartmentId: <compartment_id>
driverShape: VM.Standard.E4.Flex
driverShapeConfig:
ocpus: 2
memory_in_gbs: 32
executorShape: VM.Standard.E4.Flex
executorShapeConfig:
ocpus: 4
memory_in_gbs: 64
id: <dataflow_app_ocid>
language: PYTHON
logsBucketUri: <logs_bucket_uri>
numExecutors: 1
sparkVersion: 2.4.4
type: dataFlow
name: dataflow_app_name
runtime:
kind: runtime
spec:
scriptBucket: bucket_name
scriptPathURI: oci://<bucket_name>@<namespace>/<prefix>
type: dataFlow
kind:
allowed:
- runtime
(continues on next page)
13.4 spark-defaults.conf
The spark-defaults.conf file is used to define the properties that are used by Spark. This file can be configured
manually or with the aid of the odsc command-line tool. The best practice is to use the odsc data-catalog config
command-line tool when you want to connect to Data Catalog. It gathers information about your environment and uses
that to build the file.
The odsc data-catalog config command-line tool uses the --metastore option to define the Data Catalog Meta-
store OCID. There are no required command-line options. Default values are used or values are taken from your note-
book session environment and OCI configuration file. Below is a discussion of common parameters that you may need
to override.
The --authentication option sets the authentication mode. It supports resource principal and API keys.
The preferred method for authentication is resource principal and this is sent with --authentication
resource_principal. If you want to use API keys then used the option --authentication api_key. If the
--authentication is not specified, API keys will be used. When API keys are used, information from the OCI
configuration file is used to create the spark-defaults.conf file.
The Object Storage and Data Catalog are regional services. By default, the region is set to the region that your notebook
session is in. This information is taken from the environment variable NB_REGION. Use the --region option to override
this behavior.
The default location of the spark-defaults.conf file is in the ~/spark_conf_dir directory, as defined in the
SPARK_CONF_DIR environment variable. Use the --output option to define the directory where the file is to be
written.
The odsc data-catalog config command-line tool is ideal for setting up the spark-defaults.conf file as it
gathers information about your environment and uses that to build the file.
You will need to determine what settings are appropriate for your configuration. However, the following will work for
most configurations.
If the option --authentication api_key is used, it will extract information from the OCI configuration file that is
stored in ~/.oci/config. Use the --config option to change the path and the --profile option to specify what
OCI configuration profile will be used. The default profile is DEFAULT.
A default Data Catalog Metastore OCID can be set using the --metastore option. This value can be overridden at
run-time.
The <metastore_id> must be replaced with the OCID for the Data Catalog Metastore that is to be used.
For details on the command-line option use the command:
13.4.2 Manual
The odsc command-line tool is the preferred method for configuring the spark-defaults.conf file. However, if
you are not in a notebook session or if you have special requirements, you may need to manually configure the file.
This section will guide you through the steps.
When a Data Science Conda environment is installed, it includes a template of the spark-defaults.conf file. The
following sections provide guidance to make the required changes.
These parameters define the Object Storage address that backs the Data Catalog entry. This is the location of the data
warehouse. You also need to define the address of the Data Catalog Metastore.
• spark.hadoop.fs.oci.client.hostname: Address of Object Storage for the data warehouse. For example,
https://objectstorage.us-ashburn-1.oraclecloud.com. Replace us-ashburn-1 with the region you
are in.
• spark.hadoop.oci.metastore.uris: The address of Data Catalog Metastore. For example, https://
datacatalog.us-ashburn-1.oci.oraclecloud.com/ Replace us-ashburn-1 with the region you are in.
You can set a default metastore with the following parameter. This can be overridden at run time. Setting it is optional.
• spark.hadoop.oracle.dcat.metastore.id: The OCID of Data Catalog Metastore. For example, ocid1.
datacatalogmetastore..<unique_id>
Depending on the authentication method that is to be used there are additional parameters that need to be set. See the
following sections for guidance.
This section demonstrates how to configure the spark-defaults.conf file so that you can connect with the Oracle
Cloud Infrastructure (OCI) Data Catalog Metastore. This connection is used to run a PySpark application using OCI
Data Flow and Data Science Jobs. The data will be stored in OCI Object Storage. Thus, you will work with data that
is stored in Object Storage, information about the location and structure of that data will be managed by Data Catalog
Metastore, compute will be provided by Data Flow and all of this will be run in a Job.
OCI Data Catalog is a metadata management service that helps data professionals discover data and support data
governance. The Data Catalog Metastore provides schema definitions for objects in structured and unstructured data
assets that reside in Object Storage. Use the metastore as a central metadata repository to manage data tables that are
backed by files in Object Storage.
OCI Data Flow is a fully managed Apache Spark service. This section demonstrates how to use PySpark to create
Spark applications.
Data Science Jobs allows you to run customized tasks outside of a notebook session. A Job is a template that describes
a task that you want to perform. In this section, that task is to run a PySpark application using Data Flow. Since the
Job is run outside of a notebook, command-line arguments can be passed to the Job such that it performs customized
activities. OCI Logging is used to capture events. You can also read and write data to Object Storage directly or with
the aid of Data Catalog.
Data Flow can access the Data Catalog Metastore to securely store and retrieve schema definitions for unstructured and
structured data assets in Object Storage. For integration with Data Flow, the metastore provides an invocation endpoint.
This endpoint is a Hive Metastore interface.
Apache Hive is a data warehousing framework that facilitates read, write, or manage operations on large datasets
residing in distributed file systems. The Data Catalog Metastore is backed by the Apache Hive Metastore. A Hive
Metastore is the central repository of metadata for a Hive cluster. It stores metadata for data structures such as databases,
tables, and partitions in a relational database, backed by files maintained in Object Storage. Apache Spark SQL makes
use of a Hive Metastore for this purpose.
13.5.1 Prerequisite
To access the data in the Data Catalog or work with Data Flow, there are a number of steps that need to be completed.
To configure Data Flow you will need to:
• DataFlow requires a bucket to store the logs, and a data warehouse bucket. Refer to the Data Flow documentation
for setting up storage.
• DataFlow requires policies to be set in IAM to access resources to manage and run applications. Refer to the
Data Flow documentation on how to setup policies.
• DataFlow natively supports conda packs published to OCI Object Storage. Ensure the Data Flow Resource has
read access to the bucket or path of your published conda pack, and that the spark version >= 3 when running
your Data Flow Application.
• The core-site.xml file needs to be configured.
To configure Data Catalog you will need to:
• Data Catalog requires policies to be set in IAM. Refer to the Data Catalog documentation on how to setup policies.
• The spark-defaults.conf file needs to be configured.
compartment_id = os.environ.get("NB_SESSION_COMPARTMENT_OCID")
driver_shape = "VM.Standard.E4.Flex"
driver_shape_config = {"ocpus":2, "memory_in_gbs":32}
executor_shape = "VM.Standard.E4.Flex"
executor_shape_config = {"ocpus":4, "memory_in_gbs":64}
spark_version = "3.2.1"
def main():
database_name = "employee_attrition"
table_name = "orcl_attrition"
print(f"Creating {database_name}")
spark.sql(f"DROP DATABASE IF EXISTS {database_name} CASCADE")
spark.sql(f"CREATE DATABASE IF NOT EXISTS {database_name}")
""")
# Convert the filtered Apache Spark DataFrame into JSON format and write it out to␣
˓→stdout
# so that it can be captured in the log.
print('\\n'.join(query_result_df.toJSON().collect()))
if __name__ == '__main__':
main()
'''
dataflow_configs = DataFlow(
{
"compartment_id": compartment_id,
"driver_shape": driver_shape,
"driver_shape_config": driver_shape_config,
"executor_shape": executor_shape,
"executor_shape_config": executor_shape_config,
"logs_bucket_uri": log_bucket_uri,
"metastore_id": metastore_id,
"spark_version": spark_version
}
)
runtime_config = DataFlowRuntime(
{
"script_uri": pyspark_file_path,
"script_bucket": script_uri
}
)
database_name = "ODSC_DEMO"
table_name = "ODSC_PYSPARK_METASTORE_DEMO"
# Load the Employee Attrition data file from OCI Object Storage into a Spark DataFrame:
file_path = "oci://hosted-ds-datasets@bigdatadatasciencelarge/synthetic/orcl_attrition.
˓→csv"
# explore data
spark_df = spark.sql(f"""
SELECT EducationField, SalaryLevel, JobRole FROM {database_name}.
˓→{table_name} limit 10
""")
spark_df.show()
This example demonstrates how to create a Data Flow application that is connected to the Data Catalog Metastore. It
creates a PySpark script, then a Data Flow application. This application can be run by directly by Data Flow or as part
of a Job.
This section runs Hive queries using Data Flow. When the Data Catalog is being used the only changes that need to be
made are to provide the metastore OCID.
A PySpark script is needed for the Data Flow application. The following code creates that script. The script will use
Spark to load a CSV file from a public Object Storage bucket. It will then create a database and write the file to Object
Storage. Finally, it will use Spark SQL to query the database and print the records in JSON format.
There is nothing in the PySpark script that is specific to using Data Catalog Metastore. The script treats the database
as a standard Hive database.
script = '''
from pyspark.sql import SparkSession
def main():
database_name = "employee_attrition"
table_name = "orcl_attrition"
print(f"Creating {database_name}")
spark.sql(f"DROP DATABASE IF EXISTS {database_name} CASCADE")
spark.sql(f"CREATE DATABASE IF NOT EXISTS {database_name}")
""")
(continues on next page)
# Convert the filtered Apache Spark DataFrame into JSON format and write it out to␣
˓→stdout
# so that it can be captured in the log.
print('\\n'.join(query_result_df.toJSON().collect()))
if __name__ == '__main__':
main()
'''
To create a Data Flow application you will need DataFlow and DataFlowRuntime objects. A DataFlow object stores
the properties that are specific to the Data Flow service. These would be things such as the compartment OCID, the
URI to the Object Storage bucket for the logs, the type of hardware to be used, the version of Spark, and much more.
If you are using a Data Catalog Metastore to manage a database, the metastore OCID is stored in this object. The
DataFlowRuntime object stores properties related to the script to be run. This would be the bucket to be used for the
script, the location of the PySpark script, and any command-line arguments.
Update the script_bucket, log_bucket, and metastore_id variables to match your tenancy’s configuration.
# Update values
log_bucket_uri = "oci://<bucket_name>@<namespace>/<prefix>"
metastore_id = "<metastore_id>"
script_bucket = "oci://<bucket_name>@<namespace>/<prefix>"
compartment_id = os.environ.get("NB_SESSION_COMPARTMENT_OCID")
driver_shape = "VM.Standard.E4.Flex"
driver_shape_config = {"ocpus":2, "memory_in_gbs":32}
executor_shape = "VM.Standard.E4.Flex"
executor_shape_config = {"ocpus":4, "memory_in_gbs":64}
spark_version = "3.2.1"
In the following example, a DataFlow is created and populated with the information that it needs to define the Data
Flow service. Since, we are connecting to the Data Catalog Metastore to work with a Hive database, the metastore
OCID must be given.
from ads.jobs import DataFlow, DataFlowRun, DataFlowRuntime
dataflow_configs = DataFlow(
{"compartment_id": compartment_id,
"driver_shape": driver_shape,
"driver_shape_config": driver_shape_config,
"executor_shape": executor_shape,
"executor_shape_config": executor_shape_config,
(continues on next page)
In the following example, a DataFlowRuntime is created and populated with the URI to the PySpark script and the
URI for the script bucket. The script URI specifies the path to the script. It can be local or remote (an Object Storage
path). If the path is local, then a URI to the script bucket must also be specified. This is because Data Flow requires a
script to be in Object Storage. If the specified path to the PySpark script is on a local drive, ADS will upload it for you.
runtime_config = DataFlowRuntime(
{
"script_bucket": script_uri
"script_uri": pyspark_file_path,
}
)
13.5.3.3 Run
The recommended approach for running Data Flow applications is to use a Job. This will prevent your notebook from
being blocked.
A Job requires a name, infrastructure, and runtime settings. Update the following code to give the job a unique name.
The infrastructure takes a DataFlow object and the runtime parameter takes a DataFlowRuntime object.
# Update values
job_name = "<job_name>"
df_job = Job(name=job_name,
infrastructure=dataflow_configs,
runtime=runtime_config)
df_app = df_job.create()
df_run = df_app.run()
This section demonstrates how to make connections to the Data Catalog Metastore and Object Storage. It uses Spark
to load data from a public Object Storage file and creates a database. The metadata for the database is managed by the
Data Catalog Metastore and the data is copied to your data warehouse bucket. Finally, Spark is used to make a Spark
SQL query on the database.
Specify the bucket URI that will act as the data warehouse. Use the warehouse_uri variable and it should have
the following format oci://<bucket_name>@<namespace_name>/<prefix>. Update the variable metastore_id
with the OCID of the Data Catalog Metastore.
Create a Spark session that connects to the Data Catalog Metastore and the Object Storage that will act as the data
warehouse.
warehouse_uri = "<warehouse_uri>"
(continues on next page)
spark = SparkSession \
.builder \
.appName("Python Spark SQL Hive integration example") \
.config("spark.sql.warehouse.dir", warehouse_uri) \
.config("spark.hadoop.oracle.dcat.metastore.id", metastore_id) \
.enableHiveSupport() \
.getOrCreate()
spark.sparkContext.setLogLevel("ERROR")
Load a data file from Object Storage into a Spark DataFrame. Create a database in the Data Catalog Metastore and
then save the dataframe as a table. This will write the files to the location specified by the warehouse_uri variable.
database_name = "ODSC_DEMO"
table_name = "ODSC_PYSPARK_METASTORE_DEMO"
file_path = "oci://hosted-ds-datasets@bigdatadatasciencelarge/synthetic/orcl_attrition.
˓→csv"
spark_df = spark.sql(f"""
SELECT EducationField, SalaryLevel, JobRole FROM {database_name}.
˓→{table_name} limit 10
""")
spark_df.show()
13.6 [Legacy]
13.6.1 Prerequisite
We provide simple PySpark or sparksql templates for you to get started with . You can use data_flow.template()
to generate a pre-written template.
We support these templates:
The standard_pyspark template is used for standard PySpark jobs.
The sparksql template is used for sparksql jobs.
data_flow.template() returns the local path to the script you have generated.
The application creation process has two stages, preparation and creation.
In the preparation stage, you prepare the configuration object necessary to create a application. You must provide values
for these three parameters:
• display_name: The name you give your application.
• pyspark_file_path: The local path to your PySpark script.
• script_bucket: The bucket used to read/write the PySpark script in Object Storage.
ADS checks that the bucket exists, and that you can write to it from your notebook sesssion. Optionally, you can change
values for these parameters:
• compartment_id: The OCID of the compartment to create a application. If it’s not provided, the same com-
partment as your dataflow object is used.
• driver_shape: The driver shape used to create the application. The default value is "VM.Standard2.4".
• executor_shape: The executor shape to create the application. The default value is "VM.Standard2.4".
• logs_bucket: The bucket used to store run logs in Object Storage. The default value is "dataflow-logs".
• num_executors: The number of executor VMs requested. The default value is 1.
Note: If you want to use a private bucket as the logs_bucket, ensure that you add a corresponding service policy using
` Identity: Policy Set Up <https://docs.cloud.oracle.com/en-us/iaas/data-flow/using/dfs_getting_started.htm#policy_
set_up>`_.
Then you can use prepare_app() to create the configuration object necessary to create the application.
data_flow = DataFlow()
app_config = data_flow.prepare_app(
display_name="<app-display-name>",
script_bucket="<your-script-bucket>" ,
pyspark_file_path="<your-scirpt-path>"
)
After you have the application configured, you can create a application using create_app:
app = data_flow.create_app(app_config)
Your local script is uploaded to the script bucket in this application creation step. Object Storage supports file versioning
that creates an object version when the content changes, or the object is deleted. You can enable Object Versioning
in your bucket in the OCI Console to prevent overwriting of existing files in Object Storage.
You can create an application with a script file that exists in Object Storage by setting overwrite_script=True in
create_app. Similarly, you can set overwrite_archive=True to create an application with an archive file that
exists in Object Storage. By default, the overwrite_script and overwrite_archive options are set to false.
app.config
Next, you could get a URL link to the OCI Console Application Details page.
app.oci_link
As an alternative to creating applications in ADS, you can load existing applications created elsewhere. These appli-
cations must be Python applications. To load an existing applications, you need the application’s OCID.
You can find the app_id in the the OCI Console or by listing existing applications.
Optionally, you could assign a value to the parameter target_folder. This parameter is the directory you want
to store the local artifacts of this application in. If target_folder is not provided, then the local artifacts of this
application are stored in the dataflow_base_folder folder defined by the dataflow object instance.
From ADS you can list applications, that are returned a as a list of dictionaries, with a function to provide the data in a
Pandas dataframe. The default sort order is the most recent run first.
For example, to list the most recent five applications use this code:
After an application is created or loaded in your notebook session, the next logical step is to execute a run of that
application. The process of running (or creating) a run is similar to creating an application.
First, you configure the run using the prepare_run() method of the DataFlowApp object. You only need to provide
a value for the name of your run using run_display_name:
run_config = app.prepare_run(run_display_name="<run-display-name>")
You could use a compartment different from your application to create a run by specifying the compartment_id in
prepare_run. By default, it uses the same compartment as your application to create the run.
Optionally, you can specify the logs_bucket to store the logs of your run. By default, the run inherits the
logs_bucket from the parent application, but you can overwrite that option.
Every time the application launches a run, a local folder representing this run is created. This folder stores all the
information including the script, the run configuration, and any logs that are stored in the logs bucket.
Then, you can create a run using the run_config generated in the preparation stage. During this process, you
can monitor the run while the job is running. You can also pull logs into your local directories by setting,
save_log_to_local=True.
The DataFlowRun object has some useful attributes similar to the DataFlowApp object.
You can check the status of the run with:
run.status
You can get the configuration file that created this run. The run configuration and the PySpark script used in this run
are also saved in the corresponding run directory in your notebook environment.
run.config
You can get the run directory where the artifacts are stored in your notebook environment with:
run.local_dir
Similarly, you can get a clickable link to the OCI Console Run Details page with:
run.oci_link
After a run has completed, you can examine the logs using ADS. There are two types of logs, stdout and stderr.
# the path to the saved logs in the notebook environment if ``save_log_to_local`` was␣
˓→``True`` when you create this run
run.log_stdout.local_path
If save_log_to_local is set to False during app.run(...), you can fetch logs by calling the fetch_log(...).
save() method on the DataFlowRun object with the correct logs type.
run.fetch_log("stdout").save()
run.fetch_log("stderr").save()
Note: Due to a limitation of PySpark (specifically Python applications in Spark), both stdout and stderr are
merged into the stdout stream.
The integration with ADS supports the edit-run-edit cycle, so the local PySpark script can be edited, and is automatically
synchronized to Object Storage each time the application is run.
obtains the PySpark script from Object Storage
so the local files in the notebook session are not visible to . The app.run(...) method compares the content hash of
the local file with the remote copy on Object Storage. If any change is detected, the new local version is copied over
to the remote. For the first run the synchronization creates the remote file and generates a fully qualified URL with
namespace that’s required for .
Synchronizing is the default setting in app.run(...). If you don’t want the application to sync with the local modified
files, you need to include sync=False as an argument parameter in app.run(...).
Passing arguments to PySpark scripts is done with the arguments value in prepare_app. Additional to the argu-
ments supports, is a parameter dictionary that you can use to interpolate arguments. To just pass arguments, the
script_parameter section may be ignored. However, any key-value pair defined in script_parameter can be
referenced in arguments using the ${key} syntax, and the value of that key is passed as the argument value.
data_flow = DataFlow()
app_config = data_flow.prepare_app(
display_name,
script_bucket,
pyspark_file_path,
(continues on next page)
run_config = app.prepare_run(run_display_name="test-run")
run = app.run(run_config)
Note: The arguments in the format of ${arg} are replaced by the value provided in script parameters when passed
in, while arguments not in this format are passed into the script verbatim.
You can override the values of some or all script parameters in each run by passing different values to prepare_run().
Your PySpark applications might have custom dependencies in the form of Python wheels or virtual environments, see
Adding Third-Party Libraries to Applications.
Pass the archive file to your applications with archive_path and archive_bucket values in prepare_app.
• archive_path: The local path to archive file.
• archive_bucket: The bucket used to read and write the archive file in Object Storage; if not provided,
archive_bucket will use the bucket for PySpark bucket by default.
Use prepare_app() to create the configuration object necessary to create the application.
data_flow = DataFlow()
app_config = data_flow.prepare_app(
display_name="<app-display-name>",
script_bucket="<your-script-bucket>",
pyspark_file_path="<your-scirpt-path>",
archive_path="<your-archive-path>",
archive_bucket="<your-archive-bucket>"
)
The behavior of the archive file is very similar to the PySpark script when creating:
• An application, the local archive file is uploaded to the specified bucket Object Storage.
• A run, the latest local archive file is synchronized to the remote file in Object Storage. The sync parameter
controls this behavior.
• Loading an existing application created with archive_uri, the archive file is obtained from Object Storage, and
saved in the local directory.
After the application has run and any stdout captured in the log file, the PySpark script likely produces some form
of output. Usually a PySpark script batch processes something. For example, sampling data, aggregating data, prepro-
cessing data. You can load the resulting output as an ADSDataset.open() using the ocis:// protocol handler.
The only way to get output from PySpark back into the notebook session is to create files in Object Storage that is read
into the notebook, or use the stdout stream.
Following is a simple example of a PySpark script producing output printed in a portable JSON-L format, though CSV
works too. This method, while convenient as an example, is not a recommended for large data.
def main():
# load an example csv file from dataflow public storage into DataFrame
original_df = spark\
.read\
.format("csv")\
.option("header", "true")\
.option("multiLine", "true")\
.load("oci://oow_2019_dataflow_lab@bigdatadatasciencelarge/usercontent/kaggle_
˓→berlin_airbnb_listings_summary.csv")
query_result_df = spark.sql("""
SELECT
city,
zipcode,
number_of_reviews,
CONCAT(latitude, ',', longitude) AS lat_long
FROM
berlin"""
)
print('\n'\
.join(query_result_df\
.toJSON()\
.collect()))
if __name__ == '__main__':
main()
After you run the stdout stream (which contains CSV formatted data), it can be interpreted as a string using Pandas.
import io
import pandas as pd
# the PySpark script wrote to the log as jsonL, and we read the log back as a pandas␣
˓→dataframe
df = pd.read_json((str(run.log_stdout)), lines=True)
df.head()
13.6.5 Example Notebook: Develop Pyspark jobs locally - from local to remote work-
flows
This notebook provides spark operations for customers by bridging the existing local spark workflows with cloud based
capabilities. Data scientists can use their familiar local environments with JupyterLab, and work with remote data and
remote clusters simply by selecting a kernel. The operations demonstrated are, how to:
• Use the interactive spark environment and produce a spark script,
• Prepare and create an application,
• Prepare and create a run,
• List existing dataflow applications,
• Retrieve and display the logs,
The purpose of the dataflow module is to provide an efficient and convenient way for you to launch a Spark application,
and run Spark jobs. The interactive Spark kernel provides a simple and efficient way to edit and build your Spark script,
and easy access to read from an OCI filesystem.
import io
import matplotlib.pyplot as plt
import os
from os import path
import pandas as pd
import tempfile
import uuid
Load the Employee Attrition data file from OCI Object Storage into a Spark DataFrame:
emp_attrition = spark\
.read\
.format("csv")\
.option("header", "true")\
.option("inferSchema", "true")\
.option("multiLine", "true")\
.load("oci://hosted-ds-datasets@bigdatadatasciencelarge/synthetic/orcl_attrition.
˓→csv") \
Visualize how monthly income and age relate to one another in the context of years in industry:
fig, ax = plt.subplots()
plot = spark.sql("""
SELECT
Age,
MonthlyIncome,
YearsInIndustry
FROM
emp_attrition
""").toPandas().plot.scatter(x="Age", y="MonthlyIncome", title='Age vs Monthly␣
˓→Income',
plot.set_xlabel("Age")
plot.set_ylabel("Monthly Income")
plot
+--------------------+
| col_name|
+--------------------+
| Age|
| Attrition|
| TravelForWork|
| SalaryLevel|
| JobFunction|
| CommuteLength|
| EducationalLevel|
| EducationField|
| Directs|
| EmployeeNumber|
(continues on next page)
Select a few columns using Spark, and convert it into a Pandas dataframe:
df = spark.sql("""
SELECT
Age,
MonthlyIncome,
YearsInIndustry
FROM
emp_attrition """).limit(10).toPandas()
df
You can work with different compression formats within . For example, snappy Parquet:
read_snappy_df.first()
Other compression formats that supports include snappy Parquet, and Gzip on both CSV and Parquet.
You might have query that you want to run in from previous explorations, review the dataflow.ipynb notebook example
that shows you how to submit a job to .
dataflow_base_folder = tempfile.mkdtemp()
data_flow = DataFlow(dataflow_base_folder=dataflow_base_folder)
print("Data flow directory: {}".format(dataflow_base_folder))
script = '''
from pyspark.sql import SparkSession
def main():
print('\\n'.join(query_result_df.toJSON().collect()))
if __name__ == '__main__':
main()
'''
app_config = data_flow.prepare_app(display_name=display_name,
script_bucket=script_bucket,
pyspark_file_path=pyspark_file_path,
logs_bucket=logs_bucket)
app = data_flow.create_app(app_config)
run_display_name = "sample_Data_Flow_run"
run_config = app.prepare_run(run_display_name=run_display_name)
run.status
'SUCCEEDED'
run.config
{'compartment_id': 'ocid1.compartment..<unique_ID>',
'script_bucket': 'test',
'pyspark_file_path': '/tmp/tmpe18x_qbr/example-0054ed.py',
'archive_path': None,
'archive_bucket': None,
'run_display_name': 'sample_Data_Flow_run',
'logs_bucket': 'dataflow-log',
'logs_bucket_uri': 'oci://dataflow-log@ociodscdev',
'driver_shape': 'VM.Standard2.4',
'executor_shape': 'VM.Standard2.4',
'num_executors': 1}
run.oci_link
if "adb_url" in globals():
output_dataframe = sc.read \
.format("jdbc") \
.option("url", adb_url) \
.option("dbtable", table_name) \
(continues on next page)
The database table is loaded into Spark so that you can perform operations to transform, model, and more. In the next
cell, the notebook prints the table demonstrating that it was successfully loaded into Spark from the ADB.
if "adb_url" in globals():
output_dataframe.show()
else:
print("Skipping as it appears that you do not have output_dataframe configured.")
+----+----------+--------------+------------+-------------------+-------------+----------
˓→-------+---------------+--------+---------------+------------------------+-------+-----
˓→------+---------------+---------+---------------------+---------------+--------------+-
˓→-------------+------------+-------------------+-------+---------+------------------+---
˓→---------------+-------------------------+------------------+-----------------+--------
˓→--------+----------------------+----------------+-----------+--------------------+-----
˓→-------------------+---------------------+------------------+
˓→------+---------------+---------+---------------------+---------------+--------------+-
˓→-------------+------------+-------------------+-------+---------+------------------+---
˓→---------------+-------------------------+------------------+-----------------+--------
˓→--------+----------------------+----------------+-----------+--------------------+-----
˓→-------------------+---------------------+------------------+
˓→------+---------------+---------+---------------------+---------------+--------------+-
˓→-------------+------------+-------------------+-------+---------+------------------+---
˓→---------------+-------------------------+------------------+-----------------+--------
˓→--------+----------------------+----------------+-----------+--------------------+-----
˓→-------------------+---------------------+------------------+
Cleaning Up Artifacts
This example created a number of artifacts, such as unzipping the wallet file, creating a database table, and starting a
Spark cluster. Next, you remove these resources.
if wallet_path != "<wallet_path>":
connection.update_repository(key="pyspark_adb", value=adb_creds)
connection.import_wallet(wallet_path=wallet_path, key="pyspark_adb")
conn = cx_Oracle.connect(user, password, tnsname)
cursor = conn.cursor()
cursor.execute(f"DROP TABLE {table_name}")
cursor.close()
conn.close()
else:
print("Skipping as it appears that you do not have wallet_path specified.")
if "tns_path" in globals():
shutil.rmtree(tns_path)
sc.stop()
This notebook demonstrates how to use PySpark to process data in Object Storage, and save the results to an ADB. It
also demonstrates how to query data from an ADB using a local PySpark session.
import base64
import cx_Oracle
import oci
import os
import shutil
import tempfile
import zipfile
Introduction
It has become a common practice to store structured and semi-structured data using services such as Object Storage.
This provides a scalable solution to store vast quantities of data that can be post-processed. However, using a relational
database management system (RDMS) such as the Oracle ADB provides advantages like ACID compliance, rapid
relational joins, support for complex business logic, and more. It is important to be able to access information stored in
Object Storage, process that information, and load it into an RBMS. This notebook demonstrates how to use PySpark,
a Python interface to Apache Spark, to perform these operations.
This notebook uses a publicly accessible Object Storage location to read from. However, an ADB needs to be configured
with permissions to create a table, write to that table, and read from it. It also assumes that the credentials to access
the database are stored in the Vault. This is the best practice as it prevents the credentials from being stored locally
or in the notebook where they may be accessible to others. If you do not have credentials stored in the Vault. Once
credentials to the database, are stored in the Vault, you need the OCIDs for the Vault, encryption key, and the secret.
ADBs have an additional level of security that is needed to access them and are wallet file. You can obtain the wallet
file from your account administrator or download it using the steps that are outlined in the [downloading a wallet(https:
//docs.oracle.com/en-us/iaas/Content/Database/Tasks/adbconnecting.htm#access). The wallet file is a ZIP file. This
notebook unzips the wallet and updates the configuration settings so you don’t have to.
The database connection also needs the TNS name of the database. Your database administrator can give you the TNS
name of the database that you have access to.
Setup the Required Variables
The required variables to set up are:
1. vault_id, key_id, secret_ocid: The OCID of the secret by storing the username and password required to
connect to your ADB in a secret within the OCI Vault service. Note that the secret is the credential needed to
access a database. This notebook is designed so that any secret can be stored as long as it is in the form of a
dictionary. To store your secret, just modify the dictionary, see the vault.ipynb example notebook for detailed
steps to generate this OCID.
2. tnsname: A TNS name valid for the database.
3. wallet_path: The local path to your wallet ZIP file, see the autonomous_database.ipynb example notebook
for instructions on accessing the wallet file.
secret_ocid = "secret_ocid"
tnsname = "tnsname"
wallet_path = "wallet_path"
vault_id = "vault_id"
key_id = "key_id"
def setup_wallet(wallet_path):
"""
Prepare ADB wallet file for use in PySpark.
"""
temporary_directory = tempfile.mkdtemp()
zip_file_path = os.path.join(temporary_directory, "wallet.zip")
return temporary_directory
if wallet_path != "<wallet_path>":
print("Setting up wallet")
tns_path = setup_wallet(wallet_path)
else:
print("Skipping as it appears that you do not have wallet_path specified.")
Setting up wallet
This notebook reads in a data file that is stored in an Oracle Object Storage file. This is defined with the file_path
variable. The SparkContext with the read.option().csv() methods is used to read in the CSV file from Object
Storage into a data frame.
file_path = "oci://hosted-ds-datasets@bigdatadatasciencelarge/synthetic/orcl_attrition.
˓→csv"
table_name = "ODSC_PYSPARK_ADB_DEMO"
FOURTEEN
The following are the recommended steps to create a conda environment to connect to BDS:
• Open a terminal window then run the following commands:
• odsc conda install -s pyspark30_p37_cpu_v5: Install the PySpark conda environment.
import ads
import os
ads.set_auth('resource_principal')
with BDSSecretKeeper.load_secret("<secret_id>") as cred:
with krbcontext(principal=cred["principal"], keytab_path=cred['keytab_path']):
cursor = hive.connect(host=cred["hive_host"],
port=cred["hive_port"],
auth='KERBEROS',
kerberos_service_name="hive").cursor()
381
ADS Documentation, Release 2.6.8
import ads
import fsspec
import os
ads.set_auth('resource_principal')
refresh_ticket(principal="<your_principal>", keytab_path="<your_local_keytab_file_path>",
kerb5_path="<your_local_kerb5_config_file_path>")
cursor = hive.connect(host="<hive_host>", port="<hive_port>",
auth='KERBEROS', kerberos_service_name="hive").cursor()
To work with BDS in a notebook session or job, you must have a conda environment that supports the BDS module
in ADS along with support for PySpark. This section demonstrates how to modify a PySpark Data Science conda
environment to work with BDS. It also demonstrates how to publish this conda environment so that you can be share
it with team members and use it in jobs.
14.2.1 Create
The following are the recommended steps to create a conda environment to connect to BDS:
• Open a terminal window then run the following commands:
• odsc conda install -s pyspark30_p37_cpu_v5: Install the PySpark conda environment.
14.2.2 Publish
14.3 Connect
Notebook sessions require a conda environment that has the BDS module of ADS installed.
The preferred method to connect to a BDS cluster is to use the BDSSecretKeeper class. This allows you to store the
BDS credentials in the vault and not the notebook. It also provides a greater level of access control to the secrets and
allows for credential rotation without breaking connections from various sources.
import ads
import os
ads.set_auth('resource_principal')
with BDSSecretKeeper.load_secret("<secret_id>") as cred:
with krbcontext(principal=cred["principal"], keytab_path=cred['keytab_path']):
cursor = hive.connect(host=cred["hive_host"],
port=cred["hive_port"],
auth='KERBEROS',
kerberos_service_name="hive").cursor()
BDS requires a Kerberos ticket to authenticate to the service. The preferred method is to use the vault and
BDSSecretKeeper because it is more secure, and prevents private information from being stored in a notebook. How-
ever, if this is not possible, you can use the refresh_ticket() method to manually create the Kerberos ticket. This
method requires the following parameters:
• kerb5_path: The path to the krb5.conf file. You can copy this file from the master node of the BDS cluster
located in /etc/krb5.conf.
• keytab_path: The path to the principal’s keytab file. You can download this file from the master node on the
BDS cluster.
• principal: The unique identity to that Kerberos can assign tickets to.
import ads
import fsspec
import os
ads.set_auth('resource_principal')
refresh_ticket(principal="<your_principal>", keytab_path="<your_local_keytab_file_path>",
kerb5_path="<your_local_kerb5_config_file_path>")
cursor = hive.connect(host="<hive_host>", port="<hive_port>",
auth='KERBEROS', kerberos_service_name="hive").cursor()
14.3.2 Jobs
A job requires a conda environment that has the BDS module of ADS installed. It also requires secrets and configuration
information that can be used to obtain a Kerberos ticket for authentication. You must copy the keytab and krb5.conf
files to the jobs instance and can be copied as part of the job. We recommend that you save them into the vault then
use BDSSecretKeeper to access them. This is secure because the vault provides access control and allows for key
rotation without breaking exiting jobs. You can use the notebook to load configuration parameters like hdfs_host,
hdfs_port, hive_host, hive_port, and so on. The keytab and krb5.conf files are securely loaded from the vault
then saved in the jobs instance. The krbcontext() method is then used to create the Kerberos ticket. Once the ticket
is created, you can query BDS.
This section demonstrates various methods to work with files on BDS’ HDFS, see the individual framework’s docu-
mentation for details.
A Kerberos ticket is needed to connect to the BDS cluster. This authentication ticket can be obtained with the
refresh_ticket() method or with the use of the Vault and a BDSSercretKeeper object. This section will demon-
strate the use of the BDSSecretKeeper object as this is more secure and is the preferred method.
14.4.1 FSSpec
The fsspec or Filesystem Spec is an interface that allows access to local, remote, and embedded file systems. You use
it to access data stored in the BDS’ HDFS. This connection is made with the WebHDFS protocol.
The fsspec library must be able to access BDS so a Kerberos ticket must be generated. The secure and recommended
method to do this is to use BDSSecretKeeper that stores the BDS credentials in the vault not the notebook session.
This section outlines some common file operations, see the fsspec API Reference for complete details on the features
that are demonstrated and additional functionality.
Pandas and PyArrow can also use fsspec to perform file operations.
14.4.1.1 Connect
Credentials and configuration information is stored in the vault. This information is used to obtain a Kerberos ticket
and define the hdfs_config dictionary. This configuration dictionary is passed to the fsspec.filesystem() method to
make a connection to the BDS’ underlying HDFS storage.
import ads
import fsspec
ads.set_auth("resource_principal")
with BDSSecretKeeper.load_secret("<secret_id>") as cred:
with krbcontext(principal = cred["principal"], keytab_path = cred['keytab_path']):
hdfs_config = {
"protocol": "webhdfs",
"host": cred["hdfs_host"],
"port": cred["hdfs_port"],
(continues on next page)
fs = fsspec.filesystem(**hdfs_config)
14.4.1.2 Delete
Delete files from HDFS using the .rm() method. It accepts a path of the files to delete.
fs.rm("/data/biketrips/2020??-tripdata.csv", recursive=True)
14.4.1.3 Download
Download files from HDFS to a local storage device using the .get() method. It takes the HDFS path of the files to
download, and the local path to store the files.
fs.get("/data/biketrips/20190[123456]-tripdata.csv", local_path="./first_half/",␣
˓→overwrite=True)
14.4.1.4 List
The .ls() method lists files. It returns the matching file names as a list.
fs.ls("/data/biketrips/2019??-tripdata.csv")
['201901-tripdata.csv',
'201902-tripdata.csv',
'201903-tripdata.csv',
'201904-tripdata.csv',
'201905-tripdata.csv',
'201906-tripdata.csv',
'201907-tripdata.csv',
'201908-tripdata.csv',
'201909-tripdata.csv',
'201910-tripdata.csv',
'201911-tripdata.csv',
'201912-tripdata.csv']
14.4.1.5 Upload
The .put() method is used to upload files from local storage to HDFS. The first parameter is the local path of the files
to upload. The second parameter is the HDFS path where the files are to be stored. .upload() is an alias of .put(). ..
code-block:: python3
fs.put(
lpath=”./first_half/20200[456]-tripdata.csv”, rpath=”/data/biketrips/second_quarter/”
)
14.4.2 Ibis
Ibis is an open-source library by Cloudera that provides a Python framework to access data and perform analyt-
ical computations from different sources. Ibis allows access to the data ising HDFS. You use the ibis.impala.
hdfs_connect() method to make a connection to HDFS, and it returns a handler. This handler has methods such as
.ls() to list, .get() to download, .put() to upload, and .rm() to delete files. These operations support globbing.
Ibis’ HDFS connector supports a variety of additional operations.
14.4.2.1 Connect
After obtaining a Kerberos ticket, the hdfs_connect() method allows access to the HDFS. It is a thin wrapper around
a fsspec file system. Depending on your system configuration, you may need to define the ibis.options.impala.
temp_db and ibis.options.impala.temp_hdfs_path options.
import ibis
14.4.2.2 Delete
Delete files from HDFS using the .rm() method. It accepts a path of the files to delete.
hdfs.rm("/data/biketrips/2020??-tripdata.csv", recursive=True)
14.4.2.3 Download
Download files from HDFS to a local storage device using the .get() method. It takes the HDFS path of the files to
download, and the local path to store the files.
hdfs.get("/data/biketrips/20190[123456]-tripdata.csv", local_path="./first_half/",␣
˓→overwrite=True)
14.4.2.4 List
The .ls() method lists files. It returns the matching file names as a list.
hdfs.ls("/data/biketrips/2019??-tripdata.csv")
['201901-tripdata.csv',
'201902-tripdata.csv',
'201903-tripdata.csv',
'201904-tripdata.csv',
'201905-tripdata.csv',
'201906-tripdata.csv',
'201907-tripdata.csv',
(continues on next page)
14.4.2.5 Upload
Use the .put() method to upload files from local storage to HDFS. The first parameter is the HDFS path where the files
are to be stored. The second parameter is the local path of the files to upload.
hdfs.put(rpath="/data/biketrips/second_quarter/",
lpath="./first_half/20200[456]-tripdata.csv",
overwrite=True, recursive=True)
14.4.3 Pandas
Pandas allows access to BDS’ HDFS system through :ref: FSSpec. This section demonstrates some common operations.
14.4.3.1 Connect
import ads
import fsspec
ads.set_auth("resource_principal")
with BDSSecretKeeper.load_secret("<secret_id>") as cred:
with krbcontext(principal = cred["principal"], keytab_path = cred['keytab_path']):
hdfs_config = {
"protocol": "webhdfs",
"host": cred["hdfs_host"],
"port": cred["hdfs_port"],
"kerberos": "True"
}
fs = fsspec.filesystem(**hdfs_config)
You can use the fsspec .open() method to open a data file. It returns a file handle. That file handle, f, can be passed
to any Pandas’ methods that support file handles. In this example, a file on a BDS’ HDFS cluster is read into a Pandas
dataframe.
14.4.3.3 URL
Pandas supports fsspec so you can preform file operations by specifying a protocol string. The WebHDFS protocol is
used to access files on BDS’ HDFS system. The protocol string has this format:
webhdfs://host:port/path/to/data
The host and port parameters can be passed in the protocol string as follows:
df = pd.read_csv(f"webhdfs://{hdfs_config['host']}:{hdfs_config['port']}/data/biketrips/
˓→201901-tripdata.csv",
storage_options={'kerberos': 'True'})
You can also pass the host and port parameters in the dictionary used by the storage_options parameter. The
sample code for hdfs_config defines the host and port with the keyes host and port respectively.
hdfs_config = {
"protocol": "webhdfs",
"host": cred["hdfs_host"],
"port": cred["hdfs_port"],
"kerberos": "True"
}
In this case, Pandas uses the following syntax to read a file on BDS’ HDFS cluster:
df = pd.read_csv(f"webhdfs:///data/biketrips/201901-tripdata.csv",
storage_options=hdfs_config)
14.4.4 PyArrow
PyArrow is a Python interface to Apache Arrow. Apache Arrow is an in-memory columnar analytical tool that is
designed to process data at scale. PyArrow supports the fspec.filesystem() through the use of the filesystem
parameter in many of its data operation methods.
14.4.4.1 Connect
import ads
import fsspec
ads.set_auth("resource_principal")
with BDSSecretKeeper.load_secret("<secret_id>") as cred:
with krbcontext(principal = cred["principal"], keytab_path = cred['keytab_path']):
hdfs_config = {
"protocol": "webhdfs",
"host": cred["hdfs_host"],
"port": cred["hdfs_port"],
"kerberos": "True"
}
fs = fsspec.filesystem(**hdfs_config)
14.4.4.2 Filesystem
The following sample code shows several different PyArrow methods for working with BDS’ HDFS using the
filesystem parameter:
import pyarrow as pa
import pyarrow.parquet as pq
import pyarrow.dataset as ds
import pandas as pd
import numpy as np
df = pd.DataFrame({
'numeric_col': np.random.rand(len(idx)),
'string_col': pd._testing.rands_array(8,len(idx))},
index = idx
)
df["dt"] = df.index
df["dt"] = df["dt"].dt.date
table = pa.Table.from_pandas(df)
pq.write_to_dataset(table, root_path="/path/on/BDS/HDFS", partition_cols=["dt"],
flavor="spark", filesystem=fs)
This section demonstrates how to perform standard SQL-based data management operations in BDS using various
frameworks, see the individual framework’s documentation for details.
A Kerberos ticket is needed to connect to the BDS cluster. You can obtain this authentication ticket with the
refresh_ticket() method, or with the use of the vault and a BDSSercretKeeper object. This section demon-
strates the use of the BDSSecretKeeper object because this is more secure and is the recommended method.
14.5.1 Ibis
Ibis is an open-source library by Cloudera that provides a Python framework to access data and perform analytical
computations from different sources. The Ibis project is designed to provide an abstraction over different dialects of
SQL. It enables the data scientist to interact with many different data systems. Some of these systems are Dask, MySQL,
Pandas, PostgreSQL, PySpark, and most importantly for use with BDS, Hadoop clusters.
14.5.1.1 Connect
Obtaining a Kerberos ticket, depending on your system configuration, you may need to define the ibis.options.
impala.temp_db and ibis.options.impala.temp_hdfs_path options. The ibis.impala.connect() method
makes a connection to the Impala execution backend. The .sql() allows you to run SQL commands on the data.
import ibis
14.5.1.2 Query
To query the data using ibis use an SQL DML command like SELECT. Pass the string to the .sql() method, and then
call .execute() on the returned object. The output is a Pandas dataframe.
It is important to close sessions when you don’t need them anymore. This frees up resources in the system. Use the
.close() method close sessions.
client.close()
14.5.2 Impala
Impala is a Python client for HiveServer2 implementations (i.e. Impala, Hive). Both Impala and PyHive clients are
HiveServer2 compliant so the connection syntax is very similar. The difference is that the Impala client uses the Impala
query engine and PyHive uses Hive. In practical terms, Hive is best suited for long-running batch queries and Impala
is better suited for real-time interactive querying, see more about the differences between Hive and Impala.
The Impala dbapi module is a Python DB-API interface.
14.5.2.1 Connect
After obtaining a Kerberos ticket, use the connect() method to make the connection. It returns a connection, and the
.cursor() method returns a cursor object. The cursor has the method .execute() that allows you to run Impala
SQL commands on the data.
To create an Impala table and insert data, use the .execute() method on the cursor object, and pass in Impala SQL
commands to perform these operations.
14.5.2.3 Query
To query an Impala table, use an Impala SQL DML command like SELECT. Pass this string to the .execute() method
on the cursor object to create a record set in the cursor. You can obtain a Pandas dataframe with the as_pandas()
function.
To drop an Impala table, use an Impala SQL DDL command like DROP TABLE. Pass this string to the .execute()
method on the cursor object.
It is important to close sessions when you don’t need them anymore. This frees up resources in the system. Use the
.close() method on the cursor object to close a connection.
cursor.close()
14.5.3 PyHive
PyHive is a set of interfaces to Presto and Hive. It is based on the SQLAlchemy and Python DB-API interfaces for
Presto and Hive.
14.5.3.1 Connect
After obtaining a Kerberos ticket, call the hive.connect() method to make the connection. It returns a connection,
and the .cursor() method returns a cursor object. The cursor has the .execute() method that allows you to run
Hive SQL commands on the data.
import ads
import os
ads.set_auth('resource_principal')
with BDSSecretKeeper.load_secret("<secret_id>") as cred:
with krbcontext(principal=cred["principal"], keytab_path=cred['keytab_path']):
cursor = hive.connect(host=cred["hive_host"],
port=cred["hive_port"],
auth='KERBEROS',
kerberos_service_name="hive").cursor()
To create a Hive table and insert data, use the .execute() method on the cursor object and pass in Hive SQL commands
to perform these operations.
14.5.3.3 Query
To query a Hive table, use a Hive SQL DML command like SELECT. Pass this string to the .execute() method on the
cursor object. This creates a record set in the cursor. You can access the actual records with methods like .fetchall(),
.fetchmany(), and .fetchone().
In the following example, the .fetchall() method is used in a pd.DataFrame() call to return all the records in
Pandas dataframe: .
import pandas as pd
To drop a Hive table, use a Hive SQL DDL command like DROP TABLE. Pass this string to the .execute() method
on the cursor object.
It is important to close sessions when you don’t need them anymore. This frees up resources in the system. Use the
.close() method on the cursor object to close a connection.
cursor.close()
FIFTEEN
Oracle Cloud Infrastructure (OCI) Data Science jobs enable you to define and run a repeatable machine learning task
on a fully managed infrastructure, such as data preparation, model training, hyperparameter optimization, batch
inference, and so on.
15.1 Overview
Data Science jobs allow you to run customized tasks outside of a notebook session. You can have Compute on demand
and only pay for the Compute that you need. With jobs, you can run applications that perform tasks such as data
preparation, model training, hyperparameter tuning, and batch inference. When the task is complete the compute
automatically terminates. You can use the Logging service to capture output messages.
Using jobs, you can:
• Run machine learning (ML) or data science tasks outside of your JupyterLab notebook session.
• Operationalize discrete data science and machine learning tasks, such as reusable runnable operations.
• Automate your MLOps or CI/CD pipeline.
• Run batch or workloads triggered by events or actions.
• Batch, mini batch, or distributed batch job inference.
• In a JupyterLab notebook session, you can launch long running tasks or computation intensive tasks in a Data
Science job to keep your notebook free for you to continue your work.
Typically, an ML and data science project is a series of steps including:
• Access
• Explore
• Prepare
• Model
• Train
• Validate
• Deploy
• Test
395
ADS Documentation, Release 2.6.8
After the steps are completed, you can automate the process of data exploration, model training, deploying, and testing
using jobs. A single change in the data preparation or model training to experiment with hyperparameter tunings can
be run as a job and independently tested.
Data Science jobs consist of two types of resources: job and job run.
15.1.1 Job
A job is a template that describes the task. It contains elements like the job artifact, which is immutable. It can’t be
modified after being registered as a Data Science job. A job contains information about the Compute shape, logging
configuration, Block Storage, and other options. You can configure environment variables can be configured that are
used at run-time by the job run. You can also pass in CLI arguments. This allows a job run to be customized while
using the same job as a template. You can override the environment variable and CLI parameters in job runs. Only the
job artifact is immutable though the settings can be changed.
A job run is an instantiation of a job. In each job run, you can override some of the job configuration. The most
common configurations to change are the environment variables and CLI arguments. You can use the same job as a
template and launch multiple simultaneous job runs to parallelize a large task. You can also sequence jobs and keep
the state by writing state information to Object Storage.
For example, you could experiment with how different model classes perform on the same training data by using the
ADSTuner to perform hyperparameter tuning on each model class. You could do this in parallel by having a different
job run for each class of models. For a given job run, you could pass an environment variable that identifies the model
class that you want to use. Each model cab write its results to the Logging service or Object Storage. Then you can
run a final sequential job that uses the best model class, and trains the final model on the entire dataset.
ADS jobs API calls separate the job configurations into infrastructure and runtime. Infrastructure specifies the con-
figurations of the OCI resources and service for running the job. Runtime specifies the source code and the software
environments for running the job. These two types of infrastructure are supported: Data Science job and Data Flow.
This section shows how you can use the ADS jobs APIs to run OCI Data Science jobs. You can use similar APIs to
Run a OCI DataFlow Application.
Before creating a job, ensure that you have policies configured for Data Science resources, see About Data Science
Policies.
15.2.1 Infrastructure
The Data Science job infrastructure is defined by a DataScienceJob instance. When creating a job, you specify
the compartment ID, project ID, subnet ID, Compute shape, Block Storage size, log group ID, and log ID in the
DataScienceJob instance. For example:
infrastructure = (
DataScienceJob()
.with_compartment_id("<compartment_ocid>")
.with_project_id("<project_ocid>")
.with_subnet_id("<subnet_ocid>")
.with_shape_name("VM.Standard.E3.Flex")
.with_shape_config_details(memory_in_gbs=16, ocpus=1) # Applicable only for the␣
˓→flexible shapes
.with_block_storage_size(50)
.with_log_group_id("<log_group_ocid>")
.with_log_id("<log_ocid>")
)
If you are using these API calls in a Data Science Notebook Session, and you want to use the same infrastructure
configurations as the notebook session, you can initialize the DataScienceJob with only the logging configurations:
infrastructure = (
DataScienceJob()
.with_log_group_id("<log_group_ocid>")
.with_log_id("<log_ocid>")
)
In some cases, you may want to override the shape and block storage size. For example, if you are testing your code in
a CPU notebook session, but want to run the job in a GPU VM:
infrastructure = (
DataScienceJob()
.with_shape_name("VM.GPU2.1")
.with_log_group_id("<log_group_ocid>")
.with_log_id("<log_ocid>")
)
15.2.2 Logs
In the preceding examples, both the log OCID and corresponding log group OCID are specified in the DataScienceJob
instance. If your administrator configured the permission for you to search for logging resources, you can skip speci-
fying the log group OCID because ADS automatically retrieves it.
If you specify only the log group OCID and no log OCID, a new Log resource is automatically created within the log
group to store the logs, see ADS Logging.
15.2.3 Runtime
A job can have different types of runtime depending on the source code you want to run:
• ScriptRuntime allows you to run Python, Bash, and Java scripts from a single source file (.zip or .tar.gz)
or code directory, see Run a Script and Run a ZIP file or folder.
• PythonRuntime allows you to run Python code with additional options, including setting a working directory,
adding python paths, and copying output files, see Run a ZIP file or folder.
• NotebookRuntime allows you to run a JupyterLab Python notebook, see Run a Notebook.
• GitPythonRuntime allows you to run source code from a Git repository, see Run from Git.
All of these runtime options allow you to configure a Data Science Conda Environment for running your code. For
example, to define a python script as a job runtime with a TensorFlow conda environment you could use:
runtime = (
ScriptRuntime()
.with_source("oci://bucket_name@namespace/path/to/script.py")
.with_service_conda("tensorflow26_p37_cpu_v2")
)
You can store your source code in a local file path or location supported by fsspec, including OCI Object Storage.
You can also use a custom conda environment published to OCI Object Storage by passing the uri to the
with_custom_conda() method, for example:
runtime = (
ScriptRuntime()
.with_source("oci://bucket_name@namespace/path/to/script.py")
.with_custom_conda("oci://bucket@namespace/conda_pack/pack_name")
)
For more details on custom conda environment, see Publishing a Conda Environment to an Object Storage Bucket in
Your Tenancy.
You can also configure the environment variables, command line arguments, and free form tags for runtime:
runtime = (
ScriptRuntime()
.with_source("oci://bucket_name@namespace/path/to/script.py")
.with_service_conda("tensorflow26_p37_cpu_v2")
.with_environment_variable(ENV="value")
.with_argument("argument", key="value")
.with_freeform_tag(tag_name="tag_value")
)
With the preceding arguments, the script is started as python script.py argument --key value.
With runtime and infrastructure, you can define a job and give it a name:
job = (
Job(name="<job_display_name>")
.with_infrastructure(infrastructure)
.with_runtime(runtime)
)
If the job name is not specified, a name is generated automatically based on the name of the job artifact and a time
stamp.
Alternatively, a job can also be defined with keyword arguments:
job = Job(
name="<job_display_name>",
infrastructure=infrastructure,
runtime=runtime
)
You can call the create() method of a job instance to create a job. After the job is created, you can call the run()
method to create and start a job run. The run() method returns a DataScienceJobRun. You can monitor the job run
output by calling the watch() method of the DataScienceJobRun instance:
# Create a job
job.create()
# Run a job, a job run will be created and started
job_run = job.run()
# Stream the job run outputs
job_run.watch()
2021-10-28 17:23:50 - Job Run IN_PROGRESS, Job run artifact execution in progress.
2021-10-28 17:23:50 - <Log Message>
2021-10-28 17:23:50 - <Log Message>
2021-10-28 17:23:50 - ...
When you run job.run(), the job is run with the default configuration. You may want to override this default config-
uration with custom variables. You can specify a custom job run display name, override command line argument, add
additional environment variables, or free form tags as in this example:
job_run = job.run(
name="<my_job_run_name>",
args="new_arg --new_key new_val",
env_var={"new_env": "new_val"},
freeform_tags={"new_tag": "new_tag_val"}
)
A job instance can be serialized to a YAML file by calling to_yaml(), which returns the YAML as a string. You
can easily share the YAML with others, and reload the configurations by calling from_yaml(). The to_yaml() and
from_yaml() methods also take an optional uri argument for saving and loading the YAML file. This argument can
be any URI to the file location supported by fsspec, including Object Storage. For example:
Here is an example of a YAML file representing the job defined in the preceding examples:
kind: job
spec:
name: <job_display_name>
infrastructure:
kind: infrastructure
type: dataScienceJob
spec:
logGroupId: <log_group_ocid>
logId: <log_ocid>
compartmentId: <compartment_ocid>
projectId: <project_ocid>
subnetId: <subnet_ocid>
(continues on next page)
kind:
required: true
type: string
allowed:
- job
spec:
required: true
type: dict
schema:
id:
required: false
infrastructure:
required: false
runtime:
required: false
name:
required: false
type: string
kind:
required: true
type: "string"
allowed:
- "infrastructure"
type:
required: true
type: "string"
allowed:
- "dataScienceJob"
spec:
required: true
type: "dict"
schema:
blockStorageSize:
(continues on next page)
The ADS ContainerRuntime class allows you to run a container image using OCI data science jobs.
To use the ContainerRuntime, you need to first push the image to OCI container registry. See Creating a Repository
and Pushing Images Using the Docker CLI for more details.
15.3.1 Python
To configure ContainerRuntime, you must specify the container image. Similar to other runtime, you can add
environment variables. You can optionally specify the entrypoint and cmd for running the container (See Understand
how CMD and ENTRYPOINT interact).
from ads.jobs import Job, DataScienceJob, ContainerRuntime
job = (
Job()
.with_infrastructure(
(continues on next page)
.with_block_storage_size(50)
)
.with_runtime(
ContainerRuntime()
.with_image("<region>.ocir.io/<your_tenancy>/<your_image>")
.with_environment_variable(GREETINGS="Welcome to OCI Data Science")
.with_entrypoint(["/bin/sh", "-c"])
.with_cmd("sleep 5 && echo $GREETINGS")
)
)
15.3.2 YAML
You could use the following YAML to create the same job:
kind: job
spec:
name: container-job
infrastructure:
kind: infrastructure
type: dataScienceJob
spec:
logGroupId: <log_group_ocid>
logId: <log_ocid>
compartmentId: <compartment_ocid>
projectId: <project_ocid>
subnetId: <subnet_ocid>
shapeName: VM.Standard.E3.Flex
shapeConfigDetails:
memoryInGBs: 16
ocpus: 1
blockStorageSize: 50
runtime:
kind: runtime
(continues on next page)
ContainerRuntime Schema
kind:
required: true
type: string
allowed:
- runtime
type:
required: true
type: string
allowed:
- container
spec:
type: dict
required: true
schema:
image:
required: true
type: string
entrypoint:
required: false
type:
- string
- list
cmd:
required: false
type:
- string
- list
env:
nullable: true
required: false
type: list
schema:
type: dict
schema:
name:
type: string
value:
type:
(continues on next page)
The ADS GitPythonRuntime class allows you to run source code from a Git repository as a Data Science job. The
next example shows how to run a PyTorch Neural Network Example to train third order polynomial predicting y=sin(x).
15.4.1 Python
To configure the GitPythonRuntime, you must specify the source code url and entrypoint path. Similar to
PythonRuntime, you can specify a service conda environment, environment variables, and CLI arguments. In this
example, the pytorch19_p37_gpu_v1 service conda environment is used. Assuming you are running this example
in an Data Science notebook session, only log ID and log group ID need to be configured for the DataScienceJob
object, see Data Science Jobs for more details about configuring the infrastructure.
job = (
Job()
.with_infrastructure(
DataScienceJob()
.with_log_group_id("<log_group_ocid>")
.with_log_id("<log_ocid>")
# The following infrastructure configurations are optional
# if you are in an OCI data science notebook session.
# The configurations of the notebook session will be used as defaults
.with_compartment_id("<compartment_ocid>")
.with_project_id("<project_ocid>")
.with_subnet_id("<subnet_ocid>")
.with_shape_name("VM.Standard.E3.Flex")
.with_shape_config_details(memory_in_gbs=16, ocpus=1) # Applicable only for the␣
˓→flexible shapes
.with_block_storage_size(50)
)
.with_runtime(
GitPythonRuntime()
.with_environment_variable(GREETINGS="Welcome to OCI Data Science")
.with_service_conda("pytorch19_p37_gpu_v1")
.with_source("https://github.com/pytorch/tutorials.git")
.with_entrypoint("beginner_source/examples_nn/polynomial_nn.py")
.with_output(
output_dir="~/Code/tutorials/beginner_source/examples_nn",
output_uri="oci://BUCKET_NAME@BUCKET_NAMESPACE/PREFIX"
)
)
)
The default branch from the Git repository is used unless you specify a different branch or commit in the .
with_source() method.
For a public repository, we recommend the “http://” or “https://” URL. Authentication may be required for the SSH
URL even if the repository is public.
To use a private repository, you must first save an SSH key to an OCI Vault as a secret, and provide the secret_ocid
to the with_source() method, see Managing Secret with Vault. For example, you could use GitHub Deploy Key.
The entry point specifies how the source code is invoked. The .with_entrypiont() has the following arguments:
• func: Optional. The function in the script specified by path to call. If you don’t specify it, then the script
specified by path is run as a Python script in a subprocess.
• path: Required. The relative path for the script, module, or file to start the job.
With the GitPythonRuntime class, you can save the output files from the job run to Object Storage using
with_output(). By default, the source code is cloned to the ~/Code directory. In the example, the files in the
example_nn directory are copied to the Object Storage specified by the output_uri parameter. The output_uri
parameter should have this format:
oci://BUCKET_NAME@BUCKET_NAMESPACE/PREFIX
The GitPythonRuntime also supports these additional configurations:
• The .with_python_path() method allows you to add additional Python paths to the runtime. By default, the
code directory checked out from Git is added to sys.path. Additional Python paths are appended before the
code directory is appended.
• The .with_argument() method allows you to pass arguments to invoke the script or function. For running
a script, the arguments are passed in as CLI arguments. For running a function, the list and dict JSON
serializable objects are supported and are passed into the function.
The GitPythonRuntime method updates metadata in the free form tags of the job run after the job run finishes. The
following tags are added automatically:
• commit: The Git commit ID.
• method: The entry function or method.
• module: The entry script or module.
• outputs: The prefix of the output files in Object Storage.
• repo: The URL of the Git repository.
The new values overwrite any existing tags. If you want to skip the metadata update, set skip_metadata_update to
True when initializing the runtime:
runtime = GitPythonRuntime(skip_metadata_update=True)
15.4.2 YAML
You could create the preceding example job with the following YAML file:
kind: job
spec:
infrastructure:
kind: infrastructure
type: dataScienceJob
spec:
logGroupId: <log_group_ocid>
logId: <log_ocid>
compartmentId: <compartment_ocid>
projectId: <project_ocid>
subnetId: <subnet_ocid>
shapeName: VM.Standard.E3.Flex
shapeConfigDetails:
memoryInGBs: 16
ocpus: 1
blockStorageSize: 50
name: git_example
runtime:
kind: runtime
type: gitPython
spec:
entrypoint: beginner_source/examples_nn/polynomial_nn.py
outputDir: ~/Code/tutorials/beginner_source/examples_nn
outputUri: oci://BUCKET_NAME@BUCKET_NAMESPACE/PREFIX
url: https://github.com/pytorch/tutorials.git
conda:
slug: pytorch19_p37_gpu_v1
type: service
env:
- name: GREETINGS
value: Welcome to OCI Data Science
In some cases, you may want to run an existing JupyterLab notebook as a job. You can do this using the
NotebookRuntime() object.
The next example shows you how to run an the TensorFlow 2 quick start for beginner notebook from the internet and
save the results to OCI Object Storage. The notebook path points to the raw file link from GitHub. To run the following
example, ensure that you have internet access to retrieve the notebook:
15.5.1 Python
job = (
Job()
.with_infrastructure(
DataScienceJob()
.with_log_group_id("<log_group_ocid>")
.with_log_id("<log_ocid>")
# The following infrastructure configurations are optional
# if you are in an OCI data science notebook session.
# The configurations of the notebook session will be used as defaults
.with_compartment_id("<compartment_ocid>")
.with_project_id("<project_ocid>")
.with_subnet_id("<subnet_ocid>")
.with_shape_name("VM.Standard.E3.Flex")
.with_shape_config_details(memory_in_gbs=16, ocpus=1) # Applicable only for the␣
˓→flexible shapes
.with_block_storage_size(50)
)
.with_runtime(
NotebookRuntime()
.with_notebook(
path="https://raw.githubusercontent.com/tensorflow/docs/master/site/en/
˓→tutorials/customization/basics.ipynb",
encoding='utf-8'
)
(continues on next page)
job.create()
run = job.run().watch()
After the notebook finishes running, the notebook with results are saved to oci://bucket_name@namespace/path/
to/dir. You can download the output by calling the download() method.
run.download("/path/to/local/dir")
The NotebookRuntime also allows you to use exclusion tags, which lets you exclude cells from a job run. For example,
you could use these tags to do exploratory data analysis, and then train and evaluate your model in a notebook. Then
you could use that same notebook to only build future models that are trained on a different dataset. So the job run only
has to execute the cells that are related to training the model, and not the exploratory data analysis or model evaluation.
You tag the cells in the notebook, and then specify the tags using the .with_exclude_tag() method. Cells with any
matching tags are excluded from the job run. For example, if you tagged cells with ignore and remove, you can pass
in a list of the two tags to the method and those cells are excluded from the code that is executed as part of the job run.
To tag cells in a notebook, see Adding tags using notebook interfaces.
job.with_runtime(
NotebookRuntime()
.with_notebook("path/to/notebook")
.with_exclude_tag(["ignore", "remove"])
)
15.5.2 YAML
kind: job
spec:
infrastructure:
kind: infrastructure
type: dataScienceJob
spec:
jobInfrastructureType: STANDALONE
jobType: DEFAULT
logGroupId: <log_group_id>
logId: <log.id>
runtime:
kind: runtime
type: notebook
spec:
notebookPathURI: /path/to/notebook
conda:
slug: tensorflow26_p37_cpu_v1
type: service
NotebookRuntime Schema
kind:
required: true
type: string
allowed:
- runtime
type:
required: true
type: string
allowed:
- notebook
spec:
required: true
type: dict
schema:
excludeTags:
required: false
type: list
notebookPathURI:
required: false
type: string
notebookEncoding:
required: false
type: string
outputUri:
required: false
type: string
args:
nullable: true
required: false
type: list
schema:
type: string
conda:
nullable: false
required: false
type: dict
schema:
slug:
required: true
type: string
type:
required: true
type: string
allowed:
- service
env:
nullable: true
required: false
type: list
schema:
type: dict
(continues on next page)
This example shows you how to create a job running “Hello World” Python scripts. Although Python scripts are used
here, you could also run Bash or Shell scripts. The Logging service log and log group are defined in the infrastructure.
The output of the script appear in the logs.
15.6.1 Python
Suppose you would like to run the following “Hello World” python script named job_script.py.
print("Hello World")
Next, you specify the desired infrastructure to run the job. If you are in a notebook session, ADS can automatically
fetch the infrastructure configurations and use them for the job. If you aren’t in a notebook session or you want to
customize the infrastructure, you can specify them using the methods from the DataScienceJob class:
job.with_infrastructure(
DataScienceJob()
.with_log_group_id("<log_group_ocid>")
.with_log_id("<log_ocid>")
# The following infrastructure configurations are optional
# if you are in an OCI data science notebook session.
# The configurations of the notebook session will be used as defaults
.with_compartment_id("<compartment_ocid>")
.with_project_id("<project_ocid>")
.with_subnet_id("<subnet_ocid>")
.with_shape_name("VM.Standard.E3.Flex")
.with_shape_config_details(memory_in_gbs=16, ocpus=1) # Applicable only for the␣
˓→flexible shapes
.with_block_storage_size(50)
)
In this example, it is a Python script so the ScriptRuntime() class is used to define the name of the script using the
.with_source() method:
Finally, you create and run the job, which gives you access to the job_run.id:
job.create()
job_run = job.run()
Additionally, you can acquire the job run using the OCID:
The .watch() method is useful to monitor the progress of the job run:
job_run.watch()
After the job has been created and runs successfully, you can find the output of the script in the logs if you configured
logging.
15.6.2 YAML
You could also initialize a job directly from a YAML string. For example, to create a job identical to the preceding
example, you could simply run the following:
job = Job.from_string(f"""
kind: job
spec:
infrastructure:
kind: infrastructure
type: dataScienceJob
spec:
logGroupId: <log_group_ocid>
logId: <log_ocid>
compartmentId: <compartment_ocid>
projectId: <project_ocid>
subnetId: <subnet_ocid>
shapeName: VM.Standard.E3.Flex
shapeConfigDetails:
memoryInGBs: 16
ocpus: 1
blockStorageSize: 50
name: <resource_name>
runtime:
kind: runtime
type: python
spec:
scriptPathURI: job_script.py
""")
If the Python script that you want to run as a job requires CLI arguments, use the .with_argument() method to pass
the arguments to the job.
15.6.3.1 Python
Suppose you want to run the following python script named job_script_argument.py:
import sys
print("Hello " + str(sys.argv[1]) + " and " + str(sys.argv[2]))
job = Job()
job.with_infrastructure(
DataScienceJob()
.with_log_id("<log_id>")
.with_log_group_id("<log_group_id>")
)
# The CLI argument can be passed in using `with_argument` when defining the runtime
job.with_runtime(
ScriptRuntime()
.with_source("job_script_argument.py")
.with_argument("<first_argument>", "<second_argument>")
)
job.create()
job_run = job.run()
After the job run is created and run, you can use the .watch() method to monitor its progress:
job_run.watch()
15.6.3.2 YAML
You could create the preceding example job with the following YAML file:
kind: job
spec:
infrastructure:
kind: infrastructure
type: dataScienceJob
spec:
logGroupId: <log_group_ocid>
logId: <log_ocid>
compartmentId: <compartment_ocid>
projectId: <project_ocid>
subnetId: <subnet_ocid>
(continues on next page)
Similarly, if the script you want to run requires environment variables, you also pass them in using the .
with_environment_variable() method. The key-value pair of the environment variable are passed in using the
.with_environment_variable() method, and are accessed in the Python script using the os.environ dictionary.
15.6.4.1 Python
Suppose you want to run the following python script named job_script_env.py:
import os
import sys
print("Hello " + os.environ["KEY1"] + " and " + os.environ["KEY2"])""")
job = Job()
job.with_infrastructure(
DataScienceJob()
.with_log_group_id("<log_group_ocid>")
.with_log_id("<log_ocid>")
# The following infrastructure configurations are optional
# if you are in an OCI data science notebook session.
# The configurations of the notebook session will be used as defaults
.with_compartment_id("<compartment_ocid>")
.with_project_id("<project_ocid>")
.with_subnet_id("<subnet_ocid>")
.with_shape_name("VM.Standard.E3.Flex")
.with_shape_config_details(memory_in_gbs=16, ocpus=1)
.with_block_storage_size(50)
)
job.with_runtime(
ScriptRuntime()
.with_source("job_script_env.py")
.with_environment_variable(KEY1="<first_value>", KEY2="<second_value>")
(continues on next page)
You can watch the progress of the job run using the .watch() method:
job_run.watch()
15.6.4.2 YAML
You could create the preceding example job with the following YAML file:
kind: job
spec:
infrastructure:
kind: infrastructure
type: dataScienceJob
spec:
logGroupId: <log_group_ocid>
logId: <log_ocid>
compartmentId: <compartment_ocid>
projectId: <project_ocid>
subnetId: <subnet_ocid>
shapeName: VM.Standard.E3.Flex
shapeConfigDetails:
memoryInGBs: 16
ocpus: 1
blockStorageSize: 50
runtime:
kind: runtime
type: python
spec:
env:
- name: KEY1
value: <first_value>
- name: KEY2
value: <second_value>
scriptPathURI: job_script_env.py
kind:
required: true
type: string
allowed:
- runtime
type:
required: true
type: string
allowed:
(continues on next page)
15.7.1 ScriptRuntime
The ScriptRuntime class is designed for you to define job artifacts and configurations supported by OCI Data Science
jobs natively. It can be used with any script types that is supported by the OCI Data Science jobs, including a ZIP or
compressed tar file or folder. See Preparing Job Artifacts for more details. In the job run, the working directory is the
user’s home directory. For example /home/datascience.
15.7.1.1 Python
If you are in a notebook session, ADS can automatically fetch the infrastructure configurations, and use them in the
job. If you aren’t in a notebook session or you want to customize the infrastructure, you can specify them using the
methods in the DataScienceJob class.
With the ScriptRuntime, you can pass in a path to a ZIP file or directory. For a ZIP file, the path can be any URI
supported by fsspec, including OCI Object Storage.
You must specify the entrypoint, which is the relative path from the ZIP file or directory to the script starting your
program. Note that the entrypoint contains the name of the directory, since the directory itself is also zipped as the
job artifact.
job = (
Job()
.with_infrastructure(
DataScienceJob()
.with_log_group_id("<log_group_ocid>")
.with_log_id("<log_ocid>")
# The following infrastructure configurations are optional
# if you are in an OCI data science notebook session.
# The configurations of the notebook session will be used as defaults
.with_compartment_id("<compartment_ocid>")
.with_project_id("<project_ocid>")
.with_subnet_id("<subnet_ocid>")
.with_shape_name("VM.Standard.E3.Flex")
.with_shape_config_details(memory_in_gbs=16, ocpus=1)
.with_block_storage_size(50)
)
.with_runtime(
ScriptRuntime()
.with_source("path/to/zip_or_dir", entrypoint="zip_or_dir/main.py")
.with_service_conda("pytorch19_p37_cpu_v1")
)
)
15.7.1.2 YAML
You could use the following YAML example to create the same job with ScriptRuntime:
kind: job
spec:
infrastructure:
kind: infrastructure
type: dataScienceJob
spec:
logGroupId: <log_group_ocid>
logId: <log_ocid>
compartmentId: <compartment_ocid>
projectId: <project_ocid>
subnetId: <subnet_ocid>
shapeName: VM.Standard.E3.Flex
shapeConfigDetails:
memoryInGBs: 16
ocpus: 1
blockStorageSize: 50
runtime:
kind: runtime
type: script
spec:
conda:
slug: pytorch19_p37_cpu_v1
type: service
entrypoint: zip_or_dir/main.py
scriptPathURI: path/to/zip_or_dir
15.7.2 PythonRuntime
The PythonRuntime class allows you to run Python code with ADS enhanced features like configuring the working
directory and Python path. It also allows you to copy the output files to OCI Object Storage. This is especially useful
for Python code involving multiple files and packages in the job artifact.
The PythonRuntime uses an ADS generated driver script as the entry point for the job run. It performs additional
operations before and after invoking your code. You can examine the driver script by downloading the job artifact from
the OCI Console.
15.7.2.1 Python
job = (
Job()
.with_infrastructure(
DataScienceJob()
.with_log_group_id("<log_group_ocid>")
.with_log_id("<log_ocid>")
# The following infrastructure configurations are optional
# if you are in an OCI data science notebook session.
# The configurations of the notebook session will be used as defaults
.with_compartment_id("<compartment_ocid>")
.with_project_id("<project_ocid>")
.with_subnet_id("<subnet_ocid>")
.with_shape_name("VM.Standard.E3.Flex")
.with_shape_config_details(memory_in_gbs=16, ocpus=1) # Applicable only for the␣
˓→flexible shapes
.with_block_storage_size(50)
)
.with_runtime(
PythonRuntime()
.with_service_conda("pytorch19_p37_cpu_v1")
# The job artifact directory is named "zip_or_dir"
.with_source("local/path/to/zip_or_dir", entrypoint="zip_or_dir/my_package/entry.py")
# Change the working directory to be inside the job artifact directory
# Working directory a relative path from the parent of the job artifact directory
# Working directory is also added to Python paths
.with_working_dir("zip_or_dir")
# Add an additional Python path
# The "my_python_packages" folder is under "zip_or_dir" (working directory)
.with_python_path("my_python_packages")
# Files in "output" directory will be copied to OCI object storage once the job␣
˓→finishes
15.7.2.2 YAML
You could use the following YAML to create the same job with PythonRuntime:
kind: job
spec:
infrastructure:
kind: infrastructure
type: dataScienceJob
spec:
logGroupId: <log_group_ocid>
logId: <log_ocid>
compartmentId: <compartment_ocid>
projectId: <project_ocid>
(continues on next page)
kind:
required: true
type: string
allowed:
- runtime
type:
required: true
type: string
allowed:
- script
spec:
required: true
type: dict
schema:
args:
nullable: true
required: false
type: list
schema:
type: string
conda:
nullable: false
required: false
type: dict
schema:
slug:
required: true
type: string
type:
(continues on next page)
15.8.1 Prerequisite
15.8. Working with OCI Data Science Jobs Using CLI 423
ADS Documentation, Release 2.6.8
ads opctl cancel -j <job ocid> --work-dir <Object storage working directory specified␣
˓→when the cluster was created>
ads opctl cancel -j <job ocid> --work-dir <Object storage working directory specified␣
˓→when the cluster was created>
This command requires an api key or resource principal setup. The logs are streamed from the logging service. If your
job is not attached to logging service, this option will show only the lifecycle state.
15.9.1 watch
You can tail the logs generated by OCI Data Science Job Runs or OCI DataFlow Application Runs using the
watch subcommand.
ads opctl watch <job run ocid or dataflow application run ocid>
This command requires an api key or resource principal setup. The logs are streamed from the logging service. If your
job is not attached to logging service, this option will show only the lifecycle state.
SIXTEEN
STORE CREDENTIALS
Services such as OCI Database and Streaming require users to provide credentials. These credentials must be safely
accessed at runtime. OCI Vault provides a mechanism for safe storage and access of secrets. SecretKeeper uses Vault
as a backend to store and retrieve the credentials. The data structure of the credentials varies from service to service.
There is a SecretKeeper specific to each data structure.
These classes are provided:
• ADBSecretKeeper: Stores credentials for the Oracle Autonomous Database, with or without the wallet file.
• AuthTokenSecretKeeper: Stores an Auth Token or Access Token string. This could be an Auth Token to use
to connect to Streaming, Github, or other systems that used Auth Tokens or Access Token strings.
• BDSSecretKeeper: Stores credentials for Oracle Big Data Service with or without Keytab and kerb5 configu-
ration files.
• MySQLDBSecretKeeper: Stores credentials for the MySQL database. This class will work with many databases
that authenticate with a username and password only.
• OracleDBSecretKeeper: Stores credentials for the Oracle Database.
import ads
from ads.secrets.auth_token import AuthTokenSecretKeeper
ocid_vault = "ocid1.vault..<unique_ID>"
ocid_master_key = "ocid1.key..<unique_ID>"
ocid_mycompartment = "ocid1.compartment..<unique_ID>"
authtoken2 = AuthTokenSecretKeeper(
vault_id=ocid_vault,
key_id=ocid_master_key,
compartment_id=ocid_mycompartment,
auth_token="<your_auth_token>"
).save(
(continues on next page)
425
ADS Documentation, Release 2.6.8
'ocid1.vaultsecret..<unique_ID>'
import ads
from ads.secrets.auth_token import AuthTokenSecretKeeper
with AuthTokenSecretKeeper.load_secret(source="ocid1.vaultsecret..<unique_ID>",
) as authtoken:
import os
print(f"Credentials inside `authtoken` object: {authtoken}")
import ads
ads.set_auth('resource_principal') # If using resource principal authentication
from ads.secrets.adb import ADBSecretKeeper
connection_parameters={
"user_name":"admin",
"password":"<your_password>",
"service_name":"service_high",
"wallet_location":"/home/datascience/Wallet_--------.zip"
}
ocid_vault = "ocid1.vault..<unique_ID>"
ocid_master_key = "ocid1.key..<unique_ID>"
ocid_mycompartment = "ocid1.compartment..<unique_ID>"
adw_keeper = ADBSecretKeeper(vault_id=ocid_vault,
key_id=ocid_master_key,
compartment_id=ocid_mycompartment,
**connection_parameters)
'ocid1.vaultsecret..<unique_ID>'
import ads
ads.set_auth('resource_principal') # If using resource principal authentication
from ads.secrets.adb import ADBSecretKeeper
print(df2.head(2))
JOBFUNCTION ATTRITION
0 Product Management No
1 Software Developer No
import ads
import fsspec
import os
ads.set_auth('resource_principal')
principal = "<your_principal>"
hdfs_host = "<your_hdfs_host>"
hive_host = "<your_hive_host>"
hdfs_port = <your_hdfs_port>
hive_port = <your_hive_port>
vault_id = "ocid1.vault..<unique_ID>"
key_id = "ocid1.key..<unique_ID>"
secret = BDSSecretKeeper(
vault_id=vault_id,
key_id=key_id,
principal=principal,
hdfs_host=hdfs_host,
(continues on next page)
saved_secret = secret.save(name="your_bds_config_secret_name",
description="your bds credentials",
freeform_tags={"schema":"emp"},
defined_tags={},
save_files=True)
16.1.4 MySQL
import ads
from ads.secrets.mysqldb import MySQLDBSecretKeeper
vault_id = "ocid1.vault..<unique_ID>"
key_id = "ocid1.key..<unique_ID>"
mysqldb_keeper = MySQLDBSecretKeeper(vault_id=vault_id,
key_id=key_id,
**connection_parameters)
'ocid1.vaultsecret..<unique_ID>'
import ads
from ads.secrets.mysqldb import MySQLDBSecretKeeper
ads.set_auth('resource_principal') # If using resource principal authentication
print(df2.head(2))
JOBFUNCTION ATTRITION
0 Product Management No
1 Software Developer No
import ads
from ads.secrets.oracledb import OracleDBSecretKeeper
vault_id = "ocid1.vault..<unique_ID>"
key_id = "ocid1.key..<unique_ID>"
oracledb_keeper = OracleDBSecretKeeper(vault_id=vault_id,
key_id=key_id,
**connection_parameters)
'ocid1.vaultsecret..<unique_ID>'
import ads
ads.set_auth('resource_principal') # If using resource principal authentication
from ads.secrets.oracledb import OracleDBSecretKeeper
print(df2.head(2))
JOBFUNCTION ATTRITION
0 Product Management No
1 Software Developer No
The AuthTokenSecretKeeper helps you to save the Auth Token or Access Token string to the OCI Vault service.
See API Documentation for more details
16.2.1.1 AuthTokenSecretKeeper
16.2.1.1.1 Save
The AuthTokenSecretKeeper.save API serializes and stores the credentials to Vault. It takes following parameters
-
• name (str): Name of the secret when saved in the vault.
• description (str): Description of the secret when saved in the vault.
• freeform_tags (dict, optional): Freeform tags to use when saving the secret in the OCI Console.
• defined_tags (dict, optional.): Save the tags under predefined tags in the OCI Console.
The secret has following information:
• auth_token
16.2.1.2 Examples
import ads
from ads.secrets.auth_token import AuthTokenSecretKeeper
ocid_vault = "ocid1.vault...<unique_ID>"
ocid_master_key = "ocid1.key..<unique_ID>"
ocid_mycompartment = "ocid1.compartment..<unique_ID>"
authtoken2 = AuthTokenSecretKeeper(
vault_id=ocid_vault,
key_id=ocid_master_key,
compartment_id=ocid_mycompartment,
auth_token="<your_auth_token>"
).save(
"my_xyz_auth_token2",
"This is my key for git repo xyz",
freeform_tags={"gitrepo":"xyz"}
)
print(authtoken2.secret_id)
You can save the vault details in a file for later reference or using it within your code using export_vault_details
API. The API currently let us export the information as a yaml file or a json file.
authtoken2.export_vault_details("my_db_vault_info.json", format="json")
authtoken2.export_vault_details("my_db_vault_info.yaml", format="yaml")
16.2.2.1 Load
The AuthTokenSecretKeeper.load_secret API deserializes and loads the credentials from Vault. You could use
this API in one of the following ways:
This approach is preferred as the secrets are only available within the code block and it reduces the risk that the variable
will be leaked.
authtoken = AuthTokenSecretKeeper.load_secret('ocid1.vaultsecret..<unique_ID>')
authtokendict = authtoken.to_dict()
print(authtokendict['user_name'])
16.2.2.2 Examples
import ads
from ads.secrets.auth_token import AuthTokenSecretKeeper
with AuthTokenSecretKeeper.load_secret(source="ocid1.vaultsecret..<unique_ID",
) as authtoken:
import os
print(f"Credentials inside `authtoken` object: {authtoken}")
To expose credentials through environment variable, set export_env=True. The following keys are exported -
import ads
from ads.secrets.auth_token import AuthTokenSecretKeeper
import os
with AuthTokenSecretKeeper.load_secret(
source="ocid1.vaultsecret..<unique_ID>",
export_env=True
):
print(os.environ.get("auth_token")) # Prints the auth token
You can avoid name collisions by setting the prefix string using export_prefix along with export_env=True. For
example, if you set the prefix to kafka, the exported keys are:
import ads
from ads.secrets.auth_token import AuthTokenSecretKeeper
import os
with AuthTokenSecretKeeper.load_secret(
source="ocid1.vaultsecret..<unique_ID>",
export_env=True,
export_prefix="kafka"
):
print(os.environ.get("kafka.auth_token")) # Prints the auth token
16.3.1.1 ADBSecretKeeper
16.3.1.1.1 Save
The ADBSecretKeeper.save API serializes and stores the credentials to Vault using the following parameters:
• defined_tags (dict, optional): Default None. Save the tags under predefined tags in the OCI Console.
• description (str): Description of the secret when saved in Vault.
• freeform_tags (dict, optional): Default None. Free form tags to use for saving the secret in the OCI Console.
• name (str): Name of the secret when saved in Vault.
• save_wallet (bool, optional): Default False. If set to True, then the wallet file is serialized.
When stored without the wallet information, the secret content has following information:
• password
• service_name
• user_name
To store wallet file content, set save_wallet to True. The wallet content is stored by extracting all the files from the
wallet ZIP file, and then each file is stored in the vault as a secret. The list of OCIDs corresponding to each file along
with username, password, and service name is stored in a separate secret. The secret corresponding to each file content
has following information:
• filename
• content of the file
A meta secret is created to save the username, password, service name, and the secret ids of the files within the wallet
file. It has following attributes:
• user_name
• password
• wallet_file_name
• wallet_secret_ids
The wallet file is reconstructed when ADBSecretKeeper.load_secret is called using the OCID of the meta secret.
16.3.1.2 Examples
import ads
ads.set_auth('resource_principal') # If using resource principal authentication
from ads.secrets.adb import ADBSecretKeeper
connection_parameters={
"user_name":"admin",
"password":"<your_password>",
"service_name":"service_high",
"wallet_location":"/home/datascience/Wallet_--------.zip"
}
ocid_vault = "ocid1.vault..<unique_ID>"
ocid_master_key = "ocid1.key..<unique_ID>"
ocid_mycompartment = "ocid1.compartment..<unique_ID>"
adw_keeper = ADBSecretKeeper(vault_id=ocid_vault,
key_id=ocid_master_key,
compartment_id=ocid_mycompartment,
**connection_parameters)
'ocid1.vaultsecret..<unique_ID>'
import ads
ads.set_auth('resource_principal') # If using resource principal authentication
from ads.secrets.adb import ADBSecretKeeper
connection_parameters={
"user_name":"admin",
"password":"<your_password>",
"service_name":"service_high",
"wallet_location":"/home/datascience/Wallet_--------.zip"
}
ocid_vault = "ocid1.vault..<unique_ID>"
ocid_master_key = "ocid1.key..<unique_ID>"
ocid_mycompartment = "ocid1.compartment..<unique_ID>"
adw_keeper = ADBSecretKeeper(vault_id=ocid_vault,
key_id=ocid_master_key,
compartment_id=ocid_mycompartment,
**connection_parameters)
adw_keeper.save("adw_employee_att2",
"My DB credentials",
freeform_tags={"schema":"emp"},
save_wallet=True
)
print(adw_keeper.secret_id)
'ocid1.vaultsecret..<unique_ID>'
You can save the vault details in a file for later reference or using it within your code using export_vault_details
API calls. The API currently enables you to export the information as a YAML file or a JSON file.
adw_keeper.export_vault_details("my_db_vault_info.json", format="json")
adw_keeper.export_vault_details("my_db_vault_info.yaml", format="yaml")
16.3.2.1 Load
The ADBSecretKeeper.load_secret API deserializes and loads the credentials from Vault. You could use this API
in one of the following ways:
This approach is preferred as the secrets are only available within the code block and it reduces the risk that the variable
will be leaked.
adwsecretobj = ADBSecretKeeper.load_secret('ocid1.vaultsecret..<unique_ID>')
adwsecret = adwsecretobj.to_dict()
print(adwsecret['user_name'])
16.3.2.2 Examples
import ads
ads.set_auth('resource_principal') # If using resource principal authentication
from ads.secrets.adb import ADBSecretKeeper
with ADBSecretKeeper.load_secret(
"ocid1.vaultsecret..<unique_ID>"
) as adw_creds2:
print (adw_creds2["user_name"]) # Prints the user name
To expose credentials as an environment variable, set export_env=True. The following keys are exported:
import os
import ads
with ADBSecretKeeper.load_secret(
"ocid1.vaultsecret..<unique_ID>",
export_env=True
):
print(os.environ.get("user_name")) # Prints the user name
You can avoid name collisions by setting a prefix string using export_prefix along with export_env=True. For
example, if you set the prefix to myprocess, then the exported keys are:
import os
import ads
with ADBSecretKeeper.load_secret(
"ocid1.vaultsecret..<unique_ID>",
export_env=True,
export_prefix="myprocess"
):
print(os.environ.get("myprocess.user_name")) # Prints the user name
(continues on next page)
You can set wallet file location when wallet file is not part of the stored vault secret. To specify a local wallet ZIP file,
set the path to the ZIP file with wallet_location:
import ads
ads.set_auth('resource_principal') # If using resource principal authentication
from ads.secrets.adb import ADBSecretKeeper
with ADBSecretKeeper.load_secret(
"ocid1.vaultsecret..<unique_ID>",
wallet_location="path/to/my/local/wallet.zip"
) as adw_creds2:
print (adw_creds2["wallet_location"]) # Prints `path/to/my/local/wallet.zip`
16.4.1.1 BDSSecretKeeper
You can also save the connection parameters as well as the files needed to configure the kerberos authentication into
vault. This will allow you to use repetitively in different notebook sessions, machines, and Jobs.
The BDSSecretKeeper constructor requires the following parameters:
• compartment_id (str): OCID of the compartment where the vault is located. This defaults to the compartment
of the notebook session when used in a Data Science notebook session.
• hdfs_host (str): The HDFS hostname from the bds cluster.
• hdfs_port (str): The HDFS port from the bds cluster.
• hive_host (str): The Hive hostname from the bds cluster.
• hive_port (str): The Hive port from the bds cluster.
• kerb5_path (str): The krb5.conf file path.
• key_id: str (OCID of the master key used for encrypting the secret.
• keytab_path (str): The path to the keytab file.
• principal (str): The unique identity to which Kerberos can assign tickets.
• vault_id: (str): The OCID of the vault.
16.4.1.1.1 Save
The BDSSecretKeeper.save API serializes and stores the credentials to Vault using the following parameters:
• defined_tags (dict, optional): Default None. Save the tags under predefined tags in the OCI Console.
• description (str) – Description of the secret when saved in Vault.
• freeform_tags (dict, optional): Default None. Free form tags to use for saving the secret in the OCI Console.
• name (str): Name of the secret when saved in Vault.
• save_files (bool, optional): Default True. If set to True, then the keytab and kerb5 config files are serialized
and saved.
16.4.1.2 Examples
import ads
import fsspec
import os
ads.set_auth('resource_principal')
principal = "<your_principal>"
(continues on next page)
secret = BDSSecretKeeper(
vault_id=vault_id,
key_id=key_id,
principal=principal,
hdfs_host=hdfs_host,
hive_host=hive_host,
hdfs_port=hdfs_port,
hive_port=hive_port,
keytab_path=keytab_path,
kerb5_path=kerb5_path
)
saved_secret = secret.save(name="your_bds_config_secret_name",
description="your bds credentials",
freeform_tags={"schema":"emp"},
defined_tags={},
save_files=True)
import ads
import fsspec
import os
ads.set_auth('resource_principal')
principal = "<your_principal>"
hdfs_host = "<your_hdfs_host>"
hive_host = "<your_hive_host>"
hdfs_port = <your_hdfs_port>
hive_port = <your_hive_port>
vault_id = "ocid1.vault..<unique_ID>"
key_id = "ocid1.key..<unique_ID>"
bds_keeper = BDSSecretKeeper(
vault_id=vault_id,
key_id=key_id,
principal=principal,
hdfs_host=hdfs_host,
hive_host=hive_host,
(continues on next page)
saved_secret = bds_keeper.save(name="your_bds_config_secret_name",
description="your bds credentials",
freeform_tags={"schema":"emp"},
defined_tags={},
save_files=False)
print(saved_secret.secret_id)
'ocid1.vaultsecret..<unique_ID>'
16.4.2.1 Load
The BDSSecretKeeper.load_secret API deserializes and loads the credentials from Vault. You could use this API
in one of the following ways:
This approach is preferred as the secrets are only available within the code block and it reduces the risk that the variable
will be leaked.
bdssecretobj = BDSSecretKeeper.load_secret('ocid1.vaultsecret..<unique_ID>')
bdssecret = bdssecretobj.to_dict()
print(bdssecret['hdfs_host'])
• source: Either the file that was exported from export_vault_details or the OCID of the secret
If the keytab and kerb5 configuration files were saved in the vault, then a keytab and kerb5 configuration file of the
same name is created by .load_secret(). By default, the keytab file is created in the keytab_path specified in
the secret. To update the location, set the directory path with key_dir. However, the kerb5 configuration file is always
saved in the ~/.bds_config/krb5.conf path.
Note that keytab and kerb5 configuration files are saved only when the content is saved into the vault.
After you load and save the configuration parameters files, you can call the krbcontext context manager to create a
Kerberos ticket.
16.4.2.2 Examples
To specify a local keytab file, set the path to the ZIP file with wallet_location:
hive_cursor.execute("""
select *
from your_db.your_table
limit 10
""")
import pandas as pd
pd.DataFrame(hive_cursor.fetchall(), columns=[col[0] for col in hive_cursor.description])
bdssecretobj = BDSSecretKeeper.load_secret(saved_secret.secret_id)
bdssecret = bdssecretobj.to_dict()
print(bdssecret)
bdssecretobj.to_dict()
bdssecretobj.to_dict()
16.5 MySQL
16.5.1.1 MySQLDBSecretKeeper
16.5.1.1.1 Save
The MySQLDBSecretKeeper.save API serializes and stores the credentials to the vault using the following parame-
ters:
• defined_tags (dict, optional): Save the tags under predefined tags in the OCI Console.
• description (str): Description of the secret when saved in the vault.
• freeform_tags (dict, optional): Freeform tags to be used for saving the secret in the OCI Console.
• name (str): Name of the secret when saved in the vault.
The secret has the following informatio:
• database
• host
• password
• port
• user_name
16.5.1.2 Examples
import ads
from ads.secrets.mysqldb import MySQLDBSecretKeeper
vault_id = "ocid1.vault..<unique_ID>"
key_id = "ocid1.key..<unique_ID>"
mysqldb_keeper = MySQLDBSecretKeeper(vault_id=vault_id,
key_id=key_id,
**connection_parameters)
'ocid1.vaultsecret..<unique_ID>'
You can save the vault details in a file for later reference, or use it in your code using export_vault_details API
calls. The API currently enables you to export the information as a YAML file or a JSON file.
mysqldb_keeper.export_vault_details("my_db_vault_info.json", format="json")
mysqldb_keeper.export_vault_details("my_db_vault_info.yaml", format="yaml")
16.5.2.1 Load
The MySQLDBSecretKeeper.load_secret() API deserializes and loads the credentials from the vault. You could
use this API in one of the following ways:
mysqldb_secretobj = MySQLDBSecretKeeper.load_secret('ocid1.vaultsecret..<unique_ID>')
mysqldb_secret = mysqldb_secretobj.to_dict()
print(mysqldb_secret['user_name'])
16.5.2.2 Examples
import ads
ads.set_auth('resource_principal') # If using resource principal authentication
from ads.secrets.mysqldb import MySQLDBSecretKeeper
with MySQLDBSecretKeeper.load_secret(
(continues on next page)
To expose credentials as an environment variable, set export_env=True. The following keys are exported:
import os
import ads
with MySQLDBSecretKeeper.load_secret(
"ocid1.vaultsecret..<unique_ID>",
export_env=True
):
print(os.environ.get("user_name")) # Prints the user name
You can avoid name collisions by setting a prefix string using export_prefix along with export_env=True. For
example, if you set prefix as myprocess, then the exported keys are:
import os
import ads
16.6.1.1 OracleDBSecretKeeper
16.6.1.2 Save
The OracleDBSecretKeeper.save() API serializes and stores the credentials to Vault using the following parame-
ters:
• defined_tags (dict, optional): Save the tags under predefined tags in the OCI Console.
• description (str): Description of the secret when saved in the vault.
• freeform_tags (dict, optional): Freeform tags to use when saving the secret in the OCI Console.
• name (str): Name of the secret when saved in the vault.
The secret has the following information:
• dsn
• host
• password
• port
• service_name
• sid
• user_name
16.6.1.3 Examples
import ads
from ads.secrets.oracledb import OracleDBSecretKeeper
vault_id = "ocid1.vault..<unique_ID>"
key_id = "ocid1.key..<unique_ID>"
oracledb_keeper = OracleDBSecretKeeper(vault_id=vault_id,
key_id=key_id,
**connection_parameters)
'ocid1.vaultsecret..<unique_ID>'
You can save the vault details in a file for later reference or using it within your code using export_vault_details
API calls. The API currently enables you to export the information as a YAML file or a JSON file.
oracledb_keeper.export_vault_details("my_db_vault_info.json", format="json")
oracledb_keeper.export_vault_details("my_db_vault_info.yaml", format="yaml")
16.6.2.1 Load
The OracleDBSecretKeeper.load_secret() API deserializes and loads the credentials from the vault. You could
use this API in one of the following ways:
print(oracledb_secret['user_name']
oracledb_secretobj = OracleDBSecretKeeper.load_secret('ocid1.vaultsecret..<unique_ID>')
oracledb_secret = oracledb_secretobj.to_dict()
print(oracledb_secret['user_name'])
16.6.2.2 Examples
import ads
ads.set_auth('resource_principal') # If using resource principal authentication
from ads.secrets.oracledb import OracleDBSecretKeeper
with OracleDBSecretKeeper.load_secret(
"ocid1.vaultsecret..<unique_ID>"
) as oracledb_creds2:
print (oracledb_creds2["user_name"]) # Prints the user name
To expose credentials as an environment variable, set export_env=True. The following keys are exported:
import os
import ads
with OracleDBSecretKeeper.load_secret(
"ocid1.vaultsecret..<unique_ID>",
export_env=True
):
print(os.environ.get("user_name")) # Prints the user name
You can avoid name collisions by setting a prefix string using export_prefix along with export_env=True. For
example, if you set prefix as myprocess, then the exported keys are:
import os
import ads
with OracleDBSecretKeeper.load_secret(
"ocid1.vaultsecret..<unique_ID>",
export_env=True,
export_prefix="myprocess"
):
print(os.environ.get("myprocess.user_name")) # Prints the user name
SEVENTEEN
CLASS DOCUMENTATION
17.1.1 Subpackages
17.1.1.1.1 Submodules
453
ADS Documentation, Release 2.6.8
Examples
train(**kwargs)
Returns a fitted automl model and a fitted baseline model.
Parameters
kwargs (dict, optional) – kwargs passed to provider’s train method
Returns
• model (object of ads.common.model.ADSModel) – the trained automl model
• baseline (object of ads.common.model.ADSModel) – the baseline model to compare
Examples
ads.automl.driver.get_ml_task_type(X, y, classes)
Gets the ML task type and returns it.
Parameters
• X (Dataframe) – The training dataframe
• Y (Dataframe) – The testing dataframe
• Classes (List) – a list of classes
Returns
A particular task type like REGRESSION, MULTI_CLASS_CLASSIFICATION. . .
Return type
ml_task_type
class ads.automl.provider.AutoMLFeatureSelection(msg)
Bases: object
fit(X)
Fits the baseline estimator
Parameters
X (Dataframe or list-like) – A Dataframe or list-like object holding data to be predicted
on
Returns
Self – The fitted estimator
Return type
Estimator
transform(X)
Runs the Baselines transform function and returns the result
Parameters
X (Dataframe or list-like) – A Dataframe or list-like object holding data to be trans-
formed
Returns
X – The transformed Dataframe.
Return type
Dataframe or list-like
class ads.automl.provider.AutoMLPreprocessingTransformer(msg)
Bases: object
fit(X)
Fits the preprocessing Transformer
Parameters
X (Dataframe or list-like) – A Dataframe or list-like object holding data to be predicted
on
Returns
Self – The fitted estimator
Return type
Estimator
transform(X)
Runs the preprocessing transform function and returns the result
Parameters
X (Dataframe or list-like) – A Dataframe or list-like object holding data to be trans-
formed
Returns
X – The transformed Dataframe.
Return type
Dataframe or list-like
class ads.automl.provider.AutoMLProvider
Bases: ABC
Abstract Base Class defining the structure of an AutoML solution. The solution needs to implement train() and
get_transformer_pipeline().
property est
Returns the estimator.
The estimator can be a standard sklearn estimator or any object that implement methods from (BaseEsti-
mator, RegressorMixin) for regression or (BaseEstimator, ClassifierMixin) for classification.
Returns
est
Return type
An instance of estimator
abstract get_transformer_pipeline()
Returns a list of transformers representing the transformations done on data before model prediction.
This method is optional to implement, and is used only for visualizing transformations on data using
ADSModel#visualize_transforms().
Returns
transformers_list
Return type
list of transformers implementing fit and transform
setup(X_train, y_train, ml_task_type, X_valid=None, y_valid=None, class_names=None, client=None)
Setup arguments to the AutoML instance.
Parameters
• X_train (DataFrame) – Training features
• y_train (DataFrame) – Training labels
• ml_task_type (One of ml_task_type.{REGRESSION,BINARY_CLASSIFICATION,)
– MULTI_CLASS_CLASSIFICATION,BINARY_TEXT_CLASSIFICATION,MULTI_CLASS_TEXT_CLASS
• X_valid (DataFrame) – Validation features
• y_valid (DataFrame) – Validation labels
• class_names (list) – Unique values in y_train
• client (object) – Dask client instance for distributed execution
abstract train(**kwargs)
Calls fit on estimator.
This method is expected to set the ‘est’ property.
Parameters
• kwargs (dict, optional) –
• method (kwargs to decide the estimator and arguments for the fit) –
class ads.automl.provider.BaselineAutoMLProvider(est)
Bases: AutoMLProvider
Generates a baseline model using the Zero Rule algorithm by default. For a classification predictive modeling
problem where a categorical value is predicted, the Zero Rule algorithm predicts the class value that has the most
observations in the training dataset.
Parameters
est (BaselineModel) – An estimator that supports the fit/predict/predict_proba interface. By
default, DummyClassifier/DummyRegressor are used as estimators
decide_estimator(**kwargs)
Decides which type of BaselineModel to generate.
Returns
Modell – A baseline model generated for the particular ML task being performed
Return type
BaselineModel
get_transformer_pipeline()
Returns a list of transformers representing the transformations done on data before model prediction.
This method is used only for visualizing transformations on data using ADSModel#visualize_transforms().
Returns
transformers_list
Return type
list of transformers implementing fit and transform
train(**kwargs)
Calls fit on estimator.
This method is expected to set the ‘est’ property.
Parameters
• kwargs (dict, optional) –
• method (kwargs to decide the estimator and arguments for the fit) –
class ads.automl.provider.BaselineModel(est)
Bases: object
A BaselineModel object that supports fit/predict/predict_proba/transform interface. Labels (y) are encoded using
DataFrameLabelEncoder.
fit(X, y)
Fits the baseline estimator.
Parameters
• X (Dataframe or list-like) – A Dataframe or list-like object holding data to be pre-
dicted on
• Y (Dataframe, Series, or list-like) – A Dataframe, series, or list-like object hold-
ing the labels
Returns
estimator
Return type
The fitted estimator
predict(X)
Runs the Baselines predict function and returns the result.
Parameters
X (Dataframe or list-like) – A Dataframe or list-like object holding data to be predicted
on
Returns
List
Return type
A list of predictions performed on the input data.
predict_proba(X)
Runs the Baselines predict_proba function and returns the result.
Parameters
X (Dataframe or list-like) – A Dataframe or list-like object holding data to be predicted
on
Returns
List
Return type
A list of probabilities of being part of a class
transform(X)
Runs the Baselines transform function and returns the result.
Parameters
X (Dataframe or list-like) – A Dataframe or list-like object holding data to be trans-
formed
Returns
Dataframe or list-like
Return type
The transformed Dataframe. Currently, no transformation is performed by the default Base-
line Estimator.
class ads.automl.provider.OracleAutoMLProvider(n_jobs=-1, loglevel=None, logger_override=None,
model_n_jobs: int = 1)
Bases: AutoMLProvider, ABC
The Oracle AutoML Provider automatically provides a tuned ML pipeline that best models the given a training
dataset and a prediction task at hand.
Parameters
• n_jobs (int) – Specifies the degree of parallelism for Oracle AutoML. -1 (default) means
that AutoML will use all available cores.
• loglevel (int) – The verbosity of output for Oracle AutoML. Can be specified using the
Python logging module (https://docs.python.org/3/library/logging.html#logging-levels).
• model_n_jobs ((optional, int). Defaults to 1.) – Specifies the model paral-
lelism used by AutoML. This will be passed to the underlying model it is training.
get_transformer_pipeline()
Returns a list of transformers representing the transformations done on data before model prediction.
This method is used only for visualizing transformations on data using ADSModel#visualize_transforms().
Returns
transformers_list
Return type
list of transformers implementing fit and transform
print_summary(max_rows=None, sort_column='Mean Validation Score', ranking_table_only=False)
Prints a summary of the Oracle AutoML Pipeline in the last train() call.
Parameters
• max_rows (int) – Number of trials to print. Pass in None to print all trials
• sort_column (string) – Column to sort results by. Must be one of [‘Algorithm’, ‘#Sam-
ples’, ‘#Features’, ‘Mean Validation Score’, ‘Hyperparameters’, ‘All Validation Scores’,
‘CPU Time’]
• ranking_table_only (bool) – Table to be displayed. Pass in False to display the com-
plete table. Pass in True to display the ranking table only.
Parameters
ylabel (str,) – Label for the y-axis. Defaults to the scoring metric.
visualize_feature_selection_trials(ylabel=None)
Visualize the feature selection trials taken to arrive at optimal set of features. The orange line shows the
optimal number of features chosen by Feature Selection.
Parameters
ylabel (str,) – Label for the y-axis. Defaults to the scoring metric.
visualize_tuning_trials(ylabel=None)
Visualize (plot) the hyperparamter tuning trials taken to arrive at the optimal hyper parameters. Each trial
in the plot represents a particular hyperparamter combination.
Parameters
ylabel (str,) – Label for the y-axis. Defaults to the scoring metric.
17.1.1.2.1 Submodules
• ValueError – If error occures while getting model provenance mettadata from server.:
rollback() → None
Rollbacks the changes made to the model.
Returns
Nothing.
Return type
None
show_in_notebook(display_format: str = 'dataframe') → None
Shows model in dataframe or yaml format. Supported formats: dataframe and yaml. Defaults to dataframe
format.
Returns
Nothing.
Return type
None
to_dataframe() → DataFrame
Converts the model to dataframe format.
Returns
Pandas dataframe.
Return type
panadas.DataFrame
exception ads.catalog.model.ModelArtifactSizeError(max_artifact_size: str)
Bases: Exception
class ads.catalog.model.ModelCatalog(compartment_id: Optional[str] = None, ds_client_auth:
Optional[dict] = None, identity_client_auth: Optional[dict] = None,
timeout: Optional[int] = None, ds_client:
Optional[DataScienceClient] = None, identity_client:
Optional[IdentityClient] = None)
Bases: object
Allows to list, load, update, download, upload and delete models from model catalog.
get_model(self, model_id)
Loads the model from the model catalog based on model_id.
list_models(self, project_id=None, include_deleted=False, datetime_format=utils.date_format,
\*\*kwargs)
Lists all models in a given compartment, or in the current project if project_id is specified.
list_model_deployment(self, model_id, config=None, tenant_id=None, limit=500, page=None,
\*\*kwargs)
Gets the list of model deployments by model Id across the compartments.
update_model(self, model_id, update_model_details=None, \*\*kwargs)
Updates a model with given model_id, using the provided update data.
delete_model(self, model, \*\*kwargs)
Deletes the model based on model_id.
download_model(model_id: str, target_dir: str, force_overwrite: bool = False, install_libs: bool = False,
conflict_strategy='IGNORE', bucket_uri: Optional[str] = None, remove_existing_artifact:
Optional[bool] = True)
Downloads the model from model_dir to target_dir based on model_id.
Parameters
• model_id (str) – The OCID of the model to download.
• target_dir (str) – The target location of model after download.
• force_overwrite (bool) – Overwrite target_dir if exists.
• install_libs (bool, default: False) – Install the libraries specified in ds-
requirements.txt which are missing in the current environment.
• conflict_strategy (ConflictStrategy, default: IGNORE) – Determines how to
handle version conflicts between the current environment and requirements of model ar-
tifact. Valid values: “IGNORE”, “UPDATE” or ConflictStrategy. IGNORE: Use the in-
stalled version in case of conflict UPDATE: Force update dependency to the version re-
quired by model artifact in case of conflict
• bucket_uri ((str, optional). Defaults to None.) – The OCI Object Stor-
age URI where model artifacts will be copied to. The bucket_uri is only nec-
essary for downloading large artifacts with size is greater than 2GB. Example:
oci://<bucket_name>@<namespace>/prefix/.
• remove_existing_artifact ((bool, optional). Defaults to True.) – Whether artifacts
uploaded to object storage bucket need to be removed or not.
Returns
A ModelArtifact instance.
Return type
ModelArtifact
get_model(model_id)
Loads the model from the model catalog based on model_id.
Parameters
model_id (str, required) – The model ID.
Returns
The ads.catalog.Model with the matching ID.
Return type
ads.catalog.Model
list_model_deployment(model_id: str, config: Optional[dict] = None, tenant_id: Optional[str] = None,
limit: int = 500, page: Optional[str] = None, **kwargs)
Gets the list of model deployments by model Id across the compartments.
Parameters
• model_id (str) – The model ID.
• config (dict (optional)) – Configuration keys and values as per SDK and Tool Con-
figuration. The from_file() method can be used to load configuration from a file. Alter-
natively, a dict can be passed. You can validate_config the dict using validate_config().
Defaults to None.
• tenant_id (str (optional)) – The tenancy ID, which can be used to specify a differ-
ent tenancy (for cross-tenancy authorization) when searching for resources in a different
tenancy. Defaults to None.
• limit (int (optional)) – The maximum number of items to return. The value must be
between 1 and 1000. Defaults to 500.
• page (str (optional)) – The page at which to start retrieving results.
Return type
The list of model deployments.
list_models(project_id: Optional[str] = None, include_deleted: bool = False, datetime_format: str =
'%Y-%m-%d %H:%M:%S', **kwargs)
Lists all models in a given compartment, or in the current project if project_id is specified.
Parameters
• project_id (str) – The project_id of model.
• include_deleted (bool, optional, default=False) – Whether to include deleted
models in the returned list.
• datetime_format (str, optional, default: '%Y-%m-%d %H:%M:%S') – Change
format for date time fields.
Returns
A list of models.
Return type
ModelSummaryList
update_model(model_id, update_model_details=None, **kwargs) → Model
Updates a model with given model_id, using the provided update data.
Parameters
• model_id (str) – The model ID.
• update_model_details (UpdateModelDetails) – Contains the update model details
data to apply. Mandatory unless kwargs are supplied.
• kwargs (dict, optional) – Update model details can be supplied instead as kwargs.
Returns
The ads.catalog.Model with the matching ID.
Return type
Model
upload_model(model_artifact: ModelArtifact, provenance_metadata: Optional[ModelProvenance] = None,
project_id: Optional[str] = None, display_name: Optional[str] = None, description:
Optional[str] = None, freeform_tags: Optional[Dict[str, Dict[str, object]]] = None,
defined_tags: Optional[Dict[str, Dict[str, object]]] = None, bucket_uri: Optional[str] =
None, remove_existing_artifact: Optional[bool] = True, overwrite_existing_artifact:
Optional[bool] = True)
Uploads the model artifact to cloud storage.
Parameters
• model_artifact (Union[ModelArtifact, GenericModel]) – The model artifacts or
generic model instance.
Return type
A filtered ModelSummaryList
sort_by(columns, reverse=False)
Performs a multi-key sort on a particular set of columns and returns the sorted ModelSummaryList. Results
are listed in a descending order by default.
Parameters
• columns (List of string) – A list of columns which are provided to sort on
• reverse (Boolean (defaults to false)) – If you’d like to reverse the results (for
example, to get ascending instead of descending results)
Returns
ModelSummaryList
Return type
A sorted ModelSummaryList
exception ads.catalog.model.ModelWithActiveDeploymentError
Bases: Exception
class ads.catalog.notebook.NotebookCatalog(compartment_id=None)
Bases: object
create_notebook_session(display_name=None, project_id=None, shape=None,
block_storage_size_in_gbs=None, subnet_id=None, **kwargs)
Create a new notebook session with the supplied details.
Parameters
• display_name (str, required) – The value to assign to the display_name property of
this CreateNotebookSessionDetails.
• project_id (str, required) – The value to assign to the project_id property of this
CreateNotebookSessionDetails.
• shape (str, required) – The value to assign to the shape property of this Notebook-
SessionConfigurationDetails. Allowed values for this property are: “VM.Standard.E2.2”,
“VM.Standard.E2.4”, “VM.Standard.E2.8”, “VM.Standard2.1”, “VM.Standard2.2”,
“VM.Standard2.4”, “VM.Standard2.8”, “VM.Standard2.16”,”VM.Standard2.24”.
• block_storage_size_in_gbs (int, required) – Size of the block storage drive.
Limited to values between 50 (GB) and 1024 (1024GB = 1TB)
• subnet_id (str, required) – The OCID of the subnet resource where the notebook is
to be created.
• kwargs (dict, optional) – Additional kwargs passed to DataScience-
Client.create_notebook_session()
Returns
oci.data_science.models.NotebookSession
Return type
A new notebook record.
Raises
KeyError – If the resource was not found or do not have authorization to access that resource.:
delete_notebook_session(notebook, **kwargs)
Deletes the notebook based on notebook_id.
Parameters
notebook (str ID or oci.data_science.models.NotebookSession,required) –
The OCID of the notebook to delete as a string, or a Notebook Session instance
Returns
Bool
Return type
True if delete was successful, false otherwise
get_notebook_session(notebook_id)
Get the notebook based on notebook_id
Parameters
notebook_id (str, required) – The OCID of the notebook to get.
Returns
oci.data_science.models.NotebookSession
Return type
The oci.data_science.models.NotebookSession with the matching ID.
Raises
KeyError – If the resource was not found or do not have authorization to access that resource.:
list_notebook_session(include_deleted=False, datetime_format='%Y-%m-%d %H:%M:%S', **kwargs)
List all notebooks in a given compartment
Parameters
• include_deleted (bool, optional, default=False) – Whether to include deleted
notebooks in the returned list
• datetime_format (str, optional, default: '%Y-%m-%d %H:%M:%S') – Change
format for date time fields
Returns
NotebookSummaryList
Return type
A List of notebooks.
Raises
KeyError – If the resource was not found or do not have authorization to access that resource.:
update_notebook_session(notebook_id, update_notebook_details=None, **kwargs)
Updates a notebook with given notebook_id, using the provided update data
Parameters
• notebook_id (str) – notebook_id OCID to update
• update_notebook_details (oci.data_science.models.
UpdateNotebookSessionDetails) – contains the new notebook details data to
apply
• kwargs (dict, optional) – Update notebook session details can be supplied instead as
kwargs
Returns
oci.data_science.models.NotebookSession
Return type
The updated Notebook record
Raises
KeyError – If the resource was not found or do not have authorization to access that resource.:
class ads.catalog.notebook.NotebookSummaryList(notebook_list, response=None,
datetime_format='%Y-%m-%d %H:%M:%S')
Bases: SummaryList
filter(selection, instance=None)
Filter the notebook list according to a lambda filter function, or list comprehension.
Parameters
• selection (lambda function filtering notebook instances, or a
list-comprehension) – function of list filtering notebooks
• instance (list, optional) – list to filter, optional, defaults to self
Raises
ValueError – If selection passed is not correct. For example: selec-
tion=oci.data_science.models.NotebookSession.:
sort_by(columns, reverse=False)
Performs a multi-key sort on a particular set of columns and returns the sorted NotebookSummaryList
Results are listed in a descending order by default.
Parameters
• columns (List of string) – A list of columns which are provided to sort on
• reverse (Boolean (defaults to false)) – If you’d like to reverse the results (for
example, to get ascending instead of descending results)
Returns
NotebookSummaryList
Return type
A sorted NotebookSummaryList
to_dataframe(datetime_format=None)
Returns the model catalog summary as a pandas dataframe
Parameters
datatime_format (date_format) – A datetime format, like utils.date_format. Defaults to
none.
Returns
Dataframe
Return type
The pandas DataFrame repersentation of the model catalog summary
17.1.1.3.1 Submodules
Returns
Contains keys - config, signer and client_kwargs.
• The config contains the config loaded from the configuration loaded from oci_config.
• The signer contains the signer object created from the api keys.
• client_kwargs contains the client_kwargs that was passed in as input parameter.
Return type
dict
Examples
ads.common.auth.default_signer(client_kwargs=None)
Prepares authentication and extra arguments necessary for creating clients for different OCI services based on
the default authentication setting for the session. Refer ads.set_auth API for further reference.
Parameters
client_kwargs (dict) – kwargs that are required to instantiate the Client if we need to override
the defaults.
Returns
Contains keys - config, signer and client_kwargs.
• The config contains the config loaded from the configuration loaded from the default location
if the default auth mode is API keys, otherwise it is empty dictionary.
• The signer contains the signer object created from default auth mode.
• client_kwargs contains the client_kwargs that was passed in as input parameter.
Return type
dict
Examples
ads.common.auth.resource_principal(client_kwargs=None)
Prepares authentication and extra arguments necessary for creating clients for different OCI services using Re-
source Principals.
Parameters
client_kwargs (dict) – kwargs that are required to instantiate the Client if we need to override
the defaults.
Returns
Contains keys - config, signer and client_kwargs.
• The config contains and empty dictionary.
• The signer contains the signer object created from the resource principal.
• client_kwargs contains the client_kwargs that was passed in as input parameter.
Return type
dict
Examples
• kwargs – Additional keyword arguments that would be passed to the underlying Pandas read
API.
static build(X=None, y=None, name='', dataset_type=None, **kwargs)
Returns an ADSData object built from the (source, target) or (X,y)
Parameters
• X (Union[pandas.DataFrame, dask.DataFrame, numpy.ndarray, scipy.
sparse.csr.csr_matrix]) – If str, URI for the dataset. The dataset could be read from
local or network file system, hdfs, s3 and gcs Should be none if X_train, y_train, X_test,
Y_test are provided
• y (Union[str, pandas.DataFrame, dask.DataFrame, pandas.Series, dask.
Series, numpy.ndarray]) – If str, name of the target in X, otherwise series of labels
corresponding to X
• name (str, optional) – Name to identify this data
• dataset_type (ADSDataset, optional) – When this value is available, would be used
to evaluate the ads task type
• kwargs – Additional keyword arguments that would be passed to the underlying Pandas
read API.
Returns
ads_data – A built ADSData object
Return type
ads.common.data.ADSData
Examples
Return type
Array
feature_names(X=None)
Examples
is_classifier()
Returns True if ADS believes that the model is a classifier
Returns
Boolean
Return type
True if the model is a classifier, False otherwise.
predict(X)
Runs the models predict function on some data
Parameters
X (ADSData) – A ADSData object which holds the examples to be predicted on.
Returns
Usually a list or PandasSeries of predictions
Return type
Union[List, pandas.Series], depending on the estimator
predict_proba(X)
Runs the models predict probabilities function on some data
Parameters
X (ADSData) – A ADSData object which holds the examples to be predicted on.
Returns
Usually a list or PandasSeries of predictions
Return type
Union[List, pandas.Series], depending on the estimator
prepare(target_dir=None, data_sample=None, X_sample=None, y_sample=None,
include_data_sample=False, force_overwrite=False, fn_artifact_files_included=False,
fn_name='model_api', inference_conda_env=None, data_science_env=False,
ignore_deployment_error=False, use_case_type=None, inference_python_version=None,
imputed_values={}, **kwargs)
Prepare model artifact directory to be published to model catalog
Parameters
• target_dir (str, default: model.name[:12]) – Target directory under which the
model artifact files need to be added
• data_sample (ADSData) – Note: This format is preferable to X_sample and y_sample.
A sample of the test data that will be provided to predict() API of scoring script Used to
generate schema_input.json and schema_output.json which defines the input and output
formats
• X_sample (pandas.DataFrame) – A sample of input data that will be provided to predict()
API of scoring script Used to generate schema.json which defines the input formats
Returns
Almost always a scalar score (usually a float).
Return type
float, depending on the estimator
show_in_notebook()
Describe the model by showing it’s properties
summary()
A summary of the ADSModel
transform(X)
Process some ADSData through the selected ADSModel transformers
Parameters
X (ADSData) – A ADSData object which holds the examples to be transformed.
visualize_transforms()
A graph of the ADSModel transformer pipeline. It is only supported in JupyterLabs Notebooks.
CUML = 'cuml'
EMCEE = 'emcee'
ENSEMBLE = 'ensemble'
FLAIR = 'flair'
GENSIM = 'gensim'
H20 = 'h2o'
KERAS = 'keras'
LIGHT_GBM = 'lightgbm'
MXNET = 'mxnet'
NLTK = 'nltk'
ORACLE_AUTOML = 'oracle_automl'
OTHER = 'other'
PROPHET = 'prophet'
PYMC3 = 'pymc3'
PYOD = 'pyod'
PYSTAN = 'pystan'
PYTORCH = 'pytorch'
SCIKIT_LEARN = 'scikit-learn'
SKTIME = 'sktime'
SPACY = 'spacy'
SPARK = 'spark'
STATSMODELS = 'statsmodels'
TENSORFLOW = 'tensorflow'
TRANSFORMERS = 'transformers'
WORD2VEC = 'word2vec'
XGBOOST = 'xgboost'
class ads.common.model_metadata.MetadataCustomCategory
Bases: str
OTHER = 'Other'
PERFORMANCE = 'Performance'
class ads.common.model_metadata.MetadataCustomKeys
Bases: str
CLIENT_LIBRARY = 'ClientLibrary'
CONDA_ENVIRONMENT = 'CondaEnvironment'
CONDA_ENVIRONMENT_PATH = 'CondaEnvironmentPath'
ENVIRONMENT_TYPE = 'EnvironmentType'
MODEL_ARTIFACTS = 'ModelArtifacts'
MODEL_FILE_NAME = 'ModelFileName'
MODEL_SERIALIZATION_FORMAT = 'ModelSerializationFormat'
SLUG_NAME = 'SlugName'
TRAINING_DATASET = 'TrainingDataset'
TRAINING_DATASET_NUMBER_OF_COLS = 'TrainingDatasetNumberOfCols'
TRAINING_DATASET_NUMBER_OF_ROWS = 'TrainingDatasetNumberOfRows'
TRAINING_DATASET_SIZE = 'TrainingDatasetSize'
VALIDATION_DATASET = 'ValidationDataset'
VALIDATION_DATASET_NUMBER_OF_COLS = 'ValidationDataSetNumberOfCols'
VALIDATION_DATASET_NUMBER_OF_ROWS = 'ValidationDatasetNumberOfRows'
VALIDATION_DATASET_SIZE = 'ValidationDatasetSize'
class ads.common.model_metadata.MetadataCustomPrintColumns
Bases: str
CATEGORY = 'Category'
DESCRIPTION = 'Description'
KEY = 'Key'
VALUE = 'Value'
ARTIFACT_TEST_RESULT = 'ArtifactTestResults'
FRAMEWORK = 'Framework'
FRAMEWORK_VERSION = 'FrameworkVersion'
HYPERPARAMETERS = 'Hyperparameters'
USE_CASE_TYPE = 'UseCaseType'
class ads.common.model_metadata.MetadataTaxonomyPrintColumns
Bases: str
KEY = 'Key'
VALUE = 'Value'
Examples
reset(self ) → None
Resets model metadata item.
to_dict(self ) → dict
Serializes model metadata item to dictionary.
to_yaml(self )
Serializes model metadata item to YAML.
size(self ) → int
Returns the size of the metadata in bytes.
update(self, value: str = '', description: str = '', category: str = '') → None
Updates metadata item information.
to_json(self ) → JSON
Serializes metadata item into a JSON.
to_json_file(self, file_path: str, storage_options: dict = None) → None
Saves the metadata item value to a local file or object storage.
validate(self ) → bool
Validates metadata item.
property category: str
reset() → None
Resets model metadata item.
Resets value, description and category to None.
Returns
Nothing.
Return type
None
update(value: str, description: str, category: str) → None
Updates metadata item.
Parameters
• value (str) – The value of model metadata item.
• description (str) – The description of model metadata item.
• category (str) – The category of model metadata item.
Returns
Nothing.
Return type
None
validate() → bool
Validates metadata item.
Returns
True if validation passed.
Return type
bool
Raises
• ValueError – If invalid category provided.
• MetadataValueTooLong – If value exceeds the length limit.
class ads.common.model_metadata.ModelMetadata
Bases: ABC
The base abstract class representing model metadata.
get(self, key: str) → ModelMetadataItem
Returns the model metadata item by provided key.
reset(self ) → None
Resets all model metadata items to empty values.
to_dataframe(self ) → pd.DataFrame
Returns the model metadata list in a data frame format.
size(self ) → int
Returns the size of the model metadata in bytes.
validate(self ) → bool
Validates metadata.
to_dict(self )
Serializes model metadata into a dictionary.
to_yaml(self )
Serializes model metadata into a YAML.
to_json(self )
Serializes model metadata into a JSON.
to_json_file(self, file_path: str, storage_options: dict = None) → None
Saves the metadata to a local file or object storage.
Initializes Model Metadata.
get(key: str) → ModelMetadataItem
Returns the model metadata item by provided key.
Parameters
key (str) – The key of model metadata item.
Returns
The model metadata item.
Return type
ModelMetadataItem
Raises
ValueError – If provided key is empty or metadata item not found.
property keys: Tuple[str]
Returns all registered metadata keys.
Returns
The list of metadata keys.
Return type
Tuple[str]
reset() → None
Resets all model metadata items to empty values.
Resets value, description and category to None for every metadata item.
size() → int
Returns the size of the model metadata in bytes.
Returns
The size of model metadata in bytes.
Return type
int
abstract to_dataframe() → DataFrame
Returns the model metadata list in a data frame format.
Returns
The model metadata in a dataframe format.
Return type
pandas.DataFrame
to_dict()
Serializes model metadata into a dictionary.
Returns
The model metadata in a dictionary representation.
Return type
Dict
to_json()
Serializes model metadata into a JSON.
Returns
The model metadata in a JSON representation.
Return type
JSON
to_json_file(file_path: str, storage_options: Optional[dict] = None) → None
Saves the metadata to a local file or object storage.
Parameters
• file_path (str) – The file path to store the data.
“oci://bucket_name@namespace/folder_name/” “oci://bucket_name@namespace/folder_name/metadata.json”
“path/to/local/folder” “path/to/local/folder/metadata.json”
• storage_options (dict. Default None) – Parameters passed on to the backend
filesystem class. Defaults to options set using DatasetFactory.set_default_storage().
Returns
Nothing.
Return type
None
Raises
• ValueError – When file path is empty.:
• TypeError – When file path not a string.:
Examples
>>> storage_options
{'log_requests': False,
'additional_user_agent': '',
'pass_phrase': None,
'user': '<user-id>',
'fingerprint': '05:15:2b:b1:46:8a:32:ec:e2:69:5b:32:01:**:**:**)',
'tenancy': '<tenancy-id>',
'region': 'us-ashburn-1',
'key_file': '/home/datascience/.oci/oci_api_key.pem'}
>>> metadata.to_json_file(file_path = 'oci://bucket_name@namespace/folder_name/
˓→metadata_taxonomy.json', storage_options=storage_options)
>>> metadata_item.to_json_file("path/to/local/folder/metadata_taxonomy.json")
to_yaml()
Serializes model metadata into a YAML.
Returns
The model metadata in a YAML representation.
Return type
Yaml
validate() → bool
Validates model metadata.
Returns
True if metadata is valid.
Return type
bool
validate_size() → bool
Validates model metadata size.
Validates the size of metadata. Throws an error if the size of the metadata exceeds expected value.
Returns
True if metadata size is valid.
Return type
bool
Raises
MetadataSizeTooLarge – If the size of the metadata exceeds expected value.
class ads.common.model_metadata.ModelMetadataItem
Bases: ABC
The base abstract class representing model metadata item.
to_dict(self ) → dict
Serializes model metadata item to dictionary.
to_yaml(self )
Serializes model metadata item to YAML.
size(self ) → int
Returns the size of the metadata in bytes.
to_json(self ) → JSON
Serializes metadata item to JSON.
to_json_file(self, file_path: str, storage_options: dict = None) → None
Saves the metadata item value to a local file or object storage.
validate(self ) → bool
Validates metadata item.
size() → int
Returns the size of the model metadata in bytes.
Returns
The size of model metadata in bytes.
Return type
int
to_dict() → dict
Serializes model metadata item to dictionary.
Returns
The dictionary representation of model metadata item.
Return type
dict
to_json()
Serializes metadata item into a JSON.
Returns
The metadata item in a JSON representation.
Return type
JSON
to_json_file(file_path: str, storage_options: Optional[dict] = None) → None
Saves the metadata item value to a local file or object storage.
Parameters
• file_path (str) – The file path to store the data.
“oci://bucket_name@namespace/folder_name/” “oci://bucket_name@namespace/folder_name/result.json”
“path/to/local/folder” “path/to/local/folder/result.json”
• storage_options (dict. Default None) – Parameters passed on to the backend
filesystem class. Defaults to options set using DatasetFactory.set_default_storage().
Returns
Nothing.
Return type
None
Raises
• ValueError – When file path is empty.:
• TypeError – When file path not a string.:
Examples
>>> storage_options
{'log_requests': False,
'additional_user_agent': '',
'pass_phrase': None,
'user': '<user-id>',
'fingerprint': '05:15:2b:b1:46:8a:32:ec:e2:69:5b:32:01:**:**:**)',
'tenancy': '<tenency-id>',
'region': 'us-ashburn-1',
'key_file': '/home/datascience/.oci/oci_api_key.pem'}
>>> metadata_item.to_json_file(file_path = 'oci://bucket_name@namespace/folder_
˓→name/file.json', storage_options=storage_options)
>>> metadata_item.to_json_file("path/to/local/folder/file.json")
to_yaml()
Serializes model metadata item to YAML.
Returns
The model metadata item in a YAML representation.
Return type
Yaml
abstract validate() → bool
Validates metadata item.
Returns
True if validation passed.
Return type
bool
class ads.common.model_metadata.ModelProvenanceMetadata(repo: Optional[str] = None, git_branch:
Optional[str] = None, git_commit:
Optional[str] = None, repository_url:
Optional[str] = None,
training_script_path: Optional[str] =
None, training_id: Optional[str] = None,
artifact_dir: Optional[str] = None)
Bases: DataClassSerializable
ModelProvenanceMetadata class.
Examples
class ads.common.model_metadata.ModelTaxonomyMetadata
Bases: ModelMetadata
Class that represents Model Taxonomy Metadata.
get(self, key: str) → ModelTaxonomyMetadataItem
Returns the model metadata item by provided key.
reset(self ) → None
Resets all model metadata items to empty values.
to_dataframe(self ) → pd.DataFrame
Returns the model metadata list in a data frame format.
size(self ) → int
Returns the size of the model metadata in bytes.
validate(self ) → bool
Validates metadata.
to_dict(self )
Serializes model metadata into a dictionary.
to_yaml(self )
Serializes model metadata into a YAML.
to_json(self )
Serializes model metadata into a JSON.
to_json_file(self, file_path: str, storage_options: dict = None) → None
Saves the metadata to a local file or object storage.
Examples
>>> metadata_taxonomy.reset()
>>> metadata_taxonomy.to_dataframe()
Key Value
--------------------------------------------
0 UseCaseType None
1 Framework None
2 FrameworkVersion None
3 Algorithm None
4 Hyperparameters None
>>> metadata_taxonomy
metadata:
- key: UseCaseType
category: None
description: None
value: None
reset() → None
Resets model metadata item.
Resets value to None.
Returns
Nothing.
Return type
None
update(value: str) → None
Updates metadata item value.
Parameters
value (str) – The value of model metadata item.
Returns
Nothing.
Return type
None
validate() → bool
Validates metadata item.
Returns
True if validation passed.
Return type
bool
Raises
ValueError – If invalid UseCaseType provided. If invalid Framework provided.
property value: str
class ads.common.model_metadata.UseCaseType
Bases: str
ANOMALY_DETECTION = 'anomaly_detection'
BINARY_CLASSIFICATION = 'binary_classification'
CLUSTERING = 'clustering'
DIMENSIONALITY_REDUCTION = 'dimensionality_reduction/representation'
IMAGE_CLASSIFICATION = 'image_classification'
MULTINOMIAL_CLASSIFICATION = 'multinomial_classification'
NER = 'ner'
OBJECT_LOCALIZATION = 'object_localization'
OTHER = 'other'
RECOMMENDER = 'recommender'
REGRESSION = 'regression'
SENTIMENT_ANALYSIS = 'sentiment_analysis'
TIME_SERIES_FORECASTING = 'time_series_forecasting'
TOPIC_MODELING = 'topic_modeling'
The module that provides the decorator helping to add runtime dependencies in functions.
Examples
class ads.common.decorator.runtime_dependency.OptionalDependency
Bases: object
BDS = 'oracle-ads[bds]'
BOOSTED = 'oracle-ads[boosted]'
DATA = 'oracle-ads[data]'
GEO = 'oracle-ads[geo]'
LABS = 'oracle-ads[labs]'
MYSQL = 'oracle-ads[mysql]'
NOTEBOOK = 'oracle-ads[notebook]'
ONNX = 'oracle-ads[onnx]'
OPCTL = 'oracle-ads[opctl]'
OPTUNA = 'oracle-ads[optuna]'
PYTORCH = 'oracle-ads[torch]'
SPARK = 'oracle-ads[spark]'
TENSORFLOW = 'oracle-ads[tensorflow]'
TEXT = 'oracle-ads[text]'
VIZ = 'oracle-ads[viz]'
Examples
class ads.common.decorator.deprecate.TARGET_TYPE(value)
Bases: Enum
An enumeration.
ATTRIBUTE = 'Attribute'
CLASS = 'Class'
METHOD = 'Method'
The module that helps to minimize the number of errors of the model post-deployment process. The model provides a
simple testing harness to ensure that model artifacts are thoroughly tested before being saved to the model catalog.
Classes
ModelIntrospect
Class to introspect model artifacts.
Examples
class ads.common.model_introspect.Introspectable
Bases: ABC
Base class that represents an introspectable object.
exception ads.common.model_introspect.IntrospectionNotPassed
Bases: ValueError
class ads.common.model_introspect.ModelIntrospect(artifact: Introspectable)
Bases: object
Class to introspect model artifacts.
Parameters
• status (str) – Returns the current status of model introspection. The possible variants:
Passed, Not passed, Not tested.
• failures (int) – Returns the number of failures of introspection result.
run(self ) → None
Invokes model artifacts introspection.
to_dataframe(self ) → pd.DataFrame
Serializes model introspection result into a DataFrame.
Examples
Raises
• ValueError – If model artifact object not provided.:
• TypeError – If provided input paramater not a ModelArtifact instance.:
property failures: int
Calculates the number of failures.
Returns
The number of failures.
Return type
int
run() → DataFrame
Invokes introspection.
Returns
The introspection result in a DataFrame format.
Return type
pd.DataFrame
property status: str
Gets the current status of model introspection.
to_dataframe() → DataFrame
Serializes model introspection result into a DataFrame.
Returns
The model introspection result in a DataFrame representation.
Return type
pandas.DataFrame
class ads.common.model_introspect.PrintItem(key: str = '', case: str = '', result: str = '', message: str = '')
Bases: object
Class represents the model introspection print item.
case: str = ''
to_list() → List[str]
Converts instance to a list representation.
Returns
The instance in a list representation.
Return type
List[str]
class ads.common.model_introspect.TEST_STATUS
Bases: str
NOT_PASSED = 'Failed'
NOT_TESTED = 'Skipped'
PASSED = 'Passed'
class ads.common.model_export_util.ONNXTransformer
Bases: object
This is a transformer to convert X [pandas.Dataframe, pd.Series] data into Onnx readable dtypes and formats. It
is Serializable, so it can be reloaded at another time.
Examples
Return type
Str
transform(X: Union[DataFrame, Series, ndarray, list])
Transforms the data for the OnnxTransformer.
Parameters
X (Union[pandas.DataFrame, pandas.Series, np.ndarray, list]) – The
Dataframe for the training data
Returns
The transformed X data
Return type
Union[pandas.DataFrame, pandas.Series, np.ndarray, list]
ads.common.model_export_util.prepare_generic_model(model_path: str, fn_artifact_files_included: bool
= False, fn_name: str = 'model_api',
force_overwrite: bool = False, model: Any =
None, data_sample: ADSData = None,
use_case_type=None, X_sample: Union[list,
tuple, Series, ndarray, DataFrame] = None,
y_sample: Union[list, tuple, Series, ndarray,
DataFrame] = None, **kwargs) → ModelArtifact
Generates template files to aid model deployment. The model could be accompanied by other artifacts all of
which can be dumped at model_path. Following files are generated: * func.yaml * func.py * requirements.txt *
score.py
Parameters
• model_path (str) – Path where the artifacts must be saved. The serialized model object
and any other associated files/objects must be saved in the model_path directory
• fn_artifact_files_included (bool) – Default is False, if turned off, function artifacts
are not generated.
• fn_name (str) – Opional parameter to specify the function name
• force_overwrite (bool) – Opional parameter to specify if the model_artifact should over-
write the existing model_path (if it exists)
• model ((Any, optional). Defaults to None.) – This is an optional model object
which is only used to extract taxonomy metadata. Supported models: automl, keras, light-
gbm, pytorch, sklearn, tensorflow, and xgboost. If the model is not under supported frame-
works, then extracting taxonomy metadata will be skipped. The alternative way is using
atifact.populate_metadata(model=model, usecase_type=UseCaseType.REGRESSION).
• data_sample (ADSData) – A sample of the test data that will be provided to predict() API
of scoring script Used to generate schema_input and schema_output
• use_case_type (str) – The use case type of the model
• X_sample (Union[list, tuple, pd.Series, np.ndarray, pd.DataFrame,
dask.dataframe.core.Series, dask.dataframe.core.DataFrame]) – A sample
of input data that will be provided to predict() API of scoring script Used to generate input
schema.
• y_sample (Union[list, tuple, pd.Series, np.ndarray, pd.DataFrame,
dask.dataframe.core.Series, dask.dataframe.core.DataFrame]) – A sam-
ple of output data that is expected to be returned by predict() API of scoring script,
corresponding to X_sample Used to generate output schema.
• **kwargs –
• ________ –
• data_science_env (bool, default: False) – If set to True, the datascience environ-
ment represented by the slug in the training conda environment will be used.
• inference_conda_env (str, default: None) – Conda environment to
use within the model deployment service for inferencing. For example,
oci://bucketname@namespace/path/to/conda/env
• ignore_deployment_error (bool, default: False) – If set to True, the prepare
method will ignore all the errors that may impact model deployment.
• underlying_model (str, default: 'UNKNOWN') – Underlying Model Type, could be
“automl”, “sklearn”, “h2o”, “lightgbm”, “xgboost”, “torch”, “mxnet”, “tensorflow”, “keras”,
“pyod” and etc.
• model_libs (dict, default: {}) – Model required libraries where the key is the library
names and the value is the library versions. For example, {numpy: 1.21.1}.
• progress (int, default: None) – max number of progress.
• inference_python_version (str, default:None.) – If provided will be added to the
generated runtime yaml
• max_col_num ((int, optional). Defaults to utils.
DATA_SCHEMA_MAX_COL_NUM.) – The maximum column size of the data that allows
to auto generate schema.
Examples
... display_name="LRModel_01",
... description="My Logistic Regression Model",
... ignore_pending_changes=True,
... timeout=100,
... ignore_introspection=True,
... )
>>> print(f"The OCID of the model is: {ocimodel.id}")
Returns
model_artifact – A generic model artifact
Return type
ads.model_artifact.model_artifact
Parameters
• model (ads.Model) – A model to be serialized
• target_dir (str, optional) – directory to output the serialized model
• X (Union[pandas.DataFrame, pandas.Series]) – The X data
• y (Union[list, pandas.DataFrame, pandas.Series]) – Tbe Y data
• model_type (str, optional) – A string corresponding to the model type
Returns
model_kwargs – A dictionary of model kwargs for the serialized model
Return type
Dict
Parameters
• path (str) – Target folder where the artifacts are placed.
• fn_attributes (dict) – dictionary specifying all the function attributes as described in
https://github.com/fnproject/docs/blob/master/fn/develop/func-file.md
• artifact_type_generic (bool) – default is False. This attribute decides which template
to pick for score.py. If True, it is assumed that the code to load is provided by the user.
ads.common.function.fn_util.get_function_config() → dict
Returns dictionary loaded from func_conf.yaml
ads.common.function.fn_util.prepare_fn_attributes(func_name: str, schema_version=20180708,
version=None, python_runtime=None,
entry_point=None, memory=None) → dict
Workaround for collections.namedtuples. The defaults are not supported.
ads.common.function.fn_util.write_score(path, **kwargs)
exception ads.common.utils.FileOverwriteError
Bases: Exception
class ads.common.utils.JsonConverter(*, skipkeys=False, ensure_ascii=True, check_circular=True,
allow_nan=True, sort_keys=False, indent=None, separators=None,
default=None)
Bases: JSONEncoder
Constructor for JSONEncoder, with sensible defaults.
If skipkeys is false, then it is a TypeError to attempt encoding of keys that are not str, int, float or None. If skipkeys
is True, such items are simply skipped.
If ensure_ascii is true, the output is guaranteed to be str objects with all incoming non-ASCII characters escaped.
If ensure_ascii is false, the output can contain non-ASCII characters.
If check_circular is true, then lists, dicts, and custom encoded objects will be checked for circular references
during encoding to prevent an infinite recursion (which would cause an OverflowError). Otherwise, no such
check takes place.
If allow_nan is true, then NaN, Infinity, and -Infinity will be encoded as such. This behavior is not JSON spec-
ification compliant, but is consistent with most JavaScript based encoders and decoders. Otherwise, it will be a
ValueError to encode such floats.
If sort_keys is true, then the output of dictionaries will be sorted by key; this is useful for regression tests to
ensure that JSON serializations can be compared on a day-to-day basis.
If indent is a non-negative integer, then JSON array elements and object members will be pretty-printed with
that indent level. An indent level of 0 will only insert newlines. None is the most compact representation.
If specified, separators should be an (item_separator, key_separator) tuple. The default is (’, ‘, ‘: ‘) if indent
is None and (‘,’, ‘: ‘) otherwise. To get the most compact JSON representation, you should specify (‘,’, ‘:’) to
eliminate whitespace.
If specified, default is a function that gets called for objects that can’t otherwise be serialized. It should return a
JSON encodable version of the object or raise a TypeError.
default(obj)
Converts an object to JSON based on its type
Parameters
obj (Object) – An object which is being converted to Json, supported types are pandas
Timestamp, series, dataframe, or categorical or numpy ndarrays.
Returns
Json
Return type
A json repersentation of the object.
ads.common.utils.camel_to_snake(name: str) → str
Converts the camel case string to the snake representation.
Parameters
name (str) – The name to convert.
Returns
str
Return type
The name converted to the snake representation.
ads.common.utils.copy_file(uri_src: str, uri_dst: str, force_overwrite: Optional[bool] = False, auth:
Optional[Dict] = None, chunk_size: Optional[int] = 8192,
progressbar_description: Optional[str] = 'Copying `{uri_src}` to `{uri_dst}`') →
str
Copies file from uri_src to uri_dst. If uri_dst specifies a directory, the file will be copied into uri_dst using the
base filename from uri_src. Returns the path to the newly created file.
Parameters
• uri_src (str) – The URI of the source file, which can be local path or OCI object storage
URI.
• uri_dst (str) – The URI of the destination file, which can be local path or OCI object
storage URI.
• force_overwrite ((bool, optional). Defaults to False.) – Whether to over-
write existing files or not.
• auth ((Dict, optional). Defaults to None.) – The default authetication is set us-
ing ads.set_auth API. If you need to override the default, use the ads.common.auth.api_keys
ads.common.utils.ellipsis_strings(raw, n=24)
takes a sequence (<string>, list(<string>), tuple(<string>), pd.Series(<string>) and Ellipsis’ize them at position
n
ads.common.utils.extract_lib_dependencies_from_model(model) → dict
Extract a dictionary of library dependencies for a model
Parameters
model –
Returns
Dict
Return type
A dictionary of library dependencies.
ads.common.utils.first_not_none(itr)
Returns the first non-none result from an iterable, similar to any() but return value not true/false
ads.common.utils.flatten(d, parent_key='')
Flattens nested dictionaries to a single layer dictionary
Parameters
• d (dict) – The dictionary that needs to be flattened
• parent_key (str) – Keys in the dictionary that are nested
Returns
a_dict – a single layer dictionary
Return type
dict
ads.common.utils.folder_size(path: str) → int
Recursively calculating a size of the path folder.
Parameters
path (str) – Path to the folder.
Returns
The size fo the folder in bytes.
Return type
int
ads.common.utils.generate_requirement_file(requirements: dict, file_path: str, file_name: str =
'requirements.txt')
Generate requirements file at file_path.
Parameters
• requirements (dict) – Key is the library name and value is the version
• file_path (str) – Directory to save requirements.txt
• file_name (str) – Opional parameter to specify the file name
ads.common.utils.get_base_modules(model)
Get the base modules from an ADS model
ads.common.utils.get_bootstrap_styles()
Returns HTML bootstrap style information
ads.common.utils.get_compute_accelerator_ncores()
ads.common.utils.get_cpu_count()
Returns the number of CPUs available on this machine
ads.common.utils.get_dataframe_styles(max_width=75)
Styles used for dataframe, example usage:
df.style .set_table_styles(utils.get_dataframe_styles()) .set_table_attributes(‘class=table’) .render())
Returns
styles – A list of dataframe table styler styles.
Return type
array
ads.common.utils.get_files(directory: str)
List out all the file names under this directory.
Parameters
directory (str) – The directory to list out all the files from.
Returns
List of the files in the directory.
Return type
List
ads.common.utils.get_oci_config()
Returns the OCI config location, and the OCI config profile.
ads.common.utils.get_progress_bar(max_progress, description='Initializing')
this will return an instance of ProgressBar, sensitive to the runtime environment
ads.common.utils.get_random_name_for_resource() → str
Returns randomly generated easy to remember name. It consists from 1 adjective and 1 animal word, tailed by
UTC timestamp (joined with ‘-‘). This is an ADS default resource name generated for models, jobs, jobruns,
model deployments, pipelines.
Returns
Randomly generated easy to remember name for oci resources - models, jobs, jobruns, model de-
ployments, pipelines. Example: polite-panther-2022-08-17-21:15.46; strange-spider-2022-08-
17-23:55.02
Return type
str
ads.common.utils.get_sqlalchemy_engine(connection_url, *args, **kwargs)
The SqlAlchemny docs say to use a single engine per connection_url, this class will take care of that.
Parameters
connection_url (string) – The URL to connect to
Returns
engine – The engine from which SqlAlchemny commands can be ran on
Return type
SqlAlchemny engine
Examples
display(HTML(utils.horizontal_scrollable_div(my_html)))
Parameters
html (str) – Your HTML to wrap.
Returns
Wrapped HTML.
Return type
type
ads.common.utils.human_size(num_bytes: int, precision: Optional[int] = 2) → str
Converts bytes size to a string representing its value in B, KB, MB and GB.
Parameters
• num_bytes (int) – The size in bytes.
• precision ((int, optional). Defaults to 2.) – The precision of converting the
bytes value.
Returns
A string representing the size in B, KB, MB and GB.
Return type
str
ads.common.utils.inject_and_copy_kwargs(kwargs, **args)
Takes in a dictionary and returns a copy with the args injected
Examples
Parameters
• kwargs (dict) – The original kwargs.
• **args (type) – A series of arguments, foo=42, bar=12 etc
Returns
d – new dictionary object that you can use in place of kwargs
Return type
dict
BINARY_TEXT_CLASSIFICATION = 4
MULTI_CLASS_CLASSIFICATION = 3
MULTI_CLASS_TEXT_CLASSIFICATION = 5
REGRESSION = 1
UNSUPPORTED = 6
ads.common.utils.numeric_pandas_dtypes()
Returns a list of the “numeric” pandas data types
ads.common.utils.oci_config_file()
Returns the OCI config file location
ads.common.utils.oci_config_location()
Returns oci configuration file location.
ads.common.utils.oci_config_profile()
Returns the OCI config profile location.
ads.common.utils.oci_key_location()
Returns the OCI key location
ads.common.utils.oci_key_profile()
Returns key profile value specified in oci configuration file.
ads.common.utils.print_user_message(msg, display_type='tip', see_also_links=None, title='Tip')
This method is deprecated and will be removed in future releases. Prints in html formatted block one of
tip|info|warn type.
Parameters
• msg (str or list) – The actual message to display. display_type is “module’, msg can be
a list of [module name, module package name], i.e. [“automl”, “ads[ml]”]
• display_type (str (default 'tip')) – The type of user message.
• see_also_links (list of tuples in the form of [('display_name', 'url')])
–
• title (str (default 'tip')) – The title of user message.
ads.common.utils.random_valid_ocid(prefix='ocid1.dataflowapplication.oc1.iad')
Generates a random valid ocid.
Parameters
prefix (str) – A prefix, corresponding to a region location.
Returns
ocid – a valid ocid with the given prefix.
Return type
str
ads.common.utils.remove_file(file_path: str, auth: Optional[Dict] = None) → None
Reoves file.
Parameters
• file_path (str) – The path of the source file, which can be local path or OCI object storage
URI.
• auth ((Dict, optional). Defaults to None.) – The default authetication is set us-
ing ads.set_auth API. If you need to override the default, use the ads.common.auth.api_keys
or ads.common.auth.resource_principal to create appropriate authentication signer and
kwargs required to instantiate IdentityClient object.
Returns
Nothing.
Return type
None
ads.common.utils.replace_spaces(lst)
Replace all spaces with underscores for strings in the list.
Requires that the list contains strings for each element.
lst: list of strings
ads.common.utils.set_oci_config(oci_config_location, oci_config_profile)
Parameters
• oci_config_location – location of the config file, for example, ~/.oci/config
• oci_config_profile – The profile to load from the config file. Defaults to “DEFAULT”
ads.common.utils.snake_to_camel(name: str, capitalized_first_token: Optional[bool] = False) → str
Converts the snake case string to the camel representation.
Parameters
• name (str) – The name to convert.
• capitalized_first_token ((bool, optional). Defaults to False.) – Wether
the first token needs to be capitalized or not.
Returns
str
Return type
The name converted to the camel representation.
ads.common.utils.split_data(X, y, random_state=42, test_size=0.3)
Splits data using Sklearn based on the input type of the data.
Parameters
• X (a Pandas Dataframe) – The data points.
• y (a Pandas Dataframe) – The labels.
• random_state (int) – A random state for reproducability.
• test_size (int) – The number of elements that should be included in the test dataset.
ads.common.utils.to_dataframe(data: Union[list, tuple, Series, ndarray, DataFrame])
Convert to pandas DataFrame.
Parameters
data (Union[list, tuple, pd.Series, np.ndarray, pd.DataFrame]) – Convert data
to pandas DataFrame.
Returns
pandas DataFrame.
Return type
pd.DataFrame
ads.common.utils.truncate_series_top_n(series, n=24)
take a series which can be interpreted as a dict, index=key, this function sorts by the values and takes the top-n
values, and returns a new series
ads.common.utils.wrap_lines(li, heading='')
Wraps the elements of iterable into multi line string of fixed width
class ads.common.model_metadata_mixin.MetadataMixin
Bases: object
MetadataMixin class which populates the custom metadata, taxonomy metadata, input/output schema and prove-
nance metadata.
populate_metadata(use_case_type: Optional[str] = None, data_sample: Optional[ADSData] = None,
X_sample: Optional[Union[list, tuple, DataFrame, Series, ndarray]] = None,
y_sample: Optional[Union[list, tuple, DataFrame, Series, ndarray]] = None,
training_script_path: Optional[str] = None, training_id: Optional[str] = None,
ignore_pending_changes: bool = True, max_col_num: int = 2000)
Populates input schema and output schema. If the schema exceeds the limit of 32kb, save as json files to
the artifact directory.
Parameters
• use_case_type ((str, optional). Defaults to None.) – The use case type of the
model.
• data_sample ((ADSData, optional). Defaults to None.) – A sample of the data
that will be used to generate intput_schema and output_schema.
• X_sample (Union[list, tuple, pd.Series, np.ndarray, pd.DataFrame].
Defaults to None.) – A sample of input data that will be used to generate input
schema.
• y_sample (Union[list, tuple, pd.Series, np.ndarray, pd.DataFrame].
Defaults to None.) – A sample of output data that will be used to generate output
schema.
• training_script_path (str. Defaults to None.) – Training script path.
• training_id ((str, optional). Defaults to None.) – The training model OCID.
• ignore_pending_changes (bool. Defaults to False.) – Ignore the pending
changes in git.
• max_col_num ((int, optional). Defaults to utils.
DATA_SCHEMA_MAX_COL_NUM.) – The maximum number of columns allowed in
auto generated schema.
Returns
Nothing.
Return type
None
17.1.1.4.1 Submodules
exception ads.bds.auth.KRB5KinitError
Bases: Exception
KRB5KinitError class when kinit -kt command failed to generate cached ticket with the keytab file and the krb5
config file.
ads.bds.auth.has_kerberos_ticket()
Whether kerberos cache ticket exists.
ads.bds.auth.init_ccache_with_keytab(principal: str, keytab_file: str) → None
Initialize credential cache using keytab file.
Parameters
• principal (str) – The unique identity to which Kerberos can assign tickets.
• keytab_path (str) – Path to your keytab file.
Returns
Nothing.
Return type
None
ads.bds.auth.krbcontext(principal: str, keytab_path: str, kerb5_path: str = '~/.bds_config/krb5.conf') → None
A context manager for Kerberos-related actions. It provides a Kerberos context that you can put code inside. It
will initialize credential cache automatically with keytab if no cached ticket exists. Otherwise, does nothing.
Parameters
• principal (str) – The unique identity to which Kerberos can assign tickets.
• keytab_path (str) – Path to your keytab file.
Examples
Examples
17.1.1.5.1 Submodules
class ads.data_labeling.interface.loader.Loader
Bases: ABC
Data Loader Interface.
abstract load(**kwargs) → Any
class ads.data_labeling.interface.parser.Parser
Bases: ABC
Data Parser Interface.
abstract parse() → Any
class ads.data_labeling.interface.reader.Reader
Bases: ABC
Data Reader Interface.
info() → Serializable
top_left
Top left corner of this bounding box.
Type
Tuple[float, float]
bottom_left
Bottom left corner of this bounding box.
Type
Tuple[float, float]
bottom_right
Bottom right corner of this bounding box.
Type
Tuple[float, float]
top_right
Top right corner of this bounding box.
Type
Tuple[float, float]
Examples
labels: List[str]
Examples
items: List[BoundingBoxItem ]
Returns
The list of YOLO formatted bounding boxes.
Return type
List[Tuple[int, float, float, float, float]]
Raises
• ValueError – When categories list not provided. When categories list not matched with
the labels.
• TypeError – When categories list has a wrong format.
class ads.data_labeling.constants.AnnotationType
Bases: object
AnnotationType class which contains all the annotation types that data labeling service supports.
BOUNDING_BOX = 'BOUNDING_BOX'
ENTITY_EXTRACTION = 'ENTITY_EXTRACTION'
MULTI_LABEL = 'MULTI_LABEL'
SINGLE_LABEL = 'SINGLE_LABEL'
class ads.data_labeling.constants.DatasetType
Bases: object
DatasetType class which contains all the dataset types that data labeling service supports.
DOCUMENT = 'DOCUMENT'
IMAGE = 'IMAGE'
TEXT = 'TEXT'
class ads.data_labeling.constants.Formats
Bases: object
Common formats class which contains all the common formats that are supported to convert to.
SPACY = 'spacy'
YOLO = 'yolo'
Examples
Returns
pandas dataframe which contains the dataset information.
Return type
pandas.DataFrame
Raises
Exception – If pagination.list_call_get_all_results() fails
Type
str
dataset_type
Type of the dataset. Currently supports Text, Image, DOCUMENT.
Type
str
annotation_type: str = ''
to_dataframe() → DataFrame
Converts the metadata to dataframe format.
Returns
The metadata in Pandas dataframe format.
Return type
pandas.DataFrame
to_dict() → Dict
Converts to dictionary representation.
Returns
The metadata in dictionary type.
Return type
Dict
length: int = 0
offset: int = 0
to_spacy() → tuple
Converts one NERItem to the spacy format.
Returns
NERItem in the spacy format
Return type
Tuple
class ads.data_labeling.ner.NERItems(items: ~typing.List[~ads.data_labeling.ner.NERItem] = <factory>)
Bases: object
NERItems class consists of a list of NERItem.
items
List of NERItem.
Type
List[NERItem]
items: List[NERItem ]
to_spacy() → List[tuple]
Converts NERItems to the spacy format.
Returns
List of NERItems in the Spacy format.
Return type
List[tuple]
exception ads.data_labeling.ner.WrongEntityFormatLabelIsEmpty
Bases: ValueError
exception ads.data_labeling.ner.WrongEntityFormatLabelNotString
Bases: ValueError
exception ads.data_labeling.ner.WrongEntityFormatLengthIsNegative
Bases: ValueError
exception ads.data_labeling.ner.WrongEntityFormatLengthNotInteger
Bases: ValueError
exception ads.data_labeling.ner.WrongEntityFormatOffsetIsNegative
Bases: ValueError
exception ads.data_labeling.ner.WrongEntityFormatOffsetNotInteger
Bases: ValueError
to_dict() → Dict
Convert the Record instance to a dictionary.
Returns
Dictionary representation of the Record instance.
Return type
Dict
to_tuple() → Tuple[str, Any, Union[Tuple, str, List[BoundingBoxItem], List[NERItem]]]
Convert the Record instance to a tuple.
Returns
Tuple representation of the Record instance.
Return type
Tuple
class ads.data_labeling.mixin.data_labeling.DataLabelingAccessMixin
Bases: object
Mixin class for labeled text data.
static read_labeled_data(path: Optional[str] = None, dataset_id: Optional[str] = None,
compartment_id: Optional[str] = None, auth: Optional[Dict] = None,
materialize: bool = False, encoding: str = 'utf-8', include_unlabeled: bool =
False, format: Optional[str] = None, chunksize: Optional[int] = None)
Loads the dataset generated by data labeling service from either the export file or the Data Labeling Service.
Parameters
• path ((str, optional). Defaults to None) – The export file path, can be either
local or object storage path.
• dataset_id ((str, optional). Defaults to None) – The dataset OCID.
• compartment_id (str. Defaults to the compartment_id from the env
variable.) – The compartment OCID of the dataset.
• auth ((dict, optional). Defaults to None) – The default authetication is set us-
ing ads.set_auth API. If you need to override the default, use the ads.common.auth.api_keys
or ads.common.auth.resource_principal to create appropriate authentication signer and
kwargs required to instantiate IdentityClient object.
• materialize ((bool, optional). Defaults to False) – Whether the content of
the dataset file should be loaded or it should return the file path to the content. By default
the content will not be loaded.
• encoding ((str, optional). Defaults to 'utf-8') – Encoding of files. Only used
for “TEXT” dataset.
• include_unlabeled ((bool, optional). Default to False) – Whether to load
the unlabeled records or not.
• format ((str, optional). Defaults to None) – Output format of annotations. Can
be None, “spacy” for dataset Entity Extraction type or “yolo for Object Detection type.
– When None, it outputs List[NERItem] or List[BoundingBoxItem],
– When “spacy”, it outputs List[Tuple],
– When “yolo”, it outputs List[List[Tuple]].
• chunksize ((int, optional). Defaults to None) – The amount of records that
should be read in one iteration. The result will be returned in a generator format.
Returns
pd.Dataframe if chunksize is not specified. Generator[pd.Dataframe] if chunksize is speci-
fied.
Return type
Union[Generator[pd.DataFrame, Any, Any], pd.DataFrame]
Examples
>>> df = pd.DataFrame.ads.read_labeled_data_from_dls(dataset_id="your_dataset_
˓→ocid",
... compartment_id="your_
˓→compartment_id",
... auth=authutil.api_keys(),
... materialize=False)
Path Content Annotations
--------------------------------------------------------------------
0 path/to/the/content/file yes
1 path/to/the/content/file no
Examples
Examples
class ads.data_labeling.parser.export_metadata_parser.MetadataParser
Bases: Parser
MetadataParser class which parses the metadata from the record.
EXPECTED_KEYS = ['id', 'compartmentId', 'displayName', 'labelsSet',
'annotationFormat', 'datasetSourceDetails', 'datasetFormatDetails']
class ads.data_labeling.parser.export_record_parser.BoundingBoxRecordParser(dataset_source_path:
str, format:
Optional[str] =
None, categories:
Op-
tional[List[str]]
= None)
Bases: RecordParser
BoundingBoxRecordParser class which parses the label of BoundingBox label data.
Initiates a RecordParser instance.
Parameters
• dataset_source_path (str) – Dataset source path.
• format ((str, optional). Defaults to None.) – Output format of annotations.
• categories ((List[str], optional). Defaults to None.) – The list of object cat-
egories in proper order for model training. Example: [‘cat’,’dog’,’horse’]
Returns
RecordParser instance.
Return type
RecordParser
class ads.data_labeling.parser.export_record_parser.EntityType
Bases: object
Entity type class for supporting multiple types of entities.
GENERIC = 'GENERIC'
IMAGEOBJECTSELECTION = 'IMAGEOBJECTSELECTION'
TEXTSELECTION = 'TEXTSELECTION'
class ads.data_labeling.parser.export_record_parser.MultiLabelRecordParser(dataset_source_path:
str, format:
Optional[str] =
None, categories:
Op-
tional[List[str]] =
None)
Bases: RecordParser
MultiLabelRecordParser class which parses the label of Multiple label data.
Initiates a RecordParser instance.
Parameters
• dataset_source_path (str) – Dataset source path.
• format ((str, optional). Defaults to None.) – Output format of annotations.
• categories ((List[str], optional). Defaults to None.) – The list of object cat-
egories in proper order for model training. Example: [‘cat’,’dog’,’horse’]
Returns
RecordParser instance.
Return type
RecordParser
class ads.data_labeling.parser.export_record_parser.NERRecordParser(dataset_source_path: str,
format: Optional[str] =
None, categories:
Optional[List[str]] =
None)
Bases: RecordParser
NERRecordParser class which parses the label of NER label data.
Initiates a RecordParser instance.
Parameters
• dataset_source_path (str) – Dataset source path.
• format ((str, optional). Defaults to None.) – Output format of annotations.
• categories ((List[str], optional). Defaults to None.) – The list of object cat-
egories in proper order for model training. Example: [‘cat’,’dog’,’horse’]
Returns
RecordParser instance.
Return type
RecordParser
class ads.data_labeling.parser.export_record_parser.RecordParser(dataset_source_path: str,
format: Optional[str] = None,
categories: Optional[List[str]]
= None)
Bases: Parser
RecordParser class which parses the labels from the record.
Examples
>>> labels.append(bounding_box_labels)
The module containing classes to read labeled datasets. Allows to read labeled datasets from exports or from the cloud.
Classes
LabeledDatasetReader
The LabeledDatasetReader class to read labeled dataset.
ExportReader
The ExportReader class to read labeled dataset from the export.
DLSDatasetReader
The DLSDatasetReader class to read labeled dataset from the cloud.
Examples
>>> ds_reader.read()
Path Content Annotations
-----------------------------------------------------------------------
0 path/to/the/content/file1 file content yes
1 path/to/the/content/file2 file content no
2 path/to/the/content/file3 file content no
>>> next(ds_reader.read(iterator=True))
("path/to/the/content/file1", "file content", "yes")
>>> next(ds_reader.read(chunksize=2))
Path Content Annotations
----------------------------------------------------------------------
0 path/to/the/content/file1 file content yes
1 path/to/the/content/file2 file content no
Raises
• ValueError – When dataset_id is empty or not a string.:
• TypeError – When dataset_id not a string.:
info() → Metadata
Gets the labeled dataset metadata.
Returns
The labeled dataset metadata.
Return type
Metadata
read(format: Optional[str] = None) → Generator[Tuple, Any, Any]
Reads the labeled dataset records.
Parameters
format ((str, optional). Defaults to None.) – Output format of annotations. Can
be None, “spacy” for dataset Entity Extraction type or “yolo” for Object Detection type.
When None, it outputs List[NERItem] or List[BoundingBoxItem]. When “spacy”, it outputs
List[Tuple]. When “yolo”, it outputs List[List[Tuple]].
Returns
The labeled dataset records.
Return type
Generator[Tuple, Any, Any]
class ads.data_labeling.reader.dataset_reader.ExportReader(path: str, auth: Optional[Dict] = None,
encoding='utf-8', materialize: bool =
False, include_unlabeled: bool =
False)
Bases: Reader
The ExportReader class to read labeled dataset from the export.
info(self ) → Metadata
Gets the labeled dataset metadata.
read(self ) → Generator[Tuple, Any, Any]
Reads the labeled dataset.
Initializes the labeled dataset export reader instance.
Parameters
• path (str) – The metadata file path, can be either local or object storage path.
• auth ((dict, optional). Defaults to None.) – The default authetication is set us-
ing ads.set_auth API. If you need to override the default, use the ads.common.auth.api_keys
or ads.common.auth.resource_principal to create appropriate authentication signer and
kwargs required to instantiate IdentityClient object.
• encoding ((str, optional). Defaults to 'utf-8'.) – Encoding for files. The en-
coding is used to extract the metadata information of the labeled dataset and also to extract
the content of the text dataset records.
• materialize ((bool, optional). Defaults to False.) – Whether the content of
dataset files should be loaded/materialized or not. By default the content will not be materi-
alized.
Examples
>>> ds_reader.info()
------------------------------------------------------------------------
annotation_type SINGLE_LABEL
compartment_id TEST_COMPARTMENT
dataset_id TEST_DATASET
dataset_name test_dataset_name
dataset_type TEXT
labels ['yes', 'no']
records_path path/to/records
source_path path/to/dataset
>>> ds_reader.read()
Path Content Annotations
-----------------------------------------------------------------------
0 path/to/the/content/file1 file content yes
1 path/to/the/content/file2 file content no
2 path/to/the/content/file3 file content no
>>> next(ds_reader.read(iterator=True))
("path/to/the/content/file1", "file content", "yes")
>>> next(ds_reader.read(chunksize=2))
Path Content Annotations
----------------------------------------------------------------------
0 path/to/the/content/file1 file content yes
1 path/to/the/content/file2 file content no
• auth ((dict, optional). Defaults to None.) – The default authetication is set us-
ing ads.set_auth API. If you need to override the default, use the ads.common.auth.api_keys
or ads.common.auth.resource_principal to create appropriate authentication signer and
kwargs required to instantiate IdentityClient object.
• encoding ((str, optional). Defaults to 'utf-8'.) – Encoding for files.
• materialize ((bool, optional). Defaults to False.) – Whether the content of
the dataset file should be loaded or it should return the file path to the content. By default
the content will not be loaded.
Returns
The LabeledDatasetReader instance.
Return type
LabeledDatasetReader
classmethod from_export(path: str, auth: Optional[dict] = None, encoding: str = 'utf-8', materialize:
bool = False, include_unlabeled: bool = False) → LabeledDatasetReader
Constructs Labeled Dataset Reader instance.
Parameters
• path (str) – The metadata file path, can be either local or object storage path.
• auth ((dict, optional). Defaults to None.) – The default authetication is set us-
ing ads.set_auth API. If you need to override the default, use the ads.common.auth.api_keys
or ads.common.auth.resource_principal to create appropriate authentication signer and
kwargs required to instantiate IdentityClient object.
• encoding ((str, optional). Defaults to 'utf-8'.) – Encoding for files.
• materialize ((bool, optional). Defaults to False.) – Whether the content of
the dataset file should be loaded or it should return the file path to the content. By default
the content will not be loaded.
Returns
The LabeledDatasetReader instance.
Return type
LabeledDatasetReader
info() → Serializable
Gets the labeled dataset metadata.
Returns
The labeled dataset metadata.
Return type
Metadata
read(iterator: bool = False, format: Optional[str] = None, chunksize: Optional[int] = None) →
Union[Generator[Any, Any, Any], DataFrame]
Reads the labeled dataset records.
Parameters
• iterator ((bool, optional). Defaults to False.) – True if the result should be
represented as a Generator. Fasle if the result should be represented as a Pandas DataFrame.
• format ((str, optional). Defaults to None.) – Output format of annotations.
Can be None, “spacy” or “yolo”.
Examples
Parameters
• path (str) – object storage path or local path for a file.
• auth ((dict, optional). Defaults to None.) – The default authetication is set us-
ing ads.set_auth API. If you need to override the default, use the ads.common.auth.api_keys
or ads.common.auth.resource_principal to create appropriate authentication signer and
kwargs required to instantiate IdentityClient object.
• encoding ((str, optional). Defaults to 'utf-8'.) – Encoding of files. Only used
for “TEXT” dataset.
Examples
read() → Metadata
Reads the content from the metadata file.
Returns
The metadata of the labeled dataset.
Return type
Metadata
class ads.data_labeling.reader.metadata_reader.MetadataReader(reader: Reader)
Bases: object
MetadataReader class which reads and extracts the labeled dataset metadata.
Examples
Examples
>>> import os
>>> import oci
>>> from ads.data_labeling import RecordReader
>>> from ads.common import auth as authutil
>>> file_path = "/path/to/your_record.jsonl"
>>> dataset_type = "IMAGE"
>>> annotation_type = "BOUNDING_BOX"
>>> record_reader = RecordReader.from_export_file(file_path, dataset_type,␣
˓→annotation_type, "image_file_path", authutil.api_keys())
>>> next(record_reader.read())
• auth ((dict, optional). Defaults to None.) – The default authetication is set us-
ing ads.set_auth API. If you need to override the default, use the ads.common.auth.api_keys
or ads.common.auth.resource_principal to create appropriate authentication signer and
kwargs required to instantiate IdentityClient object.
• encoding ((str, optional). Defaults to 'utf-8'.) – Encoding for files.
• materialize ((bool, optional). Defaults to False.) – Whether the content of
the dataset file should be loaded or it should return the file path to the content. By default
the content will not be loaded.
• format ((str, optional). Defaults to None.) – Output format of annotations.
Can be None, “spacy” for dataset Entity Extraction type or “yolo” for Object Detection
type. When None, it outputs List[NERItem] or List[BoundingBoxItem]. When “spacy”, it
outputs List[Tuple]. When “yolo”, it outputs List[List[Tuple]].
• categories ((List[str], optional). Defaults to None.) – The list of object
categories in proper order for model training. Example: [‘cat’,’dog’,’horse’]
Returns
The RecordReader instance.
Return type
RecordReader
classmethod from_export_file(path: str, dataset_type: str, annotation_type: str, dataset_source_path:
str, auth: Optional[Dict] = None, include_unlabeled: bool = False,
encoding: str = 'utf-8', materialize: bool = False, format: Optional[str]
= None, categories: Optional[List[str]] = None,
includes_metadata=False) → RecordReader
Initiates a RecordReader instance.
Parameters
• path (str) – Record file path.
• dataset_type (str) – Dataset type. Currently supports TEXT, IMAGE and DOCU-
MENT.
• annotation_type (str) – Annotation Type. Currently TEXT supports SIN-
GLE_LABEL, MULTI_LABEL, ENTITY_EXTRACTION. IMAGE supports SIN-
GLE_LABEL, MULTI_LABEL and BOUNDING_BOX. DOCUMENT supports SIN-
GLE_LABEL and MULTI_LABEL.
• dataset_source_path (str) – Dataset source path.
• auth ((dict, optional). Default None) – The default authetication is set using
ads.set_auth API. If you need to override the default, use the ads.common.auth.api_keys
or ads.common.auth.resource_principal to create appropriate authentication signer and
kwargs required to instantiate IdentityClient object.
• include_unlabeled ((bool, optional). Default to False.) – Whether to load
the unlabeled records or not.
• encoding ((str, optional). Defaults to "utf-8".) – Encoding for text files.
Used only to extract the content of the text dataset contents.
• materialize ((bool, optional). Defaults to False.) – Whether to materialize
the content by loader.
• format ((str, optional). Defaults to None.) – Output format of annotations.
Can be None, “spacy” for dataset Entity Extraction type or “yolo” for Object Detection
Examples
class ads.data_labeling.visualizer.image_visualizer.ImageLabeledDataFormatter
Bases: object
The ImageRender class to render Image items in a notebook session.
static render_item(item: LabeledImageItem, options: Optional[Dict] = None, path: Optional[str] =
None) → None
Renders image dataset.
Parameters
• item (LabeledImageItem) – Item to render.
• options (Optional[dict]) – Render options.
• path (str) – Path to save the image with annotations to local directory.
Returns
Nothing.
Return type
None
Raises
• ValueError – If items not provided. If path is not valid.
• TypeError – If items provided in a wrong format.
class ads.data_labeling.visualizer.image_visualizer.LabeledImageItem(img: ImageFile, boxes:
List[BoundingBoxItem])
Bases: object
Data class representing Image Item.
img
the labeled image object.
Type
ImageFile
boxes
a list of BoundingBoxItem
Type
List[BoundingBoxItem]
boxes: List[BoundingBoxItem ]
img: ImageFile
colors
The multiple specified colors.
Type
Optional[dict]
colors: Optional[dict]
default_color: str
Examples
Examples
>>> display(HTML(result))
txt: str
default_color: str
Raises
• ValueError – If items not provided.
• TypeError – If items provided in a wrong format.
ads.data_labeling.visualizer.text_visualizer.render(items: List[LabeledTextItem], options:
Optional[Dict] = None) → str
Renders NER dataset to Html format.
Parameters
• items (List[LabeledTextItem]) – The list of NER items to render.
• options (dict, optional) – The options for rendering.
Returns
Html string.
Return type
str
Examples
>>> display(HTML(result))
17.1.1.6.1 Subpackages
17.1.1.6.2 Submodules
• kwargs (dict, optional) – Name-value pairs that are to be added to the list of connection
parameters. For example, database_name=”mydb”, database_type=”oracle”, username =
“root”, password = “pwd”.
Return type
A Connector object.
connect()
class ads.database.connection.OracleConnector(oracle_connection_config)
Bases: object
ads.database.connection.get_repository(key: str, repository_path: Optional[str] = None) → dict
Get all values from local database store.
Parameters
• key (str) – The key to find the database directory.
• repository_path (str, optional) – The path to local database store, default to
~/.database unless specified otherwise.
Return type
A dictionary of all values in the store.
ads.database.connection.import_wallet(wallet_path: str, key: str, repository_path: Optional[str] = None)
→ None
Saves wallet to local database store. Unzip the wallet zip file, update sqlnet.ora and store wallet files.
Parameters
• wallet_path (str) – The local path to the downloaded wallet zip file.
• key (str) – The key to find the database directory.
• repository_path (str, optional) – The local database store, default to ~/.database
unless specified otherwise.
ads.database.connection.update_repository(value: dict, key: str, replace: bool = True, repository_path:
Optional[str] = None) → dict
Saves value into local database store.
Parameters
• value (dict) – The values to store locally.
• key (str) – The key to find the local database directory.
• replace (bool, default to True) – If set to false, updates the stored value.
• repository_path (str: str, optional) – The local database store, default to
~/.database unless specified otherwise.
Return type
A dictionary of all values in the repository for the given key.
17.1.1.7.1 Submodules
class ads.dataflow.dataflow.DataFlow(compartment_id=None,
dataflow_base_folder='/home/datascience/dataflow', os_auth=None,
df_auth=None)
Bases: object
create_app(app_config: dict, overwrite_script=False, overwrite_archive=False) → object
Create a new dataflow application with the supplied app config. app_config contains parameters needed to
create a new application, according to oci.data_flow.models.CreateApplicationDetails.
Parameters
• app_config (dict) – the config file that contains all necessary parameters used to create
a dataflow app
• overwrite_script (bool) – whether to overwrite the existing pyscript script on Object
Storage
• overwrite_archive (bool) – whether to overwrite the existing archive file on Object
Storage
Returns
df_app – New dataflow application.
Return type
oci.dataflow.models.Application
get_app(app_id: str)
Get the Project based on app_id.
Parameters
app_id (str, required) – The OCID of the dataflow app to get.
Returns
app – The oci.dataflow.models.Application with the matching ID.
Return type
oci.dataflow.models.Application
list_apps(include_deleted: bool = False, compartment_id: Optional[str] = None, datetime_format: str =
'%Y-%m-%d %H:%M:%S', **kwargs) → object
List all apps in a given compartment, or in the current notebook session’s compartment.
Parameters
• include_deleted (bool, optional, default=False) – Whether to include deleted
apps in the returned list.
• compartment_id (str, optional, default: NB_SESSION_COMPARTMENT_OCID) –
The compartment specified to list apps.
• datetime_format (str, optional, default: '%Y-%m-%d %H:%M:%S') – Change
format for date time fields.
Returns
dsl – List of Dataflow applications.
Return type
List
load_app(app_id: str, target_folder: Optional[str] = None) → object
Load an existing dataflow application based on application id. The existing dataflow application can be
created either from dataflow service or the dataflow integration of ADS.
Parameters
• app_id (str, required) – The OCID of the dataflow app to load.
• target_folder (str, optional,) – the folder to store the local artifacts of this appli-
cation. If not specified, the target_folder will use the dataflow_base_folder by default.
Returns
dfa – A dataflow application of type ads.dataflow.dataflow.DataFlowApp
Return type
ads.dataflow.dataflow.DataFlowApp
prepare_app(display_name: str, script_bucket: str, pyspark_file_path: str, spark_version: str = '2.4.4',
compartment_id: Optional[str] = None, archive_path: Optional[str] = None, archive_bucket:
Optional[str] = None, logs_bucket: str = 'dataflow-logs', driver_shape: str =
'VM.Standard2.4', executor_shape: str = 'VM.Standard2.4', num_executors: int = 1,
arguments: list = [], script_parameters: dict = []) → dict
Check if the parameters provided by users to create an application are valid and then prepare
app_configuration for creating an app or saving for future reuse.
Parameters
• display_name (str, required) – A user-friendly name. This name is not necessarily
unique.
• script_bucket (str, required) – bucket in object storage to upload the pyspark file
• pyspark_file_path (str, required) – path to the pyspark file
• spark_version (str) – Allowed values are “2.4.4”, “3.0.2”.
• compartment_id (str) – OCID of the compartment to create a dataflow app. If not pro-
vided, compartment_id will use the same as the notebook session.
• archive_path (str, optional) – path to the archive file
• archive_bucket (str, optional) – bucket in object storage to upload the archive file
• logs_bucket (str, default is 'dataflow-logs') – bucket in object storage to put
run logs
• driver_shape (str) – The value to assign to the driver_shape property of this
CreateApplicationDetails. Allowed values for this property are: “VM.Standard2.1”,
“VM.Standard2.2”, “VM.Standard2.4”, “VM.Standard2.8”, “VM.Standard2.16”,
“VM.Standard2.24”.
• executor_shape (str) – The value to assign to the executor_shape property of this
CreateApplicationDetails. Allowed values for this property are: “VM.Standard2.1”,
“VM.Standard2.2”, “VM.Standard2.4”, “VM.Standard2.8”, “VM.Standard2.16”,
“VM.Standard2.24”.
• num_executors (int) – The number of executor VMs requested.
• arguments (list of str) – The values passed into the command line string to run the
application
• script_parameters (dict) – The value of the parameters passed to the running appli-
cation as command line arguments for the pyspark script.
Returns
app_configuration
Return type
dictionary containing all the validated params for CreateApplicationDetails.
template(job_type: str = 'standard_pyspark', script_str: str = '', file_dir: Optional[str] = None, file_name:
Optional[str] = None) → str
Populate a prewritten pyspark or sparksql python script with user’s choice to write additional lines and save
in local directory.
Parameters
• job_type (str, default is 'standard_pyspark') – Currently supports two types,
‘standard_pyspark’ or ‘sparksql’
• script_str (str, optional, default is '') – code provided by user to write in the
python script
• file_dir (str, optional) – Directory to save the python script in local directory
• file_name (str, optional) – name of the python script to save to the local directory
Returns
script_path – Path to the template generated python file in local directory
Return type
str
class ads.dataflow.dataflow.DataFlowApp(app_config, app_response, app_dir, oci_link, **kwargs)
Bases: DataFlow
property config: dict
Retrieve the app_config file used to create the data flow app
Returns
app_config – dictionary containing all the validated params for this DataFlowApp
Return type
Dict
get_run(run_id: str)
Get the Run based on run_id
Parameters
run_id (str, required) – The OCID of the dataflow run to get.
Returns
df_run – The oci.dataflow.models.Run with the matching ID.
Return type
oci.dataflow.models.Run
list_runs(include_failed: bool = False, datetime_format: str = '%Y-%m-%d %H:%M:%S', **kwargs) →
object
List all run of a dataflow app
Parameters
• include_failed (bool, optional, default=False) – Whether to include failed
runs in the returned list
• datetime_format (str, optional, default: '%Y-%m-%d %H:%M:%S') – Change
format for date time fields
Returns
df_runs – List of Data flow runs.
Return type
List
property oci_link: object
Retrieve the oci link of the data flow app
Returns
oci_link – a link to the app page in an oci console.
Return type
str
prepare_run(run_display_name: str, compartment_id: Optional[str] = None, logs_bucket: str = '',
driver_shape: str = 'VM.Standard2.4', executor_shape: str = 'VM.Standard2.4',
num_executors: int = 1, **kwargs) → dict
Check if the parameters provided by users to create a run are valid and then prepare run_config for creating
run details.
Parameters
• run_display_name (str) – A user-friendly name. This name is not necessarily unique.
• compartment_id (str) – OCID of the compartment to create a dataflow run. If not pro-
vided, compartment_id will use the same as the dataflow app.
• logs_bucket (str) – bucket in object storage to put run logs, if not provided, will use the
same logs_bucket as defined in app_config
• driver_shape (str) – The value to assign to the driver_shape property of this
CreateApplicationDetails. Allowed values for this property are: “VM.Standard2.1”,
“VM.Standard2.2”, “VM.Standard2.4”, “VM.Standard2.8”, “VM.Standard2.16”,
“VM.Standard2.24”.
• executor_shape (str) – The value to assign to the executor_shape property of this
CreateApplicationDetails. Allowed values for this property are: “VM.Standard2.1”,
“VM.Standard2.2”, “VM.Standard2.4”, “VM.Standard2.8”, “VM.Standard2.16”,
“VM.Standard2.24”.
• num_executors (int) – The number of executor VMs requested.
Returns
run_config – Dictionary containing all the validated params for CreateRunDetails.
Return type
Dict
run(run_config: dict, save_log_to_local: bool = False, copy_script_to_object_storage: bool = True,
copy_archive_to_object_storage: bool = True, pyspark_file_path: Optional[str] = None, archive_path:
Optional[str] = None, wait: bool = True) → object
Create a new dataflow run with the supplied run config. run_config contains parameters needed to create a
new run, according to oci.data_flow.models.CreateRunDetails.
Parameters
• run_config (dict, required) – The config file that contains all necessary parameters
used to create a dataflow run
• save_log_to_local (bool, optional) – A boolean value that defaults to false. If set
to true, it saves the log files to local dir
• copy_script_to_object_storage (bool, optional) – A boolean value that defaults
to true. Local script will be copied to object storage
• copy_archive_to_object_storage (bool, optional) – A boolean value that de-
faults to true. Local archive file will be copied to object storage
• pyspark_file_path (str, optional) – The pyspark file path used for creating the
dataflow app. if pyspark_file_path isn’t specified then reuse the path that the app was cre-
ated with.
• archive_path (str, optional) – The archive file path used for creating the dataflow
app. if archive_path isn’t specified then reuse the path that the app was created with.
• wait (bool, optional) – A boolean value that defaults to true. When True, the return
will be ads.dataflow.dataflow.DataFlowRun in terminal state. When False, the return will
be a ads.dataflow.dataflow.RunObserver.
Returns
df_run – Either a new Data Flow run or a run observer.
Return type
Variable
class ads.dataflow.dataflow.DataFlowLog(text, oci_path, log_local_dir)
Bases: object
head(n: int = 10)
Show the first n lines of the log as the output of the notebook cell
Parameters
n (int, default is 10) – the number of lines from head of the log file
Return type
None
property local_dir
Get the local directory where the log file is saved.
Returns
local_dir – Path to the local directory where the log file is saved.
Return type
str
property local_path
Get the path of the log file in local directory
Returns
local_path – Path of the log file in local directory
Return type
str
property oci_path
Get the path of the log file in object storage
Returns
oci_path – Path of the log file in object storage
Return type
str
save(log_dir=None)
save the log file to a local directory.
Parameters
• log_dir (str,) – The path to the local directory to save log file, if not
• set –
• default. (log will be saved to the _local_dir by) –
Return type
None
show_all()
Show all content of the log as the output of the notebook cell
Return type
None
tail(n: int = 10)
Show the last n lines of the log as the output of the notebook cell
Parameters
n (int, default is 10) – the number of lines from tail of the log file
Return type
None
class ads.dataflow.dataflow.DataFlowRun(run_config, run_response, save_log_to_local, local_dir,
**kwargs)
Bases: DataFlow
LOG_OUTPUTS = ['stdout', 'stderr']
Return type
DataFlowLog
property local_dir: str
Retrieve the local directory of the data flow run
Returns
local_dir – the local path to the Data Flow run
Return type
str
property log_stderr: object
Retrieve the stderr of the data flow run
Returns
log_error – a clickable link that opens the stderror log in another tab in jupyter notebook
environment
Return type
ads.dataflow.dataflow.DataFlowLog
property log_stdout: object
Retrieve the stdout of the data flow run
Returns
log_out – a clickable link that opens the stdout log in another tab in a JupyterLab notebook
environment
Return type
ads.dataflow.dataflow.DataFlowLog
property oci_link: object
Retrieve the oci link of the data flow run
Returns
oci_link – link to the run page in an oci console
Return type
str
property status: str
Retrieve the status of the data flow run
Returns
status – String that describes the status of the run
Return type
str
update_config(param_dict) → None
Modify the run_config file used to create the data flow run
Parameters
param_dict (Dict) – Dictionary containing the key value pairs of the run_config parameters
and the updated values.
Return type
None
v3_0_2 = '3.0.2'
17.1.1.8.1 Submodules
Examples
>>> ds = DatasetFactory.open("iris.csv")
>>> ds_with_target = ds.set_target('class')
>>> ds_with_pos_class = ds.set_positive_class('setosa')
Parameters
• fix_imbalance (bool, defaults to True.) – Fix imbalance between classes in
dataset. Used only for classification datasets.
• correlation_threshold (float, defaults to 0.7. It must be between 0
and 1, inclusive.) – The correlation threshold where columns with correlation higher
than the threshold will be considered as strongly co-correlated and recommended to be
taken care of.
• frac (float, defaults to 1.0. Range -> (0, 1].) – What fraction of the data
should be used in the calculation?
• correlation_methods (Union[list, str], defaults to 'pearson'.) –
– ‘pearson’: Use Pearson’s Correlation between continuous features,
– ’cramers v’: Use Cramer’s V correlations between categorical features,
– ’correlation ratio’: Use Correlation Ratio Correlation between categorical and continu-
ous features,
Examples
convert_to_text_classification(text_column: str)
Builds a new dataset with the given text column as the only feature besides target.
Parameters
text_column (str) – Feature name to use for text classification task
Returns
ds – Dataset with one text feature and a classification target
Return type
TextClassificationDataset
Examples
down_sample(sampler=None)
Fixes an imbalanced dataset by down-sampling.
Parameters
sampler (An instance of SamplerMixin) – Should implement fit_resample(X,y)
method. If None, does random down sampling.
Returns
down_sampled_ds – A down-sampled dataset.
Return type
ClassificationDataset
Examples
>>> ds = DatasetFactory.open("some_data.csv")
>>> ds_balanced_small = ds.down_sample()
up_sample(sampler='default')
Fixes imbalanced dataset by up-sampling
Parameters
Examples
>>> ds = DatasetFactory.open("some_data.csv")
>>> ds_balanced_large = ds.up_sample()
class ads.dataset.correlation_plot.BokehHeatMap(ds)
Bases: object
Generate a HeatMap or horizontal bar plot to compare features.
debug()
Return True if in debug mode, otherwise False.
flatten_corr_matrix(corr_matrix)
Flatten a correlation matrix into a pandas Dataframe.
Parameters
corr_matrix (Pandas Dataframe) – The correlation matrix to be flattened.
Returns
corr_flatten – The flattened correlation matrix.
Return type
Pandas DataFrame
generate_heatmap(corr_matrix, title: str, msg: str, correlation_threshold: float)
Generate a heatmap from a correlation matrix.
Parameters
• corr_matrix (Pandas Dataframe) – The dataframe to be used for heatmap generation.
• title (str) – title of the heatmap.
• msg (str) – An additional msg to include in the plot.
• correlation_threshold (float) – A float between 0 and 1 which is used for excluding
correlations which are not intense enough from the plot.
Returns
tab – A matplotlib Panel object which includes a plotted heatmap
Return type
matplotlib Panel
generate_target_heatmap(corr_matrix, title: str, correlation_target: str, msg: str, correlation_threshold:
float)
Generate a heatmap from a correlation matrix and its targets.
Parameters
• corr_matrix (Pandas Dataframe) – The dataframe to be used for heatmap generation.
• title (str) – title of the heatmap.
• correlation_target (str) – The target column name for computing correlations
against.
• msg (str) – An additional msg to include in the plot.
• correlation_threshold (float) – A float between 0 and 1 which is used for excluding
correlations which are not intense enough from the plot.
Returns
tab – A matplotlib Panel object which includes a plotted heatmap.
Return type
matplotlib Panel
plot_correlation_heatmap(ds, plot_type: str = 'heatmap', correlation_target: str = None,
correlation_threshold=-1, correlation_methods: str = 'pearson', **kwargs)
Plots a correlation heatmap.
Parameters
• ds (Pandas Slice) – A data slice or file
• plot_type (str Defaults to "heatmap") – The type of plot - “bar” is another option.
• correlation_target (str, Defaults to None) – the target column for correlation
calculations.
• correlation_threshold (float, Defaults to -1) – the threshold for computing
correlation heatmap elements.
Examples
>>> ds = DatasetFactory.open("data.csv")
>>> ds_same_size = ds.assign_column('target',lambda x: x>15 if x not None)
>>> ds_bigger = ds.assign_column('new_col', np.arange(ds.shape[0]))
astype(types)
Convert data type of features.
Parameters
types (dict) – key is the existing feature name value is the data type to which the values of
the feature should be converted. Valid data types: All numpy datatypes (Example: np.float64,
np.int64, . . . ) or one of categorical, continuous, ordinal or datetime.
Returns
updated_dataset – an ADSDataset with new data types
Return type
ADSDataset
Examples
>>> ds = DatasetFactory.open("data.csv")
>>> ds_reformatted = ds.astype({"target": "categorical"})
Examples
>>> ds = DatasetFactory.open("classfication_data.csv")
>>> def f1(df):
... return(sum(df), axis=0)
>>> sum_ds = ds.call(f1)
compute()
corr(correlation_methods: Union[list, str] = 'pearson', frac: float = 1.0, sample_size: float = 1.0,
nan_threshold: float = 0.8, overwrite: Optional[bool] = None, force_recompute: bool = False)
Compute pairwise correlation of numeric and categorical columns, output a matrix or a list of matrices
computed using the correlation methods passed in.
Parameters
• correlation_methods (Union[list, str], default to 'pearson') –
– ‘pearson’: Use Pearson’s Correlation between continuous features,
– ’cramers v’: Use Cramer’s V correlations between categorical features,
– ’correlation ratio’: Use Correlation Ratio Correlation between categorical and continu-
ous features,
– ’all’: Is equivalent to [‘pearson’, ‘cramers v’, ‘correlation ratio’].
Or a list containing any combination of these methods, for example, [‘pearson’, ‘cramers
v’].
• frac – Is deprecated and replaced by sample_size.
• sample_size (float, defaults to 1.0. Float, Range -> (0, 1]) – What
fraction of the data should be used in the calculation?
• nan_threshold (float, default to 0.8, Range -> [0, 1]) – Only compute a
correlation when the proportion of the values, in a column, is less than or equal to
nan_threshold.
• overwrite – Is deprecated and replaced by force_recompute.
• force_recompute (bool, default to be False) –
– If False, it calculates the correlation matrix if there is no cached correlation matrix.
Otherwise, it returns the cached correlation matrix.
– If True, it calculates the correlation matrix regardless whether there is cached result or
not.
Returns
correlation – The pairwise correlations as a matrix (DataFrame) or list of matrices
Return type
Union[list, pandas.DataFrame]
property ddf
drop_columns(columns)
Return new dataset with specified columns removed.
Parameters
columns (str or list) – columns to drop.
Returns
dataset – a dataset with specified columns dropped.
Return type
same type as the caller
Raises
ValidationError – If any of the feature names is not found in the dataset.
Examples
>>> ds = DatasetFactory.open("data.csv")
>>> ds_smaller = ds.drop_columns(['col1', 'col2'])
merge(data, **kwargs)
Merges this dataset with another ADSDataset or pandas dataframe.
Parameters
• data (Union[ADSDataset, pandas.DataFrame]) – Data to merge.
• kwargs (dict, optional) – additional keyword arguments that would be passed to un-
derlying dataframe’s merge API.
Examples
rename_columns(columns)
Returns a new dataset with altered column names.
dict values must be unique (1-to-1). Labels not contained in a dict will be left as-is. Extra labels listed don’t
throw an error.
Parameters
columns (dict-like or function or list of str) – dict to rename columns selec-
tively, or list of names to rename all columns, or a function like str.upper
Returns
dataset – A dataset with specified columns renamed.
Return type
same type as the caller
Examples
>>> ds = DatasetFactory.open("data.csv")
>>> ds_renamed = ds.rename_columns({'col1': 'target'})
sample(frac=None, random_state=42)
Returns random sample of dataset.
Parameters
• frac (float, optional) – Fraction of axis items to return.
• random_state (int or np.random.RandomState) – If int we create a new RandomState
with this as the seed Otherwise we draw from the passed RandomState
Returns
sampled_dataset – An ADSDataset which was randomly sampled.
Return type
ADSDataset
Examples
>>> ds = DatasetFactory.open("data.csv")
>>> ds_sample = ds.sample()
set_description(description)
Sets description for the dataset.
Give your dataset a description.
Parameters
description (str) – Description of the dataset.
Examples
>>> ds = DatasetFactory.open("data1.csv")
>>> ds_renamed = ds.set_description("dataset1 is from "data1.csv"")
set_name(name)
Sets name for the dataset.
This name will be used to filter the datasets returned by ds.list() API. Calling this API is optional. By
default name of the dataset is set to empty.
Parameters
name (str) – Name of the dataset.
Examples
>>> ds = DatasetFactory.open("data1.csv")
>>> ds_renamed = ds.set_name("dataset1")
Examples
>>> ds = DatasetFactory.open("classfication_data.csv")
>>> ds_with_target= ds.set_target("target_class")
show_corr(frac: float = 1.0, sample_size: float = 1.0, nan_threshold: float = 0.8, overwrite: Optional[bool]
= None, force_recompute: bool = False, correlation_target: Optional[str] = None, plot_type: str
= 'heatmap', correlation_threshold: float = -1, correlation_methods='pearson', **kwargs)
Show heatmap or barplot of pairwise correlation of numeric and categorical columns, output three tabs
which are heatmap or barplot of correlation matrix of numeric columns vs numeric columns using pearson
correlation method, categorical columns vs categorical columns using Cramer’s V method, and numeric vs
categorical columns, excluding NA/null values and columns which have more than 80% of NA/null values.
By default, only ‘pearson’ correlation is calculated and shown in the first tab. Set correlation_methods=’all’
to show all correlation charts.
Parameters
• frac (Is superseded by sample_size) –
• sample_size (float, defaults to 1.0. Float, Range -> (0, 1]) – What
fraction of the data should be used in the calculation?
• nan_threshold (float, defaults to 0.8, Range -> [0, 1]) – In the default
case, it will only calculate the correlation of the columns which has less than or equal
to 80% of missing values.
• overwrite – Is deprecated and replaced by force_recompute.
• force_recompute (bool, default to be False.) –
– If False, it calculates the correlation matrix if there is no cached correlation matrix.
Otherwise, it returns the cached correlation matrix.
– If True, it calculates the correlation matrix regardless whether there is cached result or
not.
Parameters
• correlation_threshold (int, default -1) – The correlation threshold to select,
which only show features that have larger or equal correlation values than the threshold.
• selected_index (int, str, default 0) – The displayed output is stacked into an
accordion widget, use selected_index to force the display to open a specific element, use
the (zero offset) index or any prefix string of the name (eg, ‘corr’ for correlations)
• sample_size (int, default 0) – The size (in rows) to sample for visualizations
• visualize_features (bool default False) – For the “Features” section control if
feature visualizations are shown or not. If not only a summary of the numeric statistics is
shown. The numeric statistics are also always shows for wide (>64 features) datasets
• correlation_methods (Union[list, str], default to 'pearson') –
– ‘pearson’: Use Pearson’s Correlation between continuous features,
– ’cramers v’: Use Cramer’s V correlations between categorical features,
– ’correlation ratio’: Use Correlation Ratio Correlation between categorical and continu-
ous features,
– ’all’: Is equivalent to [‘pearson’, ‘cramers v’, ‘correlation ratio’].
Or a list containing any combination of these methods, for example, [‘pearson’, ‘cramers
v’].
Examples
>>> ds = DatasetFactory.open("data.csv")
>>> ds_uri = ds.snapshot()
Notes
{'name': 'Test',
'namespace': 'Test',
'doc': 'Descriptive text',
'type': 'record',
'fields': [
{'name': 'a', 'type': 'int'},
]}
where the “name” field is required, but “namespace” and “doc” are optional descriptors; “type” must always
be “record”. The list of fields should have an entry for every key of the input records, and the types are like
the primitive, complex or logical types of the Avro spec (https://avro.apache.org/docs/1.8.2/spec.html).
Examples
>>> ds = DatasetFactory.open("data.avro")
>>> ds.to_avro("my/path.avro")
Examples
>>> ds = DatasetFactory.open("data.csv")
>>> [ds_link] = ds.to_csv("my/path.csv")
Examples
>>> ds = DatasetFactory.open("data.csv")
>>> ds_dask = ds.to_dask()
Notes
Examples
>>> ds = DatasetFactory.open("data.csv")
>>> ds_as_h2o = ds.to_h2o()
Notes
Examples
>>> ds = DatasetFactory.open("data.csv")
>>> ds.to_hdf(path="my/path.h5", key="df")
Examples
>>> ds = DatasetFactory.open("data.csv")
>>> ds.to_json("my/path.json")
Parameters
• filter (str, optional) – The query string to filter the dataframe, for example
ds.to_pandas(filter=”age > 50 and location == ‘san francisco”) See also https://pandas.
pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.query.html
• frac (float, optional) – fraction of original data to return.
• include_transformer_pipeline (bool, default: False) – If True, (dataframe,
transformer_pipeline) is returned as a tuple
Returns
• dataframe (pandas.DataFrame) – if include_transformer_pipeline is False.
• (data, transformer_pipeline) (tuple of pandas.DataFrame and
dataset.pipeline.TransformerPipeline) – if include_transformer_pipeline is True.
Examples
>>> ds = DatasetFactory.open("data.csv")
>>> ds_as_df = ds.to_pandas()
Notes
Examples
>>> ds = DatasetFactory.open("data.csv")
>>> ds.to_parquet("my/path")
• filter (str, optional) – The query string to filter the dataframe, for example
ds.to_xgb(filter=”age > 50 and location == ‘san francisco”) See also https://pandas.pydata.
org/pandas-docs/stable/reference/api/pandas.DataFrame.query.html
• frac (float, optional) – fraction of original data to return.
• include_transformer_pipeline (bool, default: False) – If True, (dataframe,
transformer_pipeline) is returned as a tuple.
Returns
• dataframe (xgboost.DMatrix) – if include_transformer_pipeline is False.
• (data, transformer_pipeline) (tuple of xgboost.DMatrix and
dataset.pipeline.TransformerPipeline) – if include_transformer_pipeline is True.
Examples
>>> ds = DatasetFactory.open("data.csv")
>>> xgb_dmat = ds.to_xgb()
Notes
class ads.dataset.dataset_browser.DatasetBrowser
Bases: ABC
static GitHub(user: str, repo: str, branch: str = 'master')
Returns a GitHubDataset
static filesystem(folder: str)
Returns a LocalFilesystemDataset.
filter_list(L, filter_pattern) → List[str]
Filters a list of dataset names.
static list(filter_pattern='*') → List[str]
Return a list of dataset browser strings.
abstract open(**kwargs)
Return new dataset for the given name.
Parameters
name (str) – the name of the dataset to open.
Returns
ds
Return type
Dataset
Examples
ds_browser = DatasetBrowser(“sklearn”)
ds = ds_browser.open(“iris”)
static seaborn()
Returns a SeabornDataset.
static sklearn()
Returns a SklearnDataset.
static web(index_url: str)
Returns a WebDataset.
class ads.dataset.dataset_browser.GitHubDatasets(user: str, repo: str, branch: str)
Bases: DatasetBrowser
list(filter_pattern: str = '.*') → List[str]
Return a list of dataset browser strings.
open(name: str, **kwargs)
Return new dataset for the given name.
Parameters
name (str) – the name of the dataset to open.
Returns
ds
Return type
Dataset
Examples
ds_browser = DatasetBrowser(“sklearn”)
ds = ds_browser.open(“iris”)
class ads.dataset.dataset_browser.LocalFilesystemDatasets(folder: str)
Bases: DatasetBrowser
list(filter_pattern: str = '.*') → List[str]
Return a list of dataset browser strings.
open(name: str, **kwargs)
Return new dataset for the given name.
Parameters
name (str) – the name of the dataset to open.
Returns
ds
Return type
Dataset
Examples
ds_browser = DatasetBrowser(“sklearn”)
ds = ds_browser.open(“iris”)
class ads.dataset.dataset_browser.SeabornDatasets
Bases: DatasetBrowser
list(filter_pattern: str = '.*') → List[str]
Return a list of dataset browser strings.
open(name: str, **kwargs)
Return new dataset for the given name.
Parameters
name (str) – the name of the dataset to open.
Returns
ds
Return type
Dataset
Examples
ds_browser = DatasetBrowser(“sklearn”)
ds = ds_browser.open(“iris”)
class ads.dataset.dataset_browser.SklearnDatasets
Bases: DatasetBrowser
list(filter_pattern: str = '.*') → List[str]
Return a list of dataset browser strings.
open(name: str, **kwargs)
Return new dataset for the given name.
Parameters
name (str) – the name of the dataset to open.
Returns
ds
Return type
Dataset
Examples
ds_browser = DatasetBrowser(“sklearn”)
ds = ds_browser.open(“iris”)
sklearn_datasets = ['breast_cancer', 'diabetes', 'iris', 'wine', 'digits']
Examples
ds_browser = DatasetBrowser(“sklearn”)
ds = ds_browser.open(“iris”)
Parameters
• correlation_threshold (float, defaults to 0.7. It must be between 0
and 1, inclusive) – the correlation threshold where columns with correlation higher
than the threshold will be considered as strongly co-correlated and recommended to be
taken care of.
• frac (Is superseded by sample_size) –
• sample_size (float, defaults to 1.0. Float, Range -> (0, 1]) – What
fraction of the data should be used in the calculation?
• correlation_methods (Union[list, str], defaults to 'pearson') –
– ‘pearson’: Use Pearson’s Correlation between continuous features,
– ’cramers v’: Use Cramer’s V correlations between categorical features,
– ’correlation ratio’: Use Correlation Ratio Correlation between categorical and continu-
ous features,
– ’all’: Is equivalent to [‘pearson’, ‘cramers v’, ‘correlation ratio’].
Or a list containing any combination of these methods, for example, [‘pearson’, ‘cramers
v’].
Returns
transformed_dataset
Return type
ADSDatasetWithTarget
Examples
Parameters
• correlation_methods (Union[list, str], default to 'pearson') –
– ‘pearson’: Use Pearson’s Correlation between continuous features,
– ’cramers v’: Use Cramer’s V correlations between categorical features,
– ’correlation ratio’: Use Correlation Ratio Correlation between categorical and continu-
ous features,
– ’all’: Is equivalent to [‘pearson’, ‘cramers v’, ‘correlation ratio’].
Or a list containing any combination of these methods, for example, [‘pearson’, ‘cramers
v’].
• correlation_threshold (float, defaults to 0.7. It must be between 0
and 1, inclusive) – The correlation threshold where columns with correlation higher
get_transformed_dataset()
Return the transformed dataset with the recommendations applied.
This method should be called after applying the recommendations using the Recommenda-
tion#show_in_notebook() API.
rename_columns(columns)
Returns a dataset with columns renamed.
select_best_features(score_func=None, k=12)
Return new dataset containing only the top k features.
Parameters
• k (int, default 12) – The top ‘k’ features to select.
• score_func (function) – Scoring function to use to rank the features. This scoring
function should take a 2d array X(features) and an array like y(target) and return a numeric
score for each feature in the same order as X.
Notes
Examples
>>> ds = DatasetBrowser("sklearn").open("iris")
>>> ds_small = ds.select_best_features(k=2)
• Identifying constant and primary key columns, which has no predictive quality,
• Imputation, to fill in missing values in noisy data:
– For continuous variables, fill with mean if less than 40% is missing, else drop,
– For categorical variables, fill with most frequent if less than 40% is missing, else drop,
• Identifying strongly co-correlated columns that tend to produce less generalizable models,
• Automatically balancing dataset for classification problems using up or down sampling.
Parameters
• correlation_methods (Union[list, str], default to 'pearson') –
– ‘pearson’: Use Pearson’s Correlation between continuous features,
– ’cramers v’: Use Cramer’s V correlations between categorical features,
– ’correlation ratio’: Use Correlation Ratio Correlation between categorical and continu-
ous features,
– ’all’: Is equivalent to [‘pearson’, ‘cramers v’, ‘correlation ratio’].
Or a list containing any combination of these methods, for example, [‘pearson’, ‘cramers
v’]
• print_code (bool, Defaults to True) – Print Python code for the suggested actions.
• correlation_threshold (float. Defaults to 0.7. It must be between 0
and 1, inclusive) – the correlation threshold where columns with correlation higher
than the threshold will be considered as strongly co-correated and recommended to be
taken care of.
• frac (Is superseded by sample_size) –
• sample_size (float, defaults to 1.0. Float, Range -> (0, 1]) – What
fraction of the data should be used in the calculation?
• overwrite – Is deprecated and replaced by force_recompute.
• force_recompute (bool, default to be False) –
– If False, it calculates the correlation matrix if there is no cached correlation matrix.
Otherwise, it returns the cached correlation matrix.
– If True, it calculates the correlation matrix regardless whether there is cached result or
not.
Returns
suggestion dataframe
Return type
pandas.DataFrame
Examples
train_test_split(test_size=0.1, random_state=42)
Splits dataset to train and test data.
Parameters
• test_size (Union[float, int], optional, default=0.1) –
• random_state (Union[int, RandomState], optional, default=None) –
– If int, random_state is the seed used by the random number generator;
– If RandomState instance, random_state is the random number generator;
– If None, the random number generator is the RandomState instance used by np.random.
Returns
train_data, test_data – tuple of ADSData instances
Return type
tuple
Examples
>>> ds = DatasetFactory.open("data.csv")
>>> train, test = ds.train_test_split()
Examples
>>> ds = DatasetFactory.open("data.csv")
>>> train, valid, test = ds.train_validation_test_split()
type_of_target()
Return the target type for the dataset.
Returns
target_type – an object of TypedFeature
Return type
TypedFeature
Examples
>>> ds = ds.set_target('target_class')
>>> assert(ds.type_of_target() == 'categorical')
visualize_transforms()
Render a representation of the dataset’s transform DAG.
class ads.dataset.factory.CustomFormatReaders
Bases: object
DEFAULT_SQL_ARRAYSIZE = 50000
DEFAULT_SQL_CHUNKSIZE = 12007
DEFAULT_SQL_CTU = False
DEFAULT_SQL_MIL = 128
Parameters
• path – str This is the connection URL that gets passed to sqlalchemy’s create_engine
method
• table – str This is either the name of a table to select * from or a sql query to be run
• kwargs –
Returns
pd.DataFrame
static read_tsv(path: str, **kwargs) → DataFrame
Examples
>>> DatasetFactory.download("oci://Bucket/prefix/to/data/*.csv",
... "/home/datascience/data/")
Examples
>>> df = pd.DataFrame(data)
>>> ds = from_dataframe(df)
Example
>>> DatasetFactory.list_snapshots(snapshot_dir="oci://my_bucket/snapshots_dir",
... name="ads_iris_")
Returns a list of all snapshots (recursively) saved to obj storage bucket “my_bucket” with prefix “/snap-
shots_dir/ads_iris_**” sorted by time created.
static open(source, target=None, format='infer', reader_fn: Callable = None, name: str = None,
description='', npartitions: int = None, type_discovery=True, html_table_index=None,
column_names='infer', sample_max_rows=10000, positive_class=None,
transformer_pipeline=None, types={}, **kwargs)
Examples
>>> ds = DatasetFactory.open("oci://bucket@namespace/path/to/data.tsv",
... column_names=["col1", "col2", "col3"], header=0)
>>> ds = DatasetFactory.open("oci://bucket@namespace/path/to/data.csv",
... storage_options={"config": "~/.oci/config",
... "profile": "USER_2"}, delimiter = ';')
class ads.dataset.feature_engineering_transformer.FeatureEngineeringTransformer(feature_metadata=None)
Bases: TransformerMixin
fit(X, y=None)
class ads.dataset.helper.DatasetDefaults
Bases: object
sampling_confidence_interval = 1.0
sampling_confidence_level = 95
exception ads.dataset.helper.DatasetLoadException(exc_msg)
Bases: BaseException
class ads.dataset.helper.ElaboratedPath(source: Union[str, List[str]], format: Optional[str] = None,
name: Optional[str] = None, **kwargs)
Bases: object
The Elaborated Path class unifies all of the operations and information related to a path or pathlist. Whether the
user wants to An Elaborated path can accept any of the following as a valid source: * A single path * A glob
pattern path * A directory * A list of paths (Note: all of these paths must be from the same filesystem AND have
the same format) * A sqlalchemy connection url
Parameters
• source –
• format –
• kwargs –
By the end of this method, this class needs to have paths, format, and name ready
property format: str
Find sample size for a population using Cochran’s Sample Size Formula.
With default values for confidence_level (percentage, default: 95%) and confidence_interval (margin of
error, percentage, default: 1%)
SUPPORTED CONFIDENCE LEVELS: 50%, 68%, 90%, 95%, and 99% ONLY - this is because the Z-score is
table based, and I’m only providing Z for common confidence levels.
ads.dataset.helper.concatenate(X, y)
ads.dataset.helper.convert_to_html(plot)
ads.dataset.helper.down_sample(df, target)
Fixes imbalanced dataset by down-sampling
Parameters
• df (pandas.DataFrame) –
• target (name of the target column in df ) –
Returns
downsampled_df
Return type
pandas.DataFrame
ads.dataset.helper.fix_column_names(X)
ads.dataset.helper.get_dtype(feature_type, dtype)
ads.dataset.helper.get_feature_type(name, series)
ads.dataset.helper.is_text_data(df, target=None)
ads.dataset.helper.map_types(types)
ads.dataset.helper.parse_apache_log_datetime(x)
ads.dataset.helper.parse_apache_log_str(x)
Returns the string delimited by two characters.
Source: https://mmas.github.io/read-apache-access-log-pandas .. rubric:: Example
>>> parse_str(‘[my string]’) ‘my string’
ads.dataset.helper.rename_duplicate_cols(original_cols)
class ads.dataset.label_encoder.DataFrameLabelEncoder
Bases: TransformerMixin
Label encoder for pandas.dataframe. dask.dataframe.core.DataFrame
fit(X)
Fits a DataFrameLAbelEncoder.
transform(X)
Transforms a dataset using the DataFrameLAbelEncoder.
class ads.dataset.pipeline.TransformerPipeline(steps)
Bases: Pipeline
add(transformer)
Add transformer to data transformation pipeline
Parameters
transformer (Union[TransformerMixin, tuple(str, TransformerMixin)]) – if
tuple, (name, transformer implementing transform)
steps: List[Any]
visualize()
show_in_notebook()
class ads.dataset.recommendation_transformer.RecommendationTransformer(feature_metadata=None,
correlation=None,
target=None,
is_balanced=False,
target_type=None,
feature_ranking=None,
len=0,
fix_imbalance=True,
auto_transform=True,
correla-
tion_threshold=0.7)
Bases: TransformerMixin
fit(X)
transformer_log(action)
local wrapper to both log and record in the actions_performed array
– box_plot - discrete feature vs continuous feature. Draw a box plot to show distributions
with respect to categories,
– scatter - continuous feature vs continuous feature. Draw a scatter plot with possibility of
several semantic groupings.
• yscale (str, optional) – One of {“linear”, “log”, “symlog”, “logit”}. The y axis scale
type to apply. Can be used when either x or y is an ordinal feature.
• verbose (bool, default True) – Displays Note/Tips if True
plot_gis_scatter(lon='longitude', lat='latitude', ax=None)
Supports plotting Choropleth maps
Parameters
• df (pandas dataframe) – The dataframe to plot
• x (str) – The name of the feature to plot, usually the longitude
• y (str) – THe name of the feature to plot, usually the latitude
summary(feature_name=None)
Display list of features & their datatypes. Shows the column name and the feature’s meta_data if given a
specific feature name.
Parameters
date_col (str) – The name of the feature
Returns
a dictionary that contains requested information
Return type
dict
timeseries(date_col)
Supports any plotting operations where x=datetime.
Parameters
date_col (str) – The name of the feature to plot
Returns
a plotting object that contains a date column and dataframe
Return type
func
show_in_notebook(feature_names=None)
Plot target distribution or target versus feature relation.
Parameters
feature_names (list, Optional) – Plot target against a list of features. Display target
distribution if feature_names is not provided.
17.1.1.9.1 Submodules
class ads.evaluations.evaluation_plot.EvaluationPlot
Bases: object
EvaluationPlot holds data and methods for plots and it used to output them
baseline(bool)
whether to plot the null model or zero information model
baseline_kwargs(dict)
keyword arguments for the baseline plot
color_wheel(dict)
color information used by the plot
font_sz(dict)
dictionary of plot methods
perfect(bool)
determines whether a “perfect” classifier curve is displayed
perfect_kwargs(dict)
parameters for the perfect classifier for precision/recall curves
prob_type(str)
model type, i.e. classification or regression
get_legend_labels(legend_labels)
Renders the legend labels on the plot
plot(evaluation, plots, num_classes, perfect, baseline, legend_labels)
Generates the evalation plot
baseline = None
font_sz = {'l': 14, 'm': 12, 's': 10, 'xl': 16, 'xs': 8}
classmethod get_legend_labels(legend_labels)
Gets the legend labels, resolves any conflicts such as length, and renders the labels for the plot
Parameters
(dict) (legend_labels) – key/value dictionary containing legend label data
Return type
Nothing
Examples
Type
ads.common.data.ADSData
Positive_Class_names
Class attribute listing the ways to represent positive classes
Type
list
add_metrics(func, names)
Adds the listed metics to the evaluator it is called on
del_metrics(names)
Removes listed metrics from the evaluator object it is called on
add_models(models, show_full_name)
Adds the listed models to the evaluator object
del_models(names)
Removes the listed models from the evaluator object
show_in_notebook(plots, use_training_data, perfect, baseline, legend_labels)
Visualize evalutation plots in the notebook
calculate_cost(tn_weight, fp_weight, fn_weight, tp_weight, use_training_data)
Returns a cost associated with the input weights
Creates an ads evaluator object.
Parameters
• test_data (ads.common.data.ADSData instance) – Test data to evaluate model on.
The object can be built using ADSData.build().
• models (list[ads.common.model.ADSModel]) – The object can be built using
ADSModel.from_estimator(). Maximum length of the list is 3
• training_data (ads.common.data.ADSData instance, optional) – Training data
to evaluate model on and compare metrics against test data. The object can be built using
ADSData.build()
• positive_class (str or int, optional) – The class to report metrics for binary
dataset. If the target classes is True or False, positive_class will be set to True by default. If
the dataset is multiclass or multilabel, this will be ignored.
• legend_labels (dict, optional) – List of legend labels. Defaults to None. If leg-
end_labels not specified class names will be used for plots.
• show_full_name (bool, optional) – Show the name of the evaluator object. Defaults
to False.
Examples
property precision
add_metrics(funcs, names)
Adds the listed metrics to the evaluator object it is called on.
Parameters
• funcs (list) – The list of metrics to be added. This function will be provided y_true and
y_pred, the true and predicted values for each model.
• names (list[str])) – The list of metric names corresponding to the functions.
Return type
Nothing
Examples
add_models(models, show_full_name=False)
Adds the listed models to the evaluator object it is called on.
Parameters
• models (list[ADSModel]) – The list of models to be added
• show_full_name (bool, optional) – Whether to show the full model name. Defaults
to False. ** NOT USED **
Return type
Nothing
Examples
Return type
pandas.DataFrame
Examples
del_metrics(names)
Removes the listed metrics from the evaluator object it is called on.
Parameters
names (list[str]) – The list of names of metrics to be deleted. Names can be found by
calling evaluator.test_evaluations.index.
Returns
None
Return type
None
Examples
del_models(names)
Removes the listed models from the evaluator object it is called on.
Parameters
names (list[str]) – the list of models to be delete. Names are the model names by de-
fault, and assigned internally when conflicts exist. Actual names can be found using evalua-
tor.test_evaluations.columns
Return type
Nothing
Examples
>>> model3.rename("model3")
>>> evaluator = ADSEvaluator(test, [model1, model2, model3])
>>> evaluator.del_models([model3])
property metrics
Returns evaluation metrics
Returns
HTML representation of a table comparing relevant metrics.
Return type
metrics
Examples
property raw_metrics
Returns the raw metric numbers
Parameters
• metrics (list, optional) – Request metrics to pull. Defaults to all.
• use_training_data (bool, optional) – Use training data to pull metrics. Defaults to
False
Returns
The requested raw metrics for each model. If metrics is None return all.
Return type
dict
Examples
Examples
Type
array-like object holding the true values for the model
y_pred
Type
array-like object holding the predicted values for the model
model_name(str)
Type
the name of the model
classes(list)
Type
list of target classes
positive_class(str)
Type
label for positive outcome from model
y_score
Type
array-like object holding the scores for true values for the model
metrics(dict)
Type
dictionary object holding model data
get_metrics()
Gets the metrics information in a dataframe based on the number of classes
safe_metrics_call(scoring_functions, \*args)
Applies sklearn scoring functions to parameters in args
get_metrics()
Gets the metrics information in a dataframe based on the number of classes
Parameters
self ((ModelEvaluator instance)) – The ModelEvaluator instance with the metrics.
Returns
Pandas dataframe containing the metrics
Return type
pandas.DataFrame
safe_metrics_call(scoring_functions, *args)
Applies the sklearn function in scoring_functions to parameters in args.
Parameters
• scoring_functions ((dict)) – Scoring functions dictionary
• args ((keyword arguments)) – Arguments passed to the sklearn function from metrics
Returns
Nothing
Raises
Exception – If an error is enountered applying the sklearn function fn to arguments.
17.1.1.10.1 Submodules
17.1.1.11.1 Submodules
The module that helps to manage feature types. Provides functionalities to register, unregister, list feature types.
Classes
FeatureTypeManager
Feature Types Manager class that manages feature types.
Examples
>>> FeatureTypeManager.warning_registered()
Feature Type Warning Handler
----------------------------------------------------------------------
0 continuous zeros zeros_handler
1 continuous high_cardinality high_cardinality_handler
>>> FeatureTypeManager.validator_registered()
Feature Type Validator Condition ␣
˓→Handler
-----------------------------------------------------------------------------------------
˓→--
>>> FeatureTypeManager.feature_type_unregister(NewType)
>>> FeatureTypeManager.feature_type_reset()
>>> FeatureTypeManager.feature_type_object('continuous')
Continuous
class ads.feature_engineering.feature_type_manager.FeatureTypeManager
Bases: object
Examples
>>> FeatureTypeManager.warning_registered()
Feature Type Warning Handler
----------------------------------------------------------------------
0 continuous zeros zeros_handler
1 continuous high_cardinality high_cardinality_handler
>>> FeatureTypeManager.validator_registered()
Feature Type Validator Condition ␣
˓→Handler
------------------------------------------------------------------------------------
˓→-------
>>> FeatureTypeManager.feature_type_unregister(NewType)
>>> FeatureTypeManager.feature_type_reset()
>>> FeatureTypeManager.feature_type_object('continuous')
Continuous
Return type
None
classmethod feature_type_unregister(feature_type: Union[FeatureType, str]) → None
Unregisters a feature type.
Parameters
feature_type ((FeatureType | str)) – The FeatureType subclass or a str indicating
feature type.
Returns
Nothing.
Return type
None
Raises
TypeError – In attempt to unregister a default feature type.
classmethod is_type_registered(feature_type: Union[FeatureType, str]) → bool
Checks if provided feature type registered in the system.
Parameters
feature_type (Union[FeatureType, str]) – The FeatureType subclass or a str indicat-
ing feature type.
Returns
True if provided feature type registered, False otherwise.
Return type
bool
classmethod validator_registered() → DataFrame
Lists registered validators for registered feature types.
Returns
The list of registered validators for registered feature types in a DataFrame format.
Return type
pd.DataFrame
Examples
>>> FeatureTypeManager.validator_registered()
Feature Type Validator Condition ␣
˓→ Handler
--------------------------------------------------------------------------------
˓→-----------
0 phone_number is_phone_number () ␣
˓→default_handler
2 credit_card is_credit_card () ␣
˓→default_handler
Returns
The list of registered warnings for registered feature types in a DataFrame format.
Return type
pd.DataFrame
Examples
>>> FeatureTypeManager.warning_registered()
Feature Type Warning Handler
----------------------------------------------------------------------
0 continuous zeros zeros_handler
1 continuous high_cardinality high_cardinality_handler
The ADS accessor for the Pandas DataFrame. The accessor will be initialized with the pandas object the user is
interacting with.
Examples
class ads.feature_engineering.accessor.dataframe_accessor.ADSDataFrameAccessor(pandas_obj)
Bases: ADSFeatureTypesMixin, EDAMixin, DBAccessMixin, DataLabelingAccessMixin
ADS accessor for the Pandas DataFrame.
columns
The column labels of the DataFrame.
Type
List[str]
tags(self ) → Dict[str, str]
Gets the dictionary of user defined tags for the dataframe.
default_type(self ) → Dict[str, str]
Gets the map of columns and associated default feature type names.
feature_type(self ) → Dict[str, List[str]]
Gets the list of registered feature types.
feature_type_description(self ) → pd.DataFrame
Gets the list of registered feature types in a DataFrame format.
sync(self, src: Union[pd.DataFrame, pd.Series]) → pd.DataFrame
Syncs feature types of current DataFrame with that from src.
feature_select(self, include: List[Union[FeatureType, str]] = None, exclude: List[Union[FeatureType,
str]] = None) → pd.DataFrame
Gets the list of registered feature types in a DataFrame format.
help(self, prop: str = None) → None
Provids docstring for affordable methods and properties.
Examples
Examples
>>> df.ads.feature_type_description()
Column Feature Type Description
-------------------------------------------------------------------
0 City string Type representing string values.
1 Phone Number string Type representing string values.
info() → Any
Gets information about the dataframe.
Returns
The information about the dataframe.
Return type
Any
model_schema(max_col_num: int = 2000)
Generates schema from the dataframe.
Parameters
max_col_num (int, optional. Defaults to 1000) – The maximum column size of
the data that allows to auto generate schema.
Examples
Returns
data schema.
Return type
ads.feature_engineering.schema.Schema
Raises
ads.feature_engineering.schema.DataSizeTooWide – If the number of columns of
input data exceeds max_col_num.
The ADS accessor for the Pandas Series. The accessor will be initialized with the pandas object the user is interacting
with.
Examples
Examples
Return type
List[str]
Examples
Examples
Examples
class ads.feature_engineering.accessor.series_accessor.ADSSeriesValidator(feature_type_list:
List[FeatureType],
series: Series)
Bases: object
Class helper to invoke registerred validator on a series level.
Initializes ADS series validator.
Parameters
• feature_type_list (List[FeatureType]) – The list of feature types.
• series (pd.Series) – The pandas series.
This exploratory data analysis (EDA) Mixin is used in the ADS accessor for the Pandas Dataframe. The series of
purpose-driven methods enable the data scientist to complete analysis on the dataframe.
From the accessor we have access to the pandas object the user is interacting with as well as corresponding lists of
feature types per column.
class ads.feature_engineering.accessor.mixin.eda_mixin.EDAMixin
Bases: object
correlation_ratio() → DataFrame
Generate a Correlation Ratio data frame for all categorical-continuous variable pairs.
Returns
• pandas.DataFrame
• Correlation Ratio correlation data frame with the following 3 columns –
Note: Pairs will be replicated. For example for variables x and y, we would have (x,y), (y,x) both with
same correlation value. We will also have (x,x) and (y,y) with value 1.0.
correlation_ratio_plot() → Axes
Generate a heatmap of the Correlation Ratio correlation for all categorical-continuous variable pairs.
Returns
Correlation Ratio correlation plot object that can be updated by the customer
Return type
Plot object
cramersv() → DataFrame
Generate a Cramer’s V correlation data frame for all categorical variable pairs.
Gives a warning for dropped non-categorical columns.
Returns
Cramer’s V correlation data frame with the following 3 columns:
1. Column 1 (name of the first categorical column)
2. Column 2 (name of the second categorical column)
3. Value (correlation value)
Return type
pandas.DataFrame
Note: Pairs will be replicated. For example for variables x and y, we would have (x,y), (y,x) both with
same correlation value. We will also have (x,x) and (y,y) with value 1.0.
cramersv_plot() → Axes
Generate a heatmap of the Cramer’s V correlation for all categorical variable pairs.
Gives a warning for dropped non-categorical columns.
Returns
Cramer’s V correlation plot object that can be updated by the customer
Return type
Plot object
feature_count() → DataFrame
Counts the number of columns for each feature type and each primary feature. The column of primary is
the number of primary feature types that is assigned to the column.
Returns
The number of columns for each feature type The number of columns for each primary feature
Return type
Dataframe with
Examples
>>> df.ads.feature_type
{'PassengerId': ['ordinal', 'category'],
'Survived': ['ordinal'],
'Pclass': ['ordinal'],
'Name': ['category'],
'Sex': ['category']}
>>> df.ads.feature_count()
Feature Type Count Primary
0 category 3 2
1 ordinal 3 3
feature_plot() → DataFrame
For every column in the dataframe plot generate a list of summary plots based on the most relevant feature
type.
Returns
Dataframe with 2 columns: 1. Column - feature name 2. Plot - plot object
Return type
pandas.DataFrame
feature_stat() → DataFrame
Summary statistics Dataframe provided.
This returns feature stats on each column using FeatureType summary method.
Examples
>>> df = pd.read_csv('~/advanced-ds/tests/vor_datasets/vor_titanic.csv')
>>> df.ads.feature_stat().head()
Column Metric Value
0 PassengerId count 891.000
1 PassengerId mean 446.000
2 PassengerId standard deviation 257.354
3 PassengerId sample minimum 1.000
4 PassengerId lower quartile 223.500
Returns
Dataframe with 3 columns: name, metric, value
Return type
pandas.DataFrame
pearson() → DataFrame
Generate a Pearson correlation data frame for all continuous variable pairs.
Gives a warning for dropped non-numerical columns.
Returns
• pandas.DataFrame
• Pearson correlation data frame with the following 3 columns –
Note: Pairs will be replicated. For example for variables x and y, we’d have (x,y), (y,x) both with same
correlation value. We’ll also have (x,x) and (y,y) with value 1.0.
pearson_plot() → Axes
Generate a heatmap of the Pearson correlation for all continuous variable pairs.
Returns
Pearson correlation plot object that can be updated by the customer
Return type
Plot object
warning() → DataFrame
Generates a data frame that lists feature specific warnings.
Returns
The list of feature specific warnings.
Return type
pandas.DataFrame
Examples
>>> df.ads.warning()
Column Feature Type Warning Message Metric ␣
˓→ Value
--------------------------------------------------------------------------------
˓→------
This exploratory data analysis (EDA) Mixin is used in the ADS accessor for the Pandas Series. The series of purpose-
driven methods enable the data scientist to complete univariate analysis.
From the accessor we have access to the pandas object the user is interacting with as well as corresponding list of
feature types.
class ads.feature_engineering.accessor.mixin.eda_mixin_series.EDAMixinSeries
Bases: object
feature_plot() → Axes
For the series generate a summary plot based on the most relevant feature type.
Returns
Plot object for the series based on the most relevant feature type.
Return type
matplotlib.axes._subplots.AxesSubplot
feature_stat() → DataFrame
Summary statistics Dataframe provided.
This returns feature stats on series using FeatureType summary method.
Examples
>>> df = pd.read_csv('~/advanced-ds/tests/vor_datasets/vor_titanic.csv')
>>> df['Cabin'].ads.feature_stat()
Metric Value
0 count 891
1 unqiue 147
2 missing 687
Returns
Dataframe with 2 columns and rows for different metric values
Return type
pandas.DataFrame
warning() → DataFrame
Generates a data frame that lists feature specific warnings.
Returns
The list of feature specific warnings.
Return type
pandas.DataFrame
Examples
>>> df["Age"].ads.warning()
Feature Type Warning Message Metric Value
---------------------------------------------------------------------------
0 continuous Zeros Age has 38 zeros Count 38
1 continuous Zeros Age has 12.2% zeros Percentage 12.2%
The module that represents the ADS Feature Types Mixin class that extends Pandas Series and Dataframe accessors.
Classes
ADSFeatureTypesMixin
ADS Feature Types Mixin class that extends Pandas Series and Dataframe accessors.
class ads.feature_engineering.accessor.mixin.feature_types_mixin.ADSFeatureTypesMixin
Bases: object
ADS Feature Types Mixin class that extends Pandas Series and DataFrame accessors.
warning_registered(cls) → pd.DataFrame
Lists registered warnings for registered feature types.
validator_registered(cls) → pd.DataFrame
Lists registered validators for registered feature types.
help(self, prop: str = None) → None
Help method that prints either a table of available properties or, given a property, returns its docstring.
help(prop: Optional[str] = None) → None
Help method that prints either a table of available properties or, given an individual property, returns its
docstring.
Parameters
prop (str) – The Name of property.
Returns
Nothing.
Return type
None
validator_registered() → DataFrame
Lists registered validators for registered feature types.
Returns
The list of registered validators for registered feature types
Return type
pandas.DataFrame
Examples
>>> df.ads.validator_registered()
Column Feature Type Validator Condition ␣
˓→ Handler
--------------------------------------------------------------------------------
˓→----------------------
>>> df['PhoneNumber'].ads.validator_registered()
Feature Type Validator Condition ␣
˓→ Handler
--------------------------------------------------------------------------------
˓→-----------
0 phone_number is_phone_number () ␣
˓→default_handler
warning_registered() → DataFrame
Lists registered warnings for all registered feature types.
Returns
The list of registered warnings for registered feature types.
Return type
pandas.DataFrame
Examples
>>> df.ads.warning_registered()
Column Feature Type Warning Handler
-------------------------------------------------------------------------
0 Age continuous zeros zeros_handler
1 Age continuous high_cardinality high_cardinality_handler
>>> df["Age"].ads.warning_registered()
Feature Type Warning Handler
---------------------------------------------------------------
0 continuous zeros zeros_handler
1 continuous high_cardinality high_cardinality_handler
class ads.feature_engineering.adsstring.common_regex_mixin.CommonRegexMixin
Bases: object
property address
property credit_card
property date
property email
property ip
property link
property phone_number_US
property price
property ssn
property time
property zip_code
warning
Provides functionality to register warnings and invoke them.
Type
FeatureWarning
validator
Provides functionality to register validators and invoke them.
feature_stat(x: pd.Series) → pd.DataFrame
Generates feature statistics.
feature_plot(x: pd.Series) → plt.Axes
Shows the location of given address on map base on zip code.
Example
Examples
Returns
Domain based on the Address feature type.
Return type
ads.feature_engineering.schema.Domain
Examples
Returns
Plot object for the series based on the Address feature type.
Return type
matplotlib.axes._subplots.AxesSubplot
Examples
Returns
Summary statistics of the Series provided.
Return type
pandas.DataFrame
validator =
<ads.feature_engineering.feature_type.handler.feature_validator.FeatureValidator
object>
warning =
<ads.feature_engineering.feature_type.handler.feature_warning.FeatureWarning object>
name = 'feature_type'
validator =
<ads.feature_engineering.feature_type.handler.feature_validator.FeatureValidator
object>
warning =
<ads.feature_engineering.feature_type.handler.feature_warning.FeatureWarning object>
class ads.feature_engineering.feature_type.base.Name
Bases: object
class ads.feature_engineering.feature_type.base.Tag(name: str)
Bases: object
Class for free form tags. Name must be specified.
Initialize a tag instance.
Parameters
name (str) – The name of the tag.
Examples
Examples
Returns
Domain based on the Boolean feature type.
Return type
ads.feature_engineering.schema.Domain
Examples
Parameters
x (pandas.Series) – The feature being evaluated.
Returns
Summary statistics of the Series or Dataframe provided.
Return type
pandas.DataFrame
Examples
validator =
<ads.feature_engineering.feature_type.handler.feature_validator.FeatureValidator
object>
warning =
<ads.feature_engineering.feature_type.handler.feature_warning.FeatureWarning object>
Type
str
name
The feature type name.
Type
str
warning
Provides functionality to register warnings and invoke them.
Type
FeatureWarning
validator
Provides functionality to register validators and invoke them.
feature_stat(x: pd.Series) → pd.DataFrame
Generates feature statistics.
feature_plot(x: pd.Series) → plt.Axes
Shows the counts of observations in each categorical bin using bar chart.
description = 'Type representing discrete unordered values.'
Examples
>>> cat = pd.Series(['S', 'C', 'S', 'S', 'S', 'Q', 'S', 'S', 'S', 'C', 'S', 'S',
˓→ 'S',
'S', 'S', 'S', 'Q', 'S', 'S', '', np.NaN, None], name='category')
>>> cat.ads.feature_type = ['category']
>>> cat.ads.feature_domain()
constraints:
- expression: $x in ['S', 'C', 'Q', '']
language: python
stats:
count: 22
missing: 3
unique: 3
values: Category
Returns
Domain based on the Category feature type.
Return type
ads.feature_engineering.schema.Domain
Returns
Plot object for the series based on the Category feature type.
Return type
matplotlib.axes._subplots.AxesSubplot
Examples
>>> cat = pd.Series(['S', 'C', 'S', 'S', 'S', 'Q', 'S', 'S', 'S', 'C', 'S', 'S',
˓→ 'S',
'S', 'S', 'S', 'Q', 'S', 'S', '', np.NaN, None], name='ategory')
>>> cat.ads.feature_type = ['ategory']
>>> cat.ads.feature_plot()
Examples
>>> cat = pd.Series(['S', 'C', 'S', 'S', 'S', 'Q', 'S', 'S', 'S', 'C', 'S', 'S',
˓→ 'S',
'S', 'S', 'S', 'Q', 'S', 'S', '', np.NaN, None], name='ategory')
>>> cat.ads.feature_type = ['ategory']
>>> cat.ads.feature_stat()
Metric Value
0 count 22
1 unique 3
2 missing 3
validator =
<ads.feature_engineering.feature_type.handler.feature_validator.FeatureValidator
object>
warning =
<ads.feature_engineering.feature_type.handler.feature_warning.FeatureWarning object>
Returns
Domain based on the Constant feature type.
Return type
ads.feature_engineering.schema.Domain
Examples
Returns
Plot object for the series based on the Constant feature type.
Return type
matplotlib.axes._subplots.AxesSubplot
Examples
validator =
<ads.feature_engineering.feature_type.handler.feature_validator.FeatureValidator
object>
warning =
<ads.feature_engineering.feature_type.handler.feature_warning.FeatureWarning object>
Examples
Returns
Domain based on the Continuous feature type.
Return type
ads.feature_engineering.schema.Domain
Examples
Returns
Plot object for the series based on the Continuous feature type.
Return type
matplotlib.axes._subplots.AxesSubplot
Examples
Returns
Summary statistics of the Series or Dataframe provided.
Return type
pandas.DataFrame
validator =
<ads.feature_engineering.feature_type.handler.feature_validator.FeatureValidator
object>
warning =
<ads.feature_engineering.feature_type.handler.feature_warning.FeatureWarning object>
validator
Provides functionality to register validators and invoke them.
feature_stat(x: pd.Series) → pd.DataFrame
Generates feature statistics.
feature_plot(x: pd.Series) → plt.Axes
Shows the counts of observations in each credit card type using bar chart.
Examples
Examples
>>> visa = [
"4532640527811543",
None,
"4556929308150929",
"4539944650919740",
"4485348152450846",
"4556593717607190",
]
>>> mastercard = [
"5334180299390324",
"5111466404826446",
"5273114895302717",
"5430972152222336",
"5536426859893306",
]
>>> amex = [
"371025944923273",
"374745112042294",
"340984902710890",
(continues on next page)
Returns
Domain based on the CreditCard feature type.
Return type
ads.feature_engineering.schema.Domain
Examples
>>> visa = [
"4532640527811543",
None,
"4556929308150929",
"4539944650919740",
"4485348152450846",
"4556593717607190",
]
>>> mastercard = [
"5334180299390324",
"5111466404826446",
"5273114895302717",
"5430972152222336",
"5536426859893306",
]
>>> amex = [
"371025944923273",
"374745112042294",
"340984902710890",
"375767928645325",
"370720852891659",
(continues on next page)
Returns
Plot object for the series based on the CreditCard feature type.
Return type
matplotlib.axes._subplots.AxesSubplot
Examples
>>> visa = [
"4532640527811543",
None,
"4556929308150929",
"4539944650919740",
"4485348152450846",
"4556593717607190",
]
>>> mastercard = [
"5334180299390324",
"5111466404826446",
"5273114895302717",
"5430972152222336",
"5536426859893306",
]
>>> amex = [
"371025944923273",
"374745112042294",
"340984902710890",
"375767928645325",
"370720852891659",
]
>>> creditcard_list = visa + mastercard + amex
>>> creditcard_series = pd.Series(creditcard_list,name='card')
>>> creditcard_series.ads.feature_type = ['credit_card']
>>> creditcard_series.ads.feature_stat()
Metric Value
0 count 16
1 unique 15
2 missing 1
3 count_Amex 5
(continues on next page)
Returns
Summary statistics of the Series or Dataframe provided.
Return type
pandas.DataFrame
validator =
<ads.feature_engineering.feature_type.handler.feature_validator.FeatureValidator
object>
warning =
<ads.feature_engineering.feature_type.handler.feature_warning.FeatureWarning object>
warning
Provides functionality to register warnings and invoke them.
Type
FeatureWarning
validator
Provides functionality to register validators and invoke them.
feature_stat(x: pd.Series) → pd.DataFrame
Generates feature statistics.
feature_plot(x: pd.Series) → plt.Axes
Shows distributions of datetime datasets using histograms.
Example
Examples
Returns
Domain based on the DateTime feature type.
Return type
ads.feature_engineering.schema.Domain
Examples
Returns
Plot object for the series based on the DateTime feature type.
Return type
matplotlib.axes._subplots.AxesSubplot
Examples
Returns
Summary statistics of the Series or Dataframe provided.
Return type
pandas.DataFrame
validator =
<ads.feature_engineering.feature_type.handler.feature_validator.FeatureValidator
object>
warning =
<ads.feature_engineering.feature_type.handler.feature_warning.FeatureWarning object>
Returns
The logical list indicating if the data matches requirements.
Return type
pandas.Series
Examples
Returns
Domain based on the Discrete feature type.
Return type
ads.feature_engineering.schema.Domain
Examples
Returns
Plot object for the series based on the Discrete feature type.
Return type
matplotlib.axes._subplots.AxesSubplot
Examples
Returns
Summary statistics of the Series provided.
Return type
pandas.DataFrame
validator =
<ads.feature_engineering.feature_type.handler.feature_validator.FeatureValidator
object>
warning =
<ads.feature_engineering.feature_type.handler.feature_warning.FeatureWarning object>
classmethod feature_domain()
Returns
Nothing.
Return type
None
validator =
<ads.feature_engineering.feature_type.handler.feature_validator.FeatureValidator
object>
warning =
<ads.feature_engineering.feature_type.handler.feature_warning.FeatureWarning object>
Example
Examples
Returns
Domain based on the GIS feature type.
Return type
ads.feature_engineering.schema.Domain
Examples
Returns
Plot object for the series based on the GIS feature type.
Return type
matplotlib.axes._subplots.AxesSubplot
Examples
Returns
Summary statistics of the Series provided.
Return type
pandas.DataFrame
validator =
<ads.feature_engineering.feature_type.handler.feature_validator.FeatureValidator
object>
warning =
<ads.feature_engineering.feature_type.handler.feature_warning.FeatureWarning object>
Examples
Returns
Domain based on the Integer feature type.
Return type
ads.feature_engineering.schema.Domain
Examples
Returns
Plot object for the series based on the Integer feature type.
Return type
matplotlib.axes._subplots.AxesSubplot
Examples
Returns
Summary statistics of the Series or Dataframe provided.
Return type
pandas.DataFrame
validator =
<ads.feature_engineering.feature_type.handler.feature_validator.FeatureValidator
object>
warning =
<ads.feature_engineering.feature_type.handler.feature_warning.FeatureWarning object>
Example
Examples
Returns
Domain based on the IpAddress feature type.
Return type
ads.feature_engineering.schema.Domain
Examples
Returns
Summary statistics of the Series provided.
Return type
pandas.DataFrame
validator =
<ads.feature_engineering.feature_type.handler.feature_validator.FeatureValidator
object>
warning =
<ads.feature_engineering.feature_type.handler.feature_warning.FeatureWarning object>
Example
Examples
Returns
Domain based on the IpAddressV4 feature type.
Return type
ads.feature_engineering.schema.Domain
Examples
Returns
Summary statistics of the Series provided.
Return type
pandas.DataFrame
validator =
<ads.feature_engineering.feature_type.handler.feature_validator.FeatureValidator
object>
warning =
<ads.feature_engineering.feature_type.handler.feature_warning.FeatureWarning object>
name
The feature type name.
Type
str
warning
Provides functionality to register warnings and invoke them.
Type
FeatureWarning
validator
Provides functionality to register validators and invoke them.
feature_stat(x: pd.Series) → pd.DataFrame
Generates feature statistics.
Example
Examples
Returns
Domain based on the IpAddressV6 feature type.
Return type
ads.feature_engineering.schema.Domain
Examples
Returns
Summary statistics of the Series provided.
Return type
Pandas Dataframe
validator =
<ads.feature_engineering.feature_type.handler.feature_validator.FeatureValidator
object>
warning =
<ads.feature_engineering.feature_type.handler.feature_warning.FeatureWarning object>
Example
Examples
Returns
Domain based on the LatLong feature type.
Return type
ads.feature_engineering.schema.Domain
Examples
Returns
Plot object for the series based on the LatLong feature type.
Return type
matplotlib.axes._subplots.AxesSubplot
Examples
Returns
Summary statistics of the Series or Dataframe provided.
Return type
pandas.DataFrame
validator =
<ads.feature_engineering.feature_type.handler.feature_validator.FeatureValidator
object>
warning =
<ads.feature_engineering.feature_type.handler.feature_warning.FeatureWarning object>
description
The feature type description.
Type
str
name
The feature type name.
Type
str
warning
Provides functionality to register warnings and invoke them.
Type
FeatureWarning
validator
Provides functionality to register validators and invoke them.
description = 'Type representing object.'
classmethod feature_domain()
Returns
Nothing.
Return type
None
validator =
<ads.feature_engineering.feature_type.handler.feature_validator.FeatureValidator
object>
warning =
<ads.feature_engineering.feature_type.handler.feature_warning.FeatureWarning object>
name
The feature type name.
Type
str
warning
Provides functionality to register warnings and invoke them.
Type
FeatureWarning
validator
Provides functionality to register validators and invoke them.
feature_stat(x: pd.Series) → pd.DataFrame
Generates feature statistics.
feature_plot(x: pd.Series) → plt.Axes
Shows the counts of observations in each categorical bin using bar chart.
description = 'Type representing ordered values.'
Examples
Returns
Domain based on the Ordinal feature type.
Return type
ads.feature_engineering.schema.Domain
Examples
Returns
The bart chart plot object for the series based on the Continuous feature type.
Return type
matplotlib.axes._subplots.AxesSubplot
Examples
Returns
Summary statistics of the Series or Dataframe provided.
Return type
pandas.DataFrame
validator =
<ads.feature_engineering.feature_type.handler.feature_validator.FeatureValidator
object>
warning =
<ads.feature_engineering.feature_type.handler.feature_warning.FeatureWarning object>
class ads.feature_engineering.feature_type.phone_number.PhoneNumber
Bases: String
Type representing phone numbers.
description
The feature type description.
Type
str
name
The feature type name.
Type
str
warning
Provides functionality to register warnings and invoke them.
Type
FeatureWarning
validator
Provides functionality to register validators and invoke them.
feature_stat(x: pd.Series) → pd.DataFrame
Generates feature statistics.
Examples
Examples
Returns
Domain based on the PhoneNumber feature type.
Return type
ads.feature_engineering.schema.Domain
Examples
Returns
Summary statistics of the Series or Dataframe provided.
Return type
pandas.DataFrame
validator =
<ads.feature_engineering.feature_type.handler.feature_validator.FeatureValidator
object>
warning =
<ads.feature_engineering.feature_type.handler.feature_warning.FeatureWarning object>
Example
Examples
>>> string = pd.Series(['S', 'C', 'S', 'S', 'S', 'Q', 'S', 'S', 'S', 'C', 'S',
˓→'S', 'S',
'S', 'S', 'S', 'Q', 'S', 'S', '', np.NaN, None], name='string')
>>> string.ads.feature_type = ['string']
>>> string.ads.feature_domain()
constraints: []
stats:
count: 22
missing: 3
unique: 3
values: String
Returns
Domain based on the String feature type.
Return type
ads.feature_engineering.schema.Domain
Examples
>>> string = pd.Series(['S', 'C', 'S', 'S', 'S', 'Q', 'S', 'S', 'S', 'C', 'S',
˓→'S', 'S',
'S', 'S', 'S', 'Q', 'S', 'S', '', np.NaN, None], name='string')
>>> string.ads.feature_type = ['string']
>>> string.ads.feature_plot()
Returns
Plot object for the series based on the String feature type.
Return type
matplotlib.axes._subplots.AxesSubplot
Examples
>>> string = pd.Series(['S', 'C', 'S', 'S', 'S', 'Q', 'S', 'S', 'S', 'C', 'S',
˓→'S', 'S',
'S', 'S', 'S', 'Q', 'S', 'S', '', np.NaN, None], name='string')
>>> string.ads.feature_type = ['string']
>>> string.ads.feature_stat()
Metric Value
0 count 22
(continues on next page)
Returns
Summary statistics of the Series or Dataframe provided.
Return type
Pandas Dataframe
validator =
<ads.feature_engineering.feature_type.handler.feature_validator.FeatureValidator
object>
warning =
<ads.feature_engineering.feature_type.handler.feature_warning.FeatureWarning object>
warning
Provides functionality to register warnings and invoke them.
Type
FeatureWarning
validator
Provides functionality to register validators and invoke them.
feature_plot(x: pd.Series) → plt.Axes
Shows distributions of datasets using wordcloud.
description = 'Type representing text values.'
classmethod feature_domain()
Returns
Nothing.
Return type
None
static feature_plot(x: Series) → Axes
Shows distributions of datasets using wordcloud.
Examples
>>> text = pd.Series(['S', 'C', 'S', 'S', 'S', 'Q', 'S', 'S', 'S', 'C', 'S', 'S
˓→', 'S',
'S', 'S', 'S', 'Q', 'S', 'S', '', np.NaN, None], name='text')
>>> text.ads.feature_type = ['text']
>>> text.ads.feature_plot()
Returns
Plot object for the series based on the Text feature type.
Return type
matplotlib.axes._subplots.AxesSubplot
validator =
<ads.feature_engineering.feature_type.handler.feature_validator.FeatureValidator
object>
warning =
<ads.feature_engineering.feature_type.handler.feature_warning.FeatureWarning object>
classmethod feature_domain()
Returns
Nothing.
Return type
None
validator =
<ads.feature_engineering.feature_type.handler.feature_validator.FeatureValidator
object>
warning =
<ads.feature_engineering.feature_type.handler.feature_warning.FeatureWarning object>
Example
Examples
Returns
Domain based on the ZipCode feature type.
Return type
ads.feature_engineering.schema.Domain
Examples
Returns
Plot object for the series based on the ZipCode feature type.
Return type
matplotlib.axes._subplots.AxesSubplot
Examples
Returns
Summary statistics of the Series provided.
Return type
Pandas Dataframe
validator =
<ads.feature_engineering.feature_type.handler.feature_validator.FeatureValidator
object>
warning =
<ads.feature_engineering.feature_type.handler.feature_warning.FeatureWarning object>
The module that helps to register custom validators for the feature types and extending registered validators with dis-
patching based on the specific arguments.
Classes
FeatureValidator
The Feature Validator class to manage custom validators.
FeatureValidatorMethod
The Feature Validator Method class. Extends methods which requires dispatching based on the spe-
cific arguments.
class ads.feature_engineering.feature_type.handler.feature_validator.FeatureValidator
Bases: object
The Feature Validator class to manage custom validators.
register(self, name: str, handler: Callable, condition: Union[Tuple, Dict[str, Any]] = None, replace: bool
= False) → None
Registers new validator.
unregister(self, name: str, condition: Union[Tuple, Dict[str, Any]] = None) → None
Unregisters validator.
registered(self ) → pd.DataFrame
Gets the list of registered validators.
Examples
... print("universal_phone_number_validator")
... return data
>>> PhoneNumber.validator.is_phone_number(series)
phone_number_validator
0 +1-202-555-0141
1 +1-202-555-0142
>>> PhoneNumber.validator.registered()
Validator Condition ␣
˓→Handler
--------------------------------------------------------------------------------
˓→-
0 is_phone_number () phone_number_
˓→validator
>>> series.ads.validator.is_phone_number()
phone_number_validator
0 +1-202-555-0141
1 +1-202-555-0142
registered() → DataFrame
Gets the list of registered validators.
Returns
The list of registerd validators.
Return type
pd.DataFrame
unregister(name: str, condition: Optional[Union[Tuple, Dict[str, Any]]] = None) → None
Unregisters validator.
Parameters
• name (str) – The name of the validator to be unregistered.
• condition (Union[Tuple, Dict[str, Any]]) – The condition for the validator to be
unregistered.
Returns
Nothing.
Return type
None
Raises
• TypeError – The name of the validator is not a string.
• ValidatorNotFound – The validator not found.
• ValidatorWIthConditionNotFound – The validator with provided condition not found.
class ads.feature_engineering.feature_type.handler.feature_validator.FeatureValidatorMethod(handler:
Callable)
Bases: object
The Feature Validator Method class.
Extends methods which requires dispatching based on the specific arguments.
register(self, condition: Union[Tuple, Dict[str, Any]], handler: Callable) → None
Registers new handler.
unregister(self, condition: Union[Tuple, Dict[str, Any]]) → None
Unregisters existing handler.
registered(self ) → pd.DataFrame
Gets the list of registered handlers.
Initializes the Feature Validator Method.
Parameters
handler (Callable) – The handler that will be called by default if suitable one not found.
register(condition: Union[Tuple, Dict[str, Any]], handler: Callable) → None
Registers new handler.
Parameters
• condition (Union[Tuple, Dict[str, Any]]) – The condition which will be used to
register a new handler.
• handler (Callable) – The handler to be registered.
Returns
Nothing.
Return type
None
Raises
ValueError – If condition not provided or provided in the wrong format. If handler not
provided or has wrong format.
registered() → DataFrame
Gets the list of registered handlers.
Returns
The list of registerd handlers.
Return type
pd.DataFrame
unregister(condition: Union[Tuple, Dict[str, Any]]) → None
Unregisters existing handler.
Parameters
condition (Union[Tuple, Dict[str, Any]]) – The condition which will be used to
unregister a handler.
Returns
Nothing.
Return type
None
Raises
ValueError – If condition not provided or provided in the wrong format. If condition not
registered.
exception ads.feature_engineering.feature_type.handler.feature_validator.ValidatorAlreadyExists(name:
str)
Bases: ValueError
exception ads.feature_engineering.feature_type.handler.feature_validator.ValidatorNotFound(name:
str)
Bases: ValueError
exception ads.feature_engineering.feature_type.handler.feature_validator.ValidatorWithConditionAlreadyEx
Bases: ValueError
exception ads.feature_engineering.feature_type.handler.feature_validator.ValidatorWithConditionNotFound(
Bases: ValueError
exception ads.feature_engineering.feature_type.handler.feature_validator.WrongHandlerMethodSignature(han
str,
con
di-
tion
str,
han
dle
str)
Bases: ValueError
The module that helps to register custom warnings for the feature types.
Classes
FeatureWarning
The Feature Warning class. Provides functionality to register warning handlers and invoke them.
Examples
>>> warning.zeros_percentage(data_series)
Warning Message Metric Value
----------------------------------------------------------------
0 Zeros Age has 38 zeros Count 38
>>> warning.zeros_count(data_series)
Warning Message Metric Value
----------------------------------------------------------------
1 Zeros Age has 12.2% zeros Percentage 12.2%
>>> warning(data_series)
Warning Message Metric Value
----------------------------------------------------------------
0 Zeros Age has 38 zeros Count 38
1 Zeros Age has 12.2% zeros Percentage 12.2%
>>> warning.unregister('zeros_count')
>>> warning(data_series)
(continues on next page)
class ads.feature_engineering.feature_type.handler.feature_warning.FeatureWarning
Bases: object
The Feature Warning class.
Provides functionality to register warning handlers and invoke them.
register(self, name: str, handler: Callable) → None
Registers a new warning for the feature type.
unregister(self, name: str) → None
Unregisters warning.
registered(self ) → pd.DataFrame
Gets the list of registered warnings.
Examples
>>> warning.zeros_percentage(data_series)
Warning Message Metric Value
----------------------------------------------------------------
0 Zeros Age has 38 zeros Count 38
>>> warning.zeros_count(data_series)
Warning Message Metric Value
----------------------------------------------------------------
1 Zeros Age has 12.2% zeros Percentage 12.2%
>>> warning(data_series)
Warning Message Metric Value
----------------------------------------------------------------
(continues on next page)
>>> warning.unregister('zeros_count')
>>> warning(data_series)
Warning Message Metric Value
----------------------------------------------------------------
0 Zeros Age has 12.2% zeros Percentage 12.2%
Examples
Returns
Nothing.
Return type
None
Raises
• ValueError – If warning name is not provided or empty.
• WarningNotFound – If warning not found.
The module with all default warnings provided to user. These are registered to relevant feature types directly in the
feature type files themselves.
ads.feature_engineering.feature_type.handler.warnings.high_cardinality_handler(s: Series) →
DataFrame
Warning if number of unique values (including Nan) in series is greater than or equal to 15.
Parameters
s (pd.Series) – Pandas series - column of some feature type.
Returns
Dataframe with 4 columns ‘Warning’, ‘Message’, ‘Metric’, ‘Value’ and 1 rows, which lists count
of unique values.
Return type
pd.Dataframe
ads.feature_engineering.feature_type.handler.warnings.missing_values_handler(s: Series) →
DataFrame
Warning for > 5 percent missing values (Nans) in series.
Parameters
s (pd.Series) – Pandas series - column of some feature type.
Returns
Dataframe with 4 columns ‘Warning’, ‘Message’, ‘Metric’, ‘Value’ and 2 rows, where first row
is count of missing values and second is percentage of missing values.
Return type
pd.Dataframe
ads.feature_engineering.feature_type.handler.warnings.skew_handler(s: Series) → DataFrame
Warning if absolute value of skew is greater than 1.
Parameters
s (pd.Series) – Pandas series - column of some feature type, expects continuous values.
Returns
Dataframe with 4 columns ‘Warning’, ‘Message’, ‘Metric’, ‘Value’ and 1 rows, which lists skew
value of that column.
Return type
pd.Dataframe
17.1.1.12.1 Submodules
Note: If the range [low, high] is not divisible by 𝑞, high will be replaced with the maximum of 𝑘𝑞 + lowhigh,
where 𝑘 is an integer.
Parameters
• low (float) – Lower endpoint of the range of the distribution. low is included in the range.
• high (float) – Upper endpoint of the range of the distribution. high is included in the range.
• step (float) – A discretization step.
class ads.hpo.distributions.Distribution(dist)
Bases: object
Defines the abstract base class for hyperparameter search distributions
get_distribution()
Returns the distribution
• high – Upper endpoint of the range of the distribution. high is included in the range.
• step – A step for spacing between values.
class ads.hpo.distributions.IntUniformDistribution(low: float, high: float, step: float = 1)
Bases: Distribution
A uniform distribution on integers.
Note: If the range [low, high] is not divisible by step, high will be replaced with the maximum of 𝑘 × step +
lowhigh, where 𝑘 is an integer.
Parameters
• low – Lower endpoint of the range of the distribution. low is included in the range.
• high – Upper endpoint of the range of the distribution. high is included in the range.
• step – A step for spacing between values.
Return type
str (DistributionEncode)
tuner = ADSTuner(
SVC(),
strategy='detailed',
scoring='f1_weighted',
random_state=42
)
X, y = load_iris(return_X_y=True)
tuner.tune(X=X, y=y, exit_criterion=[TimeBudget(1)])
property best_index
returns: Index which corresponds to the best candidate parameter setting. :rtype: int
property best_params
returns: Parameters of the best trial. :rtype: Dict[str, Any]
property best_score
returns: Mean cross-validated score of the best estimator. :rtype: float
best_scores(n: int = 5, reverse: bool = True)
Return the best scores from the study
Parameters
• n (int) – The maximum number of results to show. Defaults to 5. If None or negative
return all.
• reverse (bool) – Whether to reverse the sort order so results are in descending order.
Defaults to True
Returns
List of the best scores
Return type
list[float or int]
Raises
ValueError –
get_status()
return the status of the current tuning process.
Alias for the property status.
Returns
The status of the process
Return type
Status
Example:
tuner = ADSTuner(
SGDClassifier(),
strategy='detailed',
scoring='f1_weighted',
random_state=42
)
tuner.search_space({'max_iter': 100})
X, y = load_iris(return_X_y=True)
tuner.tune(X=X, y=y, exit_criterion=[TimeBudget(1)])
tuner.get_status()
halt()
Halt the current running tuning process.
Returns
Nothing
Return type
None
Raises
InvalidStateTransition –
Example:
tuner = ADSTuner(
SGDClassifier(),
strategy='detailed',
scoring='f1_weighted',
random_state=42
)
tuner.search_space({'max_iter': 100})
X, y = load_iris(return_X_y=True)
tuner.tune(X=X, y=y, exit_criterion=[TimeBudget(1)])
tuner.halt()
is_completed()
Returns
True if the ADSTuner instance has completed; False otherwise.
Return type
bool
is_halted()
Returns
True if the ADSTuner instance is halted; False otherwise.
Return type
bool
is_running()
Returns
True if the ADSTuner instance is running; False otherwise.
Return type
bool
is_terminated()
Returns
True if the ADSTuner instance has been terminated; False otherwise.
Return type
bool
property n_trials
returns: Number of completed trials. Alias for trial_count. :rtype: int
static optimizer(study_name, pruner, sampler, storage, load_if_exists, objective_func, global_start,
global_stop, **kwargs)
Static method for running ADSTuner tuning process
Parameters
• study_name (str) – The name of the study.
• pruner – The pruning method for pruning trials.
• sampler – The sampling method used for tuning.
• storage (str) – Storage endpoint.
• load_if_exists (bool) – Load existing study if it exists.
• objective_func – The objective function to be maximized.
• global_start (multiprocesing.Value) – The global start time.
• global_stop (multiprocessing.Value) – The global stop time.
• kwargs (dict) – Keyword/value pairs passed into the optimize process
Raises
Exception – Raised for any exceptions thrown by the underlying optimization process
Returns
Nothing
Return type
None
plot_best_scores(best=True, inferior=True, time_interval=1, fig_size=(800, 500))
Plot optimization history of all trials in a study.
Parameters
• best – controls whether to plot the lines for the best scores so far.
• inferior – controls whether to plot the dots for the actual objective scores.
• time_interval – how often(in seconds) the plot refresh to check on the new trial results.
• fig_size (tuple) – width and height of the figure.
Returns
Nothing.
Return type
None
plot_contour_scores(params=None, time_interval=1, fig_size=(800, 500))
Contour plot of the scores.
Parameters
• params (Optional[List[str]]) – Parameter list to visualize. Defaults to all.
• time_interval (float) – Time interval for the plot. Defaults to 1.
• fig_size (tuple[int, int]) – Figure size. Defaults to (800, 500).
Returns
Nothing.
Return type
None
plot_edf_scores(time_interval=1, fig_size=(800, 500))
Plot the EDF (empirical distribution function) of the scores.
Only completed trials are used.
Parameters
• time_interval (float) – Time interval for the plot. Defaults to 1.
• fig_size (tuple[int, int]) – Figure size. Defaults to (800, 500).
Returns
Nothing.
Return type
None
plot_intermediate_scores(time_interval=1, fig_size=(800, 500))
Plot intermediate values of all trials in a study.
Parameters
• time_interval (float) – Time interval for the plot. Defaults to 1.
• fig_size (tuple[int, int]) – Figure size. Defaults to (800, 500).
Returns
Nothing.
Return type
None
plot_parallel_coordinate_scores(params=None, time_interval=1, fig_size=(800, 500))
Plot the high-dimentional parameter relationships in a study.
Note that, If a parameter contains missing values, a trial with missing values is not plotted.
Parameters
tuner = ADSTuner(
SGDClassifier(),
strategy='detailed',
scoring='f1_weighted',
random_state=42
)
tuner.search_space({'max_iter': 100})
X, y = load_iris(return_X_y=True)
tuner.tune(X=X, y=y, exit_criterion=[TimeBudget(1)])
tuner.halt()
tuner.resume()
property score_remaining
returns: The difference between the best score and the optimal score. :rtype: float
Raises
ExitCriterionError – Error is raised if there is no score-based criteria for tuning.
property scoring_name
returns: Scoring name. :rtype: str
search_space(strategy=None, overwrite=False)
Returns the search space. If strategy is not passed in, return the existing search space. When strategy is
passed in, overwrite the existing search space if overwrite is set True, otherwise, only update the existing
search space.
Parameters
• strategy (Union[str, dict], optional) – perfunctory, detailed or a dictio-
nary/mapping of the hyperparameters and their distributions. If obj:perfunctory, picks a
few relatively more important hyperparmeters to tune . If obj:detailed, extends to a larger
search space. If obj:dict, user defined search space: Dictionary where keys are parameters
and values are distributions. Distributions are assumed to implement the ads distribution
interface.
• overwrite (bool, optional) – Ignored when strategy is None. Otherwise, search space
is overwritten if overwrite is set True and updated if it is False.
Returns
A mapping of the hyperparameters and their distributions.
Return type
dict
Example:
tuner = ADSTuner(
SGDClassifier(),
strategy='detailed',
scoring='f1_weighted',
random_state=42
)
tuner.search_space({'max_iter': 100})
X, y = load_iris(return_X_y=True)
tuner.tune(X=X, y=y, exit_criterion=[TimeBudget(1)])
tuner.search_space()
property sklearn_steps
returns: Search space which corresponds to the best candidate parameter setting. :rtype: int
property status
returns: The status of the current tuning process. :rtype: Status
terminate()
Terminate the current tuning process.
Returns
Nothing
Return type
None
Example:
tuner = ADSTuner(
SGDClassifier(),
strategy='detailed',
scoring='f1_weighted',
random_state=42
)
tuner.search_space({'max_iter': 100})
X, y = load_iris(return_X_y=True)
tuner.tune(X=X, y=y, exit_criterion=[TimeBudget(1)])
tuner.terminate()
property time_elapsed
Return the time in seconds that the HPO process has been searching
Returns
int
Return type
The number of seconds the HPO process has been searching
property time_remaining
Returns the number of seconds remaining in the study
Returns
int
Return type
Number of seconds remaining in the budget. 0 if complete/terminated
Raises
ExitCriterionError – Error is raised if time has not been included in the budget.
property time_since_resume
Return the seconds since the process has been resumed from a halt.
Returns
int
Return type
the number of seconds since the process was last resumed
Raises
NoRestartError –
property trial_count
returns: Number of completed trials. Alias for trial_count. :rtype: int
property trials
returns: Trial data up to this point. :rtype: pandas.DataFrame
trials_export(file_uri, metadata=None, script_dict={'model': None, 'scoring': None})
Export the meta data as well as files needed to reconstruct the ADSTuner object to the object storage. Data
is not stored. To resume the same ADSTuner object from object storage and continue tuning from previous
trials, you have to provide the dataset.
Parameters
• file_uri (str) – Object storage path, ‘oci://bucketname@namespace/filepath/on/objectstorage’.
For example, oci://test_bucket@ociodsccust/tuner/test.zip
• metadata (str, optional) – User defined metadata
• script_dict (dict, optional) – Script paths for model and scoring. This is only rec-
ommended for unsupported models and user-defined scoring functions. You can store the
model and scoring function in a dictionary with keys model and scoring and the respec-
tive paths as values. The model and scoring scripts must import necessary libraries for the
script to run. The model and scoring variables must be set to your model and scoring
function.
Returns
Nothing
Return type
None
Example:
'scoring':'/home/datascience/advanced-ds/notebooks/scratch/ADSTunerV2/
˓→customized_scoring.py'}
Example:
tuner = ADSTuner(
SGDClassifier(),
strategy='detailed',
scoring='f1_weighted',
random_state=42
)
tuner.search_space({'max_iter': 100})
X, y = load_iris(return_X_y=True)
tuner.tune(X=X, y=y, exit_criterion=[TimeBudget(1)], synchronous=True)
tuner.trials_export('oci://<bucket_name>@<namespace>/tuner/test.zip')
Examples
property trials_remaining
returns: The number of trials remaining in the budget. :rtype: int
Raises
ExitCriterionError – Raised if the current tuner does not include a trials-based exit con-
dition.
tune(X=None, y=None, exit_criterion=[], loglevel=None, synchronous=False)
Run hypyerparameter tuning until one of the <code>exit_criterion</code> is met. The default is to run 50
trials.
Parameters
• X (TwoDimArrayLikeType, Union[List[List[float]], np.ndarray, pd.
DataFrame, spmatrix, ADSData]) – Training data.
• y (Union[OneDimArrayLikeType, TwoDimArrayLikeType], optional) –
• OneDimArrayLikeType (Union[List[float], np.ndarray, pd.Series]) –
• TwoDimArrayLikeType (Union[List[List[float]], np.ndarray, pd.
DataFrame, spmatrix, ADSData]) – Target.
• exit_criterion (list, optional) – A list of ads stopping criterion. Can be
ScoreValue(), NTrials(), TimeBudget(). For example, [ScoreValue(0.96), NTrials(40),
TimeBudget(10)]. It will exit when any of the stopping criterion is satisfied in the
exit_criterion list. By default, the run will stop after 50 trials.
• loglevel (int, optional) – Log level.
tuner = ADSTuner(
SVC(),
strategy='detailed',
scoring='f1_weighted',
random_state=42
)
tuner.search_space({'max_iter': 100})
X, y = load_iris(return_X_y=True)
tuner.tune(X=X, y=y, exit_criterion=[TimeBudget(1)])
wait()
Wait for the current tuning process to finish running.
Returns
Nothing
Return type
None
Example:
tuner = ADSTuner(
SGDClassifier(),
strategy='detailed',
scoring='f1_weighted',
random_state=42
)
tuner.search_space({'max_iter': 100})
X, y = load_iris(return_X_y=True)
tuner.tune(X=X, y=y, exit_criterion=[TimeBudget(1)])
tuner.wait()
exception ads.hpo.search_cv.DuplicatedStudyError
Bases: Exception
DuplicatedStudyError is raised when a new tuner process is created with a study name that already exists in
storage.
exception ads.hpo.search_cv.ExitCriterionError
Bases: Exception
ExitCriterionError is raised when an attempt is made to check exit status for a different exit type than the tuner
was initialized with. For example, if an HPO study has an exit criteria based on the number of trials and a request
is made for the time remaining, which is a different exit criterion, an exception is raised.
exception ads.hpo.search_cv.InvalidStateTransition
Bases: Exception
Invalid State Transition is raised when an invalid transition request is made, such as calling halt without a running
process.
exception ads.hpo.search_cv.NoRestartError
Bases: Exception
NoRestartError is raised when an attempt is made to check how many seconds have transpired since the HPO
process was last resumed from a halt. This can happen if the process has been terminated or it was never halted
and then resumed to begin with.
class ads.hpo.search_cv.State(value)
Bases: Enum
An enumeration.
COMPLETED = 5
HALTED = 3
INITIATED = 1
RUNNING = 2
TERMINATED = 4
17.1.1.12.4 ads.hpo.stopping_criterion
Returns
ScoreValue object
Return type
ScoreValue
class ads.hpo.stopping_criterion.TimeBudget(seconds: float)
Bases: object
Exit based on the number of seconds.
Parameters
seconds (float) – Time limit, in seconds. If None there is no time limit.
Returns
TimeBudget object
Return type
TimeBudget
17.1.1.13.1 Submodules
Example
If you are in an OCI notebook session and you would like to use the same infrastructure configurations, the in-
frastructure configuration can be simplified. Here is another example of creating and running a jupyter notebook
as a job:
See also:
https
//docs.oracle.com/en-us/iaas/tools/ads-sdk/latest/user_guide/jobs/index.html
Initializes a job.
The infrastructure and runtime can be configured when initializing the job,
or by calling with_infrastructure() and with_runtime().
The infrastructure should be a subclass of ADS job Infrastructure, e.g., DataScienceJob, DataFlow. The runtime
should be a subclass of ADS job Runtime, e.g., PythonRuntime, ScriptRuntime.
Parameters
• name (str, optional) – The name of the job, by default None. If it is None, a default
name may be generated by the infrastructure, depending on the implementation of the in-
frastructure. For OCI data science job, the default name contains the job artifact name and a
timestamp. If no artifact, a randomly generated easy to remember name with timestamp will
be generated, like ‘strange-spider-2022-08-17-23:55.02’.
• infrastructure (Infrastructure, optional) – Job infrastructure, by default None
• runtime (Runtime, optional) – Job runtime, by default None.
create(**kwargs) → Job
Creates the job on the infrastructure.
Returns
The job instance (self)
Return type
Job
static dataflow_job(compartment_id: Optional[str] = None, **kwargs) → List[Job]
List data flow jobs under a given compartment.
Parameters
• compartment_id (str) – compartment id
• kwargs – additional keyword arguments
Returns
list of Job instances
Return type
List[Job]
static datascience_job(compartment_id: Optional[str] = None, **kwargs) → List[DataScienceJob]
Lists the existing data science jobs in the compartment.
Parameters
compartment_id (str) – The compartment ID for listing the jobs. This is optional if running
in an OCI notebook session. The jobs in the same compartment of the notebook session will
be returned.
Returns
A list of Job objects.
Return type
list
delete() → None
Deletes the job from the infrastructure.
download(to_dir: str, output_uri=None, **storage_options)
Downloads files from remote output URI to local.
Parameters
• to_dir (str) – Local directory to which the files will be downloaded to.
• output_uri ((str, optional). Default is None.) – The remote URI from which
the files will be downloaded. Defaults to None. If output_uri is not specified, this method
will try to get the output_uri from the runtime.
Returns
A list of job run instances, the actual object type depends on the infrastructure.
Return type
list
property runtime: Runtime
The job runtime.
Returns
The job runtime
Return type
Runtime
status() → str
Status of the job
Returns
Status of the job
Return type
str
to_dict() → dict
Serialize the job specifications to a dictionary.
Returns
A dictionary containing job specifications.
Return type
dict
with_infrastructure(infrastructure) → Job
Sets the infrastructure for the job.
Parameters
infrastructure (Infrastructure) – Job infrastructure.
Returns
The job instance (self)
Return type
Job
with_name(name: str) → Job
Sets the job name.
Parameters
name (str) – Job name.
Returns
The job instance (self)
Return type
Job
with_runtime(runtime) → Job
Sets the runtime for the job.
Parameters
runtime (Runtime) – Job runtime.
Returns
The job instance (self)
Return type
Job
CONST_CONDA_REGION = 'region'
CONST_CONDA_SLUG = 'slug'
CONST_CONDA_TYPE = 'type'
CONST_CONDA_TYPE_CUSTOM = 'published'
CONST_CONDA_TYPE_SERVICE = 'service'
CONST_CONDA_URI = 'uri'
• region (str, optional) – The region of the bucket storing the custom conda pack,
by default None. If region is not specified, ADS will use the region from your authen-
tication credentials, * For API Key, config[“region”] is used. * For Resource Principal,
signer.region is used.
This is required if the conda pack is stored in a different region.
Returns
The runtime instance.
Return type
self
See also:
https
//docs.oracle.com/en-us/iaas/data-science/using/conda_publishs_object.htm
with_service_conda(slug: str)
Specifies the service conda pack for running the job
Parameters
slug (str) – The slug name of the service conda pack
Returns
The runtime instance.
Return type
self
class ads.jobs.builders.runtimes.python_runtime.DataFlowNotebookRuntime(spec: Optional[Dict]
= None, **kwargs)
Bases: DataFlowRuntime, NotebookRuntime
Initialize the object with specifications.
User can either pass in the specification as a dictionary or through keyword arguments.
Parameters
• spec (dict, optional) – Object specification, by default None
• kwargs (dict) – Specification as keyword arguments. If spec contains the same key as the
one in kwargs, the value from kwargs will be used.
convert(overwrite=False)
CONST_ARCHIVE_BUCKET = 'archiveBucket'
CONST_ARCHIVE_URI = 'archiveUri'
CONST_CONDA_AUTH_TYPE = 'condaAuthType'
CONST_CONFIGURATION = 'configuration'
CONST_SCRIPT_BUCKET = 'scriptBucket'
CONST_SCRIPT_PATH = 'scriptPathURI'
https
//docs.oracle.com/en-us/iaas/data-science/using/conda_publishs_object.htm
with_script_bucket(bucket) → DataFlowRuntime
Set object storage bucket to save the script, in case script uri given is local.
Parameters
bucket (str) – name of the bucket
Returns
runtime instance itself
Return type
DataFlowRuntime
with_script_uri(path) → DataFlowRuntime
Set script uri.
Parameters
uri (str) – uri to the script
Returns
runtime instance itself
Return type
DataFlowRuntime
with_service_conda(slug: str)
Specifies the service conda pack for running the job
Parameters
slug (str) – The slug name of the service conda pack
Returns
The runtime instance.
Return type
self
class ads.jobs.builders.runtimes.python_runtime.GitPythonRuntime(spec: Optional[Dict] = None,
**kwargs)
Bases: CondaRuntime, _PythonRuntimeMixin
Represents a job runtime with source code from git repository
Initialize the object with specifications.
User can either pass in the specification as a dictionary or through keyword arguments.
Parameters
• spec (dict, optional) – Object specification, by default None
• kwargs (dict) – Specification as keyword arguments. If spec contains the same key as the
one in kwargs, the value from kwargs will be used.
CONST_BRANCH = 'branch'
CONST_COMMIT = 'commit'
CONST_GIT_SSH_SECRET_ID = 'gitSecretId'
CONST_GIT_URL = 'url'
CONST_SKIP_METADATA = 'skipMetadataUpdate'
This update step also requires resource principals to have the permission to update the job run.
Returns
True if the metadata update will be skipped. Otherwise False.
Return type
bool
property ssh_secret_ocid
The OCID of the OCI Vault secret storing the Git SSH key.
property url: str
URL of the Git repository.
with_argument(*args, **kwargs)
Specifies the arguments for running the script/function.
When running a python script, the arguments will be the command line arguments. For example,
with_argument(“arg1”, “arg2”, key1=”val1”, key2=”val2”) will generate the command line arguments:
“arg1 arg2 –key1 val1 –key2 val2”
When running a function, the arguments will be passed into the function. Arguments can also be list, dict
or any JSON serializable object. For example, with_argument(“arg1”, “arg2”, key1=[“val1a”, “val1b”],
key2=”val2”) will be passed in as “your_function(“arg1”, “arg2”, key1=[“val1a”, “val1b”], key2=”val2”)
Returns
The runtime instance.
Return type
self
with_source(url: str, branch: Optional[str] = None, commit: Optional[str] = None, secret_ocid:
Optional[str] = None)
Specifies the Git repository and branch/commit for the job source code.
Parameters
• url (str) – URL of the Git repository.
• branch (str, optional) – Git branch name, by default None, the default branch will be
used.
• commit (str, optional) – Git commit ID (SHA1 hash), by default None, the most recent
commit will be used.
• secret_ocid (str) – The secret OCID storing the SSH key content for checking out the
Git repository.
Returns
The runtime instance.
Return type
self
class ads.jobs.builders.runtimes.python_runtime.NotebookRuntime(spec: Optional[Dict] = None,
**kwargs)
Bases: CondaRuntime
Represents a job runtime with Jupyter notebook
Initialize the object with specifications.
User can either pass in the specification as a dictionary or through keyword arguments.
Parameters
• spec (dict, optional) – Object specification, by default None
• kwargs (dict) – Specification as keyword arguments. If spec contains the same key as the
one in kwargs, the value from kwargs will be used.
CONST_EXCLUDE_TAG = 'excludeTags'
CONST_NOTEBOOK_ENCODING = 'notebookEncoding'
CONST_NOTEBOOK_PATH = 'notebookPathURI'
CONST_OUTPUT_URI = 'outputURI'
Returns
The runtime instance.
Return type
self
class ads.jobs.builders.runtimes.python_runtime.PythonRuntime(spec: Optional[Dict] = None,
**kwargs)
Bases: ScriptRuntime, _PythonRuntimeMixin
Represents a job runtime using ADS driver script to run Python code
Initialize the object with specifications.
User can either pass in the specification as a dictionary or through keyword arguments.
Parameters
• spec (dict, optional) – Object specification, by default None
• kwargs (dict) – Specification as keyword arguments. If spec contains the same key as the
one in kwargs, the value from kwargs will be used.
CONST_WORKING_DIR = 'workingDir'
with_working_dir(working_dir: str)
Specifies the working directory in the job run. By default, the working directory will the directory con-
taining the user code (job artifact directory). This can be changed by specifying a relative path to the job
artifact directory.
Parameters
working_dir (str) – The path of the working directory. This can be a relative path from
the job artifact directory.
Returns
The runtime instance.
Return type
self
property working_dir: str
The working directory for the job run.
class ads.jobs.builders.runtimes.python_runtime.ScriptRuntime(spec: Optional[Dict] = None,
**kwargs)
Bases: CondaRuntime
Represents job runtime with scripts and conda pack
Initialize the object with specifications.
User can either pass in the specification as a dictionary or through keyword arguments.
Parameters
• spec (dict, optional) – Object specification, by default None
• kwargs (dict) – Specification as keyword arguments. If spec contains the same key as the
one in kwargs, the value from kwargs will be used.
CONST_ENTRYPOINT = 'entrypoint'
CONST_SCRIPT_PATH = 'scriptPathURI'
Returns
The runtime instance.
Return type
self
CONST_COMPARTMENT_ID = 'compartment_id'
CONST_CONFIG = 'configuration'
CONST_DRIVER_SHAPE = 'driver_shape'
CONST_DRIVER_SHAPE_CONFIG = 'driver_shape_config'
CONST_EXECUTE = 'execute'
CONST_EXECUTOR_SHAPE = 'executor_shape'
CONST_EXECUTOR_SHAPE_CONFIG = 'executor_shape_config'
CONST_ID = 'id'
CONST_LANGUAGE = 'language'
CONST_MEMORY_IN_GBS = 'memory_in_gbs'
CONST_METASTORE_ID = 'metastore_id'
CONST_NUM_EXECUTORS = 'num_executors'
CONST_OCPUS = 'ocpus'
CONST_SPARK_VERSION = 'spark_version'
CONST_WAREHOUSE_BUCKET_URI = 'warehouse_bucket_uri'
Parameters
id (str) – compartment id
Returns
the Data Flow instance itself
Return type
DataFlow
with_configuration(configs: dict) → DataFlow
Set configuration for a Data Flow job.
Parameters
configs (dict) – dictionary of configurations
Returns
the Data Flow instance itself
Return type
DataFlow
with_driver_shape(shape: str) → DataFlow
Set driver shape for a Data Flow job.
Parameters
shape (str) – driver shape
Returns
the Data Flow instance itself
Return type
DataFlow
with_driver_shape_config(memory_in_gbs: float, ocpus: float, **kwargs: Dict[str, Any]) → DataFlow
Sets the driver shape config details of Data Flow job infrastructure. Specify only when a flex shape is
selected. For example VM.Standard.E3.Flex allows the memory_in_gbs and cpu count to be specified.
Parameters
• memory_in_gbs (float) – The size of the memory in GBs.
• ocpus (float) – The OCPUs count.
• kwargs – Additional keyword arguments.
Returns
the Data Flow instance itself.
Return type
DataFlow
with_execute(exec: str) → DataFlow
Set command for spark-submit.
Parameters
exec (str) – str of commands
Returns
the Data Flow instance itself
Return type
DataFlow
Return type
DataFlow
with_metastore_id(id: str) → DataFlow
Set Hive metastore id for a Data Flow job.
Parameters
id (str) – metastore id
Returns
the Data Flow instance itself
Return type
DataFlow
with_num_executors(n: int) → DataFlow
Set number of executors for a Data Flow job.
Parameters
n (int) – number of executors
Returns
the Data Flow instance itself
Return type
DataFlow
with_spark_version(ver: str) → DataFlow
Set spark version for a Data Flow job. Currently supported versions are 2.4.4, 3.0.2 and 3.2.1 Documenta-
tion: https://docs.oracle.com/en-us/iaas/data-flow/using/dfs_getting_started.htm#before_you_begin
Parameters
ver (str) – spark version
Returns
the Data Flow instance itself
Return type
DataFlow
with_warehouse_bucket_uri(uri: str) → DataFlow
Set warehouse bucket uri for a Data Flow job.
Parameters
uri (str) – uri to warehouse bucket
Returns
the Data Flow instance itself
Return type
DataFlow
class ads.jobs.builders.infrastructure.dataflow.DataFlowApp(config: Optional[dict] = None, signer:
Optional[Signer] = None,
client_kwargs: Optional[dict] =
None, **kwargs)
Bases: OCIModelMixin, Application
Initializes a service/resource with OCI client as a property. If config or signer is specified, it will be
used to initialize the OCI client. If neither of them is specified, the client will be initialized with
ads.common.auth.default_signer. If both of them are specified, both of them will be passed into the OCI client,
Parameters
• config (dict, optional) – OCI API key config dictionary, by default None.
• signer (oci.signer.Signer, optional) – OCI authentication signer, by default None.
• client_kwargs (dict, optional) – Additional keyword arguments for initializing the
OCI client.
property driver
property executor
Parameters
• config (dict, optional) – OCI API key config dictionary, by default None.
• signer (oci.signer.Signer, optional) – OCI authentication signer, by default None.
• client_kwargs (dict, optional) – Additional keyword arguments for initializing the
OCI client.
property run_details_link
Link to run details page in OCI console
Returns
html display
Return type
DisplayHandle
property status: str
Show status (lifecycle state) of a run.
Returns
status of the run
Return type
str
to_yaml() → str
Serializes the object into YAML string.
Returns
YAML stored in a string.
Return type
str
wait(interval: int = 3) → DataFlowRun
Wait for a run to terminate.
Parameters
interval (int, optional) – interval to wait before probing again
Returns
a DataFlowRun instance
Return type
DataFlowRun
watch(interval: int = 3) → DataFlowRun
This is an alias of wait() method. It waits for a run to terminate.
Parameters
interval (int, optional) – interval to wait before probing again
Returns
a DataFlowRun instance
Return type
DataFlowRun
job_properties = {
"display_name": "my_job",
"job_infrastructure_configuration_details": {"shape_name": "VM.MY_SHAPE"}
}
job = DSCJob(**job_properties)
The properties can also be OCI REST API payload, in which the keys are in camel format.
job_payload = {
"projectId": "<project_ocid>",
"compartmentId": "<compartment_ocid>",
"displayName": "<job_name>",
"jobConfigurationDetails": {
"jobType": "DEFAULT",
"commandLineArguments": "pos_arg1 pos_arg2 --key1 val1 --key2 val2",
"environmentVariables": {
"KEY1": "VALUE1",
"KEY2": "VALUE2",
# User specifies conda env via env var
"CONDA_ENV_TYPE" : "service",
"CONDA_ENV_SLUG" : "mlcpuv1"
}
},
"jobInfrastructureConfigurationDetails": {
"jobInfrastructureType": "STANDALONE",
"shapeName": "VM.Standard.E3.Flex",
"jobShapeConfigDetails": {
"memoryInGBs": 16,
"ocpus": 1
},
"blockStorageSizeInGBs": "100",
"subnetId": "<subnet_ocid>"
}
}
job = DSCJob(**job_payload)
run(**kwargs) → DataScienceJobRun
Runs the job
Parameters
• **kwargs – Keyword arguments for initializing a Data Science Job Run. The keys
can be any keys in supported by OCI JobConfigurationDetails and JobRun, including:
* hyperparameter_values: dict(str, str) * environment_variables: dict(str, str) * com-
mand_line_arguments: str * maximum_runtime_in_minutes: int * display_name: str
• specified (If display_name is not) –
• "<JOB_NAME>-run-<TIMESTAMP>". (it will be generated as) –
Returns
An instance of DSCJobRun, which can be used to monitor the job run.
Return type
DSCJobRun
run_list(**kwargs) → list[DataScienceJobRun]
Lists the runs of this job.
Parameters
**kwargs – Keyword arguments to te passed into the OCI list_job_runs() for filtering the job
runs.
Returns
A list of DSCJobRun objects
Return type
list
update() → DSCJob
Updates the Data Science Job.
upload_artifact(artifact_path: Optional[str] = None) → DSCJob
Uploads the job artifact to OCI
Parameters
artifact_path (str, optional) – Local path to the job artifact file to be uploaded, by
default None. If artifact_path is None, the path in self.artifact will be used.
Returns
The DSCJob instance (self), which allows chaining additional method.
Return type
DSCJob
ads.jobs.builders.infrastructure.dsc_job.DSCJobRun
alias of DataScienceJobRun
class ads.jobs.builders.infrastructure.dsc_job.DataScienceJob(spec: Optional[Dict] = None,
**kwargs)
Bases: Infrastructure
Represents the OCI Data Science Job infrastructure.
Initializes a data science job infrastructure
Parameters
• spec (dict, optional) – Object specification, by default None
• kwargs (dict) – Specification as keyword arguments. If spec contains the same key as the
one in kwargs, the value from kwargs will be used.
CONST_BLOCK_STORAGE = 'blockStorageSize'
CONST_COMPARTMENT_ID = 'compartmentId'
CONST_DISPLAY_NAME = 'displayName'
CONST_JOB_INFRA = 'jobInfrastructureType'
CONST_JOB_TYPE = 'jobType'
CONST_LOG_GROUP_ID = 'logGroupId'
CONST_LOG_ID = 'logId'
CONST_MEMORY_IN_GBS = 'memoryInGBs'
CONST_OCPUS = 'ocpus'
CONST_PROJECT_ID = 'projectId'
CONST_SHAPE_CONFIG_DETAILS = 'shapeConfigDetails'
CONST_SHAPE_NAME = 'shapeName'
CONST_SUBNET_ID = 'subnetId'
Returns
An instance of DataScienceJob
Return type
DataScienceJob
classmethod from_id(job_id: str) → DataScienceJob
Gets an existing job using Job OCID
Parameters
job_id (str) – Job OCID
Returns
An instance of DataScienceJob
Return type
DataScienceJob
classmethod instance_shapes(compartment_id: Optional[str] = None) → list
Lists the supported shapes for running jobs in a compartment.
Parameters
compartment_id (str, optional) – The compartment ID for running the jobs, by default
None. This is optional in a OCI Data Science notebook session. If this is not specified, the
compartment ID of the notebook session will be used.
Returns
A list of dictionaries containing the information of the supported shapes.
Return type
list
property job_id: Optional[str]
The OCID of the job
property job_infrastructure_type: Optional[str]
Job infrastructure type
property job_type: Optional[str]
Job type
classmethod list_jobs(compartment_id: Optional[str] = None, **kwargs) → List[DataScienceJob]
Lists all jobs in a compartment.
Parameters
• compartment_id (str, optional) – The compartment ID for running the jobs, by de-
fault None. This is optional in a OCI Data Science notebook session. If this is not specified,
the compartment ID of the notebook session will be used.
• **kwargs – Keyword arguments to be passed into OCI list_jobs API for filtering the jobs.
Returns
A list of DataScienceJob object.
Return type
List[DataScienceJob]
property log_group_id: str
Log group OCID of the data science job
Returns
Log group OCID
Return type
str
property log_id: str
Log OCID for the data science job.
Returns
Log OCID
Return type
str
property name: str
Display name of the job
payload_attribute_map = {'blockStorageSize':
'job_infrastructure_configuration_details.block_storage_size_in_gbs',
'compartmentId': 'compartment_id', 'displayName': 'display_name',
'jobInfrastructureType':
'job_infrastructure_configuration_details.job_infrastructure_type', 'jobType':
'job_configuration_details.job_type', 'logGroupId':
'job_log_configuration_details.log_group_id', 'logId':
'job_log_configuration_details.log_id', 'projectId': 'project_id',
'shapeConfigDetails':
'job_infrastructure_configuration_details.job_shape_config_details', 'shapeName':
'job_infrastructure_configuration_details.shape_name', 'subnetId':
'job_infrastructure_configuration_details.subnet_id'}
Returns
A list of job runs.
Return type
List[DSCJobRun]
property shape_config_details: Dict
The details for the job run shape configuration.
shape_config_details_attribute_map = {'memoryInGBs': 'memory_in_gbs', 'ocpus':
'ocpus'}
static standardize_spec(spec)
Return type
DataScienceJob
with_shape_config_details(memory_in_gbs: float, ocpus: float, **kwargs: Dict[str, Any]) →
DataScienceJob
Sets the details for the job run shape configuration. Specify only when a flex shape is selected. For example
VM.Standard.E3.Flex allows the memory_in_gbs and cpu count to be specified.
Parameters
• memory_in_gbs (float) – The size of the memory in GBs.
• ocpus (float) – The OCPUs count.
• kwargs – Additional keyword arguments.
Returns
The DataScienceJob instance (self)
Return type
DataScienceJob
with_shape_name(shape_name: str) → DataScienceJob
Sets the shape name for running the job
Parameters
shape_name (str) – Shape name
Returns
The DataScienceJob instance (self)
Return type
DataScienceJob
with_subnet_id(subnet_id: str) → DataScienceJob
Sets the subnet ID
Parameters
subnet_id (str) – Subnet ID
Returns
The DataScienceJob instance (self)
Return type
DataScienceJob
class ads.jobs.builders.infrastructure.dsc_job.DataScienceJobRun(config: Optional[dict] = None,
signer: Optional[Signer] =
None, client_kwargs:
Optional[dict] = None,
**kwargs)
Bases: OCIDataScienceMixin, JobRun, RunInstance
Represents a Data Science Job run
Initializes a service/resource with OCI client as a property. If config or signer is specified, it will be
used to initialize the OCI client. If neither of them is specified, the client will be initialized with
ads.common.auth.default_signer. If both of them are specified, both of them will be passed into the OCI client,
and the authentication will be determined by OCI Python SDK.
Parameters
• config (dict, optional) – OCI API key config dictionary, by default None.
• signer (oci.signer.Signer, optional) – OCI authentication signer, by default None.
• client_kwargs (dict, optional) – Additional keyword arguments for initializing the
OCI client.
cancel() → DataScienceJobRun
Cancels a job run This method will wait for the job run to be canceled before returning.
Returns
The job run instance.
Return type
self
create() → DataScienceJobRun
Creates a job run
download(to_dir)
Downloads files from job run output URI to local.
Parameters
to_dir (str) – Local directory to which the files will be downloaded to.
Returns
The job run instance (self)
Return type
DataScienceJobRun
property job
The job instance of this run.
Returns
An ADS Job instance
Return type
Job
property log_group_id: str
The log group ID from OCI logging service containing the logs from the job run.
property log_id: str
The log ID from OCI logging service containing the logs from the job run.
property logging: OCILog
The OCILog object containing the logs from the job run
logs(limit: Optional[int] = None) → list
Gets the logs of the job run.
Parameters
limit (int, optional) – Limit the number of logs to be returned. Defaults to None. All
logs will be returned.
Returns
A list of log records. Each log record is a dictionary with the following keys: id, time, mes-
sage.
Return type
list
property status: str
Lifecycle status
Returns
Status in a string.
Return type
str
to_yaml() → str
Serializes the object into YAML string.
Returns
YAML stored in a string.
Return type
str
watch(interval: float = 3) → DataScienceJobRun
Watches the job run until it finishes. Before the job start running, this method will output the job run status.
Once the job start running, the logs will be streamed until the job is success, failed or cancelled.
Parameters
interval (int) – Time interval in seconds between each request to update the logs. Defaults
to 3 (seconds).
17.1.1.14.1 Submodules
• model_file_name ((str, optional). Defaults to None.) – The file name of the serialized
model.
• reload ((bool, optional). Defaults to False.) – Determine whether will reload
the Model into the env.
Returns
A ModelArtifact instance.
Return type
ModelArtifact
Raises
ValueError – If artifact_dir not provided.
classmethod from_uri(uri: str, artifact_dir: str, model_file_name: Optional[str] = None, force_overwrite:
Optional[bool] = False, auth: Optional[Dict] = None)
Constructs a ModelArtifact object from the existing model artifacts.
Parameters
• uri (str) – The URI of source artifact folder or achive. Can be local path or OCI object
storage URI.
• artifact_dir (str) – The local artifact folder to store the files needed for deployment.
• model_file_name ((str, optional). Defaults to None) – The file name of the serialized
model.
• force_overwrite ((bool, optional). Defaults to False.) – Whether to over-
write existing files or not.
• auth ((Dict, optional). Defaults to None.) – The default authetication is set us-
ing ads.set_auth API. If you need to override the default, use the ads.common.auth.api_keys
or ads.common.auth.resource_principal to create appropriate authentication signer and
kwargs required to instantiate IdentityClient object.
Returns
A ModelArtifact instance
Return type
ModelArtifact
Raises
ValueError – If uri is equal to artifact_dir, and it not exists.
prepare_runtime_yaml(inference_conda_env: str, inference_python_version: Optional[str] = None,
training_conda_env: Optional[str] = None, training_python_version:
Optional[str] = None, force_overwrite: bool = False, namespace: str =
'id19sfcrra6z', bucketname: str = 'service-conda-packs') → None
Generate a runtime yaml file and save it to the artifact directory.
Parameters
• inference_conda_env ((str, optional). Defaults to None.) – The object stor-
age path of conda pack which will be used in deployment. Can be either slug or object
storage path of the conda pack. You can only pass in slugs if the conda pack is a service
pack.
• inference_python_version ((str, optional). Defaults to None.) – The
python version which will be used in deployment.
Type
str
artifact_dir
Artifact directory to store the files needed for deployment.
Type
str
auth
Default authentication is set using the ads.set_auth API. To override the default, use the
ads.common.auth.api_keys or ads.common.auth.resource_principal to create an authentication signer to
instantiate an IdentityClient object.
Type
Dict
ds_client
The data science client used by model deployment.
Type
DataScienceClient
estimator
Any model object generated by sklearn framework
Type
Callable
framework
The framework of the model.
Type
str
hyperparameter
The hyperparameters of the estimator.
Type
dict
metadata_custom
The model custom metadata.
Type
ModelCustomMetadata
metadata_provenance
The model provenance metadata.
Type
ModelProvenanceMetadata
metadata_taxonomy
The model taxonomy metadata.
Type
ModelTaxonomyMetadata
model_artifact
This is built by calling prepare.
Type
ModelArtifact
model_deployment
A ModelDeployment instance.
Type
ModelDeployment
model_file_name
Name of the serialized model.
Type
str
model_id
The model ID.
Type
str
properties
ModelProperties object required to save and deploy model.
Type
ModelProperties
runtime_info
A RuntimeInfo instance.
Type
RuntimeInfo
schema_input
Schema describes the structure of the input data.
Type
Schema
schema_output
Schema describes the structure of the output data.
Type
Schema
serialize
Whether to serialize the model to pkl file by default. If False, you need to serialize the model manually,
save it under artifact_dir and update the score.py manually.
Type
bool
version
The framework version of the model.
Type
str
delete_deployment(...)
Deletes the current model deployment.
deploy(..., \*\*kwargs)
Deploys a model.
from_model_artifact(uri, ..., \*\*kwargs)
Loads model from the specified folder, or zip/tar archive.
from_model_catalog(model_id, ..., \*\*kwargs)
Loads model from model catalog.
from_model_deployment(model_deployment_id, ..., \*\*kwargs)
Loads model from model deployment.
introspect(...)
Runs model introspection.
predict(data, ...)
Returns prediction of input data run against the model deployment endpoint.
prepare(..., \*\*kwargs)
Prepare and save the score.py, serialized model and runtime.yaml file.
prepare_save_deploy(..., \*\*kwargs)
Shortcut for prepare, save and deploy steps.
reload(...)
Reloads the model artifact files: score.py and the runtime.yaml.
save(..., \*\*kwargs)
Saves model artifacts to the model catalog.
summary_status(...)
Gets a summary table of the current status.
verify(data, ...)
Tests if deployment works in local environment.
Examples
GenericModel Constructor.
Parameters
• estimator ((Callable).) – Trained model.
• artifact_dir ((str, optional). Defaults to None.) – Artifact directory to store
the files needed for deployment.
• properties ((ModelProperties, optional). Defaults to None.) – ModelProp-
erties object required to save and deploy model.
• auth ((Dict, optional). Defaults to None.) – The default authetication is set us-
ing ads.set_auth API. If you need to override the default, use the ads.common.auth.api_keys
or ads.common.auth.resource_principal to create appropriate authentication signer and
kwargs required to instantiate IdentityClient object.
• serialize ((bool, optional). Defaults to True.) – Whether to serialize the
model to pkl file by default. If False, you need to serialize the model manually, save it under
artifact_dir and update the score.py manually.
classmethod delete(model_id: Optional[str] = None, delete_associated_model_deployment:
Optional[bool] = False, delete_model_artifact: Optional[bool] = False, artifact_dir:
Optional[str] = None, **kwargs: Dict) → None
Deletes a model from Model Catalog.
Parameters
• model_id ((str, optional). Defaults to None.) – The model OCID to be
deleted. If the method called on instance level, then self.model_id will be used.
• delete_associated_model_deployment ((bool, optional). Defaults to False.) –
Whether associated model deployments need to be deleted or not.
• delete_model_artifact ((bool, optional). Defaults to False.) – Whether associated
model artifacts need to be deleted or not.
• artifact_dir ((str, optional). Defaults to None) – The local path to the model artifacts
folder. If the method called on instance level, the self.artifact_dir will be used by default.
• kwargs –
auth: (Dict, optional). Defaults to None.
The default authetication is set using ads.set_auth API. If you need to override the de-
fault, use the ads.common.auth.api_keys or ads.common.auth.resource_principal to cre-
ate appropriate authentication signer and kwargs required to instantiate IdentityClient
object.
• artifact_dir ((str, optional). Defaults to None.) – The artifact directory to store the files
needed for deployment. Will be created if not exists.
• auth ((Dict, optional). Defaults to None.) – The default authetication is set us-
ing ads.set_auth API. If you need to override the default, use the ads.common.auth.api_keys
or ads.common.auth.resource_principal to create appropriate authentication signer and
kwargs required to instantiate IdentityClient object.
• force_overwrite ((bool, optional). Defaults to False.) – Whether to over-
write existing files or not.
• properties ((ModelProperties, optional). Defaults to None.) – Model-
Properties object required to save and deploy model.
Returns
An instance of GenericModel class.
Return type
GenericModel
Raises
ValueError – If model_file_name not provided.
classmethod from_model_catalog(model_id: str, model_file_name: Optional[str] = None, artifact_dir:
Optional[str] = None, auth: Optional[Dict] = None, force_overwrite:
Optional[bool] = False, properties:
Optional[Union[ModelProperties, Dict]] = None, bucket_uri:
Optional[str] = None, remove_existing_artifact: Optional[bool] =
True, **kwargs) → GenericModel
Loads model from model catalog.
Parameters
• model_id (str) – The model OCID.
• model_file_name ((str, optional). Defaults to None.) – The name of the serialized model.
• artifact_dir ((str, optional). Defaults to None.) – The artifact directory to store the files
needed for deployment. Will be created if not exists.
• auth ((Dict, optional). Defaults to None.) – The default authetication is set us-
ing ads.set_auth API. If you need to override the default, use the ads.common.auth.api_keys
or ads.common.auth.resource_principal to create appropriate authentication signer and
kwargs required to instantiate IdentityClient object.
• force_overwrite ((bool, optional). Defaults to False.) – Whether to over-
write existing files or not.
• properties ((ModelProperties, optional). Defaults to None.) – Model-
Properties object required to save and deploy model.
• bucket_uri ((str, optional). Defaults to None.) – The OCI Object Stor-
age URI where model artifacts will be copied to. The bucket_uri is only nec-
essary for downloading large artifacts with size is greater than 2GB. Example:
oci://<bucket_name>@<namespace>/prefix/.
• remove_existing_artifact ((bool, optional). Defaults to True.) – Wether artifacts
uploaded to object storage bucket need to be removed or not.
• kwargs –
compartment_id
[(str, optional)] Compartment OCID. If not specified, the value will be taken from the
environment variables.
timeout
[(int, optional). Defaults to 10 seconds.] The connection timeout in seconds for the
client.
Returns
An instance of GenericModel class.
Return type
GenericModel
classmethod from_model_deployment(model_deployment_id: str, model_file_name: Optional[str] =
None, artifact_dir: Optional[str] = None, auth: Optional[Dict] =
None, force_overwrite: Optional[bool] = False, properties:
Optional[Union[ModelProperties, Dict]] = None, bucket_uri:
Optional[str] = None, remove_existing_artifact: Optional[bool] =
True, **kwargs) → GenericModel
Loads model from model deployment.
Parameters
• model_deployment_id (str) – The model deployment OCID.
• model_file_name ((str, optional). Defaults to None.) – The name of the serialized model.
• artifact_dir ((str, optional). Defaults to None.) – The artifact directory to store the files
needed for deployment. Will be created if not exists.
• auth ((Dict, optional). Defaults to None.) – The default authetication is set us-
ing ads.set_auth API. If you need to override the default, use the ads.common.auth.api_keys
or ads.common.auth.resource_principal to create appropriate authentication signer and
kwargs required to instantiate IdentityClient object.
• force_overwrite ((bool, optional). Defaults to False.) – Whether to over-
write existing files or not.
• properties ((ModelProperties, optional). Defaults to None.) – Model-
Properties object required to save and deploy model.
• bucket_uri ((str, optional). Defaults to None.) – The OCI Object Stor-
age URI where model artifacts will be copied to. The bucket_uri is only nec-
essary for downloading large artifacts with size is greater than 2GB. Example:
oci://<bucket_name>@<namespace>/prefix/.
• remove_existing_artifact ((bool, optional). Defaults to True.) – Wether artifacts
uploaded to object storage bucket need to be removed or not.
• kwargs –
compartment_id
[(str, optional)] Compartment OCID. If not specified, the value will be taken from the
environment variables.
timeout
[(int, optional). Defaults to 10 seconds.] The connection timeout in seconds for the
client.
Returns
An instance of GenericModel class.
Return type
GenericModel
get_data_serializer(data: any, data_type: Optional[str] = None)
The data_serializer_class class is set in init and used here. Frameworks should subclass the InputDataSe-
rializer class, then set that as the self.data_serializer_class. Frameworks should avoid overwriting
this method whenever possible.
Parameters
• data ((Any)) – data to be passed to model for prediction.
• data_type (str) – Type of the data.
Returns
Serialized data.
Return type
data
introspect() → DataFrame
Conducts instrospection.
Returns
A pandas DataFrame which contains the instrospection results.
Return type
pandas.DataFrame
predict(data: Optional[Any] = None, **kwargs) → Dict[str, Any]
Returns prediction of input data run against the model deployment endpoint.
Examples
Parameters
• data (Any) – Data for the prediction for onnx models, for local serialization method, data
can be the data types that each framework support.
• kwargs – content_type: str, used to indicate the media type of the resource. image:
PIL.Image Object or uri for the image.
A valid string path for image file can be local path, http(s), oci, s3, gs.
storage_options: dict
Passed to fsspec.open for a particular storage connection. Please see fsspec (https://
filesystem-spec.readthedocs.io/en/latest/api.html#fsspec.open) for more details.
Returns
Dictionary with the predicted values.
Return type
Dict[str, Any]
Raises
• NotActiveDeploymentError – If model deployment process was not started or not fin-
ished yet.
• ValueError – If data is empty or not JSON serializable.
example, use_case_type=UseCaseType.BINARY_CLASSIFICATION or
use_case_type=”binary_classification”. Check with UseCaseType class to see all
supported types.
• X_sample (Union[list, tuple, pd.Series, np.ndarray, pd.DataFrame].
Defaults to None.) – A sample of input data that will be used to generate input
schema.
• y_sample (Union[list, tuple, pd.Series, np.ndarray, pd.DataFrame].
Defaults to None.) – A sample of output data that will be used to generate output
schema.
• training_script_path (str. Defaults to None.) – Training script path.
• training_id ((str, optional). Defaults to value from environment
variables.) – The training OCID for model. Can be notebook session or job OCID.
• ignore_pending_changes (bool. Defaults to False.) – whether to ignore the
pending changes in the git.
• max_col_num ((int, optional). Defaults to utils.
DATA_SCHEMA_MAX_COL_NUM.) – Do not generate the input schema if the input has
more than this number of features(columns).
• kwargs –
impute_values: (dict, optional).
The dictionary where the key is the column index(or names is accepted for pandas
dataframe) and the value is the impute value for the corresponding column.
Raises
• FileExistsError – If files already exist but force_overwrite is False.
• ValueError – If inference_python_version is not provided, but also cannot be found
through manifest file.
Returns
An instance of GenericModel class.
Return type
GenericModel
Return type
ModelDeployment
Raises
• FileExistsError – If files already exist but force_overwrite is False.
• ValueError – If inference_python_version is not provided, but also cannot be found
through manifest file.
reload() → GenericModel
Reloads the model artifact files: score.py and the runtime.yaml.
Returns
An instance of GenericModel class.
Return type
GenericModel
reload_runtime_info() → None
Reloads the model artifact file: runtime.yaml.
Returns
Nothing.
Return type
None
save(display_name: Optional[str] = None, description: Optional[str] = None, freeform_tags: Optional[dict]
= None, defined_tags: Optional[dict] = None, ignore_introspection: Optional[bool] = False,
bucket_uri: Optional[str] = None, overwrite_existing_artifact: Optional[bool] = True,
remove_existing_artifact: Optional[bool] = True, **kwargs) → str
Saves model artifacts to the model catalog.
Parameters
• display_name ((str, optional). Defaults to None.) – The name of the model.
If a display_name is not provided in kwargs, randomly generated easy to remember name
with timestamp will be generated, like ‘strange-spider-2022-08-17-23:55.02’.
• description ((str, optional). Defaults to None.) – The description of the
model.
• freeform_tags (Dict(str, str), Defaults to None.) – Freeform tags for the
model.
• defined_tags ((Dict(str, dict(str, object)), optional). Defaults to
None.) – Defined tags for the model.
• ignore_introspection ((bool, optional). Defaults to None.) – Determine
whether to ignore the result of model introspection or not. If set to True, the save will
ignore all model introspection errors.
• bucket_uri ((str, optional). Defaults to None.) – The OCI Object Stor-
age URI where model artifacts will be copied to. The bucket_uri is only nec-
essary for uploading large artifacts which size is greater than 2GB. Example:
oci://<bucket_name>@<namespace>/prefix/.
• overwrite_existing_artifact ((bool, optional). Defaults to True.) – Overwrite target
bucket artifact if exists.
• remove_existing_artifact ((bool, optional). Defaults to True.) – Wether artifacts
uploaded to object storage bucket need to be removed or not.
• kwargs –
project_id: (str, optional).
Project OCID. If not specified, the value will be taken either from the environment vari-
ables or model properties.
compartment_id
[(str, optional).] Compartment OCID. If not specified, the value will be taken either from
the environment variables or model properties.
timeout: (int, optional). Defaults to 10 seconds.
The connection timeout in seconds for the client.
Raises
RuntimeInfoInconsistencyError – When .runtime_info is not synched with run-
time.yaml file.
Returns
The model id.
Return type
str
serialize_model(as_onnx: bool = False, initial_types: Optional[List[Tuple]] = None, force_overwrite:
bool = False, X_sample: Optional[any] = None)
Serialize and save model using ONNX or model specific method.
Parameters
• as_onnx ((boolean, optional)) – If set as True, convert into ONNX model.
• initial_types ((List[Tuple], optional)) – a python list. Each element is a tuple
of a variable name and a data type.
• force_overwrite ((boolean, optional)) – If set as True, overwrite serialized model
if exists.
• X_sample ((any, optional). Defaults to None.) – Contains model inputs such
that model(X_sample) is a valid invocation of the model, used to valid model input type.
Returns
Nothing
Return type
None
summary_status() → DataFrame
A summary table of the current status.
Returns
The summary stable of the current status.
Return type
pd.DataFrame
verify(data: Optional[Any] = None, reload_artifacts: bool = True, **kwargs) → Dict[str, Any]
Test if deployment works in local environment.
Examples
Parameters
• data (Any) – Data used to test if deployment works in local environment.
• reload_artifacts (bool. Defaults to True.) – Whether to reload artifacts or not.
• kwargs – content_type: str, used to indicate the media type of the resource. image:
PIL.Image Object or uri for the image.
A valid string path for image file can be local path, http(s), oci, s3, gs.
storage_options: dict
Passed to fsspec.open for a particular storage connection. Please see fsspec (https://
filesystem-spec.readthedocs.io/en/latest/api.html#fsspec.open) for more details.
Returns
A dictionary which contains prediction results.
Return type
Dict
class ads.model.generic_model.ModelState(value)
Bases: Enum
An enumeration.
AVAILABLE = 'Available'
DONE = 'Done'
class ads.model.generic_model.SummaryStatus
Bases: object
SummaryStatus class which track the status of the Model frameworks.
update_action(detail: str, action: str) → None
Updates the action of the summary status table of the corresponding detail.
Parameters
• detail ((str)) – Value of the detail in the Details column. Used to locate which row to
update.
• status ((str)) – New status to be updated for the row specified by detail.
Returns
Nothing.
Return type
None
update_status(detail: str, status: str) → None
Updates the status of the summary status table of the corresponding detail.
Parameters
• detail ((str)) – value of the detail in the Details column. Used to locate which row to
update.
• status ((str)) – new status to be updated for the row specified by detail.
Returns
Nothing.
Return type
None
Bases: BaseProperties
Represents properties required to save and deploy model.
bucket_uri: str = None
model_deployment: ModelDeploymentDetails
model_provenance: ModelProvenanceDetails
save()
Save the RuntimeInfo object into runtime.yaml file under the artifact directory.
Returns
Nothing.
Return type
None
class ads.model.extractor.model_info_extractor_factory.ModelInfoExtractorFactory
Bases: object
Class that extract Model Taxonomy Metadata for all supported frameworks.
static extract_info(model)
Extracts model taxonomy metadata.
Parameters
model ([ADS model, sklearn, xgboost, lightgbm, keras, oracle_automl]) –
The model object
Returns
A dictionary with keys of Framework, FrameworkVersion, Algorithm, Hyperparameters of
the model
Return type
ModelTaxonomyMetadata
Examples
class ads.model.extractor.automl_extractor.AutoMLExtractor(model)
Bases: ModelInfoExtractor
Class that extract model metadata from automl models.
model
The model to extract metadata from.
Type
object
estimator
The estimator to extract metadata from.
Type
object
property algorithm
Extracts the algorithm of the model.
Returns
The algorithm of the model.
Return type
object
property framework
Extracts the framework of the model.
Returns
The framework of the model.
Return type
str
property hyperparameter
Extracts the hyperparameters of the model.
Returns
The hyperparameters of the model.
Return type
dict
property version
Extracts the framework version of the model.
Returns
The framework version of the model.
Return type
str
class ads.model.extractor.xgboost_extractor.XgboostExtractor(model)
Bases: ModelInfoExtractor
Class that extract model metadata from xgboost models.
model
The model to extract metadata from.
Type
object
estimator
The estimator to extract metadata from.
Type
object
framework(self ) → str
Returns the framework of the model.
algorithm(self ) → object
Returns the algorithm of the model.
version(self ) → str
Returns the version of framework of the model.
hyperparameter(self ) → dict
Returns the hyperparameter of the model.
property algorithm
Extracts the algorithm of the model.
Returns
The algorithm of the model.
Return type
object
property framework
Extracts the framework of the model.
Returns
The framework of the model.
Return type
str
property hyperparameter
Extracts the hyperparameters of the model.
Returns
The hyperparameters of the model.
Return type
dict
property version
Extracts the framework version of the model.
Returns
The framework version of the model.
Return type
str
class ads.model.extractor.lightgbm_extractor.LightgbmExtractor(model)
Bases: ModelInfoExtractor
Class that extract model metadata from lightgbm models.
model
The model to extract metadata from.
Type
object
estimator
The estimator to extract metadata from.
Type
object
framework(self ) → str
Returns the framework of the model.
algorithm(self ) → object
Returns the algorithm of the model.
version(self ) → str
Returns the version of framework of the model.
hyperparameter(self ) → dict
Returns the hyperparameter of the model.
property algorithm
Extracts the algorithm of the model.
Returns
The algorithm of the model.
Return type
object
property framework
Extracts the framework of the model.
Returns
The framework of the model.
Return type
str
property hyperparameter
Extracts the hyperparameters of the model.
Returns
The hyperparameters of the model.
Return type
dict
property version
Extracts the framework version of the model.
Returns
The framework version of the model.
Return type
str
class ads.model.extractor.model_info_extractor.ModelInfoExtractor
Bases: ABC
The base abstract class to extract model metadata.
framework(self ) → str
Returns the framework of the model.
algorithm(self ) → object
Returns the algorithm of the model.
version(self ) → str
Returns the version of framework of the model.
hyperparameter(self ) → dict
Returns the hyperparameter of the model.
info(self ) → dict
Returns the model taxonomy metadata information.
abstract algorithm()
The abstract method to extracts the algorithm of the model.
Returns
The algorithm of the model.
Return type
object
abstract framework()
The abstract method to extracts the framework of the model.
Returns
The framework of the model.
Return type
str
abstract hyperparameter()
The abstract method to extracts the hyperparameters of the model.
Returns
The hyperparameter of the model.
Return type
dict
info()
Extracts the taxonomy metadata of the model.
Returns
The taxonomy metadata of the model.
Return type
dict
abstract version()
The abstract method to extracts the framework version of the model.
Returns
The framework version of the model.
Return type
str
ads.model.extractor.model_info_extractor.normalize_hyperparameter(data: Dict) → dict
Converts all the fields to string to make sure it’s json serializable.
Parameters
data (([Dict])) – The hyperparameter returned by the model.
Returns
Normalized (json serializable) dictionary.
Return type
Dict
class ads.model.extractor.sklearn_extractor.SklearnExtractor(model)
Bases: ModelInfoExtractor
Class that extract model metadata from sklearn models.
model
The model to extract metadata from.
Type
object
estimator
The estimator to extract metadata from.
Type
object
framework(self ) → str
Returns the framework of the model.
algorithm(self ) → object
Returns the algorithm of the model.
version(self ) → str
Returns the version of framework of the model.
hyperparameter(self ) → dict
Returns the hyperparameter of the model.
property algorithm
Extracts the algorithm of the model.
Returns
The algorithm of the model.
Return type
object
property framework
Extracts the framework of the model.
Returns
The framework of the model.
Return type
str
property hyperparameter
Extracts the hyperparameters of the model.
Returns
The hyperparameters of the model.
Return type
dict
property version
Extracts the framework version of the model.
Returns
The framework version of the model.
Return type
str
class ads.model.extractor.keras_extractor.KerasExtractor(model)
Bases: ModelInfoExtractor
Class that extract model metadata from keras models.
model
The model to extract metadata from.
Type
object
estimator
The estimator to extract metadata from.
Type
object
property algorithm
Extracts the algorithm of the model.
Returns
The algorithm of the model.
Return type
object
property framework
Extracts the framework of the model.
Returns
The framework of the model.
Return type
str
property hyperparameter
Extracts the hyperparameters of the model.
Returns
The hyperparameters of the model.
Return type
dict
property version
Extracts the framework version of the model.
Returns
The framework version of the model.
Return type
str
class ads.model.extractor.tensorflow_extractor.TensorflowExtractor(model)
Bases: ModelInfoExtractor
Class that extract model metadata from tensorflow models.
model
The model to extract metadata from.
Type
object
estimator
The estimator to extract metadata from.
Type
object
framework(self ) → str
Returns the framework of the model.
algorithm(self ) → object
Returns the algorithm of the model.
version(self ) → str
Returns the version of framework of the model.
hyperparameter(self ) → dict
Returns the hyperparameter of the model.
property algorithm
Extracts the algorithm of the model.
Returns
The algorithm of the model.
Return type
object
property framework
Extracts the framework of the model.
Returns
The framework of the model.
Return type
str
property hyperparameter
Extracts the hyperparameters of the model.
Returns
The hyperparameters of the model.
Return type
dict
property version
Extracts the framework version of the model.
Returns
The framework version of the model.
Return type
str
class ads.model.extractor.pytorch_extractor.PyTorchExtractor(model)
Bases: ModelInfoExtractor
Class that extract model metadata from pytorch models.
model
The model to extract metadata from.
Type
object
estimator
The estimator to extract metadata from.
Type
object
framework(self ) → str
Returns the framework of the model.
algorithm(self ) → object
Returns the algorithm of the model.
version(self ) → str
Returns the version of framework of the model.
hyperparameter(self ) → dict
Returns the hyperparameter of the model.
property algorithm
Extracts the algorithm of the model.
Returns
The algorithm of the model.
Return type
object
property framework
Extracts the framework of the model.
Returns
The framework of the model.
Return type
str
property hyperparameter
Extracts the hyperparameters of the model.
Returns
The hyperparameters of the model.
Return type
dict
property version
Extracts the framework version of the model.
Returns
The framework version of the model.
Return type
str
class ads.model.extractor.pytorch_extractor.PytorchExtractor(model)
Bases: PyTorchExtractor
17.1.1.15.1 Submodules
Examples
Type
DataScienceClient
ds_composite_client
composite data science client
Type
DataScienceCompositeClient
deploy(model_deployment_details, \*\*kwargs)
Deploy the model specified by model_deployment_details.
get_model_deployment(model_deployment_id: str)
Get the ModelDeployment specified by model_deployment_id.
get_model_deployment_state(model_deployment_id)
Get the state of the current deployment specified by id.
delete(model_deployment_id, \*\*kwargs)
Remove the model deployment specified by the id or Model Deployment Object
list_deployments(status)
lists the model deployments associated with current compartment and data science client
show_deployments(status)
shows the deployments filtered by status in a Dataframe
Initializes model deployer.
Parameters
config (dict, optional) – ADS auth dictionary for OCI authentication. This can be gener-
ated by calling ads.common.auth.api_keys() or ads.common.auth.resource_principal(). If this is
None, ads.common.default_signer(client_kwargs) will be used.
delete(model_deployment_id, wait_for_completion: bool = True, max_wait_time: int = 1200, poll_interval:
int = 30) → ModelDeployment
Deletes the model deployment specified by OCID.
Parameters
• model_deployment_id (str) – Model deployment OCID.
• wait_for_completion (bool) – Wait for deletion to complete. Defaults to True.
• max_wait_time (int) – Maximum amount of time to wait in seconds (Defaults to 600).
Negative implies infinite wait time.
• poll_interval (int) – Poll interval in seconds (Defaults to 60).
Return type
A ModelDeployment instance that was deleted
deploy(properties: Optional[Union[ModelDeploymentProperties, Dict]] = None, wait_for_completion: bool
= True, max_wait_time: int = 1200, poll_interval: int = 30, **kwargs) → ModelDeployment
Deploys a model.
Parameters
• properties (ModelDeploymentProperties or dict) – Properties to deploy the
model. Properties can be None when kwargs are used for specifying properties.
• wait_for_completion (bool) – Flag set for whether to wait for deployment to complete
before proceeding. Optional, defaults to True.
• max_wait_time (int) – Maximum amount of time to wait in seconds. Optional, defaults
to 1200. Negative value implies infinite wait time.
• poll_interval (int) – Poll interval in seconds. Optional, defaults to 30.
• kwargs – Keyword arguments for initializing ModelDeploymentProperties. See ModelDe-
ploymentProperties() for details.
Returns
A ModelDeployment instance.
Return type
ModelDeployment
deploy_from_model_uri(model_uri: str, properties: Optional[Union[ModelDeploymentProperties, Dict]]
= None, wait_for_completion: bool = True, max_wait_time: int = 1200,
poll_interval: int = 30, **kwargs) → ModelDeployment
Deploys a model.
Parameters
• model_uri (str) – uri to model files, can be local or in cloud storage
• properties (ModelDeploymentProperties or dict) – Properties to deploy the
model. Properties can be None when kwargs are used for specifying properties.
• wait_for_completion (bool) – Flag set for whether to wait for deployment to complete
before proceeding. Defaults to True
• max_wait_time (int) – Maximum amount of time to wait in seconds (Defaults to 1200).
Negative implies infinite wait time.
• poll_interval (int) – Poll interval in seconds (Defaults to 30).
• kwargs – Keyword arguments for initializing ModelDeploymentProperties
Returns
A ModelDeployment instance
Return type
ModelDeployment
get_model_deployment(model_deployment_id: str) → ModelDeployment
Gets a ModelDeployment by OCID.
Parameters
model_deployment_id (str) – Model deployment OCID
Returns
A ModelDeployment instance
Return type
ModelDeployment
get_model_deployment_state(model_deployment_id: str) → State
Gets the state of a deployment specified by OCID
Parameters
model_deployment_id (str) – Model deployment OCID
Returns
The state of the deployment
Return type
str
list_deployments(status=None, compartment_id=None, **kwargs) → list
Lists the model deployments associated with current compartment and data science client
Parameters
• status (str) – Status of deployment. Defaults to None.
• compartment_id (str) – Target compartment to list deployments from. Defaults to the
compartment set in the environment variable “NB_SESSION_COMPARTMENT_OCID”.
If “NB_SESSION_COMPARTMENT_OCID” is not set, the root compartment ID will be
used. An ValueError will be raised if root compartment ID cannot be determined.
• kwargs – The values are passed to oci.data_science.DataScienceClient.list_model_deployments.
Returns
A list of ModelDeployment objects.
Return type
list
Raises
ValueError – If compartment_id is not specified and cannot be determined from the envi-
ronment.
show_deployments(status=None, compartment_id=None) → DataFrame
Returns the model deployments associated with current compartment and data science client
as a Dataframe that can be easily visualized
Parameters
• status (str) – Status of deployment. Defaults to None.
• compartment_id (str) – Target compartment to list deployments from. Defaults to the
compartment set in the environment variable “NB_SESSION_COMPARTMENT_OCID”.
If “NB_SESSION_COMPARTMENT_OCID” is not set, the root compartment ID will be
used. An ValueError will be raised if root compartment ID cannot be determined.
Returns
pandas Dataframe containing information about the ModelDeployments
Return type
DataFrame
Raises
ValueError – If compartment_id is not specified and cannot be determined from the envi-
ronment.
exception ads.model.deployment.model_deployment.LogNotConfiguredError
Bases: Exception
class ads.model.deployment.model_deployment.ModelDeployment(properties: Op-
tional[Union[ModelDeploymentProperties,
Dict]] = None, config: Optional[Dict]
= None, workflow_req_id:
Optional[str] = None,
model_deployment_id: Optional[str]
= None, model_deployment_url: str =
'', **kwargs)
Bases: object
A class used to represent a Model Deployment.
config
Deployment configuration parameters
Type
(dict)
properties
ModelDeploymentProperties object
Type
(ModelDeploymentProperties)
workflow_state_progress
Workflow request id
Type
(str)
workflow_steps
The number of steps in the workflow
Type
(int)
url
The model deployment url endpoint
Type
(str)
ds_client
The data science client used by model deployment
Type
(DataScienceClient)
ds_composite_client
The composite data science client used by the model deployment
Type
(DataScienceCompositeClient)
workflow_req_id
Workflow request id
Type
(str)
model_deployment_id
model deployment id
Type
(str)
state
Returns the deployment state of the current Model Deployment object
Type
(State)
deploy(wait_for_completion, \*\*kwargs)
Deploy the current Model Deployment object
delete(wait_for_completion, \*\*kwargs)
Deletes the current Model Deployment object
update(wait_for_completion, \*\*kwargs)
Updates a model deployment
list_workflow_logs()
Returns a list of the steps involved in deploying a model
Initializes a ModelDeployment object.
Parameters
• properties ((Union[ModelDeploymentProperties, Dict], optional).
Defaults to None.) – Object containing deployment properties. The properties
can be None when kwargs are used for specifying properties.
• config ((Dict, optional). Defaults to None.) – ADS auth dictionary for
OCI authentication. This can be generated by calling ads.common.auth.api_keys()
or ads.common.auth.resource_principal(). If this is None then the
ads.common.default_signer(client_kwargs) will be used.
• workflow_req_id ((str, optional). Defaults to None.) – Workflow request id.
Return type
A pandas DataFrame containing logs.
property state: State
Returns the deployment state of the current Model Deployment object
property status: State
Returns the deployment state of the current Model Deployment object
update(properties: Optional[Union[ModelDeploymentProperties, dict]] = None, wait_for_completion: bool
= True, max_wait_time: int = 1200, poll_interval: int = 30, **kwargs)
Updates a model deployment
You can update model_deployment_configuration_details and change instance_shape and model_id when
the model deployment is in the ACTIVE lifecycle state. The bandwidth_mbps or instance_count can only
be updated while the model deployment is in the INACTIVE state. Changes to the bandwidth_mbps or
instance_count will take effect the next time the ActivateModelDeployment action is invoked on the model
deployment resource.
Parameters
• properties (ModelDeploymentProperties or dict) – The properties for updating
the deployment.
• wait_for_completion (bool) – Flag set for whether to wait for deployment to complete
before proceeding. Defaults to True.
• max_wait_time (int) – Maximum amount of time to wait in seconds (Defaults to 1200).
Negative implies infinite wait time.
• poll_interval (int) – Poll interval in seconds (Defaults to 60).
• kwargs – dict
Returns
The instance of ModelDeployment.
Return type
ModelDeployment
class ads.model.deployment.model_deployment.ModelDeploymentLogType
Bases: object
ACCESS = 'access'
PREDICT = 'predict'
class ads.model.deployment.model_deployment_properties.ModelDeploymentProperties(model_id:
Op-
tional[str]
= None,
model_uri:
Op-
tional[str]
= None,
oci_model_deployment:
Op-
tional[Union[ModelDeploym
Create-
ModelDe-
ployment-
Details,
Update-
ModelDe-
ployment-
Details,
Dict]] =
None,
config:
Op-
tional[dict]
= None,
**kwargs)
Bases: OCIDataScienceMixin, ModelDeployment
Represents the details for a model deployment
swagger_types
The property names and the corresponding types of OCI ModelDeployment model.
Type
dict
model_id
The model artifact OCID in model catalog.
Type
str
model_uri
uri to model files, can be local or in cloud storage.
Type
str
with_prop(property_name, value)
Set the model deployment details property_name attribute to value
with_instance_configuration(config)
Set the configuration of VM instance.
with_access_log(log_group_id, log_id)
Config the access log with OCI logging service
with_predict_log(log_group_id, log_id)
Config the predict log with OCI logging service
build()
Return an instance of CreateModelDeploymentDetails for creating the deployment.
Initialize a ModelDeploymentProperties object by specifying one of the followings:
Parameters
• model_id ((str, optiona). Defaults to None.) – Model Artifact OCID. The
model_id must be specified either explicitly or as an attribute of the OCI object.
• model_uri ((str, optiona). Defaults to None.) – Uri to model files, can be local
or in cloud storage.
• oci_model_deployment ((Union[ModelDeployment,
CreateModelDeploymentDetails, UpdateModelDeploymentDetails, Dict],
optional). Defaults to None.) – An OCI model or Dict containing model de-
ployment details. The OCI model can be an instance of either ModelDeployment,
CreateModelDeploymentDetails or UpdateModelConfigurationDetails.
• config ((Dict, optional). Defaults to None.) – ADS auth dic-
tionary for OCI authentication. This can be generated by calling
ads.common.auth.api_keys() or ads.common.auth.resource_principal(). If this is None,
ads.common.default_signer(client_kwargs) will be used.
• kwargs – Users can also initialize the object by using keyword ar-
guments. The following keyword arguments are supported by
oci.data_science.models.data_science_models.ModelDeployment:
– display_name,
– description,
– project_id,
– compartment_id,
– model_deployment_configuration_details,
– category_log_details,
– freeform_tags,
– defined_tags.
If display_name is not specified, a randomly generated easy to remember name will be gen-
erated, like ‘strange-spider-2022-08-17-23:55.02’.
ModelDeploymentProperties also supports the following additional keyward arguments:
– instance_shape,
– instance_count,
– bandwidth_mbps,
– access_log_group_id,
– access_log_id,
– predict_log_group_id,
– predict_log_id,
– memory_in_gbs,
– ocpus.
These additional arguments will be saved into appropriate properties in the OCI model.
Raises
ValueError – model_id is None AND not specified in
oci_model_deployment.model_deployment_configuration_details.model_configuration_details.
build() → CreateModelDeploymentDetails
Converts the deployment properties to OCI CreateModelDeploymentDetails object. Converts a model URI
into a model OCID if user passed in a URI.
Returns
A CreateModelDeploymentDetails instance ready for OCI API.
Return type
CreateModelDeploymentDetails
sub_properties = ['instance_shape', 'instance_count', 'bandwidth_mbps',
'access_log_group_id', 'access_log_id', 'predict_log_group_id', 'predict_log_id',
'memory_in_gbs', 'ocpus']
to_oci_model(oci_model)
Convert properties into an OCI data model
Parameters
oci_model (class) – The class of OCI data model, e.g.,
oci.data_science_models.CreateModelDeploymentDetails
to_update_deployment() → UpdateModelDeploymentDetails
Converts the deployment properties to OCI UpdateModelDeploymentDetails object.
Returns
An UpdateModelDeploymentDetails instance ready for OCI API.
Return type
CreateModelDeploymentDetails
with_access_log(log_group_id: str, log_id: str)
Adds access log config
Parameters
• group_id (str) – Log group ID of OCI logging service
• log_id (str) – Log ID of OCI logging service
Returns
self
Return type
ModelDeploymentProperties
with_category_log(log_type: str, group_id: str, log_id: str)
Adds category log configuration
Parameters
• log_type (str) – The type of logging to be configured. Must be “access” or “predict”
• group_id (str) – Log group ID of OCI logging service
Returns
self
Return type
ModelDeploymentProperties
with_prop(property_name: str, value: Any)
Sets model deployment’s property_name attribute to value
Parameters
• property_name (str) – Name of a model deployment property.
• value – New value for property attribute.
Returns
self
Return type
ModelDeploymentProperties
17.1.1.16.1 Submodules
ds_client
The data science client used by model deployment.
Type
DataScienceClient
estimator
A trained automl estimator/model using oracle automl.
Type
Callable
framework
“oracle_automl”, the framework name of the estimator.
Type
str
hyperparameter
The hyperparameters of the estimator.
Type
dict
metadata_custom
The model custom metadata.
Type
ModelCustomMetadata
metadata_provenance
The model provenance metadata.
Type
ModelProvenanceMetadata
metadata_taxonomy
The model taxonomy metadata.
Type
ModelTaxonomyMetadata
model_artifact
This is built by calling prepare.
Type
ModelArtifact
model_deployment
A ModelDeployment instance.
Type
ModelDeployment
model_file_name
Name of the serialized model. Default to “model.pkl”.
Type
str
model_id
The model ID.
Type
str
properties
ModelProperties object required to save and deploy model.
Type
ModelProperties
runtime_info
A RuntimeInfo instance.
Type
RuntimeInfo
schema_input
Schema describes the structure of the input data.
Type
Schema
schema_output
Schema describes the structure of the output data.
Type
Schema
serialize
Whether to serialize the model to pkl file by default. If False, you need to serialize the model manually,
save it under artifact_dir and update the score.py manually.
Type
bool
version
The framework version of the model.
Type
str
delete_deployment(...)
Deletes the current model deployment.
deploy(..., \*\*kwargs)
Deploys a model.
from_model_artifact(uri, model_file_name, artifact_dir, ..., \*\*kwargs)
Loads model from the specified folder, or zip/tar archive.
from_model_catalog(model_id, model_file_name, artifact_dir, ..., \*\*kwargs)
Loads model from model catalog.
introspect(...)
Runs model introspection.
predict(data, ...)
Returns prediction of input data run against the model deployment endpoint.
prepare(..., \*\*kwargs)
Prepare and save the score.py, serialized model and runtime.yaml file.
reload(...)
Reloads the model artifact files: score.py and the runtime.yaml.
save(..., \*\*kwargs)
Saves model artifacts to the model catalog.
summary_status(...)
Gets a summary table of the current status.
verify(data, ...)
Tests if deployment works in local environment.
Examples
>>> automl_model.verify(...)
>>> automl_model.save()
>>> model_deployment = automl_model.deploy(wait_for_completion=False)
ds_client
The data science client used by model deployment.
Type
DataScienceClient
estimator
A trained lightgbm estimator/model using Lightgbm.
Type
Callable
framework
“lightgbm”, the framework name of the model.
Type
str
hyperparameter
The hyperparameters of the estimator.
Type
dict
metadata_custom
The model custom metadata.
Type
ModelCustomMetadata
metadata_provenance
The model provenance metadata.
Type
ModelProvenanceMetadata
metadata_taxonomy
The model taxonomy metadata.
Type
ModelTaxonomyMetadata
model_artifact
This is built by calling prepare.
Type
ModelArtifact
model_deployment
A ModelDeployment instance.
Type
ModelDeployment
model_file_name
Name of the serialized model.
Type
str
model_id
The model ID.
Type
str
properties
ModelProperties object required to save and deploy model.
Type
ModelProperties
runtime_info
A RuntimeInfo instance.
Type
RuntimeInfo
schema_input
Schema describes the structure of the input data.
Type
Schema
schema_output
Schema describes the structure of the output data.
Type
Schema
serialize
Whether to serialize the model to pkl file by default. If False, you need to serialize the model manually,
save it under artifact_dir and update the score.py manually.
Type
bool
version
The framework version of the model.
Type
str
delete_deployment(...)
Deletes the current model deployment.
deploy(..., \*\*kwargs)
Deploys a model.
from_model_artifact(uri, model_file_name, artifact_dir, ..., \*\*kwargs)
Loads model from the specified folder, or zip/tar archive.
from_model_catalog(model_id, model_file_name, artifact_dir, ..., \*\*kwargs)
Loads model from model catalog.
introspect(...)
Runs model introspection.
predict(data, ...)
Returns prediction of input data run against the model deployment endpoint.
prepare(..., \*\*kwargs)
Prepare and save the score.py, serialized model and runtime.yaml file.
reload(...)
Reloads the model artifact files: score.py and the runtime.yaml.
save(..., \*\*kwargs)
Saves model artifacts to the model catalog.
summary_status(...)
Gets a summary table of the current status.
verify(data, ...)
Tests if deployment works in local environment.
Examples
>>> lightgbm_model.reload()
>>> lightgbm_model.verify(X_test)
>>> lightgbm_model.save()
>>> model_deployment = lightgbm_model.deploy(wait_for_completion=False)
>>> lightgbm_model.predict(X_test)
Initiates a LightGBMModel instance. This class wraps the Lightgbm model as estimator. It’s primary purpose
is to hold the trained model and do serialization.
Parameters
• estimator – any model object generated by Lightgbm framework
• artifact_dir (str) – Directory for generate artifact.
• properties ((ModelProperties, optional). Defaults to None.) – ModelProp-
erties object required to save and deploy model.
• auth ((Dict, optional). Defaults to None.) – The default authetication is set us-
ing ads.set_auth API. If you need to override the default, use the ads.common.auth.api_keys
or ads.common.auth.resource_principal to create appropriate authentication signer and
kwargs required to instantiate IdentityClient object.
Returns
LightGBMModel instance.
Return type
LightGBMModel
Raises
TypeError – If the input model is not a Lightgbm model or not supported for serialization.:
Examples
>>> lightgbm_model.prepare(inference_conda_env="generalml_p37_cpu_v1")
>>> lightgbm_model.verify(X_test)
>>> lightgbm_model.save()
>>> model_deployment = lightgbm_model.deploy()
>>> lightgbm_model.predict(X_test)
>>> lightgbm_model.delete_deployment()
auth
Default authentication is set using the ads.set_auth API. To override the default, use the
ads.common.auth.api_keys or ads.common.auth.resource_principal to create an authentication signer to
instantiate an IdentityClient object.
Type
Dict
ds_client
The data science client used by model deployment.
Type
DataScienceClient
estimator
A trained pytorch estimator/model using Pytorch.
Type
Callable
framework
“pytorch”, the framework name of the model.
Type
str
hyperparameter
The hyperparameters of the estimator.
Type
dict
metadata_custom
The model custom metadata.
Type
ModelCustomMetadata
metadata_provenance
The model provenance metadata.
Type
ModelProvenanceMetadata
metadata_taxonomy
The model taxonomy metadata.
Type
ModelTaxonomyMetadata
model_artifact
This is built by calling prepare.
Type
ModelArtifact
model_deployment
A ModelDeployment instance.
Type
ModelDeployment
model_file_name
Name of the serialized model.
Type
str
model_id
The model ID.
Type
str
properties
ModelProperties object required to save and deploy model.
Type
ModelProperties
runtime_info
A RuntimeInfo instance.
Type
RuntimeInfo
schema_input
Schema describes the structure of the input data.
Type
Schema
schema_output
Schema describes the structure of the output data.
Type
Schema
serialize
Whether to serialize the model to pkl file by default. If False, you need to serialize the model manually,
save it under artifact_dir and update the score.py manually.
Type
bool
version
The framework version of the model.
Type
str
delete_deployment(...)
Deletes the current model deployment.
deploy(..., \*\*kwargs)
Deploys a model.
from_model_artifact(uri, model_file_name, artifact_dir, ..., \*\*kwargs)
Loads model from the specified folder, or zip/tar archive.
from_model_catalog(model_id, model_file_name, artifact_dir, ..., \*\*kwargs)
Loads model from model catalog.
introspect(...)
Runs model introspection.
predict(data, ...)
Returns prediction of input data run against the model deployment endpoint.
prepare(..., \*\*kwargs)
Prepare and save the score.py, serialized model and runtime.yaml file.
reload(...)
Reloads the model artifact files: score.py and the runtime.yaml.
save(..., \*\*kwargs)
Saves model artifacts to the model catalog.
summary_status(...)
Gets a summary table of the current status.
verify(data, ...)
Tests if deployment works in local environment.
Examples
>>> torch_model.reload()
>>> torch_model.verify(...)
>>> torch_model.save()
>>> model_deployment = torch_model.deploy(wait_for_completion=False)
>>> torch_model.predict(...)
ds_client
The data science client used by model deployment.
Type
DataScienceClient
estimator
A trained sklearn estimator/model using scikit-learn.
Type
Callable
framework
“scikit-learn”, the framework name of the model.
Type
str
hyperparameter
The hyperparameters of the estimator.
Type
dict
metadata_custom
The model custom metadata.
Type
ModelCustomMetadata
metadata_provenance
The model provenance metadata.
Type
ModelProvenanceMetadata
metadata_taxonomy
The model taxonomy metadata.
Type
ModelTaxonomyMetadata
model_artifact
This is built by calling prepare.
Type
ModelArtifact
model_deployment
A ModelDeployment instance.
Type
ModelDeployment
model_file_name
Name of the serialized model.
Type
str
model_id
The model ID.
Type
str
properties
ModelProperties object required to save and deploy model.
Type
ModelProperties
runtime_info
A RuntimeInfo instance.
Type
RuntimeInfo
schema_input
Schema describes the structure of the input data.
Type
Schema
schema_output
Schema describes the structure of the output data.
Type
Schema
serialize
Whether to serialize the model to pkl file by default. If False, you need to serialize the model manually,
save it under artifact_dir and update the score.py manually.
Type
bool
version
The framework version of the model.
Type
str
delete_deployment(...)
Deletes the current model deployment.
deploy(..., \*\*kwargs)
Deploys a model.
from_model_artifact(uri, model_file_name, artifact_dir, ..., \*\*kwargs)
Loads model from the specified folder, or zip/tar archive.
from_model_catalog(model_id, model_file_name, artifact_dir, ..., \*\*kwargs)
Loads model from model catalog.
introspect(...)
Runs model introspection.
predict(data, ...)
Returns prediction of input data run against the model deployment endpoint.
prepare(..., \*\*kwargs)
Prepare and save the score.py, serialized model and runtime.yaml file.
reload(...)
Reloads the model artifact files: score.py and the runtime.yaml.
save(..., \*\*kwargs)
Saves model artifacts to the model catalog.
summary_status(...)
Gets a summary table of the current status.
verify(data, ...)
Tests if deployment works in local environment.
Examples
>>> sklearn_model.reload()
>>> sklearn_model.verify(X_test)
>>> sklearn_model.save()
>>> model_deployment = sklearn_model.deploy(wait_for_completion=False)
>>> sklearn_model.predict(X_test)
Returns
SklearnModel instance.
Return type
SklearnModel
Examples
>>> sklearn_model.prepare(inference_conda_env="dataexpl_p37_cpu_v3")
>>> sklearn_model.verify(X_test)
>>> sklearn_model.save()
>>> model_deployment = sklearn_model.deploy()
>>> sklearn_model.predict(X_test)
>>> sklearn_model.delete_deployment()
ds_client
The data science client used by model deployment.
Type
DataScienceClient
estimator
A trained tensorflow estimator/model using Tensorflow.
Type
Callable
framework
“tensorflow”, the framework name of the model.
Type
str
hyperparameter
The hyperparameters of the estimator.
Type
dict
metadata_custom
The model custom metadata.
Type
ModelCustomMetadata
metadata_provenance
The model provenance metadata.
Type
ModelProvenanceMetadata
metadata_taxonomy
The model taxonomy metadata.
Type
ModelTaxonomyMetadata
model_artifact
This is built by calling prepare.
Type
ModelArtifact
model_deployment
A ModelDeployment instance.
Type
ModelDeployment
model_file_name
Name of the serialized model.
Type
str
model_id
The model ID.
Type
str
properties
ModelProperties object required to save and deploy model.
Type
ModelProperties
runtime_info
A RuntimeInfo instance.
Type
RuntimeInfo
schema_input
Schema describes the structure of the input data.
Type
Schema
schema_output
Schema describes the structure of the output data.
Type
Schema
serialize
Whether to serialize the model to pkl file by default. If False, you need to serialize the model manually,
save it under artifact_dir and update the score.py manually.
Type
bool
version
The framework version of the model.
Type
str
delete_deployment(...)
Deletes the current model deployment.
deploy(..., \*\*kwargs)
Deploys a model.
from_model_artifact(uri, model_file_name, artifact_dir, ..., \*\*kwargs)
Loads model from the specified folder, or zip/tar archive.
from_model_catalog(model_id, model_file_name, artifact_dir, ..., \*\*kwargs)
Loads model from model catalog.
introspect(...)
Runs model introspection.
predict(data, ...)
Returns prediction of input data run against the model deployment endpoint.
prepare(..., \*\*kwargs)
Prepare and save the score.py, serialized model and runtime.yaml file.
reload(...)
Reloads the model artifact files: score.py and the runtime.yaml.
save(..., \*\*kwargs)
Saves model artifacts to the model catalog.
summary_status(...)
Gets a summary table of the current status.
verify(data, ...)
Tests if deployment works in local environment.
Examples
>>> tf_model.verify(x_test[:1])
>>> tf_model.save()
>>> model_deployment = tf_model.deploy(wait_for_completion=False)
>>> tf_model.predict(x_test[:1])
Returns
Nothing.
Return type
None
to_onnx(path: str = None, input_signature=None, X_sample: Optional[Union[Dict, str, List, Tuple, ndarray,
Series, DataFrame]] = None, opset_version=None)
Exports the given Tensorflow model into ONNX format.
Parameters
• path (str, default to None) – Path to save the serialized model.
• input_signature (a tuple or a list of tf.TensorSpec objects. default
to None.) – Define the shape/dtype of the input so that model(input_signature) is a valid
invocation of the model.
• X_sample (Union[list, tuple, pd.Series, np.ndarray, pd.DataFrame].
Defaults to None.) – A sample of input data that will be used to generate input schema
and detect input_signature.
• opset_version (int. Defaults to None.) – The opset to be used for the ONNX
model.
Returns
Nothing
Return type
None
Raises
ValueError – if path is not provided
Type
Dict
ds_client
The data science client used by model deployment.
Type
DataScienceClient
estimator
A trained xgboost estimator/model using Xgboost.
Type
Callable
framework
“xgboost”, the framework name of the model.
Type
str
hyperparameter
The hyperparameters of the estimator.
Type
dict
metadata_custom
The model custom metadata.
Type
ModelCustomMetadata
metadata_provenance
The model provenance metadata.
Type
ModelProvenanceMetadata
metadata_taxonomy
The model taxonomy metadata.
Type
ModelTaxonomyMetadata
model_artifact
This is built by calling prepare.
Type
ModelArtifact
model_deployment
A ModelDeployment instance.
Type
ModelDeployment
model_file_name
Name of the serialized model.
Type
str
model_id
The model ID.
Type
str
properties
ModelProperties object required to save and deploy model.
Type
ModelProperties
runtime_info
A RuntimeInfo instance.
Type
RuntimeInfo
schema_input
Schema describes the structure of the input data.
Type
Schema
schema_output
Schema describes the structure of the output data.
Type
Schema
serialize
Whether to serialize the model to pkl file by default. If False, you need to serialize the model manually,
save it under artifact_dir and update the score.py manually.
Type
bool
version
The framework version of the model.
Type
str
delete_deployment(...)
Deletes the current model deployment.
deploy(..., \*\*kwargs)
Deploys a model.
from_model_artifact(uri, model_file_name, artifact_dir, ..., \*\*kwargs)
Loads model from the specified folder, or zip/tar archive.
from_model_catalog(model_id, model_file_name, artifact_dir, ..., \*\*kwargs)
Loads model from model catalog.
introspect(...)
Runs model introspection.
predict(data, ...)
Returns prediction of input data run against the model deployment endpoint.
prepare(..., \*\*kwargs)
Prepare and save the score.py, serialized model and runtime.yaml file.
reload(...)
Reloads the model artifact files: score.py and the runtime.yaml.
save(..., \*\*kwargs)
Saves model artifacts to the model catalog.
summary_status(...)
Gets a summary table of the current status.
verify(data, ...)
Tests if deployment works in local environment.
Examples
>>> xgboost_model.reload()
>>> xgboost_model.verify(X_test)
>>> xgboost_model.save()
>>> model_deployment = xgboost_model.deploy(wait_for_completion=False)
>>> xgboost_model.predict(X_test)
Initiates a XGBoostModel instance. This class wraps the XGBoost model as estimator. It’s primary purpose is
to hold the trained model and do serialization.
Parameters
• estimator – XGBoostModel
• artifact_dir (str) – artifact directory to store the files needed for deployment.
• properties ((ModelProperties, optional). Defaults to None.) – ModelProp-
erties object required to save and deploy model.
• auth ((Dict, optional). Defaults to None.) – The default authetication is set us-
ing ads.set_auth API. If you need to override the default, use the ads.common.auth.api_keys
or ads.common.auth.resource_principal to create appropriate authentication signer and
kwargs required to instantiate IdentityClient object.
Returns
XGBoostModel instance.
Return type
XGBoostModel
Examples
>>> xgboost_model.prepare(inference_conda_env="generalml_p37_cpu_v1")
>>> xgboost_model.verify(X_test)
>>> xgboost_model.save()
>>> model_deployment = xgboost_model.deploy()
>>> xgboost_model.predict(X_test)
>>> xgboost_model.delete_deployment()
17.1.1.17.1 Submodules
class ads.model.runtime.env_info.EnvInfo
Bases: ABC
Env Info Base class.
classmethod from_path(env_path: str) → EnvInfo
Initiate an object from a conda pack path.
Parameters
env_path (str) – conda pack path.
Returns
An EnvInfo instance.
Return type
EnvInfo
class ads.model.runtime.env_info.PACK_TYPE(value)
Bases: Enum
Conda Pack Type
SERVICE_PACK = 'data_science'
USER_CUSTOM_PACK = 'published'
class ads.model.runtime.model_deployment_details.ModelDeploymentDetails(inference_conda_env:
~ads.model.runtime.env_info.InferenceEn
= <factory>)
Bases: DataClassSerializable
ModelDeploymentDetails class.
inference_conda_env: InferenceEnvInfo
training_code: TrainingCode
training_conda_env: TrainingEnvInfo
model_deployment: ModelDeploymentDetails
model_provenance: ModelProvenanceDetails
save()
Save the RuntimeInfo object into runtime.yaml file under the artifact directory.
Returns
Nothing.
Return type
None
Raises
DocumentError – Raised when the validation schema is missing, has the wrong format or
contains errors.:
Returns
validation result.
Return type
bool
ads.model.runtime.utils.get_service_packs(namespace: str, bucketname: str) → Tuple[Dict, Dict]
Get the service pack path mapping and service pack slug mapping. Note: deprecated packs are also included.
Parameters
• namespace (str) – namespace of the service pack.
• bucketname (str) – bucketname of the service pack.
Returns
Service pack path mapping(service pack path -> (slug, python version)) and the service pack slug
mapping(service pack slug -> (pack path, python version)).
Return type
(Dict, Dict)
17.1.1.18.1 Submodules
17.1.1.19.1 Submodules
class ads.secrets.secrets.Secret
Bases: object
Base class
serialize(self ) → dict
Serializes attributes as dictionary. Returns dictionary with the keys that are serializable.
to_dict(self ) → dict
returns dictionarry with the keys that has repr set to True and the value is not None or empty
export_dict -> dict
returns dictionary with the keys that has repr set tp True
export_options -> dcit
returns list of attributes with the fields that has repr set to True
export_dict() → dict
Serializes attributes as dictionary.
Returns
returns dictionary of key/value pair where the value of the attribute is not None and the field
does not have repr`=`False
Return type
dict
export_options() → list
Returns list of attributes that have repr=True.
Returns
returns list of fields that does not have repr=False
Return type
list
serialize() → dict
Serializes attributes as dictionary. An attribute can be marked as not serializable by using metadata field
of the field constructor provided by the dataclasses module.
Returns
returns dictionay of key/value pair where the value of the attribute is not None and not empty
and the field does not have metadata = {“serializable”:False}. Refer dataclass python docu-
mentation for more details about metadata
Return type
dict
to_dict() → dict
Serializes attributes as dictionary. Returns only non empty attributes.
Returns
returns dictionary of key/value pair where the value of the attribute is not None or empty
Return type
dict
class ads.secrets.secrets.SecretKeeper(content: Optional[bytes] = None, encoded: Optional[str] = None,
secret_id: Optional[str] = None, export_prefix: str = '',
export_env: bool = False, **kwargs)
Bases: Vault, ContextDecorator
SecretKeeper defines APIs required to serialize and deserialize secrets. Services such as Database, Streaming,
and Git require users to provide credentials. These credentials need to be safely accessed at runtime. OCI Vault
provides a mechanism for safe storage and access. SecretKeeper uses OCI Vault as a backend to store and retrieve
the credentials.
The exact data structure of the credentials varies from service to service.
Parameters
• vault_id ((str, optional). Default None) – ocid of the vault
• key_id ((str, optional). Default None) – ocid of the key that is used for encrypting
the content
• compartment_id ((str, optional). Default None) – ocid of the com-
partment_id where the vault resides. When available in environment variable -
NB_SESSION_COMPARTMENT_OCID, will defult to that.
• export_prefix (str, Default "") – Prefix to the environment variable that is ex-
ported.
• auth (dict, optional) – By default authentication will follow what is configured
using ads.set_auth API. Accepts dict returned from ads.common.auth.api_keys() or
ads.common.auth.resource_principal().
• storage_options (dict, optional) – storage_options dict as required by fsspec li-
brary
• kwargs – key word arguments accepted by the constructor of the class from which this
method is invoked.
Returns
• dict – When called from within with block, Returns a dictionary containing the secret
• ads.secrets.SecretKeeper – When called without using with operator.
Examples
... export_prefix="mykafka",
... export_env=True
... ) as apisecret:
... import os
... print("Credentials inside environment variable:",
... os.environ.get('mykafka.api_key'))
... print("Credentials inside `apisecret` object: ", apisecret)
Credentials inside environment variable: <your api key>
Credentials inside `apisecret` object: {'api_key': 'your api key'}
required_keys = ['secret_id']
service_name: str
user_name: str
wallet_secret_ids: list
Examples
>>> connection_parameters={
... "user_name":"admin",
... "password":"<your password>",
... "service_name":"service_name_{high|low|med}",
... "wallet_location":"/home/datascience/Wallet_xxxx.zip"
... }
>>> adw_keeper = ADBSecretKeeper(vault_id=vault_id, key_id=key_id, **connection_
˓→parameters)
... wallet_location='/home/datascience/Wallet_xxxxxx.zip')
>>> pd.DataFrame.ads.read_sql("select * from ATTRITION_DATA", connection_
˓→parameters=myadw_creds.to_dict()).head(2)
Parameters
• user_name ((str, optioanl). Default None) – user_name of the databse
• password ((str, optional). Default None) – password for connecting to the
database
• service_name ((str, optional). Default None) – service name of the ADB in-
stance
• wallet_location ((str, optional). Default None) – full path to the wallet zip file
used for connecting to ADB instance.
• wallet_dir ((str, optional). Default None) – local directory where the extracted
wallet content is saved
• repository_path ((str, optional). Default None.) – Path to credentials reposi-
tory. For more details refer ads.database.connection
• repository_key ((str, optional). Default None.) – Configuration key for loading
the right configuration from repository. For more details refer ads.database.connection
• kwargs – vault_id: str. OCID of the vault where the secret is stored. Required for saving
secret. key_id: str. OCID of the key used for encrypting the secret. Required for saving
secret. compartment_id: str. OCID of the compartment where the vault is located. Re-
quired for saving secret. auth: dict. Dictionay returned from ads.common.auth.api_keys() or
decode() → ADBSecretKeeper
Converts the content in self.secret to ADBSecret and stores in self.data
If the wallet_location is passed through the constructor, then retain it. We do not want to override what user
has passed in If the wallet_location was not passed, but the sercret has wallet_secret_ids, then we generate
the wallet zip file in the location specified by wallet_dir in the constructor
Returns
Returns self object
Return type
ADBSecretKeeper
encode(serialize_wallet: bool = False) → ADBSecretKeeper
Prepares content to save in vault. The user_name, password and service_name and the individual files
inside the wallet zip file are base64 encoded and stored in self.secret
Parameters
serialize_wallet (bool, optional) – When set to True, loads the wallet zip file and
encodes the content of each file in the zip file.
Returns
Returns self object
Return type
ADBSecretKeeper
save(name: str, description: str, freeform_tags: Optional[dict] = None, defined_tags: Optional[dict] = None,
save_wallet: bool = False) → ADBSecretKeeper
Saves credentials to Vault and returns self.
Parameters
• name (str) – Name of the secret when saved in the Vault.
• description (str) – Description of the secret when saved in the Vault.
• freeform_tags ((dict, optional). Default is None) – freeform_tags to be used
for saving the secret in OCI console.
• defined_tags ((dict, optional). Default is None) – Save the tags under pre-
defined tags in OCI console.
• save_wallet ((bool, optional). Default is False) – If set to True, saves the
contents of the wallet file as separate secret.
Returns
Returns self object
Return type
ADBSecretKeeper
class ads.secrets.mysqldb.MySQLDBSecret(user_name: str, password: str, host: str, port: str, database:
Optional[str] = None)
Bases: Secret
Dataclass representing the attributes managed and serialized by MySQLDBSecretKeeper
database: str = None
host: str
password: str
port: str
user_name: str
Examples
>>> connection_parameters={
... "user_name":"<your user name>",
... "password":"<your password>",
... "host":"<db host>",
... "port":"<db port>",
... "database":"<database>",
... }
>>> mysqldb_keeper = MySQLDBSecretKeeper(vault_id=vault_id, key_id=key_id,␣
˓→**connection_parameters)
Parameters
• user_name ((str, optional). Default None) – user_name of the database
• password ((str, optional). Default None) – password for connecting to the
database
• host ((str, optional). Default None) – Database host name
• port ((str, optional). Default 1521) – Port number
• database ((str, optional). Default None) – database name
• repository_path ((str, optional). Default None.) – Path to credentials reposi-
tory. For more details refer ads.database.connection
• repository_key ((str, optional). Default None.) – Configuration key for loading
the right configuration from repository. For more details refer ads.database.connection
• kwargs – vault_id: str. OCID of the vault where the secret is stored. Required for saving
secret. key_id: str. OCID of the key used for encrypting the secret. Required for saving
secret. compartment_id: str. OCID of the compartment where the vault is located. Re-
quired for saving secret. auth: dict. Dictionay returned from ads.common.auth.api_keys() or
ads.common.auth.resource_principal(). By default, will follow what is set in ads.set_auth.
Use this attribute to override the default.
decode() → MySQLDBSecretKeeper
Converts the content in self.encoded to MySQLDBSecret and stores in self.data
Returns
Returns self object
Return type
MySQLDBSecretKeeper
host: str
password: str
port: str
user_name: str
Examples
>>> connection_parameters={
... "user_name":"<your user name>",
... "password":"<your password>",
... "service_name":"service_name",
... "host":"<db host>",
... "port":"<db port>",
... }
>>> oracledb_keeper = OracleDBSecretKeeper(vault_id=vault_id, key_id=key_id,␣
˓→**connection_parameters)
Parameters
• user_name ((str, optional). Default None) – user_name of the database
• password ((str, optional). Default None) – password for connecting to the
database
• service_name ((str, optional). Default None) – service name of the Oracle DB
instance
• sid ((str, optional). Default None) – Provide sid if service name is not available.
• host ((str, optional). Default None) – Database host name
• port ((str, optional). Default 1521) – Port number
• dsn ((str, optional). Default None) – dsn string for connecting with oracledb. Re-
fer cx_Oracle documentation
• repository_path ((str, optional). Default None.) – Path to credentials reposi-
tory. For more details refer ads.database.connection
• repository_key ((str, optional). Default None.) – Configuration key for loading
the right configuration from repository. For more details refer ads.database.connection
• kwargs – vault_id: str. OCID of the vault where the secret is stored. Required for saving
secret. key_id: str. OCID of the key used for encrypting the secret. Required for saving
secret. compartment_id: str. OCID of the compartment where the vault is located. Re-
quired for saving secret. auth: dict. Dictionay returned from ads.common.auth.api_keys() or
ads.common.auth.resource_principal(). By default, will follow what is set in ads.set_auth.
Use this attribute to override the default.
decode() → OracleDBSecretKeeper
Converts the content in self.encoded to OracleDBSecret and stores in self.data
Returns
Returns self object
Return type
OracleDBSecretKeeper
keytab_path
Path to the keytab file.
Type
str
keytab_content
Content of the keytab file.
Type
dict
secret_id
secret id where the BDSSecret is stored.
Type
str
hdfs_host: str
hdfs_port: str
hive_host: str
hive_port: str
principal: str
secret_id: str
hdfs_host
hdfs host name from the bds cluster.
Type
str
hive_host
hive host name from the bds cluster.
Type
str
hdfs_port
hdfs port from the bds cluster.
Type
str
hive_port
hive port from the bds cluster.
Type
str
kerb5_path
krb5.conf file path.
Type
str
kerb5_content
Content of the krb5.conf.
Type
dict
keytab_path
Path to the keytab file.
Type
str
keytab_content
Content of the keytab file.
Type
dict
secret_id
secret id where the BDSSecret is stored.
Type
str
kwargs
------
vault_id
Type
str. OCID of the vault where the secret is stored. Required for saving secret.
key_id
Type
str. OCID of the key used for encrypting the secret. Required for saving secret.
compartment_id
Type
str. OCID of the compartment where the vault is located. Required for saving secret.
auth
Type
dict. Dictionay returned from ads.common.auth.api_keys() or
ads.common.auth.resource_principal(). By default, will follow what is set in ads.set_auth.
Use this attribute to override the default.
Parameters
• principal (str) – The unique identity to which Kerberos can assign tickets.
• hdfs_host (str) – hdfs host name from the bds cluster.
• hive_host (str) – hive host name from the bds cluster.
• hdfs_port (str) – hdfs port from the bds cluster.
• hive_port (str) – hive port from the bds cluster.
• kerb5_path (str) – krb5.conf file path.
• kerb5_content (dict) – Content of the krb5.conf.
• keytab_path (str) – Path to the keytab file.
• keytab_content (dict) – Content of the keytab file.
• keytab_dir ((str, optional).) – Default None. Local directory where the extracted
keytab content is saved.
• secret_id (str) – secret id where the BDSSecret is stored.
kwargs
vault_id: str. OCID of the vault where the secret is stored. Required for saving secret. key_id: str. OCID
of the key used for encrypting the secret. Required for saving secret. compartment_id: str. OCID of the
compartment where the vault is located. Required for saving secret. auth: dict. Dictionay returned from
ads.common.auth.api_keys() or ads.common.auth.resource_principal(). By default, will follow what is set in
ads.set_auth. Use this attribute to override the default.
decode(save_files: bool = True) → ads.secrets.bds.BDSSecretKeeper
Converts the content in self.secret to BDSSecret and stores in self.data
If the keytab_path and kerb5_path are passed through the constructor, then retain it. We do not want to
override what user has passed in If the keytab_path and kerb5_path are not passed, but the sercret has
secret_id, then we generate the keytab file in the location specified by keytab_path in the constructor.
Returns
Returns self object
Return type
BDSSecretKeeper
encode(serialize: bool = True) → ads.secrets.bds.BDSSecretKeeper
Prepares content to save in vault. The port, host name and the keytab and krb5.config files are base64
encoded and stored in self.secret
Parameters
serialize (bool, optional) – When set to True, loads the keytab and krb5.config file
and encodes the content of both files.
Returns
Returns self object
Return type
BDSSecretKeeper
save(name: str, description: str, freeform_tags: dict = None, defined_tags: dict = None, save_files: bool =
True) → ads.secrets.bds.BDSSecretKeeper
Saves credentials to Vault and returns self.
Parameters
• name (str) – Name of the secret when saved in the Vault.
• description (str) – Description of the secret when saved in the Vault.
• freeform_tags ((dict, optional). Default is None) – freeform_tags to be used
for saving the secret in OCI console.
• defined_tags ((dict, optional). Default is None) – Save the tags under pre-
defined tags in OCI console.
• save_files ((bool, optional). Default is False) – If set to True, saves the con-
tents of the keytab and krb5 file as separate secret.
Returns
Returns self object
Return type
BDSSecretKeeper
Examples
... freeform_tags={
˓→"gitrepo":"xyz"})
... export_prefix="mygitrepo",
... export_env=True
... ) as authtoken:
... import os
... print("Credentials inside environment variable:", os.environ.get('mygitrepo.
˓→auth_token'))
Parameters
• auth_token ((str, optional). Default None) – auth token string that needs to be
stored in the vault
• kwargs – vault_id: str. OCID of the vault where the secret is stored. Required for saving
secret. key_id: str. OCID of the key used for encrypting the secret. Required for saving
secret. compartment_id: str. OCID of the compartment where the vault is located. Re-
quired for saving secret. auth: dict. Dictionay returned from ads.common.auth.api_keys() or
ads.common.auth.resource_principal(). By default, will follow what is set in ads.set_auth.
Use this attribute to override the default.
decode() → AuthTokenSecretKeeper
Converts the content in self.encoded to AuthToken and stores in self.data
Returns
Returns the self object after decoding self.encoded and updates self.data
Return type
AuthTokenSecretKeeper
17.1.1.20.1 Submodules
class ads.text_dataset.backends.Base
Bases: object
Base class for backends.
convert_to_text(fhandler: OpenFile, dst_path: str, fname: Optional[str] = None, storage_options:
Optional[Dict] = None) → str
Convert input file to a text file
Parameters
• fhandler (fsspec.core.OpenFile) – a file handler returned by fsspec
• dst_path (str) – local folder or cloud storage prefix to save converted text files
• fname (str, optional) – filename for converted output, relative to dirname or prefix,
by default None
• storage_options (dict, optional) – storage options for cloud storage
Returns
path to saved output
Return type
str
get_metadata(fhandler: OpenFile) → Dict
Get metadata of a file.
Parameters
fhandler (fsspec.core.OpenFile) – a file handler returned by fsspec
Returns
dictionary of metadata
Return type
dict
read_line(fhandler: OpenFile) → Generator[Union[str, List[str]], None, None]
Read lines from a file.
Parameters
fhandler (fsspec.core.OpenFile) – a file handler returned by fsspec
Yields
Generator – a generator that yields lines
read_text(fhandler: OpenFile) → Generator[Union[str, List[str]], None, None]
Read entire file into a string.
Parameters
fhandler (fsspec.core.OpenFile) – a file handler returned by fsspec
Yields
Generator – a generator that yields text in the file
class ads.text_dataset.backends.PDFPlumber
Bases: Base
convert_to_text(fhandler, dst_path, fname=None, storage_options=None)
Convert input file to a text file
Parameters
• fhandler (fsspec.core.OpenFile) – a file handler returned by fsspec
• dst_path (str) – local folder or cloud storage prefix to save converted text files
• fname (str, optional) – filename for converted output, relative to dirname or prefix,
by default None
• storage_options (dict, optional) – storage options for cloud storage
Returns
path to saved output
Return type
str
get_metadata(fhandler)
Get metadata of a file.
Parameters
fhandler (fsspec.core.OpenFile) – a file handler returned by fsspec
Returns
dictionary of metadata
Return type
dict
read_line(fhandler)
Read lines from a file.
Parameters
fhandler (fsspec.core.OpenFile) – a file handler returned by fsspec
Yields
Generator – a generator that yields lines
read_text(fhandler)
Read entire file into a string.
Parameters
fhandler (fsspec.core.OpenFile) – a file handler returned by fsspec
Yields
Generator – a generator that yields text in the file
class ads.text_dataset.backends.Tika
Bases: Base
convert_to_text(fhandler, dst_path, fname=None, storage_options=None)
Convert input file to a text file
Parameters
get_metadata(fhandler)
Get metadata of a file.
Parameters
fhandler (fsspec.core.OpenFile) – a file handler returned by fsspec
Returns
dictionary of metadata
Return type
dict
read_line(fhandler)
Read lines from a file.
Parameters
fhandler (fsspec.core.OpenFile) – a file handler returned by fsspec
Yields
Generator – a generator that yields lines
read_text(fhandler)
Read entire file into a string.
Parameters
fhandler (fsspec.core.OpenFile) – a file handler returned by fsspec
Yields
Generator – a generator that yields text in the file
Examples
... )
>>> data_gen = textfactory.format('pdf').option(Options.FILE_NAME).backend(
˓→'pdfplumber').read_text(
... 'oci://<bucket-name>@<namespace>/<path>/*.pdf',
... storage_options={"config": oci.config.from_file(os.path.join("~/.oci",
˓→"config"))},
... )
>>> textfactory.format('docx').convert_to_text(
... 'oci://<bucket-name>@<namespace>/<path>/*.docx',
... './extracted',
... storage_options={"config": oci.config.from_file(os.path.join("~/.oci",
˓→"config"))},
... )
>>> textfactory.format('docx').convert_to_text(
... 'oci://<bucket-name>@<namespace>/<path>/*.docx',
... 'oci://<bucket-name>@<namespace>/<out_path>',
... storage_options={"config": oci.config.from_file(os.path.join("~/.oci",
˓→"config"))},
... )
>>> meta_gen = textfactory.format('docx').metadata_schema(
... 'oci://<bucket-name>@<namespace>/papers/*.pdf',
... storage_options={"config": oci.config.from_file(os.path.join("~/.oci",
˓→"config"))},
... )
>>> df = textfactory.format('pdf').engine('pandas').option(Options.FILE_METADATA, {
˓→'extract': ['Author']}).read_text(
... 'oci://<bucket-name>@<namespace>/<path>/*.pdf',
... storage_options={"config": oci.config.from_file(os.path.join("~/.oci",
˓→"config"))},
... total_files=10,
... )
>>> df = textfactory.format('txt').engine('cudf').read_line(
... 'oci://<bucket-name>@<namespace>/<path>/*.log',
... udf=r'^\[(\S+)\s(\S+)\s(\d+)\s(\d+\:\d+\:\d+)\s(\d+)]\s(\S+)\s(\S+)\s(\S+)\
˓→s(\S+)',
... n_lines_per_file=10,
... )
None
backend(backend: Union[str, Base]) → None
Set backend used for extracting text from files.
Parameters
backend ((str | ads.text_dataset.backends.Base)) – backend for extracting text from raw files.
Return type
None
convert_to_text(src_path: str, dst_path: str, encoding: str = 'utf-8', storage_options: Optional[Dict] =
None) → None
Convert files to plain text files.
Parameters
• src_path (str) – path to source data file(s). can use glob pattern
• dst_path (str) – local folder or cloud storage (e.g., OCI object storage) prefix to save
converted text files
• encoding (str, optional) – encoding for files, by default utf-8
• storage_options (Dict, optional) – storage options for cloud storage, by default
None
Return type
None
engine(eng: str) → None
Set engine for dataloader. Can be pandas or cudf.
Parameters
eng (str) – name of engine
Return type
None
Raises
NotSupportedError – raises error if engine passed in is not supported.
metadata_all(path: str, storage_options: Optional[Dict] = None, encoding: str = 'utf-8') →
Generator[Dict[str, Any], None, None]
Get metadata of all files that matches the given path. Return a generator.
Parameters
• path (str) – path to data files. can use glob pattern.
• storage_options (Dict, optional) – storage options for cloud storage, by default
None
• encoding (str, optional) – encoding of files, by default ‘utf-8’
Returns
generator of extracted metedata from files.
Return type
Generator
Return type
(Generator | DataFrame)
read_text(path: str, udf: Union[str, Callable] = None, total_files: int = None, storage_options: Dict =
None, df_args: Dict = None, encoding: str = 'utf-8') → Union[Generator[Union[str, List[str]],
None, None], DataFrame]
Read each file into a text string. If path matches multiple files, each file corresponds to one record.
Parameters
• path (str) – path to data files. can have glob pattern.
• udf ((callable | str), optional) – user defined function for processing each line,
can be a callable or regex, by default None
• total_files (int, optional) – max number of files to read, by default None
• df_args (dict, optional) – arguments passed to dataframe engine (e.g. pandas), by
default None
• storage_options (dict, optional) – storage options for cloud storage, by default
None
• encoding (str, optional) – encoding of files, by default ‘utf-8’
Returns
returns either a data generator or a dataframe.
Return type
(Generator | DataFrame)
with_processor(processor_type: str) → None
Set file processor.
Parameters
processor_type (str) – type of processor, which corresponds to format of the file.
Return type
None
class ads.text_dataset.dataset.TextDatasetFactory
Bases: object
A class that generates a dataloader given a file format.
static format(format_name: str) → DataLoader
Instantiates DataLoader class and seeds it with the right kind of FileProcessor. Eg. PDFProcessor for pdf.
The FileProcessorFactory returns the processor based on the format Type.
Parameters
format_name (str) – name of format
Returns
a DataLoader object.
Return type
ads.text_dataset.dataset.DataLoader
Examples
static get_processor(format)
class ads.text_dataset.options.OptionFactory
Bases: object
static option_handler(option: Options) → OptionHandler
class ads.text_dataset.options.Options(value)
Bases: Enum
An enumeration.
FILE_METADATA = 2
FILE_NAME = 1
17.1.1.21.1 Submodules
17.1.2 Submodules
ads.getLogger(name='ads')
ads.hello()
Imports Pandas, sets the documentation mode, and prints a fancy “Hello”.
ads.set_auth(auth='api_key', oci_config_location='~/.oci/config', profile='DEFAULT')
Enable/disable resource principal identity or keypair identity in a notebook session.
Parameters
• auth ({'api_key', 'resource_principal'}, default 'api_key') – Enable/disable re-
source principal identity or keypair identity in a notebook session
• oci_config_location (str, default oci.config.DEFAULT_LOCATION, which
is '~/.oci/config') – config file location
• profile (str, default 'DEFAULT') – profile name for api keys config file
ads.set_debug_mode(mode=True)
Enable/disable printing stack traces on notebook.
Parameters
mode (bool (default True)) – Enable/disable print stack traces on notebook
ads.set_documentation_mode(mode=False)
This method is deprecated and will be removed in future releases. Enable/disable printing user tips on notebook.
Parameters
mode (bool (default False)) – Enable/disable print user tips on notebook
ads.set_expert_mode()
This method is deprecated and will be removed in future releases. Enables the debug and documentation mode
for expert users all in one method.
Installation
python3 -m pip install oracle-ads
Source Code
https://github.com/oracle/accelerated-data-science
O o-o o-o
/ \ | \ |
o---o| O o-o
| || / |
o oo-o o--o
EIGHTEEN
ADDITIONAL DOCUMENTATION
851
ADS Documentation, Release 2.6.8
NINETEEN
EXAMPLES
import ads
import oci
import pandas as pd
ads.set_auth(
auth="api_key", oci_config_location=oci.config.DEFAULT_LOCATION, profile="DEFAULT"
)
bucket_name = "<bucket_name>"
path = "<path>"
namespace = "<namespace>"
df = pd.read_csv(
f"oci://{bucket_name}@{namespace}/{path}", storage_options=ads.auth.default_signer()
)
import ads
import pandas as pd
connection_parameters = {
"user_name": "<user_name>",
"password": "<password>",
"service_name": "<tns_name>",
"wallet_location": "<file_path>",
}
df = pd.DataFrame.ads.read_sql(
"""
SELECT *
FROM SH.SALES
WHERE ROWNUM <= :max_rows
""",
bind_variables={ max_rows : 100 },
(continues on next page)
853
ADS Documentation, Release 2.6.8
TWENTY
CONTRIBUTING
This project welcomes contributions from the community. Before submitting a pull request, please review our contri-
bution guide CONTRIBUTING.md.
Find Getting Started instructions for developers in README-development.md
855
ADS Documentation, Release 2.6.8
TWENTYONE
SECURITY
Consult the security guide SECURITY.md for our responsible security vulnerability disclosure process.
857
ADS Documentation, Release 2.6.8
TWENTYTWO
LICENSE
Copyright (c) 2020, 2022 Oracle and/or its affiliates. Licensed under the Universal Permissive License v1.0
859
ADS Documentation, Release 2.6.8
a ads.data_labeling.reader.metadata_reader, 541
ads, 848 ads.data_labeling.reader.record_reader, 543
ads.automl, 460 ads.data_labeling.record, 526
ads.automl.driver, 453 ads.data_labeling.visualizer.image_visualizer,
ads.automl.provider, 454 546
ads.bds, 518 ads.data_labeling.visualizer.text_visualizer,
ads.bds.auth, 516 549
ads.catalog, 473 ads.database, 553
ads.catalog.model, 460 ads.database.connection, 551
ads.catalog.notebook, 467 ads.dataflow, 561
ads.catalog.project, 469 ads.dataflow.dataflow, 553
ads.catalog.summary, 472 ads.dataflow.dataflowsummary, 561
ads.common, 515 ads.dataset, 600
ads.common.auth, 473 ads.dataset.classification_dataset, 561
ads.common.card_identifier, 473 ads.dataset.correlation, 564
ads.common.data, 475 ads.dataset.correlation_plot, 564
ads.common.decorator.deprecate, 499 ads.dataset.dataframe_transformer, 567
ads.common.decorator.runtime_dependency, 497 ads.dataset.dataset, 567
ads.common.function.fn_util, 506 ads.dataset.dataset_browser, 579
ads.common.model, 477 ads.dataset.dataset_with_target, 582
ads.common.model_export_util, 502 ads.dataset.exception, 587
ads.common.model_introspect, 499 ads.dataset.factory, 587
ads.common.model_metadata, 480 ads.dataset.feature_engineering_transformer,
ads.common.model_metadata_mixin, 515 592
ads.common.utils, 506 ads.dataset.feature_selection, 592
ads.config, 847 ads.dataset.forecasting_dataset, 593
ads.data_labeling, 551 ads.dataset.helper, 593
ads.data_labeling.boundingbox, 518 ads.dataset.label_encoder, 596
ads.data_labeling.constants, 521 ads.dataset.pipeline, 596
ads.data_labeling.data_labeling_service, 521 ads.dataset.plot, 596
ads.data_labeling.interface.loader, 518 ads.dataset.progress, 597
ads.data_labeling.interface.parser, 518 ads.dataset.recommendation, 597
ads.data_labeling.interface.reader, 518 ads.dataset.recommendation_transformer, 597
ads.data_labeling.metadata, 523 ads.dataset.regression_dataset, 598
ads.data_labeling.mixin.data_labeling, 527 ads.dataset.sampled_dataset, 598
ads.data_labeling.ner, 525 ads.dataset.target, 599
ads.dataset.timeseries, 600
ads.data_labeling.parser.export_metadata_parser,
529 ads.evaluations, 610
ads.data_labeling.parser.export_record_parser, ads.evaluations.evaluation_plot, 600
530 ads.evaluations.evaluator, 602
ads.data_labeling.reader.dataset_reader, 534 ads.evaluations.statistical_metrics, 608
ads.data_labeling.reader.jsonl_reader, 540 ads.feature_engineering, 691
861
ADS Documentation, Release 2.6.8
ads.feature_engineering.accessor.dataframe_accessor, 666
615 ads.feature_engineering.feature_type.object,
ads.feature_engineering.accessor.mixin.correlation, 669
623 ads.feature_engineering.feature_type.ordinal,
ads.feature_engineering.accessor.mixin.eda_mixin, 670
623 ads.feature_engineering.feature_type.phone_number,
672
ads.feature_engineering.accessor.mixin.eda_mixin_series,
626 ads.feature_engineering.feature_type.string,
675
ads.feature_engineering.accessor.mixin.feature_types_mixin,
627 ads.feature_engineering.feature_type.text,
ads.feature_engineering.accessor.series_accessor, 677
620 ads.feature_engineering.feature_type.unknown,
ads.feature_engineering.adsstring.common_regex_mixin, 679
629 ads.feature_engineering.feature_type.zip_code,
ads.feature_engineering.adsstring.oci_language, 680
630 ads.feature_engineering.feature_type_manager,
ads.feature_engineering.adsstring.string, 630 611
ads.feature_engineering.exceptions, 610 ads.hpo, 706
ads.feature_engineering.feature_type.address, ads.hpo.distributions, 691
630 ads.hpo.search_cv, 694
ads.feature_engineering.feature_type.base, ads.hpo.stopping_criterion, 705
633 ads.jobs, 740
ads.feature_engineering.feature_type.boolean, ads.jobs.ads_job, 706
634 ads.jobs.builders.infrastructure.dataflow,
ads.feature_engineering.feature_type.category, 721
636 ads.jobs.builders.infrastructure.dsc_job, 730
ads.feature_engineering.feature_type.constant, ads.jobs.builders.runtimes.python_runtime,
639 712
ads.model, 772
ads.feature_engineering.feature_type.continuous,
641 ads.model.artifact, 740
ads.model.deployment, 785
ads.feature_engineering.feature_type.creditcard,
643 ads.model.deployment.model_deployer, 772
ads.feature_engineering.feature_type.datetime, ads.model.deployment.model_deployment, 776
647 ads.model.deployment.model_deployment_properties,
ads.feature_engineering.feature_type.discrete, 780
650 ads.model.extractor.automl_extractor, 763
ads.feature_engineering.feature_type.document, ads.model.extractor.keras_extractor, 768
652 ads.model.extractor.lightgbm_extractor, 765
ads.feature_engineering.feature_type.gis, 653 ads.model.extractor.model_info_extractor, 766
ads.feature_engineering.feature_type.handler.feature_validator,
ads.model.extractor.model_info_extractor_factory,
682 762
ads.model.extractor.pytorch_extractor, 770
ads.feature_engineering.feature_type.handler.feature_warning,
687 ads.model.extractor.sklearn_extractor, 767
ads.model.extractor.tensorflow_extractor, 769
ads.feature_engineering.feature_type.handler.warnings,
690 ads.model.extractor.xgboost_extractor, 764
ads.feature_engineering.feature_type.integer, ads.model.framework, 814
657 ads.model.framework.automl_model, 785
ads.model.framework.lightgbm_model, 789
ads.feature_engineering.feature_type.ip_address,
659 ads.model.framework.pytorch_model, 794
ads.model.framework.sklearn_model, 799
ads.feature_engineering.feature_type.ip_address_v4,
661 ads.model.framework.tensorflow_model, 804
ads.model.framework.xgboost_model, 809
ads.feature_engineering.feature_type.ip_address_v6,
663 ads.model.generic_model, 742
ads.feature_engineering.feature_type.lat_long, ads.model.model_properties, 760
ads.model.runtime, 818
ads.model.runtime.env_info, 814
ads.model.runtime.model_deployment_details,
816
ads.model.runtime.model_provenance_details,
816
ads.model.runtime.runtime_info, 761
ads.model.runtime.utils, 817
ads.secrets, 836
ads.secrets.adb, 822
ads.secrets.auth_token, 834
ads.secrets.big_data_service, 830
ads.secrets.mysqldb, 826
ads.secrets.oracledb, 828
ads.secrets.secrets, 818
ads.text_dataset, 846
ads.text_dataset.backends, 836
ads.text_dataset.dataset, 838
ads.text_dataset.extractor, 843
ads.text_dataset.options, 845
ads.vault, 847
ads.vault.vault, 846
A module, 472
ads.common
ACCESS (ads.model.deployment.model_deployment.ModelDeploymentLogType
attribute), 780 module, 515
ads.common.auth
access_log (ads.model.deployment.model_deployment.ModelDeployment
property), 778 module, 473
activate() (ads.catalog.model.Model method), 460, ads.common.card_identifier
461 module, 473
ADBSecret (class in ads.secrets.adb), 822 ads.common.data
ADBSecretKeeper (class in ads.secrets.adb), 822 module, 475
add() (ads.common.model_metadata.ModelCustomMetadata ads.common.decorator.deprecate
method), 483, 484 module, 499
add() (ads.dataset.pipeline.TransformerPipeline ads.common.decorator.runtime_dependency
method), 596 module, 497
add_metrics() (ads.evaluations.evaluator.ADSEvaluator ads.common.function.fn_util
method), 603, 604 module, 506
add_models() (ads.evaluations.evaluator.ADSEvaluator ads.common.model
method), 603, 605 module, 477
ads.common.model_export_util
address (ads.feature_engineering.adsstring.common_regex_mixin.CommonRegexMixin
property), 629 module, 502
ads.common.model_introspect
Address (class in ads.feature_engineering.feature_type.address),
630 module, 499
ads ads.common.model_metadata
module, 848 module, 480
ads.automl ads.common.model_metadata_mixin
module, 460 module, 515
ads.automl.driver ads.common.utils
module, 453 module, 506
ads.automl.provider ads.config
module, 454 module, 847
ads.bds ads.data_labeling
module, 518 module, 551
ads.bds.auth ads.data_labeling.boundingbox
module, 516 module, 518
ads.catalog ads.data_labeling.constants
module, 473 module, 521
ads.catalog.model ads.data_labeling.data_labeling_service
module, 460 module, 521
ads.catalog.notebook ads.data_labeling.interface.loader
module, 467 module, 518
ads.catalog.project ads.data_labeling.interface.parser
module, 469 module, 518
ads.catalog.summary ads.data_labeling.interface.reader
865
ADS Documentation, Release 2.6.8
866 Index
ADS Documentation, Release 2.6.8
Index 867
ADS Documentation, Release 2.6.8
868 Index
ADS Documentation, Release 2.6.8
algorithm() (ads.model.extractor.lightgbm_extractor.LightgbmExtractor
(ads.common.model_metadata.ModelProvenanceMetadata
method), 765 method), 493
algorithm() (ads.model.extractor.model_info_extractor.ModelInfoExtractor
assign_column() (ads.dataset.dataset.ADSDataset
method), 766 method), 567
algorithm() (ads.model.extractor.pytorch_extractor.PyTorchExtractor
astype() (ads.dataset.dataset.ADSDataset method), 568
method), 771 ATTRIBUTE (ads.common.decorator.deprecate.TARGET_TYPE
algorithm() (ads.model.extractor.sklearn_extractor.SklearnExtractor
attribute), 499
method), 767 attribute_map (ads.jobs.builders.infrastructure.dataflow.DataFlow
algorithm() (ads.model.extractor.tensorflow_extractor.TensorflowExtractor
attribute), 721
method), 770 attribute_map (ads.jobs.builders.infrastructure.dsc_job.DataScienceJob
algorithm() (ads.model.extractor.xgboost_extractor.XgboostExtractor
attribute), 733
method), 764 attribute_map (ads.jobs.builders.runtimes.python_runtime.CondaRuntim
annotation (ads.data_labeling.record.Record at- attribute), 712
tribute), 526 attribute_map (ads.jobs.builders.runtimes.python_runtime.DataFlowRun
annotation_type (ads.data_labeling.metadata.Metadata attribute), 714
attribute), 523, 524 attribute_map (ads.jobs.builders.runtimes.python_runtime.GitPythonRun
AnnotationType (class in ads.data_labeling.constants), attribute), 716
521 attribute_map (ads.jobs.builders.runtimes.python_runtime.NotebookRunt
ANOMALY_DETECTION (ads.common.model_metadata.UseCaseType attribute), 718
attribute), 496 attribute_map (ads.jobs.builders.runtimes.python_runtime.PythonRuntim
api_keys() (in module ads.common.auth), 473 attribute), 719
application (ads.jobs.builders.infrastructure.dataflow.DataFlowLogs
attribute_map (ads.jobs.builders.runtimes.python_runtime.ScriptRuntime
property), 727 attribute), 720
archive_bucket (ads.jobs.builders.runtimes.python_runtime.DataFlowRuntime
auth (ads.model.framework.automl_model.AutoMLModel
property), 714 attribute), 785
archive_uri (ads.jobs.builders.runtimes.python_runtime.DataFlowRuntime
auth (ads.model.framework.lightgbm_model.LightGBMModel
property), 714 attribute), 789
AritfactFolderStructureError, 740 auth (ads.model.framework.pytorch_model.PyTorchModel
artifact (ads.jobs.builders.infrastructure.dsc_job.DSCJob attribute), 794
property), 731 auth (ads.model.framework.sklearn_model.SklearnModel
artifact_dir (ads.common.model_metadata.ModelProvenanceMetadata attribute), 799
attribute), 493 auth (ads.model.framework.tensorflow_model.TensorFlowModel
artifact_dir (ads.model.framework.automl_model.AutoMLModel attribute), 804
attribute), 785 auth (ads.model.framework.xgboost_model.XGBoostModel
artifact_dir (ads.model.framework.lightgbm_model.LightGBMModel attribute), 809
attribute), 789 auth (ads.model.generic_model.GenericModel at-
artifact_dir (ads.model.framework.pytorch_model.PyTorchModel tribute), 743
attribute), 794 auth (ads.secrets.big_data_service.BDSSecretKeeper at-
artifact_dir (ads.model.framework.sklearn_model.SklearnModel tribute), 833
attribute), 799 auth_token (ads.secrets.auth_token.AuthToken at-
artifact_dir (ads.model.framework.tensorflow_model.TensorFlowModeltribute), 834
attribute), 804 AuthToken (class in ads.secrets.auth_token), 834
artifact_dir (ads.model.framework.xgboost_model.XGBoostModel
AuthTokenSecretKeeper (class in
attribute), 809 ads.secrets.auth_token), 834
artifact_dir (ads.model.generic_model.GenericModel auto_transform() (ads.dataset.classification_dataset.BinaryTextClassific
attribute), 743 method), 562
artifact_directory (ads.model.runtime.model_provenance_details.TrainingCode
auto_transform() (ads.dataset.classification_dataset.ClassificationDatas
attribute), 816 method), 562
ARTIFACT_TEST_RESULT auto_transform() (ads.dataset.classification_dataset.MultiClassTextClas
(ads.common.model_metadata.MetadataTaxonomyKeys method), 564
attribute), 482 auto_transform() (ads.dataset.dataset_with_target.ADSDatasetWithTarg
ArtifactNestedFolderError, 740 method), 582
ArtifactRequiredFilesError, 740 AutoML (class in ads.automl.driver), 453
assert_path_not_dirty() AutoMLExtractor (class in
Index 869
ADS Documentation, Release 2.6.8
870 Index
ADS Documentation, Release 2.6.8
Index 871
ADS Documentation, Release 2.6.8
872 Index
ADS Documentation, Release 2.6.8
641 (ads.catalog.notebook.NotebookCatalog
convert() (ads.jobs.builders.runtimes.python_runtime.DataFlowNotebookRuntime
method), 467
method), 713 create_project() (ads.catalog.project.ProjectCatalog
convert() (ads.jobs.builders.runtimes.python_runtime.DataFlowRuntime method), 469
method), 714 create_secret() (ads.vault.vault.Vault method), 846
convert_columns() (in module ads.dataset.helper), credit_card (ads.feature_engineering.adsstring.common_regex_mixin.Co
594 property), 629
convert_dataframe_schema() CreditCard (class in ads.feature_engineering.feature_type.creditcard),
(ads.common.model.ADSModel static method), 643
477 CUML (ads.common.model_metadata.Framework at-
convert_to_html() (in module ads.dataset.helper), tribute), 480
594 CustomFormatReaders (class in ads.dataset.factory),
convert_to_text() (ads.text_dataset.backends.Base 587
method), 836
convert_to_text() (ads.text_dataset.backends.PDFPlumber D
method), 837 DATA (ads.common.decorator.runtime_dependency.OptionalDependency
convert_to_text() (ads.text_dataset.backends.Tika attribute), 497
method), 837 database (ads.secrets.mysqldb.MySQLDBSecret at-
convert_to_text() (ads.text_dataset.dataset.DataLoader tribute), 826
method), 840 DataFlow (class in ads.dataflow.dataflow), 553
convert_to_text() (ads.text_dataset.extractor.FileProcessor DataFlow (class in ads.jobs.builders.infrastructure.dataflow),
method), 843 721
convert_to_text_classification() dataflow_job() (ads.jobs.ads_job.Job static method),
(ads.dataset.classification_dataset.ClassificationDataset 708
method), 563 DataFlowApp (class in ads.dataflow.dataflow), 555
copy_file() (in module ads.common.utils), 507 DataFlowApp (class in
copy_from_uri() (in module ads.common.utils), 508 ads.jobs.builders.infrastructure.dataflow),
corr() (ads.dataset.dataset.ADSDataset method), 569 726
correlation_ratio() DataFlowLog (class in ads.dataflow.dataflow), 557
(ads.feature_engineering.accessor.mixin.eda_mixin.EDAMixin
DataFlowLogs (class in
method), 623 ads.jobs.builders.infrastructure.dataflow),
correlation_ratio_plot() 727
(ads.feature_engineering.accessor.mixin.eda_mixin.EDAMixin
DataFlowNotebookRuntime (class in
method), 624 ads.jobs.builders.runtimes.python_runtime),
cramersv() (ads.feature_engineering.accessor.mixin.eda_mixin.EDAMixin713
method), 624 DataFlowRun (class in ads.dataflow.dataflow), 558
cramersv_plot() (ads.feature_engineering.accessor.mixin.eda_mixin.EDAMixin
DataFlowRun (class in
method), 624 ads.jobs.builders.infrastructure.dataflow),
create() (ads.jobs.ads_job.Job method), 708 727
create() (ads.jobs.builders.infrastructure.dataflow.DataFlowDataFlowRuntime (class in
method), 722 ads.jobs.builders.runtimes.python_runtime),
create() (ads.jobs.builders.infrastructure.dataflow.DataFlowApp 713
method), 727 DataFrameLabelEncoder (class in
create() (ads.jobs.builders.infrastructure.dataflow.DataFlowRun ads.dataset.label_encoder), 596
method), 728 DataFrameTransformer (class in
create() (ads.jobs.builders.infrastructure.dsc_job.DataScienceJob ads.dataset.dataframe_transformer), 567
method), 733 DataLabeling (class in
create() (ads.jobs.builders.infrastructure.dsc_job.DataScienceJobRun ads.data_labeling.data_labeling_service),
method), 739 521
create() (ads.jobs.builders.infrastructure.dsc_job.DSCJobDataLabelingAccessMixin (class in
method), 731 ads.data_labeling.mixin.data_labeling), 527
create_app() (ads.dataflow.dataflow.DataFlow DataLoader (class in ads.text_dataset.dataset), 838
method), 553 datascience_job() (ads.jobs.ads_job.Job static
create_notebook_session() method), 708
Index 873
ADS Documentation, Release 2.6.8
874 Index
ADS Documentation, Release 2.6.8
Index 875
ADS Documentation, Release 2.6.8
description (ads.common.model_metadata.ModelCustomMetadataItem
DIMENSIONALITY_REDUCTION
property), 487 (ads.common.model_metadata.UseCaseType
description (ads.feature_engineering.feature_type.address.Address attribute), 496
attribute), 630, 631 Discrete (class in ads.feature_engineering.feature_type.discrete),
description (ads.feature_engineering.feature_type.base.FeatureType650
attribute), 633 DiscreteUniformDistribution (class in
description (ads.feature_engineering.feature_type.boolean.Booleanads.hpo.distributions), 691
attribute), 634, 635 Distribution (class in ads.hpo.distributions), 691
description (ads.feature_engineering.feature_type.category.Category
DistributionEncode (class in ads.hpo.distributions),
attribute), 636, 637 691
description (ads.feature_engineering.feature_type.constant.Constant
DLSDatasetReader (class in
attribute), 639 ads.data_labeling.reader.dataset_reader),
description (ads.feature_engineering.feature_type.continuous.Continuous
535
attribute), 641 DLSMetadataReader (class in
description (ads.feature_engineering.feature_type.creditcard.CreditCard
ads.data_labeling.reader.metadata_reader),
attribute), 643, 644 541
description (ads.feature_engineering.feature_type.datetime.DateTime
DOCUMENT (ads.data_labeling.constants.DatasetType at-
attribute), 647, 648 tribute), 521
description (ads.feature_engineering.feature_type.discrete.Discrete
Document (class in ads.feature_engineering.feature_type.document),
attribute), 650 652
description (ads.feature_engineering.feature_type.document.Document
DONE (ads.model.generic_model.ModelState attribute),
attribute), 652 759
description (ads.feature_engineering.feature_type.gis.GISdouble_overlay_plots
attribute), 653, 654 (ads.evaluations.evaluation_plot.EvaluationPlot
description (ads.feature_engineering.feature_type.integer.Integer attribute), 601
attribute), 657 down_sample() (ads.dataset.classification_dataset.ClassificationDataset
description (ads.feature_engineering.feature_type.ip_address.IpAddress
method), 563
attribute), 659, 660 down_sample() (in module ads.dataset.helper), 594
description (ads.feature_engineering.feature_type.ip_address_v4.IpAddressV4
download() (ads.dataset.factory.DatasetFactory static
attribute), 661, 662 method), 588
description (ads.feature_engineering.feature_type.ip_address_v6.IpAddressV6
download() (ads.jobs.ads_job.Job method), 708
attribute), 663, 664 download() (ads.jobs.builders.infrastructure.dsc_job.DataScienceJobRun
description (ads.feature_engineering.feature_type.lat_long.LatLongmethod), 739
attribute), 666, 667 download_artifact()
description (ads.feature_engineering.feature_type.object.Object (ads.jobs.builders.infrastructure.dsc_job.DSCJob
attribute), 669, 670 method), 731
description (ads.feature_engineering.feature_type.ordinal.Ordinal
download_from_web() (in module ads.common.utils),
attribute), 670, 671 508
description (ads.feature_engineering.feature_type.phone_number.PhoneNumber
download_model() (ads.catalog.model.ModelCatalog
attribute), 673 method), 462, 463
description (ads.feature_engineering.feature_type.string.String
driver (ads.jobs.builders.infrastructure.dataflow.DataFlowLogs
attribute), 675 property), 727
description (ads.feature_engineering.feature_type.text.Text
drop_columns() (ads.dataset.dataset.ADSDataset
attribute), 677, 678 method), 569
description (ads.feature_engineering.feature_type.unknown.Unknown
ds_client (ads.model.deployment.model_deployer.ModelDeployer
attribute), 679 attribute), 772
description (ads.feature_engineering.feature_type.zip_code.ZipCode
ds_client (ads.model.deployment.model_deployment.ModelDeployment
attribute), 680 attribute), 777
detect_encoding() (ads.text_dataset.backends.Tika ds_client (ads.model.framework.automl_model.AutoMLModel
method), 838 attribute), 785
df (ads.catalog.project.ProjectSummaryList attribute), ds_client (ads.model.framework.lightgbm_model.LightGBMModel
471 attribute), 789
df_read_functions (ads.dataset.dataset.ADSDataset ds_client (ads.model.framework.pytorch_model.PyTorchModel
attribute), 569 attribute), 795
876 Index
ADS Documentation, Release 2.6.8
Index 877
ADS Documentation, Release 2.6.8
878 Index
ADS Documentation, Release 2.6.8
feature_plot() (ads.feature_engineering.feature_type.gis.GIS
feature_stat() (ads.feature_engineering.feature_type.creditcard.CreditC
static method), 655 static method), 646
feature_plot() (ads.feature_engineering.feature_type.integer.Integer
feature_stat() (ads.feature_engineering.feature_type.datetime.DateTime
method), 657 method), 648
feature_plot() (ads.feature_engineering.feature_type.integer.Integer
feature_stat() (ads.feature_engineering.feature_type.datetime.DateTime
static method), 658 static method), 649
feature_plot() (ads.feature_engineering.feature_type.lat_long.LatLong
feature_stat() (ads.feature_engineering.feature_type.discrete.Discrete
method), 666 method), 650
feature_plot() (ads.feature_engineering.feature_type.lat_long.LatLong
feature_stat() (ads.feature_engineering.feature_type.discrete.Discrete
static method), 668 static method), 651
feature_plot() (ads.feature_engineering.feature_type.ordinal.Ordinal
feature_stat() (ads.feature_engineering.feature_type.gis.GIS
method), 671 method), 653
feature_plot() (ads.feature_engineering.feature_type.ordinal.Ordinal
feature_stat() (ads.feature_engineering.feature_type.gis.GIS
static method), 671 static method), 655
feature_plot() (ads.feature_engineering.feature_type.string.String
feature_stat() (ads.feature_engineering.feature_type.integer.Integer
method), 675 method), 657
feature_plot() (ads.feature_engineering.feature_type.string.String
feature_stat() (ads.feature_engineering.feature_type.integer.Integer
static method), 676 static method), 658
feature_plot() (ads.feature_engineering.feature_type.text.Text
feature_stat() (ads.feature_engineering.feature_type.ip_address.IpAddr
method), 678 method), 659
feature_plot() (ads.feature_engineering.feature_type.text.Text
feature_stat() (ads.feature_engineering.feature_type.ip_address.IpAddr
static method), 678 static method), 660
feature_plot() (ads.feature_engineering.feature_type.zip_code.ZipCode
feature_stat() (ads.feature_engineering.feature_type.ip_address_v4.IpA
method), 680 method), 661
feature_plot() (ads.feature_engineering.feature_type.zip_code.ZipCode
feature_stat() (ads.feature_engineering.feature_type.ip_address_v4.IpA
static method), 681 static method), 662
feature_select() (ads.feature_engineering.accessor.dataframe_accessor.ADSDataFrameAccessor
feature_stat() (ads.feature_engineering.feature_type.ip_address_v6.IpA
method), 616, 617 method), 664
feature_stat() (ads.feature_engineering.accessor.mixin.eda_mixin.EDAMixin
feature_stat() (ads.feature_engineering.feature_type.ip_address_v6.IpA
method), 625 static method), 665
feature_stat() (ads.feature_engineering.accessor.mixin.eda_mixin_series.EDAMixinSeries
feature_stat() (ads.feature_engineering.feature_type.lat_long.LatLong
method), 627 method), 666
feature_stat() (ads.feature_engineering.feature_type.address.Address
feature_stat() (ads.feature_engineering.feature_type.lat_long.LatLong
method), 631 static method), 668
feature_stat() (ads.feature_engineering.feature_type.address.Address
feature_stat() (ads.feature_engineering.feature_type.ordinal.Ordinal
static method), 632 method), 671
feature_stat() (ads.feature_engineering.feature_type.boolean.Boolean
feature_stat() (ads.feature_engineering.feature_type.ordinal.Ordinal
method), 634 static method), 672
feature_stat() (ads.feature_engineering.feature_type.boolean.Boolean
feature_stat() (ads.feature_engineering.feature_type.phone_number.Ph
static method), 635 method), 673
feature_stat() (ads.feature_engineering.feature_type.category.Category
feature_stat() (ads.feature_engineering.feature_type.phone_number.Ph
method), 637 static method), 674
feature_stat() (ads.feature_engineering.feature_type.category.Category
feature_stat() (ads.feature_engineering.feature_type.string.String
static method), 638 method), 675
feature_stat() (ads.feature_engineering.feature_type.constant.Constant
feature_stat() (ads.feature_engineering.feature_type.string.String
method), 639 static method), 676
feature_stat() (ads.feature_engineering.feature_type.constant.Constant
feature_stat() (ads.feature_engineering.feature_type.zip_code.ZipCode
static method), 640 method), 680
feature_stat() (ads.feature_engineering.feature_type.continuous.Continuous
feature_stat() (ads.feature_engineering.feature_type.zip_code.ZipCode
method), 641 static method), 681
feature_stat() (ads.feature_engineering.feature_type.continuous.Continuous
feature_type (ads.feature_engineering.accessor.dataframe_accessor.ADS
static method), 642 attribute), 616
feature_stat() (ads.feature_engineering.feature_type.creditcard.CreditCard
feature_type (ads.feature_engineering.accessor.dataframe_accessor.ADS
method), 644 property), 617
Index 879
ADS Documentation, Release 2.6.8
feature_type (ads.feature_engineering.accessor.series_accessor.ADSSeriesAccessor
592
attribute), 621 FeatureImportance (class in
feature_type (ads.feature_engineering.accessor.series_accessor.ADSSeriesAccessor
ads.dataset.feature_selection), 592
property), 621 FeatureType (class in
feature_type_description ads.feature_engineering.feature_type.base),
(ads.feature_engineering.accessor.dataframe_accessor.ADSDataFrameAccessor
633
attribute), 616 FeatureTypeManager (class in
feature_type_description ads.feature_engineering.feature_type_manager),
(ads.feature_engineering.accessor.dataframe_accessor.ADSDataFrameAccessor
611
property), 617 FeatureValidator (class in
feature_type_description ads.feature_engineering.feature_type.handler.feature_validator),
(ads.feature_engineering.accessor.series_accessor.ADSSeriesAccessor
682
attribute), 621 FeatureValidatorMethod (class in
feature_type_description ads.feature_engineering.feature_type.handler.feature_validator),
(ads.feature_engineering.accessor.series_accessor.ADSSeriesAccessor
685
property), 622 FeatureWarning (class in
feature_type_object() ads.feature_engineering.feature_type.handler.feature_warning),
(ads.feature_engineering.feature_type_manager.FeatureTypeManager
688
class method), 613 fetch_log() (ads.dataflow.dataflow.DataFlowRun
feature_type_object() method), 558
(ads.feature_engineering.feature_type_manager.FeatureTypeManager
fetch_training_code_details()
method), 612 (ads.common.model_metadata.ModelProvenanceMetadata
feature_type_register() class method), 493
(ads.feature_engineering.feature_type_manager.FeatureTypeManager
FILE_METADATA (ads.text_dataset.options.Options
class method), 613 attribute), 845
feature_type_register() FILE_NAME (ads.text_dataset.options.Options attribute),
(ads.feature_engineering.feature_type_manager.FeatureTypeManager
845
method), 612 FileOption (class in ads.text_dataset.options), 845
feature_type_registered() FileOverwriteError, 506
(ads.feature_engineering.feature_type_manager.FeatureTypeManager
FileProcessor (class in ads.text_dataset.extractor),
class method), 613 843
feature_type_registered() FileProcessorFactory (class in
(ads.feature_engineering.feature_type_manager.FeatureTypeManager
ads.text_dataset.extractor), 844
method), 612 filesystem() (ads.dataset.dataset_browser.DatasetBrowser
feature_type_reset() static method), 579
(ads.feature_engineering.feature_type_manager.FeatureTypeManager
filter() (ads.catalog.model.ModelSummaryList
class method), 613 method), 466
feature_type_reset() filter() (ads.catalog.notebook.NotebookSummaryList
(ads.feature_engineering.feature_type_manager.FeatureTypeManager
method), 469
method), 612 filter() (ads.catalog.project.ProjectSummaryList
feature_type_unregister() method), 472
(ads.feature_engineering.feature_type_manager.FeatureTypeManager
filter() (ads.catalog.summary.SummaryList method),
class method), 614 472
feature_type_unregister() filter() (ads.dataflow.dataflowsummary.SummaryList
(ads.feature_engineering.feature_type_manager.FeatureTypeManager
method), 561
method), 612 filter_list() (ads.dataset.dataset_browser.DatasetBrowser
FeatureBaseType (class in method), 579
ads.feature_engineering.feature_type.base), first_not_none() (in module ads.common.utils), 509
633 fit() (ads.automl.provider.AutoMLFeatureSelection
FeatureBaseTypeMeta (class in method), 454
ads.feature_engineering.feature_type.base), fit() (ads.automl.provider.AutoMLPreprocessingTransformer
633 method), 455
FeatureEngineeringTransformer (class in fit() (ads.automl.provider.BaselineModel method), 457
ads.dataset.feature_engineering_transformer), fit() (ads.common.model_export_util.ONNXTransformer
880 Index
ADS Documentation, Release 2.6.8
Index 881
ADS Documentation, Release 2.6.8
882 Index
ADS Documentation, Release 2.6.8
Index 883
ADS Documentation, Release 2.6.8
git_commit (ads.common.model_metadata.ModelProvenanceMetadata
host (ads.secrets.mysqldb.MySQLDBSecret attribute),
attribute), 493 826
GitHub() (ads.dataset.dataset_browser.DatasetBrowser host (ads.secrets.oracledb.OracleDBSecret attribute),
static method), 579 828
GitHubDatasets (class in ads.dataset.dataset_browser), human_size() (in module ads.common.utils), 511
580 hyperparameter (ads.model.extractor.automl_extractor.AutoMLExtractor
GitPythonRuntime (class in property), 763
ads.jobs.builders.runtimes.python_runtime), hyperparameter (ads.model.extractor.keras_extractor.KerasExtractor
716 property), 769
hyperparameter (ads.model.extractor.lightgbm_extractor.LightgbmExtrac
H property), 765
H20 (ads.common.model_metadata.Framework attribute), hyperparameter (ads.model.extractor.pytorch_extractor.PyTorchExtractor
480 property), 771
halt() (ads.hpo.search_cv.ADSTuner method), 696 hyperparameter (ads.model.extractor.sklearn_extractor.SklearnExtractor
HALTED (ads.hpo.search_cv.State attribute), 705 property), 768
handle() (ads.text_dataset.options.FileOption method), hyperparameter (ads.model.extractor.tensorflow_extractor.TensorflowExtr
845 property), 770
handle() (ads.text_dataset.options.MetadataOption hyperparameter (ads.model.extractor.xgboost_extractor.XgboostExtractor
method), 845 property), 764
handle() (ads.text_dataset.options.OptionHandler hyperparameter (ads.model.framework.automl_model.AutoMLModel
method), 845 attribute), 786
has_kerberos_ticket() (in module ads.bds.auth), 516 hyperparameter (ads.model.framework.lightgbm_model.LightGBMModel
hdfs_host (ads.secrets.big_data_service.BDSSecret at- attribute), 790
tribute), 830, 831 hyperparameter (ads.model.framework.pytorch_model.PyTorchModel
hdfs_host (ads.secrets.big_data_service.BDSSecretKeeper attribute), 795
attribute), 831 hyperparameter (ads.model.framework.sklearn_model.SklearnModel
hdfs_port (ads.secrets.big_data_service.BDSSecret at- attribute), 800
tribute), 830, 831 hyperparameter (ads.model.framework.tensorflow_model.TensorFlowMod
hdfs_port (ads.secrets.big_data_service.BDSSecretKeeper attribute), 805
attribute), 832 hyperparameter (ads.model.framework.xgboost_model.XGBoostModel
head() (ads.dataflow.dataflow.DataFlowLog method), attribute), 810
557 hyperparameter (ads.model.generic_model.GenericModel
hello() (in module ads), 848 attribute), 743
hyperparameter() (ads.model.extractor.lightgbm_extractor.LightgbmExtr
help() (ads.feature_engineering.accessor.dataframe_accessor.ADSDataFrameAccessor
method), 616 method), 765
hyperparameter() (ads.model.extractor.model_info_extractor.ModelInfoE
help() (ads.feature_engineering.accessor.mixin.feature_types_mixin.ADSFeatureTypesMixin
method), 628 method), 766
hyperparameter() (ads.model.extractor.pytorch_extractor.PyTorchExtrac
help() (ads.feature_engineering.accessor.series_accessor.ADSSeriesAccessor
method), 620 method), 771
high_cardinality_handler() (in module hyperparameter() (ads.model.extractor.sklearn_extractor.SklearnExtract
ads.feature_engineering.feature_type.handler.warnings), method), 768
690 hyperparameter() (ads.model.extractor.tensorflow_extractor.TensorflowE
highlight_text() (in module ads.common.utils), 511 method), 770
hive_host (ads.secrets.big_data_service.BDSSecret at- hyperparameter() (ads.model.extractor.xgboost_extractor.XgboostExtrac
tribute), 830, 831 method), 764
hive_host (ads.secrets.big_data_service.BDSSecretKeeper HYPERPARAMETERS (ads.common.model_metadata.MetadataTaxonomyKeys
attribute), 832 attribute), 482
hive_port (ads.secrets.big_data_service.BDSSecret at-
tribute), 830, 831 I
hive_port (ads.secrets.big_data_service.BDSSecretKeeperid (ads.jobs.ads_job.Job property), 709
attribute), 832 identify_issue_network()
horizontal_scrollable_div() (in module (ads.common.card_identifier.card_identify
ads.common.utils), 511 method), 473
884 Index
ADS Documentation, Release 2.6.8
Index 885
ADS Documentation, Release 2.6.8
886 Index
ADS Documentation, Release 2.6.8
Index 887
ADS Documentation, Release 2.6.8
888 Index
ADS Documentation, Release 2.6.8
Index 889
ADS Documentation, Release 2.6.8
890 Index
ADS Documentation, Release 2.6.8
Index 891
ADS Documentation, Release 2.6.8
770 (ads.common.model_metadata.UseCaseType
ads.model.extractor.sklearn_extractor, attribute), 496
767 MXNET (ads.common.model_metadata.Framework at-
ads.model.extractor.tensorflow_extractor, tribute), 481
769 MYSQL (ads.common.decorator.runtime_dependency.OptionalDependency
ads.model.extractor.xgboost_extractor, attribute), 497
764 MySQLDBSecret (class in ads.secrets.mysqldb), 826
ads.model.framework, 814 MySQLDBSecretKeeper (class in ads.secrets.mysqldb),
ads.model.framework.automl_model, 785 826
ads.model.framework.lightgbm_model, 789
ads.model.framework.pytorch_model, 794 N
ads.model.framework.sklearn_model, 799 n_trials (ads.hpo.search_cv.ADSTuner property), 697
ads.model.framework.tensorflow_model, 804 name (ads.dataset.helper.ElaboratedPath property), 593
ads.model.framework.xgboost_model, 809 name (ads.feature_engineering.accessor.series_accessor.ADSSeriesAccessor
ads.model.generic_model, 742 attribute), 620
ads.model.model_properties, 760 name (ads.feature_engineering.feature_type.address.Address
ads.model.runtime, 818 attribute), 630
ads.model.runtime.env_info, 814 name (ads.feature_engineering.feature_type.base.FeatureType
ads.model.runtime.model_deployment_details, attribute), 633
816 name (ads.feature_engineering.feature_type.boolean.Boolean
ads.model.runtime.model_provenance_details, attribute), 634
816 name (ads.feature_engineering.feature_type.category.Category
ads.model.runtime.runtime_info, 761, 817 attribute), 637
ads.model.runtime.utils, 817 name (ads.feature_engineering.feature_type.constant.Constant
ads.secrets, 836 attribute), 639
ads.secrets.adb, 822 name (ads.feature_engineering.feature_type.continuous.Continuous
ads.secrets.auth_token, 834 attribute), 641
ads.secrets.big_data_service, 830 name (ads.feature_engineering.feature_type.creditcard.CreditCard
ads.secrets.mysqldb, 826 attribute), 643
ads.secrets.oracledb, 828 name (ads.feature_engineering.feature_type.datetime.DateTime
ads.secrets.secrets, 818 attribute), 647
ads.text_dataset, 846 name (ads.feature_engineering.feature_type.discrete.Discrete
ads.text_dataset.backends, 836 attribute), 650
ads.text_dataset.dataset, 838 name (ads.feature_engineering.feature_type.document.Document
ads.text_dataset.extractor, 843 attribute), 652
ads.text_dataset.options, 845 name (ads.feature_engineering.feature_type.gis.GIS at-
ads.vault, 847 tribute), 653
ads.vault.vault, 846 name (ads.feature_engineering.feature_type.integer.Integer
MULTI_CLASS_CLASSIFICATION attribute), 657
(ads.common.utils.ml_task_types attribute), name (ads.feature_engineering.feature_type.ip_address.IpAddress
512 attribute), 659
MULTI_CLASS_TEXT_CLASSIFICATION name (ads.feature_engineering.feature_type.ip_address_v4.IpAddressV4
(ads.common.utils.ml_task_types attribute), attribute), 661
512 name (ads.feature_engineering.feature_type.ip_address_v6.IpAddressV6
MULTI_LABEL (ads.data_labeling.constants.AnnotationType attribute), 663
attribute), 521 name (ads.feature_engineering.feature_type.lat_long.LatLong
MultiClassClassificationDataset (class in attribute), 666
ads.dataset.classification_dataset), 564 name (ads.feature_engineering.feature_type.object.Object
MultiClassTextClassificationDataset (class in attribute), 670
ads.dataset.classification_dataset), 564 name (ads.feature_engineering.feature_type.ordinal.Ordinal
MultiLabelRecordParser (class in attribute), 670
ads.data_labeling.parser.export_record_parser), name (ads.feature_engineering.feature_type.phone_number.PhoneNumber
530 attribute), 673
MULTINOMIAL_CLASSIFICATION
892 Index
ADS Documentation, Release 2.6.8
Index 893
ADS Documentation, Release 2.6.8
894 Index
ADS Documentation, Release 2.6.8
Index 895
ADS Documentation, Release 2.6.8
project_ocid (ads.model.runtime.model_provenance_details.ModelProvenanceDetails
read() (ads.data_labeling.reader.dataset_reader.LabeledDatasetReader
attribute), 816 method), 537, 539
ProjectCatalog (class in ads.catalog.project), 469 read() (ads.data_labeling.reader.jsonl_reader.JsonlReader
ProjectSummaryList (class in ads.catalog.project), method), 540
471 read() (ads.data_labeling.reader.metadata_reader.DLSMetadataReader
properties (ads.model.deployment.model_deployment.ModelDeployment method), 541
attribute), 776 read() (ads.data_labeling.reader.metadata_reader.ExportMetadataReader
properties (ads.model.framework.automl_model.AutoMLModel method), 542
attribute), 787 read() (ads.data_labeling.reader.metadata_reader.MetadataReader
properties (ads.model.framework.lightgbm_model.LightGBMModelmethod), 543
attribute), 791 read() (ads.data_labeling.reader.record_reader.RecordReader
properties (ads.model.framework.pytorch_model.PyTorchModel method), 546
attribute), 796 read_arff() (ads.dataset.factory.CustomFormatReaders
properties (ads.model.framework.sklearn_model.SklearnModel static method), 587
attribute), 801 read_avro() (ads.dataset.factory.CustomFormatReaders
properties (ads.model.framework.tensorflow_model.TensorFlowModel static method), 587
attribute), 806 read_html() (ads.dataset.factory.CustomFormatReaders
properties (ads.model.framework.xgboost_model.XGBoostModel static method), 587
attribute), 811 read_json() (ads.dataset.factory.CustomFormatReaders
properties (ads.model.generic_model.GenericModel static method), 587
attribute), 744 read_labeled_data()
PROPHET (ads.common.model_metadata.Framework at- (ads.data_labeling.mixin.data_labeling.DataLabelingAccessMixin
tribute), 481 static method), 527
PYMC3 (ads.common.model_metadata.Framework at- read_libsvm() (ads.dataset.factory.CustomFormatReaders
tribute), 481 static method), 588
PYOD (ads.common.model_metadata.Framework at- read_line() (ads.text_dataset.backends.Base method),
tribute), 481 836
PYSTAN (ads.common.model_metadata.Framework read_line() (ads.text_dataset.backends.PDFPlumber
attribute), 481 method), 837
PythonRuntime (class in read_line() (ads.text_dataset.backends.Tika method),
ads.jobs.builders.runtimes.python_runtime), 838
719 read_line() (ads.text_dataset.dataset.DataLoader
PYTORCH (ads.common.decorator.runtime_dependency.OptionalDependency
method), 841
attribute), 498 read_line() (ads.text_dataset.extractor.FileProcessor
PYTORCH (ads.common.model_metadata.Framework at- method), 843
tribute), 481 read_log() (ads.dataset.factory.CustomFormatReaders
PyTorchExtractor (class in static method), 588
ads.model.extractor.pytorch_extractor), 770 read_sql() (ads.dataset.factory.CustomFormatReaders
PytorchExtractor (class in class method), 588
ads.model.extractor.pytorch_extractor), 771 read_text() (ads.text_dataset.backends.Base method),
PyTorchModel (class in 836
ads.model.framework.pytorch_model), 794 read_text() (ads.text_dataset.backends.PDFPlumber
method), 837
R read_text() (ads.text_dataset.backends.Tika method),
random_valid_ocid() (in module ads.common.utils), 838
513 read_text() (ads.text_dataset.dataset.DataLoader
raw_metrics (ads.evaluations.evaluator.ADSEvaluator method), 842
property), 607 read_text() (ads.text_dataset.extractor.FileProcessor
read() (ads.data_labeling.interface.reader.Reader method), 844
method), 518 read_tsv() (ads.dataset.factory.CustomFormatReaders
read() (ads.data_labeling.reader.dataset_reader.DLSDatasetReader static method), 588
method), 535, 536 read_xml() (ads.dataset.factory.CustomFormatReaders
read() (ads.data_labeling.reader.dataset_reader.ExportReader static method), 588
method), 536, 537 ReadDatasetError, 543
896 Index
ADS Documentation, Release 2.6.8
Index 897
ADS Documentation, Release 2.6.8
898 Index
ADS Documentation, Release 2.6.8
schema_input (ads.model.framework.pytorch_model.PyTorchModel
SecretKeeper (class in ads.secrets.secrets), 819
attribute), 796 select_best_features()
schema_input (ads.model.framework.sklearn_model.SklearnModel (ads.dataset.classification_dataset.BinaryTextClassificationDatas
attribute), 801 method), 562
schema_input (ads.model.framework.tensorflow_model.TensorFlowModel
select_best_features()
attribute), 806 (ads.dataset.classification_dataset.MultiClassTextClassificationD
schema_input (ads.model.framework.xgboost_model.XGBoostModelmethod), 564
attribute), 811 select_best_features()
schema_input (ads.model.generic_model.GenericModel (ads.dataset.dataset_with_target.ADSDatasetWithTarget
attribute), 744 method), 584
schema_output (ads.model.framework.automl_model.AutoMLModel
select_best_features()
attribute), 787 (ads.dataset.forecasting_dataset.ForecastingDataset
schema_output (ads.model.framework.lightgbm_model.LightGBMModel method), 593
attribute), 791 select_best_plot() (ads.dataset.plot.Plotting
schema_output (ads.model.framework.pytorch_model.PyTorchModelmethod), 596
attribute), 796 selected_model_name()
schema_output (ads.model.framework.sklearn_model.SklearnModel(ads.automl.provider.OracleAutoMLProvider
attribute), 801 method), 459
schema_output (ads.model.framework.tensorflow_model.TensorFlowModel
selected_score_label()
attribute), 806 (ads.automl.provider.OracleAutoMLProvider
schema_output (ads.model.framework.xgboost_model.XGBoostModel method), 459
attribute), 811 SENTIMENT_ANALYSIS (ads.common.model_metadata.UseCaseType
schema_output (ads.model.generic_model.GenericModel attribute), 497
attribute), 744 serialize (ads.model.framework.automl_model.AutoMLModel
SchemaValidator (class in ads.model.runtime.utils), attribute), 787
817 serialize (ads.model.framework.lightgbm_model.LightGBMModel
SCIKIT_LEARN (ads.common.model_metadata.Framework attribute), 791
attribute), 481 serialize (ads.model.framework.pytorch_model.PyTorchModel
score() (ads.common.model.ADSModel method), 479 attribute), 796
score_remaining (ads.hpo.search_cv.ADSTuner prop- serialize (ads.model.framework.sklearn_model.SklearnModel
erty), 699 attribute), 801
ScoreValue (class in ads.hpo.stopping_criterion), 705 serialize (ads.model.framework.tensorflow_model.TensorFlowModel
scoring_name (ads.hpo.search_cv.ADSTuner property), attribute), 806
700 serialize (ads.model.framework.xgboost_model.XGBoostModel
script_bucket (ads.jobs.builders.runtimes.python_runtime.DataFlowRuntime
attribute), 811
property), 714 serialize (ads.model.generic_model.GenericModel at-
script_uri (ads.jobs.builders.runtimes.python_runtime.DataFlowRuntime
tribute), 744
property), 714 serialize() (ads.secrets.secrets.Secret method), 818,
script_uri (ads.jobs.builders.runtimes.python_runtime.ScriptRuntime
819
property), 720 serialize_model() (ads.model.framework.automl_model.AutoMLModel
ScriptRuntime (class in method), 789
ads.jobs.builders.runtimes.python_runtime), serialize_model() (ads.model.framework.lightgbm_model.LightGBMMo
719 method), 793
seaborn() (ads.dataset.dataset_browser.DatasetBrowser serialize_model() (ads.model.framework.pytorch_model.PyTorchModel
static method), 580 method), 798
SeabornDatasets (class in serialize_model() (ads.model.framework.sklearn_model.SklearnModel
ads.dataset.dataset_browser), 581 method), 803
search_space() (ads.hpo.search_cv.ADSTuner serialize_model() (ads.model.framework.tensorflow_model.TensorFlow
method), 700 method), 808
Secret (class in ads.secrets.secrets), 818 serialize_model() (ads.model.framework.xgboost_model.XGBoostMode
secret_id (ads.secrets.big_data_service.BDSSecret at- method), 813
tribute), 831 serialize_model() (ads.model.generic_model.GenericModel
secret_id (ads.secrets.big_data_service.BDSSecretKeeper method), 758
attribute), 832 serialize_model() (in module
Index 899
ADS Documentation, Release 2.6.8
900 Index
ADS Documentation, Release 2.6.8
Index 901
ADS Documentation, Release 2.6.8
T property), 630
Tag (class in ads.feature_engineering.feature_type.base), time_elapsed (ads.hpo.search_cv.ADSTuner property),
633 701
time_remaining (ads.hpo.search_cv.ADSTuner prop-
tags (ads.feature_engineering.accessor.dataframe_accessor.ADSDataFrameAccessor
attribute), 616 erty), 701
TIME_SERIES_FORECASTING
tags (ads.feature_engineering.accessor.dataframe_accessor.ADSDataFrameAccessor
property), 619 (ads.common.model_metadata.UseCaseType
attribute), 497
tags (ads.feature_engineering.accessor.series_accessor.ADSSeriesAccessor
attribute), 620 time_since_resume (ads.hpo.search_cv.ADSTuner
tail() (ads.dataflow.dataflow.DataFlowLog method), property), 701
558 TimeBudget (class in ads.hpo.stopping_criterion), 706
TARGET_TYPE (class in Timeseries (class in ads.dataset.timeseries), 600
ads.common.decorator.deprecate), 499 timeseries() (ads.dataset.sampled_dataset.PandasDataset
TargetVariable (class in ads.dataset.target), 599 method), 599
template() (ads.dataflow.dataflow.DataFlow method), to_avro() (ads.dataset.dataset.ADSDataset method),
555 574
to_csv() (ads.dataset.dataset.ADSDataset method), 575
tenancy_ocid (ads.model.runtime.model_provenance_details.ModelProvenanceDetails
attribute), 816 to_dask() (ads.dataset.dataset.ADSDataset method),
575
TENSORFLOW (ads.common.decorator.runtime_dependency.OptionalDependency
attribute), 498 to_dask_dataframe()
TENSORFLOW (ads.common.model_metadata.Framework (ads.dataset.dataset.ADSDataset method),
attribute), 481 576
TensorflowExtractor (class in to_dataframe() (ads.catalog.model.Model method),
ads.model.extractor.tensorflow_extractor), 460, 462
769 to_dataframe() (ads.catalog.summary.SummaryList
TensorFlowModel (class in method), 472
ads.model.framework.tensorflow_model), to_dataframe() (ads.common.model_introspect.ModelIntrospect
804 method), 500, 501
to_dataframe()
TERMINAL_STATES (ads.jobs.builders.infrastructure.dsc_job.DataScienceJobRun (ads.common.model_metadata.ModelCustomMetadata
attribute), 739 method), 483, 486
terminate() (ads.hpo.search_cv.ADSTuner method), to_dataframe() (ads.common.model_metadata.ModelMetadata
700 method), 488, 489
TERMINATED (ads.hpo.search_cv.State attribute), 705 to_dataframe() (ads.common.model_metadata.ModelTaxonomyMetadata
method), 494, 495
TERMINATED_STATES (ads.jobs.builders.infrastructure.dataflow.DataFlowRun
attribute), 728 to_dataframe() (ads.data_labeling.metadata.Metadata
test_data (ads.evaluations.evaluator.ADSEvaluator at- method), 524
tribute), 602 to_dataframe() (ads.dataflow.dataflowsummary.SummaryList
TEST_STATUS (class in ads.common.model_introspect), method), 561
501 to_dataframe() (in module ads.common.utils), 514
to_dict() (ads.common.model_metadata.ModelCustomMetadata
TEXT (ads.common.decorator.runtime_dependency.OptionalDependency
attribute), 498 method), 483
TEXT (ads.data_labeling.constants.DatasetType at- to_dict() (ads.common.model_metadata.ModelCustomMetadataItem
tribute), 521 method), 487
Text (class in ads.feature_engineering.feature_type.text), to_dict() (ads.common.model_metadata.ModelMetadata
677 method), 488, 489
TextDatasetFactory (class in to_dict() (ads.common.model_metadata.ModelMetadataItem
ads.text_dataset.dataset), 842 method), 491
TextLabeledDataFormatter (class in to_dict() (ads.common.model_metadata.ModelTaxonomyMetadata
ads.data_labeling.visualizer.text_visualizer), method), 494
550 to_dict() (ads.common.model_metadata.ModelTaxonomyMetadataItem
TEXTSELECTION (ads.data_labeling.parser.export_record_parser.EntityType 495
method),
attribute), 530 to_dict() (ads.data_labeling.metadata.Metadata
Tika (class in ads.text_dataset.backends), 837 method), 524
to_dict() (ads.data_labeling.record.Record
time (ads.feature_engineering.adsstring.common_regex_mixin.CommonRegexMixin method),
902 Index
ADS Documentation, Release 2.6.8
Index 903
ADS Documentation, Release 2.6.8
904 Index
ADS Documentation, Release 2.6.8
Index 905
ADS Documentation, Release 2.6.8
906 Index
ADS Documentation, Release 2.6.8
Index 907
ADS Documentation, Release 2.6.8
908 Index
ADS Documentation, Release 2.6.8
Index 909
ADS Documentation, Release 2.6.8
zip_code (ads.feature_engineering.adsstring.common_regex_mixin.CommonRegexMixin
property), 630
ZipCode (class in ads.feature_engineering.feature_type.zip_code),
680
910 Index