Datascience Day1
Datascience Day1
2008
1700s Today
Astronomer Tobias Mayer Data experts will learn how to build
makes quantitative argument Dr. DJ Patil and a machine learning
insisting that more data is Jeff Hammerbacher model
better; is considered first coin the term
unofficial data scientist. “data science.”
1952 1997
IBM pioneer Arthur Samuel Chess grandmaster
coins the term “machine Garry Kasparov is easily
learning.” defeated by an IBM
supercomputer in 19
moves.
In the 1300s, William Ockham, a philosopher and friar, believed that scientists should prefer simpler theories
over more complex ones. The principle that bears his name, known as “Ockham's razor” can be applied to
machine learning by looking for the simplest solution.
In the 1750s, astronomer Tobias Mayer, made a quantitative argument that more data is better. He was studying
the motions of the moon and collected nine times as many data points as necessary, claiming this made his
observations more accurate. Because of this, he is often considered the first true “data scientist.”
In 1952, Arthur Samuel, an IBM pioneer in computing, gaming, and AI, coined the term “machine learning.” He
designed a game for playing checkers and discovered that the more the computer played the game the more it
learned winning strategies from experience.
In 1962, mathematician John W. Tukey predicted the effect of modern-day electronic computing on data analysis
as an empirical science. However, Tukey’s predictions occurred decades before the explosion of big data and the
ability to perform complex and large-scale analyses.
In 1997, an IBM supercomputer called Deep Blue, defeated chess grandmaster Gary Kasparov in only 19 moves.
Kasparov resigned after this match. The highly advanced supercomputer could calculate as many as 100 billion to
200 billion positions in the three minutes traditionally allotted to a player per move in standard chess.
In 2008, Dr DJ Patil of LinkedIn and Jeff Hammerbacher of Facebook, coined the term “data science” to describe
an emerging field of study that focused on teasing out the hidden value in collected data from retail and business
sectors.
Organizations have many types of unique, and often unstructured, data from many
different sources: things like equipment sensors, mobile apps, social media, customer
interactions via voice and text, videos, images, documents, and more.
Organizations want to use all data to produce new insights and new data products. They
want to improve their business operations by creating better customer experiences,
anticipating service demand, and preventing avoidable equipment outages.
The next generation of business problems – or scenarios - means being able to use all
data, and we need the capabilities provided by data science, machine learning, and AI to
understand and use that data.
Importance of Data Science and AI
UNSTRUCTURED
Social data
SEMI-STRUCTURED
DB/APP DATA
Medical
forms
Phone call
Next-gen scripts
Increasing complexity
scenarios must be
Orders
able to use and The “all data” opportunity
understand DNA
Weather
Retention details
6
What is Oracle AI?
Unified cloud services for AI and machine learning (ML)
Applications
Oracle Digital Assistant OCI Language OCI Speech OCI Vision OCI Anomaly Detection OCI Forecasting
AI Services
Oracle Cloud Infrastructure Data Science Machine Learning in Oracle Database OCI Data Labeling
ML Services
Data
“Oracle AI” is this portfolio of cloud services for helping organizations take advantage of all data for the next
generation of scenarios.
The foundation of all of this is data. Obviously, AI and machine learning work on data and require data.
The top layer of this diagram is applications, and this loosely refers to all the ways AI is consumed. That could be an
application, a business process, or an analytics system.
Between the application and data layers you see two groups here, the AI services (on top) and the machine learning
services (on the bottom). The difference between the two groups is that machine learning services are used primarily
by data scientists to build, train, deploy, and manage machine learning models.
Data scientists can work with familiar open-source frameworks in OCI Data Science, and that’s the cloud service that’s
the focus of this course. Data scientists and database specialists can take advantage of machine learning algorithms
built-in to Oracle Database. An important service that supports both machine learning and AI services is OCI Data
Labeling. Because when you’re building machine learning models that work on images, text, or speech, you need
labeled data that can be used to train the models.
AI services contains prebuilt machine learning models for specific uses. Some of the AI services are pretrained, and
some are trained by the customer with their own data. All are used by simply calling the API for the service, passing in
data to be processed, and the service returns a result.
OCI Services That Support AI and ML
AI Services
Streaming Data Integration Data Catalog GoldenGate & Big Data Data Object Autonomous
Oracle Data Integrator Service Flow Storage Database
Cloud Infrastructure
• Business analytics and graph analytics, and many forms of data integration and data
management – all running on the basic cloud infrastructure.
JupyterLab Model
Notebook Explanation
What is Oracle Cloud Infrastructure Data Science?
Infrastructure
CPU – GPU – Storage - Network
Core Principles of Oracle Cloud Infrastructure Data Science
Accelerated
Collaborative
Enterprise-Grade
Accelerated
Allow data scientists to work how they want, and provide access to automated workflows, the best of open-source
libraries, and a streamlined approach to building models
The first principle is about accelerating the work of the individual data scientist. Data scientists coming out of
universities today have been trained using open-source tools and that's what they're most comfortable with. But using
open-source tools on a laptop means managing lots of libraries from different sources and being limited to the compute
power on the laptop.
OCI Data Science provides data scientists with open-source libraries along with easy access to a range of compute
power without having to manage any infrastructure. It also includes Oracle's own library to help streamline many
aspects of their work.
Collaborative
Enable Data science teams to work together with ways to share and reproduce models in a structured, secure way for
enterprise-grade results
The second principle is collaboration. It goes beyond individual data scientist productivity to enable data science
teams to work together.
This is done through the sharing of assets, reducing duplicative work, and supporting reproducibility and auditability
of models for collaboration and risk management.
Enterprise-Grade
Provide a fully managed platform built to meet the needs of the modern enterprise
The third principle is about being enterprise-grade. That means it's integrated with all the OCI
security and access protocols. The underlying infrastructure is fully managed. The customer doesn't
have to think about provisioning compute and storage, and the service handles all the maintenance,
patching, and upgrades so users can focus on solving business problems with data science.
It serves data scientists Users work in a
throughout that full machine familiar JupyterLab
learning life cycle with notebook interface
support for Python and open- where they write
source libraries. Python code.
Notebook Sessions
Notebook Session
Conda Environments
JupyterLab
Model Catalog
Pre-installed libraries
Accelerated Data Science(ADS) Accelerated Data Science
SDK
SDK
Models
Model Catalog
Model deployments
Jobs
Projects
• Projects are containers that enable data science teams to organize their work. They represent
collaborative workspaces for organizing and documenting data science assets, such as notebook sessions
and models. Note that a tenancy can have as many projects as needed without limits.
Notebook Sessions
• Notebook sessions are where data scientists work. Notebook sessions provide a Jupyterlab environment
with pre-installed open-source libraries and the ability to add others. Notebook sessions are interactive
coding environments for building and training models. Notebook sessions run in managed infrastructure
and the user can select CPUs or GPUs, the compute shape, and amount of storage without having to do
any manual provisioning of environments.
Conda Environments
• Conda is an open-source environment and package management system and was
created for Python programs. It is used in the service to quickly install, run, and update
packages and their dependencies. Conda easily creates, saves, loads, and switches
between environments in your notebook sessions.
OCI Console
Provides easy-to-use browser-based interface; enables access to notebook sessions and all service
features
The most common method is the OCI Console. The OCI Console provides an easy-to-use, browser-
based interface that enables access to Notebook sessions and all the features of the service.
Language SDKs
Provides programming language SDKs for Java, Python, .NET, Go, Ruby, and
TypeScript/JavaScript
OCI also provides programming language SDKs for Java, Python, TypeScript/JavaScript, .Net, Go,
and Ruby. These SDKs enable the user to write code to manage data science resources. We’ll
provide some examples of how the Python SDK can be used to deploy models and create jobs.
Rest API
Provides access to service functionality; requires programming expertise
The REST API provides access to service functionality but requires programming expertise. An API
reference is provided in the product documentation.
CLI
Provides quick access and full functionality without the need for scripting
The Command Line Interface provides both quick access and full functionality without the
need for scripting.
Oracle is frequently adding new
Where to Find Data Science regions, so visit Oracle.com/cloud
to get the latest information on
OCI Region Availability cloud regions.
Commercial regions
Government regions
Dedicated regions
• Oracle Cloud Infrastructure
ADS SDK covers the end-to-end life cycle of machine learning models from data acquisition to
model evaluation.
Accelerated Data Science (ADS) SDK is an Oracle Cloud Infrastructure Data Science and Machine Learning
SDK (software development kit).
It covers the end-to-end life cycle of machine learning models from data acquisition to model evaluation.
Ways to Access ADS SDK
Conda Environments Local Environment Installation
Modules Libraries
Interpreter Programs
Container
There are many helpful features of ADS SDK. Some of them include:
ADS supports loading data from multiple sources. Here is a list of data sources you
can load data from:
• Local storage
• Object storage
• Oracle Database
• Other cloud providers such as S3, Google Cloud Service, Azure
• MongoDB and NoSQL DB instance
• OCI Big Data service / HDFS
• HTTP(S) sources
• Elastic Search instances
• Blob
Data Visualization
• Data visualization is an important part in performing exploratory data analysis (EDA) to help
gain a better understanding of the data set you are working with.
• ADS has a method show_in_notebook() that automatically creates visualizations for a data
set.
• It provides basic information about a data set including:
• Predictive data type (i.e., binary classification, multi-class classification, regress)
• Number of columns and rows of the data set
• Summary visualization of each feature
• Correlation map of the features
Feature Engineering
• Feature engineering is the process of transforming existing features into new ones.
• Why is this helpful? The idea is to transform existing features into new ones in order to improve the quality of
machine learning models. ADS has built-in tools to simplify the process of data transformation and feature
engineering.
• ADS has the functionality to turn a data set into an ADSDataset object. Any operation that can be performed
for a Pandas dataframe can also be applied to an ADSDataset. This makes it easy to apply data
transformations.
• ADS has built-in automatic data transformation tool that provides recommended transformations to a data set.
• ADS has built-in functions that support categorical encoding, null values, and imputation.
Model Training
Allow for comparison between your trained ML models with standard metrics
• Evaluations is where you compare the different machine learning models you have trained with industry-
standard metrics and try to understand the trade-offs between them
• ADS has an evaluation class that provides a collection of tools, metrics and charts to help with model
evaluation.
• The ADS evaluation class supports evaluation for binary classification, multi-class classification, and regression.
Model Interpretation and Explainability
Global
Explain the general behavior of an ML model
Model Deployment
ADS has a model deployment module, ads.model.deployment, which allows you to deploy models with OCI
Data Science’s managed resource model deployments
You can use ADS to deploy a model artifact saved in the Data Science model catalog or the URI of a directory in
the local block storage or object storage.
Model deployments integrate with the OCI Logging service. You can use it to store the access and prediction
logs from model deployments. ADS provides APIs to make interacting with the Logging service simple.
• Oracle Cloud Infrastructure
Compartments
User groups
A group of users
Dynamic groups
Policies
Compartments
A logical grouping of resources that can be accessed only by certain groups that have received administrator
permission
Tip: When configuring tenancies, decide how you will organize your Data Science resources, then create
compartments for those resources through the Identity Console.
Compartments allow you to organize and control access to your cloud resources.
A compartment is a logical grouping of resources that can be accessed only by certain groups that have been
given permission by an administrator.
When configuring your tenancy, the first step is to make a plan of how you will organize your data science
resources going forward.
Once you’ve made a plan, you can create a compartment(s) for data science resources through the Identity
Console. To do that, go to Identity > Compartments, and click ”Create Compartment.” Enter a name and
description and then click Create Compartment.
View https://docs.oracle.com/en-
us/iaas/Content/GSG/Concepts/settinguptenancy.htm#Setting_Up_Your_Tenancy for more details
Creating a Compartment
Cloud Console
Individual users are grouped in OCI and granted access to Data Science resources within compartments.
Tip: When configuring groups, decide how users will access resources in the compartments.
Dynamic Groups
Dynamic groups are a special type of group that contains resources (such as data science notebook sessions, job runs,
and model deployments) that match rules that you define.
These matching rules allow group membership to change dynamically as resources that match those rules are created
or deleted. These resources act as "principal" actors and can make API calls to services according to policies that you
write for the dynamic group.
For example, using the resource principal of a Data Science Notebook Session, you could make a call to the Object
Storage API to read data from a bucket, if the dynamic group of the notebook session has a policy which enables
object storage access.
Dynamic groups have matching rules, where <compartment-ocid> is replaced by the identifier of the
compartment created for Data Science.
• <group-name> - This will be filled in with the name of the user group or dynamic group
• <verb> - This will define the level of access
• <resource-type> - This will specify the type of resource or resource family to be accessed
• <compartment-name> - This will be filled in with the name of the compartment
Policy Basics
Verbs define the level of access that would be permitted to the resource/resource family.
Verbs include (from least to most permissive) inspect, read, use, and manage.
• Resource type in the policy defines which specific resource you are writing the policy for. For example, Data Science
includes resources such as data-science-models or data-science-jobs.
• You can write a policy for an individual resource type; however, to make writing policies for related resource easier,
there are aggregate resource types which contain a family of related resources. The aggregate resource type for data
science is data-science-family.
Required Data Science Policies
To allow data scientists to manage all Data Science resources in a specific compartment:
Allow group <user-group-name> to manage data-science-family in compartment <compartment-name>
To allow Data Science resources, such as a notebook session, in a dynamic group to manage all Data Science
resources:
Allow dynamic group <dynamic-group-name> to manage data-science-family in compartment <compartment-
name>
These are the most critical of the required data science policies, not just mere policy examples.
To allow data scientists to manage all data science resources in a specific compartment:
To allow data science resources, such as a notebook session, in a dynamic group to manage all data science
resources:
compartment <compartment-name>
More Required Data Science Policies
The following policies are required to enable users access to metrics and logging for data science resources:
To use custom networking in Data Science you will need the following policies:
They have matching rules, where <compartment-ocid> is replaced by the identifier of the
compartment created for Data Science.
They define what Data Science principals, such as users and resources, have access to in OCI.
(policy definition)
They are individual users that are grouped in OCI by administrators and granted access to Data
Science resources within compartments (user groups definition)
They are a logical grouping of resources that can be accessed only by certain groups that have
received administrator permission (compartment definition)