OCI Adoption Data Strategy
OCI Adoption Data Strategy
DATE REVISION
Feb 2024 V1
Summary .................................................................................................................................. 34
The following is intended to outline our general product direction. It is intended for information purposes only, and
may not be incorporated into any contract. It is not a commitment to deliver any material, code, or functionality, and
should not be relied upon in making purchasing decisions.
The development, release, timing, and pricing of any features or functionality described for Oracle’s products may
change and remains at the sole discretion of Oracle Corporation.
Overview
In today's digital age, the emergence of big data has accelerated a transformative shift in how organizations perceive,
collect, process, and leverage data. This modern big data landscape has expanded far beyond traditional structured
data. Organizations now collect data from a multitude of sources such as enterprise systems, social media, IoT
devices, and digital assets. This variety of data types known as "big data" includes structured, semi-structured, and
unstructured data. This velocity of big data also brings both challenges and opportunities.
Modern enterprises are challenged to develop scalable infrastructure, advanced analytics, and data management
strategies to harness the valuable insights in these datasets. To solve these challenges, organizations are heavily
dependent on advanced technologies such as cloud computing, distributed data management, data processing and
machine learning frameworks for managing, analyzing, and deriving actionable insights from this big data. As the
data landscape continues to evolve, modern enterprise’s ability to efficiently navigate, interpret, and utilize big data
will be a defining factor in determining their competitiveness and innovation. Consequently, a successful data
strategy must accommodate this variety, velocity and volume of data to ensure seamless integration and processing.
Volume
• Petabytes
• Transactions
• Tables, Files
Value Velocity
• Descriptive • Batch
• Prescriptive • Stream
• Predictive • Real / Near time
Veracity Variety
• Data Quality • Structured
• Data Reliability • Unstructured
• Trustworthiness • Semi-Structured
These 5 dimensions help highlight the unique challenges posed by modern big data and how it differs from
traditional data processing.
• Volume: The amount of data matters. With big data, you’ll have to process high volumes of low-density,
unstructured data. This can be data of unknown value, such as Twitter data feeds, clickstreams on a web
page or a mobile app, or sensor-enabled equipment. For some organizations, this might be tens of terabytes
of data. For others, it may be hundreds of petabytes.
A well-defined data strategy empowers organizations to harness this potential by establishing clear guidelines for
data lifecycle (collection, storage, processing, analysis, and governance). It aligns business objectives with data
initiatives, ensuring that data-driven insights drive key decisions and foster innovation. Moreover, a robust data
strategy enhances operational efficiency, streamlines processes, and enables personalized customer experiences, all
of which are essential for staying competitive in a market that demands agility and responsiveness. By treating data
as a strategic asset, enterprises can adapt to changes swiftly, mitigate risks effectively, and uncover hidden
opportunities, ultimately paving the way for long-term growth and success.
Data Strategy
Process
& Analysis
Vision
Data Strategy starts with enterprise data vision. The Vision outlines and provides a strategic direction to an
enterprise's ambitions and objectives regarding its utilization of data and information. It highlights the potential of
data to achieve competitive advantages, improve decision making, elevate user experiences for customers,
employees and partners. A lack of a well-defined vision can result in inconsistent data management approaches and
in some cases a disjointed strategy leading to a fragmented data landscape that hinders effective decision making,
operational efficiency, and the realization of strategic goals.
Data Lifecycle
The core of data strategy framework helps companies to shape and build their data plan around six key stages of the
data journey. This ensures data is a top priority and is guided through each step to help achieve bigger business
goals. This method helps companies understand the specific details of the data journey needed to treat data as a
valuable asset. It also gives them the freedom to change their approach to fit the special conditions of their data
environment. The data life cycle refers to the stages through which data passes from its creation or acquisition to its
eventual use and disposal. It includes various processes and activities involved in managing and leveraging data
effectively throughout its lifecycle. While specific stages may vary depending on the context, a typical modern data
life cycle includes the following key phases. The data strategy is formulated around these 6 data lifecycle phases as
defined below.
• Data Generation & Sources: This stage involves the creation or capturing of raw data from various sources
such as sensors, devices, applications, or user interactions.
• Data Ingestion: Raw data is ingested into a system for storage and processing. This may involve data
cleansing, transformation, and normalization to ensure data consistency and quality.
• Data Storage: Processed data is stored in appropriate repositories, such as databases, data warehouses, or
data lakes. Modern architectures may involve distributed or cloud-based storage solutions.
• Data Processing and Analysis: In this stage, data is processed, analyzed, and transformed using various
tools and techniques. This phase aims to derive meaningful insights and patterns from the data.
• Data Visualization and Reporting: Insights obtained from data analysis are presented through data
visualization tools or reports, making it easier for stakeholders to understand and interpret the findings.
• Data Sharing and Collaboration: Data is shared within the organization or with external partners to facilitate
collaboration and decision-making. Security measures are crucial to protect sensitive information during this
stage.
People
People pillar is critical for a successful data strategy within an enterprise as it enables and aligns the human
component with the technological aspects. A well-defined people strategy creates a culture of data driven decision
making by promoting data literacy and skill development across an organization. This well-defined strategy helps
define roles and responsibilities, identifies and support data champions and leaders who are responsible for driving
organization’s data strategy implementation and adoption. It also ensures effective communication between different
teams across technical and business stakeholders to enable cross functional collaboration which is essential for the
The following areas outlines how an enterprise can defines people pillar to enable their data strategy:
Center of Excellence
• Create leadership positions, roles and dedicated organization under Chief Data Officer (CDO) to oversee
successful data strategy's implementation.
• Create data and analytics center of excellence that provides data and analysis services to business units and
departments. Its objective is to empower the business in getting the answers using data driven solution.
• Ensure that data strategy implementation teams have representation at the senior management level.
• Promote collaboration between technical and business stakeholders.
Roles and Responsibilities
• Define clear roles and responsibilities for data related positions like data engineers, data scientists, data
analysts and data stewards.
• Develop job descriptions that outline the required skills and qualifications for each role.
• Identify and establish teams for critical roles and responsibilities for data engineering, data science, data
governance, data quality and other data strategy related teams.
Key Performance Indicators (KPIs)
• Identify KPIs that are aligned with your data strategy's objectives and goals.
• Define specific KPI targets that are achievable and quantifiable.
• Validate identified KPIs for data quality, data utilization and analytics adoption.
Training & Education Services
• Develop training programs to improve data related technical and business skills across the organization.
• Conduct workshops and provide educational resources for employees to learn on topics like data
engineering, data science, data analysis, data governance, data security and data privacy.
• Create document portal, help desk and support system to address employee queries related to data tools and
processes.
Communication
• Outline and create a communication plan to disseminate the importance of the data strategy across the
organization.
• Share success stories, case studies and updates on data strategy implementation and achievements.
• Create clear process for stakeholders to provide feedback.
Process
Process pillar defines and provides guidelines and operational framework for effective data collection, processing,
storage, security, governance, compliance and shared in an organization. It addresses key features such as data
integration with applications, data quality maintenance, security protocols and data operations. A well-defined
process strategy is critical for the successful enablement of a data strategy for any organization. Process strategy
provides clarity with responsibilities as well as streamlines workflows and team collaboration for unified approach to
leverage data as a strategic asset. Process strategy improves data driven decision making, minimizes data breaches
and data quality risks as well as promotes compliance with regulatory requirements.
The following outlines how an enterprise defines process strategy to enable their data strategy:
10 CAF Data Strategy / version 1.1
Copyright © 2024, Oracle and/or its affiliates / Confidential – Oracle Internal
Data integration
• Clearly define data quality requirements and metrics for each data types including validity, accuracy,
completeness, consistency, and timeliness.
• Implement data validation rules and processes to identify and correct errors and inconsistencies in data.
• Implement data profiling and cleansing routines to maintain data integrity.
• Set up regular data quality audits and reporting mechanisms.
Security:
• Define data access controls and permissions based on roles and responsibilities.
• Implement encryption mechanisms for data at rest and during transit.
• Develop a process for monitoring and detecting unauthorized data access or breaches.
• Establish protocols for handling and reporting data breaches in compliance with regulations.
Operations:
• Create data lifecycle management procedures, including data creation, storage, retention, and disposal.
• Implement data governance frameworks to manage data ownership, stewardship, and accountability.
• Define processes for data documentation, metadata management, and cataloging.
• Establish change management processes to handle updates and modifications to data-related systems.
Technology
The Technology pillar is integral to the success of an enterprise data strategy, it bridges the gap between the human
and operational aspects. Policies and Principles guide ethical data use, aligning with organizational goals, while robust
standards ensure consistency and interoperability. Technology platform addresses scalability and compatibility,
drives seamless integration and innovation adoption. Governance within the strategy enforces data quality, security,
and compliance, fostering transparency and accountability. The emphasis on organizational structure ensures
collaboration, breaking down silos between technical and business teams. Technology strategy not only addresses
immediate data needs but also incorporates emerging technologies, preparing the organization for future
advancements. This integrated approach ensures that technology aligns harmoniously with people and processes,
creating a culture conducive to data-driven decision-making and organizational success.
Key principles of Technology Pillar: The following components outline the key aspects of a technology pillar.
• Establish clear policies that outline how data should be handled, stored, and shared within the organization.
• Define principles that guide decision-making related to data, ensuring alignment with organizational goals
and compliance with regulations.
• Develop policies for data quality, security, and privacy to maintain the integrity and confidentiality of
information.
Data Architecture Standards:
• Choose appropriate technologies and platforms that support the organization's data strategy goals.
• Assess the scalability, flexibility, and compatibility of technology solutions to accommodate future data
growth and evolving business needs.
• Invest in data management tools, databases, and analytics platforms that align with the organization's data
strategy objectives.
Governance:
• Implement a robust data governance framework that includes roles, responsibilities, and processes for data
stewardship and management.
• Establish data ownership and accountability to ensure that data-related decisions are made by the
appropriate stakeholders.
• Regularly audit and monitor data processes to enforce compliance with policies and standards, and to
identify areas for improvement.
Organizational Structure:
• Design an organizational structure that supports the data strategy, including dedicated teams responsible for
data management, analytics, and governance.
• Ensure collaboration between IT and business units to bridge the gap between technical and business
requirements.
• Foster a data-driven culture by promoting awareness and understanding of the importance of data
throughout the organization.
Data Governance
Data governance in modern data landscape refers to the systematic and comprehensive management of data assets
within an organization. With the increasing volume, variety and velocity of data, effective data governance becomes
crucial to ensure data quality, security, compliance, and usability. It involves establishing policies, processes, and
controls to manage data throughout its lifecycle from creation and acquisition to archival or deletion. Data
governance aims to optimize the value of data while mitigating risks and ensuring accountability across the
organization.
• Data Ownership: Clearly define and assign responsibilities for data ownership, ensuring that individuals or
teams are accountable for the accuracy and integrity of specific data sets.
• Data Quality Management: Implement processes to maintain high data quality standards, including data
profiling, cleansing and validation to ensure reliable and trustworthy information.
• Data Security and Privacy: Establish protocols for safeguarding sensitive data, complying with relevant
regulations, and protecting against unauthorized access or breaches.
• Metadata Management: Create and maintain comprehensive metadata to provide context and
understanding of data, data discovery, data analysis and usage.
Data Security
Data security in the modern landscape is the comprehensive protection of digital information from unauthorized
access, disclosure, alteration, and destruction. As organizations increasingly rely on digital assets, the importance of
data security has grown exponentially. It involves implementing a robust set of technologies, processes, and policies
to safeguard sensitive information, maintaining confidentiality, integrity, and availability. Data security measures are
essential not only for regulatory compliance but also for preserving trust with customers, partners, and stakeholders.
This includes encryption, access controls, monitoring, and regular audits to identify and address potential
vulnerabilities and threats.
Access Control: Implement strict access controls to ensure that only authorized individuals or systems have the
appropriate permissions to access and manipulate specific data.
Data Encryption: Apply highest level of data encryption algorithms to protect data both in transit and at rest to
prevent unauthorized actors from decrypting sensitive information even if they gain access.
Regular Audits and Monitoring: Conduct regular security audits and monitoring to detect and respond promptly to
any suspicious activities, breaches, or vulnerabilities in the data infrastructure.
Data Classification: Classify data based on its sensitivity and importance, and apply security measures accordingly,
focusing resources on protecting the most critical and confidential information.
Employee Training and Awareness: Educate employees about security best practices, including password hygiene,
social engineering awareness, and the responsible handling of sensitive data to reduce the risk of human-related
security breaches.
Incident Response Plan: Develop and regularly test an incident response plan to ensure a swift and effective
response in the event of a data breach to minimize potential damages and downtime.
Regular Updates and Patch Management: Keep software, operating systems, and security tools up-to-date to
address known vulnerabilities and ensure that the organization's defense mechanisms remain effective against
emerging threats.
• Informed Decision Making: A robust data strategy helps stakeholders at all levels with accurate, timely and
relevant information. It promotes a culture where decisions are driven by data insights rather than intuition.
This leads to more informed, strategic, and effective decision-making across the organization.
Modern cloud data architectures such as Data Lake and Data Mesh have evolved to address the complexities of
managing, storing, and processing vast amounts of data in distributed environments. Each architecture represents
distinct approaches to organizing and utilizing data.
Data Lake architecture involves a centralized repository where raw and unstructured data from various sources is
stored in its native format, offering flexibility and scalability for diverse analytics. This centralized architecture focuses
on centralized location for data storage, data processing and data management to enable better governance, security
and control over data.
Data Mesh adopts a decentralized architecture by distributing data ownership and governance across an
organization. Data Mesh treats data as a product and assigning different data domains to subject matter experts. This
architecture approach helps to reduce data sprawl and improves accountability by decentralizing ownership.
Data lake's multi-layered centralized architecture represents a framework for Integrating, storing, processing,
managing, and analyzing huge volumes of data. Its structure comprises several layers: the foundational storage layer,
storing raw, unstructured and structured data in its native format, the ingestion layer, responsible for data collection
from varied sources and its seamless integration into the lake; a metadata layer for cataloging and organizing data to
ensure its discoverability and usability, a processing and compute layer that utilizes distributed computing
frameworks for analytics, machine learning, and data processing; a security layer ensuring robust access controls and
encryption mechanisms for data protection; and finally, a governance layer focusing on data quality, compliance, and
lifecycle management. Each layer plays a crucial role in enabling flexibility, scalability, and efficient utilization of the
data lake for deriving actionable insights and business value.
• Scalability: Data lakes are designed to scale horizontally to petabytes or even exabytes of data. This
scalability ensures that as data volume grows, the data lake can expand to store and process the increased
volume efficiently.
• Flexibility: They can store data in various formats, including structured data (like databases and CSV files),
semi-structured data (like JSON and XML files), and unstructured data (like text and multimedia files). This
flexibility allows organizations to store data as-is, without needing to convert it into a specific format before
storage.
• Cost-effectiveness: Data Lake build on technologies such as object storage that can provide cost-effective
storage solutions. The ability to run on commodity hardware or low-cost cloud storage services helps
organizations manage large volumes of data economically.
• Accessibility: Data lakes are designed to make data accessible to a wide range of users and applications by
supporting various data access patterns (batch, real-time, streaming) and providing interfaces for different
types of analytics and data science tools. This ensure that users can retrieve and analyze data efficiently.
• Security and Governance: It is critical to ensure the security and governance of data in data lake. This
includes implementing access controls, data encryption and regular auditing to protect sensitive information.
Data governance policies are also essential to manage data quality, lineage, and lifecycle, ensuring that the
data lake does not become a "data swamp."
• Integration and Processing Capabilities: Data lakes often include built-in or easily integrable tools for data
ingestion, processing, and analysis. This can include support for ETL (Extract, Transform, Load) processes,
real-time data processing, and machine learning model training and inference.
• Data Catalog and Search: To manage the massive amounts of data within a lake, data cataloging and search
tools are essential. Metadata management tools help catalog the data, making it easier for users to find and
understand the data they need.
• Multi-tenancy and Resource Management: Data lakes support multi-tenancy, allowing multiple
departments or teams within an organization to share the infrastructure while maintaining isolation of their
data and workloads.
• Complexity in Data Management: Data Lakes can become complex due to the large volume and variety of
data they accommodate. In absence of proper data organization and governance, it may become challenging
to manage, catalog, and ensure the quality of the stored data.
• Data Quality and Consistency: Ensuring data quality and data consistency is a significant challenge in Data
Lake since it supports structured, semi-structured and unstructured raw data. Ingesting data without proper
• Unified Storage: Data lakes enable the storage of structured and unstructured data in a single, unified
environment.
• Scalability: They can scale horizontally, accommodating growing volumes of data without compromising
performance.
• Data Variety: Data lakes can handle diverse data types, including text, images, videos, and more.
• Advanced Analytics: Enterprises can leverage advanced analytics, machine learning, and AI to extract
meaningful patterns and insights.
• Cost Efficiency: Data lakes offer cost-effective storage and processing, as organizations pay only for the
resources they use.
• Real-time Processing: Data lakes support real-time data processing, facilitating timely decision-making.
• Data Governance: With proper governance and security measures, data lakes ensure data quality, privacy,
and compliance.
Data mesh is a modern, decentralized approach to data architecture and organizational design emerging in response
to the complexities of handling large-scale distributed data across various domains in an organization. It shifts from
the traditional centralized data management systems like data warehouses and lakes to a more distributed model.
In a data mesh framework, data is treated as a product with the ownership and responsibility for data quality data
governance and data lifecycle management being distributed across different cross functional teams known as 'data
product teams.' These teams are typically aligned with specific business domains and are responsible for handling the
data relevant to their area of expertise.
Supply-side Demand-side
Data Products Data Products
• Decentralized Data Architecture: Data is managed and controlled by the business domain. For instance, the
sales domain would manage sales-related data. This decentralization promotes a deeper understanding and
better quality of the data, as domain experts are directly involved in its management. It also enables faster
decision making and innovation within domains as they don't have to rely on a centralized team for data-
related requests.
• Data as a Product: Data as a product highlight treating data as autonomous, self-serve products with defined
ownership and governance. This approach encourages decentralized data management that promotes a
scalable and agile data ecosystem. This principle includes defining clear data owners who are responsible for
the data quality and ensuring it meets the needs of its consumers. Data as a product should be user-centric
that is designed to be easily discoverable, understandable, and usable by different stakeholders.
• Self-Serve Data Infrastructure as a Platform: The creation of a standardized self-serve data platform
normalizes data access and allows domain teams to independently publish, discover, and utilize data
products. This self-serve platform architecture provides essential tools and services, removing the complexity
of data management and enabling domain teams to focus on deriving insights and value from their data. It
supports the organization's agility by allowing domain experts to handle data tasks without expert technical
knowledge in data engineering.
• Technical Complexity: Implementing a data mesh involves complex technical challenges. It requires
establishing standardized data infrastructure, data pipelines and APIs across diverse domain teams. Data
mesh ensures interoperability, scalability and security across these decentralized systems can be technically
demanding.
• Cultural Shift: Organizations must move from centralized data teams to a distributed model where domain
teams are responsible for their data products. This change involves new roles, responsibilities, and a mindset
shift towards data as a product, which can be difficult to achieve. This shift to data mesh architecture requires
a significant cultural and organizational change.
• Data Ownership and Governance: Establishing clear data ownership and governance in a decentralized
environment can be challenging. Each domain team becomes responsible for the lifecycle, quality, and
security of its data products, requiring robust governance frameworks to ensure consistency, compliance,
and interoperability across domains.
• Data Discoverability and Accessibility: Ensuring that data products are easily discoverable, accessible, and
usable across the organization is a key challenge. Implementing effective metadata management, data
catalogs, and search tools is crucial to avoid silos and ensure that data can be shared and reused across
domains.
• Data Quality and Consistency: Maintaining high data quality and consistency across decentralized data
products is challenging. Each domain team must implement data quality measures, which can vary in
complexity and effectiveness, potentially leading to inconsistencies in data quality standards across the
organization.
• Inter-domain Collaboration: Data mesh architecture requires effective collaboration and coordination
across domain teams. This includes establishing cross-domain standards, shared best practices, and
mechanisms for feedback and continuous improvement.
• Cost and Resource Allocation: Transitioning to a data mesh architecture can incur significant upfront costs
and ongoing operational expenses. Allocating resources effectively, ensuring cost transparency, and
demonstrating the return on investment can be challenging.
• Skills and Training: The data mesh model requires a range of new skills and competencies, including
domain-driven design, data product management, and decentralized data architecture. Building these
capabilities within domain teams and providing ongoing training can be resource intensive.
Scalable Architecture: Data Mesh is designed to scale horizontally that makes it compatible for organizations with
growing and diverse data needs. It allows for the creation of independent, self-serve data products, enabling teams to
scale their data capabilities in alignment with their specific requirements.
Improved Data Discoverability: Data Mesh implements the creation of discoverable and self-serve data products. By
implementing a data product catalog and metadata infrastructure, organizations can enhance data discoverability,
making it easier for teams to find, understand, and use the available data assets.
Enhanced Data Quality and Consistency: Individual domains are responsible for the quality of their data due to
decentralized nature of data ownership. This localized focus can lead to improved data quality within specific business
contexts. Data Mesh also encourages the establishment of domain-oriented data quality metrics and practices.
Cross-Functional Collaboration: Data Mesh promotes collaboration across different business units and domains by
encouraging the formation of cross-functional, domain-oriented teams. This collaborative ecosystem helps better
communication, knowledge sharing, and a holistic understanding of the organization's data landscape.
Flexibility and Adaptability: Data Mesh architecture is designed to be flexible and adaptable to changes in business
requirements. As new domains emerge or existing ones evolve, the decentralized structure allows organizations to
quickly respond and iterate, promoting a more agile and resilient data ecosystem.
• SaaS and On-Premises Integration: Quickly connect to 1000s of SaaS or on-premises Applications
seamlessly through 50+ native app adapters or technology adapters. Support for Service Orchestration and
rich integration patterns for real-time and batch processing.
• Process Automation: Bring agility to your business with an easy, visual, low-code platform that simplifies
day to day tasks by getting employees, customers, and partners the services they need to work anywhere,
anytime, and on any device. Support for Dynamic Case Management
• Visual Application Design: Rapidly create and host engaging business applications with a visual
development environment right from the comfort of your browser.
• Integration Insight: The Service gives you the information you need -- out of the box, with powerful
dashboards that require no coding, configuration, or modification. Get up and running fast with a solution
that delivers deep insight into your business.
• Stream Analytics: Stream processing for anomaly detection, reacting to Fast Data. Super-scalable with
Apache Spark and Kafka.
Application integration use cases
• Connect any ERP, HCM, or CX app faster: Limit training and accelerate automation with pre-built adapters,
integrations, and templates. Easily embed real-time dashboards and extensions in select Oracle SaaS
applications. Only Oracle Integration gives you event-based triggers in Oracle Cloud ERP, HCM, and CX with
connectivity for any SaaS, custom, or on-premises applications (PDF). Eliminate synchronization errors and
delays that can come with polling or other traditional methods, increasing the reliability and resilience of
application interactions.
• Visually orchestrate workflows: Templates make it easy to automate end-to-end ERP, HCM, and CX
processes such as request-to-receipt, recruit-to-hire, and lead to invoice. Quickly orchestrate human, digital
assistant, and robotic process automation (RPA) activities across applications, no matter the implementation
details. Oracle provides solutions to modernize end-to-end processes across applications for many industries
including Financial Services, Manufacturing, Retail, Utilities, and Pet Healthcare.
Simplifying the Complexity of Data Integration Oracle Cloud Infrastructure Data Integration is an Oracle managed
service that provides extract, transform and load (ETL) capabilities to target AI and Analytics projects on Oracle Cloud
Infrastructure. ETL developers will be able to load a data mart in minutes without coding, quickly discover and
connect to popular databases and applications, and design & maintain complex ETL data flows effortlessly to load a
data warehouse. Data engineers will be able to easily automate Apache Spark ETL data flows, prepare datasets quickly
for Data Science projects, and stand up new data lake Services from cloud context and hybrid
connectivity.
• Seamless Data Integration: Oracle Cloud Infrastructure Data Integration enables the seamless movement of
data between various sources and targets, allowing organizations to efficiently transfer and integrate data
across their infrastructure.
• Data integration for big data, data lakes, and data science: Ingest data faster and more reliably into data
lakes for data science and analytics. Create high-quality models more quickly.
• Data integration for data marts and data warehousing: Load and transform transactional data at scale.
Create an organized view from large data volumes.
Enterprise data is typically distributed across the enterprise in heterogeneous databases. To get data between
different data sources, organizations can use Oracle GoldenGate to load, distribute, and filter transactions within their
enterprise in real-time and enable migrations between different databases in near zero-downtime. To do this,
organizations need a means to effectively move data from one system to another in real-time and with zero-
downtime. Oracle GoldenGate is Oracle’s solution to replicate and integrate data.
Oracle GoldenGate meets almost any data movement requirements you might have. Some of the most common use
cases are described in this section. You can use Oracle GoldenGate to meet the following business requirements:
Business Continuity and High Availability: Business Continuity is the ability of an enterprise to provide its functions
and services without any lapse in its operations. High Availability is the highest possible level of fault tolerance. To
achieve business continuity, systems are designed with multiple servers, multiple storage, and multiple data centers
to provide high enough availability to support the true continuity of the business. To establish and maintain such an
environment, data needs to be moved between these multiple servers and data centers, which is easily done using
Oracle GoldenGate.
Initial Load and Database Migration: Initial load is a process of extracting data records from a source database and
loading those records onto a target database. Initial load is a data migration process that is performed only once.
Oracle GoldenGate allows you to perform initial load data migrations without taking your systems offline.
Data Integration: Data integration involves combining data from several disparate sources, which are stored using
various technologies, and provide a unified view of the data. Oracle GoldenGate provides real-time data integration.
• Elastic and scalable platform: Data engineers can easily set up and operate big data pipelines. Oracle
handles all infrastructure and platform management for event streaming, including provisioning, scaling, and
security patching.
• Deploy streaming apps at scale: With the help of consumer groups, Streaming can provide state
management for thousands of consumers. This helps developers easily build applications at scale.
• Oracle Cloud Infrastructure integrations: Native integrations with Oracle Cloud Infrastructure services
include Object Storage for long-term storage, Monitoring for observability, Resource Manager for deploying
at scale, and Tagging for easier cost tracking/account management.
• Kafka Connect Harness: The Kafka Connect Harness provides out-of-the-box integrations with hundreds of
data sources and sinks, including GoldenGate, Integration Cloud, Database, and compatible third-party
offerings.
• High throughput message bus: Streaming service is ideal for microservices and other applications that
require high throughput/low latency data movement and strict ordering guarantees.
• Real-time analytics engine: Feed data at scale from websites or mobile apps to a data warehouse,
monitoring system, or analytics engine. Real-time actions help ensure that developers can take action before
data goes stale.
• Integration with Oracle Database and SaaS applications: Use Streaming to ingest application and
infrastructure logs from Oracle SaaS applications, such as E-Business Suite, PeopleSoft, and Change Data
Capture (CDC) logs from Oracle Database. Leverage Streaming’s Kafka connectors for Oracle Integration
Cloud, then transport them to downstream systems, such as Object Storage, for long-term retention.
• Data-in-motion analytics on streaming data: OCI Streaming is directly integrated with OCI GoldenGate
Stream Analytics, OCI GoldenGate, and Oracle GoldenGate for ingesting event-driven, streaming Kafka
messages and publishing enriched and transformed messages. OCI GoldenGate Stream Analytics is a
complete application that models, processes, analyzes, and acts in real time, flowing from business
transactions, loading data warehouses, or data-in-motion. Users easily build no-code data pipelines.
Processing discovers outliers and anomalies, applies insight from ML models, and then alerts or
automatically takes the next best action.
OCI Object Storage is a fundamental component for establishing robust data lakes in the Oracle Cloud Infrastructure
(OCI) environment. OCI Object Storage is build for scalability, cost-effectiveness, and seamless integration, OCI Object
Storage offers a comprehensive solution for storing and managing diverse datasets within data lakes.
OCI Object Storage key features: Key features of OCI Object Storage for data lake storage include:
• Redundancy across fault domains: With OCI Object Storage, stored objects are automatically replicated
across fault domains or across availability domains. Customers can combine replication with lifecycle
management policies to automatically populate, archive, and delete objects.
25 CAF Data Strategy / version 1.1
Copyright © 2024, Oracle and/or its affiliates / Confidential – Oracle Internal
• Data integrity monitoring: OCI Object Storage automatically and actively monitors the integrity of data
using checksums. If corrupted data is detected, it’s flagged for remedy without human intervention.
• Automatic self-healing: When a data integrity issue is identified, corrupt data is automatically ‘healed’ from
redundant copies. Any loss of data redundancy is managed automatically with creation of a new copy of the
data. With Object Storage, there’s no need for concern about accessing down-level data. Object Storage
always serves the most recent copy of data written to the system.
• Backed by 99.9% SLA: Customers rely on Object Storage, which is backed by a 99.9% Availability SLA.
Oracle also offers manageability and performance SLAs for many cloud services that are not available from
other cloud platform vendors. A complete listing of availability, manageability, and performance SLAs for
Oracle Cloud services is available here.
• Harness business value: Object Storage is increasingly used as a data lake, where businesses store their
digital assets for processing by analytical frameworks and pipelines in order to harness business insight.
• Integrated data protection: Enterprises store data and backups on OCI Object Storage, which runs on
redundant hardware for built-in durability. Data integrity is actively monitored, with any corrupt data detected
and healed by automatically recreating a copy of the data.
• End-to-end visibility: OCI Object Storage provides a dedicated (non-shared) storage ‘namespace’ or
container unique to each customer for all stored buckets and objects. This encapsulation provides end-to-
end visibility and reduces the risk of exposed buckets. Customers can define access to meet exact
organizational requirements and avoid the open bucket vulnerabilities associated with AWS S3’s shared
global namespace.
• Long-term, low-cost storage: For longer-term data storage needs like compliance and audit mandates and
log data, OCI Archive Storage uses the same APIs as Object Storage for easy setup and integration but at
one-tenth the cost. Data is monitored for integrity, automatically healed, and encrypted at rest.
• Encryption by default: All Object Storage data at rest is encrypted by default using 256-bit AES encryption.
By default, Object Storage service manages encryption keys. Alternatively, customers can supply their own
encryption key to either use with Oracle Cloud Infrastructure (OCI) Vault or manage data separately.
• Continuous threat assessment: Oracle Cloud Guard continuously monitors data to detect anomalous
events, then automatically intervenes when it detects suspect user behavior. For example, machine learning-
powered security services revoke user permissions when it detects suspicious patterns.
• Greater control reduces risk: With OCI Object Storage, managerial controls provide complete control over
the tenancy to prevent common vulnerabilities that can lead to data leaks. Many serious data leaks have
taken place due to open Amazon S3 buckets, which publicly exposed sensitive information, including
usernames and passwords, medical records, and credit reports.
• Oracle Cloud Infrastructure Identity and Access Management: Using easy-to-define policies organized by
logical groups of users and resources, OCI Identity and Access Management controls not only who has access
to OCI resources but also which ones and the access type. Customers can manage identities and grant access
using existing organizational hierarchies and federated directory services, including Microsoft, Okta, and
other SAML directory providers.
Oracle Object Storage use cases
• Scalable Data Repository: OCI Object Storage is an ideal solution for serving as a scalable and centralized
repository for storing vast amounts of structured and unstructured data in a data lake. It accommodates the
growing volume of data generated by various sources, providing a reliable and scalable foundation for data
lake storage.
Oracle Autonomous Database provides an easy-to-use, fully autonomous database that scales elastically and delivers
fast query performance. As a service, Autonomous Database does not require database administration.
With Autonomous Database you do not need to configure or manage any hardware or install any software.
Autonomous Database handles provisioning the database, backing up the database, patching and upgrading the
database, and growing or shrinking the database. Autonomous Database is a completely elastic service.
At any time you can scale, increase or decrease, either the compute or the storage capacity. When you make resource
changes for your Autonomous Database instance, the resources automatically shrink or grow without requiring any
downtime or service interruptions.
Autonomous Database is built upon Oracle Database, so that the applications and tools that support Oracle Database
also support Autonomous Database. These tools and applications connect to Autonomous Database using standard
SQL*Net connections. The tools and applications can either be in your data center or in a public cloud. Oracle
Analytics Cloud and other Oracle Cloud services provide support for Autonomous Database connections.
Autonomous Database provides the foundation for a data lakehouse—a modern, open architecture that enables you
to store, analyze, and understand all your data. The data lakehouse combines the power and richness of data
warehouses with the breadth, flexibility, and low cost of popular open source data lake technologies. Access your data
lake through Autonomous Database using the world's most powerful and open SQL processing engine.
Data lakes are a key part of current data management architectures with data stored across object store offerings
from Oracle, Amazon, Azure, Google, and other vendors.
• Data processing engine for ETL: This allows you to reduce the data warehousing workload.
• Storing data that may not be appropriate for a data warehouse: This includes log files, sensor data, IoT
data, and so on. These source data tend to be voluminous with low information density. Storing this data in
an object store might be more appropriate than in a data warehouse, while the information derived from the
data is ideal for SQL analytics.
Autonomous Database supports integrating with data lakes not just on Oracle Cloud Infrastructure, but also on
Amazon, Azure, Google, and more. You have the option of loading data into the database or querying the data
directly in the source object store. Both approaches use the same tooling and APIs to access the data. Loading data
into Autonomous Database will typically offer significant query performance gains when compared to querying object
storage directly. However, querying the object store directly avoids the need to load data and allows for an agile
approach to extending analytics to new data sources. Once those new sources are deemed to have proven value, you
have the option to load the data into Autonomous Database.
Autonomous Database supports multiple cloud services and object stores, including Oracle Cloud Infrastructure,
Azure, AWS, Google, and others. The first step in accessing these stores is to ensure that security policies are in place.
For example, you must specify authorization rules to read and/or write files in a bucket on object storage. Each cloud
has its own process for specifying role based access control.
After you have integrated Autonomous Database with the data lake, you can use the full breadth of Oracle SQL for
querying data across both the database and object storage. The location of data is completely transparent to the
application. The application simply connects to Autonomous Database and then uses all of the Oracle SQL query
language to query across your data sets.
Integrating the various types of data allows business analysts to apply Autonomous Database's built-in analytics
across all data and you do not need to deploy specialized analytic engines. Using Autonomous Database set up as a
data lakehouse eliminates costly data replication, security challenges and administration overhead. Most importantly,
it allows cross-domain analytics.
• Managed infrastructure: OCI Data Flow handles infrastructure provisioning, network setup, and teardown
when Spark jobs are complete. Storage and security are also managed, which means less work is required for
creating and managing Spark applications for big data analysis.
• Easier cluster management: With OCI Data Flow, there are no clusters to install, patch, or upgrade, which
saves time and operational costs for projects.
• Simplified capacity planning: OCI Data Flow runs each Spark job in private dedicated resources, eliminating
the need for upfront capacity planning.
• Advanced streaming support capabilities: Spark Streaming with zero management, automatic fault-
tolerance, and automatic patching.
• Enable continuous processing: With Spark Streaming support, you gain capabilities for continuous retrieval
and continuous availability of processed data. OCI Data Flow handles the heavy lifting of stream processing
with Spark, along with the ability to perform machine learning on streaming data using MLLib. OCI Data Flow
supports Oracle Cloud Infrastructure (OCI) Object Storage and any Kafka-compatible streaming source,
including Oracle Cloud Infrastructure (OCI) Streaming as data sources and sinks.
• Automatic fault tolerance: Spark handles late-arriving data due to outages and can catch up backlogged
data over time with watermarking—a Spark feature that maintains, stores, and then aggregates late data—
without needing to manually restart the job. OCI Data Flow automatically restarts your application when
possible and your application can simply continue from the last checkpoint.
• Cloud native authentication: OCI Data Flow streaming applications can use cloud native authentication via
resource principals so applications can run longer than 24 hours.
• Cloud native security and governance: Leverage unmatched security from Oracle Cloud Infrastructure.
Authentication, isolation, and all other critical points are addressed. Protect business-critical data with the
highest levels of security.
OCI Data Flow key benefits
Accelerate workflows with NVIDIA RAPIDS: NVIDIA RAPIDS Accelerator for Apache Spark in OCI Data Flow is
supported to help accelerate data science, machine learning, and AI workflows.
ETL offload: Data Flow manages ETL offload by overseeing Spark jobs, optimizing cost, and freeing up capacity.
Active archive: Data Flow's output management capabilities optimize the ability to query data using Spark.
Unpredictable workloads: Resources can be automatically shifted to handle unpredictable jobs and lower costs. A
dashboard provides a view of usage and budget for future planning purposes.
Machine learning model training: Spark and machine learning developers can use Spark’s machine learning library
and run models more efficiently using Data Flow.
• Data preparation & enrichment: Use self-service data preparation to ingest, profile, repair, and extend
datasets, local or remote, greatly saving time and reducing errors. Data quality insights provides a quick view
of data to identify anomalies and help with corrections. The custom reference knowledge capability enables
Oracle Analytics to identify more business-specific information and make relevant data enrichment
recommendations. Build visual dataflows to transform, merge, and enrich data and save results in Oracle
Analytics storage, a connected relational database (e.g. Snowflake or MySQL), or Oracle Essbase.
• Machine learning: With the volume, variety, and sources of data constantly growing, machine learning (ML)
helps users discover unseen patterns or insights from data. ML built into Oracle Analytics removes human
bias and enables users to easily interpret possible outcomes and opportunities. Integrate OCI AI Services for
use directly in analytics projects. Everyone—from clickers to coders—can use embedded ML to build custom,
business-specific models for better decision-making. Business users do not need special technical or
programming skills to use ML. In addition, data scientists, engineers, and developers can accelerate model
building, training, and publishing by using the Oracle Autonomous Database environment as a high
performance computing platform with your choice of language, including Python, R, and SQL.
• Open Data source connectivity: Unify data across the organization and from multiple data sources for a
complete and consistent view. Oracle Analytics offers more than 35 out-of-the-box native data connection
choices, including JDBC (Java Database Connectivity). Securely create, manage, and share data connections
with individuals, teams, or the entire organization. Access data wherever it is located: public cloud, private
cloud, on-premises, data lakes, databases, or personal datasets, such as spreadsheets or text-based extracts.
• Data Visualization: Visually explore data to create and share compelling stories using Oracle Analytics.
Discover the signals in data that can turn complex relationships into engaging, meaningful, and easy-to-
understand communications. Accelerate the data analytics process and make decisions with actionable
information. A code-free, drag-and-drop interface enables anyone in the organization to build interactive
data visualizations without specialized skills.
• Enterprise data modeling: Gain trusted and consistent information across the enterprise with a business
representation of data using a shared semantic model, without compromising governance. Users access data
through nontechnical business terms, predefined hierarchies, consistent calculations, and metrics. Create
seamless views across data sources visually explore them using native queries that deliver high performance.
Easily configure the balance of live and cached connections to ensure high-performance data access.
Support multiple data visualization tools, such as Microsoft Power BI, and retain a consistent and trusted view
of enterprise metrics.
Unlock the power of generative AI models equipped with advanced language comprehension for building the next
generation of enterprise applications. Oracle Cloud Infrastructure (OCI) Generative AI is a fully managed service
available via API to seamlessly integrate these versatile language models into a wide range of use cases, including
writing assistance, summarization, and chat.
The OCI Generative AI Agents service provides an agent type that combines the power of large language models
(LLMs) and retrieval-augmented generation (RAG) with enterprise data, making it possible for users to easily query
diverse enterprise data sources. Users can access and understand up-to-date information through a chat interface
and in the future direct the agent to take actions based on findings.
OCI Language is a cloud-based AI service for performing sophisticated text analysis at scale. Use this service to build
intelligent applications by leveraging REST APIs and SDKs to process unstructured text for sentiment analysis, entity
recognition, translation, and more.
OCI Speech is an AI service that applies automatic speech recognition technology to transform audio-based content
to text. Developers can easily make API calls to integrate OCI Speech’s pretrained models into their applications. OCI
Speech can be used for accurate, text-normalized, time-stamped transcription via the console and REST APIs as well
as CLIs or SDKs. You can also use OCI Speech in an OCI Data Science notebook session. With OCI Speech, you can
filter profanities, get confidence scores for both single words and complete transcriptions, and more.
OCI Vision is an AI service for performing deep-learning–based image analysis at scale. With prebuilt models available
out of the box, developers can easily build image recognition and text recognition into their applications without
machine learning (ML) expertise. For industry-specific use cases, developers can automatically train custom vision
models with their own data. These models can be used to detect visual anomalies in manufacturing, organize digital
media assets, and tag items in images to count products or shipments.
OCI Document Understanding is an AI service that enables developers to extract text, tables, and other key data from
document files through APIs and command line interface tools. With OCI Document Understanding, you can
automate tedious business processing tasks with prebuilt AI models and customize document extraction to fit your
industry-specific needs.
Data preparation
• Flexible data access: Data scientists can access and use any data source in any cloud or on-premises. This
provides more potential data features that lead to better models.
• Data labeling: Oracle Cloud Infrastructure (OCI) Data Labeling is a service for building labeled datasets to
more accurately train AI and machine learning models. With OCI Data Labeling, developers and data
scientists assemble data, create and browse datasets, and apply labels to data records.
• Data preparation at scale with Spark: Submit interactive Spark queries to your OCI Data Flow Spark cluster.
Or, use Oracle Accelerated Data Science SDK to easily develop a Spark application and then run it at scale on
OCI Data Flow, all from within the Data Science environment.
• Feature store: Define feature engineering pipelines and build features with fully managed execution. Version
and document both features and feature pipelines. Share, govern, and control access to features. Consume
features for both batch and real-time inference scenarios.
Model Building
• JupyterLab interface: Built-in, cloud-hosted JupyterLab notebook environments enable data science teams
to build and train models using a familiar user interface.
• Open source machine learning frameworks: OCI Data Science provides familiarity and versatility for data
scientists, with hundreds of popular open source tools and frameworks, such as TensorFlow or PyTorch, or
add frameworks of choice. A strategic partnership between OCI and Anaconda enables OCI users to
download and install packages directly from the Anaconda repository at no cost—making secure open source
more accessible than ever.
• Oracle Accelerated Data Science (ADS) library: Oracle Accelerated Data Science SDK is a user-friendly
Python toolkit that supports the data scientist through their entire end-to-end data science workflow.
Model training
• Powerful hardware, including graphics processing units (GPUs): With NVIDIA GPUs, data scientists can
build and train deep learning models in less time. When compared to CPUs, performance speedups can be 5
to 10 times faster.
• Jobs: Use Jobs to run repeatable data science tasks in batch mode. Scale up your model training with support
for bare metal NVIDIA GPUs and distributed training.
• In-console editing of job artifacts: Easily create, edit, and run Data Science job artifacts directly from the
OCI Console using the Code Editor. Comes with Git integration, autoversioning, personalization, and more.
Governance and model management
• Model Catalog: Data scientists use the model catalog to preserve and share completed machine learning
models. The catalog stores the artifacts and captures metadata around the taxonomy and context of the
• Managed model deployment: Deploy machine learning models as HTTP endpoints for serving model
predictions on new data in real time. Simply click to deploy from the model catalog, and OCI Data Science
handles all infrastructure operations, including compute provisioning and load balancing.
• ML pipelines: Operationalize and automate your model development, training, and deployment workflows
with a fully managed service to author, debug, track, manage, and execute ML pipelines.
• ML monitoring: Continuously monitor models in production for data and concept drift. Enables data
scientists, site reliability engineers, and DevOps engineers to receive alerts and quickly assess model
retraining needs.
• ML applications: Originally designed for Oracle’s own SaaS applications to embed AI features, ML
applications are now available to automate the entire MLOps lifecycle, including development, provisioning,
and ongoing maintenance and fleet management, for ISVs with hundreds of models for each of their
thousands of customers.
AI Quick Actions
• No-code access: Leverage LLMs, such as Llama 2 and Mistral 7B, with one click via seamless integration with
Data Science notebooks.
• Deployment: Access support for model deployment using Text Generation Inference (Hugging Face), vLLM
(UC Berkeley), and NVIDIA Triton serving with public examples for
o Llama 2 with 7 billion parameters and 13 billion parameters using NVIDIA A10 GPUs
o Llama 2 with 70 billion parameters using NVIDIA A100 and A10 GPUs via GPTQ quantization
o Mistral 7B
o Jina Embeddings models using the NVIDIA A100 GPU
• Fine-tuning: Users can access moderation controls, endpoint model swap with zero downtime, and
endpoints deactivation and activation capabilities. Leverage distributed training with PyTorch, Hugging Face
Accelerate, and DeepSpeed for fine-tuning of LLMs to achieve optimal performance. Enable effortless
checkpointing and storage of fine-tuned weights with mounting for object storage and file system as a
service. Additionally, service-provided Condas eliminate the requirement for custom Docker environments
and enable sharing with less slowdown.
References:
1 Source: Need External Data? Explore the nee data landscape, Forrester Research, April 2019