DW Lecture Unit 1
DW Lecture Unit 1
Database DataWarehouse
Architecture is the proper arrangement of the elements. We build a data warehouse with software and
hardware components. To suit the requirements of our organizations, we arrange these building we
may want to boost up another part with extra tools and services. All of these depends on our
circumstances.
The figure shows the essential elements of atypical warehouse. We see the Source Data component
shows on the left. The Data staging element serves as the next building block. In the middle, we see
the Data Storage component that handles the data warehouses data. This element not only stores
and manages the data; it also keeps track of data using the metadata repository. The Information
Delivery component shows on the right consists of all the different ways of making the information
from the data warehouses available to the users.
After we have been extracted data from various operational systems and external sources, we have to
prepare the files for storing in the data warehouse. The extracted data coming from several different
sources need to be changed, converted, and made ready in a format that is relevant to be saved for
querying and analysis.
1) Data Extraction: This method has to deal with numerous data sources. We have to employ the
appropriate techniques for each data source.
2) Data Transformation: As we know, data for a data warehouse comes from many different
sources. If data extraction for a data warehouse posture big challenges, data transformation present
even significant challenges. We perform several individual tasks as part of data transformation.
First, we clean the data extracted from each source. Cleaning may be the correction of misspellings or
may deal with providing default values for missing data elements, or elimination of duplicates when
we bring in the same data from various source systems.
Standardization of data components forms a large part of data transformation. Data transformation
contains many forms of combining pieces of data from different sources. We combine data from
single source record or related data parts from many source records.
On the other hand, data transformation also contains purging source data that is not useful and
separating out source records into new combinations. Sorting and merging of data take place on a
large scale in the data staging area. When the data transformation function ends, we have a
collection of integrated data that is cleaned, standardized, and summarized.
3) Data Loading: Two distinct categories of tasks form data loading functions. When we complete
the structure and construction of the data warehouse and go live for the first time, we do the initial
loading of the information into the data warehouse storage. The initial load moves high volumes of
data using up a substantial amount of time.
Data storage for the data warehousing is a split repository. The data repositories for the operational
systems generally include only the current data. Also, these data repositories include the data
structured in highly normalized for fast and efficient processing.
The information delivery element is used to enable the process of subscribing for data warehouse files
and having it transferred to one or more destinations according to some customer-specified
scheduling algorithm.
Meta data Component
Metadata in a data warehouse is equal to the data dictionary or the data catalog in a database
managementsystem.Inthedatadictionary,wekeepthedataaboutthelogicaldatastructures,the data about
the records and addresses, the information about the indexes, and so on.
Data Marts
It includes a subset of corporate-wide data that is of value to a specific group of users. The scope is
confined to particular selectedsubjects.Datainadatawarehouseshouldbeafairlycurrent,butnot mainly
up to the minute, although development in the data warehouse industry has made standard and
incremental data dumps more achievable. Data marts are lower than data warehouses and usually
contain organization. The current trends in data warehousing are to develope a data warehouse with
several smaller related data marts for particular kinds of queries and reports.
The management and control elements coordinate the services and functions within the data
warehouse. These components control the data transformation and the data transfer into the data
warehouse storage. On the other hand, it moderates the data delivery to the clients. Its work with the
database management systems and authorizes data to be correctly saved in the repositories. It
monitors the movement of information into the staging method and from there into the data
warehouses storage itself.
Database DataWarehouse
1. It is used for Online Transactional Processing 1. It is used for Online Analytical Processing
(OLTP) but can be used for other objectives such as (OLAP).This reads the historical information
Data Warehousing. This records the data from the for the customers for business decisions.
clients for history.
2. The tables and joins are complicated since they are 2. The tables and joins are accessible since
normalized for RDBMS. This is done to reduce they are de-normalized. This is done to
redundant files and to save storage space. minimize the response time for analytical
queries.
3.Dataisdynamic 3.Dataislargely static
4. Entity: Relational modeling procedures are used for 4. Data: Modeling approach are used for the
RDBMS database design. Data Warehouse design.
6. Performance is low for analysis queries. 6.High performance for analytical queries.
7. The database is the place where the data is taken as a 7. Data Warehouse is the place where the
base and managed to get available fast and efficient application data is handled for analysis and
access. reporting objectives.
The Operational Database is the source of information for the data warehouse. It includes detailed
information used to run the day to day operations of the business. The data frequently changes as
updates are made and reflect the current value of the last transactions.
Operational Database Management Systems also called as OLTP (Online Transactions Processing
Databases), are used to manage dynamic data in real-time.
Data Warehouse Systems serve users or knowledge workers in the purpose of data analysis and
decision-making. Such systems can organize and present information in specific formats to
accommodate the diverse needs of various users. These systems are called as Online-Analytical
Processing (OLAP) Systems.
Data Warehouse and the OLTP database are both relational databases. However, the goals of both
these databases are different.
Data Warehouse
Operational Database
Operational systems are designed to support Data warehousing systems are typically
high-volume transaction processing. designed to support high-volume analytical
processing (i.e., OLAP).
Operational systems are usually concerned Data warehousing systems are usually
with current data. concerned with historical data.
Data within operational systems are mainly Non-volatile, new data may be added
updated regularly according to need. regularly. Once Added rarely changed.
It is optimized for a simple set of transactions, It is optimized for extent loads and high,
generally adding or retrieving a single row at a complex, unpredictable queries that access
time per table. many rows per table.
Operational systems are widely process- Data warehousing systems are widely
oriented. subject-oriented
Operational systems are usually optimized to Data warehousing systems are usually
perform fast inserts and updates of optimized to perform fast retrievals of
associatively small volumes of data. relatively high volumes of data.
Relational data bases are created for on-line Data Warehouse designed for on-line
transactional Processing (OLTP) Analytical Processing (OLAP)
OLAP System
OLAP handle with Historical Data or Archival Data. Historical data are those data that are achieved
over a long period. For example, if we collect the last 10years information about flight reservation,
the data can give us much meaningful data such as the trends in the reservation. This may provide
useful information like peak time of travel, what kind of people are traveling in various classes
(Economy/Business) etc.
The major difference between an OLTP and OLAP system is the amount of data analyzed in a
single transaction. Where as an OLTP manage many concurrent customers and queries touching
only an individual record or limited groups of files at a time. An OLAP system must have the
capability to operate on millions of files to answer a single query.
Data contents OLTP system manages current OLAP system manages a large amount of
data that too detailed and are used historical data, provides facilitates for
for decision making. summarization and aggregation, and stores and
manages data at different levels of granularity.
This information makes the data more
comfortable to use in informed decision
making.
Data base design OLTP system usually uses an OLAP system typically uses either a star or
entity-relationship (ER) data snow flake model and subject-oriented
model and application-oriented database design.
database design.
View OLTP system focuses primarily OLAP system often spans multiple versions of
on the current data within an a database schema, due to the evolutionary
enterprise or department, without process of an organization. OLAP systems also
Referring to historical deal with data that originates from various
information or data indifferent organizations, integrating information from
organizations. many data stores.
Volume of data Not very large Because of their large volume, OLAP data are
stored on multiple storage media.
Access patterns The access patterns of an OLTP Accesses to OLAP systems are mostly read-
system subsist mainly of short, only methods because of these data warehouses
atomic transactions. Such a stores historical data.
system requires concurrency
control and recovery techniques.
Insert and Short and fast inserts and updates Periodic long-running batch jobs refresh the
Updates proposed by end-users. data.
From the architecture point of view, there are three data warehouse models: the virtual warehouse,
the data mart, and the enterprise warehouse.
Virtual Warehouse: A virtual warehouse is created based on a set of views defined for an
operational RDBMS. This warehouse type is relatively easy to build but requires excess
computational capacity of the underlying operational database system. The users directly access
operational data via middleware tools. This architecture is feasible only if queries are posed
infrequently, and usually is used as a temporary solution until a permanent data warehouse is
developed.
Data Mart: The data mart contains a subset of the organisation-wide data that is of value to a
small group of users, e.g., marketing or customer service. This is usually a precursor (and/or a
successor) of the actual data warehouse, which differs with respect to the scope that is confined
to a specific group of users.
Depending on the source of data, data marts can be categorized in to the following two classes:
1. Independent data marts are sourced from data captured from one or more operational
systems or external information providers, or from data generated locally within a
particular department or geographic area.
2. Dependent data marts are sourced directly from enterprise data warehouses.
Enterprise warehouse: This warehouse type holds all information about subjects spanning the
entire organisation. For a medium-to a large-size company, usually several years are needed to
design and build the enterprise warehouse.
The differences between the virtual and the enterprise DWs are shown in Figure 1.4. Data marts
canalsobecreatedassuccessorsofanenterprisedatawarehouse.Inthiscase,theDWconsistsof an
enterprise warehouse and (several) data marts.
Lecture Topic 4: AUTONOMOUS DATA WAREHOUSE
Autonomous Data Warehouse (ADW) is a cloud-based data base service provided by Oracle. It is
part of Oracle's Autonomous Database offerings, which also include Autonomous Transaction
Processing (ATP).ADWisdesignedtosimplifydatabasemanagement,reduceoperationalcosts, and
improve performance by leveraging automation and cloud technologies.
1. Vendor:
• OracleADWisaproductofOracleCorporation,awell-establisheddatabasevendor with a
long history in the industry.
• Snowflakeisacloud-nativedatawarehousingplatformdevelopedbySnowflake
Computing, a newer entrant to the market.
2. Architecture:
• Oracle ADW is built on Oracle Data base technology and is part of the Oracle
Cloud Infrastructure. It utilizes Oracle's Autonomous Data base technology, which
includes self-driving, self-securing, and self-repairing capabilities.
• Snowflake is built as a multi-cloud, multi-cluster, and multi-region data
warehouse service. It has a unique architecture that separates storage and compute
resources, providing elasticity and scalability.
3. Automation:
• Both ADW and Snowflake emphasize automation. ADW, as part of the Oracle
Autonomous Data base family, is designed to automate various database
management tasks, including provisioning, patching, and tuning.
• Snowflake also offers automation features, such as automatic scaling of compute
resources based on demand and automatic performance optimization.
4. Scalability:
• ADW provides the ability to scale computing resources up or down based on
workload demands, allowing for flexibility in resource allocation.
• Snowflake's architecture allows for independent scaling of compute and storage,
providing the ability to scale resources independently, and it automatically
handles the distribution of data across clusters.
5. Performance:
• Both ADW and Snow flake aim to provide high-performance data
warehousing. ADW includes features like automatic indexing and in-memory
processing.
• Snow flake is known for its ease of scaling, enabling users to achieve
high performance by adding or removing compute resources as needed.
6. Multi-Cloud Support:
• Snow flake is designed to work seamlessly across multiple cloud providers, such
as AWS, Azure, and Google Cloud Platform, providing customers with flexibility
in choosing their preferred cloud infrastructure.
• OracleADWispartoftheOracleCloudInfrastructureandisprimarilyhostedon
Oracle's cloud.
7. Pricing Model:
• Both ADW and Snowflake offer consumption-based pricing models. Snowflake's
pricing is based on the amount of storage used and the amount of compute resources
consumed.
• Oracle ADW follows a similar model, charging users based on the resources
they consume.
1. Cloud-Native Architecture:
• Modern data warehouses are often built on cloud platforms, such as AWS, Azure, or
Google Cloud, to take advantage of scalable and flexible computing resources, as
well as the ability to pay for resources on a consumption basis.
2. Data Lakes Integration:
• Integration with data lakes allows for the storage and analysis of both structured
and unstructured data. This integration supports diverse data types and enables
more comprehensive analytics.
3. Scalability:
• Modern data warehouses are designed to scale horizontally and vertically, allowing
organizations to easily add or remove resources based on data volume and
processing needs.
4. Automated Data Management:
• Automation is a key aspect, covering various tasks such as data ingestion, data
transformation, and data quality checks. Automated processes reduce manual
effort, enhance efficiency, and improve overall system reliability.
5. Data Virtualization:
• Data virtualization enables users to access and analyze data without physically
moving it. This can be particularly usefull for integrating data from multiple
sources and providing a unified view without the need for extensive data
movement.
6. Advanced Analytics and Machine Learning:
• Modern data warehouses often incorporate advanced analytics and machine
learning capabilities directly within the platform. This allows organizations to
derive insights from data and build predictive models without having to move the
data to external systems.
7. Real-Time Data Processing:
• The ability to handle real-time data processing and analytics is a crucial aspect of a
modern data warehouse. This is especially important for organizations that require
up-to-the-minute insights for decision-making.
8. Security and Compliance:
• Security features are a priority, including robust authentication, encryption, and
compliance with regulatory standards. Modern data warehouses often provide
fine- grained access controls to ensure data privacy and security.
9. Cost Management:
• Cost-effective solutions are a focus, with modern data warehouses allowing
organizationstopayfortheresourcestheyconsume.Thispay-as-you-gomodelis often
more cost-efficient than traditional on-premises solutions.
10. Integration with BI Tools and Visualization:
• Seamless integration with business intelligence (BI) tools and visualization
platformsisessentialtoempoweruserstoeasilyanalyzeandvisualizedatastoredin the
warehouse.
11. Flexible Data Models:
• Modern data warehouses support flexible data models, including both relational and
non-relational data. This flexibility accommodates diverse data types and structures.
12. Data Governance:
• Robust data governance features are included to ensure data quality, lineage, and
compliance with regulatory requirements. This includes meta data management,
data cataloging, and lineage tracking.