Lecture 2 - Datawarehouse
Lecture 2 - Datawarehouse
2
A producer wants to know….
Transactional Processing
• Focus on routine processing:
data insertion, modification, Analytical Processing
deletion, and transmission • Focus on reporting,
analysis, transformation,
and decision support
Data warehouse 7
What is a Data Warehouse?
Data Warehouse: A
single, complete and
consistent store of
data obtained from a
variety of different
sources made
available to end users
in a what they can
understand and use in
a business context.
Data Warehouse?
• A data warehouse is a database designed
to enable business intelligence activities.
• It exists to help users understand and
enhance their organization's performance.
• Data warehouses separate analysis
workload from transaction workload and
enable an organization to consolidate
data from several sources.
• This helps in:
• Maintaining historical records
• Analysing the data to gain a better
understanding of the business and to improve
the business
Data warehousing
• The Difference…
– DWH Constitute Entire Information Base For All
Time..(Historical data)
– Database Constitute Real Time Information…
– DWH Supports Data mining And Business
Intelligence.
– Database Is Used To Run The Business
– DWH Is How To Run The Business
Data warehousing is …
• Subject Oriented: Data that gives information about a
particular subject instead of about a company's ongoing
operations.
• Integrated: Data that is gathered into the data
warehouse from a variety of sources and merged into a
coherent whole.
• Time-variant: All data in the data warehouse is
identified with a particular time period.
• Non-volatile: Data is stable in a data warehouse. More
data is added but data is never removed. This enables
management to gain a consistent picture of the business.
Advantages of data warehouses
DWH Improves the decision-making process as a
result of the following:
• It provides business users with a “customer-
centric” view of the company’s heterogeneous data
by helping to integrate data from customer-related
business systems.
• It provides added value to the company’s
customers by allowing them to access better
information when data warehousing is coupled with
internet technology.
Advantages of data warehouses
• It consolidates data about individual customers and
provides a repository of all customer contacts for
segmentation modeling, customer retention
planning, and cross sales analysis.
• It removes barriers among functional areas by
offering a way to reconcile views from multiple
areas, thus providing a look at activities that cross
functional lines.
• It reports on the trends across multidivisional
operating units, including trends or relationships in
areas such as merchandising, production and
planning etc.
Disadvantages of data warehouses
• Data warehouses are not the optimal environment for
unstructured data.
• Because data must be extracted, transformed and loaded
into the warehouse, there is an element of latency in data
warehouse data.
• Over their life, data warehouses can have high costs.
Maintenance costs are high.
• Duplicate, expensive functionality may be developed. Or,
functionality may be developed in the data warehouse that,
in retrospect, should have been developed in the
operational systems and vice versa.
Data Warehousing tools
• Amazon Redshift
• Microsoft Azure
• Talend Open Studio
• Google BigQuery
• Micro Focus Vertica
• Teradata
• Amazon DynamoDB
• PostgreSQL
• Amazon RDS
• Amazon S3
• SAP HANA
• MarkLogic
• MariaDB
• Db2 Warehouse
• Exadata
• BI360 Data Warehouse
• Cloudera
Data Marts
• A data mart is a scaled down version of a data warehouse
that focuses on a particular subject area.
• A data mart is a subset of an organizational data store,
usually oriented to a specific purpose or major data
subject, that may be distributed to support business needs.
• Data marts are analytical data stores designed to focus on
specific business functions for a specific community within
an organization.
• Usually designed to support the unique business
requirements of a specified department or business process
• Implemented as the first step in proving the usefulness of
the technologies to solve business problems.
Data Warehouse Architecture
From the Data Warehouse to Data Marts
Information
Individually Less
Structured
Departmentally History
Structured Normalized
Detailed
Organizationally More
Data Warehouse
Structured
Data
To summarize ...
Data warehouse
Types of Data Mart
Dependant Data Mart
Business analytics
Business analytics (BA) is the practice of iterative, methodical exploration of
an organization’s data with emphasis on statistical analysis.
Data mining
Data mining techniques are a blend of statistics and mathematics, and
artificial intelligence and machine-learning.
27
Levels of Analytical Processing
How do advertising
activities affect sales of
different products bought
by different type of
customers, in different
regions? (synthesizing)
28
OLTP (Online transaction Processing)
34
Strengths of OLAP
• It is a powerful visualization paradigm
• It provides fast, interactive response times
• It is good for analyzing time series
• It can be useful to find some clusters and
outliers
ETL
• ETL is a process that extracts the data from different
source systems, then transforms the data (like applying
calculations, concatenations, etc.) and finally loads the
data into the Data Warehouse system.
• Full form of ETL is Extract, Transform and Load
Why do you need ETL?
• Transactional databases cannot answer complex business
questions that can be answered by ETL.
• ETL provides a method of moving the data from various sources
into a data warehouse.
• As data sources change, the Data Warehouse will automatically
update.
• Allow verification of data transformation, aggregation and
calculations rules.
• ETL process allows sample data comparison between the source
and the target system.
• ETL process can perform complex transformations and requires
the extra area to store the data.
• ETL helps to Migrate data into a Data Warehouse.
• Convert to the various formats and types to adhere to one
consistent system.
Extraction
• In this step of ETL architecture, data is extracted
from the source system into the staging area.
• Transformations if any are done in staging area
so that performance of source system in not
degraded.
• Also, if corrupted data is copied directly from the
source into Data warehouse database, rollback
will be a challenge.
• Staging area gives an opportunity to validate
extracted data before it moves into the Data
warehouse.
Some validations are done during Extraction
• Reconcile records with the source data
• Make sure that no spam/unwanted data loaded
• Data type check
• Remove all types of duplicate/fragmented data
• Check whether all the keys are in place or not
Transformation
• Data transformation is done in
the staging area.
• Data extracted from source
server is raw and not usable in its
original form.
• Therefore, it needs to be
cleansed, mapped and
transformed.
• It is one of the important ETL concepts where you apply a
set of functions on extracted data.
• Data that does not require any transformation is called
as direct move or pass through data.
Data Integration Issues
In transformation step, you can perform customized operations on data.
For instance, if the user wants sum-of-sales revenue which is not in the
database. Or if the first name and the last name in a table is in different
columns. It is possible to concatenate them before loading.
Data Integrity Problems
• Different spelling of the same person like Jon, John, etc.
• There are multiple ways to denote company name like
Google, Google Inc.
• Use of different names such as Accra, Acra.
• There may be a case that different account numbers are
generated by various applications for the same customer.
• In some data required files remains blank
• Invalid product collected at POS as manual entry can lead
to mistakes.
• Transposing rows and columns
• Use lookups to merge data
• Using any complex data validation (e.g., if the first two
columns in a row are empty then it automatically reject
the row from processing)
Validations are done during this stage
• MarkLogic:
https://www.marklogic.com/product/getting-started/
• Oracle: https://www.oracle.com/index.html
• Amazon RedShift:
https://aws.amazon.com/redshift/?nc2=h_m1
• Talend Data Integration Studio:
https://www.talend.com/products/talend-open-studio/
• MapFGorce
ETL Best Practices
• Never try to cleanse all the data
• Never cleanse Anything
• Determine the cost of cleansing the data
• To speed up query processing, have auxiliary views
and indexes
Assignment