0% found this document useful (0 votes)
14 views

Lecture 2 - Datawarehouse

The document discusses data warehousing and data marts. It defines a data warehouse as a single consistent store of historical data from different sources for analysis. It also defines data marts as focused subsets of a data warehouse tailored for specific business units or functions. The document outlines the advantages of data warehouses for decision making and describes common data warehousing tools and architectures.

Uploaded by

arthurquamena
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views

Lecture 2 - Datawarehouse

The document discusses data warehousing and data marts. It defines a data warehouse as a single consistent store of historical data from different sources for analysis. It also defines data marts as focused subsets of a data warehouse tailored for specific business units or functions. The document outlines the advantages of data warehouses for decision making and describes common data warehousing tools and architectures.

Uploaded by

arthurquamena
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 50

Lecture 2

OMIS 612: Business Intelligence


Data Warehouse

Dr. Awuni Emmanuel


Overview
Data Warehousing
Data Mart
OLTP and OLAP
ETL and ETL tools

2
A producer wants to know….

Which are our


lowest/highest margin
customers ?
Who are my customers
What is the most and what products
effective distribution are they buying?
channel?

What product prom- Which customers


-otions have the biggest are most likely to go
impact on revenue? to the
What impact will competition ?
new products/services
have on revenue
and margins?
Data, Data everywhere yet ...
• I can’t find the data I need
– data is scattered over the network
– many versions, subtle differences
• I can’t get the data I need
– need an expert to get the data
• I can’t understand the data I
found
– available data poorly
documented
• I can’t use the data I found
– results are unexpected
– data needs to be transformed
from one form to other
How do we find and manage these data…..
Data Management
A special database system called data warehouse or
data mart is often used to store enterprise data
 The purpose of a data warehouse is to organize lots of
stable data for ease of analysis and retrieval.

Traditional (operational) databases facilitate data


management and transaction processing. They have
two limitations for data analysis and decision support.
 Performance
 They are transaction oriented (data insert, update, move, etc.)
 Not optimized for complex data analysis
 Usually do not hold historical data.
 Heterogeneity
 Individual databases usually manage data in very different ways,
even in the same organization (not to mention external data
sources which may be dramatically different). 6
Data Processing types
Types of Information Processing

Transactional Processing
• Focus on routine processing:
data insertion, modification, Analytical Processing
deletion, and transmission • Focus on reporting,
analysis, transformation,
and decision support

Data warehouse 7
What is a Data Warehouse?

Data Warehouse: A
single, complete and
consistent store of
data obtained from a
variety of different
sources made
available to end users
in a what they can
understand and use in
a business context.
Data Warehouse?
• A data warehouse is a database designed
to enable business intelligence activities.
• It exists to help users understand and
enhance their organization's performance.
• Data warehouses separate analysis
workload from transaction workload and
enable an organization to consolidate
data from several sources.
• This helps in:
• Maintaining historical records
• Analysing the data to gain a better
understanding of the business and to improve
the business
Data warehousing

• Data warehousing is combining data from multiple and


usually varied sources into one comprehensive and
easily manipulated database.
• Common accessing systems of data warehousing include
queries, analysis and reporting.
• Because data warehousing creates one database in the
end, the number of sources can be anything you want it
to be, provided that the system can handle the volume,
of course.
• The final result, however, is homogeneous data, which
can be more easily manipulated.
Data Warehouse Architecture
Database and Data Warehousing

• The Difference…
– DWH Constitute Entire Information Base For All
Time..(Historical data)
– Database Constitute Real Time Information…
– DWH Supports Data mining And Business
Intelligence.
– Database Is Used To Run The Business
– DWH Is How To Run The Business
Data warehousing is …
• Subject Oriented: Data that gives information about a
particular subject instead of about a company's ongoing
operations.
• Integrated: Data that is gathered into the data
warehouse from a variety of sources and merged into a
coherent whole.
• Time-variant: All data in the data warehouse is
identified with a particular time period.
• Non-volatile: Data is stable in a data warehouse. More
data is added but data is never removed. This enables
management to gain a consistent picture of the business.
Advantages of data warehouses
DWH Improves the decision-making process as a
result of the following:
• It provides business users with a “customer-
centric” view of the company’s heterogeneous data
by helping to integrate data from customer-related
business systems.
• It provides added value to the company’s
customers by allowing them to access better
information when data warehousing is coupled with
internet technology.
Advantages of data warehouses
• It consolidates data about individual customers and
provides a repository of all customer contacts for
segmentation modeling, customer retention
planning, and cross sales analysis.
• It removes barriers among functional areas by
offering a way to reconcile views from multiple
areas, thus providing a look at activities that cross
functional lines.
• It reports on the trends across multidivisional
operating units, including trends or relationships in
areas such as merchandising, production and
planning etc.
Disadvantages of data warehouses
• Data warehouses are not the optimal environment for
unstructured data.
• Because data must be extracted, transformed and loaded
into the warehouse, there is an element of latency in data
warehouse data.
• Over their life, data warehouses can have high costs.
Maintenance costs are high.
• Duplicate, expensive functionality may be developed. Or,
functionality may be developed in the data warehouse that,
in retrospect, should have been developed in the
operational systems and vice versa.
Data Warehousing tools
• Amazon Redshift
• Microsoft Azure
• Talend Open Studio
• Google BigQuery
• Micro Focus Vertica
• Teradata
• Amazon DynamoDB
• PostgreSQL
• Amazon RDS
• Amazon S3
• SAP HANA
• MarkLogic
• MariaDB
• Db2 Warehouse
• Exadata
• BI360 Data Warehouse
• Cloudera
Data Marts
• A data mart is a scaled down version of a data warehouse
that focuses on a particular subject area.
• A data mart is a subset of an organizational data store,
usually oriented to a specific purpose or major data
subject, that may be distributed to support business needs.
• Data marts are analytical data stores designed to focus on
specific business functions for a specific community within
an organization.
• Usually designed to support the unique business
requirements of a specified department or business process
• Implemented as the first step in proving the usefulness of
the technologies to solve business problems.
Data Warehouse Architecture
From the Data Warehouse to Data Marts

Information

Individually Less
Structured

Departmentally History
Structured Normalized
Detailed

Organizationally More
Data Warehouse
Structured

Data
To summarize ...

• OLTP (Operational/transactional) Systems are


used to “run” a business

• The Data Warehouse helps


to “optimize” the business
Data Marts

Reasons for creating a data mart


• Easy access to frequently needed data
• Creates collective view by a group of users
• Improves end-user response time
• Ease of creation in less time
• Lower cost than implementing a full Data
warehouse
• Potential users are more clearly defined than in a
full Data warehouse
Characteristics of the Departmental Data Mart
• Small
• Flexible
Data mart
• Customized by Department
• OLAP
• Source is departmentally
structured data
warehouse

Data warehouse
Types of Data Mart
Dependant Data Mart

• A dependent data mart lets you


combine all your business data into a
single data warehouse, giving you
the typical benefits of centralization.

• In case one or multiple physical data


marts are needed, you will have to
build them as dependent data marts
to ensure consistency and integration
across all data storage systems.
Types of Data Mart
Independent Data Mart
• An independent data mart can be created
without using the central data warehouse.

• It is mostly recommended for smaller


units or groups within an organization.

• As the name suggests, this kind of data


mart is neither related to the enterprise
data warehouse nor any other data mart.

• It inputs data separately, and the analyses


are also executed independently.
Types of Data Mart
Hybrid Data Mart
• By using a hybrid data mart, you can combine data
from several operational source systems in addition
to a data warehouse.
• As the name indicates, a hybrid data mart is a
mixture of dependent and independent data marts.
It’s suitable for businesses that have multiple
databases and need a quick turnaround.
• A hybrid data mart needs slight data cleaning,
supports huge storage structures, and is flexible as
it combines the benefits of both dependent and
independent data marts.
Analysis Tools
Operational reporting
 Structured and fixed format reports
 Based on simple and direct queries
 Usually involves simple descriptive analysis and transformation of data, such as
calculating, sorting, filtering, grouping, and formatting

Ad hoc query and reporting


OLAP (Online Analytical Processing)
 A multi-dimensional analysis and reporting application for aggregated data
 Great for discovering details from large quantities of data

Business analytics
 Business analytics (BA) is the practice of iterative, methodical exploration of
an organization’s data with emphasis on statistical analysis.

Data mining
 Data mining techniques are a blend of statistics and mathematics, and
artificial intelligence and machine-learning.

27
Levels of Analytical Processing

What is the reason for a


decrease of total sales
this year? (reasoning)

How do advertising
activities affect sales of
different products bought
by different type of
customers, in different
regions? (synthesizing)

Should we invest more


on our e-business?
(fuzzy question  need
high level analysis for
decision making)

28
OLTP (Online transaction Processing)

• Special data organization, access methods and


implementation methods are needed to support
data warehouse queries (typically
multidimensional queries).
• OLTP systems are tuned for known transactions
and workloads while workload is not known a
priori in a data warehouse
– e.g., average amount spent on phone calls
between 9AM-5PM in Accra during the month of
December
OLAP (Online Analytical Processing)
Online Analytical Processing, a category of software tools which
provide analysis of data for business decisions.

OLAP tools enable users to analyze multidimensional data


interactively from multiple perspectives.

OLAP consists of three basic analytical operations: consolidation


(roll-up), drill-down, and slicing and dicing

The main benefit of OLAP is the consistency of information and


calculations.

Easy apply security restrictions on users and objects to comply with


regulations and product sensitive data.
30
OLAP
• Online analytical processing refers to such end user activities as
DSS modelling using spreadsheets and graphics that are done
online.
• OLAP involves many different data items in complex
relationships.
• Objective of OLAP is to analyze complex relationships and look for
patterns, trends and exceptions.
OLAP
Multi-dimensional queries:
 A dimension is a particular way (or an attribute) of describing
and categorizing data
 Such queries are usually arithmetic aggregation operations (sum,
average, etc.) on records grouped by multiple dimensions
(attributes) at different aggregation levels.
OLAP is a function/operation that is optimized to answer
queries that are multi-dimensional in nature
operational report
Descriptive and
Example analysis
 "What is the total sales amount grouped by product line
(dimension 1), location (dimension 2), time (dimension 3) and …
(other dimensions)?"
 "Which segment of business provides the most revenue
growth?"
More open and 32
OLAP vs. Transactional Report

This is the transactional


data report with line
by line data. A pivot table or crosstab is
usually used for OLAP result
view (aggregated data)

34
Strengths of OLAP
• It is a powerful visualization paradigm
• It provides fast, interactive response times
• It is good for analyzing time series
• It can be useful to find some clusters and
outliers
ETL
• ETL is a process that extracts the data from different
source systems, then transforms the data (like applying
calculations, concatenations, etc.) and finally loads the
data into the Data Warehouse system.
• Full form of ETL is Extract, Transform and Load
Why do you need ETL?
• Transactional databases cannot answer complex business
questions that can be answered by ETL.
• ETL provides a method of moving the data from various sources
into a data warehouse.
• As data sources change, the Data Warehouse will automatically
update.
• Allow verification of data transformation, aggregation and
calculations rules.
• ETL process allows sample data comparison between the source
and the target system.
• ETL process can perform complex transformations and requires
the extra area to store the data.
• ETL helps to Migrate data into a Data Warehouse.
• Convert to the various formats and types to adhere to one
consistent system.
Extraction
• In this step of ETL architecture, data is extracted
from the source system into the staging area.
• Transformations if any are done in staging area
so that performance of source system in not
degraded.
• Also, if corrupted data is copied directly from the
source into Data warehouse database, rollback
will be a challenge.
• Staging area gives an opportunity to validate
extracted data before it moves into the Data
warehouse.
Some validations are done during Extraction
• Reconcile records with the source data
• Make sure that no spam/unwanted data loaded
• Data type check
• Remove all types of duplicate/fragmented data
• Check whether all the keys are in place or not
Transformation
• Data transformation is done in
the staging area.
• Data extracted from source
server is raw and not usable in its
original form.
• Therefore, it needs to be
cleansed, mapped and
transformed.
• It is one of the important ETL concepts where you apply a
set of functions on extracted data.
• Data that does not require any transformation is called
as direct move or pass through data.
Data Integration Issues
In transformation step, you can perform customized operations on data.
For instance, if the user wants sum-of-sales revenue which is not in the
database. Or if the first name and the last name in a table is in different
columns. It is possible to concatenate them before loading.
Data Integrity Problems
• Different spelling of the same person like Jon, John, etc.
• There are multiple ways to denote company name like
Google, Google Inc.
• Use of different names such as Accra, Acra.
• There may be a case that different account numbers are
generated by various applications for the same customer.
• In some data required files remains blank
• Invalid product collected at POS as manual entry can lead
to mistakes.
• Transposing rows and columns
• Use lookups to merge data
• Using any complex data validation (e.g., if the first two
columns in a row are empty then it automatically reject
the row from processing)
Validations are done during this stage

• Filtering – Select only certain columns to load


• Using rules and lookup tables for Data
standardization
• Character Set Conversion and encoding handling
• Conversion of Units of Measurements like Date
Time Conversion, currency conversions,
numerical conversions, etc.
• Data threshold validation check. For example,
age cannot be more than two digits.
• Data flow validation from the staging area to the
intermediate tables.
Validations are done during this stage

• Required fields should not be left blank.


• Cleaning ( for example, mapping NULL to 0 or
Gender Male to "M" and Female to "F" etc.)
• Split a column into multiples and merging multiple
columns into a single column.
Loading
• Loading data into the target data warehouse is the
last step of the ETL process.
• In a typical Data warehouse, huge volume of data
needs to be loaded in a relatively short period
(nights). Hence, load process should be optimized
for performance.
• In case of load failure, recover mechanisms should
be configured to restart from the point of failure
without data integrity loss.
Load verification

• Ensure that the key field data is neither missing nor


null.
• Test modeling views based on the target tables.
• Check that combined values and calculated
measures.
• Data checks in dimension table as well as history
table.
• Check the BI reports on the loaded fact and
dimension table.
Some ETL Tools

• MarkLogic:
https://www.marklogic.com/product/getting-started/
• Oracle: https://www.oracle.com/index.html
• Amazon RedShift:
https://aws.amazon.com/redshift/?nc2=h_m1
• Talend Data Integration Studio:
https://www.talend.com/products/talend-open-studio/
• MapFGorce
ETL Best Practices
• Never try to cleanse all the data
• Never cleanse Anything
• Determine the cost of cleansing the data
• To speed up query processing, have auxiliary views
and indexes
Assignment

What is the difference between Hadoop


and Data warehouse?
Next Week

Multi-Dimensional Data Modeling

You might also like