Data Warehouse
Data Warehouse
• The Difference…
– DWH Constitute Entire Information Base For All
Time..
– Database Constitute Real Time Information…
– DWH Supports DM And Business Intelligence.
– Database Is Used To Running The Business
– DWH Is How To Run The Business
A producer wants to know….
Data
Data Warehousing -- a process
11
Need for Data Warehousing
•Integrated, company-wide view of high-quality information (from
disparate databases)
12
Issues with Company-Wide View
Inconsistent key structures
Synonyms
Free-form vs. structured fields
Inconsistent data values
Missing data
13
Figure: Examples
of heterogeneous
data
15
Separating Operational and
Informational Systems
Operational system – a system that is used to run a
business in real time, based on current data; also
called a system of record
16
17
Data Warehouse Architectures
•Independent Data Mart
•Dependent Data Mart and Operational
Data Store
•Logical Data Mart and Real-Time Data
Warehouse
•Three-Layer architecture
18
Figure 9-2 Independent data mart Data marts:
data warehousing architecture Mini-warehouses, limited in scope
T
E
Separate ETL for each Data access complexity
independent data mart due to multiple data marts
Chapter 9 Copyright © 2014 Pearson Education, Inc.
19
19
Figure 9-3 Dependent data mart with ODS provides option for
operational data store: a three-level architecture obtaining current data
T
E
Simpler data access
Single ETL for Dependent data marts
enterprise data warehouse (EDW) loaded from EDW
Chapter 9 Copyright © 2014 Pearson Education, Inc.
20
20
Figure 9-4 Logical data mart and real
ODS and data warehouse
time warehouse architecture are one and the same
T
E
Near real-time ETL for Data marts are NOT separate databases,
Data Warehouse but logical views of the data warehouse
Easier to create new data marts
Chapter 9 Copyright © 2014 Pearson Education, Inc.
21
21
Chapter 9 Copyright © 2014 Pearson Education, Inc.
22
22
Figure 9-5 Three-layer data architecture for a data warehouse
Event = a
database action
(create/ update/
delete) that
results from a
transaction
Status
24
DATA CHARACTERISTICS
STATUS VS. EVENT DATA
Figure 9-7
Transient
operational data
With transient
data, changes to
existing records
are written over
previous records,
thus destroying
the previous
data content
DATA CHARACTERISTICS
STATUS VS. EVENT DATA
Figure 9-8 Periodic
warehouse data
26
Other Data Warehouse Changes
New descriptive attributes
New business activity attributes
New classes of descriptive attributes
Descriptive attributes become more refined
Descriptive data are related to one another
New source of data
27
OLT
P
OLTP- ONLINE TRANSACTION PROCESSING
• Special data organization, access methods and
implementation methods are needed to support data
warehouse queries (typically multidimensional
queries)
• OLTP systems are tuned for known transactions and
workloads while workload is not known a priori in a
data warehouse
– e.g., average amount spent on phone calls between
9AM-5PM in Pune during the month of December
OLTP vs Data Warehouse
– OLTP • Warehouse (DSS)
• Application Oriented – Subject Oriented
• Used to run business – Used to analyze business
• Detailed data – Summarized and refined
• Current up to date – Snapshot data
• Isolated Data – Integrated Data
• Clerical User – Knowledge User (Manager)
• Few Records accessed at a time – Large volumes accessed at a time (millions)
(tens) – Mostly Read (Batch Update)
• Read/Update Access – Redundancy present
• No data redundancy – Database Size 100 GB - few
• Database Size 100MB terabytes
-100 GB – Query throughput is the performance metric
• Transaction throughput is the – Hundreds of users
performance metric – Managed by subsets
• Thousands of users
• Managed in entirety
To summarize ...
• OLTP Systems are
used to “run” a business
Client Client
Query &
Analysis
Metadata Warehouse
Integration
Source Source
Source
• The data has been selected from various sources and then integrate and
store the data in a single and particular format.
• Data warehouses contain current detailed data, historical detailed data,
lightly and highly summarized data, and metadata.
• Current and historical data are voluminous because they are stored at the
highest level of detail.
• Lightly and highly summarized data are necessary to save processing time
when users request them and are readily accessible.
• Metadata are “data about data”. It is important for designing,
constructing, retrieving, and controlling the warehouse data.
Technical metadata include where the data come from, how the data were
changed, how the data are organized, how the data are stored, who owns
the data, who is responsible for the data and how to contact them, who
can access the data , and the date of last update.
Business metadata include what data are available, where the data are, what
the data mean, how to access the data, predefined reports and queries,
and how current the data are.
Business
• advantages
It provides business users with a “customer-centric” view
company’s heterogeneous data by helping to integrate
of the
from
sales,
data service, manufacturing and distribution, and other customer-related
business systems.
• It provides added value to the company’s customers by allowing them to
access better information when data warehousing is coupled with internet
technology.
• It consolidates data about individual customers and provides a repository
of all customer contacts for segmentation modeling, customer retention
planning, and cross sales analysis.
• It removes barriers among functional areas by offering a way to reconcile
views from multiple areas, thus providing a look at activities that cross
functional lines.
• It on trends across multidivisional, multinational operating
units,
reports including trends or relationships in areas such as
merchandising, production planning etc.
Strategic uses of data warehousing
Industry Functional areas of Strategic use
use
Airline Operations; marketing Crew assignment, aircraft development, mix
of fares, analysis of route profitability,
frequent flyer program promotions
Banking Product development; Customer service, trend analysis, product and
Operations; marketing service promotions, reduction of IS
expenses
Information
Individually Less
Structured
Departmentally History
Structured Normalized
Detailed
Organizationally More
Data Warehouse
Structured
Data
Characteristics of the Departmental Data Mart
• Small
Data mart
• Flexible
• Customized by Department
• OLAP
• Source is departmentally
structured data
warehouse
Data warehouse
Data Mining
• Data Mining is the process of extracting information from the
company's various databases and re-organizing it for purposes
other than what the databases were originally intended for.
• It provides a means of extracting previously unknown, predictive
information from the base of accessible data in data warehouses.
• Data mining process is different for different organizations
depending upon the nature of the data and organization.
• Data mining tools use sophisticated, automated algorithms to
discover hidden patterns, correlations, and relationships among
organizational data.
• Data mining tools are used to predict future trends
and
behaviors, allowing businesses to make proactive,
knowledge driven decisions.
• For ex: for targeted marketing, data mining can use data on past
promotional mailings to identify the targets most likely
to
maximize the return on the company’s investment in
future mailings.
Functions
• Classification: It infers the defining characteristics of
a certain group
• Clustering: identifies group of items that share
a particular characteristic
• Association: identifies relationships between
events that occur at one time
• Sequencing: similar to association, except that
the relationship exists over a period of time
• Forecasting: estimates future values based on patterns
within large sets of data
Characteristics
• Data mining tools are needed to extract the buried information “ore”.
• The “miner” is often an end user, empowered by “data drills” and other
power query tools to ask ad hoc questions and get answers quickly, with
little or no programming skill.
• The data mining environment usually has a client/server architecture.
• Because of the large amounts of data, it is sometimes necessary to use
parallel processing for data mining.
• Data mining tools are easily combined with spreadsheets and other end
user software development tools, enabling the mined data to be analyzed
and processed quickly and easily.
• Data yields five types of information:
associations,
mining sequences, classifications, clusters and forecasting.
• “Striking it rich” often involves finding unexpected, valuable results.
Common data mining applications
APPLICATION DESCRIPTION
Market Identifies the common characteristics of customers
segmentation who buys the same products from the company
Oil and gas Analyzes seismic data for signs of underground deposits
; prioritizes drilling locations; simulates underground
flows to improve recovery
End Users:
DATA Direct use
SOURCES Decision making and other
(databases tasks: CRM, DSS, EIS
)
Data
organization ;
storage Information Data Use
Purchased Organizational
knowledge STORAGE Knowledge base
• Businesses run on information and the knowledge of
how to put that information to use.
Both the data and the information, at various times during the process, and the
knowledge derived at the end of the process, may need to be presented to
users.
Data Warehouse for Decision Support
• Putting Information technology to help the knowledge worker
make faster and better decisions
• Used to manage and control business
• Data is historical or point-in-time
• Optimized for inquiry rather than update
• Use of the system is loosely defined and can be ad-hoc
• Used by managers and end-users to understand the business
and make judgments
Business intelligence and data warehousing
Business Intelligence
• One ultimate use of the data gathered and processed in the
data life cycle is for business intelligence.
• Business intelligence generally involves the creation or use of
a data warehouse and/or data mart for storage of data, and
the use of front-end analytical tools such as Oracle’s Sales
Analyzer and Financial Analyzer or Micro Strategy’s Web.
• Such tools can be employed by end users to access data, ask
queries, request ad hoc (special) reports, examine scenarios,
create CRM activities, devise pricing strategies, and much
more.
How business intelligence works?
• The process starts with raw data which are usually kept in
corporate data bases. For example, a national retail chain
that sells everything from grills and patio furniture to plastic
utensils had data about inventory, customer information, data
about past promotions, and sales numbers in various
databases.
• Though all this information may be scattered across multiple
systems-and may seem unrelated-business intelligence
software can being it together. This is done by using a
data warehouse.
• In the data warehouse (or mart) tables can be linked, and
data cubes are formed. For instance, inventory
information is linked to sales numbers and customer
databases, allowing for deep analysis of information.
• Using the business intelligence software the user can ask
queries, request ad-hoc reports, or conduct any other
analysis.
• For example, deep analysis can be carried out by
performing multilayer queries. Because all the databases
are linked, one can search for what products a store has too
much of, determine which of these products commonly sell
with popular items, bases on previous sales. After
planning a promotion to move the excess stock along with
the popular products (by bundling them together, for
example), one can dig deeper to see where this promotion
would be most popular (and most profitable). The results of
the request can be reports, predictions, alerts, and/or
graphical presentations. These can be disseminated
to decision makers to help them in their decision-making
tasks.
More advanced applications of business
intelligence include outputs such as
• financial modeling
• budgeting
• resource allocation
• and competitive intelligence.
ELT and ETL
ELT
Extract, load, transform (ELT) is an alternative to
extract, transform, load (ETL) used with data lake implementations. In
contrast to ETL, in ELT models the data is not transformed on entry to
the data lake, but stored in its original raw format. This enables faster
loading times. However, ELT requires sufficient processing power
within the data processing engine to carry out the transformation on
demand, to return the results in a timely manner. Since the data is not
processed on entry to the data lake, the query and schema do not
need to be defined a priori (although often the schema will be
available during load since many data sources are extracts from
databases or similar structured data systems and hence have an
associated schema). ELT is a data pipeline model.
The five critical differences of ETL vs ELT:
1. ETL is the Extract, Transform, and Load process for data. ELT is Extract,
Load, and Transform process for data.
2. In ETL, data moves from the data source to staging into the data
warehouse.
3. ELT leverages the data warehouse to do basic transformations. There is
no need for data staging.
4. ETL can help with data privacy and compliance by cleaning sensitive and
secure data even before loading into the data warehouse.
5. ETL can perform sophisticated data transformations and can be more
cost-effective than ELT.
Transformations happen within a staging area Transformations happen inside the data
Transformation process outside the data warehouse. system itself, and no staging area is required.
Star Schema:
Star schema is the type of multidimensional model which is used for data
warehouse. In star schema, The fact tables and the dimension tables are
contained. In this schema fewer foreign-key join is used. This schema forms a
star with fact table and dimension tables.
Characteristics of Star Schema:
1. Every dimension in a star schema is represented with the only one-
dimension table.
2. The dimension table should contain the set of attributes.
3. The dimension table is joined to the fact table using a foreign key
4. The dimension table are not joined to each other
5. Fact table would contain key and measure
6. The Star schema is easy to understand and provides optimal disk usage.
7. The dimension tables are not normalized. For instance, in the above
figure, Country_ID does not have Country lookup table as an OLTP design
would have.
8. The schema is widely supported by BI Tools
Snowflake Schema
Snowflake Schema is also the type of multidimensional model which is used
for data warehouse. In snowflake schema, The fact tables, dimension tables as
well as sub dimension tables are contained. This schema forms a snowflake
with fact tables, dimension tables as well as sub-dimension tables.
Characteristics of Snowflake Schema:
•The main benefit of the snowflake schema it uses smaller disk space.
•Easier to implement a dimension is added to the Schema
•Due to multiple tables query performance is reduced
•The primary challenge that you will face while using the snowflake Schema is
that you need to perform more maintenance efforts because of the more
lookup tables.
Difference
Star Schema Snowflake Schema
In star schema, The fact tables and the dimension tables are While in snowflake schema, The fact tables, dimension
contained. tables as well as sub dimension tables are contained.
It takes less time for the execution of queries. While it takes more time than star schema for the execution
of queries.
In star schema, Normalization is not used. While in this, Both normalization and denormalization are
used.
The query complexity of star schema is low. While the query complexity of snowflake schema is higher
than star schema.
https://www.youtube.com/watch?v=JOArz7wggkQ
Once data has been collected, the analyst selects and trains statistical models,
using historical data. Although it may be tempting to think that big data
makes predictive models more accurate, statistical theorems show that, after
a certain point, feeding more data into a predictive analytics model
does not improve accuracy. The old saying "All models are wrong, but some
are useful" is often mentioned in terms of relying solely on predictive models
to determine future action.
In many use cases, including weather predictions, multiple models are run
simultaneously and results are aggregated to create one final prediction. This
approach is known as ensemble modeling. As additional data becomes
available, the statistical analysis will either be validated or revised.
Applications of predictive
modeling
Predictive modeling is often associated with meteorology and weather
forecasting, but it has many applications in business.
•One of the most common uses of predictive modeling is in online advertising
and marketing. Modelers use web surfers' historical data, running it through
algorithms to determine what kinds of products users might be interested in
and what they are likely to click on.
•Bayesian spam filters use predictive modeling to identify the probability that
a given message is spam. In fraud detection, predictive modeling is used to
identify outliers in a data set that point toward fraudulent activity. And in
customer relationship management (CRM), predictive modeling is used to
target messaging to customers who are most likely to make a purchase. Other
applications include capacity planning, change management, disaster
recovery (DR), engineering, physical and digital security management and city
planning.
Modeling methods
Analyzing representative portions of the available information -- sampling --
can help speed development time on models and enable them to be deployed
more quickly.
Once data scientists gather this sample data, they must select the right
model. Linear regressions are among the simplest types of predictive models.
Linear models essentially take two variables that are correlated -- one
independent and the other dependent -- and plot one on the x-axis and one
on the y-axis. The model applies a best fit line to the resulting data points.
Data scientists can use this to predict future occurrences of the dependent
variable.
Some of the most popular methods include:
•Decision trees. Decision tree algorithms take data (mined, open source,
internal) and graphs it out in branches to display the possible outcomes of
various decisions. Decision trees classify response variables and predict
response variables based on past decisions, can be used with incomplete data
sets and is easily explainable and accessible for novice data scientists.
•Time series analysis. This is a technique for the prediction of events through
a sequence of time. You can predict future events by analyzing past trends
and extrapolating from there.
•Logistic regression. This method is a statistical analysis method that aids in
data preparation. As more data is brought in, the algorithm's ability to sort
and classify it improves and therefore predictions can be made.
•The most complex area of predictive modeling is the neural network. This
type of machine learning model independently reviews large volumes of
labeled data in search of correlations between variables in the data. It can
detect even subtle correlations that only emerge after reviewing millions of
data points. The algorithm can then make inferences about unlabeled data
files that are similar in type to the data set it trained on. Neural networks
form the basis of many of today's examples of artificial intelligence (AI),
including image recognition, smart assistants and natural language generation
(NLG).
Common algorithms for predictive modeling
Random Forest. An algorithm that combines unrelated decision trees and uses classification and regression to
organize and label vast amounts of data.
Gradient boosted model. An algorithm that uses several decision trees, similar to Random Forest, but they are
more closely related. In this, each tree corrects the flaws of the previous one and builds a more accurate
picture.
K-Means. Groups data points in a similar fashion as a clustering model and is popular with personalized retail
offers. It can create personalized offers when dealing with a large group by seeking out similarities.
Prophet. A forecasting procedure especially effective when dealing with capacity planning. This algorithm deals
with time series data and is relatively flexible.
Predictive modeling tools
Before deploying a prediction model tool, it is crucial for your organization to ask questions. You must sort out
the following: clarify who will be running the software, what the use case will be for these tools, what other
tools will your predictive analytics be interacting with, as well as the budget.
Different tools have different data literacy requirements, are effective in different use cases, are best used with
similar software and can be expensive. Once your organization has clarity on these issues, comparing tools
becomes easier.
Sisense. A business intelligence software aimed at a variety of companies that offers a range of business
analytics features. This requires minimal IT background.
Oracle Crystal Ball. A spreadsheet-based application focused at engineers, strategic planners and scientists
across industries that can be used for predictive modeling, forecasting as well as simulation and optimization.
IBM SPSS Predictive Analytics Enterprise. A business intelligence platform that supports open source
integration and features descriptive and predictive analysis as well as data preparation.
SAS advanced Analytics. A program that offers algorithms that identify the likelihood of future outcomes and
can be used for data mining, forecasting and econometrics.
Predictive modeling considerations
One of the most frequently overlooked challenges of predictive modeling is acquiring the amount
of data needed and sorting out the right data to use when developing algorithms. By some
estimates, data scientists spend about 80% of their time on this step. Data collection is important
but limited in usefulness if this data is not properly managed and cleaned.
Once the data has been sorted, organizations must be careful to avoid overfitting. Over-testing on
training data can result in a model that appears very accurate but has memorized the key points
in the data set rather than learned how to generalize.
While predictive modeling is often considered to be primarily a mathematical problem, users
must plan for the technical and organizational barriers that might prevent them from getting the
data they need. Often, systems that store useful data are not connected directly to centralized
data warehouses. Also, some lines of business may feel that the data they manage is their asset,
and they may not share it freely with data science teams.
Another potential stumbling block for predictive modeling initiatives is making sure projects
address real business challenges. Sometimes, data scientists discover correlations that seem
interesting at the time and build algorithms to investigate the correlation further. However, just
because they find something that is statistically significant doesn't mean it presents an insight the
business can use. Predictive modeling initiatives need to have a solid foundation of business
relevance.