Data Is A Collection of Facts, Such As Numbers, Words, Measurements, Observations or Just Descriptions of

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 31

Unit 1: Data warehouse and Data Mining Ms.

Prachi

Syllabus:

UNIT-I

Data Warehouse: Need for data warehouse, Definition, Goals of data Warehouse , Challenges faced
during Warehouse Construction, Advantages, Types of Warehouse: Data Mart, Virtual Warehouse and
Enterprise Warehouse. Components of Warehouse: Fact data, Dimension data, Fact table and
Dimension table, Designing fact tables. Pre-requisite Phases: Extract, Transform and load process.
Warehouse Schema for multidimensional data: star, snowflake and galaxy schemas

-----------------------------------------------------------------------------------------------------------------------------------------
-
-----------------------------------------------------------------------------------------------------------------------------------------
-

DATA

Data is a collection of facts, such as numbers, words, measurements, observations or just descriptions of

things. Data is

• Raw
• unorganized facts
• unprocessed
• not clean

INFORMATION

Information is the processed, organized and structured data or presented in a given context so as to make it useful
1
Unit 1: Data warehouse and Data Mining Ms. Prachi

WAREHOUSE
Definition:
A large building where raw materials or manufactured goods may be stored prior to their distribution
for sale

2
Unit 1: Data warehouse and Data Mining Ms. Prachi

What is Data warehouse?


Definition

A data warehouse(DW) is a repository of information collected from multiple heterogeneous sources (Flat files,
Database, tables, records, reports, online transactions etc) and managed to provide meaningful and useful decisions.
Defination of data warehouse according to William H. Inmon, a leading architect in the construction of data warehouse
systems,
“A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile collection of data in support of
management’s decision making process”

Characteristics / Features of data warehouse

1. Subject oriented
Data warehouse is organized around major subject areas like
▪ customer
▪ supplier
▪ product
▪ sales
▪ policy
▪ claims

It does not focus on the day to day transactions rather help the decision makers to extract useful and valuable
information on their subject.

2. Integrated
Integrate multiple heterogeneous sources such as databases, flat files, and online transaction records. To make the data
consistent data cleaning and data integration techniques are applied on the naming conventions, attributes etc.

3. Time variant
Historical data is stored in the data warehouse to provide information based on the historical perspective. Eg.
Data from past 5-10 years is collected over time for time series analysis.

3
Unit 1: Data warehouse and Data Mining Ms. Prachi

4. Non-volatile
Data in the warehouse is permanent and cannot be erased or deleted when new data is inserted. It does not
require transaction processing, recovery, and concurrency control mechanisms. It usually requires only two operations
in data accessing: initial loading of data and access of data.
Advantages / Need of Data warehouse

• Improved Business intelligence: Data Warehouse helps to integrate many sources of data to
make business decisions based on the complete data
• Time to access: quick to access critical information as all data is in centralized location
• Analysis & Reporting: Data warehouse helps to reduce total turnaround time for analysis and
reporting
• Historical intelligence: Data warehouse stores a large amount of historical data. This helps
users to analyze different time periods and trends to make future predictions
1. EnablesHistorical Insight No business can survive without a large and accurate
storehouse of historical data, from sales and inventory data to personnel and intellectual
property records. If a business executive suddenly needs to know the sales of a key
product 24 months ago, the rich historical data provided by a data warehouse make this
possible. Also important, a data warehouse can add context to this historical data by
listing all the key performance trends that surround this retrospective research. This kind
of efficiency cannot be matched by a legacy database.
2. Enhances Conformity And Quality Of Data Your business generates data in myriad
different forms, including structured and unstructured data, data from social media, and
data from sales campaigns. A data warehouse converts this data into the consistent
formats required by your analytics platforms. Moreover, by ensure this conformity, a data
warehouse ensures that the data produced by different business divisions is at the same
quality and standard – allowing a more efficient feed for analytics.
3. Boosts Efficiency It’s very time consuming for a business user or a data scientist to have
to gather data from multiple sources. It’s far more advantageous for this data to be
gathered in one place, hence the benefit of a data warehouse. Additionally, if for instance
your data scientist needs data to run a fast report, they don’t need to get the assistance
from tech support to perform this task. A data warehouse makes this data readily
available – in the correct format – improving efficiency of the entire process.
4. Increase The Power And Speed Of Data Analytics Business intelligence and data
analytics are the opposite of instinct and intuition. BI and analytics require high quality,
standardized data – on time and available for rapid data mining. A data warehouse
enables this power and speed, allowing competitive advantage in key business sectors,
ranging from CRM to HR to sales success to quarterly reporting.
5. Drives Revenue A tech pundit opined that “data is the new oil,” referring to the high dollar
value of data in today’s world. Creating more standardized and better quality data is the
key strength of a data warehouse, and this key strength translates clearly to significant
revenue gains. The data warehouse formula works like this: Better business intelligence
helps with better decisions, and in turn better decisions create a higher return on
investment across any sector of your business. Most important, these revenue gains
build on themselves over time, as better decisions strengthen the business. In short, a
high quality, fully scalable data warehouse can be seen as less of a cost and more of an
investment – one that adds exponential value like few other investments that businesses
make. × 2/21/22, 1:10 PM Top 10 Benefits of a Data Warehouse | Datamation
https://www.datamation.com/big-data/top-10-benefits-of-a-data-warehouse/ ¾
6. Scalability The top key word in the cloud era is “scalable” and a data warehouse is a
critical component in driving this scale. A topflight data warehouse is itself scalable, and
also enables greater scalability in the business overall. That is, today’s sophisticated data
warehouse are built to scale, handling ever more queries as the business grows (though
this will require more supporting hardware). Additionally, the efficiency in data flow
enabled by a data warehouse greatly boosts a business’s growth – this growth is the
core of business scalability.
7. Interoperates With On-Premise And Cloud Unlike the legacy databases of yesteryear,
today’s data warehouses are built with multicloud and hybrid cloud in mind. Many data
warehouses are now fully cloud-based, and even those that are built for onpremise
typically will interoperate well with the cloud-based portion of a company’s infrastructure.
As an additional important side point: this cloud-based focus also means that mobile
users are better able to access the data warehouse – this is beneficial for sales reps in
particular.
8. Data Security A number of key advances in data warehouse have enhanced their security,
which enhances the overall security of company data. Among these advances are
techniques like a “slave read only” set up, which blocks malicious SQL code, and
encrypted columns, which protects confidential data. Some businesses set up custom
user groups on their data warehouses, which can include or exclude various data pools,
and even give permission on a row by row basis.
9. Much Higher Query Performance And Insight The constant business intelligence queries
that are part of today’s business can put a major strain on an analytics infrastructure,
from the legacy databases to the data marts. Having a data warehouse to more
effectively handle queries removes some of the pressure on the system. Furthermore,
since a data warehouse is specifically geared to handle massive levels of date and
myriad complex queries, it’s the high functioning core of any business’s data analytics
practice.
10. Provides Major Competitive Advantage This is absolutely the bottom line benefit of a
data warehouse: it allows a business to more effectively strategize and execute against
other vendors in its sector. With the quality, speed and historical context provided by a
data warehouse, the greater insight in data mining can drive decisions that create more
sales, more targeted products, and faster response times. In short, a data warehouse
improves business decision making, which in turn gives any business a key competitive
advantage

How organizations are using the data warehouse?

Many organizations use this information to support business decision-making activities, including (1) Increasing customer
focus, which includes the analysis of customer buying patterns (such as buying preference, buying time, budget cycles,
and appetites for spending)
(2) Repositioning products and managing product portfolios by comparing the performance of sales by quarter, by year,
and by geographic regions in order to fine-tune production strategies
(3) Analyzing operations and looking for sources of profit
(4) Managing customer relationships

Goals of Data Warehousing

1. Must assist in decision making process


2. Meet requirement of business community
3. Provide easy access to information
4. Present information consistency and accurately
5. Must be adaptive and resilient to change
6. Provide secured access

Data Warehouse Applications


As discussed before, a data warehouse helps business executives to organize, analyze, and use their
data for decision making. A data warehouse serves as a sole part of a plan-execute-assess "closed-
loop" feedback system for the enterprise management. Data warehouses are widely used in the
following fields −

● Financial services
● Banking services
● Consumer goods
● Retail sectors
● Controlled manufacturing
● Airline:
● In the Airline system, it is used for operation purpose like crew assignment, analyses
of route profitability, frequent flyer program promotions, etc.
● Banking:
● It is widely used in the banking sector to manage the resources available on desk
effectively. Few banks also used for the market research, performance analysis of the
product and operations.
● Healthcare:
● Healthcare sector also used Data warehouse to strategize and predict outcomes,
generate patient’s treatment reports, share data with tie-in insurance companies,
medical aid services, etc.
● Public sector:
● In the public sector, data warehouse is used for intelligence gathering. It helps
government agencies to maintain and analyze tax records, health policy records, for
every individual
● Investment and Insurance sector:
● In this sector, the warehouses are primarily used to analyze data patterns, customer
trends, and to track market movements.
● Retain chain
● In retail chains, Data warehouse is widely used for distribution and marketing. It also
helps to track items, customer buying pattern, promotions and also used for
determining pricing policy.
● Telecommunication:
● A data warehouse is used in this sector for product promotions, sales decisions and to
make distribution decisions.
● Hospitality Industry
● This Industry utilizes warehouse services to design as well as estimate their
advertising and promotion campaigns where they want to target clients based on their
feedback and travel patterns.

Types of Data Warehouse


Information processing, analytical processing, and data mining are the three types of data warehouse
applications that are discussed below −
● Information Processing − A data warehouse allows to process the data stored in it. The data can
be processed by means of querying, basic statistical analysis, reporting using crosstabs, tables,
charts, or graphs.
● Analytical Processing − A data warehouse supports analytical processing of the information
stored in it. The data can be analyzed by means of basic OLAP operations, including slice-and-
dice, drill down, drill up, and pivoting.
● Data Mining − Data mining supports knowledge discovery by finding hidden patterns and
associations, constructing analytical models, performing classification and prediction. These
mining results can be presented using the visualization tools.

Challenges to construct the data warehouse


Here are some of the difficulties of Implementing Data Warehouses:
1. Implementing a data warehouse is generally a massive effort that must
be planned and executed according to established methods.
2. Construction, administration, and quality control are the significant
operational issues which arises with data warehousing.
3. Some of the important and challenging consideration while
implementing data warehouse are: the design, construction and
implementation of the warehouse.
4. The building of an enterprise-wide warehouse in a large organization is
a major undertaking.
5. Manual Data Processing can risk the correctness of the data being
entered.
6. An intensive enterprise is the administration of a data warehouse,
which is proportional to the complexity and size of the warehouse.
7. The complex nature of the administration should be understood by an
organization that attempts to administer a data warehouse.
8. There must be a flexibility to accept and integrate analytics to
streamline the business intelligence process.
9. To handle the evolutions, acquisition component and the warehouse’s
schema should be updated.
10. A significant issue in data warehousing is the quality control of data.
The major concerns are: quality and consistency of data.
11. Consistency remain significant issues for the database
administrator.
12. One of the major challenge that has given differences in naming,
domain definitions, identification numbers is Melding data from
heterogeneous and disparate sources.
13. The data warehouse administrator must consider the possible
interactions with elements of warehouse, every time when a source
database changes.
14. There should be accuracy of data. The efficiency and working of a
warehouse is only a good as the data that support its operation.
15. Usage projections should be estimated conservatively prior to
construction of the data warehouse and should be revised continually
to reflect current requirements.
16. To accommodate addition and attrition of data sources, the
warehouse should be designed. This also avoids a major redesign.
17. Sources and source data will be evolve, and the warehouse must
accommodate such changes.
18. Another continual challenge is fitting of the available source data
into the data model of the warehouse. This is because requirements
and capabilities of the warehouse will change over time as there will be
a continual rapid change in technology.
19. A far broader skills will be required by administration of data
warehouse for traditional database administration.
20. Managing the data warehouse in large organization, design of the
management function and selection of the management team for a
database warehouse are some of the major tasks.

Some best practices for implementing a Data Warehouse:


1. The data warehouse must be built incrementally.
2. User expectations about he completed projects should be managed.
3. It is important to be politically aware.
4. There should be a build in adaptability.
5. Developing a business/supplier relationship is the best practice.

Types of data warehouse


Data Mart, Virtual Warehouse and Enterprise Warehouse.

4
Unit 1: Data warehouse and Data Mining Ms. Prachi
Enterprise warehouse

An enterprise data warehouse (EDW) is a relational data warehouse containing a company’s


business data, including information about its customers. An EDW enables data analytics,
which can inform actionable insights. Like all data warehouses, EDWs collect and aggregate
data from multiple sources, acting as a repository for most or all organizational data to
facilitate broad access and analysis. It typically contains detailed data as well as summarized
data, and can range in size from a few gigabytes to hundreds of gigabytes, terabytes, or
beyond. An enterprise data warehouse may be implemented on traditional mainframes,
computer superservers, or parallel architecture platforms. It requires extensive business
modeling and may take years to design and build.
Nearly every department within a business can benefit from data-driven insights.
Here are a few business needs that EDWs address.
1. Real-time access to data for action
EDWs make data viewable and actionable in real-time by favoring an extract-load-
transform (ELT) approach over the once common extract-transform-load (ETL)
paradigm, in which data was cleansed, transformed, or enriched on an external
server prior to being loaded into the data warehouse. With an ELT approach, raw
data is extracted from its source and loaded, relatively unchanged, into the data
warehouse, making it much faster to access and analyze.
2. Holistic understanding of customer
EDWs enable a complete view of a business’s customer, helping improve
campaign performance, minimize churn, and ultimately grow revenue. An EDW
also facilitates predictive analytics, where teams use scenario modeling and data-
driven forecasting to inform business and marketing decisions.
3. Tracking and ensuring data compliance
EDWs enable data customers to audit and vet data sources directly and find errors
quickly. A modern EDW can also enable compliance with the EU’s General Data
Protection Regulation (GDPR) without implementing an involved process to check
multiple data locations.
4. Empowering users with limited technical knowledge
An EDW benefits non-technical employees in job functions beyond marketing,
finance, and the supply chain. For example, architects and store designers can
improve the customer experience within new stores by tapping into data from IoT
devices placed in existing locations to understand which parts of the retail footprint
are most or least engaging.
5. Consolidating data to a single, reliable repository
Modern data warehousing technology enables companies to store data across
different regions and cloud providers. Users can query an EDW as though it were a
global unified data set.

Data Mart
1. A DATA MART is focused on a single functional area of an organization and contains a subset of data stored in a Data
Warehouse.
2. Designed for use by a specific department, unit or set of users in an organization. E.g., Marketing, Sales, HR or finance
3. Collects information from small number of sources
4. Small in size and more flexible compared to warehouse

5
Unit 1: Data warehouse and Data Mining Ms. Prachi

Independent data marts are sourced from data captured from one or
more operational systems or external information providers, or from data generated
locally within a particular department or geographic area.
Dependent data marts are
sourced directly from enterprise data warehouses.
A data mart contains a subset of corporate-wide data that is of value to a specific group of users. The
scope is confined to specific selected subjects. For example, a marketing data mart may confine its
subjects to customer, item, and sales. The data contained in data marts tend to be summarized. Data
marts are usually implemented on low-cost departmental servers that are Unix/Linux or Windows
based. The implementation cycle of a data mart is more likely to be measured in weeks rather than
months or years. However, it may involve complex integration in the long run if its design and planning
were not enterprise-wide. Depending on the source of data, data marts can be categorized as
independent or dependent. Independent data marts are sourced from data captured from one or more
operational systems or external information providers, or from data generated locally within a
particular department or geographic area. Dependent data marts are sourced directly from enterprise
data warehouses.

Virtual Data warehouse


A virtual warehouse is a set of views over operational databases.
For efficient query processing, only some of the possible summary views may be
materialized. A virtual warehouse is easy to build but requires excess capacity on
operational database servers.
A virtual warehouse is a set of views over an operational database For efficient query
processing, only some of the possible summaries may be materialized. A virtual data
warehouse is easy to build but -requires excess capacity on operational database
servers.

This model creates a virtual view of databases, allowing the creation 0f • “virtual
warehouse” as opposed to a physical warehouse. In a virtual warehouse, you have a
logical description of all the databases and their structures, individuals who want to get
information from those databases do not have to know anything about them.

This approach creates a single “virtual database” from all the resources. The data
resources can be local or remote. In this type of data warehouse, the data is not
moved from the sources: Instead, the users are given direct access to the data. Direct
access to the data is sometimes through simple SQL queries, view definition, or data-
access middleware.

With this approach, it is possible to access remote data sources including major
RDBMSs. The virtual data warehouse scheme lets a client application access data
distributed across multiple data sources through a single SQL statement, a single
interface. All data sources are accessed as though they are local users and their
applications do not even need. to know the physical location of the data.

There is a great benefit in starting with a virtual warehouse since many organizations
do not want to replicate information in the physical data warehouse. Some
organizations decide to provide both by creating a data warehouse containing
summary-level data with access to legacy data for transaction details.

A virtual database is easy and fast, but it is not without problems. Since the queries
must compete with the production data transactions, their performance can be
considerably degraded. Since there is no metadata, no summary data, or history; all
the queries must be •repeated, creating an additional burden or. the system. Above
all, there is no clearing and refreshing process, which involves causing the queries to
become very complex.

Pre-requisite Phases:
Extract, Transform and load process

Extraction, Transformation, and Loading(ETL)

Data warehouse systems use back-end tools and utilities to populate and refresh their data . These tools and
utilities include the following functions:

Data extraction, which typically gathers data from multiple, heterogeneous, and
external sources.

Data cleaning, which detects errors in the data and rectifies them when possible.

Data transformation, which converts data from legacy or host format to warehouse
format.

Load, which sorts, summarizes, consolidates, computes views, checks integrity, and
builds indices and partitions.

Refresh, which propagates the updates from the data sources to the warehouse.

What is ETL?
ETL is a process that extracts the data from different source systems, then
transforms the data (like applying calculations, concatenations, etc.) and finally
loads the data into the Data Warehouse system. Full form of ETL is Extract,
Transform and Load.
It’s tempting to think a creating a Data warehouse is simply extracting data from
multiple sources and loading into database of a Data warehouse. This is far from
the truth and requires a complex ETL process. The ETL process requires active
inputs from various stakeholders including developers, analysts, testers, top
executives and is technically challenging.
In order to maintain its value as a tool for decision-makers, Data warehouse
system needs to change with business changes. ETL is a recurring activity (daily,
weekly, monthly) of a Data warehouse system and needs to be agile, automated,
and well documented.
In this ETL tutorial, you will learn-
 What is ETL?
 Why do you need ETL?
 ETL Process in Data Warehouses
 Step 1) Extraction
 Step 2) Transformation
 Step 3) Loading
 ETL Tools
 Best practices ETL process
26.6M
343
What is Database SQL

Why do you need ETL?


There are many reasons for adopting ETL in the organization:
 It helps companies to analyze their business data for taking critical business
decisions.
 Transactional databases cannot answer complex business questions that can be
answered by ETL example.
 A Data Warehouse provides a common data repository
 ETL provides a method of moving the data from various sources into a data
warehouse.
 As data sources change, the Data Warehouse will automatically update.
 Well-designed and documented ETL system is almost essential to the success of a
Data Warehouse project.
 Allow verification of data transformation, aggregation and calculations rules.
 ETL process allows sample data comparison between the source and the target
system.
 ETL process can perform complex transformations and requires the extra area to
store the data.
 ETL helps to Migrate data into a Data Warehouse. Convert to the various formats
and types to adhere to one consistent system.
 ETL is a predefined process for accessing and manipulating source data into the
target database.
 ETL in data warehouse offers deep historical context for the business.
 It helps to improve productivity because it codifies and reuses without a need for
technical skills.

ETL Process in Data Warehouses


ETL is a 3-step process
ETL Process

Step 1) Extraction
In this step of ETL architecture, data is extracted from the source system into the staging
area. Transformations if any are done in staging area so that performance of source
system in not degraded. Also, if corrupted data is copied directly from the source into
Data warehouse database, rollback will be a challenge. Staging area gives an
opportunity to validate extracted data before it moves into the Data warehouse.
Data warehouse needs to integrate systems that have different

DBMS, Hardware, Operating Systems and Communication Protocols. Sources could


include legacy applications like Mainframes, customized applications, Point of contact
devices like ATM, Call switches, text files, spreadsheets, ERP, data from vendors,
partners amongst others.
Hence one needs a logical data map before data is extracted and loaded physically. This
data map describes the relationship between sources and target data.
Three Data Extraction methods:
1. Full Extraction
2. Partial Extraction- without update notification.
3. Partial Extraction- with update notification
Irrespective of the method used, extraction should not affect performance and response
time of the source systems. These source systems are live production databases. Any
slow down or locking could effect company’s bottom line.
Some validations are done during Extraction:
 Reconcile records with the source data
 Make sure that no spam/unwanted data loaded
 Data type check
 Remove all types of duplicate/fragmented data
 Check whether all the keys are in place or not

Step 2) Transformation
Data extracted from source server is raw and not usable in its original form. Therefore it
needs to be cleansed, mapped and transformed. In fact, this is the key step where ETL
process adds value and changes data such that insightful BI reports can be generated.
It is one of the important ETL concepts where you apply a set of functions on extracted
data. Data that does not require any transformation is called as direct move or pass
through data.
In transformation step, you can perform customized operations on data. For instance, if
the user wants sum-of-sales revenue which is not in the database. Or if the first name
and the last name in a table is in different columns. It is possible to concatenate them
before loading.
Data Integration Issues
Following are Data Integrity Problems:
1. Different spelling of the same person like Jon, John, etc.
2. There are multiple ways to denote company name like Google, Google Inc.
3. Use of different names like Cleaveland, Cleveland.
4. There may be a case that different account numbers are generated by various
applications for the same customer.
5. In some data required files remains blank
6. Invalid product collected at POS as manual entry can lead to mistakes.
Validations are done during this stage
 Filtering – Select only certain columns to load
 Using rules and lookup tables for Data standardization
 Character Set Conversion and encoding handling
 Conversion of Units of Measurements like Date Time Conversion, currency
conversions, numerical conversions, etc.
 Data threshold validation check. For example, age cannot be more than two
digits.
 Data flow validation from the staging area to the intermediate tables.
 Required fields should not be left blank.
 Cleaning ( for example, mapping NULL to 0 or Gender Male to “M” and Female to
“F” etc.)
 Split a column into multiples and merging multiple columns into a single column.
 Transposing rows and columns,
 Use lookups to merge data
 Using any complex data validation (e.g., if the first two columns in a row are
empty then it automatically reject the row from processing)

Step 3) Loading
Loading data into the target datawarehouse database is the last step of the ETL process.
In a typical Data warehouse, huge volume of data needs to be loaded in a relatively short
period (nights). Hence, load process should be optimized for performance.
In case of load failure, recover mechanisms should be configured to restart from the
point of failure without data integrity loss. Data Warehouse admins need to monitor,
resume, cancel loads as per prevailing server performance.
Types of Loading:
 Initial Load — populating all the Data Warehouse tables
 Incremental Load — applying ongoing changes as when needed periodically.
 Full Refresh —erasing the contents of one or more tables and reloading with
fresh data.
Load verification
 Ensure that the key field data is neither missing nor null.
 Test modeling views based on the target tables.
 Check that combined values and calculated measures.
 Data checks in dimension table as well as history table.
 Check the BI reports on the loaded fact and dimension table.

ETL Tools
There are many Data Warehousing tools are available in the market. Here, are some
most prominent one:
1. MarkLogic:
MarkLogic is a data warehousing solution which makes data integration easier and
faster using an array of enterprise features. It can query different types of data like
documents, relationships, and metadata.
https://www.marklogic.com/product/getting-started/

2. Oracle:
Oracle is the industry-leading database. It offers a wide range of choice of Data
Warehouse solutions for both on-premises and in the cloud. It helps to optimize
customer experiences by increasing operational efficiency.
https://www.oracle.com/index.html
3. Amazon RedShift:
Amazon Redshift is Datawarehouse tool. It is a simple and cost-effective tool to analyze
all types of data using standard SQL and existing BI tools. It also allows running complex
queries against petabytes of structured data.
https://aws.amazon.com/redshift/?nc2=h_m1
Here is a complete list of useful Data warehouse Tools.

Best practices ETL process


Following are the best practices for ETL Process steps:
Never try to cleanse all the data:
Every organization would like to have all the data clean, but most of them are not ready
to pay to wait or not ready to wait. To clean it all would simply take too long, so it is
better not to try to cleanse all the data.
Never cleanse Anything:
Always plan to clean something because the biggest reason for building the Data
Warehouse is to offer cleaner and more reliable data.
Determine the cost of cleansing the data:
Before cleansing all the dirty data, it is important for you to determine the cleansing cost
for every dirty data element.
To speed up query processing, have auxiliary views and indexes:
To reduce storage costs, store summarized data into disk tapes. Also, the trade-off
between the volume of data to be stored and its detailed usage is required. Trade-off at
the level of granularity of data to decrease the storage costs.

Data Warehousing: A Multitiered Architecture


Data warehouses often adopt a three-tier architecture
1. The bottom tier is a warehouse database server that is almost always a relational database system.
Back-end tools and utilities are used to feed data into the bottom tier from operational databases or
other external sources (e.g., customer profile information provided by external consultants). These
tools and utilities perform data extraction, cleaning, and transformation (e.g., to merge similar data
from different sources into a unified format), as well as load and refresh functions to update the data
warehouse (see Section 4.1.6). The data are extracted using application program interfaces known as
gateways. A gateway is supported by the underlying DBMS and allows client programs to generate
SQL code to be executed at a server. Examples of gateways include ODBC (Open Database
Connection) and OLEDB (ObjectLinking and Embedding Database) by Microsoft and JDBC (Java
Database Connection). This tier also contains a metadata repository, which stores information about
the data warehouse and its contents. The metadata repository is further described in Section 4.1.7.
2. The middle tier is an OLAP server that is typically implemented using either (1) a relational OLAP
(ROLAP) model (i.e., an extended relational DBMS that maps operations on multidimensional data to
standard relational operations); or (2) a multidimensional OLAP (MOLAP) model (i.e., a special-
purpose server that directly implements multidimensional data and operations). OLAP servers are
discussed in Section 4.4.4.
3. The top tier is a front-end client layer, which contains query and reporting tools, analysis tools,
and/or data mining tools (e.g., trend analysis, prediction, and so on).
Meta Data Repository
• Metadata are data about data. When used in a data warehouse, metadata are the data that
define warehouse objects
Meta data repository must contain the following
1. A description of the data warehouse structure, which includes the warehouse schema, view,
dimensions, hierarchies, and derived data definitions, as well as data mart locations and contents.

2. Operational metadata, which include data lineage (history of migrated data and the
sequence of transformations applied to it), currency of data (active, archived, or
purged), and monitoring information (warehouse usage statistics, error reports, and
audit trails).
3. The algorithms used for summarization, which include measure and dimension
definition algorithms, data on granularity, partitions, subject areas, aggregation,
summarization, and predefined queries and reports.
4. Mapping from the operational environment to the data warehouse, which includes
source databases and their contents, gateway descriptions, data partitions, data
extraction, cleaning, transformation rules and defaults, data refresh and purging
rules, and security
5. Data related to system performance, which include indices and profiles that improve
data access and retrieval performance, in addition to rules for the timing and
scheduling of refresh, update, and replication cycles.
6. Business metadata, which include business terms and definitions, data ownership
information, and charging policies.

Components of Data Warehouse Architecture and their tasks :


1. Operational Source –
 An operational Source is a data source consists of Operational Data and External
Data.
 Data can come from Relational DBMS like Informix, Oracle.
2. Load Manager –
 The Load Manager performs all operations associated with the extraction of
loading data in the data warehouse.
 These tasks include the simple transformation of data to prepare data for entry
into the warehouse.
3. Warehouse Manage –
 The warehouse manager is responsible for the warehouse management process.
 The operations performed by the warehouse manager are the analysis,
aggregation, backup and collection of data, de-normalization of the data.
4. Query Manager –
 Query Manager performs all the tasks associated with the management of user
queries.
 The complexity of the query manager is determined by the end-user access
operations tool and the features provided by the database.
5. Detailed Data –
 It is used to store all the detailed data in the database schema.
 Detailed data is loaded into the data warehouse to complement the data
collected.
6. Summarized Data –
 Summarized Data is a part of the data warehouse that stores predefined
aggregations
 These aggregations are generated by the warehouse manager.
7. Archive and Backup Data –
 The Detailed and Summarized Data are stored for the purpose of archiving and
backup.
 The data is relocated to storage archives such as magnetic tapes or optical disks.
8. Metadata –
 Metadata is basically data stored above data.
 It is used for extraction and loading process, warehouse, management process,
and query management process.
9. End User Access Tools –
 End-User Access Tools consist of Analysis, Reporting, and mining.
 By using end-user access tools users can link with the warehouse.

Difference between Operational Database and Data Warehouse

Online Transaction Processing (OLTP)

Major task of online operational database systems is to perform online transaction and query processing. They cover
most of the day to day operations such as transactions, purchase, inventory etc.

Online Analytical Processing (OLAP)


These system help the users and knowledge workers in analysis and decision making.

Components of Warehouse:
8

Data Cube allows the data in the data warehouse to be modeled in multiple
dimensions Data in the data warehouse is multi dimensional having measures and
dimensions Measure are the values which can be quantified and aggregated eg. SUM(),
AVG()

Fact data/facts
A fact is a quantitative piece of information - such as a sale or a download by which one can analyze the
relationships between dimensions. Facts are stored in fact tables, and have a foreign key relationship with a
number of dimension tables.
Dimensions
Provide description to the objects in the fact table. They give the context to analyse the numerical
content of the facts. It stores the data about the ways by which the data in the fact table can be
analyzed.
These are the entities with respect to which organization wants to keep records
Example
Suppose there is an electronics store which has created a sales data warehouse. It wants to maintain
records based on dimensions item, branch, locations.
These dimensions allow the store to keep track of things like monthly sales of items and the branches and locations
at which the items were sold. Each dimension may have a table associated with it, called a dimension table, which
further describes the dimension. eg the dimension table of item is given below.
Item

Item_Id

Item_name

Brand

Type

Cost

Fact Table
It is a collection of facts , measurements, metrics. facts for the sales data warehouse include
dollars_sold, units_sold, amount_budget. The fact table contains
the names of the facts, or measures, as well as keys to each of the related dimension tables. Let

us see a simple 2-D cube

Rupees_sold in thousand is the context to the numerical data

9
Unit 1: Data warehouse and Data Mining Ms. Prachi
if we add the 4th dimension supplier

10
Unit 1: Data warehouse and Data Mining Ms. Prachi
Difference between Fact Table and
Dimension Table
A reality or fact table’s record could be a combination of attributes from totally different
dimension tables. The Fact Table or Reality Table helps the user to investigate the business
dimensions that helps him in call taking to enhance his business. 
On the opposite hand, Dimension Tables facilitate the reality table or fact table to gather
dimensions on that the measures needs to be taken. 
The main difference between fact table or reality table and the Dimension table is that
dimension table contains attributes on that measures are taken actually table. 
 
Difference between Fact Table and Dimension Table: 

S.NO Fact Table Dimension Table

Fact table contains the measuring


on the attributes of a dimension Dimension table contains the attributes on
1. table. that truth table calculates the metric.

In fact table, There is less While in dimension table, There is more


2. attributes than dimension table. attributes than fact table.

In fact table, There is more While in dimension table, There is less


3. records than dimension table. records than fact table.

While dimension table forms a horizontal


4. Fact table forms a vertical table. table.

5. The attribute format of fact table While the attribute format of dimension table
S.NO Fact Table Dimension Table

is in numerical format and text


format. is in text format.

6. It comes after dimension table. While it comes before fact table.

The number of fact table is less


than dimension table in a While the number of dimension is more than
7. schema. fact table in a schema.

While the main task of dimension table is to


It is used for analysis purpose store the information about a business and
8. and decision making. its process.
Warehouse Schema for multidimensional data: star, snowflake and galaxy schemas
The entity-relationship data model is commonly used in the design of relational databases, where a
database schema consists of a set of entities and the relationships between them. Such a data model
is appropriate for online transaction processing. A data warehouse, however, requires a concise,
subject-oriented schema that facilitates online data analysis. The most popular data model for a data
warehouse is a multidimensional model, which can exist in the form of a star schema, a snowflake
schema, or a fact constellation schema. Let’s look at each of these.

Star schema: The most common modeling paradigm is the star schema, in which the data warehouse
contains (1) a large central table (fact table) containing the bulk of the data, with no redundancy, and
(2) a set of smaller attendant tables (dimension tables), one for each dimension. The schema graph
resembles a starburst, with the dimension tables displayed in a radial pattern around the central fact
table.

Example 4.1 Star schema. A star schema for AllElectronics sales is shown in Figure 4.6. Sales are
considered along four dimensions: time, item, branch, and location. The schema contains a central
fact table for sales that contains keys to each of the four dimensions, along with two measures:
dollars sold and units sold. To minimize the size of the fact table, dimension identifiers (e.g., time key
and item key) are system-generated identifiers. Notice that in the star schema, each dimension is
represented by only one table, and each table contains a set of attributes. For example, the location
dimension table contains the attribute set {location key, street, city, province or state, country}. This
constraint may introduce some redundancy. For example, “Urbana” and “Chicago” are both cities in
the state of Illinois, USA. Entries for such cities in the location dimension table will create redundancy
among the attributes province or state and country; that is, (..., Urbana, IL, USA) and (..., Chicago, IL,
USA). Moreover, the attributes within a dimension table may form either a hierarchy (total order) or a
lattice (partial order).

Snowflake schema: The snowflake schema is a variant of the star schema model, where some
dimension tables are normalized, thereby further splitting the data into additional tables. The resulting
schema graph forms a shape similar to a snowflake. The major difference between the snowflake and
star schema models is that the dimension tables of the snowflake model may be kept in normalized
form to reduce redundancies. Such a table is easy to maintain and saves storage space. However,
this space savings is negligible in comparison to the typical magnitude of the fact table. Furthermore,
the snowflake structure can reduce the effectiveness of browsing, since more joins will be needed to
execute a query. Consequently, the system performance may be adversely impacted. Hence,
although the snowflake schema reduces redundancy, it is not as popular as the star schema in data
warehouse design

where some dimension tables are normalized, thereby further splitting the data into
additional tables to avoid redundancy. eg.
Example 4.2 Snowflake schema. A snowflake schema for AllElectronics sales is given in Figure 4.7.
Here, the sales fact table is identical to that of the star schema in Figure 4.6. The main difference
between the two schemas is in the definition of dimension tables. The single dimension table for item
in the star schema is normalized in the snowflake schema, resulting in new item and supplier tables.
For example, the item dimension table now contains the attributes item key, item name, brand, type,
and supplier key, where supplier key is linked to the supplier dimension table, containing supplier key
and supplier type information. Similarly, the single dimension table for location in the star schema can
be normalized into two new tables: location and city. The city key in the new location table links to the
city dimension. Notice that, when desirable, further normalization can be performed on province or
state and country in the snowflake schema shown in Figure 4.7.

GALAXY
Fact constellation: Sophisticated applications may require multiple fact tables to share
dimension tables. This kind of schema can be viewed as a collection of stars, and
hence is called a galaxy schema or a fact constellation.
Example 4.3 Fact constellation. A fact constellation schema is shown in Figure 4.8. This schema
specifies two fact tables, sales and shipping. The sales table definition is identical to that of the star
schema (Figure 4.6). The shipping table has five dimensions, or keys—item key, time key, shipper
key, from location, and to location—and two measures—dollars cost

and units shipped. A fact constellation schema allows dimension tables to be shared between fact
tables. For example, the dimensions tables for time, item, and location are shared between the sales
and shipping fact tables.

You might also like