DM - MOD - 2 Part - I
DM - MOD - 2 Part - I
Module - 2
● A Data Warehouse (DW) is a relational database that is designed for query and analysis
rather than transaction processing. It includes historical data derived from transaction
data from single and multiple sources.
● A Data Warehouse provides integrated, enterprise-wide, historical data and focuses on
providing support for decision-makers for data modeling and analysis.
● A Data Warehouse is a group of data specific to the entire organization, not only to a
particular group of users.
● "Data Warehouse is a subject-oriented, integrated, and time-variant store of information
in support of management's decisions."
Advanced Data Mining 221ECS001 Module -2 Part -1
Subject-Oriented
A data warehouse target on the modeling and analysis of data for decision-makers. Therefore,
data warehouses typically provide a concise and straightforward view around a particular
subject, such as customer, product, or sales, instead of the global organization's ongoing
operations. This is done by excluding data that are not useful concerning the subject and
including all data needed by the users to understand the subject.
Integrated
A data warehouse integrates various heterogeneous data sources like RDBMS, flat files, and
online transaction records. It requires performing data cleaning and integration during data
warehousing to ensure consistency in naming conventions, attributes types, etc., among different
data sources.
Time-Variant
Historical information is kept in a data warehouse. For example, one can retrieve files from 3
months, 6 months, 12 months, or even previous data from a data warehouse. These variations
with a transactions system, where often only the most current file is kept.
Non-Volatile
The data warehouse is a physically separate data storage, which is transformed from the source
operational RDBMS. The operational updates of data do not occur in the data warehouse, i.e.,
update, insert, and delete operations are not performed. It usually requires only two procedures in
data accessing: Initial loading of data and access to data. Therefore, the DW does not require
transaction processing, recovery, and concurrency capabilities, which allows for substantial
speedup of data retrieval. Non-Volatile defines that once entered into the warehouse, and data
should not change.
1) Business User: Business users require a data warehouse to view summarized data from the
past. Since these people are non-technical, the data may be presented to them in an elementary
form.
2) Store historical data: Data Warehouse is required to store the time variable data from the
past. This input is made to be used for various purposes.
3) Make strategic decisions: Some strategies may be depending upon the data in the data
warehouse. So, data warehouses contribute to making strategic decisions.
4) For data consistency and quality: Bringing the data from different sources at a
commonplace, the user can effectively undertake to bring uniformity and consistency in data.
5) High response time: Data warehouses have to be ready for somewhat unexpected loads and
types of queries, which demands a significant degree of flexibility and quick response time.
4. Queries that would be complex in many normalized databases could be easier to build
and maintain in data warehouses.
5. Data warehousing is an efficient method to manage demand for lots of information from
lots of users.
6. Data warehousing provides the capabilities to analyze a large amount of historical data.
Advanced Data Mining 221ECS001 Module -2 Part -1
Data Integration
One of the main functions of a data warehouse is to integrate data from various sources. This can
include transactional systems, such as point-of-sale systems or customer relationship
management systems, as well as external data sources, such as market research or social media
data.
Another function of a data warehouse is to clean and transform the data. This can include
removing duplicates, correcting errors, and standardizing data formats. This is important because
it ensures that the data is accurate and consistent, making it easier to analyze.
Data Consolidation
A data warehouse also consolidates data from various sources into a single, unified view. This
can include combining data from different transactional systems, such as sales and inventory
data, or combining data from different external sources, such as market research and social
media data.
Data Analysis
One of the main benefits of a data warehouse is its ability to support data analysis. This can
include running queries, creating reports, and building data visualizations. This can help
organizations gain insights into their data, identify trends and patterns, and make informed
business decisions.
Advanced Data Mining 221ECS001 Module -2 Part -1
Real-Life Examples
Retail Industry
A retail company can use a data warehouse to store and analyze data from its point-of-sale
systems, inventory systems, and customer relationship management systems. This can help the
company gain insights into customer purchasing habits, track inventory levels, and identify
which products are selling well. This information can be used to make informed decisions about
promotions, marketing, and product development.
Healthcare Industry
A healthcare organization can use a data warehouse to store and analyze data from its electronic
health records (EHR) systems and clinical systems. This can help the organization track patient
outcomes, identify trends in disease rates, and monitor the effectiveness of different treatments.
This information can be used to improve patient care and make informed decisions about
resource allocation.
Finance Industry
A financial institution can use a data warehouse to store and analyze data from its transactional
systems, such as trading systems and customer account systems. This can help the institution
track financial performance, identify potential fraud, and monitor compliance with regulations.
This information can be used to make informed decisions about risk management and investment
strategy.
A data warehouse is a centralized repository where an organization can store substantial amounts
of data from multiple source systems and locations. It is complex as it’s an information system
that contains historical and commutative data from multiple sources. There are 3 approaches for
constructing Data Warehouse layers: Single Tier, Two tier and Three tier. This 3 tier architecture
of Data Warehouse is explained below.
Advanced Data Mining 221ECS001 Module -2 Part -1
● Single-tier architecture
The objective of a single layer is to minimize the amount of data stored. This goal is to
remove data redundancy. This architecture is not frequently used in practice.
● Two-tier architecture
Two-layer architecture is one of the Data Warehouse layers which separates physically
available sources and data warehouses. This architecture is not expandable and also not
supporting a large number of end-users. It also has connectivity problems because of
network limitations.
The Data Warehouse is based on an RDBMS server which is a central information repository that
is surrounded by some key Data Warehousing components to make the entire environment
functional, manageable and accessible. There are mainly five Data Warehouse Components:
● Metadata
The name Meta Data suggests some high-level technological Data Warehousing
Concepts. However, it is quite simple. Metadata is data about data which defines the data
warehouse. It is used for building, maintaining and managing the data warehouse.
In the Data Warehouse Architecture, meta-data plays an important role as it specifies the
source, usage, values, and features of data warehouse data. It also defines how data can
be changed and processed. It is closely connected to the data warehouse.
Metadata can be classified into following categories:
1. Technical Metadata: This kind of Metadata contains information about
warehouses which are used by Data warehouse designers and administrators.
2. Business Metadata: This kind of Metadata contains detail that gives end-users a
way to understand information stored in the data warehouse.
● Query Tools
Reporting tools:
Reporting tools can be further divided into production reporting tools and desktop report
writers.
1. Report writers: This kind of reporting tool is a tool designed for end-users for
their analysis.
2. Production reporting: This kind of tool allows organizations to generate regular
operational reports. It also supports high volume batch jobs like printing and
Advanced Data Mining 221ECS001 Module -2 Part -1
calculating. Some popular reporting tools are Brio, Business Objects, Oracle,
PowerSoft, SAS Institute.
● Data Marts
A data mart is an access layer which is used to get data out to the users. It is presented as
an option for large size data warehouses as it takes less time and money to build.
However, there is no standard definition of a data mart that differs from person to person.
In a simple word Data mart is a subsidiary of a data warehouse. The data mart is used for
partition of data which is created for the specific group of users. Data marts could be
created in the same database as the Data Warehouse or a physically separate Database.
Designing a data warehouse solution involves several steps that need to be followed to ensure that the end
product is effective and meets the requirements of the business. Below are the typical steps to explain how
1. Requirements Gathering: As a data warehouse impacts all verticals, departments, and teams of a
company; it is essential to identify the expectations of DWH end-users. The design should meet
the present and future business needs, including security and compliance.
2. Preliminary Analysis:This step includes data source analysis, like determining the number of data
sources, data quality, data volume and more. Data warehouse consultants identify potential users
and their locations to align the project with department goals. They also collaborate with all
stakeholders to understand their vision and expectations.
3. Conceptualization: It includes determining the core and advanced functionality of the data
warehouse system. This stage begins with determining the components required in the DWH
based on the chosen deployment option (on-premise or cloud). For cloud deployment, deciding
between public, private, hybrid, and multi-cloud is essential to select the optimal architecture
option. The focus should be identifying how the chosen architecture will meet business goals and
solve problems. Usually, a solution architect and business analyst collaborate with you for this
step.
4. Project Planning: The data contained within a data warehouse determines its reliability. So, the
DWH project’s scope should be related to business objectives. The project deliverables, timelines,
resources, and budget are decided along the same lines, focusing on the findings of the preceding
stages. This stage also includes planning for disaster recovery in case of system failure.
5. Technologies Selection: This stage involves selecting technologies for your data warehouse
components like databases & data lakes. You should focus on your data security strategy and
existing analytics infrastructure, while selecting technology and tools for your DWH project.
6. System Analysis: It is essential to comprehensively analyze data sources, including their
relationship, access rules, and the quality, volume, complexity, sensitivity, type, and structure of
their data.
7. Data Governance: This stage involves setting up a data governance framework for your data
warehouse system. So, you must determine the criteria for data quality. Also, create the policies
and rules for data cleaning, data access, data usage, and data security for your DWH solution and
its users. They could include policies concerning data backup, and data encryption.
8. Data Modelling: This is probably the most complex part of designing a data warehouse, as it is
the process of visualizing data distribution within your DWH. It includes identifying data
Advanced Data Mining 221ECS001 Module -2 Part -1
sets/entities, creating relationships between them, determining key attributes of every data
set/entity and mapping them. It involves designing data models for the data warehouse and data
marts. A data mart is a storage area within a data warehouse that houses the data for a particular
business function. Creating data marts enhances query performance by accelerating the data
analytics speed for a specific business area. The design of data models typically starts at the data
mart level and branches out to the data warehouse. The popular data models include:
Star Schema: It has a fact table surrounded by many associated dimension tables in the
center.
Snowflake Schema: It is an extension of the star schema where additional tables surround
every dimension table.
Galaxy Schema: It contains two facts tables with dimension tables surrounding each of
them.
9. Experienced system analysts work on this step of DWH design which also includes converting
logical data models into database tables, indexes, keys, and columns.
10. ETL/ELT Processes Design: ETL (Extract, Transform, Load) is the process of pulling out data
from your data sources, cleaning and organizing the data, and feeding it into your data warehouse.
Contrarily, ELT (Extract, Load, Transform) includes extracting and loading data in the DWH,
followed by data processing for structure and quality. Depending on your DWH components and
architecture, data engineers will choose between the ETL and ELT processes and design them for
data flow control and data integration.
11. OLAP Cubes: Online Analytical Processing Cubes (OLAP) help with data analysis and reporting
in the data warehouse or data mart. Your data warehouse design may or may not require them.
12. Front-end Visualization design: Users interact with the front-end of any software, so your data
warehouse must be user-friendly with intuitive and interactive features. Popular visualization tools
like Power BI and Tableau help provide unique front-end experiences. The solution architect can
customize the front end to meet your ad-hoc reporting requirements.
13. Rolling out the data warehouse: Once you have the final design of your data warehouse, it is
time to develop and launch it.
Advanced Data Mining 221ECS001 Module -2 Part -1
1. Top-down approach:
The essential components are discussed below:
1. External Sources –External source is a source from where data is collected irrespective of
the type of data. Data can be structured, semi structured and unstructured as well.
2. Stage Area –Since the data, extracted from the external sources does not follow a
particular format, there is a need to validate this data to load into the data warehouse. For
this purpose, it is recommended to use the ETL tool.
3. E(Extracted): Data is extracted from External data source.
4. T(Transform): Data is transformed into the standard format.
Advanced Data Mining 221ECS001 Module -2 Part -1
5. L(Load): Data is loaded into the data warehouse after transforming it into the standard
format.
6. Data-warehouse – After cleansing of data, it is stored in the data warehouse as a central
repository. It actually stores the metadata and the actual data gets stored in the data marts.
Note that data warehouse stores the data in its purest form in this top-down approach.
7. Data Marts –Data mart is also a part of storage component. It stores the information of a
particular function of an organization which is handled by a single authority. There can be
as many data marts in an organization depending upon the functions. We can also say that
data mart contains subsets of the data stored in the data warehouse.
8. Data Mining –The practice of analyzing the big data present in data warehouses is data
mining. It is used to find the hidden patterns that are present in the database or in a data
warehouse with the help of algorithms of data mining. This approach is defined by Inmon
as – data warehouse as a central repository for the complete organization and data marts
are created from it after the complete data warehouse has been created.
2. Bottom-up approach:
Data marts
A data mart is a simple form of a data warehouse that is focused on a single subject or line of
business, such as sales, finance, or marketing. Given their focus, data marts draw data from
fewer sources than data warehouses. Data mart sources can include internal operational systems,
a central data warehouse, and external data.
Data marts can be established in three ways: using a dependent approach where the mart(s) are
created from an existing data warehouse, an independent approach where data is extracted and
processed from its sources and loaded directly into the mart, and a hybrid approach where data from
an existing data warehouse is combined with data from other sources.
1. Dependent
● Also known as top-down approach, dependent data marts draw data directly from a single,
existing enterprise data warehouse. This offers centralization in that the data warehouse stores
the granular data and is the single point of reference for all dependent repositories. Also, note
in the data mart example below how data pipelines are shifting from ETL to ELT (Extract,
Load, and Transform), streaming and API.
● The marts are partitioned segments of the data warehouse and you extract well-defined
subsets of the data warehouse data as needed for analysis. These subsets can be a logical view
where virtual tables are logically separated, but not physically separated from the data
warehouse, or the subsets can be stored in physically separate repositories from the data
warehouse.
Advanced Data Mining 221ECS001 Module -2 Part -1
2. Independent
● Independent data marts are stand-alone repositories which do not rely on your data warehouse
or other marts. Instead, the data necessary for the specific subject or business function is
extracted from the appropriate internal and/or external sources, transformed, and then loaded
to the mart. Independent data marts are relatively easy to set up and are well-suited for
short-term projects or to support small groups in your organization.
3. Hybrid
Advanced Data Mining 221ECS001 Module -2 Part -1
● Hybrid data marts combine data from both your data warehouse and your operational source
systems such as SaaS applications, SQL databases and flat files. The benefit of this approach
is that it gives you both access to cleansed data from the warehouse and the ability to quickly
add new sources on an ad hoc basis such as when a new geographic region is added.
Companies organize data marts in a multidimensional schema as a blueprint to address the
needs of the people using the databases for analytical tasks. The three main types of schema are
star, snowflake, and vault.
● Star
There is no dependency between dimension tables, so a star schema requires fewer joins
when writing queries. This structure makes querying easier, so star schemas are highly
efficient for analysts who want to access and navigate large data sets.
● Snowflake
A snowflake schema is a logical extension of a star schema, building out the blueprint
with additional dimension tables. The dimension tables are normalized to protect data
integrity and minimize data redundancy.
While this method requires less space to store dimension tables, it is a complex structure
that can be difficult to maintain. The main benefit of using snowflake schema is the low
demand for disk space, but the caveat is a negative impact on performance due to the
additional tables.
● Vault
Data vault eliminates star schema's need for cleansing and streamlines the addition of
new data sources without any disruption to existing schema.
Advanced Data Mining 221ECS001 Module -2 Part -1
Metadata
Metadata is simply defined as data about data. The data that is used to represent other data is
known as metadata. For example, the index of a book serves as a metadata for the contents in the
book.
Role of Metadata
Categories of Metadata
● Business Metadata − It has the data ownership information, business definition, and
changing policies.
● Technical Metadata − It includes database system names, table and column names and
sizes, data types and allowed values. Technical metadata also includes structural
information such as primary and foreign key attributes and indices.
● Operational Metadata − It includes currency of data and data lineage. Currency of data
means whether the data is active, archived, or purged. Lineage of data means the history
of data migrated and transformation applied on it.