0% found this document useful (0 votes)
59 views19 pages

DM - MOD - 2 Part - I

The document discusses data warehouses and their components and architecture. It describes a three-tier architecture for data warehouses with bottom, middle, and top tiers. The bottom tier is the database, the middle tier is an OLAP server implementing ROLAP or MOLAP, and the top tier includes front-end tools for querying, reporting, and analyzing the data. Data warehouses integrate data from multiple sources, clean and transform the data, and support analysis to provide insights for decision-making.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
59 views19 pages

DM - MOD - 2 Part - I

The document discusses data warehouses and their components and architecture. It describes a three-tier architecture for data warehouses with bottom, middle, and top tiers. The bottom tier is the database, the middle tier is an OLAP server implementing ROLAP or MOLAP, and the top tier includes front-end tools for querying, reporting, and analyzing the data. Data warehouses integrate data from multiple sources, clean and transform the data, and support analysis to provide insights for decision-making.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

Advanced Data Mining 221ECS001 Module -2 Part -1

Module - 2

Data Warehouse and OLAP Technology for Data mining


Data warehouses and its Characteristics - Data warehouse Architecture and its components, Data
Warehouse Design Process, Data Warehouse and DBMS, Data marts, Metadata
Data Cube and OLAP, Extraction - Transformation - Loading - Schemas for Multidimensional
Database : Stars, Snowflakes and Fact Constellations, OLAP Cube - OLAP operations - OLAP
Server Architecture - Data Warehouse Implementation - From Data Warehousing to Data
Mining , Trends in data warehousing.

Data warehouses and its Characteristics

● A Data Warehouse (DW) is a relational database that is designed for query and analysis
rather than transaction processing. It includes historical data derived from transaction
data from single and multiple sources.
● A Data Warehouse provides integrated, enterprise-wide, historical data and focuses on
providing support for decision-makers for data modeling and analysis.
● A Data Warehouse is a group of data specific to the entire organization, not only to a
particular group of users.
● "Data Warehouse is a subject-oriented, integrated, and time-variant store of information
in support of management's decisions."
Advanced Data Mining 221ECS001 Module -2 Part -1

Subject-Oriented

A data warehouse target on the modeling and analysis of data for decision-makers. Therefore,
data warehouses typically provide a concise and straightforward view around a particular
subject, such as customer, product, or sales, instead of the global organization's ongoing
operations. This is done by excluding data that are not useful concerning the subject and
including all data needed by the users to understand the subject.

Integrated

A data warehouse integrates various heterogeneous data sources like RDBMS, flat files, and
online transaction records. It requires performing data cleaning and integration during data
warehousing to ensure consistency in naming conventions, attributes types, etc., among different
data sources.

Time-Variant

Historical information is kept in a data warehouse. For example, one can retrieve files from 3
months, 6 months, 12 months, or even previous data from a data warehouse. These variations
with a transactions system, where often only the most current file is kept.

Non-Volatile

The data warehouse is a physically separate data storage, which is transformed from the source
operational RDBMS. The operational updates of data do not occur in the data warehouse, i.e.,
update, insert, and delete operations are not performed. It usually requires only two procedures in
data accessing: Initial loading of data and access to data. Therefore, the DW does not require
transaction processing, recovery, and concurrency capabilities, which allows for substantial
speedup of data retrieval. Non-Volatile defines that once entered into the warehouse, and data
should not change.

Goals of Data Warehousing

● To help reporting as well as analysis

● Maintain the organization's historical information

● Be the foundation for decision making.


Advanced Data Mining 221ECS001 Module -2 Part -1

Need for Data Warehouse

Data Warehouse is needed for the following reasons:

1) Business User: Business users require a data warehouse to view summarized data from the
past. Since these people are non-technical, the data may be presented to them in an elementary
form.

2) Store historical data: Data Warehouse is required to store the time variable data from the
past. This input is made to be used for various purposes.

3) Make strategic decisions: Some strategies may be depending upon the data in the data
warehouse. So, data warehouses contribute to making strategic decisions.

4) For data consistency and quality: Bringing the data from different sources at a
commonplace, the user can effectively undertake to bring uniformity and consistency in data.

5) High response time: Data warehouses have to be ready for somewhat unexpected loads and
types of queries, which demands a significant degree of flexibility and quick response time.

Benefits of Data Warehouse

1. Understand business trends and make better forecasting decisions.

2. Data Warehouses are designed to perform enormous amounts of data.

3. The structure of data warehouses is more accessible for end-users to navigate,


understand, and query.

4. Queries that would be complex in many normalized databases could be easier to build
and maintain in data warehouses.

5. Data warehousing is an efficient method to manage demand for lots of information from
lots of users.

6. Data warehousing provides the capabilities to analyze a large amount of historical data.
Advanced Data Mining 221ECS001 Module -2 Part -1

Functions of a Data Warehouse

Data Integration

One of the main functions of a data warehouse is to integrate data from various sources. This can
include transactional systems, such as point-of-sale systems or customer relationship
management systems, as well as external data sources, such as market research or social media
data.

Data Cleaning and Transformation

Another function of a data warehouse is to clean and transform the data. This can include
removing duplicates, correcting errors, and standardizing data formats. This is important because
it ensures that the data is accurate and consistent, making it easier to analyze.

Data Consolidation

A data warehouse also consolidates data from various sources into a single, unified view. This
can include combining data from different transactional systems, such as sales and inventory
data, or combining data from different external sources, such as market research and social
media data.

Data Analysis

One of the main benefits of a data warehouse is its ability to support data analysis. This can
include running queries, creating reports, and building data visualizations. This can help
organizations gain insights into their data, identify trends and patterns, and make informed
business decisions.
Advanced Data Mining 221ECS001 Module -2 Part -1

Real-Life Examples

Retail Industry

A retail company can use a data warehouse to store and analyze data from its point-of-sale
systems, inventory systems, and customer relationship management systems. This can help the
company gain insights into customer purchasing habits, track inventory levels, and identify
which products are selling well. This information can be used to make informed decisions about
promotions, marketing, and product development.

Healthcare Industry

A healthcare organization can use a data warehouse to store and analyze data from its electronic
health records (EHR) systems and clinical systems. This can help the organization track patient
outcomes, identify trends in disease rates, and monitor the effectiveness of different treatments.
This information can be used to improve patient care and make informed decisions about
resource allocation.

Finance Industry

A financial institution can use a data warehouse to store and analyze data from its transactional
systems, such as trading systems and customer account systems. This can help the institution
track financial performance, identify potential fraud, and monitor compliance with regulations.
This information can be used to make informed decisions about risk management and investment
strategy.

Data warehouse Architecture and its components

Data Warehouse Architecture

A data warehouse is a centralized repository where an organization can store substantial amounts
of data from multiple source systems and locations. It is complex as it’s an information system
that contains historical and commutative data from multiple sources. There are 3 approaches for
constructing Data Warehouse layers: Single Tier, Two tier and Three tier. This 3 tier architecture
of Data Warehouse is explained below.
Advanced Data Mining 221ECS001 Module -2 Part -1

● Single-tier architecture
The objective of a single layer is to minimize the amount of data stored. This goal is to
remove data redundancy. This architecture is not frequently used in practice.

● Two-tier architecture
Two-layer architecture is one of the Data Warehouse layers which separates physically
available sources and data warehouses. This architecture is not expandable and also not
supporting a large number of end-users. It also has connectivity problems because of
network limitations.

● Three-Tier Data Warehouse Architecture


This is the most widely used Architecture of Data Warehouse.
It consists of the Top, Middle and Bottom Tier.
1. Bottom Tier: The database of the Data Warehouse servers as the bottom tier. It is usually
a relational database system. Data is cleansed, transformed, and loaded into this layer
using back-end tools.
2. Middle Tier: The middle tier in Data warehouse is an OLAP server which is
implemented using either ROLAP or MOLAP model. For a user, this application tier
presents an abstracted view of the database. This layer also acts as a mediator between
the end-user and the database.
3. Top-Tier: The top tier is a front-end client layer. Top tier is the tools and API that you
connect and get data out from the data warehouse. It could be Query tools, reporting
tools, managed query tools, Analysis tools and Data mining tools.

Data Warehouse Components


Advanced Data Mining 221ECS001 Module -2 Part -1

The Data Warehouse is based on an RDBMS server which is a central information repository that
is surrounded by some key Data Warehousing components to make the entire environment
functional, manageable and accessible. There are mainly five Data Warehouse Components:

● Data Warehouse Database


The central database is the foundation of the data warehousing environment. This
database is implemented on the RDBMS technology. Although, this kind of
implementation is constrained by the fact that the traditional RDBMS system is
optimized for transactional database processing and not for data warehousing. For
instance, ad-hoc query, multi-table joins, aggregates are resource intensive and slow
down performance.
Hence, alternative approaches to Database are used as listed below-
● In a data warehouse, relational databases are deployed in parallel to allow for
scalability. Parallel relational databases also allow shared memory or shared
nothing model on various multiprocessor configurations or massively parallel
processors.
● New index structures are used to bypass relational table scan and improve speed.
● Use of multidimensional databases (MDDBs) to overcome any limitations which
are placed because of the relational Data Warehouse Models. Example: Essbase
from Oracle.

● Sourcing, Acquisition, Clean-up and Transformation Tools (ETL)


The data sourcing, transformation, and migration tools are used for performing all the
conversions, summarizations, and all the changes needed to transform data into a unified
format in the data warehouse. They are also called Extract, Transform and Load (ETL)
Tools.
Their functionality includes:
● Anonymize data as per regulatory stipulations.
● Eliminating unwanted data in operational databases from loading into the Data
warehouse.
● Search and replace common names and definitions for data arriving from different
sources.
● Calculating summaries and derived data
● In case of missing data, populate them with defaults.
● De-duplicated repeated data arriving from multiple data sources.
These Extract, Transform, and Load tools may generate cron jobs, background jobs,
Cobol programs, shell scripts, etc. that regularly update data in data warehouses. These
tools are also helpful to maintain the Metadata.
These ETL Tools have to deal with challenges of Database & Data heterogeneity.
Advanced Data Mining 221ECS001 Module -2 Part -1

● Metadata
The name Meta Data suggests some high-level technological Data Warehousing
Concepts. However, it is quite simple. Metadata is data about data which defines the data
warehouse. It is used for building, maintaining and managing the data warehouse.
In the Data Warehouse Architecture, meta-data plays an important role as it specifies the
source, usage, values, and features of data warehouse data. It also defines how data can
be changed and processed. It is closely connected to the data warehouse.
Metadata can be classified into following categories:
1. Technical Metadata: This kind of Metadata contains information about
warehouses which are used by Data warehouse designers and administrators.
2. Business Metadata: This kind of Metadata contains detail that gives end-users a
way to understand information stored in the data warehouse.

● Query Tools

One of the primary objects of data warehousing is to provide information to businesses to


make strategic decisions. Query tools allow users to interact with the data warehouse
system.

These tools fall into four different categories:


1. Query and reporting tools
2. Application Development tools
3. Data mining tools
4. OLAP tools
1. Query and reporting tools
Query and reporting tools can be further divided into
● Reporting tools
● Managed query tools

Reporting tools:
Reporting tools can be further divided into production reporting tools and desktop report
writers.
1. Report writers: This kind of reporting tool is a tool designed for end-users for
their analysis.
2. Production reporting: This kind of tool allows organizations to generate regular
operational reports. It also supports high volume batch jobs like printing and
Advanced Data Mining 221ECS001 Module -2 Part -1

calculating. Some popular reporting tools are Brio, Business Objects, Oracle,
PowerSoft, SAS Institute.

Managed query tools:


This kind of access tool helps end users to resolve snags in database and SQL and
database structure by inserting meta-layer between users and database.
2. Application development tools
Sometimes built-in graphical and analytical tools do not satisfy the analytical needs of an
organization. In such cases, custom reports are developed using Application development
tools.
3. Data mining tools
Data mining is a process of discovering meaningful new correlation, patterns, and trends
by mining large amounts of data. Data mining tools are used to make this process
automatic.
4. OLAP tools
These tools are based on concepts of a multidimensional database. It allows users to
analyze the data using elaborate and complex multidimensional views.

● Data warehouse Bus Architecture


Data warehouse Bus determines the flow of data in your warehouse. The data flow in a
data warehouse can be categorized as Inflow, Upflow, Downflow, Outflow and Meta
flow. While designing a Data Bus, one needs to consider the shared dimensions, facts
across data marts.

● Data Marts
A data mart is an access layer which is used to get data out to the users. It is presented as
an option for large size data warehouses as it takes less time and money to build.
However, there is no standard definition of a data mart that differs from person to person.
In a simple word Data mart is a subsidiary of a data warehouse. The data mart is used for
partition of data which is created for the specific group of users. Data marts could be
created in the same database as the Data Warehouse or a physically separate Database.

Data Warehouse Design Process

Designing a data warehouse solution involves several steps that need to be followed to ensure that the end

product is effective and meets the requirements of the business. Below are the typical steps to explain how

to design a data warehouse.


Advanced Data Mining 221ECS001 Module -2 Part -1

1. Requirements Gathering: As a data warehouse impacts all verticals, departments, and teams of a
company; it is essential to identify the expectations of DWH end-users. The design should meet
the present and future business needs, including security and compliance.
2. Preliminary Analysis:This step includes data source analysis, like determining the number of data
sources, data quality, data volume and more. Data warehouse consultants identify potential users
and their locations to align the project with department goals. They also collaborate with all
stakeholders to understand their vision and expectations.
3. Conceptualization: It includes determining the core and advanced functionality of the data
warehouse system. This stage begins with determining the components required in the DWH
based on the chosen deployment option (on-premise or cloud). For cloud deployment, deciding
between public, private, hybrid, and multi-cloud is essential to select the optimal architecture
option. The focus should be identifying how the chosen architecture will meet business goals and
solve problems. Usually, a solution architect and business analyst collaborate with you for this
step.
4. Project Planning: The data contained within a data warehouse determines its reliability. So, the
DWH project’s scope should be related to business objectives. The project deliverables, timelines,
resources, and budget are decided along the same lines, focusing on the findings of the preceding
stages. This stage also includes planning for disaster recovery in case of system failure.
5. Technologies Selection: This stage involves selecting technologies for your data warehouse
components like databases & data lakes. You should focus on your data security strategy and
existing analytics infrastructure, while selecting technology and tools for your DWH project.
6. System Analysis: It is essential to comprehensively analyze data sources, including their
relationship, access rules, and the quality, volume, complexity, sensitivity, type, and structure of
their data.
7. Data Governance: This stage involves setting up a data governance framework for your data
warehouse system. So, you must determine the criteria for data quality. Also, create the policies
and rules for data cleaning, data access, data usage, and data security for your DWH solution and
its users. They could include policies concerning data backup, and data encryption.
8. Data Modelling: This is probably the most complex part of designing a data warehouse, as it is
the process of visualizing data distribution within your DWH. It includes identifying data
Advanced Data Mining 221ECS001 Module -2 Part -1

sets/entities, creating relationships between them, determining key attributes of every data
set/entity and mapping them. It involves designing data models for the data warehouse and data
marts. A data mart is a storage area within a data warehouse that houses the data for a particular
business function. Creating data marts enhances query performance by accelerating the data
analytics speed for a specific business area. The design of data models typically starts at the data
mart level and branches out to the data warehouse. The popular data models include:

​ Star Schema: It has a fact table surrounded by many associated dimension tables in the
center.
​ Snowflake Schema: It is an extension of the star schema where additional tables surround
every dimension table.
​ Galaxy Schema: It contains two facts tables with dimension tables surrounding each of
them.

9. Experienced system analysts work on this step of DWH design which also includes converting
logical data models into database tables, indexes, keys, and columns.
10. ETL/ELT Processes Design: ETL (Extract, Transform, Load) is the process of pulling out data
from your data sources, cleaning and organizing the data, and feeding it into your data warehouse.
Contrarily, ELT (Extract, Load, Transform) includes extracting and loading data in the DWH,
followed by data processing for structure and quality. Depending on your DWH components and
architecture, data engineers will choose between the ETL and ELT processes and design them for
data flow control and data integration.
11. OLAP Cubes: Online Analytical Processing Cubes (OLAP) help with data analysis and reporting
in the data warehouse or data mart. Your data warehouse design may or may not require them.
12. Front-end Visualization design: Users interact with the front-end of any software, so your data
warehouse must be user-friendly with intuitive and interactive features. Popular visualization tools
like Power BI and Tableau help provide unique front-end experiences. The solution architect can
customize the front end to meet your ad-hoc reporting requirements.
13. Rolling out the data warehouse: Once you have the final design of your data warehouse, it is
time to develop and launch it.
Advanced Data Mining 221ECS001 Module -2 Part -1

A data-warehouse is a heterogeneous collection of different data sources organized under a unified


schema. There are 2 approaches for constructing a data-warehouse: Top-down approach and
Bottom-up approach are explained below.

1. Top-down approach:
The essential components are discussed below:

1. External Sources –External source is a source from where data is collected irrespective of
the type of data. Data can be structured, semi structured and unstructured as well.
2. Stage Area –Since the data, extracted from the external sources does not follow a
particular format, there is a need to validate this data to load into the data warehouse. For
this purpose, it is recommended to use the ETL tool.
3. E(Extracted): Data is extracted from External data source.
4. T(Transform): Data is transformed into the standard format.
Advanced Data Mining 221ECS001 Module -2 Part -1

5. L(Load): Data is loaded into the data warehouse after transforming it into the standard
format.
6. Data-warehouse – After cleansing of data, it is stored in the data warehouse as a central
repository. It actually stores the metadata and the actual data gets stored in the data marts.
Note that data warehouse stores the data in its purest form in this top-down approach.
7. Data Marts –Data mart is also a part of storage component. It stores the information of a
particular function of an organization which is handled by a single authority. There can be
as many data marts in an organization depending upon the functions. We can also say that
data mart contains subsets of the data stored in the data warehouse.
8. Data Mining –The practice of analyzing the big data present in data warehouses is data
mining. It is used to find the hidden patterns that are present in the database or in a data
warehouse with the help of algorithms of data mining. This approach is defined by Inmon
as – data warehouse as a central repository for the complete organization and data marts
are created from it after the complete data warehouse has been created.

2. Bottom-up approach:

1. First, the data is extracted from external sources .


2. Then, the data goes through the staging area (as explained above) and loaded into data
marts instead of data warehouses. The data marts are created first and provide reporting
capability. It addresses a single business area.
3. These data marts are then integrated into the data warehouse.
Advanced Data Mining 221ECS001 Module -2 Part -1

Data Warehouse and DBMS


Advanced Data Mining 221ECS001 Module -2 Part -1

Data marts
A data mart is a simple form of a data warehouse that is focused on a single subject or line of
business, such as sales, finance, or marketing. Given their focus, data marts draw data from
fewer sources than data warehouses. Data mart sources can include internal operational systems,
a central data warehouse, and external data.
Data marts can be established in three ways: using a dependent approach where the mart(s) are
created from an existing data warehouse, an independent approach where data is extracted and
processed from its sources and loaded directly into the mart, and a hybrid approach where data from
an existing data warehouse is combined with data from other sources.

1. Dependent
● Also known as top-down approach, dependent data marts draw data directly from a single,
existing enterprise data warehouse. This offers centralization in that the data warehouse stores
the granular data and is the single point of reference for all dependent repositories. Also, note
in the data mart example below how data pipelines are shifting from ETL to ELT (Extract,
Load, and Transform), streaming and API.
● The marts are partitioned segments of the data warehouse and you extract well-defined
subsets of the data warehouse data as needed for analysis. These subsets can be a logical view
where virtual tables are logically separated, but not physically separated from the data
warehouse, or the subsets can be stored in physically separate repositories from the data
warehouse.
Advanced Data Mining 221ECS001 Module -2 Part -1

2. Independent
● Independent data marts are stand-alone repositories which do not rely on your data warehouse
or other marts. Instead, the data necessary for the specific subject or business function is
extracted from the appropriate internal and/or external sources, transformed, and then loaded
to the mart. Independent data marts are relatively easy to set up and are well-suited for
short-term projects or to support small groups in your organization.

3. Hybrid
Advanced Data Mining 221ECS001 Module -2 Part -1

● Hybrid data marts combine data from both your data warehouse and your operational source
systems such as SaaS applications, SQL databases and flat files. The benefit of this approach
is that it gives you both access to cleansed data from the warehouse and the ability to quickly
add new sources on an ad hoc basis such as when a new geographic region is added.
Companies organize data marts in a multidimensional schema as a blueprint to address the
needs of the people using the databases for analytical tasks. The three main types of schema are
star, snowflake, and vault.

● Star

Star schema is a logical formation of tables in a multidimensional database that resembles


a star shape. In this blueprint, one fact table—a metric set that relates to a specific
business event or process—resides at the center of the star, surrounded by several
associated dimension tables.

There is no dependency between dimension tables, so a star schema requires fewer joins
when writing queries. This structure makes querying easier, so star schemas are highly
efficient for analysts who want to access and navigate large data sets.

● Snowflake

A snowflake schema is a logical extension of a star schema, building out the blueprint
with additional dimension tables. The dimension tables are normalized to protect data
integrity and minimize data redundancy.

While this method requires less space to store dimension tables, it is a complex structure
that can be difficult to maintain. The main benefit of using snowflake schema is the low
demand for disk space, but the caveat is a negative impact on performance due to the
additional tables.

● Vault

Data vault is a modern database modeling technique that enables IT professionals to


design agile enterprise data warehouses. This approach enforces a layered structure and
has been developed specifically to combat issues with agility, flexibility, and scalability
that arise when using the other schema models.

Data vault eliminates star schema's need for cleansing and streamlines the addition of
new data sources without any disruption to existing schema.
Advanced Data Mining 221ECS001 Module -2 Part -1

Metadata

Metadata is simply defined as data about data. The data that is used to represent other data is
known as metadata. For example, the index of a book serves as a metadata for the contents in the
book.

● Metadata is the road-map to a data warehouse.


● Metadata in a data warehouse defines the warehouse objects.
● Metadata acts as a directory. This directory helps the decision support system to locate
the contents of a data warehouse.

Role of Metadata

Several Examples of Metadata:


File metadata: This includes information about a file, such as its name, size, type, and creation
date.
Image metadata: This includes information about an image, such as its resolution, color depth,
and camera settings.
Music metadata: This includes information about a piece of music, such as its title, artist, album,
and genre.
Video metadata: This includes information about a video, such as its length, resolution, and
frame rate.
Advanced Data Mining 221ECS001 Module -2 Part -1

Categories of Metadata

● Business Metadata − It has the data ownership information, business definition, and
changing policies.
● Technical Metadata − It includes database system names, table and column names and
sizes, data types and allowed values. Technical metadata also includes structural
information such as primary and foreign key attributes and indices.
● Operational Metadata − It includes currency of data and data lineage. Currency of data
means whether the data is active, archived, or purged. Lineage of data means the history
of data migrated and transformation applied on it.

You might also like