0% found this document useful (0 votes)
24 views22 pages

Unit 1

Uploaded by

Sanguine Scorpio
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views22 pages

Unit 1

Uploaded by

Sanguine Scorpio
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

Data Warehousing

Overview
In today’s rapidly changing corporate environment, organizations are turning to cloud-based technologies
for convenient data collection, reporting, and analysis. This is where Data Warehousing comes in as a
core component of business intelligence that enables businesses to enhance their performance. It is
important to understand what is data warehouse and why it is evolving in the global marketplace.

The concept of data warehouses first came into use in the 1980s when IBM researchers Paul Murphy and
Barry Devlin developed the business data warehouse. American computer scientist Bill Inmon is
considered the “father” of the data warehouse due to his authorship of several works, such as
the Corporate Information Factory and other topics on the building, usage, and maintenance of the data
warehouse.

The term "Data Warehouse" was first coined by Bill Inmon in 1990. According to Inmon, a data
warehouse is a subject oriented, integrated, time-variant, and non-volatile collection of data. This data
helps analysts to take informed decisions in an organization.

An operational database undergoes frequent changes on a daily basis on account of the transactions that
take place. Suppose a business executive wants to analyze previous feedback on any data such as a
product, a supplier, or any consumer data, then the executive will have no data available to analyze
because the previous data has been updated due to transactions.

A data warehouses provides us generalized and consolidated data in multidimensional view. Along with
generalized and consolidated view of data, a data warehouses also provides us Online Analytical
Processing (OLAP) tools. These tools help us in interactive and effective analysis of data in a
multidimensional space. This analysis results in data generalization and data mining.

Data mining functions such as association, clustering, classification, prediction can be integrated with
OLAP operations to enhance the interactive mining of knowledge at multiple level of abstraction. That's
why data warehouse has now become an important platform for data analysis and online analytical
processing.

What Is a Data Warehouse

Data warehouse is an information system that contains historical and commutative data from single or
multiple sources. It simplifies reporting and analysis process of the organization. It is also a single version
of truth for any company for decision making and forecasting.

Data warehouses serve as a central repository for storing and analyzing information to make better
informed decisions. An organization's data warehouse receives data from a variety of sources, typically
on a regular basis, including transactional systems, relational databases, and other sources.

A data warehouse can be defined as a collection of organizational data and information extracted from
operational sources and external data sources. The data is periodically pulled from various internal
applications like sales, marketing, and finance; customer-interface applications; as well as external partner
systems. This data is then made available for decision-makers to access and analyze. So what is data
warehouse? For a start, it is a comprehensive repository of current and historical information that is
designed to enhance an organization’s performance.

Understanding a Data Warehouse

A data warehouse is a database, which is kept separate from the organization's operational database.

There is no frequent updating done in a data warehouse.

It possesses consolidated historical data, which helps the organization to analyze its business.

A data warehouse helps executives to organize, understand, and use their data to take strategic decisions.

Data warehouse systems help in the integration of diversity of application systems.

A data warehouse system helps in consolidated historical data analysis.

Why a Data Warehouse is Separated from Operational Databases


A data warehouses is kept separate from operational databases due to the following reasons –
1. An operational database is constructed for well-known tasks and workloads such as searching
particular records, indexing, etc. In contract, data warehouse queries are often complex and they
present a general form of data.
2. Operational databases support concurrent processing of multiple transactions. Concurrency
control and recovery mechanisms are required for operational databases to ensure robustness and
consistency of the database.
3. An operational database query allows to read and modify operations, while an OLAP query needs
only read only access of stored data.
4. An operational database maintains current data. On the other hand, a data warehouse maintains
historical data.

Data Warehousing Components


As per the data warehouse architecture, a data warehouse has four components- ETL tools
(Extract, Transform, Load), central database, access tools, and metadata. These
components of data warehouse are specifically designed for enhanced speed so you can get
the results faster and flawlessly analyze data on the go.

Let us now see the components of the data warehouse in detail.


1. (Warehouse) database
Warehouse database is the first one among the components of data warehouse.
Central database- It keeps all business data in the data warehouse while making it easier
to report. There are various database types in which you can store the specific data types in
the warehouse. These database types include-

Analytics database- These databases help manage and sustain analytics of data storage.
Cloud-based database- Here, the databases can be retrieved and hosted on the cloud so
that you do not have to acquire hardware to set up a data warehouse.
Typical rational databases- These are row databases used on a routine basis.
2. ETL (Extraction, Transformation, Loading) Tools
ETL tools are the central components of data warehouse and help extract data from various
resources. This data is then transformed into a suitable arrangement and is later loaded into
the data warehouse. They allow you to extract data, fill mislaid data, highlight data
distribution from the central repository to BI applications(Business Intelligence), and more.

Simply put, data is pulled from resources and altered to align the data for fast analytical
consumption. It is carried out by a variety of data integration strategies, including ETL, ELT,
bulk-load processing, real-time data duplication, data quality, data transformation, and more.
3. Metadata
Metadata is termed as ‘data about your data.’ It is one of the major components of data
warehouse. Metadata tells you everything about the usage, values, source, and other features
of the data sets in the warehouse. Additionally, business metadata includes a context to the
technical data as well. It describes the method to access the data, where it resides, and how
well it is structured. Metadata is also classified as business meta-data having information
which gives users a simple understanding of the information warehouse.
Metadata offers interactive and easy access to the users, which helps them understand the
matter and find get the data. Its management is carried out through the repository and other
accompanying software. The software relies on a workstation and easily maps the source to
the target database. Also, it creates code and controls the operational data to the respective
warehouse.
4. Access Tools
Data warehouses make use of a group of databases as the primary base. However, data
warehouse organizations can’t work with databases without using the access tools until a
database administrator is available. But, to cope with the changing conditions, it becomes
necessary to use data warehouse tools as the major components of data warehouse, including
data mining tools, app development tools, OLAP tools, query/reporting tools, and more.

Data mining tools-


They streamline the process of checking arrays and links in vast volumes of data with the use
of statistical modeling methods.

OLAP tools-
These tools aid in building a multi-dimensional data warehouse while allowing business data
analysis from various points.
Application Development tools-
They help develop customized reports.

Query reporting tools-


With these tools, corporate report production is quickly done through spreadsheets,
innovative visuals, and spreadsheets as well.
Reporting layer
The reporting layer in a data warehouse helps customers to get access to BI database
architecture and interface. Its prime aim is to be a dashboard to bring data visualization, fetch
needed information, and create reports.

The query and reporting tools include-


Dashboard tools-
These software applications display complex business information and metrics, allowing
quick understanding.
Data mining tools-
They enable customers to implement in-detail statistical and numerical calculations that
identify designs and analyze trends along with data.

6. Additional components
There exist some additional components of data warehouse in a few data warehouses. They
are-
Logical data marts-
It is an altered view of the significant data warehouse. But, logical data marts do not exist
physically as an independent data element.
Operational data store-
It is an integrated database of operational data while its sources contain legacy systems. It
also includes near and current term information.
Dependent data marts-
It is a physical database that fetches all the information from the data warehouse.

Building a Data Warehouse


A Data warehouse is a heterogeneous collection of different data sources organized under
unified schema. Builders should take a broad view of the anticipated use of the warehouse
while constructing a data warehouse. During the design phase, there is no way to anticipate
all possible queries or analyses. Some characteristic of Data warehouse are:
Subject oriented
Integrated
Time Variant
Non-volatile
Building a Data Warehouse – Some steps that are needed for building any data warehouse
are as following below:

1. To extract the data (transnational) from different data sources: For building a data
warehouse, a data is extracted from various data sources and that data is stored in central
storage area. For extraction of the data Microsoft has come up with an excellent tool.
When you purchase Microsoft SQL Server, then this tool will be available at free of cost.

2. To transform the transnational data: There are various DBMS where many of the
companies stores their data. Some of them are: MS Access, MS SQL Server, Oracle,
Sybase etc. Also these companies saves the data in spreadsheets, flat files, mail systems
etc. Relating a data from all these sources is done while building a data warehouse.

3. To load the data (transformed) into the dimensional database: After building a
dimensional model, the data is loaded in the dimensional database. This process
combines the several columns together or it may split one field into the several columns.
There are two stages at which transformation of the data can be performed and they are:
while loading the data into the dimensional model or while data extraction from their
origins.
4. To purchase a front-end reporting tool: There are top notch analytical tools are available
in the market. These tools are provided by the several major vendors. A cost effective
tool and Data Analyzer is released by the Microsoft on its own.

For the warehouse there is an acquisition of the data. There must be a use of multiple and
heterogeneous sources for the data extraction, example databases. There is a need for the
consistency for which formation of data must be done within the warehouse.

Warehouse Databases
Relational database
Relational databases are generally made of a set of tables with data that fits into a category that is already
defined. Each table has at least one data category in a column and each row has a certain data instance for
the categories which are defined in the columns. Hence forming a matrix between data and categories.
Standard user and application program for a relational database is Structured Query Language (SQL).
Relational databases are easily extended and a new data category can be added after the original database
creation without requiring much modification.
Distributed database
A distributed database is a database in which some parts of the database are stored in multiple physical
locations and in which processing is dispersed or replicated among different points in a network.
The distributed database is categorized in two forms. Homogeneous and heterogeneous. All the physical
locations in a homogeneous distributed database system have the same underlying hardware and run the
same operating systems and database applications. The hardware, operating systems or database
applications in a heterogeneous distributed database may be different at each of the locations.
Cloud database
A cloud database is built for a virtual environment; it can be in a hybrid cloud, public cloud or private
cloud. Cloud databases provide benefits such as the ability to pay for storage capacity and bandwidth on a
per-user basis and they provide scalability on demand, along with high availability.
A cloud database also gives enterprises the opportunity to support business applications in a software-as-
a-service deployment.
NoSQL database
NoSQL databases should be used when there is a large set of distributed data. NoSQL databases are
effective for big data performance issues that relational databases aren’t built to solve. They are most
effective when an organization analyzes large chunks of unstructured data or data that are stored on
multiple virtual servers in the cloud.
Object-oriented database
An object-oriented database is organized around objects rather than actions and data rather than logic. For
example, a multimedia record in a relational database can be a definable data object, as opposed to an
alphanumeric value.
Graph Database
This type of database uses graph theory to store, map and query data. Graph databases are collections of
nodes and edges where each node represents an entity and each edge represents a connection between
nodes.
Mapping the Data Warehouse to a Multiprocessor Architecture
Mapping a data warehouse to a multiprocessor architecture is essential for leveraging parallel processing
capabilities to handle large volumes of data efficiently. Multiprocessor architectures can significantly
enhance the performance of data warehousing operations, such as data loading, querying, and processing.
This involves distributing the workload across multiple processors to achieve scalability, high
availability, and faster response times.
Multiprocessor Architectures
Multiprocessor systems can be categorized into two main types:
1. Symmetric Multiprocessing (SMP):
 All processors share a single, unified memory space and are controlled by a single operating system.
 Processors communicate through shared memory, making it easier to balance the load dynamically.
 Common in many commercial database systems due to its simplicity and ease of management.
2. Massively Parallel Processing (MPP):
 Each processor has its own memory and operates independently, with processors connected by a high-
speed interconnect.
 Designed to handle very large data sets and complex queries by distributing the data and workload across
multiple nodes.
 Suitable for large-scale data warehouses where scalability and performance are critical.

Mapping Strategies
Mapping a data warehouse to a multiprocessor architecture involves several strategies to distribute data
and processing tasks efficiently:
1. Data Partitioning:
 Horizontal Partitioning: Divide tables into smaller subsets (partitions) based on rows. Each partition can be
processed by a different processor, improving parallel query execution.
 Range Partitioning: Distribute data based on a range of values (e.g., date ranges).
 Hash Partitioning: Distribute data based on a hash function applied to a key column, ensuring even
distribution.
 Vertical Partitioning: Split tables into subsets of columns. Each subset can be processed independently,
useful for wide tables with many columns.
2. Parallel Query Execution:
 Intra-Query Parallelism: Break down a single query into smaller tasks that can be executed concurrently
across multiple processors. This involves parallel scans, joins, aggregations, and sorts.
 Inter-Query Parallelism: Execute multiple queries simultaneously across different processors, improving
overall throughput.
3. ETL Parallelization:
 Distribute ETL (Extract, Transform, Load) processes across multiple processors to speed up data loading and
transformation.
 Use parallel ETL tools and frameworks that support distributed processing (e.g., Apache Spark, Talend).
4. Load Balancing and Resource Management:
 Ensure even distribution of workloads across processors to prevent bottlenecks and maximize resource
utilization.
 Implement dynamic load balancing techniques to adjust workloads based on processor availability and
performance.
5. Replication and Redundancy:
 Use data replication techniques to ensure high availability and fault tolerance. Data can be replicated across
multiple processors or nodes to prevent data loss and maintain availability during failures.
 Implement failover mechanisms to switch to backup processors or nodes in case of hardware or software
failures.
6. Index and Materialized View Optimization:
 Create indexes and materialized views that are optimized for parallel processing. Ensure that indexes are
distributed across processors to improve query performance.
 Use partitioned indexes and materialized views to enhance parallel query execution.

Implementation Considerations
1. Hardware Configuration:
 Choose appropriate hardware that supports the chosen multiprocessor architecture. Consider factors such as the
number of processors, memory capacity, storage systems, and interconnect speed.
 Ensure that the hardware supports the required level of parallelism and scalability.
2. Database Management System (DBMS):
 Use a DBMS that supports multiprocessor architectures and parallel processing. Many modern DBMSs, such
as Oracle, Microsoft SQL Server, and IBM Db2, have built-in support for parallel processing and partitioning.
 Configure the DBMS to take advantage of the multiprocessor architecture, including setting parameters for
parallel query execution and partitioning.
3. Data Distribution Strategy:
 Carefully design the data distribution strategy to ensure even data distribution and minimize data movement
across processors.
 Consider the nature of the queries and workload patterns when designing the partitioning and distribution
strategy.
4. Monitoring and Optimization:
 Continuously monitor the performance of the data warehouse to identify bottlenecks and optimize parallel
processing.
 Use performance monitoring tools to track query performance, resource utilization, and system health.
5. Scalability and Maintenance:
 Design the data warehouse architecture to scale horizontally by adding more processors or nodes as data
volume and workload increase.
 Implement maintenance procedures to ensure data consistency, optimize performance, and handle hardware
upgrades or failures.
Mapping a data warehouse to a multiprocessor architecture involves careful planning and implementation of
data partitioning, parallel query execution, ETL parallelization, and load balancing strategies. By leveraging
the capabilities of multiprocessor systems, organizations can achieve significant improvements in data
processing speed, query performance, and overall scalability. Proper hardware configuration, DBMS support,
and continuous monitoring are essential to maximize the benefits of a multiprocessor architecture for data
warehousing.

Difference between Database System and Data Warehouse


Database System: Database System is used in traditional way of storing and retrieving data. The
major task of database system is to perform query processing. These systems are generally
referred as online transaction processing system. These systems are used day to day operations of
any organization.
Data Warehouse: Data Warehouse is the place where huge amount of data is stored. It is meant
for users or knowledge workers in the role of data analysis and decision making. These systems
are supposed to organize and present data in different format and different forms in order to
serve the need of the specific user for specific purpose. These systems are referred as online
analytical processing.

Difference between Database System and Data Warehouse:


Database System Data Warehouse
It supports operational processes. It supports analysis and performance
reporting.
Capture and maintain the data. Explore the data.
Current data. Multiple years of history.
Data is balanced within the scope of this one Data must be integrated and balanced from
system. multiple system.
Data is updated when transaction occurs. Data is updated on scheduled processes.
Data verification occurs when entry is done. Data verification occurs after the fact.
100 MB to GB. 100 GB to TB.
ER based. Star/Snowflake.
Application oriented. Subject oriented.
Primitive and highly detailed. Summarized and consolidated.
Flat relational. Multidimensional.

Multi Dimensional Data Model


The multi-Dimensional Data Model is a method which is used for ordering data in the database
along with good arrangement and assembling of the contents in the database.
The Multi Dimensional Data Model allows customers to interrogate analytical questions
associated with market or business trends, unlike relational databases which allow customers to
access data in the form of queries. They allow users to rapidly receive answers to the requests
which they made by creating and examining the data comparatively fast.
OLAP (online analytical processing) and data warehousing uses multi dimensional databases.
It is used to show multiple dimensions of the data to users.
It represents data in the form of data cubes. Data cubes allow to model and view the data from
many dimensions and perspectives. It is defined by dimensions and facts and is represented by
a fact table. Facts are numerical measures and fact tables contain measures of the related
dimensional tables or names of the facts.

Working on a Multidimensional Data Model


On the basis of the pre-decided steps, the Multidimensional Data Model works.
The following stages should be followed by every project for building a Multi Dimensional
Data Model :
Stage 1 : Assembling data from the client : In first stage, a Multi Dimensional Data Model
collects correct data from the client. Mostly, software professionals provide simplicity to the
client about the range of data which can be gained with the selected technology and collect the
complete data in detail.
Stage 2 : Grouping different segments of the system : In the second stage, the Multi
Dimensional Data Model recognizes and classifies all the data to the respective section they
belong to and also builds it problem-free to apply step by step.
Stage 3 : Noticing the different proportions : In the third stage, it is the basis on which the
design of the system is based. In this stage, the main factors are recognized according to the
user’s point of view. These factors are also known as “Dimensions”.
Stage 4 : Preparing the actual-time factors and their respective qualities : In the fourth
stage, the factors which are recognized in the previous step are used further for identifying the
related qualities. These qualities are also known as “attributes” in the database.
Stage 5 : Finding the actuality of factors which are listed previously and their qualities
: In the fifth stage, A Multi Dimensional Data Model separates and differentiates the actuality
from the factors which are collected by it. These actually play a significant role in the
arrangement of a Multi Dimensional Data Model.
Stage 6 : Building the Schema to place the data, with respect to the information collected
from the steps above : In the sixth stage, on the basis of the data which was collected
previously, a Schema is built.

Data Cubes
When data is grouped or combined in multidimensional matrices called Data Cubes. The data
cube method has a few alternative names or a few variants, such as "Multidimensional
databases," "materialized views," and "OLAP (On-Line Analytical Processing)."
The general idea of this approach is to materialize certain expensive computations that are
frequently inquired.

A data cube is created from a subset of attributes in the database. Specific attributes are chosen to be
measure attributes, i.e., the attributes whose values are of interest. Another attributes are selected as
dimensions or functional attributes. The measure attributes are aggregated according to the dimensions.

For example, XYZ may create a sales data warehouse to keep records of the store's sales for the
dimensions time, item, branch, and location. These dimensions enable the store to keep track of
things like monthly sales of items, and the branches and locations at which the items were sold.
Each dimension may have a table identify with it, known as a dimensional table, which describes
the dimensions. For example, a dimension table for items may contain the attributes item_name,
brand, and type.
Data cube method is an interesting technique with many applications. Data cubes could be sparse
in many cases because not every cell in each dimension may have corresponding data in the
database.

If a query contains constants at even lower levels than those provided in a data cube, it is not
clear how to make the best use of the precomputed results stored in the data cube.

The model view data in the form of a data cube. OLAP tools are based on the multidimensional
data model. Data cubes usually model n-dimensional data.

A data cube enables data to be modeled and viewed in multiple dimensions. A multidimensional
data model is organized around a central theme, like sales and transactions. A fact table
represents this theme. Facts are numerical measures. Thus, the fact table contains measure (such
as Rs_sold) and keys to each of the related dimensional tables.

Dimensions are a fact that defines a data cube. Facts are generally quantities, which are used for
analyzing the relationship between dimensions.

Example: In the 2-D representation, we will look at the All Electronics sales data for items sold
per quarter in the city of Vancouver. The measured display in dollars sold (in thousands).
Stars
A star schema is a type of data modeling technique used in data warehousing to represent data
in a structured and intuitive way. In a star schema, data is organized into a central fact table
that contains the measures of interest, surrounded by dimension tables that describe the
attributes of the measures.
The fact table in a star schema contains the measures or metrics that are of interest to the user
or organization. For example, in a sales data warehouse, the fact table might contain sales
revenue, units sold, and profit margins. Each record in the fact table represents a specific event
or transaction, such as a sale or order.
The dimension tables in a star schema contain the descriptive attributes of the measures in the
fact table. These attributes are used to slice and dice the data in the fact table, allowing users to
analyze the data from different perspectives. For example, in a sales data warehouse, the
dimension tables might include product, customer, time, and location.
In a star schema, each dimension table is joined to the fact table through a foreign key
relationship. This allows users to query the data in the fact table using attributes from the
dimension tables. For example, a user might want to see sales revenue by product category, or
by region and time period.
The star schema is a popular data modeling technique in data warehousing because it is easy to
understand and query. The simple structure of the star schema allows for fast query response
times and efficient use of database resources. Additionally, the star schema can be easily
extended by adding new dimension tables or measures to the fact table, making it a scalable
and flexible solution for data warehousing.
Star schema is the fundamental schema among the data mart schema and it is simplest. This
schema is widely used to develop or build a data warehouse and dimensional data marts. It
includes one or more fact tables indexing any number of dimensional tables. The star schema is
a necessary cause of the snowflake schema. It is also efficient for handling basic queries.
It is said to be star as its physical model resembles to the star shape having a fact table at its
center and the dimension tables at its peripheral representing the star’s points. Below is an
example to demonstrate the Star Schema:
In the above demonstration, SALES is a fact table having attributes i.e. (Product ID, Order ID,
Customer ID, Employer ID, Total, Quantity, Discount) which references to the dimension
tables. Employee dimension table contains the attributes: Emp ID, Emp Name, Title,
Department and Region. Product dimension table contains the attributes: Product ID, Product
Name, Product Category, Unit Price. Customer dimension table contains the attributes:
Customer ID, Customer Name, Address, City, Zip. Time dimension table contains the
attributes: Order ID, Order Date, Year, Quarter, Month.
Model of Star Schema :
In Star Schema, Business process data, that holds the quantitative data about a business is
distributed in fact tables, and dimensions which are descriptive characteristics related to fact
data. Sales price, sale quantity, distant, speed, weight, and weight measurements are few
examples of fact data in star schema.
Often, A Star Schema having multiple dimensions is termed as Centipede Schema. It is easy to
handle a star schema which have dimensions of few attributes.
Advantages of Star Schema :
1. Simpler Queries –
Join logic of star schema is quite cinch in comparison to other join logic which are needed
to fetch data from a transactional schema that is highly normalized.
2. Simplified Business Reporting Logic –
In comparison to a transactional schema that is highly normalized, the star schema makes
simpler common business reporting logic, such as of reporting and period-over-period.
1. Feeding Cubes –
Star schema is widely used by all OLAP systems to design OLAP cubes efficiently. In fact,
major OLAP systems deliver a ROLAP mode of operation which can use a star schema as a
source without designing a cube structure.
Disadvantages of Star Schema –
1. Data integrity is not enforced well since in a highly de-normalized schema state.
2. Not flexible in terms if analytical needs as a normalized data model.
3. Star schemas don’t reinforce many-to-many relationships within business entities – at least
not frequently.

Features:
Central fact table: The star schema revolves around a central fact table that contains the
numerical data being analyzed. This table contains foreign keys to link to dimension tables.
Dimension tables: Dimension tables are tables that contain descriptive attributes about the
data being analyzed. These attributes provide context to the numerical data in the fact table.
Each dimension table is linked to the fact table through a foreign key.
Denormalized structure: A star schema is denormalized, which means that redundancy is
allowed in the schema design to improve query performance. This is because it is easier and
faster to join a small number of tables than a large number of tables.
Simple queries: Star schema is designed to make queries simple and fast. Queries can be
written in a straightforward manner by joining the fact table with the appropriate dimension
tables.
Aggregated data: The numerical data in the fact table is usually aggregated at different levels
of granularity, such as daily, weekly, or monthly. This allows for analysis at different levels of
detail.
Fast performance: Star schema is designed for fast query performance. This is because the
schema is denormalized and data is pre-aggregated, making queries faster and more efficient.
Easy to understand: The star schema is easy to understand and interpret, even for non-
technical users. This is because the schema is designed to provide context to the numerical data
through the use of dimension tables.
Snow Flakes
The snowflake schema is a variant of the star schema. Here, the centralized fact table is
connected to multiple dimensions. In the snowflake schema, dimensions are present in
a normalized form in multiple related tables. The snowflake structure materialized when the
dimensions of a star schema are detailed and highly structured, having several levels of
relationship, and the child tables have multiple parent tables. The snowflake effect affects only
the dimension tables and does not affect the fact tables.
A snowflake schema is a type of data modeling technique used in data warehousing to
represent data in a structured way that is optimized for querying large amounts of data
efficiently. In a snowflake schema, the dimension tables are normalized into multiple related
tables, creating a hierarchical or “snowflake” structure.
In a snowflake schema, the fact table is still located at the center of the schema, surrounded by
the dimension tables. However, each dimension table is further broken down into multiple
related tables, creating a hierarchical structure that resembles a snowflake.
For Example, in a sales data warehouse, the product dimension table might be normalized into
multiple related tables, such as product category, product subcategory, and product details.
Each of these tables would be related to the product dimension table through a foreign
key relationship.
Example:
The Employee dimension table now contains the attributes: EmployeeID, EmployeeName,
DepartmentID, Region, and Territory. The DepartmentID attribute links with
the Employee table with the Department dimension table. The Department dimension is
used to provide detail about each department, such as the Name and Location of the
department. The Customer dimension table now contains the attributes: CustomerID,
CustomerName, Address, and CityID. The CityID attributes link the Customer dimension
table with the City dimension table. The City dimension table has details about each city such
as city name, Zipcode, State, and Country.
What is Snowflaking?
The snowflake design is the result of further expansion and normalization of the dimension
table. In other words, a dimension table is said to be snowflaked if the low-cardinality attribute
of the dimensions has been divided into separate normalized tables. These tables are then
joined to the original dimension table with referential constraints (foreign key constrain).
Generally, snowflaking is not recommended in the dimension table, as it hampers the
understandability and performance of the dimension model as more tables would be required to
be joined to satisfy the queries.
Difference Between Snowflake and Star Schema
The main difference between star schema and snowflake schema is that the dimension table of
the snowflake schema is maintained in the normalized form to reduce redundancy. The
advantage here is that such tables (normalized) are easy to maintain and save storage space.
However, it also means that more joins will be needed to execute the query. This will
adversely impact system performance.
However, the snowflake schema can also be more complex to query than a star schema
because it requires more table joins. This can result in slower query response times and higher
resource usage in the database. Additionally, the snowflake schema can be more difficult to
understand and maintain because of the increased complexity of the schema design.
The decision to use a snowflake schema versus a star schema in a data warehousing project
will depend on the specific requirements of the project and the trade-offs between query
performance, schema complexity, and data integrity.
Characteristics of Snowflake Schema
 The snowflake schema uses small disk space.
 It is easy to implement the dimension that is added to the schema.
 There are multiple tables, so performance is reduced.
 The dimension table consists of two or more sets of attributes that define information at
different grains.
 The sets of attributes of the same dimension table are populated by different source
systems.
Features of the Snowflake Schema
 Normalization: The snowflake schema is a normalized design, which means that data is
organized into multiple related tables. This reduces data redundancy and improves data
consistency.
 Hierarchical Structure: The snowflake schema has a hierarchical structure that is
organized around a central fact table. The fact table contains the measures or metrics of
interest, and the dimension tables contain the attributes that provide context to the
measures.
 Multiple Levels: The snowflake schema can have multiple levels of dimension tables,
each related to the central fact table. This allows for more granular analysis of data and
enables users to drill down into specific subsets of data.
 Joins: The snowflake schema typically requires more complex SQL queries that
involve multiple tables joins. This can impact performance, especially when dealing
with large data sets.
 Scalability: The snowflake schema is scalable and can handle large volumes of data.
However, the complexity of the schema can make it difficult to manage and maintain.
Advantages of Snowflake Schema
 It provides structured data which reduces the problem of data integrity.
 It uses small disk space because data are highly structured.
Disadvantages of Snowflake Schema
 Snowflaking reduces space consumed by dimension tables but compared with the entire
data warehouse the saving is usually insignificant.
 Avoid snowflaking or normalization of a dimension table, unless required and appropriate.
 Do not snowflake hierarchies of dimension table into separate tables. Hierarchies should
belong to the dimension table only and should never be snowflakes.
 Multiple hierarchies that can belong to the same dimension have been designed at the
lowest possible detail.

Fact Constellation (Galaxy Schema):


The fact constellation schema, also known as a galaxy schema, is a more complex design that
involves multiple fact tables sharing dimension tables. It is used when there are multiple fact
tables with different measures and each fact table is related to several common dimension tables.

A Fact constellation means two or more fact tables sharing one or more dimensions. It is
also called Galaxy schema.

Fact Constellation Schema describes a logical structure of data warehouse or data mart.
Fact Constellation Schema can design with a collection of de-normalized FACT, Shared,
and Conformed Dimension tables.

Fact Constellation Schema is a sophisticated database design that is difficult to


summarize information. Fact Constellation Schema can implement between aggregate
Fact tables or decompose a complex Fact table into independent simplex Fact tables.

Example: A fact constellation schema is shown in the figure below.


This schema defines two fact tables, sales, and shipping. Sales are treated along four
dimensions, namely, time, item, branch, and location. The schema contains a fact table
for sales that includes keys to each of the four dimensions, along with two measures:
Rupee_sold and units_sold. The shipping table has five dimensions, or keys: item_key,
time_key, shipper_key, from_location, and to_location, and two measures: Rupee_cost
and units_shipped.

The primary disadvantage of the fact constellation schema is that it is a more


challenging design because many variants for specific kinds of aggregation must be
considered and selected.

Concept Hierarchy in Data Mining

In data mining, the concept of a concept hierarchy refers to the organization of data into a tree-
like structure, where each level of the hierarchy represents a concept that is more general than
the level below it. This hierarchical organization of data allows for more efficient and effective
data analysis, as well as the ability to drill down to more specific levels of detail when needed.
The concept of hierarchy is used to organize and classify data in a way that makes it more
understandable and easier to analyze. The main idea behind the concept of hierarchy is that the
same data can have different levels of granularity or levels of detail and that by organizing the
data in a hierarchical fashion, it is easier to understand and perform analysis.
Explanation:
As shown in the above diagram, it consists of a concept hierarchy for the dimension location,
where the user can easily retrieve the data. In order to evaluate it easily the data is represented
in a tree-like structure. The top of the tree consists of the main dimension location and further
splits into various sub-nodes. The root node is located, and it further splits into two nodes
countries ie. USA and India. These countries are further then splitted into more sub-nodes, that
represent the province states ie. New York, Illinois, Gujarat, UP. Thus the concept hierarchy as
shown in the above example organizes the data into a tree-like structure and describes and
represents in more general than the level below it.
The hierarchical structure represents the abstraction level of the dimension location, which
consists of various footprints of the dimension such as street, city, province state, and
country.
Types of Concept Hierarchies

1. Schema Hierarchy: Schema Hierarchy is a type of concept hierarchy that is used to


organize the schema of a database in a logical and meaningful way, grouping similar
objects together. A schema hierarchy can be used to organize different types of data, such
as tables, attributes, and relationships, in a logical and meaningful way. This can be useful
in data warehousing, where data from multiple sources needs to be integrated into a single
database.
2. Set-Grouping Hierarchy: Set-Grouping Hierarchy is a type of concept hierarchy that is
based on set theory, where each set in the hierarchy is defined in terms of its membership
in other sets. Set-grouping hierarchy can be used for data cleaning, data pre-processing and
data integration. This type of hierarchy can be used to identify and remove outliers, noise,
or inconsistencies from the data and to integrate data from multiple sources.
3. Operation-Derived Hierarchy: An Operation-Derived Hierarchy is a type of concept
hierarchy that is used to organize data by applying a series of operations or transformations
to the data. The operations are applied in a top-down fashion, with each level of the
hierarchy representing a more general or abstract view of the data than the level below it.
This type of hierarchy is typically used in data mining tasks such as clustering and
dimensionality reduction. The operations applied can be mathematical or statistical
operations such as aggregation, normalization
4. Rule-based Hierarchy: Rule-based Hierarchy is a type of concept hierarchy that is used to
organize data by applying a set of rules or conditions to the data. This type of hierarchy is
useful in data mining tasks such as classification, decision-making, and data exploration. It
allows to the assignment of a class label or decision to each data point based on its
characteristics and identifies patterns and relationships between different attributes of the
data.

Need of Concept Hierarchy in Data Mining

There are several reasons why a concept hierarchy is useful in data mining:
1. Improved Data Analysis: A concept hierarchy can help to organize and simplify data,
making it more manageable and easier to analyze. By grouping similar concepts together, a
concept hierarchy can help to identify patterns and trends in the data that would otherwise
be difficult to spot. This can be particularly useful in uncovering hidden or unexpected
insights that can inform business decisions or inform the development of new products or
services.
2. Improved Data Visualization and Exploration: A concept hierarchy can help to improve
data visualization and data exploration by organizing data into a tree-like structure,
allowing users to easily navigate and understand large and complex data sets. This can be
particularly useful in creating interactive dashboards and reports that allow users to easily
drill down to more specific levels of detail when needed.
3. Improved Algorithm Performance: The use of a concept hierarchy can also help to
improve the performance of data mining algorithms. By organizing data into a hierarchical
structure, algorithms can more easily process and analyze the data, resulting in faster and
more accurate results.
4. Data Cleaning and Pre-processing: A concept hierarchy can also be used in data cleaning
and pre-processing, to identify and remove outliers and noise from the data.
5. Domain Knowledge: A concept hierarchy can also be used to represent the domain
knowledge in a more structured way, which can help in a better understanding of the data
and the problem domain.

You might also like