0% found this document useful (0 votes)
0 views

Data Mining Unit-2 notes

A data warehouse is a centralized repository for storing and analyzing data from various sources to support business intelligence and decision-making. Key characteristics include being subject-oriented, integrated, non-volatile, and time-variant, while its architecture consists of a bottom tier (data storage), middle tier (OLAP server), and top tier (client interface). Data warehouses facilitate complex analytics and reporting, with applications across various industries such as banking, finance, and healthcare.

Uploaded by

kumariritu020503
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
0 views

Data Mining Unit-2 notes

A data warehouse is a centralized repository for storing and analyzing data from various sources to support business intelligence and decision-making. Key characteristics include being subject-oriented, integrated, non-volatile, and time-variant, while its architecture consists of a bottom tier (data storage), middle tier (OLAP server), and top tier (client interface). Data warehouses facilitate complex analytics and reporting, with applications across various industries such as banking, finance, and healthcare.

Uploaded by

kumariritu020503
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Data Mining Unit-2nd

What Is a Data Warehouse

Data warehouses serve as a central repository for storing and analyzing information to make better
informed decisions. An organization's data warehouse receives data from a variety of sources, typically on
a regular basis, including transactional systems, relational databases, and other sources.

A data warehouse is a centralized storage system that allows for the storing, analyzing, and interpreting
of data in order to facilitate better decision-making. Transactional systems, relational databases, and other
sources provide data into data warehouses on a regular basis.

A data warehouse is a type of data management system that facilitates and supports business intelligence
(BI) activities, specifically analysis. Data warehouses are primarily designed to facilitate searches and
analyses and usually contain large amounts of historical data.

A data warehouse can be defined as a collection of organizational data and information extracted from
operational sources and external data sources. The data is periodically pulled from various internal
applications like sales, marketing, and finance; customer-interface applications; as well as external partner
systems. This data is then made available for decision-makers to access and analyze. So what is data
warehouse? For a start, it is a comprehensive repository of current and historical information that is
designed to enhance an organization’s performance.

Key Characteristics of Data Warehouse

The main characteristics of a data warehouse are as follows:

 Subject-Oriented

A data warehouse is subject-oriented since it provides topic-wise information rather than the overall
processes of a business. Such subjects may be sales, promotion, inventory, etc. For example, if you want
to analyze your company’s sales data, you need to build a data warehouse that concentrates on sales. Such
a warehouse would provide valuable information like ‘who was your best customer last year?’ or ‘who is
likely to be your best customer in the coming year?’

 Integrated

A data warehouse is developed by integrating data from varied sources into a consistent format. The data
must be stored in the warehouse in a consistent and universally acceptable manner in terms of naming,
format, and coding. This facilitates effective data analysis.

 Non-Volatile
Data once entered into a data warehouse must remain unchanged. All data is read-only. Previous data is
not erased when current data is entered. This helps you to analyze what has happened and when.

 Time-Variant

The data stored in a data warehouse is documented with an element of time, either explicitly or implicitly.
An example of time variance in Data Warehouse is exhibited in the Primary Key, which must have an
element of time like the day, week, or month.

In today’s rapidly changing corporate environment, organizations are turning to cloud-based technologies
for convenient data collection, reporting, and analysis. This is where Data Warehousing comes in as a core
component of business intelligence that enables businesses to enhance their performance. It is important to
understand what is data warehouse and why it is evolving in the global marketplace.

In this article, we’ll provide an overview of Data Warehouse – explore key concepts like data warehouse
architecture, characteristics of data warehouse, what is data management, the benefits of data warehouse,
and data warehouse applications in Data Science.

:Database vs. Data Warehouse

Although a data warehouse and a traditional database share some similarities, they need not be the same
idea. The main difference is that in a database, data is collected for multiple transactional purposes.
However, in a data warehouse, data is collected on an extensive scale to perform analytics. Databases
provide real-time data, while warehouses store data to be accessed for big analytical queries.

Data warehouse is an example of an OLAP system or an online database query answering system. OLTP
is an online database modifying system, for example, ATM. Learn more about the OLTP vs.
OLAP differences.

Data Warehouse Architecture

Bottom Tier

The bottom tier or data warehouse server usually represents a relational database system. Back-end tools
are used to cleanse, transform and feed data into this layer.

Middle Tier

The middle tier represents an OLAP server that can be implemented in two ways.
The ROLAP or Relational OLAP model is an extended relational database management system that maps
multidimensional data process to standard relational process.

The MOLAP or multidimensional OLAP directly acts on multidimensional data and operations.

Top Tier

This is the front-end client interface that gets data out from the data warehouse. It holds various tools like
query tools, analysis tools, reporting tools, and data mining tools.

How Data Warehouse Works

Data Warehousing integrates data and information collected from various sources into one comprehensive
database. For example, a data warehouse might combine customer information from an organization’s
point-of-sale systems, its mailing lists, website, and comment cards. It might also incorporate confidential
information about employees, salary information, etc. Businesses use such components of data warehouse
to analyze customers.

Data mining is one of the features of a data warehouse that involves looking for meaningful data patterns
in vast volumes of data and devising innovative strategies for increased sales and profits.

What is data warehouse modelling?


Data warehouse modeling is the process of designing and organizing your data models within your data
warehouse platform. The design and organization process consists of setting up the appropriate databases
and schemas so that the data can be transformed and then stored in a way that makes sense to the end user
The goal is to transform raw data into a format that is easily understandable, queryable, and optimized for
business intelligence and analytics purposes.

The Importance of Data Warehouse Modeling


Effective data warehouse modeling is crucial for maintaining data quality, optimizing performance, and
enabling complex analytics. A well-designed model not only improves query efficiency but also supports
data governance and self-service analytics, empowering business users to derive insights independently.

 Enhanced data quality and consistency


 Improved query performance and scalability
 Better support for complex analytics and reporting
 Easier data governance and compliance
 Facilitation of self-service analytics for business users
What is OLAP?
OLAP stands for Online Analytical Processing, which is a technology that enables multi-dimensional
analysis of business data. It provides interactive access to large amounts of data and supports complex
calculations and data aggregation. OLAP is used to support business intelligence and decision-making
processes.
Grouping of data in a multidimensional matrix is called data cubes. In Dataware housing, we generally
deal with various multidimensional data models as the data will be represented by multiple dimensions
and multiple attributes. This multidimensional data is represented in the data cube as the cube represents
a high-dimensional space. The Data cube pictorially shows how different attributes of data are arranged
in the data model. Below is the diagram of a general data cube.

The example above is a 3D cube having attributes like branch(A,B,C,D),item


type(home,entertainment,computer,phone,security), year(1997,1998,1999) .

Data cube classification:


The data cube can be classified into two categories:
 Multidimensional data cube: It basically helps in storing large amounts of data by making use of a
multi-dimensional array. It increases its efficiency by keeping an index of each dimension. Thus,
dimensional is able to retrieve data fast.
 Relational data cube: It basically helps in storing large amounts of data by making use of relational
tables. Each relational table displays the dimensions of the data cube. It is slower compared to a
Multidimensional Data Cube.

Advantages of data cubes:

 Multi-dimensional analysis: Data cubes enable multi-dimensional analysis of business data,


allowing users to view data from different perspectives and levels of detail.
 Interactivity: Data cubes provide interactive access to large amounts of data, allowing users to
easily navigate and manipulate the data to support their analysis.
 Speed and efficiency: Data cubes are optimized for OLAP analysis, enabling fast and efficient
querying and aggregation of data.
 Data aggregation: Data cubes support complex calculations and data aggregation, enabling users to
quickly and easily summarize large amounts of data.
 Improved decision-making: Data cubes provide a clear and comprehensive view of business data,
enabling improved decision-making and business intelligence.
 Accessibility: Data cubes can be accessed from a variety of devices and platforms, making it easy for
users to access and analyze business data from anywhere.
 Helps in giving a summarised view of data.
 Data cubes store large data in a simple way.
 Data cube operation provides quick and better analysis,
 Improve performance of data.

Disadvantages of data cube:

 Complexity: OLAP systems can be complex to set up and maintain, requiring specialized technical
expertise.
 Data size limitations: OLAP systems can struggle with very large data sets and may require
extensive data aggregation or summarization.
 Performance issues: OLAP systems can be slow when dealing with large amounts of data,
especially when running complex queries or calculations.
 Data integrity: Inconsistent data definitions and data quality issues can affect the accuracy of OLAP
analysis.
 Cost: OLAP technology can be expensive, especially for enterprise-level solutions, due to the need
for specialized hardware and software.
 Inflexibility: OLAP systems may not easily accommodate changing business needs and may require
significant effort to modify or extend.

Types of OLAP Servers

We have four types of OLAP servers −

Relational OLAP (ROLAP)


Multidimensional OLAP (MOLAP)
Hybrid OLAP (HOLAP)
Specialized SQL Servers

Relational OLAP

ROLAP servers are plac

Components of Data Warehouse


The main components of a data warehouse include:
 Data Sources: These are the various operational systems, databases, and external data feeds that
provide raw data to be stored in the warehouse.
 ETL (Extract, Transform, Load) Process: The ETL process is responsible for extracting data from
different sources, transforming it into a suitable format, and loading it into the data warehouse.
 Data Warehouse Database: This is the central repository where cleaned and transformed data is
stored. It is typically organized in a multidimensional format for efficient querying and reporting.
 Metadata: Metadata describes the structure, source, and usage of data within the warehouse, making
it easier for users and systems to understand and work with the data.
 Data Marts: These are smaller, more focused data repositories derived from the data warehouse,
designed to meet the needs of specific business departments or functions.
 OLAP (Online Analytical Processing) Tools: OLAP tools allow users to analyze data in multiple
dimensions, providing deeper insights and supporting complex analytical queries.
 End-User Access Tools: These are reporting and analysis tools, such as dashboards or Business
Intelligence (BI) tools, that enable business users to query the data warehouse and generate reports

Applications of Data Warehouse


1. Banking Industry
Bankers can better manage all of their available resources with the right Data Warehousing solution. They
can better analyze consumer data, government regulations, and market trends to facilitate better decision-
making.
2. Finance Industry
3. Consumer Goods Industry
4. Government and Education
5. Healthcare
6. Hospitality Industry
7. Insurance
8. Manufacturing and Distribution Industry
9. Telephone Industry
10. Services Sector

Data Generalization by Attribute-Oriented Induction

Data Generalization is the process of summarizing data by replacing relatively low level values with
higher level concepts. It is a form of descriptive data mining.
There are two basic approaches of data generalization :
1. Data cube approach :
 It is also known as OLAP approach.
 It is an efficient approach as it is helpful to make the past selling graph.
 In this approach, computation and results are stored in the Data cube.
 It uses Roll-up and Drill-down operations on a data cube.
 These operations typically involve aggregate functions, such as count(), sum(), average(), and max().
 These materialized views can then be used for decision support, knowledge discovery, and many
other applications.
2. Attribute oriented induction :
 It is an online data analysis, query oriented and generalization based approach.
 In this approach, we perform generalization on basis of different values of each attributes within the
relevant data set. after that same tuple are merged and their respective counts are accumulated in
order to perform aggregation.
 It performs off-line aggregation before an OLAP or data mining query is submitted for processing.
 On the other hand, the attribute oriented induction approach, at least in its initial proposal, a
relational database query – oriented, generalized based (on-line data analysis technique).
 It is not limited to particular measures nor categorical data.
 Attribute oriented induction approach uses two method :
(i). Attribute removal.
(ii). Attribute generalization.

Data Cube computation in Data Mining


Data Mining can be referred to as knowledge mining from data, knowledge extraction, data/pattern
analysis, data archaeology, and data dredging. In data mining, a data cube is a multi-dimensional
array of data that is used for online analytical processing (OLAP).
Here are a few strategies for data cube computation in data mining:
1. Materialized view
This approach involves pre-computing and storing the data cube in a database. This can be done using a
materialized view, which is a pre-computed table that is based on a SELECT statement.
 Advantage: The advantage of this approach is that data cube queries can be answered quickly since
the data is already pre-computed and stored in the database.
 Disadvantage: The disadvantage is that the materialized view needs to be updated regularly to
reflect changes in the underlying data.
2. Lazy evaluation
This approach involves delaying the computation of the data cube until it is actually needed.
 Advantage: The advantage of this approach is that it allows the data cube to be computed on-the-fly,
which can be more efficient if the data cube is not needed very often.
 Disadvantage: The disadvantage is that data cube queries may be slower since the data cube needs to
be computed each time it is accessed.
3. Incremental update
This approach involves computing the data cube incrementally, by only updating the parts of the data
cube that have changed.
 Advantage: The advantage of this approach is that it allows the data cube to be updated more
efficiently since only a small portion of the data cube needs to be recomputed.
 Disadvantage: The disadvantage is that it can be more complex to implement since it requires
tracking changes to the data and updating the data cube accordingly.
4. Data cube approximation
This approach involves approximating the data cube using sampling or other techniques.
 Advantage: The advantage of this approach is that it can be much faster than computing the data
cube exactly.
 Disadvantage: The disadvantage is that the approximated data cube may not be as accurate as the
exact data cube.

5. Data warehouse
A data warehouse is a central repository of data that is designed for efficient querying and analysis. Data
cubes can be computed on top of a data warehouse, which allows for fast querying of the data. However,
data warehouses can be expensive to set up and maintain, and may not be suitable for all organizations.
6. Distributed computing
In this approach, the data cube is computed using a distributed computing system, such as Hadoop or
Spark.
 Advantage: The advantage of this approach is that it allows for the data cube to be computed on a
large dataset, which may not fit on a single machine.
 Disadvantage: The disadvantage is that distributed computing systems can be complex to set up and
maintain, and may require specialized skills and resources.
7. In-memory computing
This approach involves storing the data in memory and computing the data cube directly from memory.
 Advantage: The advantage of this approach is that it allows for very fast querying of the data since
the data is already in memory and does not need to be retrieved from disk.
 Disadvantage: The disadvantage is that it may not be practical for very large datasets, since the data
may not fit in memory.
8. Streaming data
This approach involves computing the data cube on a stream of data, rather than a batch of data.
 Advantage: The advantage of this approach is that it allows the data cube to be updated in real-time,
as new data becomes available.
 Disadvantage: The disadvantage is that it can be more complex to implement, and may require
specialized tools and techniques.
Note: Sorting, hashing, and grouping are techniques that can be used to optimize data cube computation,
but they are not necessarily strategies for data cube computation in and of themselves.

You might also like