Data Mining Notes UNIT II
Data Mining Notes UNIT II
Data Warehouse:
A Data Warehouse (DW) is a relational database that is designed for query and
analysis rather than transaction processing. It includes historical data derived from
transaction data from single and multiple sources.
12 The database size is from 100GB to 100 TB. The database size is from 100 MB to
100 GB.
Data warehouses and Online Analytical Processing (OLAP) tools are based on a
multidimensional data model. OLAP in data warehousing enables users to view data from
different angles and dimensions.
Star Schema:
There is a fact table at the center. It contains the keys to each of four dimensions.
The fact table also contains the attributes, namely dollars sold and units sold.
Note − Each dimension has only one dimension table and each table holds a set of
attributes. For example, the location dimension table contains the attribute set
{location_key, street, city, province_or_state,country}. This constraint may cause data
redundancy. For example, "Vancouver" and "Victoria" both the cities are in the Canadian
province of British Columbia. The entries for such cities may cause data redundancy along
the attributes province_or_state and country.
Snowflake Schema:
A fact constellation has multiple fact tables. It is also known as galaxy schema.
The following diagram shows two fact tables, namely sales and shipping.
Schema Definition:
define dimension time as (time key, day, day of week, month, quarter, year)
define dimension item as (item key, item name, brand, type, supplier type)
define dimension branch as (branch key, branch name, branch type)
define dimension location as (location key, street, city, province or state, country)
define dimension time as (time key, day, day of week, month, quarter, year)
define dimension item as (item key, item name, brand, type, supplier (supplier key,
supplier type))
define dimension branch as (branch key, branch name, branch type)
define dimension location as (location key, street, city (city key, city, province or state,
country))
define dimension time as (time key, day, day of week, month, quarter, year)
define dimension item as (item key, item name, brand, type, supplier type)
define dimension branch as (branch key, branch name, branch type)
define dimension location as (location key, street, city, province or state,country)
define cube shipping [time, item, shipper, from location, to location]:
distributive,
algebraic,
holistic.
Distributive.
An aggregate function is distributive if it can be computed in a distributed manner.
Suppose the data are partitioned into n sets. We apply the function to each partition,
resulting in n aggregate values. If the result derived by applying the function to
the n aggregate values is the same as that derived by applying the function to the entire
data set (without partitioning), the function can be computed in a distributed manner.
For example, count() can be computed for a data cube by first partitioning the cube
into a set of sub cubes, computing count() for each sub cube, and then summing up the
counts obtained for each sub cube. Hence, count() is a distributive
aggregate function. For the same reason, sum(), min(), and max() are distributive aggregate
functions.
A measure is distributive if it is obtained by applying a distributive aggregate
function. Distributive measures can be computed efficiently because they can be computed
in a distributive manner.
Algebraic.
An aggregate function is algebraic if it can be computed by an algebraic function
with m arguments (where m is a bounded positive integer), each of which is obtained by
applying a distributive aggregate function.
For example, avg() (average) can be computed by sum()/count(), where both sum()
and count() are distributive aggregate functions. Similarly, it can be shown that min N()
and max N() (which find the N minimum and N maximum values, respectively, in a given
set) and standard deviation() are algebraic aggregate functions.
Holistic.
An aggregate function is holistic if there is no constant bound on the storage size
needed to describe a sub aggregate. That is, there does not exist an algebraic function
with m arguments (where m is a constant) that characterizes the computation.
Concept Hierarchies:
A conceptual hierarchy includes a set of nodes organized in a tree, where the nodes define
values of an attribute known as concepts. A specific node, “ANY”, is constrained for the root
of the tree. A number is created to the level of each node in a conceptual hierarchy. The
level of the root node is one. The level of a non-root node is one more the level of its parent
level number.
Because values are defined by nodes, the levels of nodes can also be used to describe the
levels of values. Concept hierarchy enables raw information to be managed at a higher and
more generalized level of abstraction. There are several types of concept hierarchies which
are as follows −
Schema Hierarchy − Schema hierarchy represents the total or partial order between
attributes in the database. It can define existing semantic relationships between attributes.
In a database, more than one schema hierarchy can be generated by using multiple
sequences and grouping of attributes.
The static and dynamic generation of concept hierarchy is based on data sets. In this
context, the generation of a concept hierarchy depends on a static or dynamic data set is
known as the static or dynamic generation of concept hierarchy.
(a) Hierarchy for locations (b) a lattice for time
OLAP Operations
Since OLAP servers are based on multidimensional view of data, we will discuss OLAP
operations in multidimensional data.
Here is the list of OLAP operations −
Roll-up
Drill-down
Slice and dice
Pivot (rotate)
Roll-up
Roll-up performs aggregation on a data cube in any of the following ways −
Drill-down
Drill-down is the reverse operation of roll-up. It is performed by either of the following
ways −
Slice
The slice operation selects one particular dimension from a given cube and provides a new
sub-cube. Consider the following diagram that shows how slice works.
Here Slice is performed for the dimension "time" using the criterion time = "Q1".
It will form a new sub-cube by selecting one or more dimensions.
Dice
Dice selects two or more dimensions from a given cube and provides a new sub-cube.
Consider the following diagram that shows the dice operation.
The dice operation on the cube based on the following selection criteria involves three
dimensions.
(location = "Toronto" or "Vancouver")
(time = "Q1" or "Q2")
(item =" Mobile" or "Modem")
Pivot
The pivot operation is also known as rotation. It rotates the data axes in view in order to
provide an alternative presentation of data. Consider the following diagram that shows
the pivot operation.
The business analyst get the information from the data warehouses to measure the
performance and make critical adjustments in order to win over other business holders in
the market. Having a data warehouse offers the following advantages −
Since a data warehouse can gather information quickly and efficiently, it can
enhance business productivity.
A data warehouse provides us a consistent view of customers and items, hence, it
helps us manage customer relationship.
A data warehouse also helps in bringing down the costs by tracking trends, patterns
over a long period in a consistent and reliable manner.
To design an effective and efficient data warehouse, we need to understand and analyze
the business needs and construct a business analysis framework. Each person has
different views regarding the design of a data warehouse. These views are as follows −
The top-down view − This view allows the selection of relevant information
needed for a data warehouse.
The data source view − This view presents the information being captured, stored,
and managed by the operational system.
The data warehouse view − This view includes the fact tables and dimension
tables. It represents the information stored inside the data warehouse.
The business query view − It is the view of the data from the viewpoint of the end-
user.
Generally a data warehouses adopts a three-tier architecture. Following are the three tiers
of the data warehouse architecture.
Bottom Tier − The bottom tier of the architecture is the data warehouse database
server. It is the relational database system. We use the back end tools and utilities
to feed data into the bottom tier. These back end tools and utilities perform the
Extract, Clean, Load, and refresh functions.
Middle Tier − In the middle tier, we have the OLAP Server that can be implemented
in either of the following ways.
o By Relational OLAP (ROLAP), which is an extended relational database
management system. The ROLAP maps the operations on multidimensional
data to standard relational operations.
o By Multidimensional OLAP (MOLAP) model, which directly implements the
multidimensional data and operations.
Top-Tier − This tier is the front-end client layer. This layer holds the query tools
and reporting tools, analysis tools and data mining tools.
The following diagram depicts the three-tier architecture of data warehouse −
Data Warehouse Models
From the perspective of data warehouse architecture, we have the following data
warehouse models −
Virtual Warehouse
Data mart
Enterprise Warehouse
Virtual Warehouse
The view over an operational data warehouse is known as a virtual warehouse. It is easy
to build a virtual warehouse. Building a virtual warehouse requires excess capacity on
operational database servers.
Data Mart
Data mart contains a subset of organization-wide data. This subset of data is valuable to
specific groups of an organization.
In other words, we can claim that data marts contain data specific to a particular group.
For example, the marketing data mart may contain data related to items, customers, and
sales. Data marts are confined to subjects.
Points to remember about data marts −
Window-based or Unix/Linux-based servers are used to implement data marts.
They are implemented on low-cost servers.
The implementation data mart cycles is measured in short periods of time, i.e., in
weeks rather than months or years.
The life cycle of a data mart may be complex in long run, if its planning and design
are not organization-wide.
Data marts are small in size.
Data marts are customized by department.
The source of a data mart is departmentally structured data warehouse.
Data mart are flexible.
Enterprise Warehouse
An enterprise warehouse collects all the information and the subjects spanning an
entire organization
It provides us enterprise-wide data integration.
The data is integrated from operational systems and external information
providers.
This information can vary from a few gigabytes to hundreds of gigabytes, terabytes
or beyond.
Data Warehouse Back-End Tools and Utilities:
Data extraction: get data from multiple, heterogeneous, and external sources
Data cleaning: detect errors in the data and rectify them when possible
Data transformation: convert data from legacy or host format to warehouse format
Load: sort, summarize, consolidate, compute views, check integrity, and build indices and
partitions
Refresh: propagate the updates from the data sources to the warehouse
What is Metadata?
Metadata is simply defined as data about data. The data that is used to represent other
data is known as metadata. For example, the index of a book serves as a metadata for the
contents in the book. In other words, we can say that metadata is the summarized data
that leads us to detailed data. In terms of data warehouse, we can define metadata as
follows.
Metadata is the road-map to a data warehouse.
Metadata in a data warehouse defines the warehouse objects.
Metadata acts as a directory. This directory helps the decision support system to
locate the contents of a data warehouse.
Note − In a data warehouse, we create metadata for the data names and definitions of a
given data warehouse. Along with this metadata, additional metadata is also created for
time-stamping any extracted data, the source of extracted data.
Categories of Metadata
Relational OLAP
ROLAP servers are placed between relational back-end server and client front-end tools.
To store and manage warehouse data, ROLAP uses relational or extended-relational
DBMS.
ROLAP includes the following −
Multidimensional OLAP
Hybrid OLAP
Hybrid OLAP is a combination of both ROLAP and MOLAP. It offers higher scalability of
ROLAP and faster computation of MOLAP. HOLAP servers allows to store the large data
volumes of detailed information. The aggregations are stored separately in MOLAP store.
Specialized SQL servers provide advanced query language and query processing support
for SQL queries over star and snowflake schemas in a read-only environment.
Data Warehouse Implementation
The big data which is to be analyzed and handled to draw insights from it will be
stored in data warehouses.
These warehouses are run by OLAP servers which require processing of a query
with seconds.
So, a data warehouse should need highly efficient cube computation techniques,
access methods, and query processing techniques.
The core of multidimensional data analysis is the efficient computation of
aggregations across many sets of dimensions.
In SQL aggregations are referred to as group-by’s.
Each group-by can be represented as a cuboid.
Set of group-by’s forms a lattice of a cuboid defining a data cube.
The compute cube operator computes aggregates over all subsets of the dimensions
specified in the operation.
It requires excessive storage space, especially for a large number of dimensions.
A data cube is a lattice of cuboids.
Suppose that we create a data cube for ProElectronics(Company) sales that contains the
following: city, item, year, and sales_in_dollars.
Compute the sum of sales, grouping by city, and item.
Compute the sum of sales, grouping by city.
Compute the sum of sales, grouping by item.
What is the total number of cuboids, or group-by’s, that can be computed for this data
cube?
The base cuboid contains all three dimensions. Apex cuboid is empty. On-line
analytical processing may need to access different cuboids for different queries. So we
have to compute all or at least some of the cuboids in the data cube in advance. Pre
computation leads to fast response time and avoids some redundant computation.
A major challenge related to pre computation would be storage space if all the
cuboids in the data cube are computed, especially when the cube has many dimensions.
The storage requirements are even more excessive when many of the dimensions have
associated concept hierarchies, each with multiple levels. This problem is referred to as
the Curse of Dimensionality.
Cube Operation
There are three choices for data cube materialization given a base cuboid.
No Materialization
Full Materialization
Partial Materialization
How to select which materialization to use
Identify the subsets of cuboids or subcubes to materialize.
Exploit the materialized cuboids or subcubes during query processing.
Efficiently update the materialized cuboids or subcubes during load and refresh.
First of all, create an index table on a particular column of the table. Then each value in the
column has got a bit vector: bit-op is fast. The length of the bit vector: # of records in the
base table. The i-th bit is set if the i-th row of the base table has the value for the indexed
column. It's not suitable for high cardinality domains.
The join indexing method gained popularity from its use in relational database query
processing. The join index records can identify joinable tuples without performing costly
join operations. Join indexing is especially useful for maintaining the relationship between
a foreign key and its matching primary keys, from the joinable relation.
Suppose that there are 360-time values, 100 items, 50 branches, 30 locations, and
10 million sales tuples in the sales star data cube. If the sales fact table has recorded sales
for only 30 items, the remaining 70 items will obviously not participate in joins. If join
indices are not used, additional I/Os have to be performed to bring the joining portions of
the fact table and dimension tables together.
To further speed up query processing, the join indexing, and bitmap indexing
methods can be integrated to form bitmapped join indices. Microsoft SQL Server and
Sybase IQ support bitmap indices. Oracle 8 uses bitmap and join indices.
The purpose of materializing cuboids and constructing OLAP index structures is to speed
up the query processing in data cubes.