0% found this document useful (0 votes)
18 views21 pages

DM Unit 2

Data warehousing is essential for businesses to collect, organize, and analyze data for strategic decision-making, overcoming challenges of data integration from diverse sources. It features a central repository that supports advanced analysis, faster queries, and aids in managing customer relationships. The document also differentiates between OLTP and OLAP systems, outlines data warehouse architecture, and discusses various models and approaches for effective data management.

Uploaded by

cantbeatme006
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views21 pages

DM Unit 2

Data warehousing is essential for businesses to collect, organize, and analyze data for strategic decision-making, overcoming challenges of data integration from diverse sources. It features a central repository that supports advanced analysis, faster queries, and aids in managing customer relationships. The document also differentiates between OLTP and OLAP systems, outlines data warehouse architecture, and discusses various models and approaches for effective data management.

Uploaded by

cantbeatme006
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

Purpose of Data Warehousing: Data Integration Challenges:

Data warehousing helps businesses collect, organize, and analyze data for
• Organizations have diverse and distributed databases.
strategic decision-making.
• Integrating such data directly for frequent queries can be
Definition of a Data Warehouse:
complex, slow, and costly.
• It's a central data repository separate from everyday operational
Advantages of Data Warehousing over Traditional Approaches:
databases.
• Traditional methods rely on real-time query translation across
• It combines and organizes data from various sources (e.g.,
databases, which is slow and resource-intensive.
databases, files) for easy access and analysis.
• Data warehouses use a pre-loaded approach, where data is
Key Features of a Data Warehouse:
processed and stored in advance for quick access.
• Subject-oriented: Focuses on key areas like customers, sales, and
Performance Benefits:
products for decision-making.
• Queries in a data warehouse are faster because the data is
• Integrated: Combines data from multiple sources, ensuring
already prepared and summarized.
consistency in names, formats, and measures.
• It doesn't disrupt local databases during querying.
• Time-variant: Stores historical data (e.g., from the last 5-10 years)
to analyze changes over time. Support for Advanced Analysis:
• Nonvolatile: Data remains stable and is updated in bulk, without • Stores historical data to support detailed, multi-dimensional
ongoing changes from daily operations. queries.
Benefits of Data Warehousing: • Useful for businesses to identify trends, optimize strategies, and
make informed decisions.
• Supports decision-making by providing a clear, consolidated view
of data. Differences Between OLTP and OLAP Systems:
• Helps analyze customer behavior, product performance, and 1. Purpose:
operational efficiency.
o OLTP (Online Transaction Processing): Handles day-to-
• Aids in managing customer relationships and corporate resources day operations like banking, inventory, payroll, and
effectively. registration.
o OLAP (Online Analytical Processing): Focuses on data 6. Access Patterns:
analysis and decision-making, supporting tasks like
o OLTP: Handles short, simple read/write transactions with
reporting and strategic planning.
a need for concurrency and recovery controls.
2. User Orientation:
o OLAP: Mostly read-only complex queries, accessing large
o OLTP: Customer-focused, used by clerks, clients, and IT datasets for analysis.
professionals.
7. Data Volume and Users:
o OLAP: Market-focused, used by managers, executives,
o OLTP: Deals with smaller datasets (up to high-order GB)
and analysts.
and has thousands of users.
3. Data Type:
o OLAP: Manages large datasets (in TB) with hundreds of
o OLTP: Manages current, detailed data that supports users.
transactions.
8. Performance and Metrics:
o OLAP: Handles historic, summarized data that supports
o OLTP: Prioritizes high transaction throughput and
analysis and decision-making.
availability.
4. Database Design:
o OLAP: Focuses on flexible queries and faster response
o OLTP: Uses an ER (Entity-Relationship) model with times for complex analyses.
application-specific designs.
9. Summary of Operations:
o OLAP: Uses star or snowflake models with a subject-
o OLTP: Supports high-speed transactions using indexing
oriented approach.
and primary keys.
5. Data View:
o OLAP: Performs heavy scans for large-scale data analysis.
o OLTP: Focuses on current, detailed data from a single
10. Example Comparison:
department or enterprise.
o OLTP: Processing a payment, updating an inventory
o OLAP: Combines data across departments or
record.
organizations, often in a summarized, multidimensional
format.
o OLAP: Generating a sales report for the past year, o Decision-making requires clean, high-quality, and
analyzing customer trends. consolidated data from multiple sources.

Why Have a Separate Data Warehouse? (Simplified in Points) o Operational databases don’t typically maintain or
integrate data in this way, making them unsuitable for
1. Performance Differences:
decision support.
o Operational databases (OLTP) are designed for fast,
5. Distinct Functions:
simple tasks like searching, updating records, and
processing transactions. o OLTP focuses on running the day-to-day operations.

o OLAP queries are complex, requiring large data analysis, o OLAP is designed for long-term strategic analysis.
summarization, and multidimensional views. Running
6. Emerging Trends:
OLAP on operational databases would slow down their
performance. o Some database vendors are optimizing operational
systems to handle OLAP queries.
2. Concurrency Issues:
o In the future, the separation between OLTP and OLAP
o OLTP systems handle multiple users and transactions
systems may reduce, but for now, separate systems are
simultaneously, using mechanisms like locking and logging
necessary.
to ensure data consistency.
Explanation of the Three-Tier Data Warehousing Architecture (Based on
o OLAP queries are mostly read-only but can interfere with
Image)
these mechanisms, reducing the speed and reliability of
transactional systems. 1. Bottom Tier: Data Warehouse Server
3. Different Data Needs: o Purpose: Collects, integrates, and stores data.
o Operational databases store current, raw, and detailed o Components:
data for immediate use.
▪ Data is extracted from operational databases and
o Data warehouses store historic, summarized, and external sources.
integrated data for decision-making and analysis.
▪ Processes like Extract, Clean, Transform, Load,
4. Data Consolidation: and Refresh ensure data is accurate and unified.
▪ A metadata repository stores information about ▪ Data Mining Tools: For predictive and trend
the data warehouse structure, schema, and analysis.
contents.
o Output: Visualizations, summaries, or insights used for
o Tools Used: Relational database management systems decision-making.
(RDBMS) and gateways like ODBC, OLEDB, and JDBC.
4. Flow of Data:
2. Middle Tier: OLAP Server
o Data moves from the bottom tier (data sources) through
o Purpose: Handles data analysis and provides extraction and processing to the middle tier (OLAP
multidimensional views of data. server), and finally to the top tier (end-user tools).

o Models Used: 5. Integration:

▪ ROLAP (Relational OLAP): Uses relational o Combines diverse data sources and processes them for
databases for multidimensional data queries. user-focused analysis, ensuring efficient data
management and decision support.
▪ MOLAP (Multidimensional OLAP): Uses
specialized servers for efficient multidimensional 6. Purpose of the Architecture:
data processing.
o Ensures efficient handling, processing, and use of data for
o Connects the backend (data warehouse) to the front-end decision-making.
tools.
o Each layer has a distinct role, improving performance and
3. Top Tier: Front-End Tools flexibility.

o Purpose: Provides user-friendly interfaces for accessing Data Warehouse Models and Approaches Explained Simply
and analyzing data.
Three Data Warehouse Models
o Tools Include:
1. Enterprise Warehouse:
▪ Query/Report Tools: For generating reports.
o Stores all organization-wide information across all
▪ Analysis Tools: For understanding patterns and subjects (e.g., sales, customers, inventory).
trends.
o Integrates data from multiple operational systems and
external sources.
o Contains both detailed and summarized data. o Quick and easy to set up but requires extra capacity on
operational databases.
o Typically large (gigabytes to terabytes or more) and can
take years to design and build.

o Requires extensive planning and may use high- Development Approaches


performance platforms like mainframes or parallel
1. Top-Down Approach:
architecture.
o Starts with building a complete enterprise warehouse.
2. Data Mart:
o Advantages:
o Focused on a specific group or department (e.g.,
marketing, sales). ▪ Systematic and minimizes future integration
issues.
o Contains summarized data relevant to a particular subject
area. o Disadvantages:
o Easier and faster to build than an enterprise warehouse ▪ Expensive, time-consuming, and rigid.
(weeks instead of years).
▪ Requires agreement on a common data model
o Implemented on low-cost servers (e.g., Windows or across the organization, which can be difficult.
Linux).
2. Bottom-Up Approach:
o Two types:
o Begins with building individual data marts and integrates
▪ Independent Data Mart: Gets data from them later.
operational systems or external sources.
o Advantages:
▪ Dependent Data Mart: Gets data directly from
the enterprise warehouse. ▪ Flexible, faster to implement, and low cost.

3. Virtual Warehouse: ▪ Provides quick returns on investment.

o Disadvantages:
o A collection of views created over operational databases.

o Only a few summary views are stored ("materialized") for


better query performance.
▪ Can lead to integration issues when trying to 5. Data Refresh: Updates the warehouse with the latest data from
unify different data marts into a single enterprise external sources.
warehouse.
Data Warehouse Management:

• These tools help with the management of data within the


Recommended Method (Incremental Approach) warehouse, such as cleaning, loading, and refreshing data.

1. Define a high-level corporate data model (in 1-2 months) to Importance of Data Cleaning and Transformation:
create a consistent, integrated view of data.
• These processes improve data quality, which is crucial for better
2. Build independent data marts in parallel with the enterprise data mining results (as discussed in Chapter 3).
warehouse based on the corporate model.
Metadata Repository:
3. Use hub servers to integrate distributed data marts into a
• Metadata: Data about the data in the warehouse. It defines the
cohesive system.
warehouse objects.
4. Construct a multitier warehouse where the enterprise warehouse
Key Elements of Metadata:
acts as the central data repository, distributing data to dependent
data marts. 1. Warehouse Structure Description: Includes data names,
warehouse schema, dimensions, and data marts.
Key Functions in Data Warehouse Systems:
2. Operational Metadata: Tracks the history of data migration,
1. Data Extraction: Collects data from different sources, which can
transformations applied, and the status of data (active or
be external and varied.
archived).
2. Data Cleaning: Identifies and fixes errors in the data, improving its
3. Summarization Algorithms: Defines how data is summarized,
quality.
grouped, and aggregated.
3. Data Transformation: Converts the data from its original format
4. Mapping Information: Describes how data in the operational
into a format suitable for the warehouse.
environment relates to the data warehouse, including data
4. Data Loading: Organizes and processes data, ensuring it is extraction rules and security.
correctly placed in the warehouse and optimized for retrieval.
5. System Performance Data: Includes details about indices and 2. Dimensions: These are different viewpoints or entities that an
profiles that speed up data retrieval, along with refresh/update organization uses to analyze data. For example, in a sales data
schedules. warehouse, the dimensions might include:

6. Business Metadata: Contains business-related terms, data o Time (e.g., months, quarters)
ownership, and policies.
o Item (e.g., type of product sold)
Importance of Metadata:
o Branch (e.g., specific store branches)
• Data Directory: Helps users find specific data within the
o Location (e.g., cities or regions)
warehouse.
3. Dimension Tables: Each dimension has a related table (called a
• Data Mapping Guide: Shows how data moves from the source to
dimension table) that describes it in more detail. For instance, the
the warehouse.
"item" dimension table might list item names, brands, and types.
• Summarization Guide: Helps with the algorithms used to
4. Fact Table: A central table in a data warehouse that stores facts
summarize detailed data at different levels.
(numeric measures) like sales amount, units sold, or budgeted
• Persistence: Metadata should be stored securely and remain amounts. The fact table also connects to the dimension tables
accessible on disk. through keys.

Data Warehouse Data Levels: Data Cube:

• The warehouse stores different types of data: 5. Data Cube Concept: The data cube is a way of storing data in
multiple dimensions (not just 3-D). It’s often represented visually
o Current Detailed Data: Stored on disk.
as a cube but can have any number of dimensions (n-
o Older Data: Stored on tertiary storage. dimensional).

o Summarized Data: Can be lightly or highly summarized 6. 2-D Data Cube Example: Imagine a table showing sales data for
and may or may not be physically stored. items sold over time (e.g., quarterly sales). This table is a 2-D data
cube, with the two dimensions being "time" and "item."
Overview of Data Warehouse Modeling with Data Cubes:
7. 3-D Data Cube: If we add another dimension (e.g., location), the
1. Multidimensional Data Model: Data in data warehouses is data becomes 3-D. A 3-D cube can be viewed as multiple 2-D
structured in a way that allows it to be viewed from multiple
dimensions (perspectives), often using a data cube.
tables stacked together, each representing data for different In summary:
locations (like sales in different cities).
• Data cubes allow data to be viewed in multiple dimensions (time,
8. 4-D and Higher: You can keep adding dimensions, such as a item, location, etc.).
supplier dimension, making the data cube 4-D. This can get more
• Cuboids represent different levels of data summarization.
complex, but it's still represented as a series of lower-dimensional
cubes. • Base cuboid has the most detailed data, while the apex cuboid
shows the highest-level summary.
Cuboids:
Schemas for Multidimensional Data Models:
9. Cuboids: A cuboid represents data for a specific combination of
dimensions. For example, if you have four dimensions (time, item, 1. Entity-Relationship Model: This is used in relational databases for
location, and supplier), a cuboid might show summarized sales online transaction processing (OLTP). It focuses on entities and
data for a particular combination of those dimensions. relationships between them.
10. Lattice of Cuboids: All possible combinations of dimensions form 2. Data Warehouse Schema: For a data warehouse, a
a lattice of cuboids, where each cuboid shows the data at multidimensional schema is needed to facilitate online data
different levels of detail (summarization). The collection of these analysis. This schema is subject-oriented, meaning it's focused on
cuboids is called a data cube. specific business areas like sales or inventory.
Levels of Summarization: Types of Schemas:
11. Base Cuboid: The base cuboid holds the lowest level of detail. For 1. Star Schema:
example, a 4-D cuboid for time, item, location, and supplier would
• Structure:
be the base cuboid.

12. Nonbase Cuboid: A nonbase cuboid represents summarized data. o A central fact table holds the main data (e.g., sales
For example, a 3-D cuboid for time, item, and location, but figures).
summarized for all suppliers. o Around it are dimension tables that describe the
13. Apex Cuboid: The apex cuboid holds the highest level of dimensions (e.g., time, location, product).
summarization, often showing the total value for all dimensions. • Features:
In the example, it could show total sales across all four
dimensions. o The fact table contains numeric values like dollars sold or
units sold.
o Dimension identifiers are system-generated keys linking o A "location" dimension might be split into separate
the fact table to the dimension tables. "location" and "city" tables.

o Redundancy: There can be some redundancy in 3. Fact Constellation Schema:


dimension tables. For example, multiple cities in the same
• Structure:
state might lead to repetition of state and country
attributes. o This schema is used for more complex applications and
includes multiple fact tables that share dimension tables.
o Attributes: Each dimension table has attributes that may
form hierarchies (ordered) or lattices (unordered). • Features:
2. Snowflake Schema: o Multiple fact tables (e.g., sales, shipping) share common
dimension tables (e.g., time, location, item).
• Structure:
• Also Known As: Sometimes called a Galaxy Schema because it
o It’s a variant of the star schema, where some dimension
looks like a collection of stars (fact tables) sharing dimension
tables are normalized (split into smaller related tables).
tables.
• Benefits:
• Use Case: This schema is used in data warehouses that span an
o Reduces redundancy and saves storage space because entire organization and need to model multiple interconnected
the data is split into smaller tables. subjects.

• Drawbacks: Data Warehouse vs Data Mart:

o Can reduce the system's performance as more joins are 4. Data Warehouse:
needed to execute queries.
o A data warehouse collects information for the entire
o Not as popular as the star schema due to these organization, including subjects like sales, customers, and
performance issues. inventory. It typically uses the fact constellation schema
to model multiple related subjects.
• Example:
5. Data Mart:
o In the snowflake schema, a single "item" dimension table
is split into two smaller tables: one for item details and
one for supplier details.
o A data mart is a smaller, department-specific subset of a o Partial Order: A more flexible order where some
data warehouse, focusing on a specific area (like sales or elements can be grouped in different ways (e.g., day <
inventory). month, week < year).

o For data marts, the star schema or snowflake schema is 3. Schema Hierarchy: Concept hierarchies like these (total or partial
used, with the star schema being more popular because order) are known as schema hierarchies.
it's simpler and more efficient.
o Some hierarchies (like time) are predefined in data
Summary: systems, but users can also adjust them (e.g., defining a
fiscal year that starts on April 1).
• Star Schema: Simple and efficient, with a central fact table and
smaller dimension tables. 4. Set-Grouping Hierarchy: Hierarchies can also be based on value
ranges.
• Snowflake Schema: A more complex version of the star schema
with normalized dimension tables to reduce redundancy. o Example: Price ranges might group values into categories
like "cheap", "moderately priced", and "expensive" (e.g.,
• Fact Constellation Schema: Multiple fact tables sharing common
$0-$50, $51-$100).
dimension tables, used for complex applications across the
organization. 5. Manual or Automatic Creation:

Concept Hierarchies: o Manual: Concept hierarchies can be defined by users or


experts.
1. Definition: A concept hierarchy organizes low-level concepts into
higher-level, more general ones. o Automatic: They can also be generated based on data
patterns and distributions.
o Example: In the location dimension, cities like Vancouver,
Toronto, New York, and Chicago can be mapped to their Measures: Categorization and Computation:
provinces (e.g., Vancouver to British Columbia) and
1. Definition: Measures are numeric functions that are calculated at
countries (e.g., Vancouver to Canada).
each point in the data cube (based on a set of dimensions).
2. Hierarchy Types:
o Example: A measure for sales data might be the total
o Total Order: A strict order where each level is below the amount sold for each city, product, and quarter.
previous one (e.g., street < city < province < country).
2. Categories of Measures:
o Distributive: Measures that can be computed by dividing o Holistic measures are more complex but can be
the data into smaller parts and calculating the measure approximated in large datasets.
for each part, then combining the results.
5. Beyond Numeric Data: While most measures are numeric, they
▪ Examples: Sum, Count, Min, Max (e.g., sum of can also be applied to non-numeric data types like spatial,
sales for all sub-regions). multimedia, or text data.

o Algebraic: Measures that require combining the results of Concept Hierarchies in OLAP:
distributive functions to calculate them.
• Concept Hierarchies help organize data into multiple levels of
▪ Example: Average = sum()/count(), where both detail (e.g., cities → states → countries).
sum and count are distributive functions.
• They allow users to view data from different perspectives in
o Holistic: Measures that cannot be computed by a fixed OLAP (Online Analytical Processing).
set of rules and need more complex calculations.
• OLAP allows interactive querying to analyze data, using
▪ Examples: Median, Mode, Rank. operations on a data cube.

3. Efficiency of Computation: Typical OLAP Operations:

o Distributive measures are easiest to calculate efficiently 1. Roll-up:


because they can be broken down into smaller sub-
o Definition: Aggregates data to a higher level by climbing
problems.
up a concept hierarchy or reducing dimensions.
o Algebraic measures are also efficient, as they combine
o Example: Instead of viewing data by city, roll-up shows
distributive functions.
data by country (e.g., from street → city → state →
o Holistic measures are harder to compute efficiently, but country).
approximation techniques (e.g., estimating median
o Can also reduce dimensions (e.g., removing the time
values) can help.
dimension to aggregate by location only).
4. Application of Measures:
2. Drill-down:
o Most data cube applications focus on distributive and
o Definition: Goes from less detailed data to more detailed
algebraic measures because they are easier to calculate.
data.
o Example: Moving from quarterly data to monthly data ▪ Ranking the top or bottom N items.
(e.g., from quarter → month).
▪ Calculating moving averages, growth rates,
o Can also add new dimensions (e.g., adding a customer currency conversions, and more.
group dimension).
OLAP’s Analytical Power:
3. Slice and Dice:
• OLAP supports advanced analytical modeling, such as:
o Slice: Selects data from a single dimension, resulting in a
o Calculating ratios and variance.
subcube.
o Trend analysis, forecasting, and statistical analysis.
▪ Example: Viewing data for Q1 only.
• OLAP is a powerful tool for deriving insights and making decisions
o Dice: Selects data from multiple dimensions to form a
based on multidimensional data.
subcube.

▪ Example: Viewing data for Toronto or Vancouver


in Q1 and Q2 for home entertainment or
computers.

4. Pivot (Rotate):

o Definition: Rotates the axes of a data cube to change the


perspective or layout.

o Example: Switching axes between location and item to


view the data in a different way.

5. Other OLAP Operations:

o Drill-across: Combines data from multiple fact tables.

o Drill-through: Drills down to the back-end relational


tables for more detailed data.

o Additional operations may include:


OLAP Operations on Multidimensional Data • Description: This operation provides a more granular view of data
by moving from a higher level of aggregation to a lower level.
This image depicts various OLAP operations performed on
multidimensional data. It illustrates how data can be manipulated and • Example: The image demonstrates a "drill-down" operation from
analyzed through different perspectives and aggregations. Here's a the "time (quarters)" dimension to the "time (months)"
breakdown of the operations showcased: dimension, providing a more detailed view of sales within each
quarter.
1. Dice:
5. Roll-up:
• Description: This operation selects a subset of data based on
specific criteria. • Description: This operation aggregates data from a lower level to
a higher level, providing a summarized view.
• Example: The image shows a "dice" operation where data is
selected for locations "Toronto" or "Vancouver" and time periods • Example: The image shows a "roll-up" operation from the
"Q1" or "Q2", focusing on items "home entertainment" or "location (cities)" dimension to the "location (countries)"
"computer". dimension, aggregating data for all cities within a country.

2. Slice:

• Description: This operation extracts a single layer of data from the Data Warehouse Design:
multidimensional cube.
• A data warehouse helps businesses by offering valuable
• Example: The image shows a "slice" operation where data for the information for better decision-making and competitive
time period "Q1" is extracted for all locations and items. advantage.

3. Pivot: • It can improve productivity by providing consistent and accurate


data across the organization, enhancing customer management,
• Description: This operation switches the roles of rows and
and reducing costs through efficient data tracking.
columns, rotating the data's perspective.
Business Analysis Framework:
• Example: The image shows a "pivot" operation where the location
dimension is swapped with the item dimension, allowing analysis • When designing a data warehouse, different perspectives need to
of sales for each item across different locations. be considered: the business (top-down), technical (data source
view), storage (data warehouse view), and the user's perspective
4. Drill-down:
(business query view).
• Understanding these views helps create a structured design that OLAP vs. Data Mining:
aligns with business goals and technological capabilities.
• OLAP helps with summarizing and analyzing data interactively
Data Warehouse Design Process: (e.g., through slicing and dicing), while data mining goes deeper,
finding hidden patterns automatically.
• There are two approaches to designing a data warehouse: top-
down (starting with overall planning) and bottom-up (starting • Data mining is more automated and complex compared to OLAP,
with small prototypes). A combined approach uses both. which is more user-directed and involves simpler data
summarization.
• Key steps in designing a data warehouse include choosing a
business process to model, determining data granularity, selecting Multidimensional Data Mining:
dimensions (e.g., time, items, customers), and choosing measures
• This combines OLAP and data mining to find patterns in
(e.g., sales figures).
multidimensional data (data structured in multiple dimensions
Data Warehouse Usage: like time, products, or geography).

• Data warehouses are used by businesses for reporting, decision- • Multidimensional data mining enables the user to explore data
making, and analyzing trends. from various perspectives, offering dynamic analysis through
operations like slicing, dicing, and filtering.
• They help managers generate reports, make strategic decisions,
and analyze large datasets to understand patterns and trends. • Data mining can go beyond OLAP by handling complex tasks like
association, classification, and time-series analysis.
• Data warehouses evolve from basic reporting tools to more
advanced analysis, such as multidimensional analysis and using Why Data Warehouses Are Important for Data Mining:
data mining tools.
• A data warehouse provides high-quality, cleaned, and organized
Types of Data Warehouse Applications: data that makes data mining more effective.

• Information Processing: Basic querying and reporting. • OLAP functions within data warehouses help users interactively
explore data, which aids in finding valuable insights during the
• Analytical Processing: More advanced querying with
data mining process.
multidimensional analysis (OLAP functions like slice-and-dice,
drill-down). Data Warehouses & OLAP Queries:

• Data Mining: Advanced analysis to find hidden patterns and • Data warehouses store large amounts of data.
insights (e.g., associations, predictions).
• OLAP (Online Analytical Processing) queries need to be processed Precomputing Cuboids:
very quickly, usually in seconds.
• To speed up query processing, some cuboids are precomputed in
Data Cubes: advance.

• A data cube is a multi-dimensional representation of data, where • But if all cuboids are precomputed, it can take up too much
each dimension (e.g., city, item, year) has a corresponding set of storage.
aggregations.
• Instead, it is more practical to partially materialize the cuboids,
• Each possible combination of these dimensions forms a cuboid. meaning only some of the cuboids are computed and stored.

• For example, a data cube for sales might have dimensions like Formula for Total Cuboids:
city, item, and year, and you want to compute the sum of sales
• The total number of possible cuboids depends on the number of
for different combinations of these dimensions.
dimensions and the number of levels in each dimension.
Efficient Computation:
• The formula for the total cuboids is:
• A compute cube operator in SQL can compute aggregates over all Total cuboids=∏i=1n(Li+1)\text{Total cuboids} = \prod_{i=1}^{n}
subsets of the specified dimensions. (L_i + 1)Total cuboids=∏i=1n(Li+1) where LiL_iLi is the number of
levels for dimension iii, and we add 1 to include the highest level
• However, computing all combinations can take up a lot of storage,
(e.g., the all level).
especially with many dimensions and large data sets.
Example:
Curse of Dimensionality:
• If there are 10 dimensions, each with 5 levels, the total number of
• As the number of dimensions and levels in each dimension
cuboids would be about 9.8 million.
increases, the number of possible cuboids grows exponentially,
making it impractical to store all of them. • The storage required for these cuboids could be massive if we try
to store all of them.
• For instance, if each dimension has multiple levels (e.g., time
might have "day", "month", "quarter", "year"), the number of Conclusion:
cuboids increases even more.
• Storing all possible cuboids for a data cube is unrealistic due to
• The curse of dimensionality means storage space and the storage requirements.
computation can explode as the dimensions grow.
• Instead, it is better to materialize only some cuboids to balance • Characterization gives a concise summary of the data collection,
speed and storage efficiency. while comparison (also called discrimination) compares different
groups of data.
Data Generalization:
Challenges with Data Cubes:
• Data generalization is about summarizing data by replacing
specific values with broader concepts. • Data cubes (used in OLAP systems) are a simplified way of
analyzing data by organizing it into dimensions and measures.
o Example: Instead of using a person's exact age (e.g., 25),
you could generalize it to a category like "young" or • However, data cubes mainly handle numeric data and simple non-
"middle-aged". numeric data. They don’t easily support complex data types like
spatial data, text, images, or collections of data.
• It can also reduce the number of dimensions when summarizing
data, like removing unnecessary details (e.g., birth date and Complex Data Types and Aggregation:
phone number) when studying student behavior.
• OLAP systems may struggle with handling more complex data
Benefits of Generalization: types (like spatial data or images) or aggregations involving non-
numeric data.
• Generalizing data makes it easier to describe concepts in simpler
and more abstract terms. • Concept description should ideally include these complex data
types and their relationships.
• For example, instead of focusing on individual customer
transactions, sales managers might prefer to look at sales by User Control vs. Automation:
customer groups (based on region, income, or purchase
• In OLAP, users generally control the analysis by choosing
frequency).
dimensions and applying operations like drill-down or roll-up to
Concept Description: explore the data.

• Concept description refers to the process of summarizing or • Users need to understand the dimensions and may have to
characterizing a collection of data (e.g., frequent buyers, graduate perform many operations to get a useful result.
students).
• It would be more efficient if there was a more automated process
• It is not just about listing the data but generating descriptions that to help users decide which dimensions to analyze and how much
help in understanding or comparing data. the data should be generalized to create interesting summaries.

Attribute-Oriented Induction:
• The attribute-oriented induction method offers an alternative 4. After generalizing the data, the resulting dataset
approach to concept description. is merged (if necessary) to reduce its size and
presented in the form of charts or rules.
• It works better for handling complex data types and uses a data-
driven generalization process to create summaries without 3. AOI for Characterization (Describing Data):
requiring extensive user input.
o Characterization is the process of summarizing a data
Attribute-Oriented Induction (AOI): collection (e.g., describing the "Science Students" at a
university).
1. Introduction to AOI:
o A typical data mining query might look at attributes like
o AOI was introduced in 1989 as a method for concept
name, gender, major, and GPA, specifically for graduate
description, before the data cube approach became
students.
popular.
4. Data Focusing (Selecting Relevant Data):
o Data cubes are used for offline aggregation
(precomputing data), while AOI is an online, query-based o Data focusing means selecting relevant data based on the
method for data analysis that generalizes data based on query. This makes the analysis more efficient and
user queries. meaningful.

2. How AOI Works: o The user needs to specify which attributes to include for
mining, but sometimes it's hard to know all the relevant
o The general idea of AOI is to:
attributes.
1. Collect relevant data using a database query.
o The system can help by automatically including related
2. Perform generalization by looking at the distinct attributes (e.g., adding country and province to "city" for
values of attributes (like age or gender) in the generalizing location).
data.
5. Avoiding Irrelevant Attributes:
3. Generalization is done by either removing
o If too many attributes are selected, some may not
attributes that don't add value or generalizing
contribute to an interesting description. Attribute
them to a broader category.
relevance analysis can help filter out irrelevant attributes.
o For example, a user might unintentionally select too many too many tuples remain, generalization
attributes, which can be narrowed down using statistical continues).
methods to only those that matter.
8. Merging and Aggregating Data:
6. Generalization of Attributes:
o After generalizing, identical tuples (rows) are merged. The
o Generalization can either be: count of these tuples is accumulated so that one tuple
represents the group of similar tuples.
1. Attribute removal – Remove attributes if they
have too many distinct values and cannot be o Other aggregate functions like sum() or average() can also
generalized. be applied to numeric attributes (e.g., summing sales
numbers).
2. Attribute generalization – Apply a generalization
operation to attributes that can be generalized 9. Example of Attribute-Oriented Induction:
(e.g., changing specific ages to age ranges).
o Example for student data:
o The goal of generalization is to make the data more
1. Name – Removed because it has many distinct
concise and to help identify broader trends.
values.
7. Controlling the Level of Generalization:
2. Gender – Retained because it has only two
o It's important to control how much an attribute is values.
generalized. If generalized too much, it can lose important
3. Major – Generalized into categories like "arts &
details (overgeneralization); if not enough, it can remain
sciences" or "engineering."
too detailed (undergeneralization).
4. Birth place – Generalized to a higher level (e.g.,
o Two techniques help control generalization:
country) if there are too many distinct cities.
1. Attribute threshold control – Defines how many
5. GPA – Generalized into categories like "excellent,"
distinct values an attribute can have before it is
"very good."
generalized.
10. Result of Generalization:
2. Generalized relation threshold control – Limits
how small the generalized data set should be (if
o After generalizing, tuples that are identical (after o Aggregate counts: Combine identical generalized tuples
generalization) are merged into one, and their counts are and accumulate their counts (and any other aggregation
summed up. values like sum, average).

o The final result is a generalized relation, which is smaller, Efficient Implementation:


more concise, and easier to analyze.
• Method 1: For each generalized tuple, use a binary search to find
Attribute-Oriented Induction Process: its location in a sorted list and either update its count or insert it.

• Step 1: Collect task-relevant data from the database by using a • Method 2: Store the generalized tuples in a multi-dimensional
query (this step is efficient due to database optimization). array where each dimension represents a generalized attribute,
making insertion and aggregation faster.
• Step 2:
Complexity:
o Scan the data: Collect statistics like distinct values for
each attribute in the data. • If Method 1 is used, the time complexity is O(|W| × logp) (where
|W| is the number of tuples and p is the number of generalized
o Determine which attributes to remove: If an attribute
tuples).
has too many distinct values and can’t be generalized, it is
removed. • If Method 2 is used, the time complexity is O(N) (where N is the
number of tuples), making it more efficient.
o Set generalization levels: Decide the level of
generalization (how much the values should be Handling Large Data:
abstracted) based on the number of distinct values for
• For large datasets, the system may not need to scan all of the
each attribute.
data; instead, it can work with a sample to get the necessary
o Identify mapping pairs: For each attribute, map the statistics.
values to their generalized values at the selected level.
Dynamic Attribute Selection:
• Step 3: Create the generalized relation (Prime relation) by:
• Users may not know which attributes are truly relevant, so the
o Substitute values: Replace original values in the data with system might need to dynamically test and select the most
their generalized values. important attributes using statistical methods or correlation
analysis.

Presentation of Results:
• The results of generalization can be shown in various forms like: • Step 3: Synchronous Generalization:

o Generalized relations: Tables summarizing the data. o The target class is generalized (abstracted) to a certain
level for comparison (e.g., city, state, country).
o Visualizations: Pie charts, bar charts, etc.
o The contrasting class is generalized to the same level as
o Quantitative rules: Showing how different combinations
the target class, so the comparison is fair.
of values are distributed in the generalized relation.
• Step 4: Presentation: The comparison is presented in the form of
Class Comparison:
tables, graphs, or rules, and a contrasting measure (e.g.,
• Sometimes, users want to compare or distinguish one class (or percentage count) is used to show the differences between the
concept) from other classes instead of just describing a single target and contrasting classes.
class.
Example:
• A class comparison mines descriptions that show the differences
• Mining a Class Comparison: Suppose we want to compare
between a target class (like graduate students) and contrasting
graduate and undergraduate students based on attributes like
classes (like undergraduate students).
name, gender, GPA, etc.
• For comparison, the classes must be comparable, meaning they
o The database query is run to get data for graduate and
share similar attributes or characteristics. For example, sales in
undergraduate students.
2009 and 2010 are comparable, but "person", "address", and
"item" are not. o Irrelevant attributes (e.g., name, gender) are removed,
and only the important ones (e.g., GPA, major) are kept.
Class Comparison Process:
o The data for both classes is generalized to the same level,
• Step 1: Data Collection: Relevant data is gathered from the
and differences are analyzed (e.g., graduate students are
database and divided into a target class (e.g., graduate students)
generally older and have higher GPAs).
and one or more contrasting classes (e.g., undergraduate
students). o The results are presented with contrasting measures, like
the percentage of graduate students with a specific GPA
• Step 2: Dimension Relevance Analysis: If there are many
vs. the percentage of undergraduates.
dimensions (attributes), analyze which ones are most relevant to
the comparison. This helps in selecting only the important Comparison vs. Characterization:
attributes for further analysis.
• Attribute-oriented induction for class comparison differs from
data cube methods. It allows for generalization on-demand,
without needing to precompute a data cube.

• This method works with various types of data (not just relational
data), such as spatial or multimedia data.

• It can automatically filter out irrelevant attributes, but it may not


efficiently support "drilling down" to deeper levels than those
provided in the generalized data.

• Combining this method with data cube technology can offer a


balance between precomputed results and the flexibility of on-
demand generalization.

Summary:

• Attribute-oriented induction is a flexible method for comparing


classes based on generalized data.

• It allows users to compare classes on relevant dimensions and


present the results visually for easy interpretation.

• It can be applied to a variety of data types and doesn't require


precomputing data cubes, making it suitable for online queries.

You might also like