DM Unit 2
DM Unit 2
Data warehousing helps businesses collect, organize, and analyze data for
• Organizations have diverse and distributed databases.
strategic decision-making.
• Integrating such data directly for frequent queries can be
Definition of a Data Warehouse:
complex, slow, and costly.
• It's a central data repository separate from everyday operational
Advantages of Data Warehousing over Traditional Approaches:
databases.
• Traditional methods rely on real-time query translation across
• It combines and organizes data from various sources (e.g.,
databases, which is slow and resource-intensive.
databases, files) for easy access and analysis.
• Data warehouses use a pre-loaded approach, where data is
Key Features of a Data Warehouse:
processed and stored in advance for quick access.
• Subject-oriented: Focuses on key areas like customers, sales, and
Performance Benefits:
products for decision-making.
• Queries in a data warehouse are faster because the data is
• Integrated: Combines data from multiple sources, ensuring
already prepared and summarized.
consistency in names, formats, and measures.
• It doesn't disrupt local databases during querying.
• Time-variant: Stores historical data (e.g., from the last 5-10 years)
to analyze changes over time. Support for Advanced Analysis:
• Nonvolatile: Data remains stable and is updated in bulk, without • Stores historical data to support detailed, multi-dimensional
ongoing changes from daily operations. queries.
Benefits of Data Warehousing: • Useful for businesses to identify trends, optimize strategies, and
make informed decisions.
• Supports decision-making by providing a clear, consolidated view
of data. Differences Between OLTP and OLAP Systems:
• Helps analyze customer behavior, product performance, and 1. Purpose:
operational efficiency.
o OLTP (Online Transaction Processing): Handles day-to-
• Aids in managing customer relationships and corporate resources day operations like banking, inventory, payroll, and
effectively. registration.
o OLAP (Online Analytical Processing): Focuses on data 6. Access Patterns:
analysis and decision-making, supporting tasks like
o OLTP: Handles short, simple read/write transactions with
reporting and strategic planning.
a need for concurrency and recovery controls.
2. User Orientation:
o OLAP: Mostly read-only complex queries, accessing large
o OLTP: Customer-focused, used by clerks, clients, and IT datasets for analysis.
professionals.
7. Data Volume and Users:
o OLAP: Market-focused, used by managers, executives,
o OLTP: Deals with smaller datasets (up to high-order GB)
and analysts.
and has thousands of users.
3. Data Type:
o OLAP: Manages large datasets (in TB) with hundreds of
o OLTP: Manages current, detailed data that supports users.
transactions.
8. Performance and Metrics:
o OLAP: Handles historic, summarized data that supports
o OLTP: Prioritizes high transaction throughput and
analysis and decision-making.
availability.
4. Database Design:
o OLAP: Focuses on flexible queries and faster response
o OLTP: Uses an ER (Entity-Relationship) model with times for complex analyses.
application-specific designs.
9. Summary of Operations:
o OLAP: Uses star or snowflake models with a subject-
o OLTP: Supports high-speed transactions using indexing
oriented approach.
and primary keys.
5. Data View:
o OLAP: Performs heavy scans for large-scale data analysis.
o OLTP: Focuses on current, detailed data from a single
10. Example Comparison:
department or enterprise.
o OLTP: Processing a payment, updating an inventory
o OLAP: Combines data across departments or
record.
organizations, often in a summarized, multidimensional
format.
o OLAP: Generating a sales report for the past year, o Decision-making requires clean, high-quality, and
analyzing customer trends. consolidated data from multiple sources.
Why Have a Separate Data Warehouse? (Simplified in Points) o Operational databases don’t typically maintain or
integrate data in this way, making them unsuitable for
1. Performance Differences:
decision support.
o Operational databases (OLTP) are designed for fast,
5. Distinct Functions:
simple tasks like searching, updating records, and
processing transactions. o OLTP focuses on running the day-to-day operations.
o OLAP queries are complex, requiring large data analysis, o OLAP is designed for long-term strategic analysis.
summarization, and multidimensional views. Running
6. Emerging Trends:
OLAP on operational databases would slow down their
performance. o Some database vendors are optimizing operational
systems to handle OLAP queries.
2. Concurrency Issues:
o In the future, the separation between OLTP and OLAP
o OLTP systems handle multiple users and transactions
systems may reduce, but for now, separate systems are
simultaneously, using mechanisms like locking and logging
necessary.
to ensure data consistency.
Explanation of the Three-Tier Data Warehousing Architecture (Based on
o OLAP queries are mostly read-only but can interfere with
Image)
these mechanisms, reducing the speed and reliability of
transactional systems. 1. Bottom Tier: Data Warehouse Server
3. Different Data Needs: o Purpose: Collects, integrates, and stores data.
o Operational databases store current, raw, and detailed o Components:
data for immediate use.
▪ Data is extracted from operational databases and
o Data warehouses store historic, summarized, and external sources.
integrated data for decision-making and analysis.
▪ Processes like Extract, Clean, Transform, Load,
4. Data Consolidation: and Refresh ensure data is accurate and unified.
▪ A metadata repository stores information about ▪ Data Mining Tools: For predictive and trend
the data warehouse structure, schema, and analysis.
contents.
o Output: Visualizations, summaries, or insights used for
o Tools Used: Relational database management systems decision-making.
(RDBMS) and gateways like ODBC, OLEDB, and JDBC.
4. Flow of Data:
2. Middle Tier: OLAP Server
o Data moves from the bottom tier (data sources) through
o Purpose: Handles data analysis and provides extraction and processing to the middle tier (OLAP
multidimensional views of data. server), and finally to the top tier (end-user tools).
▪ ROLAP (Relational OLAP): Uses relational o Combines diverse data sources and processes them for
databases for multidimensional data queries. user-focused analysis, ensuring efficient data
management and decision support.
▪ MOLAP (Multidimensional OLAP): Uses
specialized servers for efficient multidimensional 6. Purpose of the Architecture:
data processing.
o Ensures efficient handling, processing, and use of data for
o Connects the backend (data warehouse) to the front-end decision-making.
tools.
o Each layer has a distinct role, improving performance and
3. Top Tier: Front-End Tools flexibility.
o Purpose: Provides user-friendly interfaces for accessing Data Warehouse Models and Approaches Explained Simply
and analyzing data.
Three Data Warehouse Models
o Tools Include:
1. Enterprise Warehouse:
▪ Query/Report Tools: For generating reports.
o Stores all organization-wide information across all
▪ Analysis Tools: For understanding patterns and subjects (e.g., sales, customers, inventory).
trends.
o Integrates data from multiple operational systems and
external sources.
o Contains both detailed and summarized data. o Quick and easy to set up but requires extra capacity on
operational databases.
o Typically large (gigabytes to terabytes or more) and can
take years to design and build.
o Disadvantages:
o A collection of views created over operational databases.
1. Define a high-level corporate data model (in 1-2 months) to Importance of Data Cleaning and Transformation:
create a consistent, integrated view of data.
• These processes improve data quality, which is crucial for better
2. Build independent data marts in parallel with the enterprise data mining results (as discussed in Chapter 3).
warehouse based on the corporate model.
Metadata Repository:
3. Use hub servers to integrate distributed data marts into a
• Metadata: Data about the data in the warehouse. It defines the
cohesive system.
warehouse objects.
4. Construct a multitier warehouse where the enterprise warehouse
Key Elements of Metadata:
acts as the central data repository, distributing data to dependent
data marts. 1. Warehouse Structure Description: Includes data names,
warehouse schema, dimensions, and data marts.
Key Functions in Data Warehouse Systems:
2. Operational Metadata: Tracks the history of data migration,
1. Data Extraction: Collects data from different sources, which can
transformations applied, and the status of data (active or
be external and varied.
archived).
2. Data Cleaning: Identifies and fixes errors in the data, improving its
3. Summarization Algorithms: Defines how data is summarized,
quality.
grouped, and aggregated.
3. Data Transformation: Converts the data from its original format
4. Mapping Information: Describes how data in the operational
into a format suitable for the warehouse.
environment relates to the data warehouse, including data
4. Data Loading: Organizes and processes data, ensuring it is extraction rules and security.
correctly placed in the warehouse and optimized for retrieval.
5. System Performance Data: Includes details about indices and 2. Dimensions: These are different viewpoints or entities that an
profiles that speed up data retrieval, along with refresh/update organization uses to analyze data. For example, in a sales data
schedules. warehouse, the dimensions might include:
6. Business Metadata: Contains business-related terms, data o Time (e.g., months, quarters)
ownership, and policies.
o Item (e.g., type of product sold)
Importance of Metadata:
o Branch (e.g., specific store branches)
• Data Directory: Helps users find specific data within the
o Location (e.g., cities or regions)
warehouse.
3. Dimension Tables: Each dimension has a related table (called a
• Data Mapping Guide: Shows how data moves from the source to
dimension table) that describes it in more detail. For instance, the
the warehouse.
"item" dimension table might list item names, brands, and types.
• Summarization Guide: Helps with the algorithms used to
4. Fact Table: A central table in a data warehouse that stores facts
summarize detailed data at different levels.
(numeric measures) like sales amount, units sold, or budgeted
• Persistence: Metadata should be stored securely and remain amounts. The fact table also connects to the dimension tables
accessible on disk. through keys.
• The warehouse stores different types of data: 5. Data Cube Concept: The data cube is a way of storing data in
multiple dimensions (not just 3-D). It’s often represented visually
o Current Detailed Data: Stored on disk.
as a cube but can have any number of dimensions (n-
o Older Data: Stored on tertiary storage. dimensional).
o Summarized Data: Can be lightly or highly summarized 6. 2-D Data Cube Example: Imagine a table showing sales data for
and may or may not be physically stored. items sold over time (e.g., quarterly sales). This table is a 2-D data
cube, with the two dimensions being "time" and "item."
Overview of Data Warehouse Modeling with Data Cubes:
7. 3-D Data Cube: If we add another dimension (e.g., location), the
1. Multidimensional Data Model: Data in data warehouses is data becomes 3-D. A 3-D cube can be viewed as multiple 2-D
structured in a way that allows it to be viewed from multiple
dimensions (perspectives), often using a data cube.
tables stacked together, each representing data for different In summary:
locations (like sales in different cities).
• Data cubes allow data to be viewed in multiple dimensions (time,
8. 4-D and Higher: You can keep adding dimensions, such as a item, location, etc.).
supplier dimension, making the data cube 4-D. This can get more
• Cuboids represent different levels of data summarization.
complex, but it's still represented as a series of lower-dimensional
cubes. • Base cuboid has the most detailed data, while the apex cuboid
shows the highest-level summary.
Cuboids:
Schemas for Multidimensional Data Models:
9. Cuboids: A cuboid represents data for a specific combination of
dimensions. For example, if you have four dimensions (time, item, 1. Entity-Relationship Model: This is used in relational databases for
location, and supplier), a cuboid might show summarized sales online transaction processing (OLTP). It focuses on entities and
data for a particular combination of those dimensions. relationships between them.
10. Lattice of Cuboids: All possible combinations of dimensions form 2. Data Warehouse Schema: For a data warehouse, a
a lattice of cuboids, where each cuboid shows the data at multidimensional schema is needed to facilitate online data
different levels of detail (summarization). The collection of these analysis. This schema is subject-oriented, meaning it's focused on
cuboids is called a data cube. specific business areas like sales or inventory.
Levels of Summarization: Types of Schemas:
11. Base Cuboid: The base cuboid holds the lowest level of detail. For 1. Star Schema:
example, a 4-D cuboid for time, item, location, and supplier would
• Structure:
be the base cuboid.
12. Nonbase Cuboid: A nonbase cuboid represents summarized data. o A central fact table holds the main data (e.g., sales
For example, a 3-D cuboid for time, item, and location, but figures).
summarized for all suppliers. o Around it are dimension tables that describe the
13. Apex Cuboid: The apex cuboid holds the highest level of dimensions (e.g., time, location, product).
summarization, often showing the total value for all dimensions. • Features:
In the example, it could show total sales across all four
dimensions. o The fact table contains numeric values like dollars sold or
units sold.
o Dimension identifiers are system-generated keys linking o A "location" dimension might be split into separate
the fact table to the dimension tables. "location" and "city" tables.
o Can reduce the system's performance as more joins are 4. Data Warehouse:
needed to execute queries.
o A data warehouse collects information for the entire
o Not as popular as the star schema due to these organization, including subjects like sales, customers, and
performance issues. inventory. It typically uses the fact constellation schema
to model multiple related subjects.
• Example:
5. Data Mart:
o In the snowflake schema, a single "item" dimension table
is split into two smaller tables: one for item details and
one for supplier details.
o A data mart is a smaller, department-specific subset of a o Partial Order: A more flexible order where some
data warehouse, focusing on a specific area (like sales or elements can be grouped in different ways (e.g., day <
inventory). month, week < year).
o For data marts, the star schema or snowflake schema is 3. Schema Hierarchy: Concept hierarchies like these (total or partial
used, with the star schema being more popular because order) are known as schema hierarchies.
it's simpler and more efficient.
o Some hierarchies (like time) are predefined in data
Summary: systems, but users can also adjust them (e.g., defining a
fiscal year that starts on April 1).
• Star Schema: Simple and efficient, with a central fact table and
smaller dimension tables. 4. Set-Grouping Hierarchy: Hierarchies can also be based on value
ranges.
• Snowflake Schema: A more complex version of the star schema
with normalized dimension tables to reduce redundancy. o Example: Price ranges might group values into categories
like "cheap", "moderately priced", and "expensive" (e.g.,
• Fact Constellation Schema: Multiple fact tables sharing common
$0-$50, $51-$100).
dimension tables, used for complex applications across the
organization. 5. Manual or Automatic Creation:
o Algebraic: Measures that require combining the results of Concept Hierarchies in OLAP:
distributive functions to calculate them.
• Concept Hierarchies help organize data into multiple levels of
▪ Example: Average = sum()/count(), where both detail (e.g., cities → states → countries).
sum and count are distributive functions.
• They allow users to view data from different perspectives in
o Holistic: Measures that cannot be computed by a fixed OLAP (Online Analytical Processing).
set of rules and need more complex calculations.
• OLAP allows interactive querying to analyze data, using
▪ Examples: Median, Mode, Rank. operations on a data cube.
4. Pivot (Rotate):
2. Slice:
• Description: This operation extracts a single layer of data from the Data Warehouse Design:
multidimensional cube.
• A data warehouse helps businesses by offering valuable
• Example: The image shows a "slice" operation where data for the information for better decision-making and competitive
time period "Q1" is extracted for all locations and items. advantage.
• Data warehouses are used by businesses for reporting, decision- • Multidimensional data mining enables the user to explore data
making, and analyzing trends. from various perspectives, offering dynamic analysis through
operations like slicing, dicing, and filtering.
• They help managers generate reports, make strategic decisions,
and analyze large datasets to understand patterns and trends. • Data mining can go beyond OLAP by handling complex tasks like
association, classification, and time-series analysis.
• Data warehouses evolve from basic reporting tools to more
advanced analysis, such as multidimensional analysis and using Why Data Warehouses Are Important for Data Mining:
data mining tools.
• A data warehouse provides high-quality, cleaned, and organized
Types of Data Warehouse Applications: data that makes data mining more effective.
• Information Processing: Basic querying and reporting. • OLAP functions within data warehouses help users interactively
explore data, which aids in finding valuable insights during the
• Analytical Processing: More advanced querying with
data mining process.
multidimensional analysis (OLAP functions like slice-and-dice,
drill-down). Data Warehouses & OLAP Queries:
• Data Mining: Advanced analysis to find hidden patterns and • Data warehouses store large amounts of data.
insights (e.g., associations, predictions).
• OLAP (Online Analytical Processing) queries need to be processed Precomputing Cuboids:
very quickly, usually in seconds.
• To speed up query processing, some cuboids are precomputed in
Data Cubes: advance.
• A data cube is a multi-dimensional representation of data, where • But if all cuboids are precomputed, it can take up too much
each dimension (e.g., city, item, year) has a corresponding set of storage.
aggregations.
• Instead, it is more practical to partially materialize the cuboids,
• Each possible combination of these dimensions forms a cuboid. meaning only some of the cuboids are computed and stored.
• For example, a data cube for sales might have dimensions like Formula for Total Cuboids:
city, item, and year, and you want to compute the sum of sales
• The total number of possible cuboids depends on the number of
for different combinations of these dimensions.
dimensions and the number of levels in each dimension.
Efficient Computation:
• The formula for the total cuboids is:
• A compute cube operator in SQL can compute aggregates over all Total cuboids=∏i=1n(Li+1)\text{Total cuboids} = \prod_{i=1}^{n}
subsets of the specified dimensions. (L_i + 1)Total cuboids=∏i=1n(Li+1) where LiL_iLi is the number of
levels for dimension iii, and we add 1 to include the highest level
• However, computing all combinations can take up a lot of storage,
(e.g., the all level).
especially with many dimensions and large data sets.
Example:
Curse of Dimensionality:
• If there are 10 dimensions, each with 5 levels, the total number of
• As the number of dimensions and levels in each dimension
cuboids would be about 9.8 million.
increases, the number of possible cuboids grows exponentially,
making it impractical to store all of them. • The storage required for these cuboids could be massive if we try
to store all of them.
• For instance, if each dimension has multiple levels (e.g., time
might have "day", "month", "quarter", "year"), the number of Conclusion:
cuboids increases even more.
• Storing all possible cuboids for a data cube is unrealistic due to
• The curse of dimensionality means storage space and the storage requirements.
computation can explode as the dimensions grow.
• Instead, it is better to materialize only some cuboids to balance • Characterization gives a concise summary of the data collection,
speed and storage efficiency. while comparison (also called discrimination) compares different
groups of data.
Data Generalization:
Challenges with Data Cubes:
• Data generalization is about summarizing data by replacing
specific values with broader concepts. • Data cubes (used in OLAP systems) are a simplified way of
analyzing data by organizing it into dimensions and measures.
o Example: Instead of using a person's exact age (e.g., 25),
you could generalize it to a category like "young" or • However, data cubes mainly handle numeric data and simple non-
"middle-aged". numeric data. They don’t easily support complex data types like
spatial data, text, images, or collections of data.
• It can also reduce the number of dimensions when summarizing
data, like removing unnecessary details (e.g., birth date and Complex Data Types and Aggregation:
phone number) when studying student behavior.
• OLAP systems may struggle with handling more complex data
Benefits of Generalization: types (like spatial data or images) or aggregations involving non-
numeric data.
• Generalizing data makes it easier to describe concepts in simpler
and more abstract terms. • Concept description should ideally include these complex data
types and their relationships.
• For example, instead of focusing on individual customer
transactions, sales managers might prefer to look at sales by User Control vs. Automation:
customer groups (based on region, income, or purchase
• In OLAP, users generally control the analysis by choosing
frequency).
dimensions and applying operations like drill-down or roll-up to
Concept Description: explore the data.
• Concept description refers to the process of summarizing or • Users need to understand the dimensions and may have to
characterizing a collection of data (e.g., frequent buyers, graduate perform many operations to get a useful result.
students).
• It would be more efficient if there was a more automated process
• It is not just about listing the data but generating descriptions that to help users decide which dimensions to analyze and how much
help in understanding or comparing data. the data should be generalized to create interesting summaries.
Attribute-Oriented Induction:
• The attribute-oriented induction method offers an alternative 4. After generalizing the data, the resulting dataset
approach to concept description. is merged (if necessary) to reduce its size and
presented in the form of charts or rules.
• It works better for handling complex data types and uses a data-
driven generalization process to create summaries without 3. AOI for Characterization (Describing Data):
requiring extensive user input.
o Characterization is the process of summarizing a data
Attribute-Oriented Induction (AOI): collection (e.g., describing the "Science Students" at a
university).
1. Introduction to AOI:
o A typical data mining query might look at attributes like
o AOI was introduced in 1989 as a method for concept
name, gender, major, and GPA, specifically for graduate
description, before the data cube approach became
students.
popular.
4. Data Focusing (Selecting Relevant Data):
o Data cubes are used for offline aggregation
(precomputing data), while AOI is an online, query-based o Data focusing means selecting relevant data based on the
method for data analysis that generalizes data based on query. This makes the analysis more efficient and
user queries. meaningful.
2. How AOI Works: o The user needs to specify which attributes to include for
mining, but sometimes it's hard to know all the relevant
o The general idea of AOI is to:
attributes.
1. Collect relevant data using a database query.
o The system can help by automatically including related
2. Perform generalization by looking at the distinct attributes (e.g., adding country and province to "city" for
values of attributes (like age or gender) in the generalizing location).
data.
5. Avoiding Irrelevant Attributes:
3. Generalization is done by either removing
o If too many attributes are selected, some may not
attributes that don't add value or generalizing
contribute to an interesting description. Attribute
them to a broader category.
relevance analysis can help filter out irrelevant attributes.
o For example, a user might unintentionally select too many too many tuples remain, generalization
attributes, which can be narrowed down using statistical continues).
methods to only those that matter.
8. Merging and Aggregating Data:
6. Generalization of Attributes:
o After generalizing, identical tuples (rows) are merged. The
o Generalization can either be: count of these tuples is accumulated so that one tuple
represents the group of similar tuples.
1. Attribute removal – Remove attributes if they
have too many distinct values and cannot be o Other aggregate functions like sum() or average() can also
generalized. be applied to numeric attributes (e.g., summing sales
numbers).
2. Attribute generalization – Apply a generalization
operation to attributes that can be generalized 9. Example of Attribute-Oriented Induction:
(e.g., changing specific ages to age ranges).
o Example for student data:
o The goal of generalization is to make the data more
1. Name – Removed because it has many distinct
concise and to help identify broader trends.
values.
7. Controlling the Level of Generalization:
2. Gender – Retained because it has only two
o It's important to control how much an attribute is values.
generalized. If generalized too much, it can lose important
3. Major – Generalized into categories like "arts &
details (overgeneralization); if not enough, it can remain
sciences" or "engineering."
too detailed (undergeneralization).
4. Birth place – Generalized to a higher level (e.g.,
o Two techniques help control generalization:
country) if there are too many distinct cities.
1. Attribute threshold control – Defines how many
5. GPA – Generalized into categories like "excellent,"
distinct values an attribute can have before it is
"very good."
generalized.
10. Result of Generalization:
2. Generalized relation threshold control – Limits
how small the generalized data set should be (if
o After generalizing, tuples that are identical (after o Aggregate counts: Combine identical generalized tuples
generalization) are merged into one, and their counts are and accumulate their counts (and any other aggregation
summed up. values like sum, average).
• Step 1: Collect task-relevant data from the database by using a • Method 2: Store the generalized tuples in a multi-dimensional
query (this step is efficient due to database optimization). array where each dimension represents a generalized attribute,
making insertion and aggregation faster.
• Step 2:
Complexity:
o Scan the data: Collect statistics like distinct values for
each attribute in the data. • If Method 1 is used, the time complexity is O(|W| × logp) (where
|W| is the number of tuples and p is the number of generalized
o Determine which attributes to remove: If an attribute
tuples).
has too many distinct values and can’t be generalized, it is
removed. • If Method 2 is used, the time complexity is O(N) (where N is the
number of tuples), making it more efficient.
o Set generalization levels: Decide the level of
generalization (how much the values should be Handling Large Data:
abstracted) based on the number of distinct values for
• For large datasets, the system may not need to scan all of the
each attribute.
data; instead, it can work with a sample to get the necessary
o Identify mapping pairs: For each attribute, map the statistics.
values to their generalized values at the selected level.
Dynamic Attribute Selection:
• Step 3: Create the generalized relation (Prime relation) by:
• Users may not know which attributes are truly relevant, so the
o Substitute values: Replace original values in the data with system might need to dynamically test and select the most
their generalized values. important attributes using statistical methods or correlation
analysis.
Presentation of Results:
• The results of generalization can be shown in various forms like: • Step 3: Synchronous Generalization:
o Generalized relations: Tables summarizing the data. o The target class is generalized (abstracted) to a certain
level for comparison (e.g., city, state, country).
o Visualizations: Pie charts, bar charts, etc.
o The contrasting class is generalized to the same level as
o Quantitative rules: Showing how different combinations
the target class, so the comparison is fair.
of values are distributed in the generalized relation.
• Step 4: Presentation: The comparison is presented in the form of
Class Comparison:
tables, graphs, or rules, and a contrasting measure (e.g.,
• Sometimes, users want to compare or distinguish one class (or percentage count) is used to show the differences between the
concept) from other classes instead of just describing a single target and contrasting classes.
class.
Example:
• A class comparison mines descriptions that show the differences
• Mining a Class Comparison: Suppose we want to compare
between a target class (like graduate students) and contrasting
graduate and undergraduate students based on attributes like
classes (like undergraduate students).
name, gender, GPA, etc.
• For comparison, the classes must be comparable, meaning they
o The database query is run to get data for graduate and
share similar attributes or characteristics. For example, sales in
undergraduate students.
2009 and 2010 are comparable, but "person", "address", and
"item" are not. o Irrelevant attributes (e.g., name, gender) are removed,
and only the important ones (e.g., GPA, major) are kept.
Class Comparison Process:
o The data for both classes is generalized to the same level,
• Step 1: Data Collection: Relevant data is gathered from the
and differences are analyzed (e.g., graduate students are
database and divided into a target class (e.g., graduate students)
generally older and have higher GPAs).
and one or more contrasting classes (e.g., undergraduate
students). o The results are presented with contrasting measures, like
the percentage of graduate students with a specific GPA
• Step 2: Dimension Relevance Analysis: If there are many
vs. the percentage of undergraduates.
dimensions (attributes), analyze which ones are most relevant to
the comparison. This helps in selecting only the important Comparison vs. Characterization:
attributes for further analysis.
• Attribute-oriented induction for class comparison differs from
data cube methods. It allows for generalization on-demand,
without needing to precompute a data cube.
• This method works with various types of data (not just relational
data), such as spatial or multimedia data.
Summary: