DW DM Notes
DW DM Notes
3 0 0 3
TOTAL: 45 PERIODS
TEXT BOOKS:
1. Alex Berson and Stephen J. Smith, “ Data Warehousing, Data Mining & OLAP”, Tata McGraw – Hill
Edition, Tenth Reprint 2007.
2. Jiawei Han and Micheline Kamber, “Data Mining Concepts and Techniques”, Second Edition,
Elsevier, 2007.
REFERENCES:
1. Pang-Ning Tan, Michael Steinbach and Vipin Kumar, “ Introduction To Data Mining” , Person
Education, 2007.
2. K.P. Soman, Shyam Diwakar and V. Ajay “, Insight into Data mining Theory and Practice”, Easter
Economy Edition, Prentice Hall of India, 2006.
3. G. K. Gupta, “ Introduction to Data Mining with Case Studies”, Easter Economy Edition, Prentice Hall
of India, 2006.
4. Daniel T.Larose, “Data Mining Methods and Models”, Wile-Interscience, 2006.
UNIT I DATA WAREHOUSING 10
Data warehousing Components –Building a Data warehouse –- Mapping the Data Warehouse to
a Multiprocessor Architecture – DBMS Schemas for Decision Support – Data Extraction,
Cleanup, and Transformation Tools –Metadata.
1. Source Systems
2. Data Staging Area
3. Presentation servers
The data travels from source systems to presentation servers via the data staging area. The entire
process is popularly known as ETL (extract, transform, and load) or ETT (extract, transform, and
transfer). Oracle’s ETL tool is called Oracle Warehouse Builder (OWB) and MS SQL Server’s
ETL tool is called Data Transformation Services (DTS).
META
DATA HIGHLY
MANAGER
OPERATIONAL
SUMMERIZED DATA
LIGHTLY
QUERY MANAGER
LOAD
SOURCE
TOOLS
DETAILED DATA
WAREHOUSE MANAGER
ARCHIVE / BACK UP
Each component and the tasks performed by them are explained below:
1. OPERATIONAL DATA
The source of data for the data warehouse is supplied from:
(i) The data from the mainframe systems in the traditional network and hierarchical
format.
(ii) Data can also come from the relational DBMS like Oracle, Informix.
(iii) In addition to these internal data, operational data also includes external data
obtained from commercial databases and databases associated with supplier and
customers.
2. LOAD MANAGER
The load manager performs all the operations associated with extraction and loading data into the
data warehouse. These operations include simple transformations of the data to prepare the data
for entry into the warehouse. The size and complexity of this component will vary between data
warehouses and may be constructed using a combination of vendor data loading tools and
custom built programs.
3. WAREHOUSE MANAGER
The warehouse manager performs all the operations associated with the management of data in
the warehouse. This component is built using vendor data management tools and custom built
programs. The operations performed by warehouse manager include:
(i) Analysis of data to ensure consistency
(ii) Transformation and merging the source data from temporary storage into data
warehouse tables
(iii) Create indexes and views on the base table.
(iv) Denormalization
(v) Generation of aggregation
(vi) Backing up and archiving of data
In certain situations, the warehouse manager also generates query profiles to determine which
indexes ands aggregations are appropriate.
4. QUERY MANAGER
The query manager performs all operations associated with management of user queries. This
component is usually constructed using vendor end-user access tools, data warehousing
monitoring tools, database facilities and custom built programs. The complexity of a query
manager is determined by facilities provided by the end-user access tools and database.
5. DETAILED DATA
This area of the warehouse stores all the detailed data in the database schema. In most cases
detailed data is not stored online but aggregated to the next level of details. However the detailed
data is added regularly to the warehouse to supplement the aggregated data.
6. LIGHTLY AND HIGHLY SUMMERIZED DATA
The area of the data warehouse stores all the predefined lightly and highly summarized
(aggregated) data generated by the warehouse manager. This area of the warehouse is transient
as it will be subject to change on an ongoing basis in order to respond to the changing query
profiles. The purpose of the summarized information is to speed up the query performance. The
summarized data is updated continuously as new data is loaded into the warehouse.
7. ARCHIVE AND BACK UP DATA
This area of the warehouse stores detailed and summarized data for the purpose of archiving and
back up. The data is transferred to storage archives such as magnetic tapes or optical disks.
8. META DATA
The data warehouse also stores all the Meta data (data about data) definitions used by all
processes in the warehouse. It is used for variety of purposed including:
(i) The extraction and loading process – Meta data is used to map data sources to a
common view of information within the warehouse.
(ii) The warehouse management process – Meta data is used to automate the
production of summary tables.
(iii) As part of Query Management process Meta data is used to direct a query to the
most appropriate data source.
The structure of Meta data will differ in each process, because the purpose is different. More
about Meta data will be discussed in the later Lecture Notes.
9. END-USER ACCESS TOOLS
The principal purpose of data warehouse is to provide information to the business managers for
strategic decision-making. These users interact with the warehouse using end user access tools.
The examples of some of the end user access tools can be:
(i) Reporting and Query Tools
(ii) Application Development Tools
(iii) Executive Information Systems Tools
(iv) Online Analytical Processing Tools
(v) Data Mining Tools
Three-tier Data warehouse architecture
The bottom tier is a ware-house database server which is almost always a relational
database system. The middle tier is an OLAP server which is typically implemented using either
(1) a Relational OLAP (ROLAP) model, (2) a Multidimensional OLAP (MOLAP) model. The
top tier is a client, which contains query and reporting tools, analysis tools, and/or data mining
tools (e.g., trend analysis, prediction, and so on).
From the architecture point of view, there are three data warehouse models: the enterprise
warehouse, the data mart, and the virtual warehouse.
• Data mart: A data mart contains a subset of corporate-wide data that is of value to a
specific group of users. The scope is connected to specific, selected subjects. For
example, a marketing data mart may connect its subjects to customer, item, and sales.
The data contained in data marts tend to be summarized. Depending on the source of
data, data marts can be categorized into the following two classes:
(i).Independent data marts are sourced from data captured from one or more
operational systems or external information providers, or from data generated locally
within a particular department or geographic area.
(ii).Dependent data marts are sourced directly from enterprise data warehouses.
• Virtual warehouse: A virtual warehouse is a set of views over operational databases. For
efficient query processing, only some of the possible summary views may be
materialized. A virtual warehouse is easy to build but requires excess capacity on
operational database servers.
EXTRACT
Some of the data elements in the operational database can be reasonably be expected to be useful
in the decision making, but others are of less value for that purpose. For this reason, it is
necessary to extract the relevant data from the operational database before bringing into the data
warehouse. Many commercial tools are available to help with the extraction process. Data
Junction is one of the commercial products. The user of one of these tools typically has an easy-
to-use windowed interface by which to specify the following:
(i) Which files and tables are to be accessed in the source database?
(ii) Which fields are to be extracted from them? This is often done internally by
SQL Select statement.
(iii) What are those to be called in the resulting database?
(iv) What is the target machine and database format of the output?
(v) On what schedule should the extraction process be repeated?
TRANSFORM
The operational databases developed can be based on any set of priorities, which keeps changing
with the requirements. Therefore those who develop data warehouse based on these databases are
typically faced with inconsistency among their data sources. Transformation process deals with
rectifying any inconsistency (if any).
CLEANSING
Information quality is the key consideration in determining the value of the information. The
developer of the data warehouse is not usually in a position to change the quality of its
underlying historic data, though a data warehousing project can put spotlight on the data quality
issues and lead to improvements for the future. It is, therefore, usually necessary to go through
the data entered into the data warehouse and make it as error free as possible. This process is
known as Data Cleansing.
Data Cleansing must deal with many types of possible errors. These include missing data and
incorrect data at one source; inconsistent data and conflicting data when two or more source are
involved. There are several algorithms followed to clean the data, which will be discussed in the
coming lecture notes.
LOADING
Loading often implies physical movement of the data from the computer(s) storing the source
database(s) to that which will store the data warehouse database, assuming it is different. This
takes place immediately after the extraction phase. The most common channel for data
movement is a high-speed communication link. Ex: Oracle Warehouse Builder is the API from
Oracle, which provides the features to perform the ETL task on Oracle Data Warehouse.
Single-source problems
The data quality of a source largely depends on the degree to which it is governed by schema and
integrity constraints controlling permissable data values. For sources without schema, such as files, there
are few restrictions on what data can be entered and stored, giving rise to a high probability of errors and
inconsistencies. Database systems, on the other hand, enforce restrictions of a specific data model (e.g.,
the relational approach requires simple attribute values, referential integrity, etc.) as well as application-
specific integrity constraints. Schema-related data quality problems thus occur because of the lack of
appropriate model-specific or application-specific integrity constraints, e.g., due to data model limitations
or poor schema design, or because only a few integrity constraints were defined to limit the overhead for
integrity control. Instance-specific problems relate to errors and inconsistencies that cannot be prevented
at the schema level (e.g., misspellings).
For both schema- and instance-level problems we can differentiate different problem scopes: attribute
(field), record, record type and source; examples for the various cases are shown in Tables 1 and 2. Note
that uniqueness constraints specified at the schema level do not prevent duplicated instances, e.g., if
information on the same real world entity is entered twice with different attribute values (see example in
Table 2).
Multi-source problems
The problems present in single sources are aggravated when multiple sources need to be integrated. Each
source may contain dirty data and the data in the sources may be represented differently, overlap or
contradict. This is because the sources are typically developed, deployed and maintained independently to
serve specific needs. This results in a large degree of heterogeneity w.r.t. data management systems, data
models, schema designs and the actual data.
At the schema level, data model and schema design differences are to be addressed by the steps of schema
translation and schema integration, respectively. The main problems w.r.t. schema design are naming and
structural conflicts. Naming conflicts arise when the same name is used for different objects (homonyms)
or different names are used for the same object (synonyms). Structural conflicts occur in many variations
and refer to different representations of the same object in different sources, e.g., attribute vs. table
representation, different component structure, different data types, different integrity constraints, etc. In
addition to schema-level conflicts, many conflicts appear only at the instance level (data conflicts). All
problems from the single-source case can occur with different representations in different sources (e.g.,
duplicated records, contradicting records,…). Furthermore, even when there are the same attribute names
and data types, there may be different value representations (e.g., for marital status) or different
interpretation of the values (e.g., measurement units Dollar vs. Euro) across sources. Moreover,
information in the sources may be provided at different aggregation levels (e.g., sales per product vs. sales
per product group) or refer to different points in time (e.g. current sales as of yesterday for source 1 vs. as
of last week for source 2).
A main problem for cleaning data from multiple sources is to identify overlapping data, in particular
matching records referring to the same real-world entity (e.g., customer). This problem is also referred to
as the object identity problem, duplicate elimination or the merge/purge problem. Frequently, the
information is only partially redundant and the sources may complement each other by providing
additional information about an entity. Thus duplicate information should be purged out and
complementing information should be consolidated and merged in order to achieve a consistent view of
real world entities.
The two sources in the example of Fig. 3 are both in relational format but exhibit schema and data
conflicts. At the schema level, there are name conflicts (synonyms Customer/Client, Cid/Cno,
Sex/Gender) and structural conflicts (different representations for names and addresses). At the instance
level, we note that there are different gender representations (“0”/”1” vs. “F”/”M”) and presumably a
duplicate record (Kristen Smith). The latter observation also reveals that while Cid/Cno are both source-
specific identifiers, their contents are not comparable between the sources; different numbers (11/493)
may refer to the same person while different persons can have the same number (24). Solving these
problems requires both schema integration and data cleaning; the third table shows a possible solution.
Note that the schema conflicts should be resolved first to allow data cleaning, in particular detection of
duplicates based on a uniform representation of names and addresses, and matching of the Gender/Sex
values.
Data analysis
Metadata reflected in schemas is typically insufficient to assess the data quality of a source, especially if
only a few integrity constraints are enforced. It is thus important to analyse the actual instances to obtain
real (reengineered) metadata on data characteristics or unusual value patterns. This metadata helps finding
data quality problems. Moreover, it can effectively contribute to identify attribute correspondences
between source schemas (schema matching), based on which automatic data transformations can be
derived.
There are two related approaches for data analysis, data profiling and data mining. Data profiling focuses
on the instance analysis of individual attributes. It derives information such as the data type, length, value
range, discrete values and their frequency, variance, uniqueness, occurrence of null values, typical string
pattern (e.g., for phone numbers), etc., providing an exact view of various quality aspects of the attribute.
Table 3 shows examples of how this metadata can help detecting data quality problems.
Data mining helps discover specific data patterns in large data sets, e.g., relationships holding between
several attributes. This is the focus of so-called descriptive data mining models including clustering,
summarization, association discovery and sequence discovery [10]. As shown in [28], integrity
constraints among attributes such as functional dependencies or application-specific “business rules” can
be derived, which can be used to complete missing values, correct illegal values and identify duplicate
records across data sources. For example, an association rule with high confidence can hint to data quality
problems in instances violating this rule. So a confidence of 99% for rule “total=quantity*unit price”
indicates that 1% of the records do not comply and may require closer examination.
necessary data transformations to be applied to the first source. The transformation defines a view on
which further mappings can be performed. The transformation performs a schema restructuring with
additional attributes in the view obtained by splitting the name and address attributes of the source. The
required data extractions are achieved by UDFs (shown in boldface). The UDF implementations can
contain cleaning logic, e.g., to remove misspellings in city names or provide missing zip codes.
UDFs may still imply a substantial implementation effort and do not support all necessary schema
transformations. In particular, simple and frequently needed functions such as attribute splitting or
merging are not generically supported but need often to be re-implemented in application-specific
variations (see specific extract functions in Fig. 4).
Conflict resolution
A set of transformation steps has to be specified and executed to resolve the various schema- and
instancelevel data quality problems that are reflected in the data sources at hand. Several types of
transformations are to be performed on the individual data sources in order to deal with single-source
problems and to prepare for integration with other sources. In addition to a possible schema translation,
these preparatory steps typically include:
Extracting values from free-form attributes (attribute split): Free-form attributes often capture
multiple individual values that should be extracted to achieve a more precise representation and support
further cleaning steps such as instance matching and duplicate elimination. Typical examples are name
and address fields (Table 2, Fig. 3, Fig. 4). Required transformations in this step are reordering of values
within a field to deal with word transpositions, and value extraction for attribute splitting.
Validation and correction: This step examines each source instance for data entry errors and tries to
correct them automatically as far as possible. Spell checking based on dictionary lookup is useful for
identifying and correcting misspellings. Furthermore, dictionaries on geographic names and zip codes
help to correct address data. Attribute dependencies (birthdate – age, total price – unit price / quantity,
city – phone area code,…) can be utilized to detect problems and substitute missing values or correct
wrong values.
Standardization: To facilitate instance matching and integration, attribute values should be converted
to a consistent and uniform format. For example, date and time entries should be brought into a specific
format; names and other string data should be converted to either upper or lower case, etc. Text data may
be condensed and unified by performing stemming, removing prefixes, suffixes, and stop words.
Furthermore, abbreviations and encoding schemes should consistently be resolved by consulting special
synonym dictionaries or applying predefined conversion rules. Dealing with multi-source problems
requires restructuring of schemas to achieve a schema integration, including steps such as splitting,
merging, folding and unfolding of attributes and tables. At the instance level, conflicting representations
need to be resolved and overlapping data must to be dealt with. The duplicate elimination task is typically
performed after most other transformation and cleaning steps, especially after having cleaned single-
source errors and conflicting representations. It is performed either on two cleaned sources at a time or on
a single already integrated data set. Duplicate elimination requires to first identify (i.e. match) similar
records concerning the same real world entity. In a second step, similar records are merged into one
record containing all relevant attributes without redundancy. Furthermore, redundant records are purged.
Tool support
ETL tools
A large number of commercial tools support the ETL process for data warehouses in a comprehensive
way, e.g., COPYMANAGER (InformationBuilders), DATASTAGE (Informix/Ardent), EXTRACT (ETI),
POWERMART (Informatica), DECISIONBASE (CA/Platinum), DATATRANSFORMATIONSERVICE
(Microsoft), METASUITE (Minerva/Carleton), SAGENTSOLUTIONPLATFORM (Sagent), and
WAREHOUSEADMINISTRATOR (SAS). They use a repository built on a DBMS to manage all metadata
about the data sources, target schemas, mappings, script programs, etc., in a uniform way. Schemas and
data are extracted from operational data sources via both native file and DBMS gateways as well as
standard interfaces such as ODBC and EDA. Data transformations are defined with an easy-to-use
graphical interface. To specify individual mapping steps, a proprietary rule language and a comprehensive
library of predefined conversion functions are typically provided. The tools also support reusing existing
transformation solutions, such as external C/C++ routines, by providing an interface to integrate them into
the internal transformation library. Transformation processing is carried out either by an engine that
interprets the specified transformations at runtime, or by compiled code. All engine-based tools (e.g.,
COPYMANAGER, DECISIONBASE, POWERMART, DATASTAGE, WAREHOUSEADMINISTRATOR), possess a
scheduler and support workflows with complex execution dependencies among mapping jobs. A
workflow may also invoke external tools, e.g., for specialized cleaning tasks such as name/address
cleaning or duplicate elimination. ETL tools typically have little built-in data cleaning capabilities but
allow the user to specify cleaning functionality via a proprietary API. There is usually no data analysis
support to automatically detect data errors and inconsistencies. However, users can implement such logic
with the metadata maintained and by determining content characteristics with the help of aggregation
functions (sum, count, min, max, median, variance, deviation,…). The provided transformation library
covers many data transformation and cleaning needs, such as data type conversions (e.g., date
reformatting), string functions (e.g., split, merge, replace, sub-string search), arithmetic, scientific and
statistical functions, etc. Extraction of values from free-form attributes is not completely automatic but the
user has to specify the delimiters separating sub-values. The rule languages typically cover if-then and
case constructs that help handling exceptions in data values, such as misspellings, abbreviations, missing
or cryptic values, and values outside of range. These problems can also be addressed by using a table
lookup construct and join functionality. Support for instance matching is typically restricted to the use of
the join construct and some simple string matching functions, e.g., exact or wildcard matching and
soundex. However, user-defined field matching functions as well as functions for correlating field
similarities can be programmed and added to the internal transformation library.
Metadata repository
Metadata are data about data. When used in a data warehouse, metadata are the data that define
warehouseobjects. Metadata are created for the data names and definitions of the given
warehouse. Additional metadata are created and captured for time stamping any extracted data,
the source of the extracted data, and missing fields that have been added by data cleaning or
integration processes. A metadata repository should contain:
• A description of the structure of the data warehouse. This includes the warehouse
schema, view, dimensions, hierarchies, and derived data definitions, as well as data mart
locations and contents;
• Operational metadata, which include data lineage (history of migrated data and the
sequence of transformations applied to it), currency of data (active, archived, or purged),
and monitoring information (warehouse usage statistics, error reports, and audit trails);
• the algorithms used for summarization, which include measure and dimension definition
algorithms, data on granularity, partitions, subject areas, aggregation, summarization, and
predefined queries and reports;
• The mapping from the operational environment to the data warehouse, which includes
source databases and their contents, gateway descriptions, data partitions, data extraction,
cleaning, transformation rules and defaults, data refresh and purging rules, and security
(user authorization and access control).
• Data related to system performance, which include indices and profiles that improve data
access and retrieval performance, in addition to rules for the timing and scheduling of
refresh, update, and replication cycles; and
• Business metadata, which include business terms and definitions, data ownership
information, and charging policies.
UNIT II BUSINESS ANALYSIS
Reporting and Query tools and Applications – Tool Categories – The Need for Applications –
Cognos Impromptu – Online Analytical Processing (OLAP) – Need – Multidimensional Data
Model – OLAP Guidelines – Multidimensional versus Multirelational OLAP – Categories of
Tools – OLAP Tools and the Internet.
The major distinguishing features between OLTP and OLAP are summarized as follows.
1. Users and system orientation: An OLTP system is customer-oriented and is used for
transaction and query processing by clerks, clients, and information technology professionals. An
OLAP system is market-oriented and is used for data analysis by knowledge workers, including
managers, executives, and analysts.
2. Data contents: An OLTP system manages current data that, typically, are too detailed to be
easily used for decision making. An OLAP system manages large amounts of historical data,
provides facilities for summarization and aggregation, and stores and manages information at
different levels of granularity. These features make the data easier for use in informed decision
making.
3. Database design: An OLTP system usually adopts an entity-relationship (ER) data model and
an application oriented database design. An OLAP system typically adopts either a star or
snowflake model and a subject-oriented database design.
4. View: An OLTP system focuses mainly on the current data within an enterprise or department,
without referring to historical data or data in different organizations. In contrast, an OLAP
system often spans multiple versions of a database schema. OLAP systems also deal with
information that originates from different organizations, integrating information from many data
stores. Because of their huge volume, OLAP data are stored on multiple storage media.
5. Access patterns: The access patterns of an OLTP system consist mainly of short, atomic
transactions. Such a system requires concurrency control and recovery mechanisms. However,
accesses to OLAP systems are mostly read-only operations although many could be complex
queries.
Comparison between OLTP and OLAP systems.
Multidimensional DataModel.
The most popular data model for data warehouses is a multidimensional model. This
model can exist in the form of a star schema, a snowflake schema, or a fact constellation schema.
Let's have a look at each of these schema types.
• Star schema: The star schema is a modeling paradigm in which the data warehouse
contains (1) a large central table (fact table), and (2) a set of smaller attendant tables
(dimension tables), one for each dimension. The schema graph resembles a starburst,
with the dimension tables displayed in a radial pattern around the central fact table.
• Snowflake schema: The snowflake schema is a variant of the star schema model, where
some dimension tables are normalized, thereby further splitting the data into additional
tables. The resulting schema graph forms a shape similar to a snowflake. The major
difference between the snowflake and star schema models is that the dimension tables of
the snowflake model may be kept in normalized form. Such a table is easy to maintain
and also saves storage space because a large dimension table can be extremely large
when the dimensional structure is included as columns.
Figure Fact constellation schema of a data warehouse for sales and shipping.
A Concept Hierarchy
Concept hierarchies allow data to be handled at varying levels of abstraction
1. Roll-up: The roll-up operation performs aggregation on a data cube, either by climbing-up a
concept hierarchy for a dimension or by dimension reduction. Figure shows the result of a roll-up
operation performed on the central cube by climbing up the concept hierarchy for location. This
hierarchy was defined as the total order street < city < province or state <country.
2. Drill-down: Drill-down is the reverse of roll-up. It navigates from less detailed data to more
detailed data. Drill-down can be realized by either stepping-down a concept hierarchy for a
dimension or introducing additional dimensions. Figure shows the result of a drill-down
operation performed on the central cube by stepping down a concept hierarchy for time defined
as day < month < quarter < year. Drill-down occurs by descending the time hierarchy from the
level of quarter to the more detailed level of month.
3. Slice and dice: The slice operation performs a selection on one dimension of the given cube,
resulting in a subcube. Figure shows a slice operation where the sales data are selected from the
central cube for the dimension time using the criteria time=”Q2". The dice operation defines a
subcube by performing a selection on two or more dimensions.
4. Pivot (rotate): Pivot is a visualization operation which rotates the data axes in view in order
to provide an alternative presentation of the data. Figure shows a pivot operation where the item
and location axes in a 2-D slice are rotated.
On-Line Analytical Mining (OLAM) (also called OLAP mining), which integrates on-
line analytical processing (OLAP) with data mining and mining knowledge in multidimensional
databases, is particularly important for the following reasons.
Most data mining tools need to work on integrated, consistent, and cleaned data, which
requires costly data cleaning, data transformation and data integration as preprocessing steps. A
data warehouse constructed by such preprocessing serves as a valuable source of high quality
data for OLAP as well as for data mining.
Effective data mining needs exploratory data analysis. A user will often want to traverse
through a database, select portions of relevant data, analyze them at different granularities, and
present knowledge/results in different forms. On-line analytical mining provides facilities for
data mining on different subsets of data and at different levels of abstraction, by drilling,
pivoting, filtering, dicing and slicing on a data cube and on some intermediate data mining
results.
By integrating OLAP with multiple data mining functions, on-line analytical mining
provides users with the exibility to select desired data mining functions and swap data mining
tasks dynamically.
• Transform it into a SQL-like language (with a new operator cube by, introduced by Gray
et al.’96)
SELECT item, city, year, SUM (amount)
FROM SALES
()
• Computation
• Partition arrays into chunks (a small subcube which fits in memory).
• Compressed sparse array addressing: (chunk_id, offset)
• Compute aggregates in “multiway” by visiting cube cells in the order which
minimizes the # of times to visit each cell, and reduces memory access and
storage cost.
The bitmap index is an alternative representation of the record ID (RID) list. In the
bitmap index for a given attribute, there is a distinct bit vector, By, for each value v in the
domain of the attribute. If the domain of a given attribute consists of n values, then n bits are
needed for each entry in the bitmap index
The join indexing method gained popularity from its use in relational database query
processing. Traditional indexing maps the value in a given column to a list of rows having that
value. In contrast, join indexing registers the joinable rows of two relations from a relational
database. For example, if two relations R(RID;A) and S(B; SID) join on the attributes A and B,
then the join index record contains the pair (RID; SID), where RID and SID are record identifiers
from the R and S relations, respectively.
1. Determine which operations should be performed on the available cuboids. This involves
transforming any selection, projection, roll-up (group-by) and drill-down operations specified in
the query into corresponding SQL and/or OLAP operations. For example, slicing and dicing of a
data cube may correspond to selection and/or projection operations on a materialized cuboid.
2. Determine to which materialized cuboid(s) the relevant operations should be applied. This
involves identifying all of the materialized cuboids that may potentially be used to answer the
query, pruning the
Metadata repository
Metadata are data about data. When used in a data warehouse, metadata are the data that define
warehouseobjects. Metadata are created for the data names and definitions of the given
warehouse. Additional metadata are created and captured for time stamping any extracted data,
the source of the extracted data, and missing fields that have been added by data cleaning or
integration processes. A metadata repository should contain:
• A description of the structure of the data warehouse. This includes the warehouse
schema, view, dimensions, hierarchies, and derived data definitions, as well as data mart
locations and contents;
• Operational metadata, which include data lineage (history of migrated data and the
sequence of transformations applied to it), currency of data (active, archived, or purged),
and monitoring information (warehouse usage statistics, error reports, and audit trails);
• the algorithms used for summarization, which include measure and dimension definition
algorithms, data on granularity, partitions, subject areas, aggregation, summarization, and
predefined queries and reports;
• The mapping from the operational environment to the data warehouse, which includes
source databases and their contents, gateway descriptions, data partitions, data extraction,
cleaning, transformation rules and defaults, data refresh and purging rules, and security
(user authorization and access control).
• Data related to system performance, which include indices and profiles that improve data
access and retrieval performance, in addition to rules for the timing and scheduling of
refresh, update, and replication cycles; and
• Business metadata, which include business terms and definitions, data ownership
information, and charging policies.
Data warehouse back-end tools and utilities
Data warehouse systems use back-end tools and utilities to populate and refresh their data. These
tools and facilities include the following functions:
1. Data extraction, which typically gathers data from multiple, heterogeneous, and external
sources;
2. Data cleaning, which detects errors in the data and recti_es them when possible;
3. Data transformation, which converts data from legacy or host format to warehouse
format;
4. Load, which sorts, summarizes, consolidates, computes views, checks integrity, and
builds indices and partitions;
5. Refresh, which propagates the updates from the data sources to the warehouse.
6. Besides cleaning, loading, refreshing, and metadata defnition tools, data warehouse
systems usually provide a good set of data warehouse management tools.
Hybrid OLAP (HOLAP) servers: The hybrid OLAP approach combines ROLAP and MOLAP
technology, benefitting from the greater scalability of ROLAP and the faster computation of
MOLAP. For example, a HOLAP server may allow large volumes of detail data to be stored in a
relational database, while aggregations are kept in a separate MOLAP store.
Specialized SQL servers: To meet the growing demand of OLAP processing in relational
databases, some relational and data warehousing forms (e.g., Redbrick) implement specialized
SQL servers which provide advanced query language and query processing support for SQL
queries over star and snowflake schemas in a read-only environment.
MDDB RDBMS
Data is stored in multidimensional arrays Data is stored in relations
Direct inspection of an array gives a great Not so
deal of information
Can handle limited size databases (< 100GB) Proven track record for handling VLDBs
Takes long to load and update Highly volatile data are better handled
Support aggregations better RDBMSs are catching up-Aggregate
Navigators
New investments need to be made and new Most enterprises already made significant
skill sets need to be developed investments in RDBMS technology and skill
sets
Adds complexity to the overall system No additional complexity
architecture
Limited no. of facts an dimensional tables No such restriction
Examples Examples
• Arbor-Essbase • IBM-DB2
• Brio Query-Enterprise • Microsoft-SQL Server
• Dimensional Insight-DI Diver • Oracle-Oracle RDBMS
• Oracle-Express Server • Red Brick Systems-Red Brick
Warehouse
UNIT III
DATA MINING
What is Data?
Attribute Values
• Attribute values are numbers or symbols assigned to an attribute
• Distinction between attributes and attribute values
– Same attribute can be mapped to different attribute values
• Example: height can be measured in feet or meters
– Different attributes can be mapped to the same set of values
• Example: Attribute values for ID and age are integers
• But properties of attribute values can be different
– ID has no limit but age has a maximum and minimum value
Types of Attributes
• There are different types of attributes
– Nominal
• Examples: ID numbers, eye color, zip codes
– Ordinal
• Examples: rankings (e.g., taste of potato chips on a scale from 1-10),
grades, height in {tall, medium, short}
– Interval
• Examples: calendar dates, temperatures in Celsius or Fahrenheit.
– Ratio
Examples: temperature in Kelvin, length, time, counts
1. Task-relevant data: This is the database portion to be investigated. For example, suppose that
you are a manager of All Electronics in charge of sales in the United States and Canada. In
particular, you would like to study the buying trends of customers in Canada. Rather than mining
on the entire database. These are referred to as relevant attributes
2. The kinds of knowledge to be mined: This specifies the data mining functions to be
performed, such as characterization, discrimination, association, classification, clustering, or
evolution analysis. For instance, if studying the buying habits of customers in Canada, you may
choose to mine associations between customer profiles and the items that these customers like to
buy
3. Background knowledge: Users can specify background knowledge, or knowledge about the
domain to be mined. This knowledge is useful for guiding the knowledge discovery process, and
for evaluating the patterns found. There are several kinds of background knowledge.
4. Interestingness measures: These functions are used to separate uninteresting patterns from
knowledge. They may be used to guide the mining process, or after discovery, to evaluate the
discovered patterns. Different kinds of knowledge may have different interestingness measures.
5. Presentation and visualization of discovered patterns: This refers to the form in which
discovered patterns are to be displayed. Users can choose from different forms for knowledge
presentation, such as rules, tables, charts, graphs, decision trees, and cubes.
Figure : Primitives for specifying a data mining task.
Knowledge Discovery in Databases or KDD
The architecture of a typical data mining system may have the following major components
2. Database or data warehouse server. The database or data warehouse server is responsible
for fetching the relevant data, based on the user's data mining request.
3. Knowledge base. This is the domain knowledge that is used to guide the search, or evaluate
the interestingness of resulting patterns. Such knowledge can include concept hierarchies, used to
organize attributes or attribute values into different levels of abstraction. Knowledge such as user
beliefs, which can be used to assess a pattern's interestingness based on its unexpectedness, may
also be included.
4. Data mining engine. This is essential to the data mining system and ideally consists of a set
of functional modules for tasks such as characterization, association analysis, classification,
evolution and deviation analysis.
5. Pattern evaluation module. This component typically employs interestingness measures and
interacts with the data mining modules so as to focus the search towards interesting patterns. It
may access interestingness thresholds stored in the knowledge base. Alternatively, the pattern
evaluation module may be integrated with the mining module, depending on the implementation
of the data mining method used.
6. Graphical user interface. This module communicates between users and the data mining
system, allowing the user to interact with the system by specifying a data mining query or task,
providing information to help focus the search, and performing exploratory data mining based on
the intermediate data mining results.
Data mining functionalities are used to specify the kind of patterns to be found in data mining
tasks. In general, data mining tasks can be classified into two categories Descriptive and
Predictive.
Descriptive mining tasks characterize the general properties of the data in the database.
Predictive mining tasks perform inference on the current data in order to make predictions. In
some cases, users may have no idea of which kinds of patterns in their data may be interesting,
and hence may like to search for several different kinds of patterns in parallel. Thus it is
important to have a data mining system that can mine multiple kinds of patterns to accommodate
different user expectations or applications. Furthermore, data mining systems should be able to
discover patterns at various granularities. To encourage interactive and exploratory mining, users
should be able to easily \play" with the output patterns, such as by mouse clicking. Operations
that can be specified by simple mouse clicks include adding or dropping a dimension (or an
attribute), swapping rows and columns (pivoting, or axis rotation), changing dimension
representations (e.g., from a 3-D cube to a sequence of 2-D cross tabulations, or crosstabs), or
using OLAP roll-up or drill-down operations along dimensions. Such operations allow data
patterns to be expressed from different angles of view and at multiple levels of abstraction.
Data mining systems should also allow users to specify hints to guide or focus the search for
interesting patterns. Since some patterns may not hold for all of the data in the database, a
measure of certainty or \trustworthiness" is usually associated with each discovered pattern. Data
mining functionalities, and the kinds of patterns they can discover, are described below.
Data can be associated with classes or concepts. For example, in the AllElectronics store, classes
of items for sale include computers and printers, and concepts of customers include bigSpenders
and budgetSpenders. It can be useful to describe individual classes and concepts in summarized,
concise, and yet precise terms. Such descriptions of a class or a concept are called class/concept
descriptions. These descriptions can be derived via (1) data characterization, by summarizing the
data of the class under study (often called the target class) in general terms, or (2) data
discrimination, by comparison of the target class with one or a set of comparative classes (often
called the contrasting classes), or (3) both data characterization and discrimination.
Association Analysis
Association analysis is the discovery of association rules showing attribute-value conditions that
occur frequently together in a given set of data. Association analysis is widely used for market
basket or transaction data analysis. More formally, association rules are of the form X ) Y , i.e.,
\A1 ^ _ _ _ ^Am !B1 ^ _ _ _^Bn", where Ai (for i 2 f1; : : :;mg) and Bj (for j 2 f1; : : :; ng) are
attribute-value pairs. The association rule X ) Y is interpreted as \database tuples that satisfy the
conditions in X are also likely to satisfy the conditions in Y ".
An association between more than one attribute, or predicate (i.e., age, income, and buys).
Adopting the terminology used in multidimensional databases, where each attribute is referred to
as a dimension, the above rule can be referred to as a multidimensional association rule.
Suppose, as a marketing manager of AllElectronics, you would like to determine which items are
frequently purchased together within the same transactions. An example of such a rule is
Contains
(T; \computer") ) contains(T; \software") [support = 1%; confidence = 50%]
meaning that if a transaction T contains \computer", there is a 50% chance that it contains
\software" as well, and 1% of all of the transactions contain both. This association rule involves a
single attribute or predicate (i.e., contains) which repeats. Association rules that contain a single
predicate are referred to as single-dimensional association rules. Dropping the predicate notation,
the above rule can be written simply as \computer ) software [1%, 50%]".
Classification is the processing of finding a set of models (or functions) which describe and
distinguish data classes or concepts, for the purposes of being able to use the model to predict the
class of objects whose class label is unknown. The derived model is based on the analysis of a
set of training data (i.e., data objects whose class label is known). The derived model may be
represented in various forms, such as classification (IF-THEN) rules, decision trees,
mathematical formulae, or neural networks. A decision tree is a flow-chart-like tree structure,
where each node denotes a test on an attribute value, each branch represents an outcome of the
test, and tree leaves represent classes or class distributions. Decision trees can be easily
converted to classification rules. A neural network is a collection of linear threshold units that
can be trained to distinguish objects of different classes. Classification can be used for predicting
the class label of data objects. However, in many applications, one may like to predict some
missing or unavailable data values rather than class labels. This is usually the case when the
predicted values are numerical data, and is often specifically referred to as prediction. Although
prediction may refer to both data value prediction and class label prediction, it is usually
confined to data value prediction and thus is distinct from classification. Prediction also
encompasses the identification of distribution trends based on the available data. Classification
and prediction may need to be preceded by relevance analysis which attempts to identify at
tributes that do not contribute to the classification or prediction process.
Clustering Analysis
Clustering analyzes data objects without consulting a known class label. In general, the class
labels are not present in the training data simply because they are not known to begin with.
Clustering can be used to generate such labels. The objects are clustered or grouped based on the
principle of maximizing the intraclass similarity and minimizing the interclass similarity. That is,
clusters of objects are formed so that objects within a cluster have high similarity in comparison
to one another, but are very dissimilar to objects in other clusters. Each cluster that is formed can
be viewed as a class of objects, from which rules can be derived.
Data evolution analysis describes and models regularities or trends for objects whose behavior
changes over time. Although this may include characterization, discrimination, association,
classification, or clustering of time-related data, distinct features of such an analysis include
time-series data analysis, sequence or periodicity pattern matching, and similarity-based data
analysis.
Interestingness Patterns
A data mining system has the potential to generate thousands or even millions of patterns, or
rules. This raises some serious questions for data mining:
A pattern is interesting if (1) it is easily understood by humans, (2) valid on new or test data with
some degree of certainty, (3) potentially useful, and (4) novel. A pattern is also interesting if it
validates a hypothesis that the user sought to confirm. An interesting pattern represents
knowledge. Several objective measures of pattern interestingness exist. These are based on the
structure of discovered patterns and the statistics underlying them. An objective measure for
association rules of the form XU Y is rule support, representing the percentage of data samples
that the given rule satisfies. Another objective measure for association rules is confidence, which
assesses the degree of certainty of the detected association. It is defined as the conditional
probability that a pattern Y is true given that X is true. More formally, support and confidence
are defined as
support(X ) Y) = Prob{XUY}g
confidence (X ) Y) = Prob{Y |X}g
Data mining is an interdisciplinary field, the confluence of a set of disciplines including database
systems, statistics, machine learning, visualization, and information science. Moreover,
depending on the data mining approach used, techniques from other disciplines may be applied,
such as neural networks, fuzzy and/or rough set theory, knowledge representation, inductive
logic programming, or high performance computing. Depending on the kinds of data to be mined
or on the given data mining application, the data mining system may also integrate techniques
from spatial data analysis, information retrieval, pattern recognition, image analysis, signal
processing, computer graphics, Web technology, economics, or psychology. Because of the
diversity of disciplines contributing to data mining, data mining research is expected to generate
a large variety of data mining systems. Therefore, it is necessary to provide a clear classification
of data mining systems. Such a classification may help potential users distinguish data mining
systems and identify those that best match their needs. Data mining systems can be categorized
according to various criteria, as follows._ Classification according to the kinds of databases
mined. A data mining system can be classified according to the kinds of databases mined.
Database systems themselves can be classified according to different criteria (such as data
models, or the types of data or applications involved), each of which may require its own data
mining technique. Data mining systems can therefore be classified accordingly.
For instance, if classifying according to data models, we may have a relational, transactional,
object-oriented, object-relational, or data warehouse mining system. If classifying according to
the special types of data handled, we may have a spatial, time-series, text, or multimedia data
mining system, or a World-Wide Web mining system. Other system types include heterogeneous
data mining systems, and legacy data mining systems.
Classification according to the kinds of knowledge mined. Data mining systems can be
categorized according to the kinds of knowledge they mine, i.e., based on data mining
functionalities, such as characterization, discrimination, association, classification, clustering,
trend and evolution analysis, deviation analysis, similarity analysis, etc. A comprehensive data
mining system usually provides multiple and/or integrated data mining functionalities. Moreover,
data mining systems can also be distinguished based on the granularity or levels of abstraction of
the knowledge mined, including generalized knowledge (at a high level of abstraction),
primitive-level knowledge (at a raw data level), or knowledge at multiple levels (considering
several levels of abstraction). An advanced data mining system should facilitate the discovery of
knowledge at multiple levels of abstraction.
2. Performance issues. These include efficiency, scalability, and parallelization of data mining
algorithms.
Efficiency and scalability of data mining algorithms.
To effectively extract information from a huge amount of data in databases, data mining
algorithms must be efficient and scalable. That is, the running time of a data mining algorithm
must be predictable and acceptable in large databases. Algorithms with exponential or even
medium-order polynomial complexity will not be of practical use. From a database perspective
on knowledge discovery, efficiency and scalability are key issues in the implementation of data
mining systems. Many of the issues discussed above under mining methodology and user-
interaction must also consider efficiency and scalability.
DataPreprocessing
Data cleaning.
Data cleaning routines attempt to fill in missing values, smooth out noise while
identifying outliers, and correct inconsistencies in the data.
1. Ignore the tuple: This is usually done when the class label is missing (assuming the mining
task involves classification or description). This method is not very effective, unless the tuple
contains several attributes with missing values. It is especially poor when the percentage of
missing values per attribute varies considerably.
2. Fill in the missing value manually: In general, this approach is time-consuming and may not
be feasible given a large data set with many missing values.
3. Use a global constant to fill in the missing value: Replace all missing attribute values by the
same constant, such as a label like “Unknown". If missing values are replaced by, say,
“Unknown", then the mining program may mistakenly think that they form an interesting
concept, since they all have a value in common - that of “Unknown". Hence, although this
method is simple, it is not recommended.
4. Use the attribute mean to fill in the missing value: For example, suppose that the average
income of All Electronics customers is $28,000. Use this value to replace the missing value for
income.
5. Use the attribute mean for all samples belonging to the same class as the given tuple: For
example, if classifying customers according to credit risk, replace the missing value with the
average income value for customers in the same credit risk category as that of the given tuple.
6. Use the most probable value to fill in the missing value: This may be determined with
inference-based tools using a Bayesian formalism or decision tree induction. For example, using
the other customer attributes in your data set, you may construct a decision tree to predict the
missing values for income.
1. Binning methods:
Binning methods smooth a sorted data value by consulting the ”neighborhood", or values
around it. The sorted values are distributed into a number of 'buckets', or bins. Because binning
methods consult the neighborhood of values, they perform local smoothing. Figure illustrates
some binning techniques.
In this example, the data for price are first sorted and partitioned into equi-depth bins (of
depth 3). In smoothing by bin means, each value in a bin is replaced by the mean value of the
bin. For example, the mean of the values 4, 8, and 15 in Bin 1 is 9. Therefore, each original value
in this bin is replaced by the value 9. Similarly, smoothing by bin medians can be employed, in
which each bin value is replaced by the bin median. In smoothing by bin boundaries, the
minimum and maximum values in a given bin are identified as the bin boundaries. Each bin
value is then replaced by the closest boundary value.
(i).Sorted data for price (in dollars): 4, 8, 15, 21, 21, 24, 25, 28, 34
• Bin 1: 4, 8, 15
• Bin 2: 21, 21, 24
• Bin 3: 25, 28, 34
(iii).Smoothing by bin means:
- Bin 1: 9, 9, 9,
• Bin 1: 4, 4, 15
• Bin 2: 21, 21, 24
• Bin 3: 25, 25, 34
2. Clustering:
Outliers may be detected by clustering, where similar values are organized into groups or
“clusters”. Intuitively, values which fall outside of the set of clusters may be considered outliers.
Figure: Outliers may be detected by clustering analysis.
4. Regression: Data can be smoothed by fitting the data to a function, such as with regression.
Linear regression involves finding the “best" line to fit two variables, so that one variable can be
used to predict the other. Multiple linear regression is an extension of linear regression, where
more than two variables are involved and the data are fit to a multidimensional surface.
There may be inconsistencies in the data recorded for some transactions. Some data
inconsistencies may be corrected manually using external references. For example, errors made
at data entry may be corrected by performing a paper trace. This may be coupled with routines
designed to help correct the inconsistent use of codes. Knowledge engineering tools may also be
used to detect the violation of known data constraints. For example, known functional
dependencies between attributes can be used to find values contradicting the functional
constraints.
Data transformation.
In data transformation, the data are transformed or consolidated into forms appropriate
for mining. Data transformation can involve the following:
1. Normalization, where the attribute data are scaled so as to fall within a small specified range,
such as -1.0 to 1.0, or 0 to 1.0.
There are three main methods for data normalization : min-max normalization, z-score
normalization, and normalization by decimal scaling.
(i).Min-max normalization performs a linear transformation on the original data. Suppose that
minA and maxA are the minimum and maximum values of an attribute A. Min-max
normalization maps a value v of A to v0 in the range [new minA; new maxA] by computing
(ii).z-score normalization (or zero-mean normalization), the values for an attribute A are
normalized based on the mean and standard deviation of A. A value v of A is normalized to v0
by computing where mean A and stand dev A are the mean and standard deviation, respectively,
of attribute A. This method of normalization is useful when the actual minimum and maximum
of attribute A are unknown, or when there are outliers which dominate the min-max
normalization.
(iii). Normalization by decimal scaling normalizes by moving the decimal point of values of
attribute A. The number of decimal points moved depends on the maximum absolute value of A.
A value v of A is normalized to v0by computing where j is the smallest integer such that
2. Smoothing, which works to remove the noise from data? Such techniques include binning,
clustering, and regression.
Binning methods smooth a sorted data value by consulting the ”neighborhood", or values
around it. The sorted values are distributed into a number of 'buckets', or bins. Because binning
methods consult the neighborhood of values, they perform local smoothing. Figure illustrates
some binning techniques.
In this example, the data for price are first sorted and partitioned into equi-depth bins (of
depth 3). In smoothing by bin means, each value in a bin is replaced by the mean value of the
bin. For example, the mean of the values 4, 8, and 15 in Bin 1 is 9. Therefore, each original value
in this bin is replaced by the value 9. Similarly, smoothing by bin medians can be employed, in
which each bin value is replaced by the bin median. In smoothing by bin boundaries, the
minimum and maximum values in a given bin are identified as the bin boundaries. Each bin
value is then replaced by the closest boundary value.
(i).Sorted data for price (in dollars): 4, 8, 15, 21, 21, 24, 25, 28, 34
• Bin 1: 4, 8, 15
• Bin 2: 21, 21, 24
• Bin 3: 25, 28, 34
(iii).Smoothing by bin means:
- Bin 1: 9, 9, 9,
• Bin 1: 4, 4, 15
• Bin 2: 21, 21, 24
• Bin 3: 25, 25, 34
(ii). Clustering:
Outliers may be detected by clustering, where similar values are organized into groups or
“clusters”. Intuitively, values which fall outside of the set of clusters may be considered outliers.
3. Aggregation, where summary or aggregation operations are applied to the data. For example,
the daily sales data may be aggregated so as to compute monthly and annual total amounts.
4. Generalization of the data, where low level or 'primitive' (raw) data are replaced by higher
level concepts through the use of concept hierarchies. For example, categorical attributes, like
street, can be generalized to higher level concepts, like city or county.
Data reduction.
Data reduction techniques can be applied to obtain a reduced representation of the data
set that is much smaller in volume, yet closely maintains the integrity of the original data. That
is, mining on the reduced data set should be more efficient yet produce the same (or almost the
same) analytical results.
1. Data cube aggregation, where aggregation operations are applied to the data in the
construction of a data cube.
3. Data compression, where encoding mechanisms are used to reduce the data set size.
4. Numerosity reduction, where the data are replaced or estimated by alternative, smaller data
representations such as parametric models (which need store only the model parameters instead
of the actual data), or nonparametric methods such as clustering, sampling, and the use of
histograms.
5. Discretization and concept hierarchy generation, where raw data values for attributes are
replaced by ranges or higher conceptual levels. Concept hierarchies allow the mining of data at
multiple levels of abstraction, and are a powerful tool for data mining.
– Select a minimum set of features such that the probability distribution of different
classes given the values for those features is as close as possible to the original
distribution given the values of all features
– reduce # of patterns in the patterns, easier to understand
Heuristic methods:
1. Step-wise forward selection: The procedure starts with an empty set of attributes. The best of
the original attributes is determined and added to the set. At each subsequent iteration or step, the
best of the remaining original attributes is added to the set.
2. Step-wise backward elimination: The procedure starts with the full set of attributes. At each
step, it removes the worst attribute remaining in the set.
4. Decision tree induction: Decision tree algorithms, such as ID3 and C4.5, were originally
intended for classifcation. Decision tree induction constructs a flow-chart-like structure where
each internal (non-leaf) node denotes a test on an attribute, each branch corresponds to an
outcome of the test, and each external (leaf) node denotes a class prediction. At each node, the
algorithm chooses the “best" attribute to
partition the data into individual classes.
Data compression
Wavelet transforms
The discrete wavelet transform (DWT) is a linear signal processing technique that, when
applied to a data vector D, transforms it to a numerically different vector, D0, of wavelet
coefficients. The two vectors are of the same length.
The DWT is closely related to the discrete Fourier transform (DFT), a signal processing
technique involving sines and cosines. In general, however, the DWT achieves better lossy
compression.
1. The length, L, of the input data vector must be an integer power of two. This condition
can be met by padding the data vector with zeros, as necessary.
2. Each transform involves applying two functions. The first applies some data smoothing,
such as a sum or weighted average. The second performs a weighted difference.
3. The two functions are applied to pairs of the input data, resulting in two sets of data of
length L=2. In general, these respectively represent a smoothed version of the input data,
and the high-frequency content of it.
4. The two functions are recursively applied to the sets of data obtained in the previous
loop, until the resulting data sets obtained are of desired length.
5. A selection of values from the data sets obtained in the above iterations are designated
the wavelet coefficients of the transformed data.
1. The input data are normalized, so that each attribute falls within the same range. This step
helps ensure that attributes with large domains will not dominate attributes with smaller domains.
2. PCA computes N orthonormal vectors which provide a basis for the normalized input data.
These are unit vectors that each point in a direction perpendicular to the others. These vectors are
referred to as the principal components. The input data are a linear combination of the principal
components.
3. The principal components are sorted in order of decreasing “significance" or strength. The
principal components essentially serve as a new set of axes for the data, providing important
information about variance.
4. since the components are sorted according to decreasing order of “significance", the size of the
data can be reduced by eliminating the weaker components, i.e., those with low variance. Using
the strongest principal components, it should be possible to reconstruct a good approximation of
the original data.
Numerosity reduction
Regression and log-linear models can be used to approximate the given data. In linear
regression, the data are modeled to fit a straight line. For example, a random variable, Y (called a
response variable), can be modeled as a linear function of another random variable, X (called a
predictor variable), with the equation where the variance of Y is assumed to be constant. These
coefficients can be solved for by the method of least squares, which minimizes the error between
the actual line separating the data and the estimate of the line.
Histograms
A histogram for an attribute A partitions the data distribution of A into disjoint subsets, or
buckets. The buckets are displayed on a horizontal axis, while the height (and area) of a bucket
typically reflects the average frequency of the values represented by the bucket.
1. Equi-width: In an equi-width histogram, the width of each bucket range is constant (such as
the width of $10 for the buckets in Figure 3.8).
2. Equi-depth (or equi-height): In an equi-depth histogram, the buckets are created so that,
roughly, the frequency of each bucket is constant (that is, each bucket contains roughly the same
number of contiguous data samples).
3. V-Optimal: If we consider all of the possible histograms for a given number of buckets, the V-
optimal histogram is the one with the least variance. Histogram variance is a weighted sum of the
original values that each bucket represents, where bucket weight is equal to the number of values
in the bucket.
4. MaxDiff: In a MaxDiff histogram, we consider the difference between each pair of adjacent
values. A bucket boundary is established between each pair for pairs having the largest
differences, where is user-specified.
Clustering
Clustering techniques consider data tuples as objects. They partition the objects into
groups or clusters, so that objects within a cluster are “similar" to one another and “dissimilar" to
objects in other clusters. Similarity is commonly defined in terms of how “close" the objects are
in space, based on a distance function. The “quality" of a cluster may be represented by its
diameter, the maximum distance between any two objects in the cluster. Centroid distance is an
alternative measure of cluster quality, and is defined as the average distance of each cluster
object from the cluster centroid.
Sampling
Sampling can be used as a data reduction technique since it allows a large data set to be
represented by a much smaller random sample (or subset) of the data. Suppose that a large data
set, D, contains N tuples. Let's have a look at some possible samples for D.
3. Cluster sample: If the tuples in D are grouped into M mutually disjoint “clusters", then a SRS
of m clusters can be obtained, where m < M. A reduced data representation can be obtained by
applying, say, SRSWOR to the pages, resulting in a cluster sample of the tuples.
4. Stratified sample: If D is divided into mutually disjoint parts called “strata", a stratified
sample of D is generated by obtaining a SRS at each stratum. This helps to ensure a
representative sample, especially when the data are skewed. For example, a stratified sample
may be obtained from customer data, where stratum is created for each customer age group.
Mining Frequent Patterns, Associations and Correlations – Mining Methods – Mining Various
Kinds of Association Rules – Correlation Analysis – Constraint Based Association Mining –
Classification and Prediction - Basic Concepts - Decision Tree Induction - Bayesian
Classification – Rule Based Classification – Classification by Back propagation – Support
Vector Machines – Associative Classification – Lazy Learners – Other Classification Methods –
Prediction
Association Mining
• Given: (1) database of transactions, (2) each transaction is a list of items (purchased by a
customer in a visit)
• Find: all rules that correlate the presence of one set of items with that of another set of
items
– E.g., 98% of people who purchase tires and auto accessories also get automotive
services done
• Applications
– * Maintenance Agreement (What the store should do to boost Maintenance
Agreement sales)
– Home Electronics * (What other products should the store stocks up?)
– Attached mailing in direct marketing
– Detecting “ping-pong”ing of patients, faulty “collisions”
• Find all the rules X & Y Z with minimum confidence and support
– support, s, probability that a transaction contains {X Y Z}
– confidence, c, conditional probability that a transaction having {X Y} also
contains Z
Let minimum support 50%, and minimum confidence 50%, we have
– A C (50%, 66.6%)
– C A (50%, 100%)
Transaction ID Items Bought
2000 A,B,C
1000 A,C
4000 A,D
5000 B,E,F
Association Rule Mining: A Road Map
Rule support and confidence are two measures of rule interestingness. They respectively
reflect the usefulness and certainty of discovered rules. A support of 2% for association Rule
means that 2% of all the transactions under analysis show that computer and financial
management software are purchased together. A confidence of 60% means that 60% of the
customers who purchased a computer also bought the software. Typically, association rules are
considered interesting if they satisfy both a minimum support threshold and a minimum
confidence threshold.
The method that mines the complete set of frequent itemsets with candidate generation.
Apriori property & The Apriori Algorithm.
Apriori property
The method that mines the complete set of frequent itemsets without generation.
Header Table
• Completeness:
– never breaks a long pattern of any transaction
– preserves complete information for frequent pattern mining
• Compactness
– reduce irrelevant information—infrequent items are gone
– frequency descending ordering: more frequent items are more likely to be shared
– never be larger than the original database (if not count node-links and counts)
– Example: For Connect-4 DB, compression ratio could be over 100
Food
Milk Bread
Fraser Sunset
TID Items
T1 {111, 121, 211, 221}
T2 {111, 211, 222, 323}
T3 {112, 122, 221, 411}
T4 {111, 121}
T5 {111, 122, 211, 221, 413}
Mining Multi-Level Associations
– If adopting the same min_support across multi-levels then toss t if any of t’s
ancestors is infrequent.
– If adopting reduced min_support at lower levels then examine only those
descendents whose ancestor’s support is frequent/non-negligible.
Correlation in detail.
• Database: (1) trans (TID, Itemset ), (2) itemInfo (Item, Type, Price)
• A constrained asso. query (CAQ) is in the form of {(S1, S2 )|C },
– where C is a set of constraints on S1, S2 including frequency constraint
• A classification of (single-variable) constraints:
– Class constraint: S A. e.g. S Item
– Domain constraint:
• S v, { =, , , , , }. e.g. S.Price < 100
• v S, is or . e.g. snacks S.Type
• V S, or S V, { , , , =, }
– e.g. {snacks, sodas } S.Type
– Aggregation constraint: agg(S) v, where agg is in {min, max, sum, count, avg},
and { =, , , , , }.
• e.g. count(S1.Type) = 1 , avg(S2.Price) 100
2. Succinct Constraint
• A subset of item Is is a succinct set, if it can be expressed as p(I) for some selection
predicate p, where is a selection operator
• SP2I is a succinct power set, if there is a fixed number of succinct set I1, …, Ik I, s.t.
SP can be expressed in terms of the strict power sets of I1, …, Ik using union and minus
• A constraint Cs is succinct provided SATCs(I) is a succinct power set
3. Convertible Constraint
• Suppose all items in patterns are listed in a total order R
• A constraint C is convertible anti-monotone iff a pattern S satisfying the constraint
implies that each suffix of S w.r.t. R also satisfies C
• A constraint C is convertible monotone iff a pattern S satisfying the constraint implies
that each pattern of which S is a suffix w.r.t. R also satisfies C
• Succinctness:
– For any set S1 and S2 satisfying C, S1 S2 satisfies C
– Given A1 is the sets of size 1 satisfying C, then any set S satisfying C are based
on A1 , i.e., it contains a subset belongs to A1 ,
• Example :
– sum(S.Price ) v is not succinct
– min(S.Price ) v is succinct
• Optimization:
– If C is succinct, then C is pre-counting prunable. The satisfaction of the constraint
alone is not affected by the iterative support counting.
Classification and Prediction
• Classification:
• Typical applications
– Credit approval
– Target marketing
– Medical diagnosis
– Fraud detection
–
Classification—A Two-Step Process
• Example:
– Weather problem: build a decision tree to guide the decision about whether or not
to play tennis.
– Dataset
(weather.nominal.arff)
• Validation:
– Using training set as a test set will provide optimal classification accuracy.
– Expected accuracy on a different test set will always be less.
– 10-fold cross validation is more robust than using the training set as a test set.
• Divide data into 10 sets with about same proportion of class label values
as in original set.
• Run classification 10 times independently with the remaining 9/10 of the
set as the training set.
• Average accuracy.
– Ratio validation: 67% training set / 33% test set.
– Best: having a separate training set and test set.
• Results:
– Classification accuracy (correctly classified instances).
– Errors (absolute mean, root squared mean, …)
– Kappa statistic (measures agreement between predicted and observed
classification; -100%-100% is the proportion of agreements after chance
agreement has been excluded; 0% means complete agreement by chance)
• Results:
– TP (True Positive) rate per class label
– FP (False Positive) rate
– Precision = TP rate = TP / (TP + FN)) * 100%
– Recall = TP / (TP + FP)) * 100%
– F-measure = 2* recall * precision / recall + precision
• ID3 characteristics:
– Requires nominal values
– Improved into C4.5
• Dealing with numeric attributes
• Dealing with missing values
• Dealing with noisy data
• Generating rules from trees
Tree Mining in Clementine
• Methods:
– C5.0: target field must be categorical, predictor fields may be numeric or
categorical, provides multiple splits on the field that provides the maximum
information gain at each level
– QUEST: target field must be categorical, predictor fields may be numeric ranges
or categorical, statistical binary split
– C&RT: target and predictor fields may be numeric ranges or categorical,
statistical binary split based on regression
– CHAID: target and predictor fields may be numeric ranges or categorical,
statistical binary split based on chi-square
Bayesian Classification:
• Probabilistic learning: Calculate explicit probabilities for hypothesis, among the most
practical approaches to certain types of learning problems
• Incremental: Each training example can incrementally increase/decrease the probability
that a hypothesis is correct. Prior knowledge can be combined with observed data.
• Probabilistic prediction: Predict multiple hypotheses, weighted by their probabilities
• Standard: Even when Bayesian methods are computationally intractable, they can provide
a standard of optimal decision making against which other methods can be measured
Bayesian Theorem
• Given training data D, posteriori probability of a hypothesis h, P(h|D) follows the Bayes
theorem
• Greatly reduces the computation cost, only count the class distribution.
Outlook P N Humidity P N
sunny 2/9 3/5 high 3/9 4/5
overcast 4/9 0 normal 6/9 1/5
rain 3/9 2/5
Bayesian classification
Temperature W indy
hot 2/9 2/5 true 3/9 3/5
• The mild 4/9 2/5 false 6/9 2/5
classification cool 3/9 1/5 problem
may be
formalized using a-posteriori probabilities:
• P(C|X) = prob. that the sample tuple
• X=<x1,…,xk> is of class C.
• E.g. P(class=N | outlook=sunny, windy=true,…)
• Idea: assign to sample X the class label C such that P(C|X) is maximal
• Bayes theorem:
P(C|X) = P(X|C)·P(C) / P(X)
CPT shows the conditional probability for each possible combination of its parents
Association-Based Classification
Discarding one or more subtrees and replacing them with leaves simplify a decision tree, and
that is the main task in decision-tree pruning. In replacing the subtree with a leaf, the algorithm
expects to lower the predicted error rate and increase the quality of a classification model. But
computation of error rate is not simple. An error rate based only on a training data set does not
provide a suitable estimate. One possibility to estimate the predicted error rate is to use a new,
additional set of test samples if they are available, or to use the cross-validation techniques. This
technique divides initially available samples into equal sized blocks and, for each block, the tree
is constructed from all samples except this block and tested with a given block of samples. With
the available training and testing samples, the basic idea of decision tree-pruning is to remove
parts of the tree (subtrees) that do not contribute to the classification accuracy of unseen testing
samples, producing a less complex and thus more comprehensible tree. There are two ways in
which the recursive-partitioning method can be modified:
1. Deciding not to divide a set of samples any further under some conditions. The stopping
criterion is usually based on some statistical tests, such as the χ2 test: If there are no
significant differences in classification accuracy before and after division, then represent
a current node as a leaf. The decision is made in advance, before splitting, and therefore
this approach is called prepruning.
2. Removing restrospectively some of the tree structure using selected accuracy criteria. The
decision in this process of postpruning is made after the tree has been built.
C4.5 follows the postpruning approach, but it uses a specific technique to estimate the predicted
error rate. This method is called pessimistic pruning. For every node in a tree, the estimation of
the upper confidence limit ucf is computed using the statistical tables for binomial distribution
(given in most textbooks on statistics). Parameter U cf is a function of ∣Ti∣ and E for a given node.
C4.5 uses the default confidence level of 25%, and compares U 25% (∣Ti∣/E) for a given node Ti
with a weighted confidence of its leaves. Weights are the total number of cases for every leaf. If
the predicted error for a root node in a subtree is less than weighted sum of U25% for the leaves
(predicted error for the subtree), then a subtree will be replaced with its root node, which
becomes a new leaf in a pruned tree.
Let us illustrate this procedure with one simple example. A subtree of a decision tree is
given in Figure, where the root node is the test x 1 on three possible values {1, 2, 3} of the
attribute A. The children of the root node are leaves denoted with corresponding classes and
(∣Ti∣/E) parameters. The question is to estimate the possibility of pruning the subtree and
replacing it with its root node as a new, generalized leaf node.
To analyze the possibility of replacing the subtree with a leaf node it is necessary to
compute a predicted error PE for the initial tree and for a replaced node. Using default
confidence of 25%, the upper confidence limits for all nodes are collected from statistical tables:
U25% (6, 0) = 0.206, U25%(9, 0) = 0.143, U25%(1, 0) = 0.750, and U25%(16, 1) = 0.157. Using these
values, the predicted errors for the initial tree and the replaced node are
Since the existing subtree has a higher value of predicted error than the replaced node, it
is recommended that the decision tree be pruned and the subtree replaced with the new leaf node.
Rule Based Classification
Classification by Backpropagation
◼ Backpropagation: A neural network learning algorithm
◼ Started by psychologists and neurobiologists to develop and test computational analogues
of neurons
◼ A neural network: A set of connected input/output units where each connection has a
weight associated with it
◼ During the learning phase, the network learns by adjusting the weights so as to be able
to predict the correct class label of the input tuples
◼ Also referred to as connectionist learning due to the connections between units
Neural Network as a Classifier
◼ Weakness
◼ Long training time
◼ Require a number of parameters typically best determined empirically, e.g., the
network topology or ``structure."
◼ Poor interpretability: Difficult to interpret the symbolic meaning behind the
learned weights and of ``hidden units" in the network
◼ Strength
◼ High tolerance to noisy data
◼ Ability to classify untrained patterns
◼ Well-suited for continuous-valued inputs and outputs
◼ Successful on a wide array of real-world data
◼ Algorithms are inherently parallel
◼ Techniques have recently been developed for the extraction of rules from trained
neural networks
A Neuron (= a perceptron)
◼ The n-dimensional input vector x is mapped into variable y by means of the scalar
product and a nonlinear function mapping
◼ The inputs to the network correspond to the attributes measured for each training tuple
◼ Inputs are fed simultaneously into the units making up the input layer
◼ They are then weighted and fed simultaneously to a hidden layer
◼ The number of hidden layers is arbitrary, although usually only one
◼ The weighted outputs of the last hidden layer are input to units making up the output
layer, which emits the network's prediction
◼ The network is feed-forward in that none of the weights cycles back to an input unit or to
an output unit of a previous layer
◼ From a statistical point of view, networks perform nonlinear regression: Given enough
hidden units and enough training samples, they can closely approximate any function
Backpropagation
◼ Iteratively process a set of training tuples & compare the network's prediction with the
actual known target value
◼ For each training tuple, the weights are modified to minimize the mean squared error
between the network's prediction and the actual target value
◼ Modifications are made in the “backwards” direction: from the output layer, through each
hidden layer down to the first hidden layer, hence “backpropagation”
◼ Steps
◼ Initialize weights (to small random #s) and biases in the network
◼ Propagate the inputs forward (by applying activation function)
◼ Backpropagate the error (by updating weights and biases)
◼ Terminating condition (when error is very small, etc.)
◼ Efficiency of backpropagation: Each epoch (one interation through the training set) takes
O(|D| * w), with |D| tuples and w weights, but # of epochs can be exponential to n, the
number of inputs, in the worst case
◼ Rule extraction from networks: network pruning
◼ Simplify the network structure by removing weighted links that have the least
effect on the trained network
◼ Then perform link, unit, or activation value clustering
◼ The set of input and activation values are studied to derive rules describing the
relationship between the input and hidden unit layers
◼ Sensitivity analysis: assess the impact that a given input variable has on a network output.
The knowledge gained from this analysis can be represented in rules
SVM—General Philosophy
Associative Classification May Achieve High Accuracy and Efficiency (Cong et al.
SIGMOD05)
Other Classification Methods
The k-Nearest Neighbor Algorithm
Genetic Algorithms
Figure: A rough set approximation of the set of tuples of the class C suing lower and upper
approximation sets of C. The rectangular regions represent equivalence classes
Cluster Analysis
• Pattern Recognition
• Spatial Data Analysis
– create thematic maps in GIS by clustering feature spaces
– detect spatial clusters and explain them in spatial data mining
• Image Processing
• Economic Science (especially market research)
• WWW
– Document classification
– Cluster Weblog data to discover groups of similar access patterns
Examples of Clustering Applications
• Marketing: Help marketers discover distinct groups in their customer bases, and then use
this knowledge to develop targeted marketing programs
• Land use: Identification of areas of similar land use in an earth observation database
• Insurance: Identifying groups of motor insurance policy holders with a high average
claim cost
• City-planning: Identifying groups of houses according to their house type, value, and
geographical location
• Earth-quake studies: Observed earth quake epicenters should be clustered along continent
faults
What Is Good Clustering?
• A good clustering method will produce high quality clusters with
– high intra-class similarity
– low inter-class similarity
• The quality of a clustering result depends on both the similarity measure used by the
method and its implementation.
• The quality of a clustering method is also measured by its ability to discover some or all
of the hidden patterns.
Interval-valued variables
• Standardize data
– Calculate the mean absolute deviation:
Where mf = 1
n (x1 f + x2 f + ... + xnf ) .
• Distances are normally used to measure the similarity or dissimilarity between two data
objects
• Some popular ones include: Minkowski distance:
d (i, j) = q (| x − x |q + | x − x |q +...+ | x − x |q )
i1 j1 i2 j 2 ip jp
where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two p-dimensional data objects, and q
is a positive integer
• If q = 1, d is Manhattan distance
d (i, j) =| x − x | + | x − x | +...+ | x − x |
i1 j1 i2 j 2 i p jp
• If q = 2, d is Euclidean distance:
d (i, j) = (| x − x |2 + | x − x |2 +...+ | x − x |2 )
i1 j1 i2 j 2 ip jp
sf = 1
n (| x1 f − m f | + | x2 f − m f | +...+ | xnf − m f |)
– Properties
• d(i,j) 0
• d(i,i) = 0
• d(i,j) = d(j,i)
• Also one can use weighted distance, parametric Pearson product moment correlation, or
other disimilarity measures.
Binary Variables
Object j
1 0 sum
1 a b a +b
Object i
0 c d c+d
sum a + c b + d p
• Simple matching coefficient (invariant, if the binary variable is symmetric):
• Jaccard coefficient (noninvariant if the binary variable is asymmetric):
Dissimilarity between Binary Variables
• Example
Nominal Variables
A generalization of the binary variable in that it can take more than 2 states, e.g., red, yellow,
blue, green
• Method 1: Simple matching
– m: # of matches, p: total # of variables
d (i, j) = p −pm
• Method 2: use a large number of binary variables
Ordinal Variables
Ratio-Scaled Variables
pf = 1 ij( f )dij( f )
– d (i, j) =
– pf = 1 ij( f )
–
– f is binary or nominal:
dij(f) = 0 if xif = xjf , or dij(f) = 1 o.w.
– f is interval-based: use the normalized distance
– f is ordinal or ratio-scaled
• compute ranks rif and
• and treat zif as interval-scaled
z =
r
if − 1
if M f −1
• Example
10
10
9
9
8
8
7
7
6
6
5
5
4
4
3
3
2
2
1
1
0
0
0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10
10
10
9
9
8 8
7 7
6
6
5
5
4
4 3
3 2
2 1
0
1 0 1 2 3 4 5 6 7 8 9 10
0
0 1 2 3 4 5 6 7 8 9 10
• Strength
– Relatively efficient: O(tkn), where n is # objects, k is # clusters, and t is #
iterations. Normally, k, t << n.
– Often terminates at a local optimum. The global optimum may be found using
techniques such as: deterministic annealing and genetic algorithms
• Weakness
– Applicable only when mean is defined, then what about categorical data?
– Need to specify k, the number of clusters, in advance
– Unable to handle noisy data and outliers
– Not suitable to discover clusters with non-convex shapes
10 10
9 9
j
8
t 8 t
7
7
6
5
j 6
5
4
3
i h 4
h
2
3
i
1 2
0 1
0 1 2 3 4 5 6 7 8 9 10
0
0 1 2 3 4 5 6 7 8 9 10
10
9
9
8
8
h 7
7
6
j 6
i
5
5
i
4
t 4
2
3
2
h j
1
0 1 2 3 4 5 6 7 8 9 10
1
0 1 2 3 4 5
t
6 7 8 9 10
• Focusing techniques and spatial access structures may further improve its performance
(Ester et al.’95)
Hierarchical Clustering
Use distance matrix as clustering criteria. This method does not require the number of clusters k
as an input, but needs a termination condition
9
10
8
9
7
8 6
7 5
4
6
3
5
2
4 1
3 0
0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10
10
0 1 2 3 4 5 6 7 8 9 10
Decompose data objects into a several levels of nested partitioning (tree of clusters), called a
dendrogram.
A clustering of the data objects is obtained by cutting the dendrogram at the desired level, then
each connected component forms a cluster.
Rock: Algorithm
{1,2,3} 3 {1,2,4}
• Algorithm
– Draw random sample
– Cluster with links
– Label data in disk
CHAMELEON
Algorithms of hierarchical cluster analysis are divided into the two categories divisible
algorithms and agglomerative algorithms. A divisible algorithm starts from the entire set of
samples X and divides it into a partition of subsets, then divides each subset into smaller sets,
and so on. Thus, a divisible algorithm generates a sequence of partitions that is ordered from a
coarser one to a finer one. An agglomerative algorithm first regards each object as an initial
cluster. The clusters are merged into a coarser partition, and the merging process proceeds until
the trivial partition is obtained: all objects are in one large cluster. This process of clustering is a
bottom-up process, where partitions from a finer one to a coarser one.
The basic steps of the agglomerative clustering algorithm are the same. These steps are
1. Place each sample in its own cluster. Construct the list of inter-cluster distances for all
distinct unordered pairs of samples, and sort this list in ascending order.
2. Step through the sorted list of distances, forming for each distinct threshold value d k a
graph of the samples where pairs samples closer than dk are connected into a new cluster
by a graph edge. If all the samples are members of a connected graph, stop. Otherwise,
repeat this step.
3. The output of the algorithm is a nested hierarchy of graphs, which can be cut at the
desired dissimilarity level forming a partition (clusters) identified by simple connected
components in the corresponding subgraph.
Let us consider five points {x1, x2, x3, x4, x5} with the following coordinates as a two-
dimensional sample for clustering:
The distances between these points using the Euclidian measure are
d(x1 , x2 ) =2, d(x1, x3) = 2.5, d(x1, x4) = 5.39, d(x1, x5) = 5
The distances between points as clusters in the first iteration are the same for both single-
link and complete-link clustering. Further computation for these two algorithms is different.
Using agglomerative single-link clustering, the following steps are performed to create a cluster
and to represent the cluster structure as a dendrogram.
There are two main types of clustering techniques, those that create a hierarchy of
clusters and those that do not. The hierarchical clustering techniques create a hierarchy of
clusters from small to big. The main reason for this is that, as was already stated, clustering is
an unsupervised learning technique, and as such, there is no absolutely correct answer. For this
reason and depending on the particular application of the clustering, fewer or greater numbers of
clusters may be desired. With a hierarchy of clusters defined it is possible to choose the number
of clusters that are desired. At the extreme it is possible to have as many clusters as there are
records in the database. In this case the records within the cluster are optimally similar to each
other (since there is only one) and certainly different from the other clusters. But of course such
a clustering technique misses the point in the sense that the idea of clustering is to find useful
patters in the database that summarize it and make it easier to understand. Any clustering
algorithm that ends up with as many clusters as there are records has not helped the user
understand the data any better. Thus one of the main points about clustering is that there be
many fewer clusters than there are original records. Exactly how many clusters should be
formed is a matter of interpretation. The advantage of hierarchical clustering methods is that
they allow the end user to choose from either many clusters or only a few.
The hierarchy of clusters is usually viewed as a tree where the smallest clusters merge
together to create the next highest level of clusters and those at that level merge together to
create the next highest level of clusters. Figure 1.5 below shows how several clusters might form
a hierarchy. When a hierarchy of clusters like this is created the user can determine what the
right number of clusters is that adequately summarizes the data while still providing useful
information (at the other extreme a single cluster containing all the records is a great
summarization but does not contain enough specific information to be useful).
This hierarchy of clusters is created through the algorithm that builds the clusters. There are
two main types of hierarchical clustering algorithms:
Of the two the agglomerative techniques are the most commonly used for clustering and have
more algorithms developed for them. We’ll talk about these in more detail in the next section.
The non-hierarchical techniques in general are faster to create from the historical database but
require that the user make some decision about the number of clusters desired or the minimum
“nearness” required for two records to be within the same cluster. These non-hierarchical
techniques often times are run multiple times starting off with some arbitrary or even random
clustering and then iteratively improving the clustering by shuffling some records around. Or
these techniques some times create clusters that are created with only one pass through the
database adding records to existing clusters when they exist and creating new clusters when no
existing cluster is a good candidate for the given record. Because the definition of which clusters
are formed can depend on these initial choices of which starting clusters should be chosen or
even how many clusters these techniques can be less repeatable than the hierarchical techniques
and can sometimes create either too many or too few clusters because the number of clusters is
predetermined by the user not determined solely by the patterns inherent in the database.
Figure 1.5 Diagram showing a hierarchy of clusters. Clusters at the lowest level are merged
together to form larger clusters at the next level of the hierarchy.
Non-Hierarchical Clustering
There are two main non-hierarchical clustering techniques. Both of them are very fast to
compute on the database but have some drawbacks. The first are the single pass methods. They
derive their name from the fact that the database must only be passed through once in order to
create the clusters (i.e. each record is only read from the database once). The other class of
techniques are called reallocation methods. They get their name from the movement or
“reallocation” of records from one cluster to another in order to create better clusters. The
reallocation techniques do use multiple passes through the database but are relatively fast in
comparison to the hierarchical techniques.
Hierarchical Clustering
Hierarchical clustering has the advantage over non-hierarchical techniques in that the
clusters are defined solely by the data (not by the users predetermining the number of clusters)
and that the number of clusters can be increased or decreased by simple moving up and down the
hierarchy.
The hierarchy is created by starting either at the top (one cluster that includes all records)
and subdividing (divisive clustering) or by starting at the bottom with as many clusters as there
are records and merging (agglomerative clustering). Usually the merging and subdividing are
done two clusters at a time.
The main distinction between the techniques is their ability to favor long, scraggly
clusters that are linked together record by record, or to favor the detection of the more classical,
compact or spherical cluster that was shown at the beginning of this section. It may seem strange
to want to form these long snaking chain like clusters, but in some cases they are the patters that
the user would like to have detected in the database. These are the times when the underlying
space looks quite different from the spherical clusters and the clusters that should be formed are
not based on the distance from the center of the cluster but instead based on the records being
“linked” together. Consider the example shown in Figure 1.6 or in Figure 1.7. In these cases
there are two clusters that are not very spherical in shape but could be detected by the single link
technique.
When looking at the layout of the data in Figure1.6 there appears to be two relatively flat
clusters running parallel to each along the income axis. Neither the complete link nor Ward’s
method would, however, return these two clusters to the user. These techniques rely on creating
a “center” for each cluster and picking these centers so that they average distance of each record
from this center is minimized. Points that are very distant from these centers would necessarily
fall into a different cluster.
.
Figure 1.6 an example of elongated clusters which would not be recovered by the complete
link or Ward's methods but would be by the single-link method.
• Two parameters:
– Eps: Maximum radius of the neighbourhood
– MinPts: Minimum number of points in an Eps-neighbourhood of that point
• NEps(p): {q belongs to D | dist(p,q) <= Eps}
• Directly density-reachable: A point p is directly density-reachable from a point q wrt.
Eps, MinPts if
– 1) p belongs to NEps(q)
– 2) core point condition:
|NEps (q)| >= MinPts
• Uses grid cells but only keeps information about grid cells that do actually contain data
points and manages these cells in a tree-based access structure.
• Influence function: describes the impact of a data point within its neighborhood.
• Overall density of the data space can be calculated as the sum of the influence function of
all data points.
• Clusters can be determined mathematically by identifying density attractors.
• Density attractors are local maximal of the overall density function.
Grid-Based Methods
Using multi-resolution grid data structure
• Several interesting methods
– STING (a STatistical INformation Grid approach) by Wang, Yang and Muntz
(1997)
– WaveCluster by Sheikholeslami, Chatterjee, and Zhang (VLDB’98)
• A multi-resolution clustering approach using wavelet method
– CLIQUE: Agrawal, et al. (SIGMOD’98)
WaveCluster (1998)
• Sheikholeslami, Chatterjee, and Zhang (VLDB’98)
• A multi-resolution clustering approach which applies wavelet transform to the feature
space
– A wavelet transform is a signal processing technique that decomposes a signal
into different frequency sub-band.
• Both grid-based and density-based
• Input parameters:
– # of grid cells for each dimension
– the wavelet, and the # of applications of wavelet transform.
Outlier Analysis
What Is Outlier Discovery?
• What are outliers?
– The set of objects are considerably dissimilar from the remainder of the data
– Example: Sports: Michael Jordon, Wayne Gretzky, ...
• Problem
– Find top n outlier points
• Applications:
– Credit card fraud detection
– Telecom fraud detection
– Customer segmentation
– Medical analysis
Outlier Discovery: Statistical Approaches
• Assume a model underlying distribution that generates data set (e.g. normal
distribution)
• Use discordancy tests depending on
– data distribution
– distribution parameter (e.g., mean, variance)
– number of expected outliers
• Drawbacks
– most tests are for single attribute
– In many cases, data distribution may not be known
Outlier Discovery: Distance-Based Approach
• Introduced to counter the main limitations imposed by statistical methods
– We need multi-dimensional analysis without knowing data distribution.
• Distance-based outlier: A DB(p, D)-outlier is an object O in a dataset T such that at least
a fraction p of the objects in T lies at a distance greater than D from O
• Algorithms for mining distance-based outliers
– Index-based algorithm
– Nested-loop algorithm
– Cell-based algorithm
Outlier Discovery: Deviation-Based Approach
• Identifies outliers by examining the main characteristics of objects in a group
• Objects that “deviate” from this description are considered outliers
• sequential exception technique
– simulates the way in which humans can distinguish unusual objects from among a
series of supposedly like objects
• OLAP data cube technique
– uses data cubes to identify regions of anomalies in large multidimensional data