Data Mining - Reference - 1
Data Mining - Reference - 1
Based on this view, the architecture of a typical data mining system may have the
following major components.
Database, Data Warehouse, World Wide Web, or Other Information
Repository: This is one or a set of databases, data warehouses, spreadsheets, or
other kinds of information repositories. Data cleaning and data integration
techniques may be performed on the data.
Database or Data Warehouse Server: The database or data warehouse server is
responsible for fetching the relevant data, based on the user’s data mining
request.
Knowledge Base: This is the domain knowledge that is used to guide the search
or evaluate the interestingness of resulting patterns. It is simply stored in the
form of set of rules. Such knowledge can include concept hierarchies, used to
organize attributes or attribute values into different levels of abstraction.
Data Mining Engine: This is essential to the data mining system and ideally
consists of a set of functional modules for tasks such as characterization,
association and correlation analysis, classification, prediction, cluster analysis,
outlier analysis, and evolution analysis.
Pattern Evaluation Module: This component typically employs interestingness
measures and interacts with the data mining modules so as to focus the search
toward interesting patterns. It may use interestingness thresholds to filter out
discovered patterns.
User interface: This module communicates between users and the data mining
system, allowing the user to interact with the system by specifying a data mining
query or task. In addition, this component allows the user to browse database
and data warehouse schemas or data structures, evaluate mined patterns, and
visualize the patterns in different forms.
Figure: Architecture of Data Mining System
Cluster Analysis
Unlike classification and prediction, which analyze class-labeled data objects, clustering
analyzes data objects without consulting a known class label. Clustering can be used to
generate such labels. The objects are clustered or grouped based on the principle of
maximizing the intra-class similarity and minimizing the interclass similarity. That is, clusters
of objects are formed so that objects within a cluster have high similarity in comparison
to one another, but are very dissimilar to objects in other clusters.
Figure: Three data clusters
Outlier Analysis
A database may contain data objects that do not comply with the general behavior or
model of the data. These data objects are outliers. Most data mining methods discard
outliers as noise or exceptions. However, in some applications such as fraud detection,
the rare events can be more interesting than the more regularly occurring ones. Outliers
may be detected using statistical tests that assume a distribution or probability model
for the data, or using distance measures where objects that are a substantial distance
from any other cluster are considered outliers. For example, Outlier analysis may
uncover fraudulent usage of credit cards by detecting purchases of extremely large
amounts for a given account number in comparison to regular charges incurred by the
same account.
Evolution Analysis
Data evolution analysis describes and models regularities or trends for objects whose
behavior changes over time. Distinct features of such an analysis include time-series
data analysis, sequence or periodicity pattern matching, and similarity-based data
analysis. For example, you have the major stock market (time-series) data of the last
several years available from the Nepal Stock Exchange and you would like to invest in
shares of high-tech industrial companies. A data mining study of stock exchange data
may identify stock evolution regularities for overall stocks and for the stocks of
particular companies. Such regularities may help predict future trends in stock market
prices, contributing to your decision making regarding stock investments.
Major Issues in Data Mining
Major issues in data mining are about mining methodology, user interaction,
performance, and diverse data types. These issues are introduced below:
Mining methodology and user interaction issues
Mining different kinds of knowledge in databases: Because different users can be
interested in different kinds of knowledge, data mining should cover a wide
spectrum of data analysis and knowledge discovery tasks, including data
characterization, discrimination, association and correlation analysis,
classification, prediction, clustering, outlier analysis, and evolution analysis.
These tasks may use the same database in different ways and require the
development of numerous data mining techniques.
Interactive mining of knowledge at multiple levels of abstraction: Because it is difficult
to know exactly what can be discovered within a database, the data mining
process should be interactive. Interactive mining allows users to focus the search
for patterns, providing and refining data mining requests based on returned
results.
Incorporation of background knowledge: Domain knowledge related to databases,
such as integrity constraints and deduction rules, can help focus and speed up a
data mining process, or judge the interestingness of discovered patterns.
Data mining query languages and ad hoc data mining: High-level data mining query
languages need to be developed to allow users to describe ad hoc data mining
tasks by facilitating the specification of the relevant sets of data for analysis, the
domain knowledge, the kinds of knowledge to be mined, and the conditions and
constraints to be enforced on the discovered patterns.
Presentation and visualization of data mining results: Discovered knowledge should
be expressed in high-level languages, visual representations, or other expressive
forms so that the knowledge can be easily understood and directly usable by
humans.
Handling noisy or incomplete data: The data stored in a database may reflect noise,
exceptional cases, or incomplete data objects. These objects may confuse the data
mining process, causing the knowledge model constructed to overfit the data.
Thus data cleaning methods and data analysis methods that can handle noise are
required, as well as outlier mining methods for the discovery and analysis of
exceptional cases.
Pattern evaluation—the interestingness problem: A data mining system can uncover
thousands of patterns. Many of the patterns discovered may be uninteresting to
the given user, either because they represent common knowledge or lack
novelty. Thus the use of interestingness measures or user-specified constraints to
guide the discovery process and reduce the search space is another active area of
research.
Performance Issues
Efficiency and scalability of data mining algorithms: Running time of a data mining
algorithm must be predictable and acceptable in large databases. From a
database perspective on knowledge discovery, efficiency and scalability are key
issues in the implementation of data mining systems.
Parallel, distributed, and incremental mining algorithms: The huge size of many
databases, the wide distribution of data, and the computational complexity of
some data mining methods are factors motivating the development of parallel
and distributed data mining algorithms. Such algorithms divide the data into
partitions, which are processed in parallel. The results from the partitions are
then merged. Moreover, the high cost of some data mining processes promotes
the need for incremental data mining algorithms that incorporate database
updates without having to mine the entire data again “from scratch.” Such
algorithms perform knowledge modification incrementally to amend and
strengthen what was previously discovered.
1
amounts of historical data, provides facilities for summarization and
aggregation, and stores and manages information at different levels of
granularity. These features make the data easier to use in informed decision
making.
Database Design: An OLTP system usually adopts an entity-relationship (ER)
data model and an application-oriented database design. An OLAP system
typically adopts either a star or snowflake model (to be discussed in Section 3.2.2)
and a subject oriented database design.
View: An OLTP system focuses mainly on the current data within an enterprise
or department, without referring to historical data or data in different
organizations. In contrast, an OLAP system often spans multiple versions of a
database schema, due to the evolutionary process of an organization. OLAP
systems also deal with information that originates from different organizations,
integrating information from many data stores. Because of their huge volume,
OLAP data are stored on multiple storage media.
Access Patterns: The access patterns of an OLTP system consist mainly of short,
atomic transactions. Such a system requires concurrency control and recovery
mechanisms. However, accesses to OLAP systems are mostly read-only
operations (because most data warehouses store historical rather than up-to-date
information)
Why Separate Data Warehouse?
Databases store huge amounts of data. Now the major question is “why not perform on-
line analytical processing directly on such databases instead of spending additional time and
resources to construct a separate data warehouse?” A major reason for such a separation is
to help promote the high performance of both systems.
An operational database is designed and tuned from known tasks and
workloads, such as indexing and hashing using primary keys, searching for
particular records, and optimizing canned queries. On the other hand, data
warehouse queries are often complex. They involve the computation of large
groups of data at summarized levels, and may require the use of special data
organization, access, and implementation methods based on multidimensional
views. Processing OLAP queries in operational databases would substantially
degrade the performance of operational tasks.
Concurrency control and recovery mechanisms, such as locking and logging, are
required to ensure the consistency and robustness of transactions in database
systems. An OLAP query often needs read-only access of data records for
summarization and aggregation. Concurrency control and recovery mechanisms,
2
if applied for such OLAP operations may jeopardize the execution of concurrent
transactions and thus substantially reduce the throughput of an OLTP system.
Finally, the separation of operational databases from data warehouses is based
on the different structures, contents, and uses of the data in these two systems.
Decision support requires historical data, whereas operational databases do not
typically maintain historical data. In this context, the data in operational
databases, though abundant, is usually far from complete for decision making.
A multidimensional data model is typically organized around a central theme, like sales.
This theme is represented by a fact table. Facts are numerical measures. Themes are the
quantities by which we want to analyze relationships between dimensions. Examples of
facts for a sales data warehouse include dollars_sold (sales amount in dollars), units_sold
(number of units sold), and amount budgeted. The fact table contains the names of the
facts, or measures, as well as keys to each of the related dimension tables.
Figure: Sales data for an organization according to the dimensions time, item, and location. The
measure displayed is dollars_sold.
3
Figure A 3-D data cube representation of the data in above table, according to the dimensions
time, item, and location. The measure displayed is dollars_sold (in thousands).
Suppose that we would now like to view our sales data with an additional fourth
dimension, such as supplier. Viewing things in 4-D becomes tricky. However, we can
think of a 4-D cube as being a series of 3-D cubes, as shown in Figure below.
Figure 4-D data cube representation of sales data, according to the dimensions time, item,
location, and supplier. The measure displayed is dollars_sold (in thousands)
If we continue in this way, we may display any n-dimensional data as a series of (n-1)
dimensional cubes. The data cube is a metaphor for multidimensional data storage. The
actual physical storage of such data may differ from its logical representation. The
4
important thing to remember is that data cubes are n-dimensional and do not confine
data to 3-D.
5
Unit 2
Key Features of Data Warehouse
The key features of a data warehouse are discussed below:
Subject Oriented - A data warehouse is subject oriented because it provides
information around a subject rather than the organization's ongoing operations.
These subjects can be product, customers, suppliers, sales, revenue, etc. A data
warehouse does not focus on the ongoing operations; rather it focuses on
modeling and analysis of data for decision making.
Integrated - A data warehouse is constructed by integrating data from
heterogeneous sources such as relational databases, flat files, etc. This integration
enhances the effective analysis of data.
Time Variant - The data collected in a data warehouse is identified with a
particular time period. The data in a data warehouse provides information from
the historical point of view.
Non-volatile - Non-volatile means the previous data is not erased when new
data is added to it. A data warehouse is kept separate from the operational
database and therefore frequent changes in operational database are not reflected
in the data warehouse.
1
amounts of historical data, provides facilities for summarization and
aggregation, and stores and manages information at different levels of
granularity. These features make the data easier to use in informed decision
making.
Database Design: An OLTP system usually adopts an entity-relationship (ER)
data model and an application-oriented database design. An OLAP system
typically adopts either a star or snowflake model (to be discussed in Section 3.2.2)
and a subject oriented database design.
View: An OLTP system focuses mainly on the current data within an enterprise
or department, without referring to historical data or data in different
organizations. In contrast, an OLAP system often spans multiple versions of a
database schema, due to the evolutionary process of an organization. OLAP
systems also deal with information that originates from different organizations,
integrating information from many data stores. Because of their huge volume,
OLAP data are stored on multiple storage media.
Access Patterns: The access patterns of an OLTP system consist mainly of short,
atomic transactions. Such a system requires concurrency control and recovery
mechanisms. However, accesses to OLAP systems are mostly read-only
operations (because most data warehouses store historical rather than up-to-date
information)
Why Separate Data Warehouse?
Databases store huge amounts of data. Now the major question is “why not perform on-
line analytical processing directly on such databases instead of spending additional time and
resources to construct a separate data warehouse?” A major reason for such a separation is
to help promote the high performance of both systems.
An operational database is designed and tuned from known tasks and
workloads, such as indexing and hashing using primary keys, searching for
particular records, and optimizing canned queries. On the other hand, data
warehouse queries are often complex. They involve the computation of large
groups of data at summarized levels, and may require the use of special data
organization, access, and implementation methods based on multidimensional
views. Processing OLAP queries in operational databases would substantially
degrade the performance of operational tasks.
Concurrency control and recovery mechanisms, such as locking and logging, are
required to ensure the consistency and robustness of transactions in database
systems. An OLAP query often needs read-only access of data records for
summarization and aggregation. Concurrency control and recovery mechanisms,
2
if applied for such OLAP operations may jeopardize the execution of concurrent
transactions and thus substantially reduce the throughput of an OLTP system.
Finally, the separation of operational databases from data warehouses is based
on the different structures, contents, and uses of the data in these two systems.
Decision support requires historical data, whereas operational databases do not
typically maintain historical data. In this context, the data in operational
databases, though abundant, is usually far from complete for decision making.
A multidimensional data model is typically organized around a central theme, like sales.
This theme is represented by a fact table. Facts are numerical measures. Themes are the
quantities by which we want to analyze relationships between dimensions. Examples of
facts for a sales data warehouse include dollars_sold (sales amount in dollars), units_sold
(number of units sold), and amount budgeted. The fact table contains the names of the
facts, or measures, as well as keys to each of the related dimension tables.
Figure: Sales data for an organization according to the dimensions time, item, and location. The
measure displayed is dollars_sold.
3
Figure A 3-D data cube representation of the data in above table, according to the dimensions
time, item, and location. The measure displayed is dollars_sold (in thousands).
Suppose that we would now like to view our sales data with an additional fourth
dimension, such as supplier. Viewing things in 4-D becomes tricky. However, we can
think of a 4-D cube as being a series of 3-D cubes, as shown in Figure below.
Figure 4-D data cube representation of sales data, according to the dimensions time, item,
location, and supplier. The measure displayed is dollars_sold (in thousands)
If we continue in this way, we may display any n-dimensional data as a series of (n-1)
dimensional cubes. The data cube is a metaphor for multidimensional data storage. The
actual physical storage of such data may differ from its logical representation. The
4
important thing to remember is that data cubes are n-dimensional and do not confine
data to 3-D.
Start Schema
It is the data warehouse schema that contains two types of tables: Fact Table and
Dimension Tables. Fact Table lies at the center point and dimension tables are connected
with fact table such that star share is formed.
Fact Tables: A fact table typically has two types of columns: foreign keys to
dimension tables and measures those that contain numeric facts. A fact table can
contain fact's data on detail or aggregated level.
Dimension Tables: Dimension tables usually have a relatively small number of
records compared to fact tables, but each record may have a very large number
of attributes to describe the fact data.
Each dimension in the star schema has only one dimension table and each table holds a
set of attributes. This constraint may cause data redundancy. The following diagram
shows the sales data of a company with respect to the four dimensions, namely time,
item, branch, and location.
5
There is a fact table at the center. It contains the keys to each of four dimensions. The
fact table also contains the attributes, namely dollars sold and units sold.
Since star schema contains de-normalized dimension tables, it leads to simpler queries
due to lesser number of join operations and it also leads to better system performance.
On the other hand it is difficult to maintain integrity of data in star schema due to de-
normalized tables. It is the wifely used data warehouse schema and is also
recommended by oracle
Snowflake Schema
The snowflake schema is a variant of the star schema model, where some dimension
tables are normalized, thereby further splitting the data into additional tables. The
resulting schema graph forms a shape similar to a snowflake. For example, the item
dimension table in star schema is normalized and split into two dimension tables,
namely item and supplier table.
Due to normalization table is easy to maintain and saves storage space. However, this
saving of space is negligible in comparison to the typical magnitude of the fact table.
Furthermore, the snowflake structure can reduce the effectiveness of browsing, since
more joins will be needed to execute a query. Consequently, the system performance
may be adversely impacted. Hence, although the snowflake schema reduces
redundancy, it is not as popular as the star schema in data warehouse design.
6
Fact Constellation Schema
This kind of schema can be viewed as a collection of stars, and hence is called a galaxy
schema or a fact constellation. A fact constellation schema allows dimension tables to be
shared between fact tables. For example, following schema specifies two fact tables,
sales and shipping. The sales table definition is identical to that of the star schema. The
shipping table has five dimensions, or keys: item key, time key, shipper key, from location,
and to location, and two measures: dollars cost and units shipped.
Schema Definition
Multidimensional schema is defined using Data Mining Query Language (DMQL). The
two primitives, cube definition and dimension definition, can be used for defining the
data warehouses and data marts.
7
Star Schema Definition
The star schema that we have discussed can be defined using Data Mining Query
Language (DMQL) as follows.
8
define dimension from_location as location in cube sales
define dimension to_location as location in cube sales
Meta Data
Metadata is simply defined as data about data. The data that is used to represent other
data is known as metadata. For example, the index of a book serves as a metadata for
the contents in the book. In other words, we can say that metadata is the summarized
data that leads us to detailed data. In terms of data warehouse, we can define metadata
as follows.
Metadata is the road-map to a data warehouse.
Metadata in a data warehouse defines the warehouse objects.
Metadata acts as a directory. This directory helps the decision support system to
locate the contents of a data warehouse.
Data Marts
A data mart is a subject-oriented archive that stores data and uses the retrieved set of
information to assist and support the requirements involved within a particular
business function or department. Data marts exist within a single organizational data
warehouse repository. Data marts improve end-user response time by allowing users to
have access to the specific type of data they need to view most often.
A data mart is basically a condensed and more focused version of a data warehouse that
reflects the regulations and process specifications of each business unit within an
organization. Each data mart is dedicated to a specific business function or region. This
subset of data may span across many or all of an enterprise’s functional subject areas. It
is common for multiple data marts to be used in order to serve the needs of each
9
individual business unit (different data marts can be used to obtain specific information
for various enterprise departments, such as accounting, marketing, sales, etc.).
10
Unit 3
Data Warehouse Architecture
Generally a data warehouses adopts three-tier architecture. Following are the three tiers
of the data warehouse architecture.
Bottom Tier - The bottom tier of the architecture is the data warehouse database
server. It is the relational database system. We use the back end tools and utilities
to feed data into the bottom tier. These back end tools and utilities perform the
Extract, Clean, Load, and refresh functions.
Middle Tier - In the middle tier, we have the OLAP Server that can be
implemented in either of the following ways.
o By Relational OLAP (ROLAP), which is an extended relational database
management system. The ROLAP maps the operations on
multidimensional data to standard relational operations.
o By Multidimensional OLAP (MOLAP) model, which directly implements
the multidimensional data and operations.
Top-Tier - This tier is the front-end client layer. This layer holds the query tools
and reporting tools, analysis tools and data mining tools.
1
Virtual Warehouse
Data mart
Enterprise Warehouse
Virtual Warehouse
The view over an operational data warehouse is known as a virtual warehouse. It is
easy to build a virtual warehouse. Building a virtual warehouse requires excess capacity
on operational database servers.
Data Mart
Data mart contains a subset of organization-wide data. This subset of data is valuable to
specific groups of an organization. In other words, we can claim that data marts contain
data specific to a particular group. For example, the marketing data mart may contain
data related to items, customers, and sales. Data marts are confined to subjects.
Enterprise Warehouse
An enterprise warehouse collects all the information and the subjects spanning an entire
organization. It provides us enterprise-wide data integration. The data is integrated
from operational systems and external information providers. This information can
vary from a few gigabytes to hundreds of gigabytes, terabytes or beyond.
Warehouse Manager
A warehouse manager is responsible for the warehouse management process. It
consists of third-party system software, C programs, and shell scripts. The size and
complexity of warehouse managers varies between specific solutions. A warehouse
manager includes the following:
The controlling process
Stored procedures or C with SQL
Backup/Recovery tool
SQL Scripts
2
Operations Performed by Warehouse Manager
A warehouse manager analyzes the data to perform consistency and referential
integrity checks.
Creates indexes, business views, partition views against the base data.
Generates new aggregations and updates existing aggregations. Generates
normalizations.
Transforms and merges the source data into the published data warehouse.
Backup the data in the data warehouse.
OLAP Operations
Since OLAP servers are based on multidimensional view of data, we will discuss OLAP
operations in multidimensional data. Here is the list of OLAP operations:
Roll-up
Drill-down
Slice and dice
Pivot (rotate)
Roll-up: Roll-up performs aggregation on a data cube in any of the following ways: By
climbing up a concept hierarchy for a dimension or by dimension reduction. The
following diagram illustrates how roll-up works.
3
Roll-up is performed by climbing up a concept hierarchy for the dimension location.
Initially the concept hierarchy was "street < city < province < country". On rolling up,
the data is aggregated by ascending the location hierarchy from the level of city to the
level of country. The data is grouped into cities rather than countries. When roll-up is
performed, one or more dimensions from the data cube are removed.
Drill-down is performed by stepping down a concept hierarchy for the dimension time.
Initially the concept hierarchy was "day < month < quarter < year." On drilling down,
the time dimension is descended from the level of quarter to the level of month. When
drill-down is performed, one or more dimensions from the data cube are added. It
navigates the data from less detailed data to highly detailed data.
Slice: The slice operation selects one particular dimension from a given cube and
provides a new sub-cube. Consider the following diagram that shows how slice works.
4
Here Slice is performed for the dimension "time" using the criterion time = "Q1". It will
form a new sub-cube by selecting one or more dimensions.
Dice: Dice selects two or more dimensions from a given cube and provides a new sub-
cube. Consider the following diagram that shows the dice operation.
5
The dice operation on the cube based on the following selection criteria involves three
dimensions.
(location = "Toronto" or "Vancouver")
(time = "Q1" or "Q2")
(item =" Mobile" or "Modem")
Pivot: The pivot operation is also known as rotation. It rotates the data axes in view in
order to provide an alternative presentation of data. Consider the following diagram
that shows the pivot operation.
OLAP vs OLTP
OLAP OLTP
Involves historical processing of Involves day-to-day processing.
information.
OLAP systems are used by knowledge OLTP systems are used by clerks, DBAs,
workers such as executives, managers and or database professionals.
analysts.
Useful in analyzing the business. Useful in running the business.
Based on Star Schema, Snowflake, Schema Based on Entity Relationship Model.
and Fact Constellation Schema.
Provides summarized and consolidated Provides primitive and highly detailed
data. data.
Highly flexible. Provides high performance.
6
Types of OLAP Servers
We have four types of OLAP servers:
Relational OLAP (ROLAP)
Multidimensional OLAP (MOLAP)
Hybrid OLAP (HOLAP)
Relational OLAP:- Relational OLAP servers are placed between relational back-end
server and client front-end tools. To store and manage the warehouse data, the
relational OLAP uses relational or extended-relational DBMS. ROLAP includes the
following:
Implementation of aggregation navigation logic
Optimization for each DBMS back-end
Additional tools and services
Advantages
ROLAP servers can be easily used with existing RDBMS.
ROLAP tools do not use pre-calculated data cubes.
Disadvantages
7
Poor query performance.
Some limitations of scalability depending on the technology architecture that is
utilized.
Advantages
MOLAP allows fastest indexing to the pre-computed summarized data.
Easier to use, therefore MOLAP is suitable for inexperienced users.
Disadvantages
MOLAP are not capable of containing detailed data.
The storage utilization may be low if the data set is sparse.
Maintains a separate database for It may not require space other than
data cubes. available in the Data warehouse.
9
Unit 4
Computation of Data Cubes and OLAP Queries
Data warehouses contain huge volumes of data. OLAP servers demand that decision
support queries be answered in the order of seconds. Therefore, it is crucial for data
warehouse systems to support highly efficient cube computation techniques, access
methods, and query processing techniques. At the core of multidimensional data
analysis is the efficient computation of aggregations across many sets of dimensions. In
SQL terms, these aggregations are referred to as group-by’s. Each group-by can be
represented by cuboids, where the set of group-by’s forms a lattice of cuboids defining a
data cube.
Taking the three attributes, city, item, and year, as the dimensions for the data cube, and
sales in dollars as the measure, the total number of cuboids, or groupby’s, that can be
computed for this data cube is 23 = 8. The possible group-by’s are the following: f(city,
item, year), (city, item), (city, year), (item, year), (city), (item), (year), ()g, where () means that
the group-by is empty (i.e., the dimensions are not grouped). These group-by’s form a
lattice of cuboids for the data cube, as shown in Figure below. The base cuboid contains
all three dimensions, city, item, and year. It can return the total sales for any combination
of the three dimensions. The apex cuboid, or 0-D cuboid, refers to the case where the
group-by is empty. It contains the total sum of all sales. The base cuboid is the least
generalized (most specific) of the cuboids. The apex cuboid is the most generalized
(least specific) of the cuboids, and is often denoted as all. If we start at the apex cuboid
and explore downward in the lattice, this is equivalent to drilling down within the data
cube. If we start at the base cuboid and explore upward, this is akin to rolling up.
1
An SQL query containing no group-by, such as “compute the sum of total sales,” is a
zero-dimensional operation. An SQL query containing one group-by, such as “compute the
sum of sales, group by city,” is a one-dimensional operation. A cube operator on n
dimensions is equivalent to a collection of group by statements, one for each subset of
the n dimensions. Therefore, the cube operator is the n-dimensional generalization of
the group by operator. Based on the syntax of DMQL, the data cube in this example
could be defined as
define cube sales cube [city, item, year]: sum(sales in dollars)
For a cube with n dimensions, there are a total of 2n cuboids, including the base cuboid.
A statement such as
compute cube sales cube
would explicitly instruct the system to compute the sales aggregate cuboids for all of
the eight subsets of the set {city, item, year}, including the empty subset.
On-line analytical processing may need to access different cuboids for different queries.
Therefore, it may seem like a good idea to compute all or at least some of the cuboids in
a data cube in advance. Pre-computation leads to fast response time and avoid some
redundant computation. A major challenge related to this pre-computation, however, is
2
that the required storage space may explode if all of the cuboids in a data cube are pre-
computed, especially when the cube has many dimensions. The storage requirements
are even more excessive when many of the dimensions have associated concept
hierarchies, each with multiple levels. This problem is referred to as the curse of
dimensionality.
If there were no hierarchies associated with each dimension, then the total number of
cuboids for an n-dimensional data cube, as we have seen above, is 2 n. However, in
practice, many dimensions do have hierarchies. For example, the dimension time is
usually not explored at only one conceptual level, such as year, but rather at multiple
conceptual levels, such as in the hierarchy “day < month < quarter < year”. For an n-
dimensional data cube, the total number of cuboids that can be generated is:
Example
If the cube has 10 dimensions and each dimension has 4 levels, what will be the number of
cuboids generated?
Solution
Here n=10 Li=4 for i=1,2…….10
Thus
Total number of cuboids= 5×5×5×5×5×5×5×5×5×5=5109.8 ×106
From above example, it is unrealistic to pre-compute and materialize all of the cuboids
that can possibly be generated for a data cube (or from a base cuboid). If there are many
cuboids, and these cuboids are large in size, a more reasonable option is partial
materialization, that is, to materialize only some of the possible cuboids that can be
generated.
3
2. Full materialization: Pre-compute all of the cuboids. The resulting lattice of
computed cuboids is referred to as the full cube. This choice typically requires
huge amounts of memory space in order to store all of the pre-computed
cuboids.
3. Partial materialization: Selectively compute a proper subset of the whole set of
possible cuboids. Alternatively, we may compute a subset of the cube, which
contains only those cells that satisfy some user-specified criterion, such as where
the tuple count of each cell is above some threshold. We will use the term sub-
cube to refer to the latter case, where only some of the cells may be pre-computed
for various cuboids. Partial materialization represents an interesting trade-off
between storage space and response time.
The partial materialization of cuboids or sub-cubes should consider three factors: (1)
identify the subset of cuboids or sub-cubes to materialize; (2) exploit the materialized
cuboids or sub-cubes during query processing; and (3) efficiently update the
materialized cuboids or sub-cubes during load and refresh.
The selection of the subset of cuboids or sub-cubes to materialize should take into
account the queries in the workload, their frequencies, and their accessing costs. In
addition, it should consider workload characteristics, the cost for incremental updates,
and the total storage requirements. .A popular approach is to materialize the set of
cuboids on which other frequently referenced cuboids are based. Alternatively, we can
compute an iceberg cube, which is a data cube that stores only those cube cells whose
aggregate value (e.g., count) is above some minimum support threshold. Another
common strategy is to materialize a shell cube. This involves pre-computing the cuboids
for only a small number of dimensions (such as 3 to 5) of a data cube. Queries on
additional combinations of the dimensions can be computed on-the-fly. An iceberg cube
can be specified with an SQL query, as shown in the following example.
compute cube sales iceberg as
select month, city, customer group, count(*)
from salesInfo
cube by month, city, customer group
having count(*) >= min_sup
The compute cube statement specifies the pre-computation of the iceberg cube, sales
iceberg, with the dimensions month, city, and customer group, and the aggregate measure
count(). The input tuples are in the salesInfo relation. The cube by clause specifies that
4
aggregates (group-by’s) are to be formed for each of the possible subsets of the given
dimensions.
Refresh
Refreshing a warehouse consists in propagating updates on source data to
correspondingly update the base data and derived data stored in the warehouse. There
are two sets of issues to consider: when to refresh, and how to refresh. Usually, the
warehouse is refreshed periodically (e.g., daily or weekly). Only if some OLAP queries
need current data, it is necessary to propagate every update. The refresh policy is set by
the warehouse administrator, depending on user needs and may be different for
different sources.
5
Refresh techniques may also depend on the characteristics of the source and the
capabilities of the database servers. Extracting an entire source file or database is
usually too expensive, but may be the only choice for legacy data sources. Most
contemporary database systems provide replication servers that support incremental
techniques for propagating updates from a primary database to one or more replicas.
Such replication servers can be used to incrementally refresh a warehouse when the
sources change.
6
It is performed to test whether the various components do well after integration.
System Testing
In system testing, the whole data warehouse application is tested together.
The purpose of system testing is to check whether the entire system works
correctly together or not.
System testing is performed by the testing team.
Since the size of the whole data warehouse is very large, it is usually possible to
perform minimal system testing before the test plan can be enacted.
7
Unit 5
Data Mining Definition and Task
There is a huge amount of data available in the Information Industry. This data is of no
use until it is converted into useful information. It is necessary to analyze this huge
amount of data and extract useful information from it. Data Mining is defined as
extracting information from huge sets of data. In other words, we can say that data mining is the
procedure of mining knowledge from data.
On the basis of the kind of data to be mined, there are two types of tasks that are
performed by Data Mining:
Descriptive
Classification and Prediction
Descriptive Function
The descriptive function deals with the general properties of data in the database. Here
is the list of descriptive functions −
Class/Concept Description
Mining of Frequent Patterns
Mining of Associations
Mining of Correlations
Mining of Clusters
Classification and Prediction
Classification is the process of finding a model that describes the data classes or
concepts. The purpose is to be able to use this model to predict the class of objects
whose class label is unknown. This derived model is based on the analysis of sets of
training data. The derived model can be presented in the following forms −
Classification (IF-THEN) Rules
Decision Trees
Mathematical Formulae
Neural Networks
Prediction is used to predict missing or unavailable numerical data values rather than
class labels.
1
unknown information (i.e. knowledge) from large collections of digitized data. KDD
consists of several steps, and Data Mining is one of them. Data Mining is application of
a specific algorithm in order to extract patterns from data. Nonetheless, KDD and Data
Mining are used interchangeably. In summary, Data Mining is only the application of a
specific algorithm based on the overall goal of the KDD process.
2
in various topics available. The challenge is how to keep those books in a way that
readers can take several books in a particular topic without hassle. By using clustering
technique, we can keep books that have some kinds of similarities in one cluster or one
shelf and label it with a meaningful name. If readers want to grab books in that topic,
they would only have to go to that shelf instead of looking for entire library.
Prediction
The prediction, as it name implied, is one of a data mining techniques that discovers
relationship between independent variables and relationship between dependent and
independent variables. For instance, the prediction analysis technique can be used in
sale to predict profit for the future if we consider sale is an independent variable, profit
could be a dependent variable. Then based on the historical sale and profit data, we can
draw a fitted regression curve that is used for profit prediction.
Sequential Patterns
Often used over longer-term data, sequential patterns are a useful method for
identifying trends, or regular occurrences of similar events. For example, with customer
data you can identify that customers buy a particular collection of products together at
different times of the year. In a shopping basket application, you can use this
information to automatically suggest that certain items be added to a basket based on
their frequency and past purchasing history.
Decision trees
Decision tree is one of the most used data mining techniques because its model is easy
to understand for users. In decision tree technique, the root of the decision tree is a
simple question or condition that has multiple answers. Each answer then leads to a set
of questions or conditions that help us determine the data so that we can make the final
decision based on it.
3
versions, although some specialize in one operating system only. In addition,
while some may concentrate on one database type, most will be able to handle
any data using online analytical processing or a similar technology.
Dashboards: Installed in computers to monitor information in a database,
dashboards reflect data changes and updates onscreen — often in the form of a
chart or table — enabling the user to see how the business is performing.
Historical data also can be referenced, enabling the user to see where things have
changed (e.g., increase in sales from the same period last year). This functionality
makes dashboards easy to use and particularly appealing to managers who wish
to have an overview of the company's performance.
Text-mining Tools: The third type of data mining tool sometimes is called a text-
mining tool because of its ability to mine data from different kinds of text —
from Microsoft Word and Acrobat PDF documents to simple text files, for
example. These tools scan content and convert the selected data into a format
that is compatible with the tool's database, thus providing users with an easy and
convenient way of accessing data without the need to open different
applications. Scanned content can be unstructured (i.e., information is scattered
almost randomly across the document, including e-mails, Internet pages, audio
and video data) or structured (i.e., the data's form and purpose is known, such as
content found in a database). Capturing these inputs can provide organizations
with a wealth of information that can be mined to discover trends, concepts, and
attitudes.
4
Distributed data mining.
Real time data mining.
Multi database data mining.
Privacy protection and information security in data mining.
5
Unit 6
Integrating Data Mining with SQL Databases: OLE DB for Data Mining
The Data Mining Query Language (DMQL) was proposed by Han, Fu, Wang, et al. for
the DBMiner data mining system. The Data Mining Query Language is actually based
on the Structured Query Language (SQL).Data Mining Query Languages can be
designed to support ad hoc and interactive data mining. The DMQL can work with
databases and data warehouses as well. DMQL can be used to define data mining tasks.
The language adopts an SQL-like syntax, so that it can easily be integrated with the
relational query language SQL. The syntax of DMQL is defined in an extended BNF
grammar, where “[ ]” represents 0 or one occurrence, “{ }” represents 0 or more
occurrences, and words in sans serif font represent keywords.
1
Where I.item_ID=S. item.JD and S.trans_ID =P.trans_ID and P .custJD=C.cust_ID
and C. country –“Sri Lanka” Group by p.data
Characterization
The syntax for characterization is −
mine characteristics [as pattern_name]
analyze {measure(s) }
The analyze clause, specifies aggregate measures, such as count, sum, or count%. For
example − Description describing customer purchasing habits.
mine characteristics as customerPurchasing
analyze count%
Discrimination
The syntax for Discrimination is −
mine comparison [as {pattern_name]}
For {target_class } where {target_condition }
{versus {contrast_class_i }
where {contrast_condition_i}}
analyze {measure(s) }
For example, a user may define big spenders as customers who purchase items that cost
$100 or more on an average; and budget spenders as customers who purchase items at
less than $100 on an average. The mining of discriminant descriptions for customers
from each of these categories can be specified in the DMQL as −
mine comparison as purchaseGroups
for bigSpenders where avg(I.price) ≥$100
versus budgetSpenders where avg(I.price)< $100
analyze count
Association
The syntax for Association is−
mine associations [ as {pattern_name} ]
{matching {metapattern} }
For Example
mine associations as buyingHabits
2
matching P(X:customer,W) ^ Q(X,Y) ≥ buys(X,Z)
Where X is key of customer relation; P and Q are predicate variables; and W, Y, and Z
are object variables, Such as
age(X,”20-30”) ^ inclome(X,”40-50K”) ≥ buys (X, “Computer”)
This rule states that customers in their thirties, with an annual income of between 40K
and 50K, are likely to purchase a Computer.
Classification
The syntax for Classification is −
mine classification [as pattern_name]
analyze classifying_attribute_or_dimension
Example:
mine classifications as classifyCustomerCreditRating
analyze credit_rating
For categorical attributes or dimensions, each value represents a class (such as low-risk,
medium risk, high risk). For numeric attributes, each class defined by a range (such as
20-39, 40-59, 60-89 for age)
Prediction
The syntax for prediction is
mine prediction [as pattern_name]
analyze prediction_attribute_or_dimension
{set {attribute_or_dimension_i= value_i}}
3
level4: {60, ..., 89} < level1: senior
- Operation-derived hierarchies
define hierarchy age_hierarchy for age on customer as
{age_category(1), ..., age_category(5)}
:= cluster(default, age, 5) < all(age)
This statement says that hierarchy is generated by default clustering algorithm with 5 as
fan-out value. Fan-out value defined levels in tree while generating concept hierarchy.
- Rule-based hierarchies
define hierarchy profit_margin_hierarchy on item as
level_1: low_profit_margin < level_0: all
4
You would like to know the percentage of customers having that characteristic. In
particular, you are only interested in purchases made in Canada, and paid with an
American Express credit card. You would like to view the resulting descriptions in the
form of a table.
use database AllElectronics_db
use hierarchy location_hierarchy for B.address
mine characteristics as customerPurchasing
analyze count%
in relevance to C.age,I.type,I.place_made
from customer C, item I, purchase P, items_sold S, branch B
where I.item_ID = S.item_ID and P.cust_ID = C.cust_ID and
P.method_paid = "AmEx" and B.address = "Canada" and I.price ≥ 100
with noise threshold = 5%
display as table
5
Unit 7
The problem of association rule mining is defined as: Let I {I1, I 2.......... ..In} be a set
of binary attributes called items. Let D {T1, T 2.......... ..Tm} be a set of transactions called
the database. Each transaction in has a unique transaction ID and contains a subset of the
items in A rule is defined as an implication of the form:
X Y
Where X , Y I and X Y
Every rule is composed by two different set of items, also known as X and Y
itemsets and, where X is called antecedent or left-hand-side (LHS) and Y is consequent or
right-hand-side (RHS).
1
To illustrate the concepts, we use a small example from the supermarket domain. The
set of items in above table shows a small database containing the items, where, in each
entry, the value 1 means the presence of the item in the corresponding transaction, and
the value 0 represent the absence of an item in a that transaction. An example rule for
the supermarket could be {butter, bread} {milk ) meaning that if butter and bread are
bought, customers also buy milk.
For example, in above data-set, the association rule bread milk has a confidence
of 2/3, since 66.66% of all transactions containing bread also contains milk.
Rules that satisfy both a minimum support threshold and a minimum confidence
threshold are called strong. By convention, we write support and confidence values so
as to occur between 0% and 100%, rather than 0 to 1.0.
2
Apriori Algorithm
It is a classic algorithm used in data mining for learning association rules. It is very
simple. Learning association rules basically means finding the items that are purchased
together more frequently than others. The name of the algorithm is based on the fact
that the algorithm uses prior knowledge of frequent item set properties.
Example
3
Solution
Step 1: Generating 1-itemset Frequent Pattern
The set of frequent 1-itemsets, L1, consists of the candidate 1-itemsets satisfying
minimum support. In the first iteration of the algorithm, each item is a member of the
set of candidate.
In the first iteration of the algorithm, each item is a member of the set of candidate.
4
Step 3: Generating 3-itemset Frequent Pattern
The generation of the set of candidate 3-itemsets, C3, involves use of the Apriori
Property. In order to find C3, we compute L2JoinL2. C3= L2 JoinL2 = {{I1, I2, I3}, {I1, I2,
I5}, {I1, I3, I5}, {I2, I3, I4}, {I2, I3, I5}, {I2, I4, I5}}. Now, Join step is complete and Prune
step will be used to reduce the size of C3. Prune step helps to avoid heavy computation
due to large Ck.
Based on the Apriori property that all subsets of a frequent item set must also be
frequent, we can determine that four latter candidates cannot possibly be frequent. For
example, let’s take {I1, I2, I3}.The 2-item subsets of it are {I1, I2}, {I1, I3} & {I2, I3}. Since
all 2-item subsets of {I1, I2, I3} are members of L2, We will keep {I1, I2, I3} in C3. Lets
take another example of {I2, I3, I5} which shows how the pruning is performed. The 2-
item subsets are {I2, I3}, {I2, I5} & {I3,I5}. BUT, {I3, I5} is not a member of L2and hence it
is not frequent violating Apriori Property. Thus we will have to remove {I2, I3, I5} from
C3. Therefore, C3= {{I1, I2, I3}, {I1, I2, I5}} after checking for all members of result of Join
operation for Pruning. Now, the transactions in D are scanned in order to determine L3,
consisting of those candidates 3-itemsets in C3 having minimum support.
These frequent itemsets will be used to generate strong association rules ( where strong
association rules satisfy both minimum support & minimum confidence).
5
Step 5: Generating Association Rules from Frequent Itemsets
Procedure:
For each frequent item set “l”, generate all nonempty subsets of l. For every nonempty
subset s of l, output the rule “ s l s ”if support_count(l) / support_count(s) >=
min_conf where min_conf is minimum confidence threshold.
Back To Example
We had L = {{I1}, {I2}, {I3}, {I4}, {I5}, {I1,I2}, {I1,I3}, {I1,I5}, {I2,I3}, {I2,I4}, {I2,I5}, {I1,I2,I3},
{I1,I2,I5}}.
Let’s take l = {I1, I2, I5}. It’s all nonempty subsets are {I1, I2}, {I1, I5}, {I2, I5}, {I1}, {I2},
{I5}.
Let minimum confidence threshold is, say 70%. The resulting association rules are
shown below, each listed with its confidence.
R1: I1 ^ I2 =>I5
Confidence = support_count {I1, I2,
6
Hash-based technique
A hash-based technique can be used to reduce the size of the candidate k-itemsets, Ck,
for k > 1. For example, when scanning each transaction in the database to generate the
frequent 1-itemsets, L1, from the candidate 1-itemsets in C1, we can generate all of the
2-itemsets for each transaction, hash them into the different buckets of a hash table
structure, and increase the corresponding bucket counts. A 2-itemset whose
corresponding bucket count in the hash table is below the support threshold cannot be
frequent and thus should be removed from the candidate set. Such a hash-based
technique may substantially reduce the number of the candidate k-itemsets examined.
Transaction Reduction
It reduces the number of transactions scanned in future iterations. A transaction that
does not contain any frequent k-itemsets cannot contain any frequent (k+1)-itemsets.
Therefore, such a transaction can be marked or removed from further consideration
because subsequent scans of the database for j-itemsets, where j > k, will not require it.
Partitioning
The set of transactions may be divided into a number of disjoint subsets. Then each
partition is searched for frequent itemsets. These frequent itemsets are called local
frequent itemsets. Any itemset that is potentially frequent with respect to D must occur as a
frequent itemset in at least one of the partitions. Therefore, all local frequent itemsets are
candidate itemsets with respect to D. The collection of frequent itemsets from all
partitions forms the global candidate itemsets with respect to D.
Sampling
A random sample (usually large enough to fit in the main memory) may be obtained
from the overall set of transactions and the sample is searched for frequent itemsets.
These frequent itemsets are called sample frequent itemsets. Because we are searching for
frequent itemsets in S rather than in D, it is possible that we will miss some of the global
7
frequent itemsets. To lessen this possibility, we use a lower support threshold than
minimum support to find the frequent itemsets local to S
8
Unit 8
Classification and Prediction
There are two forms of data analysis that can be used for extracting models describing
important classes or to predict future data trends. These two forms are as follows:
Classification
Prediction
Classification models predict categorical class labels; and prediction models predict
continuous valued functions. For example, we can build a classification model to
categorize bank loan applications as either safe or risky, or a prediction model to predict
the expenditures in dollars of potential customers on computer equipment given their
income and occupation.
What is classification?
Following are the examples of cases where the data analysis task is Classification:
A bank loan officer wants to analyze the data in order to know which customer
(loan applicant) is risky or which are safe.
A marketing manager at a company needs to analyze a customer with a given
profile, who will buy a new computer.
In both of the above examples, a model or classifier is constructed to predict the
categorical labels. These labels are risky or safe for loan application data and yes or no
for marketing data.
What is prediction?
Following are the examples of cases where the data analysis task is Prediction: Suppose
the marketing manager needs to predict how much a given customer will spend during
a sale at his company. In this example we are bothered to predict a numeric value.
Therefore the data analysis task is an example of numeric prediction. In this case, a
model or a predictor will be constructed that predicts a continuous-valued-function or
ordered value. Regression analysis is a statistical methodology that is most often used
for numeric prediction.
How Does Classification Works?
With the help of the bank loan application that we have discussed above, let us
understand the working of classification. The Data Classification process includes two
steps:
Building the Classifier or Model
Using Classifier for Classification
Building the Classifier or Model
This step is the learning step or the learning phase.
In this step the classification algorithms build the classifier.
1
The classifier is built from the training set made up of database tuples and their
associated class labels.
Each tuple that constitutes the training set is referred to as a category or class.
These tuples can also be referred to as sample, object or data points.
2
Classification and Prediction Issues
The major issue is preparing the data for Classification and Prediction. Preparing the
data involves the following activities:
Data Cleaning − Data cleaning involves removing the noise and treatment of
missing values. The noise is removed by applying smoothing techniques and the
problem of missing values is solved by replacing a missing value with most
commonly occurring value for that attribute.
Relevance Analysis − Database may also have the irrelevant attributes.
Correlation analysis is used to know whether any two given attributes are
related.
Data Transformation and reduction − The data can be transformed by any of the
following methods:
o Normalization − Data is transformed using normalization. Normalization
involves scaling all values for given attribute in order to make them fall
within a small specified range. Normalization is used when in the learning
step, the neural networks or the methods involving measurements are
used.
o Generalization − Data can also be transformed by generalizing it to the
higher concept. For this purpose we can use the concept hierarchies.
3
Classification by Decision Tree Induction
Decision tree induction is the learning of decision trees from class labeled training
tuples. It is a decision tree is a flowchart-like tree structure where internal nodes (non
leaf node) denotes a test on an attribute branches represent outcomes of tests Leaf
nodes (terminal nodes) hold class labels and Root node is the topmost node.
The attributes of a tuple are tested against the decision tree. A path is traced from the
root to a leaf node which holds the prediction for that tuple.
Example
RID age income student credit-rating Class
1 youth high no fair ?
4
knowledge or parameter setting
They can handle high dimensional data
Intuitive representation that is easily understood by humans
Learning and classification are simple and fast
They have a good accuracy
Algorithm for constructing Decision Tress
Constructing a Decision tree uses greedy algorithm. Tree is constructed in a top-down
recursive divide-and-conquer manner.
1. At start, all the training tuples are at the root
2. Tuples are partitioned recursively based on selected attributes
3. If all samples for a given node belong to the same class
Label the class
4. If There are no remaining attributes for further partitioning
Majority voting is employed for classifying the leaf
5. There are no samples left
Label the class and terminate
6. Else
Got to step 2
Example
RID age student credit-rating Class: Buys-computer
1 youth yes fair yes
2 youth yes fair yes
3 youth yes fair no
4 youth no fair no
5 middle-aged no excellent yes
6 senior yes fair no
7 senior yes excellent yes
Solution
Step 1
5
Step 2
Step 3
Step 4
Step 5
6
Step 6
Prediction
As already mentioned, numeric prediction is the task of predicting continuous values
for given input. For example, we may wish to predict the salary of college graduates
with 10 years of work experience, or the potential sales of a new product given its price.
The most widely used approach for numeric prediction is regression. Regression
analysis can be used to model the relationship between one or more independent or
predictor variables and a dependent or response variable. In general, the values of the
predictor variables are known. The response variable is what we want to predict.
Linear Regression
Straight-line regression analysis involves a response variable, y, and a single predictor
variable, x. It is the simplest form of regression, and models y as a linear function of x.
That is,
y = b+wx
Where, the b and w are regression coefficients specifying the y-intercept and slope of the
line respectively. These coefficients, w and b, can also be thought of as weights, so that
we can equivalently write,
y = w0+w1x
Let D be a training set consisting of values of predictor variable, x, for some population
and their associated values for response variable, y. The training set contains D data
points of the form(x1, y1), (x2, y2), (xj, yj). The regression coefficients can be estimated
using this method with the following equations:
7
D
( xi x)( yi y)
w1 i 1
w0 y w1x
( xi x) 2
Example
Table given below shows a set of paired data where x is the number of years of work
experience of a college graduate and y is the corresponding salary of the graduate.
Predict the value of salary after 10 years of experience.
Solution
x 10.4 y 72100
8
Introduction to Clustering
The process of grouping a set of physical or abstract objects into classes of similar objects
is called clustering. A cluster is a collection of data objects that are similar to one another
within the same cluster and are dissimilar to the objects in other clusters. A cluster of
data objects can be treated collectively as one group. Although classification is an
effective means for distinguishing groups or classes of objects, it requires the often
costly collection and labeling of a large set of training tuples or patterns, which the
classifier uses to model each group.
d ( x, y ) ( x 2 x1 ) 2 ( y 2 y1 ) 2
Where, x ( x1 , y1 ) and y ( x2 , y 2 )
Where p is a positive integer, such a distance is also called Lp norm, in some literature.
It represents the Manhattan distance when p = 1 (i.e., L1 norm) and Euclidean distance
when p = 2 (i.e., L2 norm).
9
Categorization of Clustering Algorithms
Many clustering algorithms exist in the literature. In general, the major clustering
methods can be classified into the following categories.
Partitioning methods: Given a database of n objects or data tuples, a partitioning
method constructs k partitions of the data, where each partition represents a
cluster and k <n. Given k, the number of partitions to construct, a partitioning
method creates an initial partitioning. It then uses an iterative relocation
technique that attempts to improve the partitioning by moving objects from one
group to another.
Hierarchical methods: A hierarchical method creates a hierarchical
decomposition of the given set of data objects. A hierarchical method can be
classified as being either agglomerative or divisive. The agglomerative approach, also
called the bottom-up approach, starts with each object forming a separate group.
It successively merges the objects or groups that are close to one another, until all
of the groups are merged into one (the topmost level of the hierarchy), or until a
termination condition holds. The divisive approach, also called the top-down
approach, starts with all of the objects in the same cluster. In each successive
iteration, a cluster is split up into smaller clusters, until eventually each object is
in one cluster, or until a termination condition holds.
Density-based methods: Most partitioning methods cluster objects based on the
distance between objects. Such methods can find only spherical-shaped clusters
and encounter difficulty at discovering clusters of arbitrary shapes. Other
clustering methods have been developed based on the notion of density. Their
general idea is to continue growing the given cluster as long as the density
(number of objects or data points) in the “neighborhood” exceeds some
threshold.
Model-based methods: Model-based methods hypothesize a model for each of
the clusters and find the best fit of the data to the given model. EM is an
algorithm that performs expectation-maximization analysis based on statistical
modeling.
10
in a cunning way because of different location causes different result. So, the
better choice is to place them as much as possible far away from each other.
ci
4) Recalculate the new cluster center using: 1 xi , where, ‘ci’ represents the
i 1
Advantages
Fast, robust and easier to understand.
Gives best result when data set are distinct or well separated from each other.
Disadvantages
The learning algorithm requires apriori specification of the number of cluster
centers.
Randomly choosing of the cluster center cannot lead us to the fruitful result.
Applicable only when mean is defined i.e. fails for categorical data.
Algorithm fails for non-linear data set.
Example
Divide the data points {(1,1), ((2,1), (4,3), (5,4)} into two clusters.
Solution
Let p1=(1,1) p2=(2,1) p3=(4,3) p4=(5,4)
Initial step
Let c1=(1,1) and c2=(2,1) are two initial cluster centers.
11
Iteration 1
Calculate distance between clusters centers and each data points
d (c1, p1) 0 d (c 2, p1) (2 1) 2 (1 1) 2 1
d (c1, p 2) (2 1) 2 (1 1) 2 1 d (c 2, p 2) (2 2) 2 (1 1) 2 0
d (c1, p 4) (5 1) 2 (4 1) 2 5 d (c 2, p 4) (5 2) 2 (4 1) 2 4.24
Iteration 2
Calculate distance between new cluster centers and each data points
d (c1, p1) 0 d (c 2, p1) (11/ 3 1) 2 (8 / 3 1) 2 3.14
12
technique of clustering that clusters the data set of n objects into k clusters known a
priori. It is more robust to noise and outliers as compared to because it may minimize a
sum of pair-wise dissimilarities instead of a sum of squared Euclidean distances.
Algorithms
The most common realization of k-medoid clustering is the Partitioning around
Medoid (PAM) algorithm and is as follows:
1. Initialize: randomly select (without replacement) k of the n data points as the
medoid
2. Associate each data point to the closest medoid.
3. While the cost of the configuration decreases:
For each medoid m, for each non-medoid data point o:
Swap m and o, re-compute the cost (sum of distances of points to their
medoid)
If the total cost of the configuration increased in the previous step, undo
the swap
Example
Cluster the following data set of ten objects into two clusters i.e. k = 2. Data Points are
{(1,3), (4,2), (6,2), (3,5), (4,1)}
Solution
Let p1=(1,3) p2=(4,5) p3=(6,3) p4=(3,4) p5=(2,1)
Initial step
Let m1=p1=(1,3) and m2=p4=(3,4) are two initial medoid
Iteration 1
Calculate distance between clusters centers and each data point
d (m1, p2) 3 2 5 d (m2, p2) 1 0 1
d (m1, p3) 5 0 5 d (m2, p3) 3 1 4
d (m1, p5) 1 2 3 d (m2, p5) 1 3 4
13
Iteration 2
Lets swap m1 and p2
Now m1=p2= (4,5) and m2=p4=(3,4) are two medoid
Let p1=(1,3) p2=(4,5) p3=(6,3) p4=(3,4) p5=(2,1)
Calculate distance between new cluster centers and each data points
d (m1, p1) 3 2 5 d (m2, p1) 2 1 3
d (m1, p3) 2 1 3 d (m2, p3) 3 1 4
d (m1, p5) 2 4 6 d (m2, p5) 1 3 4
Thus after second iteration
Cluster 1= {p2,p3}
Cluster 2={p1,p4,p5}
14
Unit 8
Introduction to Clustering
The process of grouping a set of physical or abstract objects into classes of similar objects
is called clustering. A cluster is a collection of data objects that are similar to one another
within the same cluster and are dissimilar to the objects in other clusters. A cluster of
data objects can be treated collectively as one group. Although classification is an
effective means for distinguishing groups or classes of objects, it requires the often
costly collection and labeling of a large set of training tuples or patterns, which the
classifier uses to model each group.
d ( x, y ) ( x 2 x1 ) 2 ( y 2 y1 ) 2
Where, x ( x1 , y1 ) and y ( x2 , y 2 )
Where p is a positive integer, such a distance is also called Lp norm, in some literature.
It represents the Manhattan distance when p = 1 (i.e., L1 norm) and Euclidean distance
when p = 2 (i.e., L2 norm).
1
Categorization of Clustering Algorithms
Many clustering algorithms exist in the literature. In general, the major clustering
methods can be classified into the following categories.
Partitioning methods: Given a database of n objects or data tuples, a partitioning
method constructs k partitions of the data, where each partition represents a
cluster and k <n. Given k, the number of partitions to construct, a partitioning
method creates an initial partitioning. It then uses an iterative relocation
technique that attempts to improve the partitioning by moving objects from one
group to another.
Hierarchical methods: A hierarchical method creates a hierarchical
decomposition of the given set of data objects. A hierarchical method can be
classified as being either agglomerative or divisive. The agglomerative approach, also
called the bottom-up approach, starts with each object forming a separate group.
It successively merges the objects or groups that are close to one another, until all
of the groups are merged into one (the topmost level of the hierarchy), or until a
termination condition holds. The divisive approach, also called the top-down
approach, starts with all of the objects in the same cluster. In each successive
iteration, a cluster is split up into smaller clusters, until eventually each object is
in one cluster, or until a termination condition holds.
Density-based methods: Most partitioning methods cluster objects based on the
distance between objects. Such methods can find only spherical-shaped clusters
and encounter difficulty at discovering clusters of arbitrary shapes. Other
clustering methods have been developed based on the notion of density. Their
general idea is to continue growing the given cluster as long as the density
(number of objects or data points) in the “neighborhood” exceeds some
threshold.
Model-based methods: Model-based methods hypothesize a model for each of
the clusters and find the best fit of the data to the given model. EM is an
algorithm that performs expectation-maximization analysis based on statistical
modeling.
2
in a cunning way because of different location causes different result. So, the
better choice is to place them as much as possible far away from each other.
ci
4) Recalculate the new cluster center using: 1 xi , where, ‘ci’ represents the
i 1
Advantages
Fast, robust and easier to understand.
Gives best result when data set are distinct or well separated from each other.
Disadvantages
The learning algorithm requires apriori specification of the number of cluster
centers.
Randomly choosing of the cluster center cannot lead us to the fruitful result.
Applicable only when mean is defined i.e. fails for categorical data.
Algorithm fails for non-linear data set.
Example
Divide the data points {(1,1), ((2,1), (4,3), (5,4)} into two clusters.
Solution
Let p1=(1,1) p2=(2,1) p3=(4,3) p4=(5,4)
Initial step
Let c1=(1,1) and c2=(2,1) are two initial cluster centers.
3
Iteration 1
Calculate distance between clusters centers and each data points
d (c1, p1) 0 d (c 2, p1) (2 1) 2 (1 1) 2 1
d (c1, p 2) (2 1) 2 (1 1) 2 1 d (c 2, p 2) (2 2) 2 (1 1) 2 0
d (c1, p 4) (5 1) 2 (4 1) 2 5 d (c 2, p 4) (5 2) 2 (4 1) 2 4.24
Iteration 2
Calculate distance between new cluster centers and each data points
d (c1, p1) 0 d (c 2, p1) (11/ 3 1) 2 (8 / 3 1) 2 3.14
4
technique of clustering that clusters the data set of n objects into k clusters known a
priori. It is more robust to noise and outliers as compared to because it may minimize a
sum of pair-wise dissimilarities instead of a sum of squared Euclidean distances.
Algorithms
The most common realization of k-medoid clustering is the Partitioning around
Medoid (PAM) algorithm and is as follows:
1. Initialize: randomly select (without replacement) k of the n data points as the
medoid
2. Associate each data point to the closest medoid.
3. While the cost of the configuration decreases:
For each medoid m, for each non-medoid data point o:
Swap m and o, re-compute the cost (sum of distances of points to their
medoid)
If the total cost of the configuration increased in the previous step, undo
the swap
Example
Cluster the following data set of ten objects into two clusters i.e. k = 2. Data Points are
{(1,3), (4,2), (6,2), (3,5), (4,1)}
Solution
Let p1=(1,3) p2=(4,5) p3=(6,3) p4=(3,4) p5=(2,1)
Initial step
Let m1=p1=(1,3) and m2=p4=(3,4) are two initial medoid
Iteration 1
Calculate distance between clusters centers and each data point
d (m1, p2) 3 2 5 d (m2, p2) 1 0 1
d (m1, p3) 5 0 5 d (m2, p3) 3 1 4
d (m1, p5) 1 2 3 d (m2, p5) 1 3 4
5
Iteration 2
Lets swap m1 and p2
Now m1=p2= (4,5) and m2=p4=(3,4) are two medoid
Let p1=(1,3) p2=(4,5) p3=(6,3) p4=(3,4) p5=(2,1)
Calculate distance between new cluster centers and each data points
d (m1, p1) 3 2 5 d (m2, p1) 2 1 3
d (m1, p3) 2 1 3 d (m2, p3) 3 1 4
d (m1, p5) 2 4 6 d (m2, p5) 1 3 4
Thus after second iteration
Cluster 1= {p2,p3}
Cluster 2={p1,p4,p5}
6
CLARANS (“Randomized” CLARA)
CLARANS is a Clustering Algorithm based on Randomized Search. CLARANS draws
sample in solution space dynamically. A solution is a set of k medoids. The solution
space can be represented by a graph where every node is a potential solution, i.e., a set
of k medoids
Hierarchical Clustering
In data mining and statistics, hierarchical clustering (also called hierarchical cluster
analysis or HCA) is a method of cluster analysis which seeks to build a hierarchy of
clusters. Strategies for hierarchical clustering generally fall into two types:
Agglomerative: This is a "bottom up" approach: each observation starts in its
own cluster, and pairs of clusters are merged as one moves up the hierarchy.
Divisive: This is a "top down" approach: all observations start in one cluster, and
splits are performed recursively as one moves down the hierarchy.
Distance matrix is used for deciding which clusters to merge/split. There are losts of
alternatives to define distance between two sets of points.
Single-Link Distance
Single-Link Distance between clusters Ci and Cj is the minimum distance between any
object in Ci and any object in Cj. The distance is defined by the two most similar
objects
Dsl Ci , C j min x , y d ( x, y ) x Ci , y C j
Complete-Link Distance
Complete-Link Distance between clusters Ci and Cj is the maximum distance between
any object in Ci and any object in Cj. The distance is defined by the two most
dissimilar objects.
Dcl Ci , C j max x , y d ( x, y ) x Ci , y C j
Group Average Distance
Group Average Distance between clusters Ci and Cj is the average distance between any
object in Ci and any object in Cj .
Davg Ci , C j
1
Ci C j
d ( x, y)
xCi , yC j
7
Centroid Distance
Centroid Distance between clusters Ci and Cj is the distance between the centroid ri of
Ci and the centroid rj of Cj .
Solution
Assume A=(1,1), B= (1.5,1.5), C=(5,5), D=(3,4), E=(4,4), F=(3,3.5)
Distance Matrix
In this case, the closest cluster is between cluster F and D with shortest distance of 0.5.
Thus, we group cluster D and F into cluster (D, F). Then we update the distance matrix
(see distance matrix below). Distance between ungrouped clusters will not change from
the original distance matrix. Now the problem is how to calculate distance between
newly grouped clusters (D, F) and other clusters?
8
Looking at the lower triangular updated distance matrix, we found out that the closest
distance between cluster B and cluster A is now 0.71. Thus, we group cluster A and
cluster B into a single cluster name (A, B). Now we update the distance matrix. Aside
from the first row and first column, all the other elements of the new distance matrix are
not changed.
Observing the lower triangular of the updated distance matrix, we can see that the
closest distance between clusters happens between cluster E and (D, F) at distance 1.00.
Thus, we cluster them together into cluster ((D, F), E ). The updated distance matrix is
given below
After that, we merge cluster ((D, F), E) and cluster C into a new cluster name (((D, F), E),
C). The updated distance matrix is shown in the figure below
9
Divisive clustering
A cluster hierarchy can also be generated top-down. This variant of hierarchical
clustering is called top-down clustering or divisive clustering. We start at the top with all
data in one cluster. The cluster is split two clusters such that the objects in one subgroup
are far from the objects in the other. . This procedure is applied recursively until
required numbers of clusters are formed. This method is not considered attractive
because there exist O(2n)ways of splitting each cluster.
Algorithm
1. Start with all data point in single cluster
2. Repeat
• Choice of the cluster to be split
• Split of this cluster
3. Until only K clusters are created
10
Dissimilarity Measures of Binary Attributes
q- number of attributes having value 1 in both object
r- number of attributes having value 1 in object I and 0 in object J
s- number of attributes having value 0 in object I and 1 in object I
Data Transformation
Data transformation is the process of converting data or information from one format to
another, usually from the format of a source system into the required format of a new
destination system. The usual process involves converting documents, but data
conversions sometimes involve the conversion of a program from one computer
language to another to enable the program to run on a different platform. The usual
reason for this data migration is the adoption of a new system that's totally different
from the previous one.
11
Data Transformation involves two key phases:
1. Data Mapping: The assignment of elements from the source base or system
toward the destination to capture all transformations that occur. This is made
more complicated when there are complex transformations like many-to-one or
one-to-many rules for transformation.
2. Code Generation: The creation of the actual transformation program. The
resulting data map specification is used to create an executable program to run
on computer systems.
Data Normalization
Normalization is normally done, when there is a distance computation involved in our
algorithm. Some of the techniques of normalization are:
Min-Max Normalization –
Min Max Normalization transforms a value A to B which fits in the range[C,D]. It is
given my the below formula
Consider the below example, the salary value is 50000, we want to transform this in to
the range [0.0, 1.0], the maximum value of salary is 55000 and the minimum value of
salary is 25000 so the new scaled value for 50000 will be.
12
Unit 9
Information Retrieval
Information retrieval deals with the retrieval of information from a large number of
text-based documents. Examples of information retrieval system include − Online
Library catalogue system, Online Document Management Systems, Web Search
Systems etc.
Precision
Precision is the percentage of retrieved documents that are in fact relevant to the query.
Precision can be defined as
Precision= |{Relevant} ∩ {Retrieved}| / |{Retrieved}|
Recall
Recall is the percentage of documents that are relevant to the query and were in fact
retrieved. Recall is defined as
Recall = |{Relevant} ∩ {Retrieved}| / |{Relevant}|
F-score
F-score is the commonly used trade-off. The information retrieval system often needs to
trade-off for precision or vice versa. F-score is defined as harmonic mean of recall or
precision as follows
F-score = recall x precision / (recall + precision) / 2