0% found this document useful (0 votes)
6 views82 pages

Data Mining Notes (1, 2, 3,4)

The document provides an overview of data warehousing and data mining, detailing their definitions, architectures, and key features. It discusses the differences between operational databases and data warehouses, emphasizing the importance of data preprocessing, OLAP, and the role of metadata. Additionally, it highlights the significance of data mining in extracting valuable insights from vast amounts of data generated across various sectors.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views82 pages

Data Mining Notes (1, 2, 3,4)

The document provides an overview of data warehousing and data mining, detailing their definitions, architectures, and key features. It discusses the differences between operational databases and data warehouses, emphasizing the importance of data preprocessing, OLAP, and the role of metadata. Additionally, it highlights the significance of data mining in extracting valuable insights from vast amounts of data generated across various sectors.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 82

SDHR DEGREE & PG COLLEGE :: TIRUPATI

DATA WAREHOUSING AND DATAMINING

Unit – I:
Data Warehousing & OLAP, Basic Data Mining Tasks: Classification-Regression-Time series Analysis -
Prediction-Clustering- Summarization-Association rules-Sequence discovery-Data mining Versus
Knowledge discovery in databases-the development of data mining-Data Mining issues-Data mining
Metrics-Social Implications of Data Mining-The future, Data Pre-Processing.

DATA WAREHOUSING & OLAP


Data warehouses generalize and consolidate data in multidimensional space. The
construction of data warehouses involves the steps such as data cleaning, data integration, and data
transformation. These steps are called as Data Preprocessing and the preprocessed can be used for
data mining.

Data warehouses provide tools for online analytical processing (OLAP) and they can be
used for interactive analysis of multidimensional data of varied granularities. Data Warehouses and
OLAP facilitates effective data generalization and data mining. The data mining functions, such as
association, classification, prediction, and clustering, can be integrated with OLAP operations to
enhance interactive mining of knowledge at multiple levels of abstraction. Hence, the data warehouse
has become an increasingly important platform for data analysis and OLAP and will provide an effective
platform for data mining. Therefore, data warehousing and OLAP form an essential step in the
knowledge discovery process.

Data warehousing provides architectures and tools for business executives to systematically
organize, understand, and use their data to make strategic decisions. Data warehouse systems are
valuable tools in today’s competitive, fast-evolving world. Many people feel that with competition
mounting in every industry, data warehousing is the latest must-have marketing weapon—a way to
retain customers by learning more about their needs.

Data warehouses have been defined in many ways, loosely speaking, a data warehouse refers
to a data repository that is maintained separately from an organization’s operational databases. Data
warehouse systems allow for integration of a variety of application systems. They support information
processing by providing a solid platform of consolidated historic data for analysis.

As per the William H. Inmon, a leading architect in the construction of data warehouse systems,
“A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile
collection of data in support of management’s decision making process”.

The definition presents the major features of a data warehouse. They are a) subject-
oriented, b) integrated, c) time-variant, and d) nonvolatile these four features distinguish the
data warehouses from other data repository systems, such as relational database systems, transaction
processing systems, and file systems.
a) Subject-oriented: A data warehouse is organized around major subjects such as customer,
supplier, product, and sales. Rather than concentrating on the day-to-day operations and
transaction processing of an organization, a data warehouse focuses on the modeling and
analysis

C Subramanyam Reddy, Department of Computer Science, SDHR Degree & PG College Page 1
SDHR DEGREE & PG COLLEGE :: TIRUPATI
DATA WAREHOUSING AND DATAMINING

of data for decision makers. Hence, data warehouses typically provide a simple and concise
view of particular subject issues by excluding data that are not useful in the decision support
process.
b) Integrated: A data warehouse is usually constructed by integrating multiple heterogeneous
sources, such as relational databases, flat files, and online transaction records. Data cleaning
and data integration techniques are applied to ensure consistency in naming conventions,
encoding structures, attribute measures, and so on.
c) Time-variant: Data are stored to provide information from an historic perspective (e.g., the
past 5–10 years). Every key structure in the data warehouse contains, either implicitly or
explicitly, a time element.
d) Nonvolatile: A data warehouse is always a physically separate store of data transformed from
the application data found in the operational environment. Due to this separation, a data
warehouse does not require transaction processing, recovery, and concurrency control
mechanisms. It usually requires only two operations in data accessing i.e., initial loading of
data and access of data.

Generally data warehouse is a semantically consistent data store that serves as a physical
implementation of a decision support data model. It stores the information an enterprise needs to
make strategic decisions. A data warehouse is also often viewed as an architecture, constructed by
integrating data from multiple heterogeneous sources to support structured and/or ad hoc queries,
analytical reporting, and decision making.

Differences between Operational Database Systems and Data Warehouses


Most of the people are familiar with commercial relational database systems, it is easy to
understand what a data warehouse is by comparing these two kinds of systems. The major task of
online operational database systems is to perform online transaction and query processing. These
systems are called online transaction processing (OLTP) systems. They cover most of the day-
to-day operations of an organization such as purchasing, inventory, manufacturing, banking, payroll,
registration, and accounting.

Data warehouse systems, on the other hand, serve users or knowledge workers in the role of
data analysis and decision making. Such systems can organize and present data in various formats in
order to accommodate the diverse needs of different users. These systems are known as online
analytical processing (OLAP) systems. The major distinguishing features of OLTP and OLAP are
summarized as follows:

• Users and system orientation: An OLTP system is customer-oriented and is used for
transaction and query processing by clerks, clients, and information technology
professionals. An OLAP system is market-oriented and is used for data analysis by
knowledge workers, including managers, executives, and analysts.
• Data contents: An OLTP system manages current data that, typically, are too detailed to be
easily used for decision making. An OLAP system manages large amounts of historic data,
provides facilities for summarization and aggregation, and stores and manages information at
different levels of granularity. These features make the data easier to use for informed decision
making.
• Database design: An OLTP system usually adopts an entity-relationship (ER) data model and
an application-oriented database design. An OLAP system typically adopts either a star or a
snowflake model and a subject-oriented database design.

C Subramanyam Reddy, Department of Computer Science, SDHR Degree & PG College Page 2
SDHR DEGREE & PG COLLEGE :: TIRUPATI
DATA WAREHOUSING AND DATAMINING

• View: An OLTP system focuses mainly on the current data within an enterprise or department,
without referring to historic data or data in different organizations. In contrast, an OLAP system
often spans multiple versions of a database schema, due to the evolutionary process of an
organization. An OLAP system also deals with information that originates from different
organizations, integrating information from many data stores. Because of their huge volume,
OLAP data are stored on multiple storage media.
• Access patterns: The access patterns of an OLTP system consist mainly of short, atomic
transactions. Such a system requires concurrency control and recovery mechanisms. However,
accesses to OLAP systems are mostly read-only operations (because most data warehouses
store historic rather than up-to-date information), although many could be complex queries.

The operational databases or traditional databases support the concurrent processing of


multiple transactions. Concurrency control and recovery mechanisms (e.g., locking and logging) are
required to ensure the consistency and robustness of transactions. An OLAP query often needs read-
only access of data records for summarization and aggregation. Concurrency control and recovery
mechanisms, if applied for such OLAP operations, may jeopardize the execution of concurrent
transactions and thus substantially reduce the throughput of an OLTP system.

DATA WAREHOUSING ARCHITECTURE


Data warehouse uses three-tier architecture, i.e a) Bottom tier (Data Warehouse Server)
b) Middle tier (OLAP Server) and c) Top Tier (Front End Tools) to build the data warehouse.

a) The bottom tier is a warehouse database server that is almost always a relational database
system. Back-end tools and utilities are used to feed data into the bottom tier from operational
databases or other external sources (e.g., customer profile information provided by external
consultants). These tools and utilities perform data extraction, cleaning, and transformation (e.g.,
to merge similar data from different sources into a unified format), as well as load and refresh
functions to update the data warehouse. The data are extracted using application program
interfaces known as gateways. This tier also contains a metadata repository, which stores
information about the data warehouse and its contents.

b) The middle tier is an OLAP server that is typically implemented using either (1) a relational
OLAP (ROLAP) model (i.e., an extended relational DBMS that maps operations on
multidimensional data to standard relational operations); or (2) a multidimensional OLAP
(MOLAP) model (i.e., a special- purpose server that directly implements multidimensional data
and operations).

c) The top tier is a front-end client layer, which contains query and reporting tools, analysis tools,
and/or data mining tools.

C Subramanyam Reddy, Department of Computer Science, SDHR Degree & PG College Page 3
SDHR DEGREE & PG COLLEGE :: TIRUPATI
DATA WAREHOUSING AND DATAMINING

Figure 1 Three Tier Data Warehouse Architecture

Data Warehouse Models: Enterprise Warehouse, Data Mart, and Virtual Warehouse
A Data Warehouse can be constructed by using three data warehouse models: the enterprise
warehouse, the data mart, and the virtual warehouse.

Enterprise warehouse: An enterprise warehouse collects all of the information about subjects
covering all aspects of the entire organization. It provides corporate-wide data integration, usually
from one or more operational systems or external information providers, and is cross-functional in
scope. It typically contains detailed data as well as summarized data, and can range in size from a few
gigabytes to hundreds of gigabytes, terabytes, or beyond. An enterprise data warehouse may be
implemented on traditional mainframes, computer super servers, or parallel architecture platforms.
It requires extensive business modeling and may take years to design and build.

Data mart: A data mart contains a subset of corporate-wide data that is of value to a specific group
of users. The scope is confined to specific selected subjects. For example, a marketing data mart may
confine its subjects to customer, item, and sales. The data contained in data marts tend to be
summarized.

Data marts are usually implemented on low-cost departmental servers that are Unix/Linux or
Windows based. The implementation cycle of a data mart is more likely to be measured in weeks
rather than months or years. Depending on the source of data, data marts can be categorized as
independent or dependent.

Independent data marts are sourced from data captured from one or more operational
systems or external information providers, or from data generated locally within a particular
department or geographic area. Dependent data marts are sourced directly from enterprise data
warehouses.

C Subramanyam Reddy, Department of Computer Science, SDHR Degree & PG College Page 4
SDHR DEGREE & PG COLLEGE :: TIRUPATI
DATA WAREHOUSING AND DATAMINING

Virtual warehouse: A virtual warehouse is a set of views over operational databases. For efficient
query processing, only some of the possible summary views may be materialized. A virtual warehouse
is easy to build but requires excess capacity on operational database servers.

Extraction, Transformation, and Loading


Data warehouse systems use back-end tools and utilities to populate and refresh their data.
These tools and utilities include the following functions:

a) Data extraction: In Data extraction process the data extraction tool typically gathers data
from multiple, heterogeneous, and external sources.
b) Cleaning: Data Cleaning is the process of detecting errors in the data and rectifies them
when possible.
c) Data transformation: Data transformation is the process of converting data from legacy or
host format to warehouse format.
d) Load: In which the data is sorted, summarizes, consolidates, computes views, checks
integrity, and builds indices and partitions.
e) Refresh, which propagates the updates from the data sources to the warehouse.
Besides cleaning, loading, refreshing, and metadata definition tools, data warehouse systems
usually provide a good set of data warehouse management tools. Data cleaning and data
transformation are important steps in improving the data quality and, subsequently, the data mining
results.

Metadata Repository
Metadata are data about data. When used in a data warehouse, metadata are the data that
define warehouse objects. Metadata are created for the data names and definitions of the given
warehouse. Additional metadata are created and captured for time stamping any extracted data, the
source of the extracted data, and missing fields that have been added by data cleaning or integration
processes. A data warehouse contains different levels of summarization, of which metadata is one.

A metadata repository should contain the following:

• A description of the data warehouse structure, which includes the warehouse schema,
view, dimensions, hierarchies, and derived data definitions, as well as data mart locations and
contents.
• Operational metadata, which include data lineage (history of migrated data and the sequence
of transformations applied to it), currency of data (active, archived, or purged), and monitoring
information (warehouse usage statistics, error reports, and audit trails).

C Subramanyam Reddy, Department of Computer Science, SDHR Degree & PG College Page 5
SDHR DEGREE & PG COLLEGE :: TIRUPATI
DATA WAREHOUSING AND DATAMINING

• The algorithms used for summarization, which include measure and dimension definition
algorithms, data on granularity, partitions, subject areas, aggregation, summarization, and
predefined queries and reports.
• Mapping from the operational environment to the data warehouse, which includes
source databases and their contents, gateway descriptions, data partitions, data extraction,
cleaning, transformation rules and defaults, data refresh and purging rules, and security (user
authorization and access control).
• Data related to system performance, which include indices and profiles that improve data
access and retrieval performance, in addition to rules for the timing and scheduling of refresh,
update, and replication cycles.
• Business metadata, which include business terms and definitions, data ownership
information, and charging policies.

Data Mining
We live in a world where vast amounts of data are collected daily. Analyzing such data is an
important need. “Now we are living in the information age” is a popular saying; however, we are
actually saying we are living in the data age. Every day Terabytes or petabytes of data pour into our
computer networks, the World Wide Web (WWW), and various data storage devices every day from
business, society, science and engineering, medicine, and almost every other aspect of daily life.

The explosive growth of available data volume is a result of the computerization of our society
and the fast development of powerful data collection and storage tools. Businesses worldwide generate
gigantic data sets, including sales transactions, stock trading records, product descriptions, sales
promotions, company profiles and performance, and customer feedback. For example, large stores,
such as Wal-Mart, handle hundreds of millions of transactions per week at thousands of branches
around the world. Scientific and engineering practices generate high orders of petabytes of data in a
continuous manner, from remote sensing, process measuring, scientific experiments, system
performance, engineering observations, and environment surveillance.

Every Day the global telecommunication networks carry tens of petabytes of data. The medical
and health industry generates tremendous amounts of data from medical records, patient monitoring,
and medical imaging. Billions of Web searches supported by search engines process tens of petabytes
of data daily. Communities and social media have become increasingly important data sources,
producing digital pictures and videos, blogs, Web communities, and various kinds of social networks.
The list of sources that generate huge amounts of data is endless.

This explosively growing, widely available, and gigantic body of data makes our time truly the
data age. In order to process this huge amount of data into useful knowledge, we require variety of
tools. This necessity has led to the birth of data mining. The Data Mining field is always young,
dynamic, and promising.

Nowadays Simple structured/query language queries are not adequate to support these
increased demands for information. Data mining is used to solve all these needs. The term Data
mining is often defined as finding hidden information in a database. Alternatively, The data
mining is also called exploratory data analysis, data driven discovery, and deductive learning.

C Subramanyam Reddy, Department of Computer Science, SDHR Degree & PG College Page 6
SDHR DEGREE & PG COLLEGE :: TIRUPATI
DATA WAREHOUSING AND DATAMINING

From the above figure, we can able to know that the traditional databases uses the queries, to
access a database using a well-defined query stated in a language such as SQL. The output of that:
query consists of the data from the database that satisfies the query. The output is usually a subset
of the database, but it may also be an extracted view or may contain aggregations. In Data mining
the access of the database differs from this traditional access in several ways:

• Query: In Data Mining the query might not be well formed or precisely stated. The data miner
might not even be exactly sure of what he wants to see.
• Data: In Data Mining the data accessed is usually a different version from that of the original
operational database. The data have been cleansed and modified to better support the mining
process.
• Output: In Data Mining the output of the data mining query probably is not a subset of the
database. Instead it is the output of some analysis of the contents of the database.

Data mining involves many different algorithms to accomplish different tasks. All of these
algorithms attempt to fit a model to the data. The algorithms examine the data and determine a model
that is closest to the characteristics of the data being examined. Data mining algorithms can be
characterized as consisting of three parts:

• Model: The purpose of the algorithm is to fit a model to the data.


• Preference: Some criteria must be used to fit one model over another.
• Search: All algorithms require some technique to search the data.

Figure 2: Data Mining Models

A predictive model makes a prediction about values of data using known results found from
different data. Predictive modeling may be made based on the use of other historical data. Predictive
model data mining tasks include classification, regression, time series analysis, and prediction.

A descriptive model identifies patterns or relationships in data. Unlike the predictive model,
a descriptive model serves as a way to explore the properties of the data examined, not to predict
new properties. Clustering, summarization, association rules, and sequence discovery are usually
viewed as descriptive in nature.

C Subramanyam Reddy, Department of Computer Science, SDHR Degree & PG College Page 7
SDHR DEGREE & PG COLLEGE :: TIRUPATI
DATA WAREHOUSING AND DATAMINING

Supervised, Unsupervised and Semi-Supervised Learning


In Data Mining the Learning process or algorithms are classified as

a) Supervised: In Supervised learning mode we will have input variables (X) and an output
variable (Y ) and we use an algorithm to learn the mapping function from the input to the
output i.e.,

Y = f(X)
The goal of supervised learning is to approximate the mapping function so well that
when you have new input data (X) that you can predict the output variables (Y ) for that data.
This type of learning is called supervised learning because the process of an algorithm is to
learn from the training dataset and this can be thought of as a teacher supervising the learning
process. We know the correct answers, the algorithm iteratively makes predictions on the
training data and is corrected by the teacher. Learning stops when the algorithm achieves an
acceptable level of performance. Supervised learning problems can be further grouped into
regression and classification problems.

b) Unsupervised: In Unsupervised learning model we will have only input data (X) and no
corresponding output variables. The goal of unsupervised learning is to model the underlying
structure or distribution in the data in order to learn more about the data. This process is called
unsupervised learning because unlike supervised learning there are no correct answers and
there is no teacher. Algorithms are left to their own devises to discover and present the
interesting structure in the data. Unsupervised learning problems can be further grouped into
clustering and association problems.

c) Semi-Supervised: In Semi-Supervised learning process we will have a large amount of input


data
(X) and only some of the data is labeled (Y) and this kind of learning is called semi-supervised
learning process. These kind of problems sit in between both supervised and unsupervised
learning. A good example is a photo archive where only some of the images are labeled, (e.g.
dog, cat, person) and the majority are unlabeled. Many real world machine learning problems
fall into this area. This is because it can be expensive or time consuming to label data as it
may require access to domain experts. Whereas unlabeled data is cheap and easy to collect
and store.

We can use unsupervised learning techniques to discover and learn the structure in the input
variables. You can also use supervised learning techniques to make best guess predictions for the
unlabeled data, feed that data back into the supervised learning algorithm as training data and use
the model to make predictions on new unseen data.

Basic Data Mining Tasks

Classification
“Classification is a process which maps data into predefined groups or classes”. It is
often referred to as supervised learning because the classes are determined before examining the
data. Two examples of classification applications are determining whether to make a bank loan and
identifying credit risks. Classification algorithms require that the classes be defined based on data
attribute values. They often describe these classes by looking at the characteristics of data already
known to belong to the

C Subramanyam Reddy, Department of Computer Science, SDHR Degree & PG College Page 8
SDHR DEGREE & PG COLLEGE :: TIRUPATI
DATA WAREHOUSING AND DATAMINING

classes. Pattern recognition is a type of classification where an input pattern is classified into one of
several classes based on its similarity to these predefined classes.

Example 1: An airport security screening station is used to determine: if passengers are potential
terrorists or criminals. To do this, the face of each passenger is scanned and its basic pattern (distance
between eyes, size and shape of mouth, shape of head, etc.) is identified. This pattern is compared
to entries in a database to see if it matches any patterns that are associated with known offenders.

Example 2: The bank authorities want know whether a loan applicant is capable of repaying a loan
or not. For which the bank constructs a model by using past data from which some rules will be
constructed based on these rules the bank authorities decides whether the loan should be sanctioned
or not.

Regression
“Regression is a process used to map a data item to a real valued prediction variable”.
In actually, the regression involves the learning of the function that does this mapping. Regression
assumes that the target data fit into some known type of function (e.g., linear, logistic, etc.) and then
determines the best function of this type that models the given data. Some type of error analysis is
used to determine which function is "best”.

Example 1: An employee wishes to reach a certain level of savings before the retirement. Periodically,
the employee predicts what her retirement savings will be based on Its current value and several past
values. The Employee uses a simple linear regression formula to predict this value by fitting past
behavior to a linear function and then using this function to predict the values at points in the future.
Based on these values, she then alters her investment portfolio.

Time Series Analysis


With the help of time series analysis, the value of an attribute is examined as it varies over
time. The values usually are obtained as evenly spaced time points (daily, weekly, hourly, etc.). A
time series plot is used to visualize the time series.

From the above figure we can easily see that the plots for Y and Z have similar behavior, while X
appears to have less volatility. There are three basic functions performed in time series analysis:

• In one case, distance measures are used to determine the similarity between different
time series.
• In the second case, the structure of the line is examined to determine (and perhaps
classify) its behavior.
• A third application would be to use the historical time series plot to predict future values.

C Subramanyam Reddy, Department of Computer Science, SDHR Degree & PG College Page 9
SDHR DEGREE & PG COLLEGE :: TIRUPATI
DATA WAREHOUSING AND DATAMINING

Example : Mr. Smith is trying to determine whether to purchase stock from Companies X, Y, or z.
For a period of one month he charts the daily stock price for each company. The above Figure shows
the time series plot that Mr. Smith has generated. Using this and similar information available from
his stockbroker, Mr. Smith decides to purchase stock X because it is less volatile while overall showing
a slightly larger relative amount of growth than either of the other stocks. As a matter of fact, the
stocks for Y and Z have a similar behavior. The behavior of Y between days 6 and 20 is identical to
that for Z between days 13 and 27.

Prediction
Many real-world data mining applications can be seen as predicting future data states based
on past and current data. Prediction can be viewed as a type of classification. The difference is that
prediction is predicting a future state rather than a current state. Here we are referring to a type of
application rather than to a type of data mining modeling approach. Prediction applications include
flooding, speech recognition, machine learning, and pattern recognition. Although future values may
be predicted using time series analysis or regression techniques, other approaches may be used as
well. The following Example illustrates the process.

Example: Predicting flooding is a difficult problem. One approach uses monitors placed at various
points in the river. These monitors collect data relevant to flood prediction: water level, rain
amount, time, humidity, and so on. Then the water level at a potential flooding point in the river can
be predicted based on the data collected by the sensors upriver from this point. The prediction must
be made with respect to the time the data were collected.

Clustering
Clustering is similar to classification except that the groups are not predefined, but rather
defined by the data alone. Clustering is alternatively referred to as unsupervised learning or
segmentation. It can be thought of as partitioning or segmenting the data into groups that might or
might not be disjointed. The clustering is usually accomplished by determining the similarity among
the data on predefined attributes. The most similar data are grouped into clusters. The clusters are
not predefined; a domain expert is often required to interpret the meaning of the created clusters.

A special type of clustering is called segmentation. With segmentation a database is partitioned


into disjointed groupings of similar tuples called segments. Segmentation is often viewed as being
identical to clustering. In other circles segmentation is viewed as a specific type of clustering applied
to a database itself. In this text we use the two terms, clustering and segmentation, interchangeably.
The following Example provides a simple clustering example.

Example: A certain national department store chain creates special catalogs targeted to various
demographic groups based on attributes such as income, location, and physical characteristics of
potential customers (age, height, weight, etc.). To determine the target mailings of the various
catalogs and to assist in the creation of new, more specific catalogs, the company performs a clustering
of potential customers based on the determined attribute values. The results of the clustering exercise
are then used by management to create special catalogs and distribute them to the correct target
population based on the cluster for that catalog.

C Subramanyam Reddy, Department of Computer Science, SDHR Degree & PG College Page 10
SDHR DEGREE & PG COLLEGE :: TIRUPATI
DATA WAREHOUSING AND DATAMINING

Summarization
Summarization maps data into subsets with associated simple descriptions. Summarization is
also called characterization or generalization. It extracts or derives representative information about
the database. This may be accomplished by actually retrieving portions of the data. Alternatively,
summary type information (such as the mean of some numeric attribute) can be derived from the
data. The summarization succinctly characterizes the contents of the database. The following example
illustrates this process.

Example: One of the many criteria used to compare universities by the U.S. News & World Report is
the average SAT or AC T score. This is a summarization used to estimate the type and intellectual
level of the student body.

Association Rules
Association Rules are also called as Link analysis, alternatively referred to as affinity analysis
or association, refers to the data mining task of uncovering relationships among data. The best
example of this type of application is to determine association rules. An association rule is a model
that identifies specific types of data associations. These associations are often used in the retail
sales community to identify items that are frequently purchased together. The following example
illustrates the use of association rules in market basket analysis. Here the data analyzed consist of
information about what items a customer purchases. Associations are also used in many other
applications such as predicting the failure of telecommunication switches.

Example: A grocery store retailer is trying to decide whether to put bread on sale. To help determine
the impact of this decision, the retailer generates association rules that show what other products are
frequently purchased with bread. He finds that 60% of the time that bread is sold so are pretzels and
that 70% of the time jelly is also sold. Based on these facts, he tries to capitalize on the association
between bread, pretzels, and jelly by placing some pretzels and jelly at the end of the aisle where the
bread is placed. In addition, he decides not to place either of these items on sale at the same time.

While using association rules users must be cautioned that these are not causal relationships.
They do not represent any relationship inherent in the actual data (as is true with functional
dependencies) or in the real world. There probably is no relationship between bread and pretzels that
causes them to be purchased together. And there is no guarantee that this association will apply in
the future. However, association rules can be used to assist retail store management in effective
advertising, marketing, and inventory control.

Sequence Discovery
Sequential analysis or sequence discovery is used to determine sequential patterns in data.
These patterns are based on a time sequence of actions. These patterns are similar to associations in
that data (or events) are found to be related, but the relationship is based on time. Unlike a market
basket analysis, which requires the items to be purchased at the same time, in sequence discovery
the items are purchased over time in some order. Example 1.9 illustrates the discovery of some simple
patterns. A similar type of discovery can be seen in the sequence within which data are purchased. For
example, most

C Subramanyam Reddy, Department of Computer Science, SDHR Degree & PG College Page 11
SDHR DEGREE & PG COLLEGE :: TIRUPATI
DATA WAREHOUSING AND DATAMINING

people who purchase CD players may be found to purchase CDs within one week. As we will see,
temporal association rules really fall into this category.

Example: The Webmaster at the XYZ Corp. periodically analyzes the Web log data to determine how
users of the XYZ's Web pages access them. He is interested in determining what sequences of pages
are frequently accessed. He determines that 70 percent of the users of page A follow one of the
following patterns of behavior: (A, B, C) or (A, D, B, C) or (A, E, B, C). He then determines to add a
link directly from page A to page C.

DATA MINING VERSUS KNOWLEDGE DISCOVERY IN DATABASES


The terms knowledge discovery in databases (KDD) and data mining are often used
interchangeably. KDD is having many other names such as process of discovering useful (hidden)
patterns in data: knowledge extraction, information discovery, exploratory data analysis, information
harvesting, and unsupervised pattern recognition. Over the last few years KDD has been used to refer
to a process consisting of many steps, while data mining is only one of these steps.

• DEFINITION: Knowledge discovery in databases (KDD) is the process of


finding useful information and patterns in data.

• DEFINITION: Data mining is the use of algorithms to extract the information


and patterns derived by the KDD process.

The KDD process is often said to be nontrivial; however, we take the larger view that KDD is an
all- encompassing concept. A traditional SQL database query can be viewed as the data mining part
of a KDD process. Indeed, this may be viewed as somewhat simple and trivial. However, this was not
the case 30 years ago. If we were to advance 30 years into the future, we might find that processes
thought of today as nontrivial and complex will be viewed as equally simple. The definition of KDD
includes the keyword useful. Although some definitions have included the term "potentially useful,"
we believe that if the information found in the process is not useful, then it really is not information.
Of course, the idea of being useful is relative and depends on the individuals involved.

Figure 3: KDD Process

KDD is a process that involves many different steps. The input to the KDD process is the data,
and the output is the useful information desired by the users. However, the objective may be unclear
or inexact. The process itself is interactive and may require much elapsed time. To ensure the
usefulness and accuracy of the results of the process, interaction throughout the process with both
domain experts and technical experts might be needed. The KDD process consists of the following five
steps, they are

C Subramanyam Reddy, Department of Computer Science, SDHR Degree & PG College Page 12
SDHR DEGREE & PG COLLEGE :: TIRUPATI
DATA WAREHOUSING AND DATAMINING

1) Selection: The data needed for the data mining process may be obtained from many different
and heterogeneous data sources. This first step obtains the data from various databases, files,
and non-electronic sources.
2) Preprocessing: The data to be used by the process may have incorrect or missing data. There
may be anomalous data from multiple sources involving different data types and metrics. There
may be many different activities performed at this time. Erroneous data may be corrected or
removed, whereas missing data must be supplied or predicted (often using data mining tools).
3) Transformation: Data from different sources must be converted into a common format for
processing. Some data may be encoded or transformed into more usable formats. Data
reduction may be used to reduce the number of possible data values being considered.
4) Data mining: Based on the data mining task being performed, this step applies algorithms to
the transformed data to generate the desired results.
5) Interpretation/evaluation: Presenting the mining result is extremely important because the
usefulness of the results is dependent on it. Various visualization and GUI strategies are used
at this last step.

Popular Visualization Techniques


Visualization refers to the visual presentation of data. The old expression "a picture is worth a
thousand words" certainly is true when examining the structure of data. For example, a line graph
that shows the distribution of a data variable is easier to understand and perhaps more informative
than the formula for the corresponding distribution. The use of visualization techniques allows users to
summarize, extract, and grasp more complex results than more mathematical or text type descriptions
of the results. Visualization techniques include:

• Graphical: Traditional graph structures including bar charts, pie charts, histograms,
and line graphs may be used.
• Geometric: Geometric techniques include the box plot and scatter diagram techniques.
• Icon-based: Using figures, colors, or other icons can improve the presentation of the results.
• Pixel-based: With these techniques each data value is shown as a uniquely colored pixel.
• Hierarchical: These techniques hierarchically divide the display area (screen) into regions
based on data values.
• Hybrid: The preceding approaches can be combined into one display.

The Development of Data Mining


The current evolution of data mining functions and products is the result of years of influence
from many disciplines, including databases, information retrieval, statistics, algorithms, and machine
learning. The following figure describes the historical perceptive of data mining.

C Subramanyam Reddy, Department of Computer Science, SDHR Degree & PG College Page 13
SDHR DEGREE & PG COLLEGE :: TIRUPATI
DATA WAREHOUSING AND DATAMINING

Figure 4: Historical Perspective of Data Mining

Apart from the above areas another computer science area that has had a major impact on
the KDD process is multimedia and graphics. A major goal of KDD is to be able to describe the results
of the KDD process in a meaningful manner. Because many different results are often produced, this
is a nontrivial problem. Visualization techniques often involve sophisticated multimedia and graphics
presentations. In addition, data mining techniques can be applied to multimedia applications. The
following figure shows the time-line of data mining.

Unlike previous research in these disparate areas, a major trend in the database community is
to combine results from these seemingly different disciplines into one unifying data or algorithmic
approach. Although in its infancy, the ultimate goal of this evolution is to develop a "big picture" view
of the area that will facilitate integration of the various types of applications into real-world user
domains.

The above figure shows developments in the areas of artificial intelligence (AI), information
retrieval (IR), databases (DB), and statistics (Stat) leading to the current view of data mining. These
different historical influences, which have led to the development of the total data mining area, have
given rise to different views of what data mining functions are actually, they are:

• Induction is used to proceed from very specific knowledge to more general information.
This type of technique is often found in AI applications.
• Because the primary objective of data mining is to describe some characteristics of a set of
data by a general model, this approach can be viewed as a type of compression. Here the
detailed

C Subramanyam Reddy, Department of Computer Science, SDHR Degree & PG College Page 14
SDHR DEGREE & PG COLLEGE :: TIRUPATI
DATA WAREHOUSING AND DATAMINING

data within the database are abstracted and compressed to a smaller description of the
data characteristics that are found in the model.
• As stated earlier, the data mining process itself can be viewed as a type of querying the
underlying database. Indeed, an ongoing direction of data mining research is how to define a
data mining query and whether a query language (like SQL) can be developed to capture the
many different types of data mining queries.
• Describing a large database can be viewed as using approximation to help uncover hidden
information about the data.
• When dealing with large databases, the impact of size and efficiency of developing an abstract
model can be thought of as a type of search problem.
DATA MINING ISSUES
There are many important implementation issues associated with data mining :

1) Human interaction: Since data mining problems are often not precisely stated, interfaces
may be needed with both domain and technical experts. Technical experts are used to
formulate the queries and assist in interpreting the results. Users are needed to identify training
data and desired results.
2) Over fitting: When a model is generated that is associated with a given database state it is
desirable that the model also fit future database states. Over fitting occurs when the model
does not fit future states. This may be caused by assumptions that are made about the data
or may simply be caused by the small size of the training database. For example, a classification
model for an employee database may be developed to classify employees as short, medium,
or tall. If the training database is quite small, the model might erroneously indicate that a short
person is anyone under five feet eight inches because there is only one entry in the training
database under five feet eight. In this case, many future employees would be erroneously
classified as short. Over fitting can arise under other circumstances as well, even though the
data are not changing.
3) Outliers: There are often many data entries that do not fit nicely into the derived model. This
becomes even more of an issue with very large databases. If a model is developed that includes
these outliers, then the model may not behave well for data that are not outliers.
4) Interpretation of results: Currently, data mining output may require experts to correctly
interpret the results, which might otherwise be meaningless to the average database user.
5) Visualization of results: To easily view and understand the output of data mining algorithms,
visualization of the results is helpful.
6) Large datasets: The massive datasets associated with data mining create problems when
applying algorithms designed for small datasets. Many modeling applications grow
exponentially on the dataset size and thus are too inefficient for larger datasets. Sampling and
parallelization are effective tools to attack this scalability problem.
7) High dimensionality: A conventional database schema may be composed of many different
attributes. The problem here is that not all attributes may be needed to solve a given data
mining problem. In fact, the use of some attributes may interfere with the correct completion
of a data mining task. The use of other attributes may simply increase the overall complexity
and decrease the efficiency of an algorithm. This problem is sometimes referred to as the
dimensionality curse, meaning that there are many attributes (dimensions) involved and it is
difficult to determine which ones should be used. One solution to this high dimensionality
problem is to reduce the number of

C Subramanyam Reddy, Department of Computer Science, SDHR Degree & PG College Page 15
SDHR DEGREE & PG COLLEGE :: TIRUPATI
DATA WAREHOUSING AND DATAMINING

attributes, which is known as dimensionality reduction. However, determining which attributes


not needed is not always easy to do.
8) Multimedia data: Most previous data mining algorithms are targeted to traditional data types
(numeric, character, text, etc.). The use of multimedia data such as is found in GIS databases
complicates or invalidates many proposed algorithms.
9) Missing data: During the preprocessing phase of KDD, missing data may be replaced with
estimates. This and other approaches to handling missing data can lead to invalid results in
the data mining step.
10) Irrelevant data: Some attributes in the database might not be of interest to the data mining
task being developed.
11) Noisy data: Some attribute values might be invalid or incorrect. These values are often
corrected before running data mining applications.
12) Changing data: Databases cannot be assumed to be static. However, most data mining
algorithms do assume a static database. This requires that the algorithm be completely rerun
anytime the database changes.
13) Integration: The KDD process is not currently integrated into normal data processing
activities. KDD requests may be treated as special, unusual, or one-time needs. This makes
them inefficient, ineffective, and not general enough to be used on an ongoing basis.
Integration of data mining functions into traditional DBMS systems is certainly a desirable goal.
14) Application: Determining the intended use for the information obtained from the data mining
function is a challenge. Indeed, how business executives can effectively use the output is
sometimes considered the more difficult part, not the running of the algorithms themselves.
Because the data are of a type that has not previously been known, business practices may
have to be modified to determine how to effectively use the information uncovered.

The above issues has to be addressed by researchers by developing data mining algorithms
and products.

SOCIAL IMPLICATIONS OF DATA MINING


The integration of data mining techniques into normal day-to-day activities has become
common place. We are confronted daily with targeted advertising, and businesses have become more
efficient through the use of data mining activities to reduce costs. Data mining adversaries, however,
are concerned that this information is being obtained at the cost of reduced privacy. Privacy is a
fundamental right of every human being. The mining of private information leads to many problems
and this is also one of the major problem need to be addressed by the researches. Data mining
applications can derive much demographic information concerning customers that was previously not
known or hidden in the data. The unauthorized use of such data could result in the disclosure of
information that is deemed to be confidential.

There is an increase, in interest in data mining techniques targeted to such applications as


fraud detection, identifying criminal suspects, and prediction of potential terrorists. These can be
viewed as types of classification problems. The approach that is often used here is one of "profiling"
the typical behavior or characteristics involved. Indeed, many classification techniques work by
identifying the attribute values that commonly occur for the target class. Subsequent records will be
then classified based

C Subramanyam Reddy, Department of Computer Science, SDHR Degree & PG College Page 16
SDHR DEGREE & PG COLLEGE :: TIRUPATI
DATA WAREHOUSING AND DATAMINING

on these attribute values. Keep in mind that these approaches to classification are imperfect and
mistakes can be made. Just because an individual takes a series of credit card purchases that are
similar to those often made when a card IS stolen does not mean that the card is stolen or that the
individual is a criminal. Users of data mining techniques must be sensitive to these issues and must
not violate any privacy directives or guidelines.

THE FUTURE
The advent of the relational data model and SQL were milestones in the evolution of database
systems. Currently, data mining is little more than a set of tools that can be used to uncover previously
hidden information in a database. While there are many tools to aid in this process, there is no all-
encompassing model or approach. Over the next few years, the following things need to be developed.

• Efficient algorithms with better interface techniques should be developed.


• Current data mining tools require much human interaction not only to define the request, but
also to interpret the results. As the tools become better and more integrated, this extensive
human interaction is likely to decrease.
• The various data mining applications currently available are of diverse types, so the
development of a complete data mining model is required.
• A "query language" has to be developed which includes traditional functions as well as more
complicated requests such as those found in OLAP and data mining applications.
• A data mining query language (DMQL) based on SQL has been proposed. Unlike SQL where
the access is assumed to be only to relational databases, DMQL allows access to background
information such as concept hierarchies. Another difference is that the retrieved data need not
be a subset or aggregate of data from relations. Thus, a DMQL statement must indicate the
type of knowledge to be mined.

DATA PREPROCESSING
Data have its importance when it satisfy the requirements of the intended use. There are many
factors comprising data quality, including accuracy, completeness, consistency, timeliness,
believability, and interpretability.

Data preprocessing is an important data mining technique which involves transforming raw
data into an understandable format. Real-world data is often incomplete, inconsistent, and/or lacking
in certain behaviors or trends, and is likely to contain many errors. Data preprocessing is a proven
method of resolving such issues. Data preprocessing prepares raw data for further processing.

Defining data quality is consisting of three elements: accuracy, completeness, and consistency.
Inaccurate, incomplete, and inconsistent data are commonplace properties of large real-world
databases and data warehouses. There are many possible reasons for inaccurate data (i.e., having
incorrect attribute values).

• The data collection instruments used may be faulty.


• There may have been human or computer errors occurring at data entry.

C Subramanyam Reddy, Department of Computer Science, SDHR Degree & PG College Page 17
SDHR DEGREE & PG COLLEGE :: TIRUPATI
DATA WAREHOUSING AND DATAMINING

• Users may purposely submit incorrect data values for mandatory fields when they do not wish
to submit personal information (e.g., by choosing the default value “January 1” displayed for
birthday). This kind of data is known as disguised missing data.
• Errors in data transmission can also occur.
• There may be technology limitations such as limited buffer size for coordinating synchronized
data transfer and consumption.
• Incorrect data may also result from inconsistencies in naming conventions or data codes, or
inconsistent formats for input fields (e.g., date).
• Duplicate tuples also require data cleaning.

Incomplete data can occur for a number of reasons. Attributes of interest may not always be
available, such as customer information for sales transaction data. Other data may not be included
simply because they were not considered important at the time of entry. Relevant data may not be
recorded due to a misunderstanding or because of equipment malfunctions. Data that were inconsistent
with other recorded data may have been deleted. Furthermore, the recording of the data history or
modifications may have been overlooked. Missing data, particularly for tuples with missing values for
some attributes, may need to be inferred.

Major Tasks in Data Preprocessing


The following are the major steps involved in data preprocessing, are namely, data cleaning,
data integration, data reduction, and data transformation.

Data cleaning routines work to “clean” the data by filling in missing values, smoothing noisy
data, identifying or removing outliers, and resolving inconsistencies. If users believe the data are dirty,
they are unlikely to trust the results of any data mining that has been applied. Furthermore, dirty data
can cause confusion for the mining procedure, resulting in unreliable output. Although most mining
routines have some procedures for dealing with incomplete or noisy data, they are not always robust.
Instead, they may concentrate on avoiding over fitting the data to the function being modeled.
Therefore, a useful preprocessing step is to run your data through some data cleaning routines.

Data Integration is the process of integrating multiple databases, data cubes, or files from
multiple sources. Yet some attributes representing a given concept may have different names in
different

C Subramanyam Reddy, Department of Computer Science, SDHR Degree & PG College Page 18
SDHR DEGREE & PG COLLEGE :: TIRUPATI
DATA WAREHOUSING AND DATAMINING

databases, causing inconsistencies and redundancies. For example, the attribute for customer
identification may be referred to as “customer id” in one data store and “cust id” in another. Naming
inconsistencies may also occur for attribute values.

Data reduction obtains a reduced representation of the data set that is much smaller in
volume, but it produces the same (or almost the same) analytical results. Data reduction strategies
include dimensionality reduction and numerosity reduction. In dimensionality reduction, data
encoding schemes are applied so as to obtain a reduced or “compressed” representation of the
original data. Examples include data compression techniques (e.g., wavelet transforms and principal
components analysis), attribute subset selection (e.g., removing irrelevant attributes), and attribute
construction (e.g., where a small set of more useful attributes is derived from the original set). In
numerosity reduction, the data are replaced by alternative, smaller representations using
parametric models (e.g., regression or log-linear models) or nonparametric models (e.g., histograms,
clusters, sampling, or data aggregation).

Data Cleaning
The available real-world data tend to be incomplete, noisy, and inconsistent. Data cleaning (or
data cleansing) routines attempt to fill in missing values, smooth out noise while identifying
outliers, and correct inconsistencies in the data. The following are the some basic methods of data
cleaning are:-

Missing Values
The following methods are used to fill the missing values.

1. Ignore the tuple: This method is adopted when the class label is missing in the given data. This
method is not very effective, unless the tuple contains several attributes with missing values. It is
especially poor when the percentage of missing values per attribute varies considerably. By
ignoring the tuple, we do not make use of the remaining attributes’ values in the tuple. Such data
could have been useful to the task at hand.
2. Filling the missing value manually: Filling the missing value manually is time consuming
process and this may not be feasible huge missing values.
3. Using a global constant to fill the missing value: In this method we replace all missing attribute
values by the same constant such as a label like “Unknown” or with value. If the missing values
are replaced say, “Unknown,” then the mining program may mistakenly think that they form an
interesting concept, since they all have a value in common—that of “Unknown.” Hence, this
method is simple, but it is not foolproof.
4. Use a measure of central tendency for the attribute (e.g., the mean or median) to fill in
the missing value: Measures of central tendency, which indicate the “middle” value of a data
distribution. For normal (symmetric) data distributions, the mean can be used, while skewed data
distribution should employ the median. For example, suppose that the data distribution regarding
the income of a customers is symmetric and that the mean income is Rs 56,000/-. By using this
mean income we can replace the missing value for income.
5. Use the attribute mean or median for all samples belonging to the same class as the
given tuple: For example, if classifying customers according to credit risk, we may replace the
missing value with the mean income value for customers in the same credit risk category as that
of the given tuple.

C Subramanyam Reddy, Department of Computer Science, SDHR Degree & PG College Page 19
SDHR DEGREE & PG COLLEGE :: TIRUPATI
DATA WAREHOUSING AND DATAMINING

6. Use the most probable value to fill in the missing value: In this method the missing value
may be determined with regression, inference-based tools using a Bayesian formalism, or decision
tree induction.

Noisy Data
Noise is a random error or variance in a measured variable. The basic statistical description
techniques (e.g., boxplots and scatter plots), and methods of data visualization can be used to identify
outliers, which may represent noise. Given a numeric attribute such as, say, price, how can we
“smooth” out the data to remove the noise? The following are techniques used smooth the data by
removing the noise.

1. Binning: Binning is one of the techniques used to smooth a sorted data value by consulting
its “neighborhood,” values of the given data. In this technique the sorted values are distributed
into a number of “buckets,” or bins. In Binning the following techniques are used.

a) Smoothing using bin means


b) Smoothing using bin boundaries
c) Regression and
d) Outlier analysis

a) Smoothing using the bin means: In this method first the given data for a column is sorted
and we divide the given values into equal bins and then we find the mean value of the
separated bins and find the mean values of each bin and replace the mean value instead
of original values. Example: If the given values for the price column is
{4,8,15,21,21,24,25,28,34 } the values given here are 9. These nine values can be grouped
into 3 equal bins i.e., Bin-1{4, 8, 15}
, Bin-2 {21, 21, 24}, Bin-3 {25, 28, 34}. The mean value of Bin-1 is 9, Bin-2 is 22, Bin-3 is
29. Now
replace the original values in the bins. i.e., Bin-1 {9, 9, 9}, Bin-2 {22, 22, 22} and Bin-3
{29, 29,
29}.

b) Smoothing using the bin boundaries: In this method the given data for the column is
sorted and we divide the given values into equal bins and then we find the minimum value
and maximum value in the bin and these minimum and maximum values in the bin are
called as bin boundaries. The values in the bins are replaced by the closest values of the
bin boundaries. Example : If the given values for the price column is
{4,8,15,21,21,24,25,28,34 } the values given here are 9. These nine values can be
grouped into 3 equal bins i.e., Bin-1{4, 8, 15} , Bin- 2 {21, 21, 24}, Bin-3 {25, 28, 34}.
Bin Boundaries of the Bin-1 is Min-4 and Max-15, Bin-2 is Min-21 and Max-24 and Bin-3 is
Min-25 and Max-34. Now replace the bin original values with these closest bin boundaries.

C Subramanyam Reddy, Department of Computer Science, SDHR Degree & PG College Page 20
SDHR DEGREE & PG COLLEGE :: TIRUPATI
DATA WAREHOUSING AND DATAMINING

Figure 5: Binning for Data smoothing

c) Regression: Data smoothing can also be done by regression, a technique that conforms
data values to a function. Linear regression involves finding the “best” line to fit two
attributes (or variables) so that one attribute can be used to predict the other. Multiple
linear regression is an extension of linear regression, where more than two attributes are
involved and the data are fit to a multidimensional surface.
d) Outlier analysis: Outliers may be detected by clustering, for example, where similar
values are organized into groups, or “clusters.” Intuitively, values that fall outside of the set
of clusters may be considered outliers. Consider the following figure.

Figure 6: Customer Data


Plot representing
customers from three
different places
From the above figure we have grouped the customers based on the place. The
clustering has created the three groups of customers which are represented in circles.
Some customers are not fitting into the three groups and these are called outliers.

Discrepancy Detection
Another basic step in data cleaning as a process is discrepancy detection. Discrepancies can
be caused by several factors, including poorly designed data entry forms that have many optional
fields, human error in data entry, deliberate errors (e.g., respondents not wanting to divulge
information about themselves), and data decay (e.g., outdated addresses). Discrepancies may also
arise from inconsistent data representations and inconsistent use of codes. Other sources of
discrepancies include errors in instrumentation devices that record data and system errors. Errors can
also occur when the data are (inadequately) used for purposes other than originally intended. There
may also be inconsistencies due

C Subramanyam Reddy, Department of Computer Science, SDHR Degree & PG College Page 21
SDHR DEGREE & PG COLLEGE :: TIRUPATI
DATA WAREHOUSING AND DATAMINING

to data integration (e.g., where a given attribute can have different names in different databases).
The data should also be examined regarding unique rules, consecutive rules, and null rules.

A unique rule says that each value of the given attribute must be different from all other
values for that attribute. A consecutive rule says that there can be no missing values between the
lowest and highest values for the attribute, and that all values must also be unique (e.g., as in check
numbers). A null rule specifies the use of blanks, question marks, special characters, or other strings
that may indicate the null condition and how such values should be handled.

Reasons for missing values may include (1) the person originally asked to provide a value for
the attribute refuses and/or finds that the information requested is not applicable (e.g., a license
number attribute left blank by non-drivers ); (2) the data entry person does not know the correct
value; or (3) the value is to be provided by a later step of the process.

The null rule should specify how to record the null condition, for example, such as to store
zero for numeric attributes, a blank for character attributes, or any other conventions that may be in
use (e.g., entries like “don’t know” or “?” should be transformed to blank).

Data Integration
Data mining often requires data integration—the merging of data from multiple data stores.
Careful integration can help reduce and avoid redundancies and inconsistencies in the resulting data
set. This can help improve the accuracy and speed of the subsequent data mining process. The
semantic heterogeneity and structure of data pose great challenges in data integration. The
following are the major problems in data integration they are:-

1) Matching schema and objects from different sources?


2) Are any attributes are correlated?
3) Tuple duplication and
4) Touches on the detection and resolution of data value conflicts.

Entity Identification Problem: Data Analysis task is involved in data integration task, the data
integration task combines data from multiple sources into a coherent data store, as in data
warehousing. These sources may include multiple databases, data cubes, or flat files. Matching up
equivalent real-world entities from multiple data sources is a challenging task and this is referred to
as the entity identification problem. For example, how can the data analyst or the computer be sure
that customer_id in one database and cust_number in another refer to the same attribute? Examples
of metadata for each attribute include the name, meaning, data type, and range of values permitted
for the attribute, and null rules for handling blank, zero, or null values. Such metadata can be used to
help avoid errors in schema integration. The metadata may also be used to help transform the data
(e.g., where data codes for pay type in one database may be “H” and “S” but 1 and 2 in another).

C Subramanyam Reddy, Department of Computer Science, SDHR Degree & PG College Page 22
SDHR DEGREE & PG COLLEGE :: TIRUPATI
DATA WAREHOUSING AND DATAMINING

Redundancy and Correlation Analysis: Redundancy is another important issue in data integration.
An attribute (such as annual revenue, for instance) may be redundant if it can be “derived” from
another attribute or set of attributes. Inconsistencies in attribute or dimension naming can also cause
redundancies in the resulting data set. Some of the redundancies can be detected by correlation
analysis. Given two attributes, such analysis can measure how strongly one attribute implies the
other, based on the available data. For nominal data, we use the X2 (chi-square) test. For numeric
attributes, we can use the correlation coefficient and covariance, both of which access how one
attribute’s values vary from those of another.
Further readings: correlation analysis, Chi-Square test, correlation coefficient and
covariance

Tuple Duplication: In addition to detecting redundancies between attributes, duplication should also
be detected at the tuple level (e.g., where there are two or more identical tuples for a given unique
data entry case). The use of de-normalized tables (often done to improve performance by avoiding
joins) is another source of data redundancy. Inconsistencies often arise between various duplicates,
due to inaccurate data entry or updating some but not all data occurrences. For example, if a purchase
order database contains attributes for the purchaser’s name and address instead of a key to this
information in a purchaser database, discrepancies can occur, such as the same purchaser’s name
appearing with different addresses within the purchase order database.

Data Value Conflict Detection and Resolution: Data integration also involves the detection and
resolution of data value conflicts. For example, for the same real-world entity, attribute values from
different sources may differ. This may be due to differences in representation, scaling, or encoding.
For instance, a weight attribute may be stored in metric units in one system and British imperial units
in another.

Data Reduction
Data reduction techniques are applied to obtain a reduced representation of the data set that
is much smaller in volume, yet closely maintains the integrity of the original data. That is, mining on
the reduced data set should be more efficient yet produce the same (or almost the same) analytical
results.
The following are the data reduction techniques. Data reduction strategies include
a) dimensionality reduction,
b) numerosity reduction, and
c) data compression.

a) Dimensionality reduction is the process of reducing the number of random variables or attributes
under consideration. Dimensionality reduction methods include wavelet transforms and principal
components analysis, which transform or project the original data onto a smaller space. Attribute
subset selection is a method of dimensionality reduction in which irrelevant, weakly relevant, or
redundant attributes or dimensions are detected and removed.

b) Numerosity reduction techniques replace the original data volume by alternative, smaller
forms of data representation. These techniques may be parametric or nonparametric. For
parametric methods, a model is used to estimate the data, so that typically only the data
parameters need to be stored, instead of the actual data. Regression and log-linear models are the
best examples. Nonparametric

C Subramanyam Reddy, Department of Computer Science, SDHR Degree & PG College Page 23
SDHR DEGREE & PG COLLEGE :: TIRUPATI
DATA WAREHOUSING AND DATAMINING

methods for storing reduced representations of the data include histograms, clustering, sampling,
and data cube aggregation.

c) In data compression, transformations are applied so as to obtain a reduced or “compressed”


representation of the original data. If the original data can be reconstructed from the compressed
data without any information loss, the data reduction is called lossless. If, instead, we can
reconstruct only an approximation of the original data, then the data reduction is called lossy. There
are several lossless algorithms for string compression; however, they typically allow only limited
data manipulation. Dimensionality reduction and numerosity reduction techniques can also be
considered forms of data Compression.

There are many other ways of organizing methods of data reduction. The computational time
spent on data reduction should not outweigh or “erase” the time saved by mining on a reduced data
set size.

C Subramanyam Reddy, Department of Computer Science, SDHR Degree & PG College Page 24
SDHR DEGREE & PG COLLEGE :: TIRUPATI
DATA WAREHOUSING AND DATAMINING

UNIT II:
Basic Data mining Tasks , Principles of dimensional modeling-design decisions, Dimensional Modeling
basics-R Modeling versus Dimensional modeling-use of case tools-The star schema-Review of a simple
STAR schema, inside a Dimension table, inside the fact table, the fact less fact table, Data Granularity.
Star Schema keys primary keys, surrogate keys, foreign keys. Advantages of star schema, Dimensional
Modeling: Updates to the dimensional tables-Miscellaneous Dimensions-The Snowflake schema-
Aggregate fact tables-Families of stars

Basic Data Mining Tasks

Classification
“Classification is a process which maps data into predefined groups or classes”. It is
often referred to as supervised learning because the classes are determined before examining the
data. Two examples of classification applications are determining whether to make a bank loan and
identifying credit risks. Classification algorithms require that the classes be defined based on data
attribute values. They often describe these classes by looking at the characteristics of data already
known to belong to the classes. Pattern recognition is a type of classification where an input pattern
is classified into one of several classes based on its similarity to these predefined classes.

Example 1: An airport security screening station is used to determine: if passengers are potential
terrorists or criminals. To do this, the face of each passenger is scanned and its basic pattern (distance
between eyes, size and shape of mouth, shape of head, etc.) is identified. This pattern is compared
to entries in a database to see if it matches any patterns that are associated with known offenders.

Example 2: The bank authorities want know whether a loan applicant is capable of repaying a loan
or not. For which the bank constructs a model by using past data from which some rules will be
constructed based on these rules the bank authorities decides whether the loan should be sanctioned
or not.

Regression
“Regression is a process used to map a data item to a real valued prediction variable”.
In actually, the regression involves the learning of the function that does this mapping. Regression
assumes that the target data fit into some known type of function (e.g., linear, logistic, etc.) and then
determines the best function of this type that models the given data. Some type of error analysis is
used to determine which function is "best”.

Example 1: An employee wishes to reach a certain level of savings before the retirement. Periodically,
the employee predicts what her retirement savings will be based on Its current value and several past
values. The Employee uses a simple linear regression formula to predict this value by fitting past
behavior to a linear function and then using this function to predict the values at points in the future.
Based on these values, she then alters her investment portfolio.

C Subramanyam Reddy, Department of Computer Science, SDHR Degree & PG College Page 25
SDHR DEGREE & PG COLLEGE :: TIRUPATI
DATA WAREHOUSING AND DATAMINING

Time Series Analysis


With the help of time series analysis, the value of an attribute is examined as it varies over
time. The values usually are obtained as evenly spaced time points (daily, weekly, hourly, etc.). A
time series plot is used to visualize the time series.

From the above figure we can easily see that the plots for Y and Z have similar behavior, while X
appears to have less volatility. There are three basic functions performed in time series analysis:

• In one case, distance measures are used to determine the similarity between different
time series.
• In the second case, the structure of the line is examined to determine (and perhaps
classify) its behavior.
• A third application would be to use the historical time series plot to predict future values.

Example : Mr. Smith is trying to determine whether to purchase stock from Companies X, Y, or z.
For a period of one month he charts the daily stock price for each company. The above Figure shows
the time series plot that Mr. Smith has generated. Using this and similar information available from
his stockbroker, Mr. Smith decides to purchase stock X because it is less volatile while overall showing
a slightly larger relative amount of growth than either of the other stocks. As a matter of fact, the
stocks for Y and Z have a similar behavior. The behavior of Y between days 6 and 20 is identical to
that for Z between days 13 and 27.

Prediction
Many real-world data mining applications can be seen as predicting future data states based
on past and current data. Prediction can be viewed as a type of classification. The difference is that
prediction is predicting a future state rather than a current state. Here we are referring to a type of
application rather than to a type of data mining modeling approach. Prediction applications include
flooding, speech recognition, machine learning, and pattern recognition. Although future values may
be predicted using time series analysis or regression techniques, other approaches may be used as
well. The following Example illustrates the process.

Example: Predicting flooding is a difficult problem. One approach uses monitors placed at various
points in the river. These monitors collect data relevant to flood prediction: water level, rain
amount, time, humidity, and so on. Then the water level at a potential flooding point in the river can
be predicted based on the data collected by the sensors upriver from this point. The prediction must
be made with respect to the time the data were collected.

C Subramanyam Reddy, Department of Computer Science, SDHR Degree & PG College Page 26
SDHR DEGREE & PG COLLEGE :: TIRUPATI
DATA WAREHOUSING AND DATAMINING

Clustering
Clustering is similar to classification except that the groups are not predefined, but rather
defined by the data alone. Clustering is alternatively referred to as unsupervised learning or
segmentation. It can be thought of as partitioning or segmenting the data into groups that might or
might not be disjointed. The clustering is usually accomplished by determining the similarity among
the data on predefined attributes. The most similar data are grouped into clusters. The clusters are
not predefined; a domain expert is often required to interpret the meaning of the created clusters.

A special type of clustering is called segmentation. With segmentation a database is partitioned


into disjointed groupings of similar tuples called segments. Segmentation is often viewed as being
identical to clustering. In other circles segmentation is viewed as a specific type of clustering applied
to a database itself. In this text we use the two terms, clustering and segmentation, interchangeably.
The following Example provides a simple clustering example.

Example: A certain national department store chain creates special catalogs targeted to various
demographic groups based on attributes such as income, location, and physical characteristics of
potential customers (age, height, weight, etc.). To determine the target mailings of the various
catalogs and to assist in the creation of new, more specific catalogs, the company performs a clustering
of potential customers based on the determined attribute values. The results of the clustering exercise
are then used by management to create special catalogs and distribute them to the correct target
population based on the cluster for that catalog.

Summarization
Summarization maps data into subsets with associated simple descriptions. Summarization is
also called characterization or generalization. It extracts or derives representative information about
the database. This may be accomplished by actually retrieving portions of the data. Alternatively,
summary type information (such as the mean of some numeric attribute) can be derived from the
data. The summarization succinctly characterizes the contents of the database. The following example
illustrates this process.

Example: One of the many criteria used to compare universities by the U.S. News & World Report is
the average SAT or AC T score. This is a summarization used to estimate the type and intellectual
level of the student body.

Association Rules
Association Rules are also called as Link analysis, alternatively referred to as affinity analysis
or association, refers to the data mining task of uncovering relationships among data. The best
example of this type of application is to determine association rules. An association rule is a model
that identifies specific types of data associations. These associations are often used in the retail
sales community to identify items that are frequently purchased together. The following example
illustrates the use of association rules in market basket analysis. Here the data analyzed consist of
information about what items a customer purchases. Associations are also used in many other
applications such as predicting the failure of telecommunication switches.

C Subramanyam Reddy, Department of Computer Science, SDHR Degree & PG College Page 27
SDHR DEGREE & PG COLLEGE :: TIRUPATI
DATA WAREHOUSING AND DATAMINING

Example: A grocery store retailer is trying to decide whether to put bread on sale. To help determine
the impact of this decision, the retailer generates association rules that show what other products are
frequently purchased with bread. He finds that 60% of the time that bread is sold so are pretzels and
that 70% of the time jelly is also sold. Based on these facts, he tries to capitalize on the association
between bread, pretzels, and jelly by placing some pretzels and jelly at the end of the aisle where the
bread is placed. In addition, he decides not to place either of these items on sale at the same time.

While using association rules users must be cautioned that these are not causal relationships.
They do not represent any relationship inherent in the actual data (as is true with functional
dependencies) or in the real world. There probably is no relationship between bread and pretzels that
causes them to be purchased together. And there is no guarantee that this association will apply in
the future. However, association rules can be used to assist retail store management in effective
advertising, marketing, and inventory control.

Sequence Discovery
Sequential analysis or sequence discovery is used to determine sequential patterns in data.
These patterns are based on a time sequence of actions. These patterns are similar to associations in
that data (or events) are found to be related, but the relationship is based on time. Unlike a market
basket analysis, which requires the items to be purchased at the same time, in sequence discovery
the items are purchased over time in some order. Example 1.9 illustrates the discovery of some simple
patterns. A similar type of discovery can be seen in the sequence within which data are purchased. For
example, most people who purchase CD players may be found to purchase CDs within one week. As we
will see, temporal association rules really fall into this category.

Example: The Webmaster at the XYZ Corp. periodically analyzes the Web log data to determine how
users of the XYZ's Web pages access them. He is interested in determining what sequences of pages
are frequently accessed. He determines that 70 percent of the users of page A follow one of the
following patterns of behavior: (A, B, C) or (A, D, B, C) or (A, E, B, C). He then determines to add a
link directly from page A to page C.

C Subramanyam Reddy, Department of Computer Science, SDHR Degree & PG College Page 28
SDHR DEGREE & PG COLLEGE :: TIRUPATI
DATA WAREHOUSING AND DATAMINING

Principals of Dimensional modeling and Design Decisions


A dimension is a collection of logically related attributes and is viewed as an axis for modeling
the data. The time dimension could be divided into many different granularities: millennium, century,
decade, year, month, day, hour, minute, or second. Within each dimension, these entities form levels
on which various Decision Support Systems questions may be asked. The specific data stored in the
dimensions are called the facts and usually they are in the form of numeric data.

“A dimensional model is a data structure technique optimized for Data warehousing


tools”. The concept of Dimensional Modelling was developed by Ralph Kimball and is comprised of
"fact" and "dimension" tables. A Dimensional model is designed to read, summarize, analyze numeric
information like values, balances, counts, weights, etc. in a data warehouse. Dimensional models are
used in data warehouse systems and they are not fit for relational systems.

a) Elements of Dimensional Data Model


1. Fact: Facts are the measurements/metrics or facts from your business process. For a Sales
business process, a measurement would be quarterly sales number.
2. Dimension: Dimension provides the context surrounding a business process event. In simple
terms, they give who, what, where of a fact. In the Sales business process, for the fact
quarterly sales number, dimensions would be Who – Customer Names, Where – Location,
What – Product Name. In other words, a dimension is a window to view information in the
facts.
3. Attributes: The Attributes are the various characteristics of the dimension. In the Location
dimension, the attributes can be State, Country, Zip-code etc. Attributes are used to search,
filter, or classify facts. Dimension Tables contain Attributes.
4. Fact Table: A fact table is the primary table in a dimensional model. A Fact Table contains 1)
Measurements /facts, 2) Foreign key to dimension table.
5. Dimension table: Dimension table contains the following A) dimension table contains
dimensions of a fact. B) They are joined to fact table via a foreign key. C) Dimension tables are
de- normalized tables. D) The Dimension Attributes are the various columns in a dimension
table. E) Dimensions offer descriptive characteristics of the facts with the help of their
attributes. F) No set limit set for given for number of dimensions. G) The dimension can also
contain one or more hierarchical relationships

b) Steps for designing Dimensional Modeling


The accuracy in creating the Dimensional modeling determines the success of data warehouse
implementation. The following are the sequence of steps followed to create Dimension Model and they
are

1. Identify Business Process 2. Identify Grain (level of


detail) 3. Identify Dimensions
1. Identify Facts 5. Build Star

The model should describe the Why, How much, When/Where/Who and What of your business
process

C Subramanyam Reddy, Department of Computer Science, SDHR Degree & PG College Page 29
SDHR DEGREE & PG COLLEGE :: TIRUPATI
DATA WAREHOUSING AND DATAMINING

1. Identify the business process


Identifying the actual business process that a data ware house should cover. This could be
Marketing, Sales, HR, etc., The selection of the Business process also depends on the quality of data
available for that process. It is the most important step of the Data Modeling process, and a failure
here would have cascading and irreparable defects. To describe the business process, you can use plain
text or use basic Business Process Modelling Notation (BPMN) or Unified Modelling Language (UML).

2. Identify the grain


The Grain describes the level of detail for the business problem/solution. It is the process of
identifying the lowest level of information for any table in your data warehouse. If a table contains
sales data for every day, then it should be daily granularity. If a table contains total sales data for
each month, then it has monthly granularity.

3) Identify the dimensions


Dimensions are nouns like date, store, inventory, etc. These dimensions are where all the data
should be stored. For example, the date dimension may contain data like a year, month and weekday.

4) Identify the Fact


This step is co-associated with the business users of the system because this is where they get
access to data stored in the data warehouse. Most of the fact table rows are numerical values like
price or cost per unit, etc. Example The CEO at an MNC wants to find the sales for specific products in
different locations on a daily basis. The fact here is Sum of Sales by product by location by time.

5) Build Schema
In this step, we implement the Dimension Model. A schema is nothing but the database
structure (arrangement of tables). There are two popular schemas i.e Star and Snowflake Schema
a) Star Schema : The star schema architecture is easy to design. It is called a star schema because
diagram resembles a star, with points radiating from a center. The center of the star consists of
the fact table, and the points of the star are dimension tables. The fact tables in a star schema
which is third normal form whereas dimensional tables are de-normalized.
b) Snowflake Schema: The snowflake schema is an extension of the star schema. In a snowflake
schema, each dimension are normalized and connected to more dimension tables.

Multidimensional Schemas
Data Warehouse systems are modeled using Multidimensional schemas. The
multidimensional schemas are designed to address the unique needs of very large databases
designed for the analytical purpose (OLAP). The following are the most popular multidimensional
schemas each having its unique advantages.

1 Star Schema
2 Snowflake Schema and
3 Galaxy Schema
1. Star Schema
The star schema is the simplest type of Data Warehouse schema and this schema is also
known as Star Schema as its structure resembles a star. In the Star schema, the center of the star
can have one

C Subramanyam Reddy, Department of Computer Science, SDHR Degree & PG College Page 30
SDHR DEGREE & PG COLLEGE :: TIRUPATI
DATA WAREHOUSING AND DATAMINING

fact tables and numbers of associated dimension tables. It is also known as Star Join Schema
and is optimized for querying large data sets.

Figure 7: Star Schema


From the above figure we can see dimension tables around the fact table. The fact table is at
the center which contains keys to every dimension table like Deal_ID, Model ID, Date_ID, Product_ID,
Branch_ID & other attributes like Units sold and revenue.

Characteristics of Star Schema:


• Every dimension in a star schema is represented with the only one-dimension table.
• The dimension table should contain the set of attributes.
• The dimension table is joined to the fact table using a foreign key
• The dimension table are not joined to each other
• Fact table would contain key and measure
• The Star schema is easy to understand and provides optimal disk usage.
• The dimension tables are not normalized. For instance, in the above figure, Country_ID does
not have Country lookup table as an OLTP design would have.
• The schema is widely supported by Business Intelligence (BI) Tools

Snowflake Schema
A Snowflake Schema is an extension of a Star Schema, and it adds additional dimensions to
the existing star schema. It is called snowflake because its diagram resembles a Snowflake. The
dimension tables are normalized which splits data into additional tables. From the snowflake schema
figure the fact table is at the center just like in star schema and the dimension tables are placed
around the fact table. These dimension tables are having additional dimensions which are linked to
dimension tables.

C Subramanyam Reddy, Department of Computer Science, SDHR Degree & PG College Page 31
SDHR DEGREE & PG COLLEGE :: TIRUPATI
DATA WAREHOUSING AND DATAMINING

Figure 8: Snowflake
Schema Characteristics of Snowflake Schema:
• The main benefit of the snowflake schema it uses smaller disk space.
• Easier to implement a dimension is added to the Schema
• Due to multiple tables query performance is reduced
• The primary challenge that you will face while using the snowflake Schema is that you
need to perform more maintenance efforts because of the more lookup tables.

Key Differences between Star and Snow Flake Schema

Star Schema Snow Flake Schema


Hierarchies for the dimensions are stored in Hierarchies are divided into separate tables.
the dimensional table.
It contains a fact table surrounded One fact table surrounded by
by dimension tables. dimension table which are in turn
surrounded by
dimension table
In a star schema, only single join creates A snowflake schema requires many joins
the relationship between the fact table and to fetch the data.
any dimension tables.
Simple DB Design. Very Complex DB Design.
Denormalized Data structure and query also Normalized Data Structure.
run faster.
High level of Data redundancy Very low-level data redundancy
Single Dimension table contains aggregated Data Split into different Dimension Tables.
data.
Cube processing is faster. Cube processing might be slow because of
the complex join.
Offers higher performing queries using The Snow Flake Schema is represented
Star Join Query Optimization. Tables may by centralized fact table which unlikely
be connected with multiple dimensions. connected with multiple dimensions.

C Subramanyam Reddy, Department of Computer Science, SDHR Degree & PG College Page 32
SDHR DEGREE & PG COLLEGE :: TIRUPATI
DATA WAREHOUSING AND DATAMINING

Galaxy schema
A Galaxy Schema contains two fact tables that shares dimension tables. It is also called Fact
Constellation Schema. The schema is viewed as a collection of stars hence the name Galaxy Schema.

Figure 9: Galaxy Schema

From the above figure we can see that, there are two facts tables i.e., Revenue and Product. In
Galaxy schema shares the dimensions and this is also called Conformed Dimensions.

Characteristics of Galaxy Schema:


• The dimensions in this schema are separated into separate dimensions based on the
various levels of hierarchy. For example, if geography has four levels of hierarchy like
region, country, state, and city then Galaxy schema should have four dimensions.
• Moreover, it is possible to build this type of schema by splitting the one-star schema into
more Star schemes.
• The dimensions are large in this schema which is needed to build based on the
levels of hierarchy.
• This schema is helpful for aggregating fact tables for better understanding.

E-R Modeling versus Dimensional modeling

E-R Model is the most popular data model for operational or OLTP systems and we adopt the
Entity-
Relationship (E-R) modeling technique to create the data models for these systems. The main
functions of E-R Modeling is a) Removes data redundancy b) Ensures data consistency c)
Expresses microscopic relationships. This model is vey much useful for reporting and point
queries.

Dimensional model is a method in which the data is stored in two types of tables namely
facts table and dimension table. It has only physical model. It is good for ad hoc query analysis.
Dimensional Modeling is useful in Capturing critical measures, viewing along dimensions.

C Subramanyam Reddy, Department of Computer Science, SDHR Degree & PG College Page 33
SDHR DEGREE & PG COLLEGE :: TIRUPATI
DATA WAREHOUSING AND DATAMINING

E-R Dimensional Modeling


Modeling
1. Data is stored in RDBMS 1. Data is stored in RDBMS or Multi-
2. Tables are units of storage dimensional database
3. Data is normalized and used 2. Cubes are units of storage
for OLTP, Optimized for OLTP 3. Data is de-normalized and used in Data
processing Warehouse and Data Mart. Optimized for
4. Several tables and chains OLAP
of relationships among 4. Few tables and fact tables are connected
them to dimensional tables
5. Volatile (several Updates) 5. Non - Volatile and time variant
and time variant 6. ETL tools are used to manipulate data
6. SQL is used to manipulate data 7. Summary of bulky transnational data
7. Detailed level of (Aggregates and Measures) used in business
transnational Data decisions
8. Normal Reports 8. User friendly, interactive, drag and drop
multi- dimensional OLAP reports

Case Tools
A very large array of CASE (computer-aided software engineering) tools is available to support
every type of computing effort in today’s environment. Over the years, the CASE tools market has
matured with some leading vendors producing sophisticated tools. In today’s industry, no aspect of
computing seems to be beyond the scope of CASE tools. The following are the sample case tools used
in different purposes.

In database design and development, the tools that aid in the design, development, and
implementation of database systems. The following are the main features of CASE tools applicable to
the database development life cycle (DDLC).

LOGICAL DATA MODELING

• Defining and naming entities and attributes


• Selecting primary keys
• Designating alternate key attributes
• Defining one-to-one and one-to-many relationships
• Resolving many-to-many relationships
• Specifying special relationship types (n-ary, recursive, subtype)
• Defining foreign keys and specifying identifying and non -identifying relationships
• Establishing referential integrity constraints
• Completing the entity-relationship diagram (ERD)

PHYSICAL DATA MODELING (for the relational model)

• Transforming entities into tables


• Converting attributes into columns
• Assigning primary and foreign keys
• Defining data validation constraints

C Subramanyam Reddy, Department of Computer Science, SDHR Degree & PG College Page 34
SDHR DEGREE & PG COLLEGE :: TIRUPATI
DATA WAREHOUSING AND DATAMINING

• Defining triggers and stored procedures for business rules


• Including triggers for INSERT, UPDATE, and DELETE to preserve referential integrity
• Set data types based on target DBMS

DIMENSIONAL DATA MODELING (For Data Warehousing)

• Defining fact tables


• Defining dimension tables
• Designing the STAR schema
• Designing outrigger tables (snowflake schema)
• Accounting for slowly changing dimensions
• Defining and attaching data warehouse rules
• Defining data warehouse sources
• Importing from data warehouse sources
• Attaching sources to columns

CALCULATING PHYSICAL STORAGE SPACE

• Estimating database table sizes


• Establishing volumes
• Setting parameters for space calculations DOMAIN DICTIONARY
• Establishing standards
• Setting domain inheritances and overrides
• Creating domains
• Defining domain properties
• Changing domain properties

We can use a case tool to define the tables, the attributes, and the relationships. We can able
to assign the primary keys and indicate the foreign keys and we can construct the entity-relationship
diagrams. All these operations are done very easily using graphical user interfaces and powerful drag-
and-drop facilities. After creating an initial model, we can add fields, delete fields, change field
characteristics, create new relationships, and make any number of revisions with utmost ease.
Another very useful function found in the case tools is the ability to forward-engineer the model and
generate the schema for the target database system. Forward-engineering is easily done with these
case tools.

For modeling the data warehouse, we need concentrate on dimensional modeling technique.
By using the case tools we can create fact tables, dimension tables, and establish the relationships
between each dimension table and the fact table. The result of these operations is a STAR schema for
the developed model.

Star Schema
STAR schema is the fundamental data design technique for the data warehouse. In the Star
schema, the center of the star can have one fact tables and numbers of associated dimension tables.
It is also known as Star Join Schema and is optimized for querying large data sets.

C Subramanyam Reddy, Department of Computer Science, SDHR Degree & PG College Page 35
SDHR DEGREE & PG COLLEGE :: TIRUPATI
DATA WAREHOUSING AND DATAMINING

Review of Simple Star Schema


The following is a graphical representation of simple STAR schema, designed for order analysis
by a manufacturing company. By this design the marketing department can able to know how the
orders are received and how they are processed and the star schema.

Figure 10: Simple Star Schema for order analysis


The Figure 10 shows simple STAR schema for order analysis. The above star schema consists
of the “Order Measures” fact table shown in the middle of schema diagram. The “Order Measures” fact
table is surrounding by four dimension tables, i.e., “customer”, “salesperson”, “order date”, and
“product”.

From the point of view of the marketing department, the users in this department will analyze
the orders using dollar amounts, cost, profit margin, and sold quantity. This information is found in
the fact table of the structure. The users will analyze these measurements by breaking down the
numbers in combinations by customer, salesperson, date, and product. All these dimensions along
which the users will analyze are found in the structure. The structure mirrors how the users normally
view their critical measures along their business dimensions.

The above STAR schema structure can automatically answers the questions of what, when, by
whom, and to whom. From above schema, the users can easily visualize the answers to these
questions: For a given amount of dollars, what was the product sold? Who was the customer? Which
salesperson brought the order? When was the order placed?

When a query is made against the data warehouse, the results of the query are produced by
combining or joining one of more dimension tables with the fact table. The joins are between the fact
table and individual dimension tables. The relationship of a particular row in the fact table is with the
rows in each dimension table.
Take a simple query against the STAR schema. Let us say that the marketing department
wants the quantity sold and order dollars for product bigpart-1, relating to customers in the state
of Maine, obtained by salesperson Jane Doe, during the month of June. Figure 11 shows how this
query is formulated from the STAR schema. Constraints and filters for queries are easily
understood by looking at the STAR schema.

C Subramanyam Reddy, Department of Computer Science, SDHR Degree & PG College Page 36
SDHR DEGREE & PG COLLEGE :: TIRUPATI
DATA WAREHOUSING AND DATAMINING

Figure 11: Understanding a query from the STAR schema.

Inside a Dimension Table


The key component of the STAR schema is dimension tables. These dimension tables represent
the business dimensions along which the metrics are analyzed. The following things are present inside
a dimension table and its characteristics are defined in the Figure-12.

Figure 12: Inside a Dimension Table


From the above figure the following observations can be made.

• Dimension table key: Primary key of the dimension table uniquely identifies each row in the table.
• Table is wide: Typically, a dimension table has many columns or attributes. It is not uncommon
for some dimension tables to have more than fifty attributes. Therefore, we say that the dimension
table is wide. If you lay it out as a table with columns and rows, the table is spread out horizontally.
• Textual attributes: The Textual attributes contains text type of information in the dimension
table. In these attributes we find numerical values rarely. These numerical values are used for
calculations. These attributes represent the textual descriptions of the components within the
business dimensions. Users will compose their queries using these descriptors.
• Attributes not directly related: Most of the attributes in a dimension table are not directly
related to the other attributes in the table. For example, package size is not directly related to
product brand; however, package size and product brand could both be attributes of the product
dimension table.
• Not normalized: The attributes in a dimension table are used over and over again in queries. An
attribute is taken as a constraint in a query and applied directly to the metrics in the fact table.
For efficient query performance, it is best if the query picks up an attribute from the dimension
table and goes directly to the fact table and not through other intermediary tables. If you
normalize the

C Subramanyam Reddy, Department of Computer Science, SDHR Degree & PG College Page 37
SDHR DEGREE & PG COLLEGE :: TIRUPATI
DATA WAREHOUSING AND DATAMINING

dimension table, you will be creating such intermediary tables and that will not be efficient.
Therefore, a dimension table is flattened out, not normalized.
• Drilling down, rolling up: The attributes in a dimension table provide the ability to get to the
details from higher levels of aggregation to lower levels of details. For example, the three attributes
zip, city, and state form a hierarchy. You may get the total sales by state, then drill down to total
sales by city, and then by zip. Going the other way, you may first get the totals by zip, and then
roll up to totals by city and state.
• Multiple hierarchies: Dimension tables often provide for multiple hierarchies, so that drilling
down may be performed along any of the multiple hierarchies, for example a product dimension
table for a department store. In this business, the marketing department may have its way of
classifying the products into product categories and product departments. On the other hand, the
accounting department may group the products differently into categories and product
departments. So in this case, the product dimension table will have the attributes of marketing–
product–category, marketing–product–department, finance–product–category, and finance–
product–department.
• Less number of records: A dimension table typically has fewer number of records or rows than
the fact table. A product dimension table for an automaker may have just 500 rows. On the other
hand, the fact table may contain millions of rows.

Inside the Fact Table


The fact table contains the measurements. Hence we need to keep the details at the lowest
possible level. In the department store fact table for sales analysis, we may keep the units sold by
individual transactions at the cashier. Some fact tables may just contain summary data. These are
called aggregate fact tables. The Figure 13 lists the characteristics of a fact table.

Figure 13: Inside the Fact


Table The following are the characteristics of the Fact
Table

• Concatenated Key: A row in the fact table relates to a combination of rows from all the dimension
tables. In this example of a fact table, you find quantity ordered as an attribute. Let us say the
dimension tables are product, time, customer, and sales representative. For these dimension
tables, assume that the lowest levels in the dimension hierarchies are individual product, a
calendar date, a specific customer, and a single sales representative. Then a single row in the fact
table must relate to a particular product, a specific calendar date, a specific customer, and an
individual sales representative. This means the row in the fact table must be identified by the
primary keys of these four dimension tables. Thus, the primary key of the fact table must be the
concatenation of the primary keys of all the dimension tables.
• Data Grain: Grain is an important characteristic of the fact table. As we know, the data grain is
the level of detail for the measurements or metrics. In this example, the metrics are at the detailed
level.

C Subramanyam Reddy, Department of Computer Science, SDHR Degree & PG College Page 38
SDHR DEGREE & PG COLLEGE :: TIRUPATI
DATA WAREHOUSING AND DATAMINING

The quantity ordered relates to the quantity of a particular product on a single order, on a certain
date, for a specific customer, and procured by a specific sales representative. If we keep the
quantity ordered as the quantity of a specific product for each month, then the data grain is
different and is at a higher level.
• Fully Additive Measures: Additive facts are facts that can be summed up through all of the
dimensions in the fact table. The values of these attributes may be summed up by simple addition.
Such measures are known as fully additive measures. Aggregation of fully additive measures is
done by simple addition. When we run queries to aggregate measures in the fact table, we will
have to make sure that these measures are fully additive.
• Semiadditive Measures: Semi-additive facts are facts that can be summed up for some of the
dimensions in the fact table, but not the others. For example, if the order_dollars is 120 and
extended_cost is 100, the margin_percentage is 20. This is a calculated metric derived from the
order_dollars and extended_cost. If you are aggregating the numbers from rows in the fact table
relating to all the customers in a particular state, you cannot add up the margin_percentages from
all these rows and come up with the aggregated number. Derived attributes such as
margin_percentage are not additive. They are known as semiadditive measures.
• Table Deep, Not Wide: Typically a fact table contains fewer attributes than a dimension table.
Usually, there are about 10 attributes or less. But the number of records in a fact table is very
large in comparison. Take a very simplistic example of 3 products, 5 customers, 30 days, and 10
sales representatives represented as rows in the dimension tables. Even in this example, the
number of fact table rows will be 4500, very large in comparison with the dimension table rows. If
you lay the fact table out as a two-dimensional table, you will note that the fact table is narrow
with a small number of columns, but very deep with a large number of rows.
• Sparse Data: A single row in the fact table relates to a particular product, a specific calendar
date, a specific customer, and an individual sales representative. In other words, for a particular
product, a specific calendar date, a specific customer, and an individual sales representative, there
is a corresponding row in the fact table. What happens when the date represents a closed holiday
and no orders are received and processed? The fact table rows for such dates will not have values
for the measures. Also, there could be other combinations of dimension table attributes, values for
which the fact table rows will have null measures. Therefore, it is important to realize this type of
sparse data and understand that the fact table could have gaps.
• Degenerate Dimensions: Look closely at the example of the fact table. You find the attributes
of order_number and order_line. These are not measures or metrics or facts. Then why are these
attributes in the fact table? When you pick up attributes for the dimension tables and the fact
tables from operational systems, you will be left with some data elements in the operational
systems that are neither facts nor strictly dimension attributes. Examples of such attributes are
reference numbers like order numbers, invoice numbers, order line numbers, and so on. These
attributes are useful in some types of analyses. For example, you may be looking for average
number of products per order. Then you will have to relate the products to the order number to
calculate the average. Attributes such as order_number and order_line in the example are called
degenerate dimensions and these are kept as attributes of the fact table.

The Fact Less Fact Table


Apart from the concatenated primary key, a fact table contains facts or measures. Let us say
we are building a fact table to track the attendance of students. For analyzing student attendance,
the

C Subramanyam Reddy, Department of Computer Science, SDHR Degree & PG College Page 39
SDHR DEGREE & PG COLLEGE :: TIRUPATI
DATA WAREHOUSING AND DATAMINING

possible dimensions are student, course, date, room, and professor. The attendance may be affected
by any of these dimensions. When you want to mark the attendance relating to a particular course,
date, room, and professor, what is the measurement you come up for recording the event? In the fact
table row, the attendance will be indicated with the number one. Every fact table row will contain the
number one as attendance. If so, why bother to record the number one in every fact table row? There
is no need to do this. The very presence of a corresponding fact table row could indicate the
attendance. This type of situation arises when the fact table represents events. Such fact tables really
do not need to contain facts. They are “factless” fact tables. Figure 10-12 shows a typical factless fact
table.

Data Granularity

Figure 14: Factless Fact Table


Granularity represents the level of detail in the fact table. If the fact table is at the lowest
grain, then the facts or metrics are at the lowest possible level at which they could be captured from
the operational systems. When you keep the fact table at the lowest grain, the users could drill down
to the lowest level of detail from the data warehouse without the need to go to the operational systems
themselves. Base level fact tables must be at the natural lowest levels of all corresponding dimensions.
By doing this, queries for drill-down and roll-up can be performed efficiently.

Let us say we want to add a new attribute of district in the sales representative dimension.
This change will not warrant any changes in the fact table rows because these are already at the
lowest level of individual sales representative. This is a “graceful” change because all the old queries
will continue to run without any changes. Similarly, let us assume we want to add a new dimension
of promotion. Now you will have to recast the fact table rows to include promotion dimensions. Still,
the fact table grain will be at the lowest level. Even here, the old queries will still run without any
changes. This is also a “graceful” change. Fact tables at the lowest grain facilitate “graceful”
extensions. But we have to pay the price in terms of storage and maintenance for the fact table at
the lowest grain. Lowest grain necessarily means large numbers of fact table rows. In practice,
however, we build aggregate fact tables to support queries looking for summary numbers.

There are two more advantages of granular fact tables. Granular fact tables serve as natural
destinations for current operational data that may be extracted frequently from operational systems.
Further, the more recent data mining applications need details at the lowest grain. Data warehouses
feed data into data mining applications.

C Subramanyam Reddy, Department of Computer Science, SDHR Degree & PG College Page 40
SDHR DEGREE & PG COLLEGE :: TIRUPATI
DATA WAREHOUSING AND DATAMINING

STAR SCHEMA KEYS

Figure 15: STAR Schema Keys


Primary Keys

Each row in a dimension table is identified by a unique value of an attribute designated as the
primary key of the dimension. In a product dimension table, the primary key identifies each product
uniquely. In the customer dimension table, the customer number identifies each customer uniquely.
Similarly, in the sales representative dimension table, the social security number of the sales
representative identifies each sales representative. We should not use production system keys as
primary keys in Star Schema.

Surrogate Keys
The surrogate keys are simply system-generated sequence numbers. They do not have any
built- in meanings. Of course, the surrogate keys will be mapped to the production system keys.
Nevertheless, they are different. The general practice is to keep the operational system keys as
additional attributes in the dimension tables. The STORE KEY in the above diagram is the surrogate
primary key for the store dimension table. The operational system primary key for the store reference
table may be kept as just another non-key attribute in the store dimension table.

Foreign Key
Each dimension table is in a one-to-many relationship with the central fact table. So the
primary key of each dimension table must be a foreign key in the fact table. If there are four dimension
tables of product, date, customer, and sales representative, then the primary key of each of these four
tables must be present in the orders fact table as foreign keys.

Advantages of Star Schema


Star schemas are easy for end users and applications to understand and navigate. With a well-
designed schema, users can quickly analyze large, multidimensional data sets. The main advantages
of star schemas in a decision-support environment are:
1) Query Performance
2) Load Performance and administration
3) Built-in referential integrity
4) Easily understand

C Subramanyam Reddy, Department of Computer Science, SDHR Degree & PG College Page 41
SDHR DEGREE & PG COLLEGE :: TIRUPATI
DATA WAREHOUSING AND DATAMINING

1) Query performance: Because a star schema database has a small number of tables and clear join
paths, queries run faster than they do against an OLTP system. Small single-table queries, usually of
dimension tables, are almost instantaneous. Large join queries that involve multiple tables take only
seconds or minutes to run. In a star schema database design, the dimensions are linked only through
the central fact table. When two dimension tables are used in a query, only one join path, intersecting
the fact table, exists between those two tables. This design feature enforces accurate and consistent
query results.

2) Load performance and administration: Structural simplicity also reduces the time required to
load large batches of data into a star schema database. By defining facts and dimensions and
separating them into different tables, the impact of a load operation is reduced. Dimension tables can
be populated once and occasionally refreshed. You can add new facts regularly and selectively by
appending records to a fact table.

3) Built-in referential integrity: A star schema has referential integrity built in when data is loaded.
Referential integrity is enforced because each record in a dimension table has a unique primary key,
and all keys in the fact tables are legitimate foreign keys drawn from the dimension tables. A record
in the fact table that is not related correctly to a dimension cannot be given the correct key value to be
retrieved.

4) Easily understood: A star schema is easy to understand and navigate, with dimensions joined
only through the fact table. These joins are more significant to the end user, because they represent
the fundamental relationship between parts of the underlying business. Users can also browse
dimension table attributes before constructing a query.

C Subramanyam Reddy, Department of Computer Science, SDHR Degree & PG College Page 42
SDHR DEGREE & PG COLLEGE :: TIRUPATI
DATA WAREHOUSING AND DATAMINING

UPDATES TO THE DIMENSIONAL TABLES


Dimensions in data management and data warehousing contain relatively static data about
such entities as geographical locations, customers, or products. Data captured by Slowly Changing
Dimensions (SCDs) change slowly but unpredictably, rather than according to a regular schedule.
Some scenarios can cause Referential integrity problems.
For example, a database may contain a fact table that stores sales records. This fact table
would be linked to dimensions by means of foreign keys. One of these dimensions may contain data
about the company's salespeople: e.g., the regional offices in which they work. However, the
salespeople are sometimes transferred from one regional office to another. For historical sales
reporting purposes it may be necessary to keep a record of the fact that a particular sales person had
been assigned to a particular regional office at an earlier date, whereas that sales person is now
assigned to a different regional office. Dealing with these issues involves SCD management
methodologies referred to as Type 0 through 6. Type 6 SCDs are also sometimes called Hybrid SCDs.

1. Type 0: retain original


2. Type 1:
overwrite 3.

1. Type 0: retain original:


The Type 0 method is passive. It manages dimensional changes and no action is performed.
Values remain as they were at the time the dimension record was first inserted. In certain
circumstances history is preserved with a Type 0. Higher order types are employed to guarantee the
preservation of history whereas Type 0 provides the least or no control. This is rarely used.

2. Type 1: overwrite:
This methodology overwrites old with new data, and therefore does not track historical data.

Supplier_Key Supplier_Code Supplier_Name Supplier_State

123 ABC Acme Supply CA


Co
From the above example, Supplier_Code is the natural key and Supplier_Key is a surrogate key.
Technically, the surrogate key is not necessary, since the row will be unique by the natural key
(Supplier_Code). However, to optimize performance on joins use integer rather than character keys
(unless the number of bytes in the character key is less than the number of bytes in the integer key).
If the supplier relocates the headquarters to Illinois the record would be overwritten:
Supplier_Key Supplier_Code Supplier_Name Supplier_State

123 ABC Acme IL


Supply Co

The disadvantage of the Type 1 method is that there is no history in the data warehouse. It
has the advantage however that it's easy to maintain. If one has calculated an aggregate table
summarizing facts by state, it will need to be recalculated when the Supplier_State is changed.

C Subramanyam Reddy, Department of Computer Science, SDHR Degree & PG College Page 43
SDHR DEGREE & PG COLLEGE :: TIRUPATI
DATA WAREHOUSING AND DATAMINING

3. Type 2: add new row:


• This method tracks historical data by creating multiple records for a given natural key in
the dimensional tables with separate surrogate keys and/or different version numbers.
Unlimited history is preserved for each insert.
• For example, if the supplier relocates to Illinois the version numbers will be
incremented sequentially:
Supplier_K Supplier_Co Supplier_Name Supplier_Sta Version
ey de te

123 ABC Acme CA 0


Supply Co

123 ABC Acme IL 1


Supply Co

Another method is to add 'effective date' columns.


Supplier_Key Supplier_Code Supplier_Name Supplier_State Start_Date End_Date

123 ABC Acme Supply Co CA 01-Jan-2000 21-Dec-2004

123 ABC Acme Supply Co 22-Dec-2004 NULL


IL
The null End_Date in row two indicates the current tuple version. In some cases, a standardized
surrogate high date (e.g. 9999-12-31) may be used as an end date, so that the field can be included
in an index, and so that null-value substitution is not required when querying.

Transactions that reference a particular surrogate key (Supplier_Key) are then permanently
bound to the time slices defined by that row of the slowly changing dimension table. An aggregate
table summarizing facts by state continues to reflect the historical state, i.e. the state the supplier
was in at the time of the transaction; no update is needed. To reference the entity via the natural
key, it is necessary to remove the unique constraint making Referential integrity by DBMS
impossible.

If there are retroactive changes made to the contents of the dimension, or if new attributes
are added to the dimension (for example a Sales_Rep column) which have different effective dates from
those already defined, then this can result in the existing transactions needing to be updated to reflect
the new situation. This can be an expensive database operation, so Type 2 SCDs are not a good choice
if the dimensional model is subject to change.

4. Type 3: add new attribute

This method tracks changes using separate columns and preserves limited history. The Type
3 preserves limited history as it is limited to the number of columns designated for storing historical
data. The original table structure in Type 1 and Type 2 is the same but Type 3 adds additional columns.
In the following example, an additional column has been added to the table to record the supplier's
original state - only the previous history is stored.

C Subramanyam Reddy, Department of Computer Science, SDHR Degree & PG College Page 44
SDHR DEGREE & PG COLLEGE :: TIRUPATI
DATA WAREHOUSING AND DATAMINING

Supplier_Key Supplier_Code Supplier_Name Supplier_State Effective_Date Current_


Supplier_Stat
e
123 ABC Acme Supply CA 22-Dec-2004 IL
Co
This record contains a column for the original state and current state—cannot track the
changes if the supplier relocates a second time.
One variation of this is to create the field Previous_Supplier_State instead of
Original_Supplier_State which would track only the most recent historical change.

5. Type 4: add history table


The Type 4 method is usually referred to as using "history tables", where one table keeps
the current data, and an additional table is used to keep a record of some or all changes. Both the
surrogate keys are referenced in the Fact table to enhance query performance.
For the above example the original table name is Supplier and the history
table is Supplier_History.Supplier

Supplier_Key Supplier_Code Supplier_Name Supplier_State

123 ABC Acme& Johnson Supply Co IL

Supplier_History

Supplier_Key Supplier_Code Supplier_Name Supplier_State Create_Date


123 ABC Acme Supply Co,CA CA 14-June-2003
123 ABC Acme & Johnson Supply Co IL 22-Dec-2004

This method resembles how database audit tables and change data capture techniques function.

6.Type 6: hybrid
The Type 6 method combines the approaches of types 1, 2 and 3 (1 + 2 + 3 = 6). One possible
explanation of the origin of the term was that it was coined by Ralph Kimball during a conversation
with Stephen Pace from Kalido[citation needed]. Ralph Kimball calls this method "Unpredictable
Changes with Single-Version Overlay" in The Data Warehouse Toolkit.[1]
The Supplier table starts out with one record for our example supplier:

Supplier_K Row_K Supplier_Co Supplier_Na Supplier_St Create_Da End_Da Current_Fl


ey ey de me ate te te ag
123 1 ABC Acme Supply CA 01-Jan- 31-Dec- Y
Co, 2000 2009
The Current_State and the Historical_State are the same. The optional Current_Flag
attribute indicates that this is the current or most recent record for this supplier.
When Acme Supply Company moves to Illinois, we add a new record, as in Type 2
processing, however a row key is included to ensure we have a unique key for each row:

C Subramanyam Reddy, Department of Computer Science, SDHR Degree & PG College Page 45
SDHR DEGREE & PG COLLEGE :: TIRUPATI
DATA WAREHOUSING AND DATAMINING

Supplier_K Row_K Supplier_Co Supplier_Na Supplier_St Create_Da End_Da Current_Fl


ey ey de me ate te te ag
123 1 ABC Acme Supply CA 01-Jan- 21-Dec- N
Co, 2000 2004
123 2 ABC Acme Supply IL 22-Dec- 31-Dec- Y
Co. 2004 2009
We overwrite the Current_Flag information in the first record (Row_Key = 1) with the new
information, as in Type 1 processing. We create a new record to track the changes, as in Type 2
processing. And we store the history in a second State column (Historical_State), which incorporates
Type 3 processing.
For example, if the supplier were to relocate again, we would add another record to the Supplier
dimension, and we would overwrite the contents of the Current_State column:

Supplier_ Row_K Supplier_C Supplier_N Supplier_S Create_D End_D Current_F


K ey ey o de a me t ate a te a te l ag
123 1 ABC Acme Supply NY 01-Jan- 21-Dec- N
Co, 2000 2004
123 2 ABC Acme Supply NY 22-Dec- 03-Feb- N
Co, 2004 2008
123 3 ABC Acme NY 04-Feb- 31-Dec- Y
Supply Co, 2008 2009
Note that, for the current record (Current_Flag = 'Y'), the Current_State and the
Historical_State are always the same.
Aggregate fact tables
Aggregate fact tables are special fact tables in a data warehouse that contain new metrics
derived from one or more aggregate functions (AVERAGE, COUNT, MIN, MAX, etc..) or from other
specialized functions that output totals derived from a grouping of the base data. These new metrics,
called “aggregate facts” or “summary statistics” are stored and maintained in the data warehouse
database in special fact tables at the grain of the aggregation. Likewise, the corresponding dimensions
are rolled up and condensed to match the new grain of the fact.

These specialized tables are used as substitutes whenever possible for returning user queries.
The reason? Speed. Querying a tidy aggregate table is much faster and uses much less disk I/O than
the base, atomic fact table, especially if the dimensions are large as well. If you want to wow your users,
start adding aggregates. You can even use this “trick” in your operational systems to serve as a
foundation for operational reports. I’ve always done this for any report referred to by my users as a
“Summary”.

For example, take the “Orders” business process from an online catalog company where
you might have customer orders in a fact table called FactOrders with dimensions Customer,
Product, and OrderDate. With possibly millions of orders in the transaction fact, it makes sense to
start thinking about aggregates.

C Subramanyam Reddy, Department of Computer Science, SDHR Degree & PG College Page 46
SDHR DEGREE & PG COLLEGE :: TIRUPATI
DATA WAREHOUSING AND DATAMINING

Families of Stars
Most of the data warehouses contain multiple STAR schema structures. Each STAR serves a
specific purpose to track the measures stored in the fact table. When you have a collection of related
STAR schemas, you may call the collection a family of STARS. Families of STARS are formed for
various reasons. We may form a family by just adding aggregate fact tables and the derived dimension
tables to support the aggregates. We create a core fact table containing facts interesting to most
users and customized fact tables for specific user groups. Many factors lead to the existence of families
of STARS. The fact tables of the STARS in a family share dimension tables. Generally the time
dimension is shared by most of the fact tables in the group.

Figure 16 : Families of Stars

C Subramanyam Reddy, Department of Computer Science, SDHR Degree & PG College Page 47
SDHR DEGREE & PG COLLEGE :: TIRUPATI
DATA WAREHOUSING AND DATAMINING

UNIT-III:
Classification: Introduction-Issues in classification. Statistical Based Algorithms: Regression-Bayesian
Classification-Distance based algorithm-Simple approach-K nearest approach. Decision tree based
algorithms-ID3-C4.5 & C5.0-CART-Scalable DT Techniques
Neural network based algorithms-Propagation-NN Supervised Learning-Radial basis function works-
Perceptrons-Rule based algorithms

Classification
Introduction
Classification is the most familiar and most popular data mining technique. Best examples of
classification applications include image and pattern recognition, medical diagnosis, loan approval,
detecting faults in industry applications, and classifying financial market trends. Estimation and
prediction may be viewed as types of classification.

When someone estimates your age or guesses the number of marbles in a jar, these are
actually belongs to Classification problems. Prediction can be thought of as classifying an attribute
value into one of a set of possible classes. It is often viewed as forecasting a continuous value, while
classification forecasts a discrete value. Classification is frequently performed by simply applying
knowledge of the data. Consider the following example.

In a Class of students, the Teachers classify students based on their grades as A, B, C, D, or


F. By using simple boundaries (60, 70, 80, 90), the following classification is possible:
90 ≤ grade A
80 ≤ grade < 90 B
70 ≤ grade < 80 C
60 ≤ grade < 70 D
grade < 60 F
All approaches to performing classification assume some knowledge of the data. Often a
training set is used to develop the specific parameters required by the technique. Training data
consist of sample input data as well as the classification assignment for the data. Domain experts
may also be used to assist in the process. Consider the following Definition of Classification.

Definition of Classification

Given a database D={t1,t2,…..,tn} of tuples (items, records) and a set of classes C={c1,c2,…..,cm}, the
classification problem is to define a mapping 𝑓𝑓: D →C where each ti is assigned to one class. A Class, cj,
contains precisely those tuples mapped to it, that is Cj={ti ∕ f(ti) = Cj, 1≤i ≤ n, and ti D}.

The above definition views classification as a mapping from the database to the set of classes.
It is also important to note that the classes are predefined, and they are non-overlapping, and partition
the entire database. Each tuple in the database is assigned to exactly one class. The classes that exist
for a classification problem are indeed equivalence classes. The above said classification problem is
implemented in two phases:

C Subramanyam Reddy, Department of Computer Science, SDHR Degree & PG College Page 48
SDHR DEGREE & PG COLLEGE :: TIRUPATI
DATA WAREHOUSING AND DATAMINING

1. Create a specific model by evaluating the training data. This step has as input the training data
(including defined classification for each tuple) and as output a definition of the model
developed. The model created classifies the training data as accurately as possible.
2. Apply the model developed in step 1 by classifying tuples from the target database.

From the above two phases, the second phase actually does the classification, according to the
classification definition, most research have been applied to Phase 1. Phase 2 is often straightforward.
In order to solve the classification problems we have three basic methods are Specifying boundaries,
Using probability distributions, Using posterior probabilities.

Figure 17: Classification Problem


Suppose we are given that a database consists of tuples of the form t = (x, y) where 0 ≤ x ≤ 8
and 0 ≤ y ≤ 10. Figure 4. 1 illustrates the classification problem. Figure 17(a) shows the predefined
classes by dividing the reference space, Figure 17(b) provides sample input data, and Figure 17(c)
shows the classification of the data based on the defined classes.

A major issue associated with classification is that of overfitting. If the classification strategy
fits the training data exactly, it may not be applicable to a broader population of data. For example,
suppose that the training data has erroneous or noisy data. Certainly in this case, fitting the data
exactly is not desired.

Issues in classification

1) Missing Data: Missing data values cause problems during both the training phase and the
classification process itself. Missing values in the training data must be handled and may
produce an inaccurate result. Missing data in a tuple to be classified must be able to be handled
by the resulting classification scheme. There are many approaches to handing missing data:
• Ignore the missing data.
• Assume a value for the missing data. This may be determined by using some method to
predict what the value could be.
• Assign a special value for the missing data. This means that the value of missing data is
taken to be a specific value all of its own.
2) Measuring Performance: Measuring performance of two classification result using two different
classification tools is very hard. The performance of classification algorithms is usually examined by
evaluating the accuracy of the classification. However, since classification is often a complicated
problem, the correct answer may depend on the user. Traditional algorithm evaluation approaches
such as determining the space and time overhead can be used, but these approaches are usually
secondary.

C Subramanyam Reddy, Department of Computer Science, SDHR Degree & PG College Page 49
SDHR DEGREE & PG COLLEGE :: TIRUPATI
DATA WAREHOUSING AND DATAMINING

Classification accuracy is usually calculated by determining the percentage of tuples placed in the
correct class.

The performance of classification is done as with information retrieval systems. With only two
classes, there are four possible outcomes with the classification, they are, the four possible outcomes
are defined based upon the weather a person is having disease or not.

a) true positives (TP): These are cases in which we predicted yes (they have the disease),
and they do have the disease.
b) true negatives (TN): We predicted no, and they don't have the disease.
c) false positives (FP): We predicted yes, but they don't actually have the disease. (Also
known as a "Type I error.")
d) false negatives (FN): We predicted no, but they actually do have the disease. (Also known
as a "Type II error.")

An OC (operating characteristic) curve or ROC (receiver operating characteristic) curve or ROC


(relative operating characteristic) curve shows the relationship between false positives and true
positives. An OC curve was originally used in the communications area to examine false alarm rates. It
has also been used in information retrieval to examine fallout (percentage of retrieved that are not
relevant) versus recall (percentage of retrieved that are relevant) . In the OC curve the horizontal
axis has the percentage of false positives and the vertical axis has the percentage of true positives
for a database sample. At the beginning of evaluating a sample, there are none of either category,
while at the end there are 100 percent of each. When evaluating the results for a specific sample, the
curve looks like a jagged stair-step, as seen in Figure 18, as each new tuple is either a false positive
or a true positive. A more smoothed version of the OC curve can also be obtained.

Figure 18: Operating characteristic curve (OC)


A confusion matrix illustrates the accuracy of the solution to a classification problem. Given m
classes, a confusion matrix is an m x m matrix where entry ci,j indicates the number of tuples from
D that were assigned to class Cj but where the correct class is Ci . Obviously, the best solutions will
have only zero values outside the diagonal. Figure 20 shows a confusion matrix for the height example
in figure 20 where the Output1 assignment is assumed to be correct and the Output2 assignment is
what is actually made.

C Subramanyam Reddy, Department of Computer Science, SDHR Degree & PG College Page 50
SDHR DEGREE & PG COLLEGE :: TIRUPATI
DATA WAREHOUSING AND DATAMINING

Figure 19: Data for Height Classification

Figure 20: Confusion Matrix

C Subramanyam Reddy, Department of Computer Science, SDHR Degree & PG College Page 51
SDHR DEGREE & PG COLLEGE :: TIRUPATI
DATA WAREHOUSING AND DATAMINING

Statistical Based Algorithms

Regression
Regression problems deal with estimation of an output
value based on input values. Regression is used for classification,
when the input values are values from the database D and the
output values represent the classes. Regression can be used
to solve classification problems, and Regression can also be
used for other applications such as forecasting. Regression
can be performed using many different types of techniques,
Including Neural Networks. In Real time regression takes a set of data and fits the data to a formula.
In statistics, a linear regression is a linear approach to model the relationship between a scalar
response or dependent variable and one or more explanatory variable. If there is one explanatory
variable then we call it as simple linear regression. If there are more than one explanatory variable
then it is called multiple linear regression.

Consider the
Dataset X 1,2,4,3,5
Y 1,3,3,2,5

From the above dataset we have single input variable (X), and (Y) is the output variable. In
order to perform regression on the above dataset, we need to perform linear regression operation
because, we has one input variable i.e., (X). If we had multiple input attributes i.e., X1,X2,….,Xn then
we need to use multiple linear regression. Simple Linear regression can be defined by using the
following equation.
Y=B0+B1 * X
From the above equation, where Y is the output variable, which we want to predict, X is the
input variable, B0 and B1 are the coefficients that we need to estimate. From the above equation B1
can be calculated by using the formula
B1=SUM((Xi-Mean(X)) * (Yi-Mean(Y))) / Sum((Xi-Mean(X)2)
From the above equation B1 the term Mean() is the average value for the variable in the
dataset.
The Xi and Yi refers that we need to repeat these calculations across all the values in the dataset
and i refers to the ith value of X or Y.

The value of B0 can be calculated by using B1, for which we use the following formula.
B0=Mean(Y) –B1*Mean(X)
Consider the following dataset and calculate the linear regression
value.

Consider the
Dataset X 1,2,4,3,5
Y 1,3,3,2,5
First we need to calculate the mean value for X
and Y Mean(X) = 1/n * Sum(1,2,4,3,5) =>
1/5 * 15 => 3 Mean(Y) = 1/n * Sum(1,3,3,2,5)
=> 1/5 *14
=>2.8

Where n is the number of values. In this case it is 5.

C Subramanyam Reddy, Department of Computer Science, SDHR Degree & PG College Page 52
SDHR DEGREE & PG COLLEGE :: TIRUPATI
DATA WAREHOUSING AND DATAMINING

In order to calculate the B1 first we need to calculate the numerator part. i.e SUM((Xi-Mean(X))
* (Yi-Mean(Y))). For which we need to calculate the error of each variable from the mean i.e., Xi-
Mean(X) and Yi-Mean(Y).

X Y Mean(X) Mean(Y X-Mean(X) Y-Mean(Y)


)
1 1 3 2.8 -2 -1.8
2 3 3 2.8 -1 0.2
4 3 3 2.8 1 0.2
3 2 3 2.8 0 -0.8
5 5 3 2.8 2 2.2

Now we need to multiple the error for each X with the error for each Y and calculate the
sum of these multiplications.
X-Mean(X) Y-Mean(Y) X-Mean(X)*Y-
Mean(X)
-2 -1.8 3.6
-1 0.2 -0.2
1 0.2 0.2
0 -0.8 0
2 2.2 4.4
Sum 8

Now we need to calculate the denominator of the B1 i.e B1*Mean(X). This is calculated as the
sum of the square differences of each X value from mean. For which we have already calculated the
difference of each X value from the mean i.e X-Mean(X). Now we need to square each value and need
to calculate the sum.
X- (X-
Mean(X) Mean(X))2
-2 4
-1 1
1 1
0 0
2 4
Sum 10

Now we can calculate B1 i.e 8/10 => 0.8

Now we can calculate B0 i.e B0=Mean(Y) –B1*Mean(X)


2.8 – 0.8 * 3 => 0.4

Hence B0=0.4

C Subramanyam Reddy, Department of Computer Science, SDHR Degree & PG College Page 53
SDHR DEGREE & PG COLLEGE :: TIRUPATI
DATA WAREHOUSING AND DATAMINING

Making the predictions:


Now we have the coefficients for our simple linear regression equation i.e., Y=B0 + B1 * X
=0.4+0.8*X
X Predicted Y
1 1.2 Predicted Y
2 2
4 3.6
4.5
3 2.8
5 4.4 3.5

2.5
Predicted Y

1.5

0.5

0 2 4 6

Estimating Error : We can calculate error for our predictions by using Root Mean Squared Error
(RMSE) RMSE= Sqrt(sum(pi-yi)2/n
From the above Sqrt is the square root, p is the predicted value and y is the actual value, i
is the index for a specific instance, n is the number of predictions.

Predicted Y error Squared


Y error
1.2 1 0.2 0.04
2 3 -1 1
3.6 3 0.6 0.36
2.8 2 0.8 0.64
4.4 5 -0.6 0.36
Sum of Squared 2.4
error

i.e 2.4/5 => 0.692

Bayesian Classification

Bayesian classification is based on Bayes' Theorem. Bayesian classifiers are the statistical
classifiers. Bayesian classifiers can predict class membership probabilities such as the probability that
a given tuple belongs to a particular class. To understand Naive Bayes classification we first need to
understand Bayes Theorem. Bayes Theorem works on the conditional probability. Conditional
probability is a probability of occurrence of something, given that something else has already occurred.
Naive Bayes is a kind of classifier which implements Bayes Theorem.
Naive Bayes predicts membership probabilities for each class such as the probability of a given
record or data point that belongs to a particular class. The class which has the highest probability is

C Subramanyam Reddy, Department of Computer Science, SDHR Degree & PG College Page 54
SDHR DEGREE & PG COLLEGE :: TIRUPATI
DATA WAREHOUSING AND DATAMINING

considered the most likely class. This is also known as Maximum A Posteriori (MAP). Naive Bayes
classifier assumes that all the features are unrelated to each other, so the absence or presence of a
feature does not influence the presence or absence of another feature.

Formula-
P(H|E)=(P(E|H) * P(H))/P(E)

From the above equation where,


P(H) is the prior probability.
P(E) is the probability of the evidence(regardless of the hypothesis).
P(E|H) is the probability of the evidence given the hypothesis is true.
P(H|E) is the probability of the hypothesis given that the evidence is
there.

Example-
Let us assume we have data of 1000 fruits from which some are banana, orange, and some other
fruit, each fruit has classified using three characteristics: i.e., round, Sweet, Red

Training set:
Type Round Not Round Sweet Not Sweet Red Not Red Total
Apple 400 100 250 150 450 50 500
Banana 0 300 150 300 0 200 200
Other fruit 100 100 150 50 150 50 300
Total 500 500 550 350 900 100 1000

Step 1: finding the ‘prior’ probabilities for each class of fruits.

P(Apple) = 500 / 1000 = 0.50


P(Banana) = 200 / 1000 = 0.20
P(other fruit) = 300 / 1000 = 0.30

Step 2: finding the probability of

evidence p(round) = 0.5

P(Sweet) = 550/1000= 0.65

P(red) = 0.8

Step 3: finding the probability of likelihood of

evidences : P(round|Apple) = 0.8

P(round|Banana) = 0

C Subramanyam Reddy, Department of Computer Science, SDHR Degree & PG College Page 55
SDHR DEGREE & PG COLLEGE :: TIRUPATI
DATA WAREHOUSING AND DATAMINING

P(Red|Other Fruit) = 150/300 =

0.75 P(Not red|Other Fruit) = 50/300

=0.25 Step 4: Putting the values in

equation:

P(Apple|Round, Sweet, and Red)

= P(Round|Apple) * P(Sweet|Apple) * P(Red|Apple) * P(Apple)/P(Round) * P(Sweet) * P(Red)

= 0.8 * 0.5 * 0.9 * 0.5 / P(evidence)

= 0.18 / P(evidence)

P(Banana|Round, Sweet and Red) =

P(Other Fruit|Round, Sweet and Red) = ( P(Round|Other fruit) * P(Sweet|Other fruit) *


P(Red|Other fruit) * P(Other Fruit) ) /P(evidence)

= (100/300 * 150/300 * 150/300 * 300/1000) / P(evidence)

=( 0.33*0.5 * 0.5 * 0.3) / P(evidence)

=(0.02475)/P(evidence)

Through this example we classify that this round, sweet and red colour fruit is likely to be an Apple.

DISTANCE-BASED ALGORITHMS
Each item that is mapped to the same class may be thought of as more similar to the other
items in that class than it is to the items found in other classes. Therefore, similarity (or distance)
measures may be used to identify the "alikeness" of different items in the database. The concept of
similarity measure, the concept is well known to anyone who has performed Internet searches using a
search engine. In these cases, the set of Web pages represents the whole database and these are
divided into two classes: those that answer your query and those that do not. Those that answer your
query should be more alike than those that do not answer your query. The similarity in this case is
defined by the query you state, usually a keyword list. Thus, the retrieved pages are similar because
they all contain (to some degree) the keyword list you have specified.

The idea of similarity measures can be abstracted and applied to more general classification
problems. The difficulty lies in how the similarity measures are defined and applied to the items in the
database. Since most similarity measures assume numeric (and often discrete) values, they might be
difficult to use for more general or abstract data types. A mapping from the attribute domain to a
subset of the integers may be used.

C Subramanyam Reddy, Department of Computer Science, SDHR Degree & PG College Page 56
SDHR DEGREE & PG COLLEGE :: TIRUPATI
DATA WAREHOUSING AND DATAMINING

Using a similarity measure for classification where the classes are predefined is somewhat
simpler than using a similarity measure for clustering where the classes are not known in advance.
Consider the example of IR. Each IR query provides the class definition in the form of the IR query
itself. So the classification problem then becomes one of determining similarity not among all tuples
in the database but between each tuple and the query. This makes the problem an O(n) problem
rather than an O(n2) problem.

Simple Approach
In general classification can be defined by using the IR approach, in this method we have a
representative of each class, we can perform classification by assigning each tuple to the class to which
it is most similar. We assume here that each tuple, ti , in the database is defined as a vector (ti1 , ti2,
. . . , tik) of numeric values. In the same way, we assume that each class Cj is defined by a tuple (Cj1,
Cj2, . . . , Cjk) of numeric values. The classification problem is then restated as follows.

Given a database D = { t1 , t2, . . . , tn } of tuples where each tuple ti = (ti1 , ti2, . . . , tik)
contains numeric values and a set of classes C = {C1 , . . . , Cm } where each class Cj = (Cj1,Cj2, .
. . , Cjk) has numeric values, the classification problem is to assign each ti to the class Cj such that
sim(ti , Cj) ≥ sim(ti ,Cj)∀C1 ∈ C where Ci ≠Cj.

In order to calculate similarity measures, the representative vector for each class must be
determined. Consider the following classification diagram

The above figure referring to three classes, from this we can determine a representative for
each class by calculating the center of each region. Thus class A is represented by {4, 7.5}, class B
by {2, 2.5}, and class C by {6, 2.5}. A simple classification technique, then, would be to place each
item in the class where it is most similar (closest) to the center of that class. The representative for
the class may be found in other ways. For example, in pattern recognition problems, a predefined
pattern can be used to represent each class. Once a similarity measure is defined, each item to be
classified will be compared to each predefined pattern. The item will be placed in the class with the
largest similarity value.

The diagram illustrates the use of this simple approach to


perform classification using the data found in the above
Figure. The three large dark circles are the class
representatives for the three classes. The dashed lines show
the distance from each item to the closest center. The
implementation of the simple approach can be shown in the
below algorithm.

C Subramanyam Reddy, Department of Computer Science, SDHR Degree & PG College Page 57
SDHR DEGREE & PG COLLEGE :: TIRUPATI
DATA WAREHOUSING AND DATAMINING

K Nearest Neighbors
Another important classification method based on distance measures is “K nearest neighbors”
(KNN). In KNN method we assume that the entire training set includes not only the data in the set
but also the desired classification for each item. In effect, the training data become the model. When
a classification is to be made for a new item, its distance to each item in the training set must be
determined. Only the K closest entries in the training set are considered further. The new item is then
placed in the class that contains the most items from this set of K closest items. Consider the following
diagram of KNN.

From the following image of KNN, the points in the training set
are shown and K = 3. The three closest items in the training set are
shown; t will be placed in the class to which most of these are
members.

How does the KNN algorithm work?


Let’s take a simple case to understand this algorithm. Following is a spread of red circles (RC)
and
green squares (GS) :

From the figure we need to find out the class


of the blue star (BS). BS can either be RC or GS and
nothing else. The “K” is KNN algorithm is the nearest
neighbors we wish to take vote from. Let’s say K = 3.
Hence, we will now make a circle with BS as center
just as big as to enclose only three data points on the
plane.

C Subramanyam Reddy, Department of Computer Science, SDHR Degree & PG College Page 58
SDHR DEGREE & PG COLLEGE :: TIRUPATI
DATA WAREHOUSING AND DATAMINING

The three closest points to BS is all RC. Hence, with


good confidence level we can say that the BS should
belong to the class RC. Here, the choice became
very obvious as all three votes from the closest
neighbor went to RC. The choice of the parameter K
is very crucial in this algorithm.

First let us try to understand what exactly does “K” influence in the algorithm. If we see the
last example, given that all the 6 training observation remain constant, with a given K value we can
make boundaries of each class. These boundaries will segregate RC from GS. The same way, let’s try
to see the effect of value “K” on the class boundaries. Following are the different boundaries separating
the two classes with different values of K.

If you watch carefully, you can see that the boundary becomes smoother with increasing value
of
K. With K increasing to infinity it finally becomes all blue or all red depending on the total majority.

C Subramanyam Reddy, Department of Computer Science, SDHR Degree & PG College Page 59
SDHR DEGREE & PG COLLEGE :: TIRUPATI
DATA WAREHOUSING AND DATAMINING

Decision Tree Based Algorithm


Decision Tree algorithm belongs to the family of supervised learning algorithms. Unlike other
supervised learning algorithms, decision tree algorithm can be used for solving regression and
classification problems too. The general motive of using Decision Tree is to create a training model
which can be used to predict class or value of target variables by learning decision rules inferred
from prior data (training data).

The understanding level of Decision Trees algorithm is so easy compared with other
classification algorithms. The decision tree algorithm tries to solve the problem, by using tree
representation. Each internal node of the tree corresponds to an attribute, and each leaf node
corresponds to a class label.

Decision Tree Algorithm Pseudocode


1. Place the best attribute of the dataset at the root of the tree.
2. Split the training set into subsets. Subsets should be made in such a way that each
subset contains data with the same value for an attribute.
3. Repeat step 1 and step 2 on each subset until you find leaf nodes in all the branches of the
tree.

In decision trees, for predicting a class label for a record we start from the root of the tree. We
compare the values of the root attribute with record’s attribute. On the basis of comparison, we follow
the branch corresponding to that value and jump to the next node. We continue comparing our
record’s attribute values with other internal nodes of the tree until we reach a leaf node with predicted
class value. As we know how the modeled decision tree can be used to predict the target class or the
value. Now let’s understanding how we can create the decision tree model.

C Subramanyam Reddy, Department of Computer Science, SDHR Degree & PG College Page 60
SDHR DEGREE & PG COLLEGE :: TIRUPATI
DATA WAREHOUSING AND DATAMINING

Decision Trees follow Sum of Product (SOP) representation. For the above images, you can see
how we can predict can we accept the new job offer? and Use computer daily? from traversing for the
root node to the leaf node. It’s a sum of product representation. The Sum of product (SOP) is also
known as Disjunctive Normal Form. For a class, every branch from the root of the tree to a leaf node
having the same class is a conjunction (product) of values, different branches ending in that class
form a disjunction (sum).

The primary challenge in defining decision tree is to identify which attributes we need to
consider as the root node and at each level. This can be done by using the concept attributes selection.
We have different attributes selection measures to identify the attribute which can be considered as
the root note at each level.

Attributes Selection : If dataset consists of “n” attributes then deciding which attribute to place
at the root or at different levels of the tree as internal nodes is a complicated step. By just randomly
selecting any node to be the root can’t solve the issue. If we follow a random approach, it may give
us bad results with low accuracy. For solving this attribute selection problem, researchers worked and
devised some solutions. They suggested using some measures called information gain, gini index, etc.
These measures will calculate values for every attribute. The values are sorted, and attributes are
placed in the tree by following the order i.e, the attribute with a high value(in case of information
gain) is placed at the root. While using information Gain as a criterion, we assume attributes to be
categorical, and for gini index, attributes are assumed to be continuous.

Decision Tree Algorithm Advantages and Disadvantages


Advantages Disadvantages
1. Decision Trees are easy to explain. It 1. There is a high probability of overfitting
results in a set of rules. in Decision Tree.
2. It follows the same approach as 2. Generally, it gives low prediction accuracy
humans generally follow while making for a dataset as compared to other machine
decisions. learning algorithms.
3. Interpretation of a complex Decision Tree 3. Information gain in a decision tree with
model can be simplified by its categorical variables gives a biased
visualizations. Even a naive person can response for attributes with greater no. of
understand logic. categories.
4. The Number of hyper-parameters to be 4. Calculations can become complex when
tuned is almost null. there are many class labels.

C Subramanyam Reddy, Department of Computer Science, SDHR Degree & PG College Page 61
SDHR DEGREE & PG COLLEGE :: TIRUPATI
DATA WAREHOUSING AND DATAMINING

Decision Tree Algorithms

1) ID3 (Iterative Dichotomiser 3) was developed in 1986 by Ross Quinlan. The algorithm creates
a multiway tree, finding for each node (i. e. in a greedy manner) the categorical feature that will yield
the largest information gain for categorical targets. Trees are grown to their maximum size and
then a pruning step is usually applied to improve the ability of the tree to generalise to unseen data.

2) C4.5 is the successor to ID3 and removed the restriction that features must be categorical by
dynamically defining a discrete attribute (based on numerical variables) that partitions the continuous
attribute value into a discrete set of intervals. C4.5 converts the trained trees (i.e. the output of the
ID3 algorithm) into sets of if-then rules. This accuracy of each rule is then evaluated to determine the
order in which they should be applied. Pruning is done by removing a rule’s precondition if the accuracy
of the rule improves without it.

3. CART (Classification and Regression Trees) is very similar to C4.5, but it differs in that it
supports numerical target variables (regression) and does not compute rule sets. CART
constructs binary trees using the feature and threshold that yields the largest information gain
at each node.

C Subramanyam Reddy, Department of Computer Science, SDHR Degree & PG College Page 62
SDHR DEGREE & PG COLLEGE :: TIRUPATI
DATA WAREHOUSING AND DATAMINING

Neural Networks
Neural networks are parallel computing devices, which are basically an attempt to make a
computer model of the brain. The main objective is to develop a system to perform various
computational tasks faster than the traditional systems. These tasks include pattern recognition and
classification, approximation, optimization, and data clustering.

Artificial Neural Network: Artificial Neural Network (ANN) is an efficient computing system whose
central theme is borrowed from the analogy of biological neural networks. ANNs are also named as
“artificial neural systems,” or “parallel distributed processing systems,” or “connectionist systems.”
ANN acquires a large collection of units that are interconnected in some pattern to allow
communication between the units. These units, also referred to as nodes or neurons, are simple
processors which operate in parallel.

Every neuron is connected with other neuron through a connection link. Each connection link
is associated with a weight that has information about the input signal. This is the most useful
information for neurons to solve a particular problem because the weight usually excites or inhibits
the signal that is being communicated. Each neuron has an internal state, which is called an activation
signal. Output signals, which are produced after combining the input signals and activation rule, may
be sent to other units.

Biological Neuron: A nerve cell (neuron) is a special biological cell that processes information in the
human brain. According to estimation by the scientists, there is huge number of neurons exists in the
human brain, approximately 1011 with numerous interconnections, approximately 1015. The
Schematic Diagram of working of a Biological Neuron in the human brain is.

C Subramanyam Reddy, Department of Computer Science, SDHR Degree & PG College Page 63
SDHR DEGREE & PG COLLEGE :: TIRUPATI
DATA WAREHOUSING AND DATAMINING

From the above diagram, a typical neuron consists of the following four parts with the help of
which we can explain its working –

• Dendrites − They are tree-like branches, responsible for receiving the information from other
neurons it is connected to. In other sense, we can say that they are like the ears of neuron.
• Soma − It is the cell body of the neuron and is responsible for processing of information, they
have received from dendrites.
• Axon − It is just like a cable through which neurons send the information.
• Synapses − It is the connection between the axon and other neuron dendrites.
ANN versus BNN: The following are the differences between Artificial Neural Network (ANN) and
Biological Neural Network (BNN), let us take a look at the similarities based on the terminology
between these two.

Biological Neural Network Artificial Neural Network


(BNN) (ANN)
Soma Node
Dendrites Input
Synapse Weights or Interconnections
Axon Output

The following table shows the comparison between ANN and BNN based on criteria mentioned.

Criteria BNN ANN


Processing Massively parallel, slow but Massively parallel, fast but inferior than
superior than ANN BNN
Size 1011 neurons and 102 to 104 nodes (mainly depends
1015interconnection on the type of application and
s network
designer)
Learning They can tolerate ambiguity Very precise, structured and
formatted data is required to tolerate
ambiguity

C Subramanyam Reddy, Department of Computer Science, SDHR Degree & PG College Page 64
SDHR DEGREE & PG COLLEGE :: TIRUPATI
DATA WAREHOUSING AND DATAMINING

Fault Performance degrades It is capable of robust


tolerance with even partial damage performance, hence has the
potential to be fault
tolerant
Storage Stores the information in the Stores the information in continuous
capacity synapse memory locations

Model of Artificial Neural Network : The following diagram represents the general model of
ANN followed by its processing.

For the above general model of artificial neural network, the net input can be calculated as follows –
yin=x1.w1+x2.w2+x3.w3…xm.wm

The output can be calculated by applying the activation function over the net
input.
Y=F(yin)
Output = function (net input calculated)

Processing of ANN depends upon the following three building blocks of Network Topology –

• Network Topology
• Adjustments of Weights or Learning
• Activation Functions

A network topology is the arrangement of a network along with its nodes and connecting lines.
According to the topology, ANN can be classified as the following kinds –

Feedforward Network : It is a non-recurrent network having processing units/nodes in layers and all
the nodes in a layer are connected with the nodes of the previous layers. The connection has different
weights upon them. There is no feedback loop means the signal can only flow in one direction, from
input to output. It may be divided into the following two types i.e Single layer feedforward network and
Multilayer feedforward network.

a) Single layer feedforward network − The concept is of feedforward ANN having only one weighted
layer. In other words, we can say the input layer is fully connected to the output layer.

C Subramanyam Reddy, Department of Computer Science, SDHR Degree & PG College Page 65
SDHR DEGREE & PG COLLEGE :: TIRUPATI
DATA WAREHOUSING AND DATAMINING

b) Multilayer feedforward network − The concept is of feedforward ANN having more than one
weighted layer. As this network has one or more layers between the input and the output
layer, it is called hidden layers.

Feedback Network: As the name suggests, a feedback network has feedback paths, which means the
signal can flow in both directions using loops. This makes it a non-linear dynamic system, which
changes continuously until it reaches a state of equilibrium. It may be divided into the following types
i.e Recurrent networks, Fully recurrent network and Jordan Network.

a) Recurrent networks − They are feedback networks with closed loops. Following are the two types
of recurrent networks.
b) Fully recurrent network − It is the simplest neural network architecture because all nodes are
connected to all other nodes and each node works as both input and output.

c) Jordan network − It is a closed loop network in which the output will go to the input again as
feedback as shown in the following diagram.

Adjustments of Weights or Learning


Learning, in artificial neural network, is the method of modifying the weights of connections
between the neurons of a specified network. Learning in ANN can be classified into three categories
namely a) supervised learning, b) unsupervised learning, and c) reinforcement learning.

C Subramanyam Reddy, Department of Computer Science, SDHR Degree & PG College Page 66
SDHR DEGREE & PG COLLEGE :: TIRUPATI
DATA WAREHOUSING AND DATAMINING

a) Supervised Learning: As the name suggests, this type of learning is done under the supervision
of a teacher. This learning process is dependent. During the training of ANN under supervised learning,
the input vector is presented to the network, which will give an output vector. This output vector is
compared with the desired output vector. An error signal is generated, if there is a difference between
the actual output and the desired output vector. On the basis of this error signal, the weights are
adjusted until the actual output is matched with the desired output.

b) Unsupervised Learning: As the name suggests, this type of learning is done without the
supervision of a teacher. This learning process is independent. During the training of ANN under
unsupervised learning, the input vectors of similar type are combined to form clusters. When a new
input pattern is applied, then the neural network gives an output response indicating the class to which
the input pattern belongs.

There is no feedback from the environment as to what should be the desired output and if it is
correct or incorrect. Hence, in this type of learning, the network itself must discover the patterns and
features from the input data, and the relation for the input data over the output.

c) Reinforcement Learning: As the name suggests, this type of learning is used to reinforce or
strengthen the network over some critic information. This learning process is similar to supervised
learning; however we might have very less information.

During the training of network under reinforcement learning, the network receives some
feedback from the environment. This makes it somewhat similar to supervised learning. However, the
feedback obtained here is evaluative not instructive, which means there is no teacher as in supervised
learning. After receiving the feedback, the network performs adjustments of the weights to get better
critic information in future.

Neural networks versus conventional computers


Neural networks take a different approach to problem solving than that of conventional
computers. Conventional computers use an algorithmic approach i.e. the computer follows a set of
instructions in order to solve a problem. Unless the specific steps that the computer needs to follow
are known the computer cannot solve the problem. That restricts the problem solving capability of

C Subramanyam Reddy, Department of Computer Science, SDHR Degree & PG College Page 67
SDHR DEGREE & PG COLLEGE :: TIRUPATI
DATA WAREHOUSING AND DATAMINING

conventional computers to problems that we already understand and know how to solve. But
computers would be so much more useful if they could do things that we don't exactly know how to
do.

Neural networks process information in a similar way the human brain does. The network is
composed of a large number of highly interconnected processing elements(neurones) working in
parallel to solve a specific problem. Neural networks learn by example. They cannot be programmed
to perform a specific task. The examples must be selected carefully otherwise useful time is wasted
or even worse the network might be functioning incorrectly. The disadvantage is that because the
network finds out how to solve the problem by itself, its operation can be unpredictable.

On the other hand, conventional computers use a cognitive approach to problem solving;
the way the problem is to solved must be known and stated in small unambiguous instructions.
These instructions are then converted to a high level language program and then into machine
code that the computer can understand. These machines are totally predictable; if anything goes
wrong is due to a software or hardware fault.

Neural networks and conventional algorithmic computers are not in competition but
complement each other. There are tasks are more suited to an algorithmic approach like arithmetic
operations and tasks that are more suited to neural networks. Even more, a large number of tasks,
require systems that use a combination of the two approaches (normally a conventional computer is
used to supervise the neural network) in order to perform at maximum efficiency.

C Subramanyam Reddy, Department of Computer Science, SDHR Degree & PG College Page 68
SDHR DEGREE & PG COLLEGE :: TIRUPATI
DATA WAREHOUSING AND DATAMINING

UNIT – IV:
Clustering: Introduction-Similarity & distance measures-outliers-Hierarchical algorithms agglomerative
algorithms-Divisive clustering-Partitional algorithms-Minimum spanning tree Squared error clustering
algorithm-K-means clustering-nearest neighbor algorithm-PAM algorithm-Bond energy algorithm-
Clustering with Genetic algorithms-Clustering with neural networks-Clustering large databases-BIRCH-
DBSCAN-CURE algorithm-Clustering with categorical attributes.

Clustering
Clustering is similar to classification in which data are grouped. However, unlike classification,
the groups are not predefined. Instead, the grouping is accomplished by finding similarities between
data according to characteristics found in the actual data. The groups are called clusters. Some authors
view clustering as a special type of classification. Many definitions for clusters have been proposed:

• Set of like elements. Elements from different clusters are not alike.
• The distance between points in a cluster is less than the distance between a point in the cluster
and any point outside it.

A term similar to clustering is database segmentation, where like tuples (records) in a database
are grouped together. This is done to partition or segment the database into components that then
give the user a more general view of the data. A simple example of clustering is as follows. This
example illustrates the fact that determining how to do the clustering is not straightforward.

Example
An international online catalog company wishes to group its customers based on common
features. Company management does not have any predefined labels for these groups. Based on the
outcome of the grouping, they will target marketing and advertising campaigns to the different groups.
The information they have about the customers includes income, age, number of children, marital
status, and education. The following Table shows some tuples from this database for customers in the
United States. Depending on the type of advertising not all attributes are important. For example,
suppose the advertising is for a special sale on children’s clothes. We could target the advertising only
to the persons with children. One possible clustering is that shown by the divisions of the table. The
first group of people have young children and a high school degree, while the second group is similar
but have no children. The third group has both children and a college degree. The last two groups have
higher incomes and at least a college degree. The very last group has children. Different clustering’s
would have been found by examining age or marital status.

C Subramanyam Reddy, Department of Computer Science, SDHR Degree & PG College Page 69
SDHR DEGREE & PG COLLEGE :: TIRUPATI
DATA WAREHOUSING AND DATAMINING

From the above image a given set of data may be clustered on different attributes. (a)
represents
group of homes in a geographic area is shown. (b) clustering is based on the location of the home.
Homes that are geographically close to each other are clustered together. (c) clustering, homes are
grouped based on the size of the house.

Clustering has been used in many application domains, including biology, medicine,
anthropology, marketing, and economics. Clustering applications include plant and animal
classification, disease classification, image processing, pattern recognition, and document retrieval.
One of the first domains in which clustering was used was biological taxonomy. Recent uses include
examining Web log data to detect usage patterns. When clustering is applied to a real-world database,
many interesting problems occur:
Outlier handling is difficult. Here the elements do not naturally fall into any cluster. They can
be viewed as solitary clusters. However, if a clustering algorithm attempts to find larger
clusters, these outliers will be forced to be placed in some cluster. This process may result in
the creation of poor clusters by combining two existing clusters and leaving the outlier in its
own cluster.
Dynamic data in the database implies that cluster membership may change over time.
Interpreting the semantic meaning of each cluster may be difficult. With classification, the
labeling of the classes is known ahead of time. However, with clustering, this may not be the
case. Thus, when the clustering process finishes creating a set of clusters, the exact meaning
of each cluster may not be obvious. Here is where a domain expert is needed to assign a label
or interpretation for each cluster.
There is no one correct answer to a clustering problem. In fact, many answers may be found.
The exact number of clusters required is not easy to determine. Again, a domain expert may
be required. For example, suppose we have a set of data about plants that have been collected
during a field trip. Without any prior knowledge of plant classification, if we attempt to divide
this set of data into similar groupings, it would not be clear how many groups should be
created.
Another related issue is what data should be used for clustering. Unlike learning during a
classification process, where there is some a priori knowledge concerning what the attributes
of each classification should be, in clustering we have no supervised learning to aid the process.
Indeed, clustering can be viewed as similar to unsupervised learning.

A classification of the different types of clustering algorithms is shown in the Figure.


Clustering algorithms themselves may be viewed as hierarchical or partitional.

C Subramanyam Reddy, Department of Computer Science, SDHR Degree & PG College Page 70
SDHR DEGREE & PG COLLEGE :: TIRUPATI
DATA WAREHOUSING AND DATAMINING

Hierarchical: This method creates a hierarchical decomposition of the given set of data objects. We
can classify hierarchical methods on the basis of how the hierarchical decomposition is formed. There
are two approaches here
Agglomerative Approach: This approach is also known as the bottom-up approach. In this,
we start with each object forming a separate group. It keeps on merging the objects or groups
that are close to one another. It keep on doing so until all of the groups are merged into one
or until the termination condition holds.

Divisive Approach: This approach is also known as the top-down approach. In this, we start
with all of the objects in the same cluster. In the continuous iteration, a cluster is split up into
smaller clusters. It is down until each object in one cluster or the termination condition holds.
This method is rigid, i.e., once a merging or splitting is done, it can never be undone.

Partitional: Suppose we are given a database of ‘n’ objects and the partitioning method constructs
‘k’ partition of data. Each partition will represent a cluster and k ≤ n. It means that it will classify the data
into k groups, which satisfy the following requirements −
Each group contains at least one object.
Each object must belong to exactly one group.
For a given number of partitions (say k), the partitioning method will create an initial partitioning.
Then it uses the iterative relocation technique to improve the partitioning by moving objects from one
group to other.

Similarity & distance measures


The similarity measure is the measure of how much alike two data objects are. Similarity
measure in a data mining context is a distance with dimensions representing features of the objects.
If this distance is small, it will be the high degree of similarity where large distance will be the low
degree of similarity. The similarity is subjective and is highly dependent on the domain and application.
For example, two fruits are similar because of color or size or taste. Care should be taken when
calculating distance across dimensions/features that are unrelated. The relative values of each
element must be normalized, or one feature could end up dominating the distance calculation.
Similarity are measured in the range 0 to 1 [0,1].

Two main consideration about similarity:

Similarity = 1 if X = Y (Where X, Y are two objects)


Similarity = 0 if X ≠ Y

Popular Similarity Distance Measures

1) Euclidean distance: is the most common use of distance. In most cases when people said about
distance, they will refer to Euclidean distance. Euclidean distance is also known as simply
distance. When data is dense or

C Subramanyam Reddy, Department of Computer Science, SDHR Degree & PG College Page 71
SDHR DEGREE & PG COLLEGE :: TIRUPATI
DATA WAREHOUSING AND DATAMINING

continuous, this is the best proximity measure. The Euclidean distance between two points is the
length of the path connecting them. The Pythagorean Theorem gives this distance between two

points.

2) Manhattan distance: is a metric in which the distance between two points is the sum of the absolute
differences of their Cartesian coordinates. In a simple way of saying it is the total sum of the difference
between the x- coordinates and y-coordinates. Suppose we have two points A and B if we want to find
the Manhattan distance between them, just we have, to sum up, the absolute x-axis and y – axis
variation means we have to find how these two points A and B are varying in X-axis and Y- axis. In a
more mathematical way of saying Manhattan distance between two points measured along axes at
right angles.

In a plane with p1 at (x1, y1) and p2 at


(x2, y2). Manhattan distance = |x1 – x2|
+ |y1 – y2|

This Manhattan distance metric is also known as Manhattan length, rectilinear distance, L1
distance or L1 norm, city block distance, Minkowski’s L1 distance, taxi-cab metric, or city block
distance.

OUTLIERS
Outliers are sample points with values much different from those of the remaining set of data.
Outliers may represent errors in the data (perhaps a malfunctioning sensor recorded an incorrect data
value) or could be correct data values that are simply much different from the remaining data. A person
who is 2.5 meters tall is much taller than most people. In analyzing the height of individuals; this value
probably would be viewed as an outlier. Some clustering techniques do not perform well with the
presence of outliers. Consider the following example.
C Subramanyam Reddy, Department of Computer Science, SDHR Degree & PG College Page 72
SDHR DEGREE & PG COLLEGE :: TIRUPATI
DATA WAREHOUSING AND DATAMINING

From the above image, if three clusters are found (solid line), the outlier will occur in a cluster
by itself. However, if two clusters are found (dashed line), the two (obviously) different sets of data
will be placed in one cluster because they are closer together than the outlier. This problem is
complicated by the fact that many clustering algorithms actually have as input the number of desired
clusters to be found.

Clustering algorithms may actually find and remove outliers to ensure that they perform better.
However, care must be taken in actually removing outliers. For example, suppose that the data mining
problem is to predict flooding. Extremely high water level values occur very infrequently, and when
compared with the normal water level values may seem to be outliers. However, removing these
values may not allow the data mining algorithms to work effectively because there would be no data
that showed that floods ever actually occurred.

Outlier detection, or outlier mining, is the process of identifying outliers in a set of data.
Clustering, or other data mining, algorithms may then choose to remove or treat these values
differently. Some outlier detection techniques ate based on statistical techniques. These usually
assume that the set of data follows a known distribution and that outliers can be detected by well-
known tests such as discordancy tests. However, these tests are not very realistic for real-world data
because real world data values may not follow well-defined data distributions. Also, most of these
tests assume a single attribute value, and many attributes are- involved in real- world datasets.
Alternative detection techniques may be based on distance measures.

HIERARCHICAL ALGORITHMS
Hierarchical clustering algorithms actually creates sets of clusters. The following Example illustrates
the concept.

The above Figure shows six elements, {A, B, C, D, E, F}, to be clustered. Parts (a) to (e) of
the figure show five different sets of clusters. In part (a) each cluster is viewed to consist of
Hierarchical algorithms differ in how the sets are created. A tree data structure, called a dendrogram,
can be used to illustrate the hierarchical clustering technique and the sets of different clusters.
C Subramanyam Reddy, Department of Computer Science, SDHR Degree & PG College Page 73
SDHR DEGREE & PG COLLEGE :: TIRUPATI
DATA WAREHOUSING AND DATAMINING

From the above figure, the root in a dendrogram tree contains one cluster where all elements
are together. The leaves in the dendrogram each consist of a single element cluster. Internal nodes in
the dendrogram represent new clusters formed by merging the clusters that appear as its children in
the tree. Each level in the tree is associated with the distance measure that was used to merge the
clusters. All clusters created at a particular level were combined because the children clusters had a
distance between them less than the distance value associated with this level in the tree.

The space complexity for hierarchical algorithms is O(n2) because this is the space required for
the adjacency matrix. The space required for the dendrogram is O(kn), which is much less than O(n2)
. The time complexity for hierarchical algorithms is 0 (kn2) because there is one iteration for each
level in the dendrogram.

Hierarchical techniques are well suited for many clustering applications that naturally exhibit a
nesting relationship between clusters. For example, in biology, plant and animal taxonomies could
easily be viewed as a hierarchy of clusters.

Agglomerative Hierarchical clustering Technique: In this technique, initially each data point is
considered as an individual cluster. At each iteration, the similar clusters merge with other clusters until
one cluster or K clusters are formed. The basic algorithm of Agglomerative is straight forward.

• Compute the proximity matrix


• Let each data point be a cluster
• Repeat: Merge the two closest clusters and update the proximity matrix
• Until only a single cluster remains

Key operation is the computation of the proximity of two clusters. To understand better let’s see
a pictorial representation of the Agglomerative Hierarchical clustering Technique.

Agglomerative Hierarchical Clustering Technique

From the above figure we have six data points {A,B,C,D,E,F}.


• Step- 1: In the initial step, we calculate the proximity of individual points and
consider all the six data points as individual clusters as shown in the image above.
• Step- 2: In step two, similar clusters are merged together and formed as a single
cluster. Let’s consider B,C, and D,E are similar clusters that are merged in step two.
Now, we’re left with four clusters which are A, BC, DE, F.

C Subramanyam Reddy, Department of Computer Science, SDHR Degree & PG College Page 74
SDHR DEGREE & PG COLLEGE :: TIRUPATI
DATA WAREHOUSING AND DATAMINING

• Step- 3: We again calculate the proximity of new clusters and merge the similar
clusters to form new clusters A, BC, DEF.
• Step- 4: Calculate the proximity of the new clusters. The clusters DEF and BC
are similar and merged together to form a new cluster. We’re now left with two
clusters A, BCDEF.
• Step- 5: Finally, all the clusters are merged together and form a single cluster.

The Hierarchical clustering Technique can be visualized using a Dendrogram. A Dendrogram is a


tree-like diagram that records the sequences of merges or splits.

PARTITIONAL ALGORITHMS

Nonhierarchical or partitional clustering creates the clusters in one step as opposed to several
steps. Only one set of clusters is created, although several different sets of clusters may be created
internally within the various algorithms. Since only one set of clusters is output, the user must input
the desired number, k, of clusters. In addition, some metric or criterion function is used to determine
the goodness of any proposed solution. This measure of quality could be the average distance between
clusters or some other metric. The solution with the best value for the criterion function is the clustering
solution used. One common measure is a squared error metric, which measures the squared distance
from each point to the centroid for the associated cluster:

The problem with partitional algorithms is that they suffer from a combinatorial explosion due
to the number of possible solutions. Clearly, searching all possible clustering alternatives usually would
not be feasible. For example, given a measurement criteria, a naive approach could look at all possible
sets of k clusters. There are S(n , k) possible combinations to examine. Here

There are 1 1,259,666,000 different ways to cluster 19 items into 4 clusters. Thus, most
algorithms look only at a small subset of all the clusters using some strategy to identify sensible
clusters.

The following are the most Partitional Algorithms


1. Minimum Spanning Tree
2. Squared Error Clustering Algorithm
3. K-Means Cl ustering
4. Nearest Neighbor Algorithm
5. PAM Algorithm
6. Bond Energy Algorithm
7. Clustering with Genetic Algorithms
8. Clustering with Neural Networks

C Subramanyam Reddy, Department of Computer Science, SDHR Degree & PG College Page 75
SDHR DEGREE & PG COLLEGE :: TIRUPATI
DATA WAREHOUSING AND DATAMINING

1. Minimum Spanning Tree


This is a very simplistic approach, but it illustrates how partitional algorithms work. The following
algorithm illustrates how the clustering problem is to define a mapping, the output of this algorithm
shows the clusters as a set of ordered pairs (ti , j) where f (ti ) = KJ.

The problem is how to define "inconsistent." It could be defined as in the earlier division MST
algorithm based on distance. This would remove the largest k - 1 edges from the starting completely
connected graph and yield the same results as this corresponding level in the dendrogram. Zahn
proposes more reasonable inconsistent measures based on the weight (distance) of an edge as
compared to those close to it. For example, an inconsistent edge would be one whose weight is much
larger than the average of the adjacent edges.

The time complexity of this algorithm is again dominated by the MST procedure, which is O
(n2). At most, k - 1 edges will be removed, so the last three steps of the algorithm, assuming each
step takes a constant time, is only O(k - 1). Although determining the inconsistent edges in M may be
quite complicated, it will not require a time greater than the number of edges in M. When looking at
edges adjacent to one edge, there are at most k - 2 of these edges. In this case, then, the last three
steps are O(k2), and the total algorithm is still 0(n2) .

2. Squared Error Clustering Algorithm


The squared error clustering algorithm minimizes the squared error. The squared error for a
cluster is the sum of the squared Euclidean distances between each element in the cluster and the
cluster centroid, Ck. Given a cluster Ki, let the set of items mapped to that cluster be {ti1 , ti2, . . . ,
tim} . The squared error is defined as

Given a set of clusters K = {K1 , K2, . . . , Kk }, the squared error for K is defined as

For each iteration in the squared error algorithm, each tuple is assigned to the cluster with the
closest center. Since there are k clusters and n items, this is an O(kn) operation. Assuming t
iterations, this becomes an O(tkn) algorithm.
C Subramanyam Reddy, Department of Computer Science, SDHR Degree & PG College Page 76
SDHR DEGREE & PG COLLEGE :: TIRUPATI
DATA WAREHOUSING AND DATAMINING

3. K-Means Clustering
K- means is an iterative clustering algorithm in which items are moved among sets of clusters
until the desired set is reached. As such, it may be viewed as a type of squared error algorithm,
although the convergence criteria need not be defined based on the squared error. A high degree of
similarity among elements in clusters is obtained, while a high degree of dissimilarity among elements
in different clusters is achieved simultaneously.
The cluster mean of Ki = {ti1 , ti2 ,... , tim} is defined as

This definition assumes that each tuple has only one numeric value as opposed to a tuple with
many attribute values. This algorithm assumes that the desired number of clusters, k, is an input
parameter. The following Algorithm shows the K-means algorithm.

Example : Suppose that we are given the following items to cluster: {2, 4, 10, 12, 3,
20, 30, 11, 25} and suppose that k=2. We initially assign the means to the first two values:
m1 =2 and m2= 4. Using Euclidean distance, we find that initially K1= {2, 3} and K2= {4, 10,
12, 20, 30, 11, 25 }. The value 3 is equally close to both means, so we arbitrarily choose K1.
Any desired assignment could be used in the case of ties. We then recalculate the means to get
m1 = 2.5 and m2 = 16. We again make assignments to clusters to get K1 = {2, 3, 4} and K2
= {10, 12, 20 , 30, 11, 25}. Continuing in this fashion, we obtain the following:

From the above table the clusters in the last two steps are identical. This will yield
identical means, and thus the means have converged. Hence Our solution to this problem is
K1 = {2, 3, 4, 10, 11, 12} and K2
= {20, 30, 25}.

The time complexity of K-means is O(tkn) where t is the number of iterations. K-means finds
a local optimum and may actually miss the global optimum. K-Means does not work on categorical data
because the mean must be defined on the attribute type. Only convex-shaped clusters are found. It
also does not handle outliers well. One variation of K-means, K-modes, does handle categorical data.
Instead of using means, it uses modes. Although the K-means algorithm often produces good results,
it is not time-efficient and does not scale well.
C Subramanyam Reddy, Department of Computer Science, SDHR Degree & PG College Page 77
SDHR DEGREE & PG COLLEGE :: TIRUPATI
DATA WAREHOUSING AND DATAMINING

4. Nearest Neighbor Algorithm


Nearest Neighbor Algorithm uses Single link technique. With this serial algorithm, items are
iteratively merged into the existing clusters that are closest. In this algorithm a threshold, t, is used
to determine if items will be added to existing clusters or if a new cluster is created.

Example: From the above table Initially, A is placed in a cluster by itself, so we have
K1 = {A}. We then look at B to decide if it should be added to K1 or be placed in a new cluster.
Since dis(A , B ) = 1, which is less than the threshold of 2, we place B in K1 to get K1 = { A ,
B }. When looking at C, we see that its distance to both A and B is 2, so we add it to the cluster
to get K1 = {A, B, C }. The dis(D, C ) = 1 < 2, so we get K1 = {A, B, C , D}. Finally, looking
at E, we see that the closest item in K1 has a distance of 3, which is greater than 2, so we
place it in its own cluster: K2 = { E }

The complexity of the nearest neighbor algorithm actually depends on the number of items. For
each loop, each item must be compared to each item already in a cluster. Obviously, this is n in the
worst case. Thus, the time complexity is O(n2). Since we do need to examine the distances between
items often, we assume that the space requirement is also O(n2).

5. PAM Algorithm
The PAM (partitioning around medoids) algorithm, also called the K-medoids algorithm,
represents a cluster by a medoid. Using a medoid is an approach that handles outliers well. The
following is an PAM algorithm.

C Subramanyam Reddy, Department of Computer Science, SDHR Degree & PG College Page 78
SDHR DEGREE & PG COLLEGE :: TIRUPATI
DATA WAREHOUSING AND DATAMINING

Initially, a random set of k items is taken to be the set of medoids. Then at each step, all items from
the input dataset that are not currently medoids are examined one by one to see if they should be
medoids. That is, the algorithm determines whether there is an item that should replace one of the
existing medoids. By looking at all pairs of medoid, non-medoid objects, the algorithm chooses the
pair that improves the overall quality of the clustering the best and exchanges them. Quality here is
measured by the sum of all distances from a non-medoid object to the medoid for the cluster it is in.
An item is assigned to the cluster represented by the medoid to which it is closest (minimum distance).
We assume that Ki is the cluster represented by medoid t. Suppose t is a current medoid and we wish
to determine whether it should be exchanged with a non-medoid th . We wish to do this swap only if
the overall impact to the cost (sum of the distances to cluster medoids) represents an improvement.

6. Bond Energy Algorithm


The bond energy algorithm (BEA) was developed and has been used in the database design area to
determine how to group data and how to physically place data on a disk. It can be used to cluster
attributes based on usage and then perform logical or physical design accordingly. With BEA, the
affinity (bond) between database attributes is based on common usage. This bond is used by the
clustering algorithm as a similarity measure. The actual measure counts the number of times the two
attributes are used together in a given time. To find this, all common queries must be identified.

The idea is that attributes that are used together form a cluster and should be stored together.
In a distributed database, each resulting cluster is called a vertical fragment and may be stored at
different sites from other fragments. The basic steps of this clustering algorithm are :

1. Create an attribute affinity matrix in which each entry indicates the affinity between the two
associate attributes. The entries in the similarity matrix are based on the frequency of common
usage of attribute pairs.
2. The BEA then converts this similarity matrix to a BOND matrix in which the entries represent
a type of nearest neighbor bonding based on probability of co-access. The BEA algorithm
rearranges rows or columns so that similar attributes appear dose together in the matrix.
3. Finally, the designer draws boxes around regions in the matrix with high similarity.

7. Clustering with Genetic Algorithms


Neural networks (NNs) that use unsupervised learning attempt to find features in the data that
characterize the desired output. They look for clusters of like data. These types of NNs are often called
self-organizing neural networks. There are two basic types of unsupervised learning: noncompetitive
and competitive. With the noncompetitive or Hebbian learning, the weight between two nodes is
changed to be proportional to both output values. With competitive learning, nodes are allowed to
compete and the winner takes all.

This approach usually assumes a two-layer NN in which all nodes from one layer are connected to all
nodes in the other layer. As training occurs, nodes in the output layer become associated with certain
tuples in the input dataset. Thus, this provides a grouping of these tuples together into a cluster.
Imagine every input tuple having each

C Subramanyam Reddy, Department of Computer Science, SDHR Degree & PG College Page 79
SDHR DEGREE & PG COLLEGE :: TIRUPATI
DATA WAREHOUSING AND DATAMINING

attribute value input to a specific input node in the NN. The number of input nodes is the same as the
number of attributes. We can thus associate each weight to each output node with one of the attributes
from the input tuple. When a tuple is input to the NN, all output nodes produce an output value. The
node with the weights more similar to the input tuple is declared the winner. Its weights are then
adjusted. This process' continues with each tuple input from the training set. With a large and varied
enough training set, over time each output node should become associated with a set of tuples. The
input weights to the node are then close to an average of the tuples in this cluster.

BIRCH
BIRCH (balanced iterative reducing and clustering using hierarchies) is designed for clustering
a large amount of metric data. It assumes that there may be a limited amount of main memory and
achieves a linear 1/0 time requiring only one database scan. It is incremental and hierarchical, and it
uses an outlier handling technique. The basic idea of the algorithm is that a tree is built that captures
needed information to perform clustering. The clustering is then performed on the tree itself, where
labelings of nodes in the tree contain the needed information to calculate distance values. A major
characteristic of the BIRCH algorithm is the use of the clustering feature, which is a triple that contains
information about a cluster. The clustering feature provides a summary of the information about one
cluster. By this definition it is clear that BIRCH applies only to numeric data.

DBSCAN
Clusters are dense regions in the data space, separated by regions of the lower density
of points. The DBSCAN algorithm is based on this intuitive notion of “clusters” and “noise”. The key
idea is that for each point of a cluster, the neighborhood of a given radius has to contain at least a
minimum number of points.

DBSCAN algorithm requires two parameters –


1. eps : It defines the neighborhood around a data point i.e. if the distance between two points is
lower or equal to ‘eps’ then they are considered as neighbors. If the eps value is chosen too
small then large part of the data will be considered as outliers. If it is chosen very large then
the clusters will merge and majority of the data points will be in the same clusters. One way
to find the eps value is based on the k-distance graph.
2. MinPts: Minimum number of neighbors (data points) within eps radius. Larger the dataset, the
larger value of MinPts must be chosen. As a general rule, the minimum MinPts can be derived
from the number of dimensions D in the dataset as, MinPts >= D+1. The minimum value of
MinPts must be chosen at least 3.

In this algorithm, we have 3 types of data points.

C Subramanyam Reddy, Department of Computer Science, SDHR Degree & PG College Page 80
SDHR DEGREE & PG COLLEGE :: TIRUPATI
DATA WAREHOUSING AND DATAMINING

Core Point: A point is a core point if it has more than MinPts points within eps.
Border Point: A point which has fewer than MinPts within eps but it is in the neighborhood of a
core point.
Noise or outlier: A point which is not a core point or border point.

DBSCAN algorithm will work in the following steps –


1. Find all the neighbor points within eps and identify the core points or visited with more than
MinPts neighbors.
2. For each core point if it is not already assigned to a cluster, create a new cluster.
3. Find recursively all its density connected points and assign them to the same cluster as the core
point. A point a and b are said to be density connected if there exist a point c which has a
sufficient number of points in its neighbors and both the points a and b are within the eps
distance. This is a chaining process. So, if b is neighbor of c, c is neighbor of d, d is neighbor of
e, which in turn is neighbor of a implies that b is neighbor of a.
4. Iterate through the remaining unvisited points in the dataset. Those points that do not
belong to any cluster are noise.

Disadvantage Of K-MEANS:
1. K-Means forms spherical clusters only. This algorithm fails when data is not spherical ( i.e.
same variance in all directions).

2. K-Means algorithm is sensitive towards outlier. Outliers can skew the clusters in K-Means in very large
extent.

3. K-Means algorithm requires one to specify the number of clusters a priory etc.

Basically, DBSCAN algorithm overcomes all the above-mentioned drawbacks of K-Means


algorithm. DBSCAN algorithm identifies the dense region by grouping together data points that are
closed to each other based on distance measurement.
CURE Algorithm
One objective for the CURE (Clustering Using REpresentatives) clustering algorithm is to handle
outliers well. It has both a hierarchical component and a partitioning component. First, a constant
number of points, c, are chosen from each cluster. These well-scattered points are then shrunk toward
the cluster's centroid by applying a shrinkage

C Subramanyam Reddy, Department of Computer Science, SDHR Degree & PG College Page 81
SDHR DEGREE & PG COLLEGE :: TIRUPATI
DATA WAREHOUSING AND DATAMINING

factor, a When a is 1, all points are shrunk to just one-the centroid. These points represent the cluster
better than a single point (such as a medoid or centroid) could. With multiple representative points,
clusters of unusual shapes (not just a sphere) can be better represented. CURE then uses a hierarchical
clustering algorithm. At each step in the agglomerative algorithm, clusters with the closest pair of
representative points are chosen to be merged. The distance between them is defined as the minimum
distance between any pair of points in the representative sets from the two clusters.

C Subramanyam Reddy, Department of Computer Science, SDHR Degree & PG College Page 82

You might also like