DWDM Notes 1-5 Units
DWDM Notes 1-5 Units
UNIT-I
Data warehouse is an information system that contains historical and commutative data from single or multiple sources.
It simplifies reporting and analysis process of the organization.
Data warehousing is the process of constructing and using a data warehouse. A data warehouse is constructed by
integrating data from multiple heterogeneous sources that support analytical reporting, structured and/or ad hoc queries,
and decision making. Data warehousing involves data cleaning, data integration, and data consolidations.
Data warehouse is the center of the architecture for information systems for the 1990's. Data warehouse supports
informational processing by providing a solid platform of integrated, historical data from which to do analysis. Data
warehouse provides the facility for integration in a world of non integrated application systems. Data warehouse is
achieved in an evolutionary, step at a time fashion. Data warehouse organizes and stores the data needed for
informational, analytical processing over a long historical time perspective.
It is a subject-oriented, integrated, non-volatile and time variant collection of data. They contain granular corporate data
in the later half of the report the granularity concept would be explained in detail.
1
Subject orientation
The main feature of the data warehouse is that the data is oriented around major subject areas of business. Figure 2 shows
the contrast between the two types of orientations.
The operational world is designed around applications and functions such as loans, savings, bankcard, and trust for a
financial institution. The data warehouse world is organized around major subjects such as customer, vendor, product,
and activity. The alignment around subject areas affects the design and implementation of the data found in the data
warehouse.
Another important way in which the application oriented operational data differs from data warehouse data is in the
relationships of data. Operational data maintains an ongoing relationship between two or more tables based on a business
rule that is in effect. Data warehouse data spans a spectrum of time and the relationships found in the data warehouse are
vast.
Integration
The most important aspect of the data warehouse environment is the data integration. The very essence of the data
warehouse environment is that data contained within the boundaries of the warehouse is integrated. The integration is
seen in different ways- one would be the consistency of naming conventions, consistency in the measurement variables,
consistency in the physical attributes of the data and so forth. Figure 3 shows the concept of integration in a data
warehouse
2
Time Variant:
All data in the data warehouse is accurate as of some moment in time. This basic characteristic of data in the warehouse
is very different from data found in the operational environment. In the operational environment data is accurate as of the
moment of access. In other words, in the operational environment when you access a unit of data, you expect that it will
reflect accurate values as of the moment of access. Because data in the data warehouse is accurate as of some moment in
time (i.e., not "right now"), data
found in the warehouse is said to be "time variant". Figure 4 shows the time variance of data warehouse data.
3
The time variant of data in this shows up in different ways. The simplest way would be the data for a time horizon of 10
to 15 years, but in the case of an operational environment the time span is much shorter.
The second way that time variance shows up in the data warehouse is in the key structure. Every key structure in the data
warehouse contains - implicitly or explicitly - an element of time, such as day, week, month, etc. The element of time is
almost always at the bottom of the concatenated key found in the data warehouse.
The third way that time variance appears is that data warehouse data, once correctly recorded, cannot be updated. Data
warehouse data is, for all practical purposes, a long series of snapshots. Of course if the snapshot of data has been taken
incorrectly, then snapshots can be changed. But assuming that snapshots are made properly, they are not altered once
made.
Non Volatile:
Figure 5 explains the concept of non volatile. Figure 5 shows that updates (inserts, deletes, and changes) are done
regularly to the operational environment on a record by record basis. But the basic manipulation of data that occurs in the
data warehouse is much simpler. There are only two kinds of operations that occur in the data warehouse - the initial
loading of data, and the access of data. There is no update of data (in the general sense of update) in the data warehouse
as a normal part of processing.
4
Structure of a Data warehouse
Data warehouses have a distinct structure. There are different levels of summarization and detail. The structure of a data
warehouse is shown by Figure 6.
Older detail data is data that is stored on some form of mass storage. It is infrequently accessed and is stored at a level of
detail consistent with current detailed data
Lightly summarized data is data that is distilled from the low level of detail found at the current detailed level. This level
of the data warehouse is almost always stored on disk storage
Highly summarized data is compact and easily accessible. Sometimes the highly summarized data is found in the data
warehouse environment and in other cases the highly summarized data is found outside the immediate walls of the
technology that houses the data warehouse
The final component of the data warehouse is that of meta data. In many ways meta data sits in a different dimension than
other data warehouse data, because meta data contains no data directly taken from the operational environment. Meta
data plays a special and very important role in the data warehouse. Meta data is used as:
• a directory to help the DSS analyst locate the contents of the data warehouse,
• a guide to the mapping of data as the data is transformed from the operational environment to the data warehouse
environment.
Meta data plays a much more important role in the data warehouse environment than it ever did in the classical
operational environment.
Flow of data
5
There is a normal and predictable flow of data within the data warehouse. Figure 7 shows that flow.
Data enters the data warehouse from the operational environment. Upon entering the data warehouse, data goes into the
current detail level of detail, as shown. It resides there and is used there until one of three events occurs:
• it is purged,
• it is summarized, and/or
• it is archived
The aging process inside a data warehouse moves current detail data to old detail data, based on the age of data. As the
data is summarized, it passes from the lightly summarized data to highly summarized.
Based on the above facts we now realize that the data warehouse is not built at once. Instead it is populated and designed
one step at a time, it develops based on the evolutionary phenomenon and not revolutionary. The cost of building a data
warehouse all at once would be very expensive and the results also would not be very accurate. So it is always suggested
and dictated that the environment is build using the step by step approach.
Activities like delete, update, and insert which are performed in an operational application environment are omitted in
Data warehouse environment. Only two types of data operations performed in the Data Warehousing are
1. Data loading
2. Data access
3.
Here, are some major differences between Application and Data Warehouse
Decision support systems are a class of computer-based information systems including knowledge based systems that
support decision making activities.
7
historically changing what ever is the scenario we do the changes in the data. Finally, data relationships in the operational
environment are turned into artifacts in the data warehouse.
So during this analysis what we do is we group the data which seldom changes and then we group the data which
regularly changes and then we do a stability analysis to create groups of data which are having similar characteristics.
The stability analysis is done as shown in the figure.
Data modeling (data modelling) is the process of creating a data model for the data to be stored in a
Database. This data model is a conceptual representation of Data objects, the associations between different
data objects and the rules. Data modeling helps in the visual representation of data and enforces business
rules, regulatory compliances, and government policies on the data. Data Models ensure consistency in
naming conventions, default values, semantics, security while ensuring quality of the data.
8
Data model emphasizes on what data is needed and how it should be organized instead of what operations
need to be performed on the data. Data Model is like architect's building plan which helps to build a
conceptual model and set the relationship between data items.
Ensures that all data objects required by the database are accurately represented. Omission of data
will lead to creation of faulty reports and produce incorrect results.
A data model helps design the database at the conceptual, physical and logical levels.
Data Model structure helps to define the relational tables, primary and foreign keys and stored
procedures.
It provides a clear picture of the base data and can be used by database developers to create a
physical database.
It is also helpful to identify missing and redundant data.
Though the initial creation of data model is labor and time consuming, in the long run, it makes your IT
infrastructure upgrade and maintenance cheaper and faster.
1. Conceptual: This Data Model defines WHAT the system contains. This model is typically created by
Business stakeholders and Data Architects. The purpose is to organize, scope and define business
concepts and rules.
2. Logical: Defines HOW the system should be implemented regardless of the DBMS. This model is
typically created by Data Architects and Business Analysts. The purpose is to developed technical
map of rules and data structures.
3. Physical: This Data Model describes HOW the system will be implemented using a specific DBMS
system. This model is typically created by DBA and developers. The purpose is actual implementation
of the database.
9
Conceptual Model
The main aim of this model is to establish the entities, their attributes, and their relationships. In this Data
modeling level, there is hardly any detail available of the actual Database structure.
For example:
Customer and Product are two entities. Customer number and name are attributes of the Customer
entity
Product name and price are attributes of product entity
Sale is the relationship between the customer and product
10
Charac
Conceptual data models known as Domain models create a common vocabulary for all stakeholders by
establishing basic concepts and scope.
Logical data models add further information to the conceptual model elements. It defines the structure of the
data elements and set the relationships between them.
The advantage of the Logical data model is to provide a foundation to form the base for the Physical model.
However, the modeling structure remains generic.
At this Data Modeling level, no primary or secondary key is defined. At this Data modeling level, you need to
verify and adjust the connector details that were set earlier for relationships.
Describes data needs for a single project but could integrate with other logical data models based on
the scope of the project.
Designed and developed independently from the DBMS.
11
Data attributes will have datatypes with exact precisions and length.
Normalization processes to the model is applied typically till 3NF.
A Physical Data Model describes the database specific implementation of the data model. It offers an
abstraction of the database and helps generate schema. This is because of the richness of meta-data offered
by a Physical Data Model.
This type of Data model also helps to visualize database structure. It helps to
model database columns keys, constraints, indexes, triggers, and other RDBMS features.
The physical data model describes data need for a single project or application though it maybe
integrated with other physical data models based on project scope.
Data Model contains relationships between tables that which addresses cardinality and nullability of
the relationships.
Developed for a specific version of a DBMS, location, data storage or technology to be used in the
project.
Columns should have exact datatypes, lengths assigned and default values.
Primary and Foreign keys, views, indexes, access profiles, and authorizations, etc. are defined.
The main goal of a designing data model is to make certain that data objects offered by the functional
team are represented accurately.
The data model should be detailed enough to be used for building the physical database.
The information in the data model can be used for defining the relationship between tables, primary
and foreign keys, and stored procedures.
Data Model helps business to communicate the within and across organizations.
Data model helps to documents data mappings in ETL process
Help to recognize correct sources of data to populate the model
12
To develop Data model one should know physical data stored characteristics.
This is a navigational system produces complex application development, management. Thus, it
requires a knowledge of the biographical truth.
Even smaller change made in structure require modification in the entire application.
There is no set data manipulation language in DBMS.
Conclusion
Data modeling is the process of developing data model for the data to be stored in a Database.
Data Models ensure consistency in naming conventions, default values, semantics, security while
ensuring quality of the data.
Data Model structure helps to define the relational tables, primary and foreign keys and stored
procedures.
There are three types of conceptual, logical, and physical.
The main aim of conceptual model is to establish the entities, their attributes, and their relationships.
Logical data model defines the structure of the data elements and set the relationships between them.
A Physical Data Model describes the database specific implementation of the data model.
The main goal of a designing data model is to make certain that data objects offered by the functional
team are represented accurately.
The biggest drawback is that even smaller change made in structure require modification in the entire
application.
Granularity refers to the level of detail or summarization of the units of data in the data warehouse. The more detail there
is, the lower the level of granularity. The less detail there is, the higher the level of granularity. For example, a simple
transaction would be at a low level of granularity. A summary of all transactions for the month would be at a high level
of granularity. Granularity of data has always been a major design issue. In early operational systems, granularity was
taken for granted. When detailed data is being updated, it is almost a given that data be stored at the lowest level of
granularity.
13
Major design issues of the data warehouse: granularity, partitioning, and proper design.
14
Determining the level of granularity is the most important design issue in the
data warehouse environment.
The single most important aspect and issue of the design of the data warehouse is the issue of granularity. It refers to the
detail or summarization of the units of data in the data warehouse. The more detail there is, the lower the granularity
level. The less detail there is, the higher the granularity level.
Granularity is a major design issue in the data warehouse as it profoundly affects the volume of data. The figure below
shows the issue of granularity in a data warehouse.
15
Granularity is the most important to the data warehouse architect because it affects all the environments that depend in
the data warehouse for data. The main issue of granularity is that of getting it at the right level. The level of granularity
needs to be neither too high nor too low.
Raw Estimates
The starting point to determine the appropriate level of granularity is to do a rough estimate of the number of rows that
would be there in the data warehouse. If there are very few rows in the data warehouse then any level of granularity
would be fine. After these projections are made the index data space projections are calculated. In this index data
projection we identify the length of the key or element of data and determine whether the key would exist for each and
every entry in the primary table.
The raw estimate of the number of rows of data that will reside in the data warehouse tells the architect a great deal.
–If there are only 10,000 rows, almost any level of granularity will do.
–If there are 10 million rows, a low level of granularity is possible.
–If there are 10 billion rows, not only is a higher level of granularity needed, but a major portion of the data will probably
go into overflow storage.
Data in the data warehouse grows in a rate never seen before. The combination of historical data and detailed data
produces a growth rate which is phenomenal. It is only after data warehouse the terms terabyte and petabyte came into
existence. As data keeps growing some part of the data becomes inactively used and they are sometimes called as
dormant data. So it is always better to have these kinds of dormant data in external storage media.
Data which is usually stored externally are much less expensive than the data which resides on the disk storage. Some
times as these data are external it becomes difficult to retrieve the data and this causes lots of performance issues and
these issues cause lots of effect on the granularity. It is usually the rough estimates which tell whether the overflow
storage should be considered or not.
16
Levels of Granularity
After simple analysis is done the next step would be to determine the level of granularity for the data which is residing on
the disk storage. Determining the level of granularity requires some extent of common sense and intuition. Having a very
low level of granularity also doesn't make any sense as we will have to need many resources to analyze and process the
data. While if the level of granularity is very high then this means that analysis needs to done on the data which reside in
the external storage. Hence this is a very tricky issue so the only way to handle this to put the data in front of the user and
let he/she decide on what the type of data should be. The below figure shows the iterative loop which needs to be
followed.
17
The process which needs to be followed is.
Build a small subset quickly based on the feedback
Prototyping
Looking what other people have done
Working with experienced user
Looking at what the organization has now
Having sessions with the simulated output.
Sometimes there is a great need for efficiency in storing and accessing data and the ability to analyze the data in great
data. When an organization has huge volumes of data it makes sense to consider two or more levels of granularity in the
detailed portion of the data warehouse. The figure below shows two levels of granularity in a data warehouse. In the
below figure we see a phone company which fits the needs of most of its shops. There is a huge amount of data in the
operational level. The data up to 30 days is stored in the operational environment. Then the data shifts to the lightly and
highly summarized zone.
18
This process of granularity not only helps the data warehouse it supports more than data marts. It supports the process of
exploration and data mining. Exploration and data mining takes masses of detailed historical data and examine the same
to analyze and previously unknown patterns of business activity.
It is usually said that if both granularity and partitioning are done properly then all most all the aspects of the data
warehouse implementation comes easily. Proper partitioning of data allows the data to grow and to be managed
Partitioning of data:
The main purpose of this partitioning is to break up the data into small manageable physical units the main advantage of
this would be that the developer would have a greater flexibility in managing the physical units of the data.
The main tasks that are carried out while partitioning is as follows:
Restructuring
Indexing
Sequential scanning
Reorganization
Recovery
Monitoring
In short the main aim for this activity is the flexible access of data. Partitioning can be done in many different ways. One
of the major issues facing the data warehouse developer is whether the partitioning is done at system or application level.
Partitioning at system level is a function of the DBMS and operating system to some extent.
19
Building a Data warehouse
There are two factors that drive you to build and use data warehouse. They are:
Business factors:
Business users want to make decision quickly and correctly using all available data.
Technological factors:
To address the incompatibility of operational data stores
IT infrastructure is changing rapidly. Its capacity is increasing and cost is decreasing so that building a data
warehouse is easy There are several things to be considered while building a successful data warehouse
Business considerations:
Organizations interested in development of a data warehouse can choose one of the
Following two approaches:
Top - Down Approach (Suggested by Bill Inmon)
Bottom - Up Approach (Suggested by Ralph Kimball)
In the top down approach suggested by Bill Inmon, we build a centralized storage area to house corporate wide business
data. This repository (storage area) is called Enterprise Data Warehouse (EDW). The data in the EDW is stored in a
normalized form in order to avoid redundancy.
The central repository for corporate wide data helps us maintain one version of truth of the
data.
The data in the EDW is stored at the most detail level. The reason to build the EDW on the most detail level is to
leverage
1. Flexibility to be used by multiple departments.
2. Flexibility to provide for future requirements.
The disadvantages of storing data at the detail level are
1. The complexity of design increases with increasing level of detail.
2. It takes large amount of space to store data at detail level, hence increased cost.
The advantage of using the Top Down approach is that we build a centralized repository to
provide for one version of truth for business data. This is very important for the data to be reliable, consistent across
subject areas and for reconciliation in case of data related contention between subject areas.
The disadvantage of using the Top Down approach is that it requires more time and initial
investment. The business has to wait for the EDW to be implemented followed by building
the data marts before which they can access their reports.
Bottom Up Approach
The bottom up approach suggested by Ralph Kimball is an incremental approach to build a data warehouse. Here we
build the data marts separately at different points of time as and when the specific subject area requirements are clear.
The data marts are integrated or combined together to form a data warehouse. Separate data marts are combined through
the use of conformed dimensions and conformed facts. A conformed dimension and a conformed fact is one that can be
shared across data marts.
Conformed dimension has consistent dimension keys, consistent attribute names and consistent values across
separate data marts. The conformed dimension means exact same thing with every fact table it is joined.
Conformed fact has the same definition of measures, same dimensions joined to it and at
the same granularity across data marts.
20
The bottom up approach helps us incrementally build the warehouse by developing and integrating data marts as and
when the requirements are clear. We don’t have to wait for knowing the overall requirements of the warehouse. We
should implement the bottom up approach when
1. We have initial cost and time constraints.
2. The complete warehouse requirements are not clear.
The advantage of using the Bottom Up approach is that they do not require high initial costs
and have a faster implementation time; hence the business can start using the marts much earlier as compared to the top-
down approach.
The disadvantages of using the Bottom Up approach are that it stores data in the de normalized format, hence there would
be high space usage for detailed data.
Design considerations
To be a successful data warehouse designer must adopt a holistic approach that is
considering all data warehouse components as parts of a single complex system, and take
into account all possible data sources and all known usage requirements.
Most successful data warehouses that meet these requirements have these common
characteristics:
Are based on a dimensional model
Contain historical and current data
Include both detailed and summarized data
Consolidate disparate data from multiple sources while retaining consistency
Data warehouse is difficult to build due to the following reason:
Heterogeneity of data sources
Use of historical data
Growing nature of data base
Data warehouse design approach muse be business driven, continuous and iterative
engineering approach. In addition to the general considerations there are following specific
points relevant to the data warehouse design:
Data content
The content and structure of the data warehouse are reflected in its data model. The data model is the template that
describes how information will be organized within the integrated warehouse framework. The data warehouse data must
be a detailed data. It must be formatted, cleaned up and transformed to fit the warehouse data model.
Meta data
It defines the location and contents of data in the warehouse. Meta data is searchable by users to find definitions or
subject areas. In other words, it must provide decision support oriented pointers to warehouse data and thus provides a
logical link between warehouse data and decision support applications.
Data distribution
One of the biggest challenges when designing a data warehouse is the data placement and distribution strategy. Data
volumes continue to grow in nature. Therefore, it becomes necessary to know how the data should be divided across
multiple servers and which users should get access to which types of data. The data can be distributed based on the
subject area, location (geographical region), or time (current, month, year).
Tools
A number of tools are available that are specifically designed to help in the implementation of the data warehouse. All
selected tools must be compatible with the given data warehouse environment and with each other. All tools must be able
to use a common Meta data repository.
Design steps
The following nine-step method is followed in the design of a data warehouse:
1. Choosing the subject matter
21
2. Deciding what a fact table represents
3. Identifying and conforming the dimensions
4. Choosing the facts
5. Storing pre calculations in the fact table
6. Rounding out the dimension table
7. Choosing the duration of the db
8. The need to track slowly changing dimensions
9. Deciding the query priorities and query models
Technical considerations
A number of technical issues are to be considered when designing a data warehouse
environment. These issues include:
The hardware platform that would house the data warehouse
The dbms that supports the warehouse data
The communication infrastructure that connects data marts, operational systems and
end users
The hardware and software to support meta data repository
The systems management framework that enables admin of the entire environment
Implementation considerations
The following logical steps needed to implement a data warehouse:
Collect and analyze business requirements
Create a data model and a physical design
Define data sources
Choose the db tech and platform
Extract the data from operational db, transform it, clean it up and load it into the
warehouse
Choose db access and reporting tools
Choose db connectivity software
Choose data analysis and presentation s/w
Update the data warehouse
Access tools
Data warehouse implementation relies on selecting suitable data access tools. The best way to choose this is based on the
type of data can be selected using this tool and the kind of access it permits for a particular user. The following lists the
various type of data that can be accessed:
Simple tabular form data
Ranking data
Multivariable data
Time series data
Graphing, charting and pivoting data
Complex textual search data
Statistical analysis data
Data for testing of hypothesis, trends and patterns
Predefined repeatable queries
Ad hoc user specified queries
Reporting and analysis data
Complex queries with multiple joins, multi level sub queries and sophisticated search criteria
User levels
The users of data warehouse data can be classified on the basis of their skill level in accessing the warehouse. There are
three classes of users:
Casual users: are most comfortable in retrieving info from warehouse in pre defined formats and running pre existing
queries and reports. These users do not need tools that allow for building standard and ad hoc reports
Power Users: can use pre defined as well as user defined queries to create simple and ad hoc reports. These users can
engage in drill down operations. These users may have the experience of using reporting and query tools.
Expert users: These users tend to create their own complex queries and perform standard analysis on the info they
retrieve. These users have the knowledge about the use of query and report tools.
Components of Datawarehouse
Architecture
Data warehouse is an environment, not a product which is based on relational database management system that
functions as the central repository for informational data.
The central repository information is surrounded by number of key components designed to make the
environment is functional, manageable and accessible.
The data source for data warehouse is coming from operational applications. The data
entered into the data warehouse transformed into an integrated structure and format.
The transformation process involves conversion, summarization, filtering and condensation.
The data warehouse must be capable of holding and managing large volumes of data as well as different
structure of data structures over the time.
This is item number 1 in the above arch diagram. They perform conversions, summarization, key changes, structural
changes and condensation. The data transformation is required so that the information can be used by decision support
tools.
24
The transformation produces programs, control statements, JCL code, COBOL code, UNIX scripts, and SQL DDL code
etc., to move the data into data warehouse from multiple operational systems.
Data heterogeneity: It refers to DBMS different nature such as it may be in different data
modules, it may have different access languages, it may have data navigation methods,
operations, concurrency, integrity and recovery processes etc.,
Data heterogeneity: It refers to the different way the data is defined and used in different
modules. E.g Prism Solutions, Evolutionary Technology Inc., Vality, Praxis and Carleton.
Meta data
It is data about data. It is used for maintaining, managing and using the data warehouse. It is classified into two:
Technical Meta data:
It contains information about data warehouse data used by warehouse designer,
administrator to carry out development and management tasks. It includes,
Information about data stores
Transformation descriptions. That is mapping methods from operational database to warehouse database.
Warehouse Object and data structure definitions for target data
The rules used to perform clean up, and data enhancement
Data mapping operations
Access authorization, backup history, archive history, info delivery history, data
acquisition history, data access etc.,
Meta data helps the users to understand content and find the data. Meta data are stored in a separate data stores which is
known as informational directory or Meta data repository which helps to integrate, maintain and view the contents of the
data warehouse.
Access tools
Its purpose is to provide info to business users for decision making. There are five categories:
Data query and reporting tools
Application development tools
Executive info system tools (EIS)
OLAP tools
Data mining tools
Query and reporting tools are used to generate query and report. There are two types of reporting tools. They are:
Production reporting tool used to generate regular operational reports
Desktop report writer are inexpensive desktop tools designed for end users.
Managed Query tools: used to generate SQL query. It uses Meta layer software in between users and databases
which offers a point-and-click creation of SQL statement. This tool is a preferred choice of users to perform
segment identification, demographic analysis, territory management and preparation of customer mailing lists
etc.
Application development tools: This is a graphical data access environment which integrates OLAP tools with data
warehouse and can be used to access all db systems OLAP Tools: are used to analyze the data in multi dimensional and
complex views. To enable multidimensional properties it uses MDDB and MRDB where MDDB refers multidimensional
data base and MRDB refers multi relational data bases.
Data mining tools: are used to discover knowledge from the data warehouse data also can be used for data visualization
and data correction purposes.
Data marts
Departmental subsets that focus on selected subjects. They are independent used by dedicated user group. They are used
for rapid delivery of enhanced decision support functionality to end users. Data mart is used in the following situation:
Extremely urgent user requirement
The absence of a budget for a full scale data warehouse strategy
The decentralization of business needs
The attraction of easy to use tools and mind sized project
26
• It is used to enable the process of subscribing for data warehouse information.
• Delivery to one or more destinations according to specified scheduling algorithm
Metadata
Metadata is one of the most important aspects of data warehousing. It is data about data stored in the warehouse and its
users.
Metadata contains:
i. The location and description of warehouse system and data components (warehouse objects).
ii. Names, definition, structure and content of the data warehouse and end user views.
iii. Identification of reliable data sources (systems of record).
iv. Integration and transformation rules
- used to generate the data warehouse; these include the mapping method from operational databases into the
warehouse, and lgorithms used to convert, enhance, or transform data.
v. Integration and transformation rules
-used to deliver data to end
-user analytical tools.
vi. Subscription information
-for the information delivery to the analysis subscribers.
vii. Data warehouse operational information,
- which includes a history of warehouse updates, refreshments, snapshots, versions, ownership authorizations and
extract audit trail.
viii. Metrics
- used to analyze warehouse usage and performance and end user usage patterns.
ix. Security
- authorizations access control lists, etc.
27
28
Metadata Interchange Initiative (idea)
In a situation such as a data warehouse different tools must be able to freely and easily access, and in some cases
manipulate and update, the metadata must be created by other tools and stored in a variety of different storages. To
achieve this goal is to establish atleast minimum common method of interchange standards and guidelines for fulfill
different vendors tools. This can be offered by the data warehousing vendors and is known as the metadata interchange
initiative.
The metadata interchange standard defines two different meta models:
The application meta model — the tables, etc., used to "hold" the metadata for a
particular application.
The metadata meta model — the set of objects that the metadata interchange standard can be used to describe.
These represent the information that is common to one or more classes of tools, such as data extraction tools, replication
tools, user query tools and database servers.
Metadata Repository(storage)
• The data warehouse architecture framework includes the metadata interchange
framework as one of its components.
• It defines a number of components all of which interact with each other via the
architecturally defined layer of metadata.
Metadata repository management software can be used to map the source data to the
target database, generate code for data transformations, integrate and transform the data, and control moving data to the
warehouse.
Metadata defines the contents and location of data (data model) in the warehouse,
relationships between the operational databases and the data warehouse and the business views of the warehouse data that
are accessible by end-user tools.
A data warehouse design ensures a mechanism for maintaining the metadata repository
and all the access paths to the data warehouse must have metadata-as an entry point.
The variety of access paths available into the data warehouse, and at the same time to
show how many tool classes can be involved in the process.
• It can define all data elements and their attributes, data sources and timing, and the rules
that govern data use and data transformation
. • Metadata needs to be collected as the warehouse is designed and built.
• Even through there are a number of tools available to help users understand and use the warehouse, these tools need to
be carefully evaluated before any purchasing decision is made.
Implementation Examples
Platinum technologies, R&O, Prism solutions and Logic works.
30
Metadata Trends
The data warehouse arena must include external data within the data warehouse.
The data warehouse must reduce costs and to increase competitiveness and business
quickness.
The process of integrating external and internal data into the warehouse faces a number of challenges.
In consistent data formats
Missing or invalid data
Different levels of aggregation
Semantic inconsistency
Unknown or questionable data quality and timeliness
Data warehouses integrate various data types such as alphanumeric data types, data types for, text, voice, image, full
motion video, web pages in HTML formats.
UNIT-II
Mapping the Data Warehouse to a Multiprocessor Architecture – DBMS Schemas for Decision Support – Data
Extraction, Cleanup, and Transformation Tools –. Reporting and Query tools and Applications –Online Analytical
Processing (OLAP) – Need – Multidimensional Data Model – OLAP Guidelines – Multidimensional versus
Multirelational OLAP – Categorization of OLAP Tools.
31
Relational data base technology for data warehouse
The functions of data warehouse are based on the relational data base technology. The relational data base technology is
implemented in parallel manner. There are two advantages of having parallel relational data base technology for data
warehouse:
Linear Speed up: refers the ability to increase the number of processor to reduce response time.
Linear Scale up: refers the ability to provide same performance on the same requests as the database size
increases.
Horizontal parallelism: which means that the data base is partitioned across multiple disks and parallel processing
occurs within a specific task that is performed concurrently on different processors against different set of data.
Vertical parallelism: This occurs among different tasks. All query components such as scan, join, sort etc are executed
in parallel in a pipelined fashion. In other words, an output from one task becomes an input into another task.
Data partitioning:
Data partitioning is the key component for effective parallel execution of data base operations. Partition can be done
randomly or intelligently.
Random portioning includes random data striping across multiple disks on a single server. Another option for random
portioning is round robin fashion partitioning in which each record is placed on the next disk assigned to the data base.
Intelligent partitioning assumes that DBMS knows where a specific record is located and does not waste time searching
for it across all disks. The various intelligent partitioning include:
Hash partitioning: A hash algorithm is used to calculate the partition number based on the value of the partitioning key
for each row
Key range partitioning: Rows are placed and located in the partitions according to the value of the partitioning key.
That is all the rows with the key value from A to K are in partition 1, L to T are in partition 2 and so on.
Schema portioning: an entire table is placed on one disk; another table is placed on different disk etc. This is useful for
small reference tables.
32
User defined portioning: It allows a table to be partitioned on the basis of a user defined expression.
Tightly coupled shared memory systems, illustrated in following figure have the following characteristics:
Multiple PUs share memory.
Each PU has full access to all shared memory through a common bus.
Communication between nodes occurs via shared memory.
Performance is limited by the bandwidth of the memory bus.
Symmetric multiprocessor (SMP) machines are often nodes in a cluster. Multiple SMP nodes can be used with Oracle
Parallel Server in a tightly coupled system, where memory is shared among the multiple PUs, and is accessible by all the
PUs through a memory bus.
Examples of tightly coupled systems include the Pyramid, Sequent, and Sun SparcServer.
Performance is potentially limited in a tightly coupled system by a number of factors. These include various system
components such as the memory bandwidth, PU to PU communication bandwidth, the memory available on the system,
the I/O bandwidth, and the bandwidth of the common bus.
33
Shared Disk Architecture
Shared disk systems are typically loosely coupled. Such systems, illustrated in following figure, have the following
characteristics:
Each node consists of one or more PUs and associated memory.
Memory is not shared between nodes.
Communication occurs over a common high-speed bus.
Each node has access to the same disks and other resources.
A node can be an SMP if the hardware supports it.
Bandwidth of the high-speed bus limits the number of nodes (scalability) of the system.
The cluster illustrated in figure is composed of multiple tightly coupled nodes. The Distributed Lock Manager (DLM ) is
required. Examples of loosely coupled systems are VAXclusters or Sun clusters.
Since the memory is not shared among the nodes, each node has its own data cache. Cache consistency must be
maintained across the nodes and a lock manager is needed to maintain the consistency. Additionally, instance locks using
the DLM on the Oracle level must be maintained to ensure that all nodes in the cluster see identical data.
There is additional overhead in maintaining the locks and ensuring that the data caches are consistent. The performance
impact is dependent on the hardware and software components, such as the bandwidth of the high-speed bus through
which the nodes communicate, and DLM performance.
34
Shared nothing systems are typically loosely coupled. In shared nothing systems only one CPU is connected to a given
disk. If a table or database is located on that disk, access depends entirely on the PU which owns it. Shared nothing
systems can be represented as follows:
Shared nothing systems are concerned with access to disks, not access to memory. Nonetheless, adding more PUs and
disks can improve scaleup. Oracle Parallel Server can access the disks on a shared nothing system as long as the
operating system provides transparent disk access, but this access is expensive in terms of latency.
Shared nothing systems have advantages and disadvantages for parallel processing:
Advantages
Shared nothing systems provide for incremental growth.
System growth is practically unlimited.
MPPs are good for read-only databases and decision support applications.
Failure is local: if one node fails, the others stay up.
Disadvantages
More coordination is required.
More overhead is required for a process working on a disk belonging to another node.
If there is a heavy workload of updates or inserts, as in an online transaction processing system, it may be worthwhile
to consider data-dependent routing to alleviate contention.
DBMS management tools help to configure, tune, admin and monitor a parallel RDBMS as effectively as if it were a
serial RDBMS
Price / Performance: The parallel RDBMS can demonstrate a non linear speed up and scale up at reasonable costs.
The basic concepts of dimensional modeling are: facts, dimensions and measures. A fact is a collection of related data
items, consisting of measures and context data. It typically represents business items or business transactions. A
dimension is a collection of data that describe one business dimension. Dimensions determine the contextual background
for the facts; they are the parameters over which we want to perform OLAP. A measure is a numeric attribute of a fact,
representing the performance or behavior of the business relative to the dimensions.
Considering Relational context, there are three basic schemas that are used in dimensional modeling:
1. Star schema
2. Snowflake schema
3. Fact constellation schema
Star schema
The multidimensional view of data that is expressed using relational data base semantics is provided by the data base
schema design called star schema. The basic of stat schema is that information can be classified into two groups:
Facts
Dimension
Star schema has one large central table (fact table) and a set of smaller tables (dimensions) arranged in a radial pattern
around the central table.
Facts are core data element being analyzed while dimensions are attributes about the facts.
36
The following diagram shows the sales data of a company with respect to the four dimensions, namely time, item, branch,
and location.
The star schema architecture is the simplest data warehouse schema. It is called a star schema because the diagram
resembles a star, with points radiating from a center. The center of the star consists of fact table and the points of the star
are the dimension tables. Usually the fact tables in a star schema are in third normal form (3NF) whereas dimensional
tables are de-normalized. Despite the fact that the star schema is the simplest architecture, it is most commonly used
nowadays and is recommended by Oracle.
Fact Tables
37
A fact table is a table that contains summarized numerical and historical data (facts) and a multipart index composed of
foreign keys from the primary keys of related dimension tables. A fact table typically has two types of columns: foreign
keys to dimension tables and measures those that contain numeric facts. A fact table can contain fact's data on detail or
aggregated level.
Dimension Tables
Dimensions are categories by which summarized data can be viewed. E.g. a profit summary in a fact table can be viewed
by a Time dimension (profit by month, quarter, year), Region dimension (profit by country, state, city), Product
dimension (profit for product1, product2).
Typical fact tables store data about sales while dimension tables data about geographic region (markets, cities), clients,
products, times, channels.
Measures are numeric data based on columns in a fact table. They are the primary data which end users are interested in.
E.g. a sales fact table may contain a profit measure which represents profit on each sale.
The main characteristics of star schema:
Simple structure -> easy to understand schema
Great query effectives -> small number of tables to join
Relatively long time of loading data into dimension tables -> de-normalization, redundancy data caused that size of the
table could be large.
The most commonly used in the data warehouse implementations -> widely supported by a large number of business
intelligence tools
38
2. Level Indicator.
The dimension table design includes a level of hierarchy indicator for every record.
Every query that is retrieving detail records from a table that stores details and aggregates must use this indicator as an
additional constraint to obtain a correct result.
The user is not and aware of the level indicator, or its values are in correct, the otherwise valid query may result in a
totally invalid answer.
Alternative to using the level indicator is the snowflake schema. Aggregate fact tables are created separately from detail
tables. Snowflake schema contains separate fact tables for each level of aggregation.
Other problems with the star schema design - Pairwise Join Problem
5 tables require joining first two tables, the result of this join with third table and so on. The intermediate result of every
join operation is used to join with the next table. Selecting the best order of pairwise joins rarely can be solve in a
reasonable amount of time.
2 .Snowflake schema: is the result of decomposing one or more of the dimensions. The many-to-one relationships
among sets of attributes of a dimension can separate new dimension tables, forming a hierarchy. The decomposed
snowflake structure visualizes the hierarchical structure of dimensions very well.
3.Fact constellation schema: For each star schema it is possible to construct fact constellation schema(for example by
splitting the original star schema into more star schemes each of them describes facts on another level of dimension
hierarchies). The fact constellation architecture contains multiple fact tables that share many dimension tables. The main
shortcoming of the fact constellation schema is a more complicated design because many variants for particular kinds of
aggregation must be considered and selected. Moreover, dimension tables are still large.
39
4.2 STAR join and STAR Index.
A STAR join is high-speed, single pass, parallelizable muti-tables join method. It performs many joins by single
operation with the technology called Indexing. For query processing the indexes are used in columns and rows of the
selected tables.
Red Brick's RDBMS indexes, called STAR indexes, used for STAR join performance. The STAR indexes are created on
one or more foreign key columns of a fact table. STAR index contains information that relates the dimensions of a fact
table to the rows that contains those dimensions. STAR indexes are very space-efficient. The presence of a STAR index
allows Red Brick's RDBMS to quickly identify which target rows of the fact table are of interest for a particular set of
dimension. Also, because STAR indexes are created over foreign keys, no assumptions are made about the type of
queries which can use the STAR indexes.
Over view:
SYBASE IQ is a separate SQL database.
Once loaded, SYBASE IQ converts all data into a series of bit maps, which are then highly compressed and stored on
disk.
SYBASE positions SYBASE IQ as a read only database for data marts, with a practical size limitations currently
placed at 100 Gbytes.
Data cardinality: Bitmap indexes are used to optimize queries against low- cardinality data
— that is, data in which the total number of possible values is relatively low.
(Cardinal meaning – important)
40
Fig: - Bitmap index
For example, address data cardinality pin code is 50 (50 possible values), and gender data cardinality is only 2 (male and
female)..
If the bit for a given index is "on", the value exists in the record. Here, a 10,000 — row employee table that contains the
"gender" column is bitmap-indexed for this value.
Bitmap indexes can become bulky and even unsuitable for high cardinality data where the range of possible values is
high. For example, values like "income" or "revenue" may have an almost infinite number of values.
SYBASE IQ uses a patented technique called Bit-wise technology to build bitmap indexes for high-cardinality data.
Index types: The first release of SYBASE IQ provides five index techniques.
41
A traditional RDBMS approach to storing data in memory and on the disk is to store it one row at a time, and each row
can be viewed and accessed a single record. This approach works well for OLTP environments in which a typical
transaction access a record at a time.
However, for a set processing adhoc query environment in data warehousing the goal is to retrieve multiple values of
several columns. For example, if a problem is to calculate average, maximum and minimum salary, the column wise
storage of the salary field requires a DBMS to read only one record.
42
Data Extraction, Cleanup, and Transformation Tools
Tool Requirements .
The tools that provide data contents and formats from operational and external data stores into the data warehouse
includes following tasks.
• Data transformation - from one format to another on possible differences between the source and target platforms.
• Data transformation and calculation - based on the application of business rules.
• Data consolidation and integration,- which include combining several source records into a single record to be loaded
into the warehouse.
• Metadata synchronization and management- which includes storing and/or updating meta data definitions about source
data files, transformation actions, loading formats, and events, etc.
The following are the Criteria’s that affects the Tools ability to transform, consolidate, integrate and repair the data.
1. The ability to identify data - in the data source environments that can be read by the conversion tool is important.
2. Support for flat files, indexed files is critical. eg. VSAM , IMS and CA-IDMS
3. The capability to merge data from multiple data stores is required in many installations.
4. The specification interface to indicate the data to be extracted and the conversion criteria is important.
5. The ability to read information from data dictionaries or import information from warehouse products is desired.
6. The code generated by the tool should be completely maintainable from within the development environment.
7. Selective data extraction of both data elements and records enables users to extract only the required data.
8. A field-level data examination for the transformation of data into information is needed.
9. The ability to perform data-type and character-set translation is a requirement when moving data between incompatible
systems.
10. The capability to create summarization, aggregation and derivation records and field is very important.
11. Vendor stability and support for the product items must be carefully evaluated.
Vendor Approaches
Integrated solutions can fall into one of the categories described below.
43
• Code generators create modified 3GL/4GL programs based on source, target data definitions, data transformation,
improvement rules defined by the developer. This approach reduces the need for an organization to write its own data
capture,
transformation, and load programs.
•Database data replication tools utilize database triggers or a recovery log to capture changes to a single data source on
one system and apply the changes to a copy of the source data located on a different systems.
• Rule-driven-dynamic transformation engines ( data mart builders). Capture data from a source system at user defined
intervals, transforms the data, and then send and load the results into a target environment, typically a data mart.
Vendor Solution
• Prism solutions
• SAS Institute
• Validity Corporation
• Information Builders
Prism solutions: While Enterprise/Access focuses on providing access to legacy data; Prism warehouse manager
provides a solution for data warehousing by mapping source data to a target dbms to be used as a warehouse.
Prism warehouse manager can extract data from multiple source, environments, including DB2, IDMS, IMS, VSAM,
RMS, and sequential files under UNIX or MVS. It has strategic relationship with pyramid and Informix.
SAS institute:
SAS starts with the basis of critical data still resides in the data center and offer its traditional SAS system tools to serve
at data warehousing functions. Its data repository function can act to build the informational database.
SAS Data Access Engines serve as extraction tools to combine common variables, transform data representation forms
for consistency, consolidate redundant data, and use business rules to produce computed values in the warehouse.
SAS engines can work with hierarchical and relational database and sequential files.
Validity Corporation:
Validity Corporation's Integrity data reengineering tool is used to investigate, standardize, transform and integrate data
from multiple operational systems and external sources.
Integrity is a specialized, multipurpose data tool that organizations apply on projects such as:
• Data audits
• Data warehouse and decision support systems
• Customer information files and house holding applications
• Client/Server business applications such as SAP R/S, Oracle and Hogan
• System consolidations.
Information builders:
A product that can be used as a component for data extraction, transformation and legacy access tool suite for building
data warehouse is EDA/SQL from information builders.
EDA/SQL implements a client/server model that is optimized for higher performance
EDA/SQL supports copy management, data quality management, data replication capabilities, and standards
support for both ODBC and the X/Open CLI.
44
Transformation Engines
1.Informatica:
This is a multicompany metadata integration idea. Informatica joined services with Andyne, Brio, Business objects,
Cognos, Information Advantage, Info space, IQ software and Microstrategy to deliver a "back-end" architecture and
publish AFI specifications supporting its technical and business metadata.
2. Power Mart:
Informatica's flagship product — PowerMart suite — consists of the following components.
• Power Mart Designer
• Power Mart server
• The Informatica Server Manager
• The Informatica Repository
• Informatica Power Capture
3. Constellar:
The constellar Hub consists of a set of components supporting the distributed transformation management capabilities.
The product is designed to handle the movement and transformation of data for both data migration and data distribution,
in an operational system, and for capturing operational data for loading into a data warehouse.
The transformation hub performs the tasks of data cleanup and transformation.
The Hub Supports:
Record reformatting and restructuring.
Field level data transformation, validation and table look up.
File and multi-file set-level data transformation and validation.
The creation of intermediate results for further downstream transformation by the hub.
45
Reporting and query tools for data analysis:-
The principal purpose of data warehousing is to provide information to business users for strategic decision making.
These users interact with the data warehouse using front-end tools, or by getting the required information through the
information delivery system.
Tool Categories
There are five categories of decision support tools
1. Reporting
2. Managed query
3. Executive information systems (EIS)
4. On-line analytical processing (OLAP)
5. Data mining (DM)
Reporting tools:
Reporting tools can be divided into production reporting tools and desktop report writers.
Production reporting tools: Companies generate Production reporting tools for regular
operational reports or support high-volume batch jobs. E.g calculating and printing pay
checks.
Production reporting tools include third-generation languages such as COBOL, specialized
fourth-generation languages, such as Information Builders, Inc.'s Focus, and high-end
client/server tools, such as MITI'S SQL.
Report writers: Are inexpensive desktop tools designed for end users. Products such as Seagate software's crystal
reports allows users to design and run reports without having to rely on the IS department.
In general, report writers have graphical interfaces and built-in charting functions, They can pull groups of data from a
variety of data sources and integrate them in a single report.
Leading report writers include Crystal Reports, Actuate and Platinum Technology, Inc's Info Reports. Vendors are trying
to increase the scalability of report writers by supporting thGuiree-tiered architectures in which report processing is done
on a Windows NT or UNIX server.
46
Report writers also are beginning to offer object-oriented interfaces for designing and manipulating reports and modules
for performing ad hoc queries and OLAP analysis.
Other tools are IQ software's IQ objects, Andyne Computing Ltd,'s GQL, IBM's Decision Server, Speedware Corp's
Esperant (formerly sold by software AG), and Oracle Corp's Discoverer/2000.
EIS tools include pilot software, Inc.'s Light ship, Platinum Technology's Forest and Trees,
Comshare, Inc.'s Commander Decision, Oracle's Express Analyzer and SAS Institute, Inc.'s
SAS/EIS.
EIS vendors are moving in two directions.
Many are adding managed query functions to compete head-on with other –decision support tools.
Others are building packaged applications that address horizontal functions, such as sales budgeting, and marketing, or
vertical industries such as financial services.
Ex: Platinum Technologies offers Risk Advisor.
OLAP tools:
It provides a sensitive way to view corporate data.
These tools aggregate data along common business subjects or dimensions and then let users navigate through the
hierarchies and dimensions with the click of a mouse button.
Some tools such as Arbor software Corp.'s Essbase , Oracle's Express, pre aggregate data in special multi dimensional
database.
Other tools work directly against relational data and aggregate data on the fly, such as Micro-strategy, Inc.'s DSS Agent
or Information /Advantage, Inc.'s Decision suite.
47
Data mining tools:
Provide close to corporate data that aren't easily differentiate with managed query or OLAP
tools.
Data mining tools use a variety of statistical and artificial intelligence (AI) algorithm to analyze the correlation of
variables in the data and search out interesting patterns and relationship to investigate.
Data mining tools, such as IBM's Intelligent Miner, are expensive and require statisticians to implement and manage.
These include Data Mind CorP's Data Mind, Pilot's Discovery server, and tools from Business objects and SAS Institute.
This tools offer simple user interfaces that plug in directly to existing OLAP tools or databases and can be run directly
against data warehouses.
For example, all end-user tools use metadata definitions to obtain access to data stored in the warehouse, and some of
these tools (eg., OLAP tools) may employ additional or intermediary data stores. (eg., data marts, multi dimensional data
base).
Applications
Organizations use a familiar application development approach to build a query and reporting environment for the data
warehouse. There are several reasons for doing this:
A legacy DSS or EIS system is still being used, and the reporting facilities appear adequate.
An organization has made a large investment in a particular application development environment (eg., Visual C++,
Power Builder).
A new tool may require an additional investment in developers skill set, software, and the infrastructure, all or part of
which was not budgeted for in the planning stages of the project.
The business users do not want to get involved in this phase of the project, and will continue to relay on the IT
organization to deliver periodic reports in a familiar format .
A particular reporting requirement may be too complicated of an available reporting tool to handle.
All these reasons are perfectly valid and in many cases result in a timely and cost-effective delivery of a reporting system
for a data warehouse.
OLAP stands for Online Analytical Processing. It uses database tables (fact and dimension tables) to enable
multidimensional viewing, analysis and querying of large amounts of data.
E.g. OLAP technology could provide management with fast answers to complex queries on their operational data or
enable them to analyze their company's historical data for trends and patterns.
Online Analytical Processing (OLAP) applications and tools are those that are designed to ask “complex queries of large
multidimensional collections of data.” Due to that OLAP is accompanied with data warehousing.
OLAP is an application architecture, not basically a data warehouse or a database management system (DBMS). Whether
it utilizes a data warehouse or not OLAP is becoming an architecture that an increasing number of enterprises are
implementing to support analytical applications.
The majority of OLAP applications are deployed in a "stovepipe" fashion, using specialized MDDBMS technology, a
narrow set of data; a preassembled application- user interface.
Business problems such as market analysis and financial forecasting requires query-centric database schemas that are
array-oriented and multi dimensional in nature.
These business problems are characterized by the need to retrieve large number of records from very large data sets
(hundreds of gigabytes and even terabytes). The multidimensional nature of the problems it is designed to address is the
key driver for OLAP.
The result set may look like a multidimensional spreadsheet (hence the term multidimensional). All the necessary data
can be represented in a relational database accessed via SQL.
The two dimensional relational model of data and the Structured Query Language (SQL) have limitations for such
complex real-world problems.
The multidimensional data model is an integral part of On-Line Analytical Processing, or OLAP. Because OLAP is on-
line, it must provide answers quickly; analysts create iterative queries during interactive sessions, not in batch jobs that
49
run overnight. And because OLAP is also analytic, the queries are complex. The multidimensional data model is
designed to solve complex queries in real time.
Multidimensional data model is to view it as a cube. The cable at the left contains detailed sales data by product, market
and time. The cube on the right associates sales number (unit sold) with dimensions-product type, market and time with
the unit variables organized as cell in an array.
This Figure also gives a different understanding to the drilling down operations. The relations defined must not be
directly related, they related directly.
The size of the dimension increase, the size of the cube will also increase exponentially.
The time response of the cube depends on the size of the cube.
OLAP Operations (Operations in Multidimensional Data Model:)
Roll-up
Drill-down
Slice and dice
Pivot (rotate)
1.Roll-up
50
Roll-up is performed by climbing up a concept hierarchy for the dimension location.
Initially the concept hierarchy was "street < city < state < country".
On rolling up, the data is aggregated by ascending the location hierarchy from the level of city to the level of country.
The data is grouped into cities rather than countries.
When roll-up is performed, one or more dimensions from the data cube are removed.
Drill-down
Drill-down is the reverse operation of roll-up. It is performed by either of the following ways:
By stepping down a concept hierarchy for a dimension
By introducing a new dimension.
The following diagram illustrates how drill-down works:
51
Drill-down is performed by stepping down a concept hierarchy for the dimension time.
Initially the concept hierarchy was "day < month < quarter < year."
On drilling down, the time dimension is descended from the level of quarter to the level of month.
When drill-down is performed, one or more dimensions from the data cube are added.
It navigates the data from less detailed data to highly detailed data.
Slice
The slice operation selects one particular dimension from a given cube and provides a new sub-cube. Consider the
following diagram that shows how slice works.
Here Slice is performed for the dimension "time" using the criterion time = "Q1".
It will form a new sub-cube by selecting one or more dimensions.
Dice
Dice selects two or more dimensions from a given cube and provides a new sub-cube.
Consider the following diagram that shows the dice operation.
52
The dice operation on the cube based on the following selection criteria involves three dimensions.
(location = "Toronto" or "Vancouver")
(time = "Q1" or "Q2")
(item =" Mobile" or "Modem")
Pivot
The pivot operation is also known as rotation. It rotates the data axes in view in order to provide an alternative
presentation of data. Consider the following diagram that shows the pivot operation. In this the item and location axes in
2-D slice are rotated.
Dr. E.F. Codd the ―father of the relational model, created a list of rules to deal with the
OLAP systems.
53
These rules are:
1).Multidimensional conceptual view: The OLAP should provide a suitable multidimensional business model that suits
the business problems and requirements.
2).Transparency: -(OLAP must transparency to the input data for the users).
The OLAP systems technology, the basic database and computing architecture {client/server, mainframe gateways, etc.)
and the heterogeneity of input data sources should be transparent to users to save their productivity and ability with front-
end environments and tools (eg., MS Windows, MS Excel).
3).Accessibility:-(OLAP tool should only access the data required only to the analysis Needed).
The OLAP system should access only the data actually required to perform the analysis. The system should be able to
access data from all heterogeneous enterprise data sources required/for the analysis.
4).Consistent reporting performance: Size of the database should not affect in performance).
As the number of dimensions and the size of the database increase, users should not identify any significant decrease in
performance.
5).Client/server architecture:(c/s architecture to ensure better performance and flexibility ).
The OLAP system has to conform to client/server architectural principles for maximum price and performance,
flexibility, adaptivity and interoperability
6).Generic dimensionality: Data entered should be equivalent to the structure and operation requirements.
7).Dynamic sparse matrix handling: The OLAP too should be able to manage the sparse matrix and so maintain the level
of performance.
8).Multi-user support: The OLAP should allow several users working concurrently to work together on a specific model.
9).Unrestricted cross-dimensional operations: The OLAP systems must be able to recognize dimensional hierarchies and
automatically perform associated roll-up calculations within and across dimensions.
10).Intuitive data manipulation. Consolidation path reorientation pivoting drill down and Rollup
and other manipulation should be accomplished via direct point-and-click; drag-and-drop operations on the cells of the
cube.
11).Flexible reporting: The ability to arrange rows, columns, and cells in a fashion that facilitates analysis by spontaneous
visual presentation of analytical report must exist.
12).Unlimited dimensions and aggregation levels: This depends on the kind of business, where multiple dimensions and
defining hierarchies can be made.
Multidimensional structure: - “A variation of the relational model that uses multidimensional structures for organize data
and express the relationships between data”.
Multidimensional: MOLAP
MOLAP is the 'classic' form of OLAP and is sometimes referred to as just OLAP.
MOLAP stores this data in optimized multidimensional array storage, rather than in a relational database. Therefore it
requires the pre-computation and storage of information in the cube the operation known as processing.
MOLAP analytical operations:-
Consolidation: involves the aggregation of data such as roll-ups or complex expressions involving interrelated data. For
example, branch offices can be rolled up to cities and rolled up to countries.
Drill-Down: is the reverse of consolidation and involves displaying the detailed data that comprises the consolidated
data.
Slicing and dicing: refers to the ability to look at the data from different viewpoints. Slicing and dicing is often
performed along a time axis in order to analyze trends and find patterns.
Multi relational OLAP: ROLAP
54
ROLAP works directly with relational databases. The base data and the dimension tables are stored as relational tables
and new tables are created to hold the aggregated information. It depends on a specialized schema design.
This methodology relies on manipulating the data stored in the relational database to give
the appearance of traditional OLAP's slicing and dicing functionality.
Comparison:
Comparison:
MOLAP implementations are smooth to database explosion, such as usage of large storage space ,high number of
dimensions, pre-calculated results and sparse multidimensional data.
MOLAP generally delivers better performance by indexing and storage optimizations.
MOLAP also needs less storage space compared to ROLAP because the specialized storage typically includes
compression techniques.
ROLAP is generally more scalable. However large volume pre-processing is difficult to implement efficiently so it is
frequently skipped.
ROLAP query performance can therefore suffer extremely.
ROLAP relies more on the database to perform calculations, it has more limitations in the specialized functions it can
use.
A chart comparing capabilities of these two classes of OLAP tools.
55
The area of the circle implies data size.
Fig: - OLAP style comparison
Architectures enables excellent performance when the data is utilized as designed and predictable application response
times for applications addressing a narrow breadth of data for a specific DSS requirement.
Applications requiring iterative and comprehensive time series analysis of trends are well suited for MOLAP technology
(eg., financial analysis and budgeting). Examples include Arbor software's Ess base, Oracle's Express Server.
First, there are limitations in the ability of data structures to support multiple subject areas of data (a common trait of
many strategic DSS applications) and the detail data required by many, analysis applications. This has begun to be
56
addressed in some products, utilizing basic "reach through" mechanisms that enable the MOLAP tools to access detail
data maintained in an RDBMS.
MOLAP products require a different set of skills and tools for the database administrator to build and maintain the
database, thus increasing the cost and complexity of support. These hybrid solutions have as their primary characteristic
the integration of specialized multidimensional data storage with RDBMS technology, providing users with a facility that
tightly "couples" the multidimensional data structures (MDDSs) with data maintained in an RDBMS.
This approach can be very useful for organizations with performance — sensitive multidimensional analysis
requirements and that have built, or are in the process of building, a data warehouse architecture that contains multiple
subject areas.
Eg: (Product and sales region) to be stored and maintained in a persistent structure. These structure can be automatically
refreshed at predetermined intervals established by an administrator.
2.ROLAP
The fastest growing style of OLAP technology, with new vendors (eg., Sagnent technology) entering the market at an
accelerating step. Products directly through a dictionary layer of metadata, bypassing any requirement for creating a static
multidimensional data structures
57
This enables multiple multidimensional views of the two-dimensional relational tables to be created without the need to
structure the data around the desired view.
Some of the products in this segment have developed strong SQL-generation engines to support the complexity of
multidimensional analysis.
Flexibility is an attractive feature of ROLAP products, there are products in this segment that recommend, or require, the
use of highly de-normalized database designs (e.g., Star schema).
Some products (e.g, Andyne's Pablo) that have a custom in ad hoc query have developed features to provide "datacube"
and "slice" and "dice" analysis capabilities. This is achieved by first developing a query to select data from the DBMS
which then delivers the requested data to the desktop, where it is placed into a data cube. This data cube can be stored and
maintained locally, to reduce the overhead required to create the structure each time the query is executed.
Once the data is in the data cube; users can perform multidimensional analysis (i.e., Slice, dice, and pivot operations)
against it. The simplicity of the installation and administration of such products makes them particularly attractive to
organizations looking to provide seasoned users with more sophisticated analysis capabilities, without the significant cost
and maintenance of more complex products.
This mechanism allows for the flexibility of each user to build a custom data cube, the lack of data consistency among
users, and the relatively small amount of data that can be efficiently maintained are significant challenges facing tool
administrators. Examples include Cognos Software's Power play, Andyne Software's Pablo, Business Objects, Mercury
Project, Dimensional Insight's cross target and Speedware's Media.
58
COURSE CODE COURSE TITLE L T P C
1151CS114 DATA WAREHOUSING AND DATA MINING 3 0 0 3
UNIT-III
Data Mining
Data mining is the non-trivial(non-unimportant) process of identifying valid, novel(original), potentially useful and
ultimately understandable patterns(model) in data.
Data mining techniques supports automatic searching of data and tries to source out patterns and trends in the data and
also gather rules from these patterns which will help the user to support review and examine decisions in some related
business- or scientific area.
Data refers to extracting or mining knowledge from large databases. Data mining and
knowledge discovery in the databases is a new inter disciplinary field, merging ideas from statistics, machine, learning,
databases and parallel computing.
Fig:2 - Data mining — searching for knowledge (interesting patterns) in your data
59
KDD: Knowledge Discovery in Database (KDD) was formalized in search of seeking
knowledge from the data.
Fayyed et al distinguish between KDD and data mining by giving the following definitions:
Knowledge Discovery in Databases KDD is the process of identifying a valid, potentially useful and ultimately
understandable structure in data. This process involves selecting or sampling data from a data warehouse, cleaning or pre-
processing it, transforming or reducing it, applying a data mining component to produce a structure and then evaluating
the derived structure.
Data mining is a step in the KDD process concerned with the algorithmic means by which patterns or structures are
enumerated from the data under acceptable, computational efficiency limitations.
60
Steps in KDD process:
Data cleaning: It is the process of removing noise and inconsistent data.
Data integrating: It is the process of combining data from multiple sources.
Data selection: It is the process of retrieving relevant data from the databases.
Data transformation: In this process, data are transformed or consolidated into forms
suitable for mining by performing summary of aggregation operations.
Data mining: It is an essential process where intelligent methods are applied in support to extract data patterns.
Pattern evaluation: The patterns obtained in the data mining stage are converted into
knowledge based on some interestingness measures.
Knowledge presentation: Visualization and knowledge representation techniques are used to present the mined
knowledge to the user.
61
Architecture of Data Mining System
Major components.
Database or data warehouse server: The database or data warehouse serve obtains the relevant data, based on the user's
data mining request.
Knowledge base: This is the domain knowledge that used to guide the search or evaluate the interestingness of resulting
patterns. Such knowledge can include concept hierarchies, used to organize attributes or attribute values into different
levels of abstraction knowledge such as user beliefs; threshold and metadata can be used to access a patterns
interestingness.
Data mining engine: This is essential to the data mining system and ideally consists of a set of functional modules for
task such as characterization, association classification,
cluster analysis, evolution and outlier analysis.
62
Pattern evaluation module: This component uses interestingness measures and interacts with the data mining modules
so as to focus the search towards increasing patterns. It may use interestingness entrances to / filter out discovered
patterns. Alternately, the pattern evaluation module may also be integrate with mining module.
Graphical user interface: This module communicates between users and the data mining system, allowing the user to
interact with the system by specifying a task or data mining query for performing exploratory data mining based on
intermediate data mining results.
This module allows the user to browse database and datawarehouse schemes or data structure, evaluate mined patterns
and visualize the pattern in different forms such as maps, charts etc.
Data mining should be applicable to any kind of information repository. This includes
Flat files
Relational databases,
Data warehouses,
Transactional databases,
Advanced database systems,
World-Wide Web.
Flat files: Flat files are simple data files in text or binary format with a structure known by the data mining algorithm to
be applied. The data in these files can be transactions, time series data, scientific measurements, etc.
Relational databases: A relational database is a collection of tables. Each table consists of a set of attributes (columns or
fields) and a set of tuples (records or rows). Each tuple is identified by a unique key and is described by a set of attribute
values. Entity relationships (ER) data model is often constructed for relational databases. Relational data can be accessed
by database queries written in a relational query language.
e.g Product and market table
Data warehouse:
A data warehouse is a repository of information collected from multiple sources, stored
under a unified scheme residing on a single site.
63
A data warehouse is formed by a multidimensional database structure, where each
dimension corresponds to an attribute or a set of attributes in the schema.
Data warehouse is formed by data cubes. Each dimension is an attribute and each cell represents the aggregate measure.
A data warehouse collects information about subjects that cover an entire organization whereas data mart focuses on
selected subjects. The multidimensional data views makes (OLAP) Online Analytical Processing easier.
Transactional databases: A transactional database consists of a file where each record represents a transaction. A
transaction includes transaction identity number, list of items, date of transactions etc.
Advanced databases:
Object oriented databases: Object oriented databases are based on object-oriented programming concept. Each entity is
considered as an object which encapsulates data and code into a single unit objects are grouped into a class.
Object-relational database: Object relational database are constructed based on an object relational data mode which
extends the basic relational data model by handling complex data types, class hierarchies and object inheritance.
Spatial databases: A spatial database stores a large amount of space-related data, such as maps, preprocessed remote
sensing or medical imaging data and VLSI chip layout data. Spatial data may be represented in raster format, consisting
of n-dimensional bit maps or fixed maps.
Temporal Databases, Sequence Databases, and Time-Series Databases
A temporal database typically stores relational data that include time-related attributes.
A sequence database stores sequences of ordered events, with or without a existing view of time. E.g customer shopping
sequences, Web click streams.
64
A time-series database stores sequences of values or events obtained over repeated measurements of time (e.g., hourly,
daily, weekly). E.g stock exchange, inventory control, observation of temperature and wind.
Text databases and multimedia databases: Text databases contains word descriptions of objects such as long sentences
or paragraphs, warning messages, summary reports etc. Text database consists of large collection of documents from
various sources. Data stored in most text databases are semi structured data.
A multimedia database stores and manages a large collection of multimedia objects such as audio data, image, video,
sequence and hypertext data.
Heterogeneous databases and legacy databases:
A heterogeneous database consists of a set of interconnected, autonomous component databases. The components
communicate in order to exchange information and answer queries.
A legacy database is a group of heterogeneous databases that combines different kinds of data systems, such as relational
or object-oriented databases, hierarchical databases, network databases, spreadsheets, multimedia databases, or file
systems.
The heterogeneous databases in a legacy database may be connected by intra or intercomputer networks.
The World Wide Web: The World Wide Web and its associated distributed information services, such as Yahoo!,
Google, America Online, and AltaVista, provides worldwide, online
information services. Capturing user access patterns in such distributed information environments is called Web usage
mining or Weblog mining.
Data mining functionalities are used to specify the kind of patterns to be found in data mining tasks. In general, data
mining tasks can be classified into two categories: descriptive and predictive.
Descriptive mining:- tasks characterize the general properties of the data in the database.
Predictive mining: - tasks perform conclusion on the current data in order to make predictions.
Users have no idea regarding what kinds of patterns is required, so they search for several different kinds of patterns in
parallel.
Data characterization: It is a summarization of the general characteristics of a class (target class) of data. The data
related to the user specified class are collected by a database query. Several methods like OLAP roll up operation and
attribute-oriented technique are used for effective data summarization and characterization. The output of data
characterization can be presented in various forms like • Pie charts • Bar charts • Curves • Multidimensional cubes •
Multidimensional tables etc. The resulting descriptions can be presented as generalized relations or in rule forms called
characteristics rules.
Data discrimination is a comparison of the general features of target class data objects with the general features of
objects from one or a set of contrasting classes. The output of data discrimination can be presented in the same manner as
data characterization. Discrimination descriptions expressed in rule form are referred to as discriminant rules. E.g the user
may like to compare the general features of software products whose sales increased by 10% in the last year with those
whose sales decreased by at least 30% during the same period
Association analysis.
single-dimensional association rule.
A marketing manager of All Electronics shop, find which items are frequently purchased together within the same
transactions. An example of such a rule, mined from the AllElectronics transactional database, is
66
This association rule involves a single attribute or predicate (i.e., buys) that repeats. Association rules that contain a
single predicate are referred to as single-dimensional association rules. Also, the above rule can be written simply as
Multidimensional association rule.
Consider “AllElectronics” relational database relating to purchases.
A data mining system may find association rules like
The rule indicates that of the AllElectronics customers under study, 2% are 20 to 29 years
of age with an income of 20,000 to 29,000 and have purchased a CD player at AllElectronics. There is a 60% probability
that a customer in this age and income group will purchase a CD player. Note that this is an association between more
than one attribute, or predicate (i.e., age, income, and buys).
A decision tree is a flow-chart-like tree structure, node -> a test on an attribute value, branch-> outcome of the test, tree
leaves -> classes or class distributions. A neural network is typically a collection of neuron-like processing units with
weighted connections between the units.
Prediction
Prediction models calculate continuous-valued functions. Prediction is used to predict missing or unavailable numerical
data values. Prediction refers both numeric prediction and class label prediction. Regression analysis is a statistical
methodology is used for numeric prediction. Prediction also includes the identification of distribution trends based on the
available data.
Clustering Analysis
67
Clustering analyzes data objects without consulting a known class label. Clusters can be grouped based on the principle
of maximizing the intra-class similarity and minimizing the interclass similarity. Clustering is a method of grouping data
into different groups, so that data in each group share similar trends and patterns. The objectives of clustering are
* To uncover natural groupings
* To initiate hypothesis about the data
* To find consistent and valid organization of the data
5 .Outlier Analysis
A database may contain data objects that do not fulfil with the general model of the data.
These data objects are called outliers.Most data mining methods discard outliers as noise or exceptions. Applications like
credit card fraud detection, cell phone cloning fraud and detection of suspicious activities the rare events can be more
interesting than the more regularly occurring ones. The analysis of outlier data is referred to as outlier mining. Outliers
may be detected using statistical tests, distance measures, deviation-based methods.
6. Evolution Analysis
Data evolution analysis describes and model regularities (or) trends of objects whose behaviour changes over time.
Normally, evolution analysis is used to predict the future trends by effective decision making process. It include
characterization, discrimination, association and correlation analysis, classification, prediction, or clustering of time-
related data, time-series data analysis, sequence or periodicity pattern matching, and similarity-based data analysis. E.g
stock market (time-series) data of the last several years available from the New York Stock Exchange and like to invest
in shares of high-tech industrial companies.
68
These are based on the structure of discovered patterns and the statistics underlying them. An objective measure for
association rules of the form X =>Y is rule support, representing the percentage of transactions from a transaction
database that the given rule satisfies. This is taken to be the probability P(X Y),where X Y indicates that a transaction
contains both X and Y, that is, the union of itemsets X and Y.
Another objective measure for association rules is confidence, which measures the degree of certainty of the detected
association. This is taken to be the conditional probability P(Y/X), that is, the probability that a transaction containing X
also contains Y. More formally, support and confidence are defined as
support(X=>Y) = P(X Y):
confidence(X=>)Y) = P(Y/X):
For example, rules that do not satisfy a confidence 50% can be considered uninteresting. Rules below this reflect noise,
exceptions, or minority cases and are probably of less value. Subjective interestingness measures These are based on user
beliefs in the data. These measures find patterns interesting if they are unexpected (opposing a user’s belief) or offer
strategic information on which the user can act. Patterns that are expected can be interesting if they confirm a hypothesis
that the user wished to validate, or resemble a user’s idea.
A data mining system generate all of the interesting patterns—refers to the completeness of a data mining algorithm. It is
often unrealistic and inefficient for data mining systems to generate all of the possible patterns.
A data mining system generate only interesting patterns—is an optimization problem in data mining. It is highly
desirable for data mining systems to generate only interesting patterns.
This is efficient for users and data mining systems, because have search through the patterns generated in order to
identify the truly interesting ones.
Data mining is an interdisciplinary field, that merging a set of disciplines, including database systems, statistics, machine
learning, visualization, and information science.
Depending on the data mining approach used, techniques from other disciplines may be
applied, such as
o neural networks,
o fuzzy and/or rough set theory,
o knowledge representation,
o inductive logic programming,
o high-performance computing.
Depending on the kinds of data to be mined or on the given data mining application, the
data mining system may also integrate techniques from
o spatial data analysis,
o information retrieval,
o pattern recognition,
o image analysis,
o signal processing,
o computer graphics,
o Web technology,
o economics,
o business,
o bioinformatics,
o psychology
Data mining systems can be categorized according to various criteria, as follows:
69
Classification according to the kinds of databases mined: Database systems can be
Classified according to different criteria may require its own data mining technique. For example, if classifying
according to data models, it may have a relational, transactional, object relational, or data warehouse mining system. If
classifying according to the special types of data handled, it may have a spatial, time-series, text, stream data, multimedia
data mining system, or a World Wide Web mining system.
Classification according to the kinds of knowledge mined:
o It is, based on data mining functionalities, such as characterization, discrimination, association and correlation analysis,
classification, prediction, clustering, outlier analysis, and evolution analysis.
A complete data mining system usually provides multiple and/or integrated data mining
functionalities.
o Moreover, data mining systems can be famous based on the granularity or levels of abstraction of the knowledge
mined, including generalized knowledge ,primitive-level knowledge, or knowledge at multiple levels .
o An advanced data mining system should facilitate the discovery of knowledge at multiple levels of abstraction.
o Data mining systems can also be categorized as those that mine data regularities (commonly occurring patterns) versus
those that mine data irregularities (such as exceptions, or outliers).
In general, concept description, association and correlation analysis, classification, prediction, and clustering mine data
regularities, rejecting outliers as noise. These methods may also help detect outliers.
Classification according to the kinds of techniques utilized: These techniques can be described according to the degree
of user interaction involved e.g. Autonomous systems, interactive exploratory systems, query-driven systems or the
methods of data analysis employed.
e.g., database-oriented or data warehouse–oriented techniques, machine learning, statistics, visualization, pattern
recognition, neural networks, and so on.
70
The set of task-relevant data to be mined: This specifies the portions of the database or the set of data in which the user is
interested. This includes the database attributes or data warehouse dimensions of interest
The kind of knowledge to be mined: This specifies the data mining functions to be performed, such as characterization,
discrimination, association or correlation analysis, classification, prediction, clustering, outlier analysis, or evolution
analysis.
The background knowledge to be used in the discovery process: This knowledge about the domain to be mined is useful
for guiding the knowledge discovery process and for evaluating the patterns found.
Concept hierarchies (shown in Fig 2) are a popular form of background knowledge, which allow data to be mined at
multiple levels of abstraction.
71
The interestingness measures and thresholds for pattern evaluation: They may be used to guide the mining process or,
after discovery, to evaluate the discovered patterns. Different kinds of knowledge may have different interestingness
measures.
The expected representation for visualizing the discovered patterns: This refers to the form in which discovered patterns
are to be displayed ,which may include rules, tables, charts, graphs, decision trees, and cubes.
A data mining query language can be designed to incorporate these primitives, allowing users to flexibly interact with
data mining systems. This facilitates a data mining system’s communication with other information systems and its
integration with the overall information processing environment.
72
When a DM system works in an environment that requires it to communicate with other information system components,
such as DB and DW systems, possible integration schemes include
No coupling,
Loose coupling,
Semi tight coupling,
Tight coupling
No coupling: means that a DM system will not utilize any function of a DB or DW system. It may fetch data from a file
system, process data using some data mining algorithms, and then store the mining results in another file.
Drawbacks.
First, a DB system provides flexibility and efficiency at storing, organizing, accessing, and processing data. Without
using a DB/DWsystem, a DM system spend more time for finding, collecting, cleaning, and transforming data. In
DB/DW systems, data’s are well organized, indexed, cleaned, integrated, or consolidated, so that finding the task-
relevant, high-quality data becomes an easy task.
Second, there are many tested, scalable algorithms and data structures implemented in DB and DW systems. Without any
coupling of such systems, a DM system will need to use other tools to extract data, making it difficult to integrate such a
system into an information processing environment. Thus, no coupling represents a poor design.
Loose coupling: means that a DM system will use some facilities of a DB or DW system, fetching data from a data
repository managed by these systems, performing data mining, and then storing the mining results either in a file or in a
designated place in a database or data warehouse.( In computing and systems design a loosely coupled system is
one in which each of its components has, or makes use of, little or no knowledge of the definitions of other
separate components. Subareas include the coupling of classes, interfaces, data, and services. Loose
coupling is the opposite of tight coupling. )
73
Advantages : Loose coupling is better than nocoupling because it can fetch any portion of data stored in DB’s or DW’s
by using query processing, indexing, and other system facilities.
Drawbacks : However, many loosely coupled mining systems are main memory-based. Because mining does not explore
data structures and query optimization methods provided by DB or DW systems, it is difficult for loose coupling to
achieve high scalability and good performance with large data sets.
Semitight coupling: means that too linking a DM system to a DB/DW system, efficient implementations of a few
essential data mining primitives can be provided in the DB/DW system.
These primitives can include sorting, indexing, aggregation, histogram analysis, multiway join, and precomputation of
some essential statistical measures, such as sum, count, max, min, standard deviation, and so on.
Moreover, some frequently used intermediate mining results can be precomputed and stored in the DB/DW system.
Tight coupling: means that a DM system is smoothly integrated into the DB/DW system. This approach is highly
desirable because it facilitates efficient implementations of data mining functions, high system performance, and an
integrated information processing environment.
Data mining queries and functions are optimized based on mining query analysis, data structures, indexing schemes, and
query processing methods of a DB or DW system. By technology advances, DM, DB, and DW systems will integrate
together as one information system with multiple functionalities. This will provide a uniform information processing
environment.
Mining methodology and user interaction issues: These reflect the kinds of knowledge mined, the ability to mine
knowledge at multiple granularities, the use of domain knowledge, ad hoc mining, and knowledge visualization.
Mining different kinds of knowledge databases: Data mining should cover a wide data analysis and knowledge
discovery tasks, including data characterization, discrimination, association, classification, prediction, clustering, outlier
analysis.
Interactive mining of knowledge at multiple levels of abstraction: The data mining process
should be interactive. Interactive mining allows users to focus the search for patterns, providing and refining data mining
requests based on returned results.
Incorporation of background knowledge: Background knowledge may be used to guide the discovery process and
allow discovered patterns to be expressed in concise terms and at different levels of abstraction.
Data mining query languages and ad hoc mining: Relational query languages (such as SQL) allow users to use ad hoc
queries for data retrieval.
Presentation and visualization of data mining results: Discovered knowledge should be expressed in high-level
languages, visual representations, or other expressive forms and directly usable by humans.
Handling noisy or incomplete data: When mining data regularities, these objects may confuse the process, causing the
knowledge model constructed to over fit the data.
Pattern evaluation--the interestingness problem: A data mining system can uncover thousands of patterns. Many of the
patterns discovered may be uninteresting to the given user, representing common knowledge or lacking newness.
Performance issues:
74
Efficiency and scalability of data mining algorithms: To effectively extract information from a huge amount of data in
databases, data mining algorithms must be efficient and scalable. Parallel, distributed, and incremental mining
algorithms: The huge size of many databases, the wide distribution of data, and the computational complexity of some
data mining methods are factors motivating the development of algorithms that divide data into partitions that can be
processed in parallel.
Data Preprocessing :-
Data in the real world is dirty.
incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data
e.g., occupation=“ ”
o noisy: containing errors or outliers
e.g., Salary=“-10”
o inconsistent: containing discrepancies in codes or names
e.g., Age=“42” Birthday=“03/07/1997”
e.g., Was rating “1,2,3”, now rating “A, B, C”
e.g., discrepancy between duplicate records
Regression
o smooth by fitting the data into regression functions
Clustering
o detect and remove outliers
Combined computer and human inspection
o detect suspicious values and check by human (e.g., deal with possible outliers)
Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
Cluster Analysis
78
III. Data Integration and Transformation
Data integration:
o Combines data from multiple sources into a coherent store
Schema integration: e.g., A.cust-id , B.cust no.
o Integrate metadata from different sources
Entity identification problem:
o Identify real world entities from multiple data s ources,
o e.g., Bill Clinton = William Clinton
Detecting and resolving data value conflicts
o For the same real world entity, attribute values from different sources are different
o Possible reasons: different representations, different scales,
e.g., metric vs. British units
The larger the Χ2 value, the more likely the variables are related
79
The cells that contribute the most to the Χ2 value are those whose actual count is very different from the expected
count
Correlation does not imply causality
o No., of hospitals and no., of car-theft in a city are correlated
o Both are causally linked to the third variable: population
Data Transformation
Data reduction
Data reduction necessity
o A database/data warehouse may store terabytes of data
o Complex data analysis/mining may take a very long time to run on the complete data set
Data reduction
o Obtain a reduced representation of the data set that is much smaller in volume but yet produce the same (or
almost the same) analytical results
81
2. Dimensionality Reduction: Wavelet Transformation
Discrete wavelet transform (DWT): linear signal processing, multi-resolutional analysis
Compressed approximation: store only a small fraction of the strongest of the wavelet coefficients
Similar to discrete Fourier transform (DFT), but better lossy compression, localized in space
Method:
o Length, L, must be an integer power of 2 (padding with 0’s, when necessary)
o Each transform has 2 functions: smoothing, difference
o Applies to pairs of data, resulting in two set of data of length L/2
o Applies two functions recursively, until reaches the desired length
82
3. Data Compression
String compression
o There are extensive theories and well-tuned algorithms
o Typically lossless
o But only limited manipulation is possible without expansion
Audio/video compression
o Typically lossy compression, with progressive refinement
o Sometimes small fragments of signal can be reconstructed without reconstructing the
whole
Time sequence is not audio
o Typically short and vary slowly with time
4. Numerosity Reduction
Reduce data volume by choosing alternative, smaller forms of data representation
Parametric methods
o Assume the data fits some model, estimate model parameters, store only the
parameters, and discard the data (except possible outliers)
o Example: Log-linear models—obtain value at a point in m-D space as the product on
appropriate marginal subspaces
Non-parametric methods
o Do not assume models
o Major families: histograms, clustering, sampling
Parametric methods
Non-parametric methods
Histograms,
83
Divide data into buckets and store average (sum) for each bucket
Partitioning rules:
o Equal-width: equal bucket range
o Equal-frequency (or equal-depth)
o V-optimal: with the least histogram variance (weighted sum of the original values that
each bucket represents)
o MaxDiff: set bucket boundary between each pair for pairs have the β–1 largest differences
Clustering
Partition data set into clusters based on similarity, and store cluster representation (e.g., centroid and
diameter) only
Can be very effective if data is clustered but not if data is “dirty”
Can have hierarchical clustering and be stored in multi-dimensional index tree structures
There are many choices of clustering definitions and clustering algorithms.
Sampling
84
Discretization and concept hierarchy generation
Discretization:
o Divide the range of a continuous attribute into intervals
o Some classification algorithms only accept categorical attributes.
o Reduce data size by discretization
o Prepare for further analysis
o Reduce the number of values for a given continuous attribute by dividing the range of the attribute into
intervals
o Interval labels can then be used to replace actual data values
o Supervised vs. unsupervised
o Split (top-down) vs. merge (bottom-up)
o Discretization can be performed recursively on an attribute Concept hierarchy formation
o Recursively reduce the data by collecting and replacing low level concepts (such as numeric values for age) by
higher level concepts (such as young, middle-aged, or senior)
UNIT-IV
Association rules are if-then statements that help to show the probability of relationships between data items
within large data sets in various types of databases. Association rule mining has a number of applications
and is widely used to help discover sales correlations in transactional data or in medical data sets.
Finding frequent patterns, associations, correlations, or causal structures among sets of items or objects in
transaction databases, relational databases, and other information repositories
Applications – Basket data analysis, cross‐marketing, catalog design, loss‐ leader analysis, clustering,
classification, etc.
Association Rule: Basic Concepts • Given a database of transactions each transaction is a list of items
(purchased (purchased by a customer customer in a visit) • Find all rules that correlate the presence of one set
86
of items with that of another another set of items • Find frequent patterns • Example for frequent itemset
mining is market basket analysis
Association rule performance measures
• Confidence • Support • Minimum support threshold • Minimum confidence threshold
1.Frequent patterns are patterns (such as item sets, subsequences, or substructures) that appear in a data set frequently.
For example, a set of items, such as milk and bread, that appear frequently together in a transaction data set is a frequent
itemset.
A subsequence, such as buying first a PC, then a digital camera, and then a memory card, if it occurs frequently in a
shopping history database, is a (frequent) sequential pattern.
A substructure can refer to different structural forms, such as subgraphs, subtrees, or
sublattices, which may be combined with itemsets or subsequences. If a substructure occurs frequently, it is called a
(frequent) structured pattern.
Frequent itemset mining used to find associations and correlations of all items in large
transactional or relational data sets. With large amounts of data continuously collected and stored, many industries are
interested in mining such patterns from their databases. This can help in many business decision-making processes, such
as catalogue design, cross marketing, and customer shopping behaviour analysis.
A typical example of frequent itemset mining is market basket analysis. This process
analyzes customer buying habits by finding associations between the different items that customers place in their
“shopping baskets” (Figure 5.1).
87
The discovery of such associations can help retailers develop marketing strategies by
which items are frequently purchased together by customers. For example, if customers are buying milk, how many of
them also buy bread on the same trip to the supermarket? Such information can lead to increased sales by helping
retailers do selective marketing and plan theirshelf space.
This is also known, simply, as the frequency, support count, or count of the itemset.
Note that the itemset support defined in Equation is sometimes referred to as relative support, whereas the occurrence
frequency is called the absolute support.
Rules that satisfy both a minimum support threshold (min sup) and a minimum confidence threshold (min conf) are called
Strong Association Rules.
1.Find all frequent itemsets: By definition, each of these itemsets will occur at least as frequently as a predetermined
minimum support count, min_sup.
88
2. Generate strong association rules from the frequent itemsets: By definition, these rules must satisfy minimum support
and minimum confidence.
Closed Itemsets : An itemset X is closed in a data set S if there exists no proper super-itemset Y such that Y has the same
support count as X in S. An itemset X is a closed frequent itemset in set S if X is both closed and frequent in S.
Maximal frequent itemset: An itemset X is a maximal frequent itemset (or max-itemset) in set S if X is frequent, and there
exists no super-itemset Y such that X belongsY and Y is frequent in S.
Frequent pattern mining can be classified in various ways, based on the following criteria:
1. Based on the completeness of patterns to be mined: The following can be mined based on the Completeness of
patterns.
Frequent itemsets, Closed frequent itemsets, Maximal frequent itemsets,
Constrained frequent itemsets (i.e., those that satisfy a set of user-defined constraints),
Approximate frequent itemsets (i.e., those that derive only approximate support counts for the mined frequent
itemsets),
Near-match frequent itemsets (i.e., those that tally the support count of the near or almost matching itemsets),
Top-k frequent itemsets (i.e., the k most frequent itemsets for a user-specified value, k),
Frequent itemset mining: mining of frequent itemsets (sets of items) from transactional or relational data sets.
Sequential pattern mining: searches for frequent subsequences in a sequence data set
Structured pattern mining: searches for frequent substructures in a structured dataset.
89
Mining Methods
Apriori is an algorithm proposed by R. Agrawal and R. Srikant in 1994 for mining frequent itemsets for Boolean
association rules. The name of the algorithm is based on the fact that the algorithm uses prior knowledge of frequent
itemset properties.
Apriori uses an iterative approach known as a level-wise search, where k-itemsets are used to explore (k+1)-itemsets.
First, the set of frequent 1-itemsets is found by scanning the database to accumulate the count for each item, and
collecting those items that satisfy minimum support. The resulting set is denoted L1.
Next, L1 is used to find L2, the set of frequent 2-itemsets, which is used to find L3, and so
on, until no more frequent k-itemsets can be found. The finding of each Lk requires one full scan of
the database.
To improve the efficiency of the level-wise generation of frequent itemsets, an important
property called the Apriori property, presented below, is used to reduce the search space.
Apriori property: All nonempty subsets of a frequent itemset must also be frequent.
90
91
92
1.The Apriori Algorithm: Finding Frequent Itemsets Using Candidate Generation
The Apriori Algorithm: Basics
The Apriori Algorithm is an influential algorithm for mining frequent itemsets for boolean
association rules.
Key Concepts :
• Frequent Itemsets: The sets of item which has minimum support (denoted by Li for ith-
Itemset).
• Apriori Property: Any subset of frequent itemset must be frequent.
• Join Operation: To find Lk , a set of candidate k-itemsets is generated by joining Lk-1 with
itself.
The Apriori Algorithm Steps
Find the frequent itemsets: the sets of items that have minimum support
A subset of a frequent itemset must also be a frequent itemset
i.e., if {AB} is a frequent itemset, both {A} and {B} should be a frequent itemset
Iteratively find frequent itemsets with cardinality from 1 to k (k-itemset)
Use the frequent itemsets to generate association rules.
Join Step: Ck is generated by joining Lk-1with itself
Prune Step: Any (k-1)-itemset that is not frequent cannot be a subset of a frequent k itemset
Pseudo-code:
To discover the set of frequent 2-itemsets, L2 , the algorithm uses L1 Join L1 to generate a candidate set of 2-
itemsets, C2.
Next, the transactions in D are scanned and the support count for each candidate itemset in C2 is accumulated (as
shown in the middle table).
The set of frequent 2-itemsets, L2 , is then determined, consisting of those candidate 2-itemsets in C2 having
minimum support.
Note: We haven’t used Apriori Property yet.
Step 3: Generating 3-itemset Frequent Pattern
The generation of the set of candidate 3-itemsets, C3 , involves use of the Apriori Property.
In order to find C3, we compute L2 Join L2.
94
C3 = L2 Join L2 = {{I1, I2, I3}, {I1, I2, I5}, {I1, I3, I5}, {I2, I3, I4}, {I2, I3, I5}, {I2, I4, I5}}.
Now, Join step is complete and Prune step will be used to reduce the size of C3.
Prune step helps to avoid heavy computation due to large Ck.
Based on the Apriori property that all subsets of a frequent itemset must also be frequent, we can determine that four
latter candidates cannot possibly be frequent. How ?
For example , lets take {I1, I2, I3}. The 2-item subsets of it are {I1, I2}, {I1, I3} & {I2,
I3}. Since all 2-item subsets of {I1, I2, I3} are members of L2, We will keep {I1, I2, I3}
in C3.
Lets take another example of {I2, I3, I5} which shows how the pruning is performed.
The 2-item subsets are {I2, I3}, {I2, I5} & {I3,I5}.
BUT, {I3, I5} is not a member of L2 and hence it is not frequent violating Apriori Property. Thus We will have to
remove {I2, I3, I5} from C3.
Therefore, C3 = {{I1, I2, I3}, {I1, I2, I5}} after checking for all members of result of Join operation for Pruning.
Now, the transactions in D are scanned in order to determine L3, consisting of those candidates 3-itemsets in C3
having minimum support.
Step 4: Generating 4-itemset Frequent Pattern
The algorithm uses L3 Join L3 to generate a candidate set of 4-itemsets, C4. Although the join results in {{I1, I2, I3,
I5}}, this itemset is pruned since its subset {{I2, I3, I5}} is not frequent.
Thus, C4 = φ , and algorithm terminates, having found all of the frequent items. This completes our Apriori
Algorithm.
Back to e.g
L = {{I1}, {I2}, {I3}, {I4}, {I5}, {I1,I2}, {I1,I3}, {I1,I5}, {I2,I3}, {I2,I4}, {I2,I5}, {I1,I2,I3},
{I1,I2,I5}}.
Lets take l = {I1,I2,I5}.
Its all nonempty subsets are {I1,I2}, {I1,I5}, {I2,I5}, {I1}, {I2}, {I5}.
Let minimum confidence threshold is , say 70%.
The resulting association rules are shown below, each listed with its confidence.
95
If the minimum confidence threshold is, say, 70%, then only the second, third, and last rules above are output, because
these are the only ones generated that are strong.
Hash-based itemset counting: A k-itemset whose corresponding hashing bucket count is below the threshold cannot
be frequent.
Transaction reduction: A transaction that does not contain any frequent k-itemset is useless in subsequent scans.
Partitioning: Any itemset that is potentially frequent in DB must be frequent in at least one of the partitions of DB.
Sampling: mining on a subset of given data, lower support threshold + a method to determine the completeness.
Dynamic itemset counting: add new candidate itemsets only when all of their subsets are estimated to be frequent.
• Apriori Advantages:
– Uses large itemset property.
– Easily parallelized
– Easy to implement.
• Apriori Disadvantages:
– Assumes transaction database is memory resident.
– Requires up to m database scans
–
4.Mining Frequent Itemsets without Candidate Generation
Compress a large database into a compact, Frequent-Pattern tree (FP-tree) structure
highly condensed, but complete for frequent pattern mining
avoid costly database scans
Develop an efficient, FP-tree-based frequent pattern mining method
A divide-and-conquer methodology: decompose mining tasks into smaller ones
Avoid candidate generation: sub-database test only!
96
Consider the same previous example of a database, D , consisting of 9 transactions.
Suppose min. support count required is 2 (i.e. min_sup = 2/9 = 22 % )
The first scan of database is same as Apriori, which derives the set of 1-itemsets &
their support counts.
The set of frequent items is sorted in the order of descending support count.
The resulting set is denoted as L = {I2:7, I1:6, I3:6, I4:2, I5:2}
FP-Growth Method: Construction of FP-Tree
First, create the root of the tree, labeled with “null”.
Scan the database D a second time. (First time we scanned it to create 1-itemset and then L).
The items in each transaction are processed in L order (i.e. sorted order).
A branch is created for each transaction with items having their support count separated by colon.
Whenever the same node is encountered in another transaction, we just increment the support count of the
common node or Prefix.
To facilitate tree traversal, an item header table is built so that each item points to its occurrences in the tree
via a chain of node-links.
Now, The problem of mining frequent patterns in database is transformed to that of mining the FP-Tree.
Steps:
1. Start from each frequent length-1 pattern (as an initial suffix pattern).
2. Construct its conditional pattern base which consists of the set of prefix paths in the FP-Tree co-occurring with suffix
pattern.
3. Then, Construct its conditional FP-Tree & perform mining on such a tree.
4. The pattern growth is achieved by concatenation of the suffix pattern with the frequent patterns generated from a
conditional FP-Tree.
97
5. The union of all frequent patterns (generated by step 4) gives the required frequent itemset.
Pseudo Code
Algorithm: FP growth. Mine frequent itemsets using an FP-tree by pattern fragment
growth.
Input:
• D, a transaction database;
• min sup, the minimum support count threshold.
Output: The complete set of frequent patterns.
Method:
1. The FP-tree is constructed in the following steps:
98
(a) Scan the transaction database D once. Collect F, the set of frequent items, and their support counts. Sort F in support
count descending order as L, the list of frequent items.
(b) Create the root of an FP-tree, and label it as “null.” For each transaction Trans in D
do the following.
Select and sort the frequent items in Trans according to the order of L. Let the sorted frequent item list in Trans be [pjP],
where p is the first element and P is the remaining list. Call insert tree([pjP], T), which is performed as follows. If T has a
child N such that N.itemname= p.item-name, then increment N’s count by 1; else create a new node N, and let its count be
1, its parent link be linked to T, and its node-link to the nodes with the same item-name via the node-link structure. If P is
nonempty, call insert tree(P, N) recursively.
99
Various Kinds of Mining Association Rules
100
Association rules generated from mining data at multiple levels of abstraction are called multiple-level or multilevel
association rules.
Multilevel association rules can be mined efficiently using concept hierarchies under a support-confidence framework. A
top-down strategy is used, where counts are collected for the calculation of frequent itemsets at each concept level,
starting at the concept level 1 and working downward in the hierarchy towards the specific concept levels, until no more
frequent itemsets can be found. For each level, any algorithm for discovering frequent itemsets may be used, such as
Apriori or its variations.
Using uniform minimum support for all levels (referred to as uniform support):
The same minimum support entry is used when mining at each level of abstraction. For example, in following Figure , a
minimum support enrty of 5% is used throughout (e.g., for mining from “computer” down to “laptop computer”). Both
“computer” and “laptop computer” are found to be frequent, while “desktop computer” is not. The method is also
simple in that users are required to specify only one minimum support entry. An Apriori-optimization technique can be
used, based on the concept of an ancestor is a superset of its children’s: The search avoids examining itemsets containing
any item whose ancestors do not have minimum support.
101
Using reduced minimum support at lower levels (referred to as reduced support):
Each level of abstraction has its own minimum support threshold. The deeper the level of abstraction, the smaller the
corresponding threshold is. For example, in Figure, the minimum support thresholds for levels 1 and 2 are 5% and 3%,
respectively. In this way, “computer,” “laptop computer,” and “desktop computer” are all considered frequent.
When mining multilevel rules users approaching which groups are more important than others, also it is important to set
up user-specific, item, or group based minimal support entries.
For example, a user could set up the minimum support entries based on product price, on items of interest, such as low
support entries for laptop computers and flash drives which association patterns containing items in these categories.
Association rules that involve two or more dimensions or predicates can be referred to as multidimensional association
rules. Rule above contains three predicates (age, occupation, and buys), each of which occurs only once in the rule.
102
Hence, it has no repeated predicates. Multidimensional association rules with no repeated predicates are called inter
dimensional association rules.
We can also mine multidimensional association rules with repeated predicates, which contain multiple occurrences of
some predicates. These rules are called hybrid dimensional association rules.
An example of such a rule is the following, where the predicate buys is repeated:
where Aquan1 and Aquan2 are tests on quantitative attribute intervals, and Acat tests a categorical attribute from the task-
relevant data. Such rules have been referred to as two-dimensional quantitative association rules, because they contain
two quantitative dimensions.
An example of such a 2-D quantitative association rule is
Finding frequent predicate sets: Once the 2-D array containing the count distribution for each category is set up, it can
be scanned to find the frequent predicate sets (those satisfying minimum support) that also satisfy minimum confidence.
Strong association rules can then be generated from these predicate sets, using a rule generation algorithm.
Clustering the association rules: The strong association rules obtained in the Previous step are then mapped to a 2-D
grid. Following figure shows a 2-D grid for 2-D quantitative association rules predicting the condition buys (X, “HDTV”)
on the rule right-hand side, given the quantitative attributes age and income.
The four Xs correspond to the rules
The four rules can be combined or “clustered” together to form the following simpler rule, which subsumes and replaces
the above four rules:
104
Correlation Analysis (correlation - relationship)
Strong Rules Are Not Necessarily Interesting: An Example
In analysing transactions at AllElectronics shop purchase of computer games and videos. Let game refer to the
transactions of computer games, and video refer to videos.
Of the 10,000 transactions analyzed, 6,000 of the customer transactions included computer games, while 7,500 included
videos, and 4,000 included both computer games and videos.
If minimum support 30% and a minimum confidence of 60% was given then the following association rule is discovered:
buys(X, “computer games”))buys(X, “videos”) [support = 40%, confidence = 66%]
Above Rule is a strong association rule since its support value of 4,000/10,000 =40% and confidence value of
4,000/6,000 =66% satisfy the minimum support and minimum confidence thresholds, respectively. However, Rule is
misleading because the probability of purchasing videos is 75%, which is even larger than 66%.The above example also
illustrates that the confidence of a rule A=>B can be misleading in that it is only an estimate of the conditional
probability of itemset B given itemset A. It does not measure the real strength of the correlation and implication between
A and B.
That is, a correlation rule is measured not only by its support and confidence but also by the correlation between itemsets
A and B. Lift is a simple correlation measure that is given as follows. The occurrence of itemset A is independent of the
occurrence of itemset B if
105
If the resulting value of above Equation is less than 1, then the occurrence of A is negatively
correlated with the occurrence of B.
If the resulting value is greater than 1, then A and B are positively correlated, meaning that the occurrence of one implies
the occurrence of the other. If the resulting value is equal to 1, then A and B are independent and there is no correlation
between them.
A data mining process may uncover so many rules which uninteresting to the users. A good practical is to have the users
constraints to limit the search space. This strategy is known as constraint-based mining.
The constraints can include the following:
Knowledge type constraints: These specify the type of knowledge to be mined, such as association or correlation.
Data constraints: These specify the set of task-relevant data.
Dimension/level constraints: These specify the desired dimensions (or attributes) of the data, or levels of the concept
hierarchies, to be used in mining.
Interestingness constraints: These specify thresholds on statistical measures of rule interestingness, such as support,
confidence, and correlation.
Rule constraints: These specify the form of rules to be mined. Such constraints may be
expressed as metarules.
Metarule-guided mining.
E.g Consider Market analyst for AllElectronics, describing customers (such as customer age, address, and credit rating)
the list of customer transactions. Finding associations between customer characters and the customers purchased items.
Instead of finding all of the association rules find only which pairs of customer characters increase the sale of office
software.
An example of such a metarule is
where P1 and P2 are variables that are instantiated to attributes from the given database during the mining process, X is a
variable representing a customer, and Y and W take on values of the attributes assigned to P1 and P2, respectively.
The data mining system can then search for rules that match the given metarule. For instance, Rule (2) matches or
complies with Metarule (1).
age(X, “30……39”) ^ income(X, “41K….60K”) =>buys(X, “office software”) (2)
Basic Concepts
What Is Classification? What Is Prediction?
106
Databases are rich with hidden information that can be used for intelligent decision making.
Classification and prediction are two forms of data analysis that can be used to extract models describing important data
classes or to predict future data trends. Whereas classification predicts definite labels, prediction represents continuous
valued functions.
For example, build a classification model to categorize bank loan applications as either safe or risky, build a prediction
model to predict the expenses of customers on computer devices given their income and occupation.
A bank loans officer needs analysis of her data in order to learn which loan applicants are
“safe”and which are “risky” for the bank.
A marketing manager at AllElectronics needs data analysis to help guess whether a customer with a given profile will buy
a new computer.
A medical researcher wants to analyze breast cancer data in order to predict which one of
three specific treatments a patient should receive. In each of these examples, the data analysis task is classification, where
a model or classifier is constructed to predict categorical labels, such as “safe” or “risky” for the loan application data;
“yes” or “no” for the marketing data; or “treatment A,” “treatment B,” or “treatment C” for the medical data.
Suppose the marketing manager like to predict how much a given customer will spend during a sale at AllElectronics.
This data analysis task is an example of numeric prediction, where the model constructed predicts a continuous-valued
function, or ordered value, as opposed to a categorical label. This model is a predictor. Regression analysis is a statistical
methodology that is used for numeric prediction, hence the two terms are often used equally.
Classification and numeric prediction are the two major types of prediction problems.
The term of prediction to refer to numeric prediction. Classification work. Data classification is a two-step process, as
shown for the loan application data of Figure 1.
107
Issu
es: Data Preparation
Data cleaning
o Preprocess data in order to reduce noise and handle missing values
Relevance analysis (feature selection)
108
o Remove the irrelevant or redundant attributes
Data transformation
o Generalize and/or normalize data
Decision tree induction is the learning of decision trees from class-labelled training tuples. A decision tree is a flowchart-
like tree structure, where each internal node (non leaf node)
denotes a test on an attribute, each branch represents an outcome of the test, and each leaf
node (or terminal node) holds a class label. The topmost node in a tree is the root node.Internal nodes are denoted by
rectangles, and leaf nodes are denoted by ovals.
Decision trees are used for classification- Given a tuple, X, for which the associated class label is unknown, the attribute
values of the tuple are tested against the decision tree. A path is traced from the root to a leaf node, which holds the class
prediction for that tuple. Decision trees can easily be converted to classification rules.
“Why are decision tree classifiers so popular?”
The construction of decision tree classifiers does not require any domain knowledge
Decision trees can handle high dimensional data.
The learning and classification steps of decision tree induction are simple and fast.
Decision tree classifiers have good accuracy.
109
Decision tree induction algorithms have been used for classification in many application areas, such as medicine,
manufacturing and production, financial analysis, astronomy, and molecular biology.
Algorithm
Basic algorithm (a greedy algorithm)
o Tree is constructed in a top-down recursive divide-and-conquer manner
o At start, all the training examples are at the root
o Attributes are categorical (if continuous-valued, they are discretized in advance)
o Examples are partitioned recursively based on selected attributes
o Test attributes are selected on the basis of a heuristic or statistical measure (e.g., information gain)
o Conditions for stopping partitioning
o All samples for a given node belong to the same class
o There are no remaining attributes for further partitioning – majority voting
is employed for classifying the leaf
o There are no samples left
110
2. Attribute Selection Measures
o An attribute selection measure is a trial for selecting by splitting the criterion to “best” from a given data partition, D, of
class-labelled training tuples into individual classes.
o If we were to split D into smaller partitions according to the outcomes of the splitting criterion, ideally each partition
would be pure.
o Attribute selection measures are also known as splitting rules because they determine how the tuples at a given node are
to be split.
111
Information gain is defined as the difference between the original information
requirement (i.e., based on just the proportion of classes) and the new requirement (i.e.,
obtained after partitioning on A). That is,
112
113
2). Gain ratio
C4.5 (a successor of ID3) uses gain ratio to overcome the problem (normalization to information gain). To find gain ratio
Gain Ratio (A) = Gain (A) / SplitInfo(A) where
3. Gini index
The Gini index is used in CART. Using the notation described above, the Gini index measures the impurity of D, a data
partition or set of training tuples, as
114
E.g
but Gini{medium,high} is 0.30 and thus the best since it is the lowest.
Tree Pruning
Overfitting: An induced tree may overfit the training data
o Too many branches, some may reflect differences due to noise or outliers
o Poor accuracy for unseen samples
Two approaches to avoid overfitting
o Prepruning: Halt tree construction early—do not split a node if this would result in the goodness measure
falling below a threshold
Difficult to choose an appropriate threshold
o Postpruning: Remove branches from a “fully grown” tree—get a sequence of progressively pruned trees
Use a set of data different from the training data to decide which is the “best pruned tree”
A statistical classifier: performs probabilistic prediction, i.e., predicts class membership probabilities
Foundation: Based on Bayes’ Theorem.
Performance: A simple Bayesian classifier, naïve Bayesian classifier, has comparable performance with decision tree
and selected neural network classifiers
Incremental: Each training example can incrementally increase/decrease the probability that a hypothesis is correct —
prior knowledge can be combined with observed data
Standard: Even when Bayesian methods are computationally inflexible, they can provide a standard of optimal
decision making against which other methods can be measured
116
Practical difficulty: require initial knowledge of many probabilities, significant computational
117
Naïve Bayesian Classifier: Training Dataset
Class:
C1:buys_computer = ‘yes’
C2:buys_computer = ‘no’
Data sample
X = (age <=30, Income = medium, Student = yes , Credit_rating = Fair)
118
119
o Advantages
Easy to implement
Good results obtained in most of the cases
o Disadvantages
Assumption: class conditional independence, therefore loss of accuracy
Practically, dependencies exist among variables
E.g., hospitals: patients: Profile: age, family history, etc.
120
Symptoms: fever, cough etc., Disease: lung cancer, diabetes, etc.
121
(greedy hill-climbing) method, analogous to neural network learning
Network structure unknown, all variables observable: search through the
model space to reconstruct network topology
Unknown structure, all hidden variables: No good algorithms known for
this purpose
A rule R can be assessed by its coverage and accuracy. Given a tuple, X, from a class labeled data set D, let ncovers be
the number of tuples covered by R; ncorrect be the number of tuples correctly classified by R; and |D| be the number of
tuples in D. We can define the coverage and accuracy of R as
e.g Consider rule R1 above, which covers 2 of the 14 tuples. It can correctly classify both tuples. Therefore, coverage(R1)
= 2/14 = 14:28% and accuracy (R1) = 2/2 = 100%.( See table)
122
IF-THEN rules can be extracted directly from the training data (i.e., without having to generate a decision tree first) using
a sequential covering algorithm. In this the rules are learned sequentially (one at a time), where each rule for a given class
will ideally cover many of the tuples of that class (and none of the tuples of other classes). Algorithm: Sequential
covering. Learn a set of IF-THEN rules for classification.
Input:
D, a data set class-labeled tuples;
Att vals, the set of all attributes and their possible values.
Output: A set of IF-THEN rules.
Method:
(1) Rule_set = { }; // initial set of rules learned is empty
(2) for each class c do
(3) repeat
(4) Rule = Learn_ One_ Rule(D, Att_ vals, c);
(5) remove tuples covered by Rule from D;
(6) until terminating condition;
(7) Rule_ set = Rule_ set +Rule; // add new rule to rule set
(8) endfor
(9) return Rule_ Set;
123
During the learning phase, the network learns by adjusting the weights so as to be able to predict the correct class
label of the input tuples
Also referred to as connectionist learning due to the connections between units
For Example
The n-dimensional input vector x is mapped into variable y by means of the scalar
product and a nonlinear function mapping.
A Multilayer Feed-Forward Neural Network
124
Working process of Multilayer Feed-Forward Neural Network
The inputs to the network correspond to the attributes measured for each training tuple
Inputs are fed simultaneously into the units making up the input layer
They are then weighted and fed simultaneously to a hidden layer
The number of hidden layers is arbitrary, although usually only one
The weighted outputs of the last hidden layer are input to units making up the output layer, which emits the
network's prediction
The network is feed-forward in that none of the weights cycles back to an input unit or to an output unit of a
previous layer
From a statistical point of view, networks perform nonlinear regression: Given enough hidden units and enough
training samples, they can closely approximate any function
Backpropagation
Iteratively process a set of training tuples & compare the network's prediction with the actual known target value
For each training tuple, the weights are modified to minimize the mean squared error between the network's
prediction and the actual target value
Modifications are made in the “backwards” direction: from the output layer, through each hidden layer down to the
first hidden layer, hence “backpropagation”
Steps
o Initialize weights (to small random #s) and biases in the network
o Propagate the inputs forward (by applying activation function)
o Backpropagate the error (by updating weights and biases)
o Terminating condition (when error is very small, etc.)
125
Backpropagation and Interpretability
Efficiency of backpropagation: Each time (one interation through the training set) takes O(|D|x w), with |D| tuples and
w weights, but number of times can be exponential to n, the number of inputs, in the worst case.
Rule extraction from networks: network pruning
o Simplify the network structure by removing weighted links that have the least effect on the trained network
o Then perform link, unit, or activation value clustering
o The set of input and activation values are studied to derive rules describing the relationship between the input
and hidden unit layers
Sensitivity analysis: assess the impact that a given input variable has on a network output. The knowledge gained from
this analysis can be represented in rules
126
Let data D be (x1, y1), (x2, y2) …, (x|D|, y|D|), where Xi is the set of training tuples associated with the class labels yi.
There are infinite lines (hyperplanes) separating the two classes but find the best one.
SVM searches for the hyperplane with the largest margin, i.e., maximum marginal hyperplane (MMH)
A separating hyperplane can be written as
W . X+b = 0;
where W={w1, w2, …, wn} is a weight vector and b a scalar (bias)
For 2-D it can be written as
w0 + w1x1 +w2x2 = 0:
The hyperplane defining the sides of the margin:
H1: w0 + w1 x1 + w2 x2 ≥ 1 for yi = +1, and
127
H2: w0 + w1 x1 + w2 x2 ≤ – 1 for yi = –1
Any training tuples that fall on hyperplanes H1 or H2 (i.e., the sides defining the margin) are support vectors
This becomes a constrained (convex) quadratic optimization problem: Quadratic objective function and linear
constraints Quadratic Programming (QP) Lagrangian multipliers
That is, any tuple that falls on or above H1 belongs to class +1, and any tuple that falls
on or below H2 belongs to class -1. Combining the two inequalities of above two Equations
we get
yi (w0 + w1x1 + w2x2 ) ≥ 1, .
Any training tuples that fall on hyperplanes H1 or H2 (i.e., the “sides” defining the margin) satisfy above Equation and
are called support vectors.
2. The Case When the Data Are Linearly inseparable
Association rules show strong associations between attribute-value pairs (or items) that occur frequently in a given data
set. Such analysis is useful in many decision-making processes, such as product placement, catalog design, and cross-
marketing. Association rules are mined in a two-step process frequent itemset mining, and rule generation.
The first step searches for patterns of attribute-value pairs that occur repeatedly in a data set, where each attribute-value
pair is considered an item. The resulting attribute value pairs form frequent itemsets.
The second step analyses the frequent itemsets in order to generate association rules.
Advantages
o It explores highly confident associations among multiple attributes and may overcome some constraints by decision-
tree induction, which considers only one attribute at a time
o It is more accurate than some traditional classification methods, such as C4.5
128
Classification: Based on evaluating a set of rules in the form of
p1 ^ p2 … ^ pi => Aclass = C (confidence, support)
CPAR uses an algorithm for classification known as FOIL (First Order Inductive Learner). FOIL builds rules to
differentiate positive tuples ( having class buys computer = yes) from negative tuples (such as buys computer = no).
For multiclass problems, FOIL is applied to each class. That is, for a class, C, all tuples of class C are considered
positive tuples, while the rest are considered negative tuples. Rules are generated to differentiate C tuples from all others.
Each time a rule is generated, the positive samples it satisfies (or covers) are removed until all the positive tuples in the
data set are covered.
CPAR relaxes this step by allowing the covered tuples to remain under consideration, but reducing their weight. The
process is repeated for each class. The resulting rules are merged to form the classifier rule set.
Eager learners
Decision tree induction, Bayesian classification, rule-based classification, classification by backpropagation, support
vector machines, and classification based on association rule mining—are all examples of eager learners.
Eager learners - when given a set of training tuples, will construct a classification model before receiving new tuples to
classify.
Lazy Learners
o In a lazy approach, for a given training tuple, a lazy learner simply stores it or does only a little minor processing and
waits for until a test tuple given. After seeing the test tuple it perform classification in order to classify the tuple based on
its similarity to the stored training tuples.
o Lazy learners do less work when a training tuple is presented and more work when making a classification or
prediction. Because lazy learners store the training tuples or “instances,” (they are also referred to as instance based
learners,) even though all learning is essentially based on instances.
129
Examples of lazy learners:
k-nearest neighbour classifiers
case-based reasoning classifiers
Nearest-neighbour classifiers are based on a comparison, between given test tuple with training tuples that are similar
to it.
The training tuples are named as n attributes. Each tuple represents a point in an n dimensional
space. All of the training tuples are stored in an n-dimensional pattern space.
When given an unknown tuple, a k-nearest-neighbour classifier searches the pattern space for the k training tuples that
are closest to the unknown tuple. These k training tuples are the k “nearest neighbours” of the unknown tuple.
“Closeness” is defined in terms of a distance metric, such as Euclidean distance.
The Euclidean distance between two points or tuples, say,
For k-
nearest-neighbour classification, the unknown tuple is assigned the most common class among its k nearest neighbours.
When k = 1, the unknown tuple is assigned the class of the training tuple that is closest to it in pattern space.
Nearest neighbour classifiers can also be used for prediction, that is, to return a realvalued prediction for a given
unknown tuple. In this case, the classifier returns the average value of the real-valued labels associated with the k nearest
neighbours of the unknown tuple.
130
Case-Based Reasoning (CBR)
Case-based reasoning classifiers use a database of problem solutions to solve new problems. CBR stores the tuples or
“cases” for problem solving as complex symbolic descriptions. e.g Medical education - where patient case histories and
treatments are used to help diagnose and treat new patients.
When given a new case to classify, a case-based reasoner will first check if an identical training case exists. If one is
found, then the associated solution to that case is returned. If no identical case is found, then the case-based reasoner will
search for training cases having components that are similar to those of the new case.
Ideally, these training cases may be considered as neighbours of the new case. If cases are represented as graphs, this
involves searching for subgraphs that are similar to subgraphs within the new case. The case-based reasoner tries to
combine the solutions of the neighbouring training cases in order to propose a solution for the new case.
131
o If an attribute has k > 2 values, k bits can be used to encode the attribute’s values
Based on the notion of survival of the fittest, a new population is formed to consist of the fittest rules and their
offspring
The fitness of a rule is represented by its classification accuracy on a set of training examples
Off springs are generated by crossover and mutation
The process continues until a population P evolves when each rule in P satisfies a pre-specified threshold
Slow but easily parallelizable
132
Predictions (Numeric prediction / Regression)
1. Linear Regression
2. Nonlinear Regression
3. Other Regression-Based Methods
Numeric prediction is the task of predicting continuous values for given input. e.g.,To predict the salary of employee with
10 years of work experience, the sales of a new product.
An approach for numeric prediction is regression, a statistical methodology. Regression analysis can be used to model the
relationship between one or more independent or predictor variables and a dependent or response variable (which is
continuous-valued).
The predictor variables are the attributes of the tuple. In general, the values of the predictor variables are known. The
response variable is unknown so predict it.
where x is the mean value of x1, x2, ….. , x|D|, and y is the mean value of y1, y2, …., y|D|.
133
Example. Straight-line regression using method of least squares. Table shows a set of paired data where x is the number
of years of work experience of a employee and y is the corresponding salary of the employee.
The 2-D data can be graphed on a scatter plot, as in Figure. The plot suggests a linear relationship between the two
variables, x and y.
We model the relationship that salary may be related to the number of years of work experience with the equation
134
Given the above data, we compute x = 9.1 and y = 55.4. Substituting these values into
Equations we get
2. Nonlinear Regression
Some nonlinear models can be modeled by a polynomial function
A polynomial regression model can be transformed into linear regression model.
For example,
Other functions, such as power function, can also be transformed to linear model.
Some models are intractable nonlinear (e.g., sum of exponential terms)
possible to obtain least square estimates through extensive calculation on more complex formulae
135
Regression and model trees tend to be more accurate than linear regression when
the data are not represented well by a simple linear model
UNIT V
1. Define Clustering?
Clustering is a process of grouping the physical or conceptual data object into clusters.
136
• Clustering is used in biology to develop new plants and animal taxonomies. • Clustering is used
in business to enable marketers to develop new distinct groups of their customers and characterize
the customer group on basis of purchasing. • Clustering is used in the identification of groups of
automobiles Insurance policy customer. • Clustering is used in the identification of groups of house
in a city on the basis of house type, their cost and geographical location.• Clustering is used to
classify the document on the web for information discovery.
5.What are the different types of data used for cluster analysis?
The different types of data used for cluster analysis are interval scaled, binary, nominal, ordinal and
ratio scaled data.
7. Define Binary variables? And what are the two types of binary variables?
Binary variables are understood by two states 0 and 1, when state is 0, variable is absent and when
state is 1, variable is present. There are two types of binary variables,
137
symmetric and asymmetric binary variables. Symmetric variables are those variables that have same state values and
weights. Asymmetric variables are those variables that have not same state values and weights.
138
13. What is CURE?
Clustering Using Representatives is called as CURE. The clustering algorithms generally work on spherical and
similar size clusters. CURE overcomes the problem of spherical and similar size cluster and is more robust with
respect to outliers.
Chameleon is another hierarchical clustering method that uses dynamic modeling. Chameleon is introduced to
recover the drawbacks of CURE method. In this method two clusters are merged, if the interconnectivity between
two clusters is greater than the interconnectivity between the objects within a cluster.
Density based method deals with arbitrary shaped clusters. In density-based method, clusters are formed on the
basis of the region where the density of the objects is high.
In this method objects are represented by the multi resolution grid data structure. All the objects are quantized into a
finite number of cells and the collection of cells build the grid structure of objects. The clustering operations are
performed on that grid structure. This method is widely used because its processing time is very fast and that is
independent of number of objects.
Statistical Information Grid is called as STING; it is a grid based multi resolution clustering method. In STING
method, all the objects are contained into rectangular cells, these cells are kept into various levels of resolutions and
these levels are arranged in a hierarchical structure.
139
for finding the dense region. Each grid cell contains the information of the group of objects that map into a
cell. A wavelet transformation is a process of signaling that produces the signal of various frequency sub
bands.
22. What are the reasons for not using the linear regression model to estimate the output data?
There are many reasons for that, One is that the data do not fit a linear model, It is possible however that the data
generally do actually represent a linear model, but thelinear model generated is poor because noise or outliers exist
in the data. Noise is erroneous data and outliers are data values that are exceptions to the usual and expected data.
23. What are the two approaches used by regression to perform classification?
Regression can be used to perform classification using the following approaches
.
140
COURSE CODE COURSE TITLE L T P C
DATA WAREHOUSING AND DATA MINING
1151CS114 3 0 0 3
UNIT-V
Cluster Analysis - Types of Data – Categorization of Major Clustering Methods - K- means – Partitioning Methods –
Hierarchical Methods - Outlier Analysis – Data Mining Applications – Social Impacts of Data Mining – Mining WWW -
Mining Text Database – Mining Spatial Databases - Case Studies (Simulation Tool).
Cluster. A cluster is a collection of data objects that are similar to one another within the same
cluster and are dissimilar to the objects in other clusters.
Clustering. The process of grouping a set of physical or abstract objects into classes of similar
objects is called clustering.
Cluster analysis has wide applications, - market or customer segmentation, pattern recognition,
biological studies, spatial data analysis, Web document classification, etc
Pattern Recognition
Spatial Data Analysis
o Create thematic maps in Geographical information system by clustering feature
spaces
o Detect spatial clusters or for other spatial mining tasks
Image Processing
Economic Science (especially market research)
World Wide Web
o Document classification
o Cluster Weblog data to discover groups of similar access patterns
Clustering Applications - Marketing , Land use, Insurance, City-planning , Earth- quake studies.
142
Types of Data in cluster analysis
Suppose that a data set to be clustered contains n objects, which may represent persons, houses, documents,
countries, and so on. The two data structures are used.
Data matrix (or object-by-variable structure): This represents n objects, such as persons, with p variables (also
called measurements or attributes), such as age, height, weight, gender, and so on. The structure is in the form of a
relational table, or n-by-p matrix (n objects _p variables):
Dissimilarity matrix (or object-by-object structure): This stores a collection of proximities that are available for all
pairs of n objects. It is often represented by an n-by- n table:
Dissimilarity/Similarity metric: Similarity is expressed in terms of a distance function, typically metric: d(i, j)
There is a separate “quality” function that measures the “goodness” of a cluster.
The definitions of distance functions are usually very different for interval-scaled, boolean, categorical, ordinal
ratio, and vector variables.
Weights should be associated with different variables based on applications and data semantics.
143
It is hard to define “similar enough” or “good enough”
o the answer is typically highly subjective.
Interval-scaled variables are continuous measurements of a roughly linear scale. Examples -weight and height,
latitude and longitude coordinates and weather temperature.
After standardization, or without standardization in certain applications, the dissimilarity or similarity between the
objects described by interval-scaled variables is typically computed based on the distance between each pair of
objects.
1). The most popular distance measure is Euclidean distance, which is defined as
Both the Euclidean distance and Manhattan distance satisfy the following mathematic requirements of a distance
function
144
2. Binary variables
145
146
Dissimilarity between Binary Variables:
A categorical variable is a generalization of the binary variable in that it can take on more
than two states. For example, map colour is a categorical variable that may have, say, five
states: red, yellow, green, pink, and blue.
The dissimilarity between two objects i and j can be computed based on the ratio of mismatches:
where m is the number of matches (i.e., the number of variables for which i and j are in the same state), and p is
the total number of variables.
147
Consider object identifier, test-1 column only to find the categorical variables. By using above equation we get
Ordinal Variables
An ordinal variable can be discrete or continuous
Order is important, e.g., rank
The values of an ordinal variable can be mapped to ranks. For example, suppose that an ordinal variable f has
Mf states. These ordered states define the ranking 1,
….., Mf .
148
o From above table consider only the object-identifier and the continuous ordinal variable, test-2, are available.
There are 3 states for test-2, namely fair, good, and excellent, that is Mf =3.
o For step 1, if we replace each value for test-2 by its rank, the four objects are assigned the ranks 3, 1, 2, and 3,
respectively.
o Step 2 normalizes the ranking by mapping rank 1 to 0.0, rank 2 to 0.5, and rank 3 to 1.0.
o For step 3, we can use, say, the Euclidean distance (Equation (7.5)), which results in the following dissimilarity
matrix:
where A and B are positive constants, and t typically represents time. E.g.,the growth of a bacteria population , the
decay of a radioactive element.
Methods to handle ratio-scaled variables for computing the dissimilarity between objects by Apply
logarithmic transformation to a ratio-scaled variable.
o This time, from the above table consider only the object-identifier and the ratio- scaled
variable, test-3, are available.
o Logarithmic transformation of the log of test-3 results in the values 2.65, 1.34, 2.21, and 3.08 for the objects 1
to 4, respectively.
o Using the Euclidean distance on the transformed values, we obtain the following dissimilarity
matrix:
149
A database may contain all the six types of variables
o symmetric binary, asymmetric binary, nominal, ordinal, interval and ratio
One may use a weighted formula to combine their effects
5. Vector objects:
and y.
150
Categorization of Major Clustering Methods
Clustering is a dynamic field of research in data mining. Many clustering algorithms have been developed. These
can be categorized into (i).Partitioning methods, (ii).hierarchical methods,(iii). density-based methods, (iv).grid-
based methods, (v).model-based methods, (vi).methods for high-dimensional data, and (vii), constraint based
methods.
A partitioning method first creates an initial set of k partitions, where parameter k is the number of partitions to
construct. It then uses an iterative relocation technique that attempts to improve the partitioning by moving objects
from one group to another. Typical partitioning methods include k-means, k-medoids, CLARANS, and their
improvements.
A hierarchical method creates a hierarchical decomposition of the given set of data objects. The method can be
classified as being either agglomerative (bottom-up) or divisive (top-down), based on how the hierarchical
decomposition is formed. To compensate for the rigidity of merge or split, the quality of hierarchical agglomeration
can be improved by analyzing object linkages at each hierarchical partitioning (such as in ROCK and Chameleon),
or by first performing microclustering (that is, grouping objects into “microclusters”) and then operating on the
microclusters with other clustering techniques, such as iterative relocation (as in BIRCH).
A density-based method clusters objects based on the notion of density. It either grows clusters according to the
density of neighborhood objects (such as in DBSCAN) or according to some density function (such as in
DENCLUE). OPTICS is a density based method that generates an increased ordering of the clustering structure of
the data.
A grid-based method first quantizes the object space into a finite number of cells that form a grid structure, and
then performs clustering on the grid structure. STING is a typical example of a grid-based method based on
statistical information stored in grid cells. WaveCluster and CLIQUE are two clustering algorithms that are both grid
based and density-based.
A model-based method hypothesizes a model for each of the clusters and finds the best fit of the data to that model.
Examples of model-based clustering include the EM algorithm (which uses a mixture density model), conceptual
clustering (such as COBWEB), and neural network approaches (such as self-organizing feature maps).
Clustering high-dimensional data is of vital importance, because in many advanced applications, data objects such
as text documents and microarray data are high- dimensional in nature. There are three typical methods to handle
high dimensional data sets: dimension-growth subspace clustering, represented by CLIQUE, dimension-
reduction projected clustering, represented by PROCLUS, and frequent pattern–based clustering, represented by
pCluster.
A constraint-based clustering method groups objects based on application dependent or user-specified constraints.
For example, clustering with the existence of obstacle objects and clustering under user-specified constraints are
typical methoads of constraint-based clustering. Typical examples include clustering with the existence of obstacle
objects, clustering under user-specified constraints, and semi-supervised clustering based on “weak” supervision
(such as pairs of objects labeled as belonging to the same or different cluster).
151
One person’s noise could be another person’s signal. Outlier detection and analysis are very useful for fraud
detection, customized marketing, medical analysis, and many other tasks. Computer-based outlier analysis methods
typically follow either a statistical distribution-based approach, a distance-based approach, a density-based local
outlier detection approach, or a deviation-based approach.
Partitioning Methods
Given D, a data set of n objects, and k, the number of clusters to form, a partitioning algorithm organizes the objects
into k partitions (k ≤ n), where each partition represents a cluster. The commonly used partitioning methods are (i).
k-means, (ii). k-medoids.
o k-means. where each cluster’s center is represented by the mean value of the
objects in the cluster. i.e Each cluster is represented by the center of the cluster.
o Algorithm
Input:
k: the number of clusters,
152
D: a data set containing n objects.
Method:
(1) arbitrarily choose k objects from D as the initial cluster centers;
(2) repeat
(3) (re)assign each object to the cluster to which the object is the most similar, based on the mean
value of the objects in the cluster;
(4) update the cluster means, i.e., calculate the mean value of the objects for each cluster;
(5) until no change;
o The k-means algorithm is sensitive to outliers .Since an object with an extremely large value may largely
change the distribution of the data.
o K-Medoids: Instead of taking the mean value of the object in a cluster as a reference point, medoids can be
used, which is the most centrally located object in a cluster.
10
9
10 8
9 7
6
8
5
7
6 4
5 3
2
4
3 1
0
2
153
0 12 3 4567 89 10
1
0 0 12 3 4567 89 10
The K-Medoids Clustering Methods
Algorithm: k-medoids. PAM, a k-medoids algorithm for partitioning based on medoid or central objects.
Input:
k: the number of clusters,
D: a data set containing n
Method:
(1) arbitrarily choose k objects in D as the initial representative
154 objects or seeds;
(2) repeat
(3) assign each remaining object to the cluster with the nearest representative object;
(4) randomly select a nonrepresentative object, Orandom;
(5) compute the total cost, S, of swapping representative object, Oj, with Orandom;
CLARA (Clustering LARge Applications) - Sampling based method
PAM works efficiently for small data sets but does not scale well for large data sets.
Built in statistical analysis packages, such as S+
It draws multiple samples of the data set, applies PAM on each sample, and gives the best clustering as the
output
Strength: deals with larger data sets than PAM
Weakness:
o Efficiency depends on the sample size
o A good clustering based on samples will not necessarily represent a good clustering of the whole data
set if the sample is biased
A hierarchical clustering method works by grouping data objects into a tree of clusters. Hierarchical clustering
methods can be further classified as either agglomerative or divisive,
depending on whether the hierarchical decomposition is formed in a bottom-up (merging) or top-down (splitting)
fashion.
Agglomerative hierarchical clustering: This bottom-up strategy starts by placing each object in its own cluster and
then merges these atomic clusters into larger and larger clusters, until all of the objects are in a single cluster or until
certain termination conditions are satisfied..
Divisive hierarchical clustering: This top-down strategy does the reverse of agglomerative hierarchical clustering
by starting with all objects in one cluster. It subdivides the cluster into smaller and smaller pieces, until each object
forms a cluster on its own or until it satisfies certain termination conditions, such as a desired number of clusters is
obtained.
155
Decompose data objects into a several levels of nested partitioning (tree of clusters), called a dendrogram.
A clustering of the data objects is obtained by cutting the dendrogram at the desired level, then each connected
component forms a cluster.
A CF tree is a height-balanced tree that stores the clustering features for a hierarchical
Clustering.
Incrementally construct a CF (Clustering Feature) tree, a hierarchical data structure for multiphase clustering
o Phase 1: scan DB to build an initial in-memory CF tree (a multi-level compression of the data that tries to
preserve the inherent clustering structure of the data)
o Phase 2: use an arbitrary clustering algorithm to cluster the leaf nodes of the CF- tree
A CF tree is a height-balanced tree that stores the clustering features for a hierarchical clustering
o A nonleaf node in a tree has descendants or “children”
o The nonleaf nodes store sums of the CFs of their children
o C2. <a, b, f, g>: {a, b, f}, {a, b, g}, {a, f, g}, {b, f, g}
o Jaccard co-efficient may lead to wrong clustering result
o C1: 0.2 ({a, b, c}, {b, d, e}} to 0.5 ({a, b, c}, {a, b, d})
o C1 & C2: could be as high as 0.5 ({a, b, c}, {a, b, f})
o Jaccard co-efficient-based similarity function:
158
o Let T1 = {a, b, c}, T2 = {c, d, e}, T3 = {a, b, f}
Major features:
o Discover clusters of arbitrary shape
o Handle noise
o One scan
o Need density parameters as termination condition Methods (1).
DBSCAN (2).OPTICS (3).DENCLUE
1) .DBSCAN: A Density-Based Clustering Method Based on Connected Regions with Sufficiently High
Density
159
DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a density based clustering
algorithm.
The algorithm grows regions with sufficiently high density into clusters and discovers clusters of arbitrary
shape in spatial databases with noise. It defines a cluster as a maximal set of density-connected points.
Density-reachability and density connectivity.
Consider Figure for a given £ represented by the radius of the circles, and, MinPts = 3.
Labeled points ,m, p, o, and r are core objects because each is in an £ neighbourhood containing at least three
points.
q is directly density-reachable from m. m is directly density-reachable from p and vice versa.
q is (indirectly) density-reachable from p because q is directly density-reachable from
m and m is directly density-reachable from p. However, p is not density-reachable from q because q is not a
core object. Similarly, r and s are density-reachable from o, and o is density-reachable from r.
o, r, and s are all density-connected.
DBSCAN searches for clusters by checking the £ -neighborhood of each point in the database. If the £
neighborhood of a point p contains more than MinPts, a new cluster with p as a core object is created.
DBSCAN then iteratively collects directly density-reachable objects from these core objects, which may
involve the merge of a few density-reachable clusters. The process terminates when no new point can be added
to any cluster.
OPTICS computes an better cluster ordering for automatic and interactive cluster analysis .The cluster ordering can
be used to extract basic clustering information such as cluster centers or arbitrary-shaped clusters as well as provide
160
the basic clustering structure.
Fig : OPTICS terminology.
161
Core-distance and reachability-distance.
For example, in above Figure is the reachability plot for a simple two-dimensional data set, which presents a general
overview of how the data are structured and clustered. The data objects are plotted in cluster order (horizontal axis)
together with their respective reachability-distance (vertical axis). The three Gaussian “bumps” in the plot reflect
three clusters in the data set.
3). DENCLUE (DENsity-based CLUstEring) Clustering Based on
Density Distribution Functions
DENCLUE is a clustering method based on a set of density distribution functions. The method is built on the
following ideas:
(1) the influence of each data point can be formally modeled using a mathematical function called an influence
function, which describes the impact of a data point within its neighborhood;
162
(2) the overall density of the data space can be modeled analytically as the sum of the influence function applied to
all data points.
(3) clusters can then be determined mathematically by identifying density attractors, where density attractors are
local maxima of the overall density function.
Advantages
The grid-based clustering approach uses a multiresolution grid data structure. It quantizes the object space into a
finite number of cells that form a grid structure on which all of the operations for clustering are performed.
163
STING: STatistical INformation Grid
STING is a grid-based multiresolution clustering technique in which the spatial area is divided into
rectangular cells. These cells form a hierarchical structure. Each cell at a high level is partitioned to form a
number of cells at the next lower level.
Statistical parameters of higher-level cells can easily be computed from the parameters of the lower-level
cells.
These parameters includes
o Attribute independent parameter, count;
o Attribute dependent parameters, mean, stdev (standard deviation), min , max.
o Attribute type of distribution such as normal, uniform, exponential, or none.
When the data are loaded into the database, the parameters count, mean, stdev, min, and max of the bottom-
level cells are calculated directly from the data.
The value of distribution may either be assigned by the user if the distribution type is known beforehand or
obtained by hypothesis tests such as the X2 test.
The type of distribution of a higher-level cell can be computed based on the majority of distribution types
of its corresponding lower-level cells in conjunction with a threshold filtering process.
If the distributions of the lower level cells disagree with each other and fail the threshold test, the
distribution type of the high-level cell is set to none.
Advantages:
Model-based clustering methods attempt to optimize the fit between the given data and some mathematical model.
165
Such methods are often based on the assumption that the data are generated by a mixture of underlying probability
distributions.
Typical methods
o Statistical approach
EM (Expectation maximization), AutoClass
o Machine learning approach
COBWEB, CLASSIT
o Neural network approach
SOM (Self-Organizing Feature Map) (i). Statistical
approach : EM (Expectation maximization),
166
o Patterns belonging to the same cluster, if they are placed by their scores in a particular component
Algorithm converges fast but may not be in global optima
o Maximization step:
Estimation of model parameters
Conceptual clustering
o A form of clustering in machine learning
o Produces a classification scheme for a set of unlabeled objects
o Finds characteristic description for each concept (class)
COBWEB (Fisher’87)
o A popular a simple method of incremental conceptual learning
o Creates a hierarchical clustering in the form of a classification tree
o Each node refers to a concept and contains a probabilistic description of that concept
167
Working method:
o For a given new object, COBWEB decides where to include it into the classification tree. For this COBWEB
derives the tree along an suitable path, updating counts along the way, in search of the “best host” or node at
which to classify the object.
o If the object does not really belong to any of the concepts represented in the tree then better to create a new
node for the given object. The object is then placed in an existing class, or a new class is created for it, based on
the partition with the highest category utility value.
Limitations of COBWEB
o The assumption that the attributes are independent of each other is often too strong because correlation may
exist
o Not suitable for clustering large database data – skewed tree and expensive probability distributions
. CLASSIT
o an extension of COBWEB for incremental clustering of continuous data
o suffers similar problems as COBWEB
168
Clustering High-Dimensional Data
Partition the data space and find the number of points that lie inside each cell of the partition.
Identify the subspaces that contain clusters using the Apriori principle
Identify clusters
o Determine dense units in all subspaces of interests
o Determine connected dense units in all subspaces of interests.
Generate minimal description for the clusters
o Determine maximal regions that cover a cluster of connected dense units for each cluster
o Determination of minimal cover for each cluster.
169
170
Fig .Dense units found with respect to age for the dimensions salary and vacation are
intersected in order to provide a candidate search space for dense units of higher dimensionality.
Strength
o automatically finds subspaces of the highest dimensionality such that high density
clusters exist in those subspaces
o insensitive to the order of records in input and does not presume some canonical
data distribution
o scales linearly with the size of input and has good scalability as the number of
dimensions in the data increases
Weakness
o The accuracy of the clustering result may be degraded at the expense of simplicity
of the method
Each dimension is then assigned a weight for each cluster, and the updated weights are
used in the next iteration to regenerate the clusters.
This leads to the search of solid regions in all subspaces of some desired dimensionality
and avoids the generation of a large number of overlapped clusters in projected dimensions
of
lower dimensionality.
The PROCLUS algorithm consists of three phases: initialization, iteration, and cluster
refinement.
171
o Frequent pattern mining can lead to the discovery of interesting associations and
correlations among data objects.
Text documents are clustered based on the frequent terms they contain. A term can be
made up of a single word or several words. Terms are then extracted.
A stemming algorithm is then applied to reduce each term to its basic stem. In this
way, each document can be represented as a set of terms. Each set is typically large.
Collectively, a large set of documents will contain a very large set of different terms.
Advantage: It automatically generates a description for the generated clusters in terms
of their frequent term sets.
o Figure.1 shows a fragment of microarray data containing only three genes (taken as
“objects” ) and ten attributes (columns a to j ).
o However, if two subsets of attributes, {b, c, h, j, e} and { f , d, a, g, i}, are selected and
plotted as in Figure. 2 (a) and (b) respectively,
o Figure. 2(a) forms a shift pattern, where the three curves are similar to each other with
respect to a shift operation along the y-axis.
o Figure.2(b) forms a scaling pattern, where the three curves are similar to each other with
respect to a scaling operation along the y-axis.
172
Fig: Raw data from a fragment of microarray data containing only 3 objects and 10 attributes
Clustering with obstacle objects using a partitioning approach requires that the distance
between each object and its corresponding cluster center be re-evaluated at each iteration
173
whenever the cluster center is changed.
e.g A city may have rivers, bridges, highways, lakes, and mountains. We do not want to swim
across a river to reach an ATM.
Fig(a) :First, a point, p, is visible from another point, q, in Region R, if the straight line joining
p and q does not intersect any obstacles.
174
The shortest path between two points, p and q, will be a subpath of VG’ as shown in Figure (a).
We see that it begins with an edge from p to either v1, v2, or v3, goes through some path in VG,
and then ends with an edge from either v4 or v5 to q.
Fig.(b).To reduce the cost of distance computation between any two pairs of objects,
microclusters techniques can be used. This can be done by first triangulating the region R into
triangles, and then grouping nearby points in the same triangle into microclusters, as shown in
Figure (b).
After that, precomputation can be performed to build two kinds of join indices based on the
shortest paths:
o VV index: indices for any pair of obstacle vertices
o MV index: indices for any pair of micro-cluster and obstacle indices
e.g., A parcel delivery company with n customers would like to determine locations for k
service stations so as to minimize the traveling distance between customers and service
stations.
The company’s customers are considered as either high-value customers (requiring
frequent, regular services) or ordinary customers (requiring occasional services).
The manager has specified two constraints: each station should serve (1) at least 100 high-
value customers and (2) at least 5,000 ordinary customers.
Outlier Analysis
Data objects which are totally different from or inconsistent with the remaining set of data,
are called outliers. Outliers can be caused by measurement or execution error
E.g The display of a person’s age as 999.
Outlier detection and analysis is an interesting data mining task, referred to as outlier
mining.
Applications:
o Fraud Detection (Credit card, telecommunications, criminal activity in e-
Commerce)
o Customized Marketing (high/low income buying habits)
o Medical Treatments (unusual responses to various drugs)
o Analysis of performance statistics (professional athletes)
o Weather Prediction
o Financial Applications (loan approval, stock tracking)
176
A statistical discordancy test examines two hypotheses:
a working hypothesis
an alternative hypothesis.
Working hypothesis
A working hypothesis, H, is a statement that the entire data set of n objects comes from an
initial distribution model, F, that is,
A discordancy test verifies whether an object, oi, is significantly large (or small) in relation to
the distribution F.
Alternative hypothesis.
An alternative hypothesis, H, which states that oi comes from another distribution model, G,
is adopted
There are different kinds of alternative distributions.
o Inherent alternative distribution
o Mixture alternative distribution
o Slippage alternative distribution
An object, O, in a data set, D, is a distance-based (DB) outlier with parameters pct and dmin,
that is, a DB(pct;dmin)-outlier, if at least a fraction, pct, of the objects in D lie at a distance
greater than dmin from O.
Index-based algorithm
Given a data set, the index-based algorithm uses multidimensional indexing structures, such as
R-trees or k-d trees, to search for neighbours of each object o within radius dmin around that
object.
o Nested-loop algorithm
177
This algorithm avoids index structure construction and tries to minimize the number of I/Os. It
divides the memory buffer space into two halves and the data set into several logical blocks. I/O
efficiency can be achieved by choosing the order in which blocks are loaded into each half.
o Cell-based algorithm: A cell-based algorithm was developed for memory-resident data sets.
Its complexity is O(ck +n), where c is a constant depending on the number of cells and k is
the dimensionality.
It identifies outliers by examining the main characteristics of objects in a group. Objects that
“deviate” from this description are considered outliers. Hence, deviations is used to refer
outliers.
Techniques
o Sequential Exception Technique
o OLAP Data Cube Technique
Dissimilarity function: It is any function that, if given a set of objects, returns a low value if
the objects are similar to one another. The greater the dissimilarity among the objects, the
higher the value returned by the function.
Cardinality function: This is typically the count of the number of objects in a given set.
Smoothing factor: This function is computed for each subset in the sequence. It assesses
how much the dissimilarity can be reduced by removing the subset from the original set of
objects.
Financial data collected in banks and financial institutions are often relatively complete,
reliable, and of high quality
Design and construction of data warehouses for multidimensional data analysis and data
mining
o View the debt and revenue changes by month, by region, by sector, and by other
factors
o Access statistical information such as max, min, total, average, trend, etc.
Loan payment prediction/consumer credit policy analysis
o feature selection and attribute relevance ranking
o Loan payment performance
o Consumer credit rating
Classification and clustering of customers for targeted marketing
o multidimensional segmentation by nearest-neighbor, classification, decision trees,
etc. to identify customer groups or associate a new customer to an appropriate
customer group
Detection of money laundering and other financial crimes
o integration of from multiple DBs (e.g., bank transactions, federal/state crime
history DBs)
o Tools: data visualization, linkage analysis, classification, clustering tools, outlier
analysis, and sequential pattern analysis tools (find unusual access sequences)
Retail industry: huge amounts of data on sales, customer shopping history, etc.
Applications of retail data mining
o Identify customer buying behaviors
o Discover customer shopping patterns and trends
o Improve the quality of customer service
o Achieve better customer retention and satisfaction
o Enhance goods consumption ratios
o Design more effective goods transportation and distribution policies
Examples
Ex. 1. Design and construction of data warehouses based on the benefits of data
mining
Ex. 2.Multidimensional analysis of sales, customers, products, time, and region
Ex. 3. Analysis of the effectiveness of sales campaigns
Ex. 4. Customer retention: Analysis of customer loyalty
o Use customer loyalty card information to register sequences of purchases of
particular customers
o Use sequential pattern mining to investigate changes in customer
180
consumption or loyalty
o Suggest adjustments on the pricing and variety of goods
Ex. 5. Purchase recommendation and cross-reference of items
A rapidly expanding and highly competitive industry and a great demand for data mining
o Understand the business involved
o Identify telecommunication patterns
o Catch fraudulent activities
o Make better use of resources
o Improve the quality of service
DNA sequences: 4 basic building blocks (nucleotides): adenine (A), cytosine (C),
guanine (G), and thymine (T).
Gene: a sequence of hundreds of individual nucleotides arranged in a particular order
Humans have around 30,000 genes
Tremendous number of ways that the nucleotides can be ordered and sequenced to form
distinct genes
Data mining may contribute to biological data analysis in the following aspects
181
Data Mining - Mining World Wide Web
The World Wide Web contains huge amounts of information that provides a rich source for data mining.
The web is too huge − The size of the web is very huge and rapidly increasing. This seems that the web is
too huge for data warehousing and data mining.
Complexity of Web pages − The web pages do not have unifying structure. They are very complex as
compared to traditional text document. There are huge amount of documents in digital library of web.
These libraries are not arranged according to any particular sorted order.
Web is dynamic information source − The information on the web is rapidly updated. The data such as
news, stock markets, weather, sports, shopping, etc., are regularly updated.
Diversity of user communities − The user community on the web is rapidly expanding. These users have
different backgrounds, interests, and usage purposes. There are more than 100 million workstations that
are connected to the Internet and still rapidly increasing.
Relevancy of Information − It is considered that a particular person is generally interested in only small
portion of the web, while the rest of the portion of the web contains the information that is not relevant to
the user and may swamp desired results.
The DOM structure was initially introduced for presentation in the browser and not for description of semantic
structure of the web page. The DOM structure cannot correctly identify the semantic relationship between the
different parts of a web page.
Such a semantic structure corresponds to a tree structure. In this tree each node corresponds to a block.
A value is assigned to each node. This value is called the Degree of Coherence. This value is assigned to
indicate the coherent content in the block based on visual perception.
182
The VIPS algorithm first extracts all the suitable blocks from the HTML DOM tree. After that it finds the
separators between these blocks.
The separators refer to the horizontal or vertical lines in a web page that visually cross with no blocks.
The semantics of the web page is constructed on the basis of these blocks.
Text databases consist of huge collection of documents. They collect these information from several sources such
as news articles, books, digital libraries, e-mail messages, web pages, etc. Due to increase in the amount of
information, the text databases are growing rapidly. In many of the text databases, the data is semi-structured.
For example, a document may contain a few structured fields, such as title, author, publishing_date, etc. But along
with the structure data, the document also contains unstructured text components, such as abstract and contents.
Without knowing what could be in the documents, it is difficult to formulate effective queries for analyzing and
extracting useful information from the data. Users require tools to compare the documents and rank their
importance and relevance. Therefore, text mining has become popular and an essential theme in data mining.
Information Retrieval
183
Information retrieval deals with the retrieval of information from a large number of text-based documents. Some of
the database systems are not usually present in information retrieval systems because both handle different kinds of
data. Examples of information retrieval system include −
In such search problems, the user takes an initiative to pull relevant information out from a collection. This is
appropriate when the user has ad-hoc information need, i.e., a short-term need. But if the user has a long-term
information need, then the retrieval system can also take an initiative to push any newly arrived information item to
the user.
This kind of access to information is called Information Filtering. And the corresponding systems are known as
Filtering Systems or Recommender Systems.
There are three fundamental measures for assessing the quality of text retrieval −
Precision
Recall
F-score
Precision
Precision is the percentage of retrieved documents that are in fact relevant to the query. Precision can be defined as
−
184
Precision= |{Relevant} ∩ {Retrieved}| / |{Retrieved}|
Recall
Recall is the percentage of documents that are relevant to the query and were in fact retrieved. Recall is defined as
−
F-score
F-score is the commonly used trade-off. The information retrieval system often needs to trade-off for precision or
vice versa. F-score is defined as harmonic mean of recall or precision as follows −
Text mining can be used to make the large quantities of unstructured data accessible and useful, thereby generating
not only value, but delivering ROI from unstructured data management as we’ve seen with applications of text
mining for Risk Management Software and Cybercrime applications.
These 10 text mining examples can give you an idea of how this technology is helping organizations today.
1 – Risk management
No matter the industry, Insufficient risk analysis is often a leading cause of failure. This is especially true in the
financial industry where adoption of Risk Management Software based on text mining technology
can dramatically increase the ability to mitigate risk, enabling complete management of thousands of sources and
petabytes of text documents, and providing the ability to link together information and be able to access the right
information at the right time.
2 – Knowledge management
Not being able to find important information quickly is always a challenge when managing large volumes of text
documents—just ask anyone in the healthcare industry. Here, organizations are challenged with a tremendous
amount of information—decades of research in genomics and molecular techniques, for example, as well as
volumes of clinical patient data—that could potentially be useful for their largest profit center: new product
development. Here, knowledge management software based on text mining offer a clear and reliable
solution for the “info-glut” problem.
185
3 – Cybercrime prevention
The anonymous nature of the internet and the many communication features operated through it contribute to the
increased risk of internet-based crimes. Today, text mining intelligence and anti-crime applications are making
internet crime prevention easier for any enterprise and law enforcement or intelligence agencies.
Text mining, as well as natural language processing are frequent applications for customer care. Today, text
analytics software is frequently adopted to improve customer experience using different sources of valuable
information such as surveys, trouble tickets, and customer call notes to improve the quality, effectiveness and speed
in resolving problems. Text analysis is used to provide a rapid, automated response to the customer,
dramatically reducing their reliance on call center operators to solve problems.
Text analytics is a tremendously effective technology in any domain where the majority of information is
collected as text. Insurance companies are taking advantage of text mining technologies by combining the results
of text analysis with structured data to prevent frauds and swiftly process claims.
6 – Contextual Advertising
Digital advertising is a moderately new and growing field of application for text analytics . Here, companies
such as Admantx have made text mining the core engine for contextual retargeting with great success.
Compared to the traditional cookie-based approach, contextual advertising provides better accuracy, completely
preserves the user’s privacy.
7 – Business intelligence
This process is used by large companies to uphold and support decision making. Here, text mining really makes
the difference, enabling the analyst to quickly jump at the answereven when analyzing petabytes of internal and
open source data. Applications such as the Cogito Intelligence Platform (link to CIP) are able to monitor thousands
of sources and analyze large data volumes to extract from them only the relevant content.
8 – Content enrichment
While it’s true that working with text content still requires a bit of human effort, text analytics techniques make a
significant difference when it comes to being able to more effectively manage large volumes of information. Text
mining techniques enrich content, providing a scalable layer to tag, organize and summarize the available
content that makes it suitable for a variety of purposes.
9 – Spam filtering
E-mail is an effective, fast and reasonably cheap way to communicate, but it comes with a dark side: spam.
Today, spam is a major issue for internet service providers, increasing their costs for service management and
hardware\software updating; for users, spam is an entry point for viruses and impacts productivity. Text mining
techniques can be implemented to improve the effectiveness of statistical-based filtering methods.
Today, social media is one of the most prolific sources of unstructured data; organizations have taken notice.
Social media is increasingly being recognized as a valuable source of market and customer intelligence, and
186
companies are using it to analyze or predict customer needs and understand the perception of their brand. In both
needs Text analytics can address both by analyzing large volumes of unstructured data, extracting opinions,
emotions and sentiment and their relations with brands and products.
A spatial database stores a large amount of space-related data, such as maps, preprocessed remote sensing or
medical imaging data, and VLSI chip layout data.
Spatial data mining refers to the extraction of knowledge, spatial relationships, or other interesting patterns not
explicitly stored in spatial databases. Such mining demands an integration of data miningwith spatial database
technologies. It can be used for understanding spatial data, discovering spatial relationships and relationships
between spatial and nonspatial data, constructing spatial knowledge bases, reorganizing spatial databases, and
optimizing spatial queries. It is expected to have wide applications in geographic information systems,
geomarketing, remote sensing, image database exploration, medical imaging, navigation, traffic control,
environmental studies, and many other areas where spatial data are used..
Statistical spatial data analysis has been a popular approach to analyzing spatial data and exploring geographic
information. The term geostatistics is often associated with continuous geographic space, whereas the term spatial
statistics is often associated with discrete space.
Spatial Data warehouse can be constructed by integrating spatial data to construct a data warehouse that facilitates
spatial data mining. A spatial data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile
collection of both spatial and nonspatial data in support of spatial data mining and spatial-datarelated decision-
making processes.
A nonspatial dimension contains only nonspatial data. Nonspatial dimensions temperature and precipitation can be
constructed for the warehouse in Example 10.5, since each contains nonspatial data whose generalizations are
nonspatial (such as “hot” for temperature and “wet” for precipitation).
187
A spatial-to-nonspatial dimension is a dimension whose primitive-level data are spatial but whose generalization,
starting at a certain high level, becomes nonspatial. For example, the spatial dimension city relays geographic data
for the U.S. map. Suppose that the dimension’s spatial representation of, say, Seattle is generalized to the string
“pacific northwest.” Although “pacific northwest” is a spatial concept, its representation
is not spatial (since, in our example, it is a string). It therefore plays the role of a nonspatial dimension.
A spatial-to-spatial dimension is a dimension whose primitive level and all of its highlevel generalized data are
spatial. For example, the dimension equi temperature regioncontains spatial data, as do all of its generalizations,
such as with regions covering 0-5 degrees (Celsius), 5-10 degrees, and so on.
A numerical measure contains only numerical data. For example, one measure in a spatial data warehouse could be
the monthly revenue of a region, so that a roll-up may compute the total revenue by year, by county, and so on.
Numerical measures can be further classified into distributive, algebraic, and holistic.
A spatial measure contains a collection of pointers to spatial objects. For example, in a generalization (or roll-up) in
the spatial data cube of Example 10.5, the regions with the same range of temperature and precipitation will be
grouped into the same cell, and the measure so formed contains a collection of pointers to those regions.
A nonspatial data cube contains only nonspatial dimensions and numerical measures. If a spatial data cube contains
spatial dimensions but no spatial measures, its OLAP operations, such as drilling or pivoting, can be implemented in
a manner similar to that for nonspatial data cubes.
For example, two different roll-ups on the BC weather map data (Figure 10.2) may produce two different
generalized region maps, as shown in Figure 10.4, each being the result of merging a large number of small (probe)
regions from Figure 10.2.
188
Figure 10.3 presents hierarchies for each of the dimensions in the BC weather warehouse.
189
Mining Spatial Association and Co-location Patterns
Similar to the mining of association rules in transactional and relational databases, spatial association rules can be
mined in spatial databases. A spatial association rule is of the form A)B [s%;c%], where A and B are sets of spatial
or nonspatial predicates, s% is the support of the rule, and c%is the confidence of the rule. For example, the
following is a spatial association rule:
This rule states that 80% of schools that are close to sports centers are also close to parks, and 0.5% of the data
belongs to such a case.
Spatial data clustering identifies clusters, or densely populated regions, according to some distance measurement in
a large, multidimensional data set.
Spatial classification analyzes spatial objects to derive classification schemes in relevance to certain spatial
properties, such as the neighborhood of a district, highway, or river.Current: highly distributed, uncontrolled
generation and use of a wide variety of DNA data
190
o Data cleaning and data integration methods developed in data mining will help
Alignment, indexing, similarity search, and comparative
analysis ofmultiple nucleotide/
protein sequences
o Compare the frequently occurring patterns of each class (e.g., diseased and
healthy)
o Identify gene sequence patterns that play roles in various diseases
Discovery of structural patterns and analysis of genetic networks and protein pathways:
Association analysis: identification of co-occurring gene sequences
o Most diseases are not triggered by a single gene but by a combination of genes
acting together
o Association analysis may help determine the kinds of genes that are likely to co-
occur together in target samples
Path analysis: linking genes to different disease development stages
o Different genes may become active at different stages of the disease
o Develop pharmaceutical interventions that target the
different stages separately
Visualization tools and genetic data analysis
Vast amounts of data have been collected from scientific domains (including geosciences,
astronomy, and meteorology) using sophisticated telescopes, multispectral high-resolution
remote satellite sensors, and global positioning systems.
Large data sets are being generated due to fast numerical simulations in various fields, such
as climate and ecosystem modeling, chemical engineering, fluid dynamics, and structural
mechanics.
some of the challenges brought about by emerging scientific applications of data mining,
such as the following
o Data warehouses and data preprocessing:
o Mining complex data types:
o Graph-based mining:
o Visualization tools and domain-specific knowledge:
191
VI. Data Mining for Intrusion Detection
The security of our computer systems and data is at constant risk. The extensive growth of
the Internet and increasing availability of tools and tricks for interrupting and attacking
networks have prompted intrusion detection to become a critical component of network
administration.
An intrusion can be defined as any set of actions that threaten the integrity, confidentiality,
or availability of a network resource .
The following are areas in data mining technology applied or further developed for intrusion
detection:
192