0% found this document useful (0 votes)
39 views98 pages

Data Mining e Resources

Uploaded by

santhosh13824
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views98 pages

Data Mining e Resources

Uploaded by

santhosh13824
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 98

UNIT – I

DATA MINING BASICS

Introduction: Definition of data mining – Data mining vs query tools – machine


learning – steps in data mining process – overview of data mining techniques.

Introduction:

There is a huge amount of data available in the Information Industry. This data is of
no use until it is converted into useful information. It is necessary to analyze this huge
amount of data and extract useful information from it.

Extraction of information is not the only process we need to perform; data mining also
involves other processes such as Data Cleaning, Data Integration, Data Transformation, Data
Mining, Pattern Evaluation and Data Presentation. Once all these processes are over, we
would be able to use this information in many applications such as Fraud Detection, Market
Analysis, Production Control, Science Exploration, etc.

Data Mining:

Data Mining is defined as extracting information from huge sets of data. In other words, we
can say that data mining is the procedure of mining knowledge from data. It is the process of
discovering patterns in large data sets involving methods at the intersection of machine
learning, statistics, and database systems.

Definitions of Data Mining:

● The nontrivial extraction of implicit, previously unknown and potentially useful


information from data.” – ( Piatetsky Shapiro )

● The automated or convenient extraction of patterns representing knowledge implicitly


stored or captured in large Databases, Data Warehouses and the Web or data streams.
“ - (Han)

● The process of discovering patterns in data. The process must be automatic or


semiautomatic. The patterns discovered must be meaningful. – “ (Witten)

● “Finding hidden information in a database.” –Dunham

● The Process of employing one or more computer learning technique to automatically


analyze and extract knowledge from data contained within a database. “ - (Rorger)

1
Why Data Mining?
The major reason that data mining has attracted a great deal of attention in the
information industry in recent year is due to the wide availability of huge amount of data and
the imminent need for turning such data into useful information and knowledge. The
information and knowledge gained can be used for applications ranging from business
management, production control, research projects, and market analysis to engineering design
and science exploration.
Data Mining can be viewed as a result of the natural evolution of information
technology. An evolutionary path has been witnessed in database industry in the development
of the following functionalities:
● Data Collection
● Database Creation
● Data Management (Storage, retrieval, transaction processing)
● Data Analysis and Understanding
Why Data Mining is important?
● Large amount of current and historical data being stored.
● As database grow larger, decision-making form the data is not possible. We need
knowledge derived from the stored data
● Data Sources
o Health-related Services. e.g., medical analysis.
o Commercial, e.g., marketing and sales.
o Financial.
o Scientific, e.g., NASA, Genome
▪ DOD and Intelligence
● Desired analysis
o Support for Planning (Historical supply and demand trends)
o Yield management(scanning Airlines set reservation)
o System Performance(detect abnormal behavior in a system)
o Mature database analysis ( Clean up the data sources)

Data Mining Applications :


Data mining is highly useful and the information or knowledge extracted so can be
used for any of the following applications:

● Market Analysis

2
● Fraud Detection
● Customer Retention
● Production Control
● Science Exploration

Apart from these, data mining can also be used in the areas of production control,
customer retention, science exploration, sports, astrology, and Internet Web Surf-Aid.

Data Mining Vs Query Tool:

All data mining queries use the Data Mining Extensions (DMX) language. DMX can
be used to create models for all kinds of machine learning tasks, including classification, risk
analysis, generation of recommendations, and linear regression. We can also write DMX
queries to get information about the patterns and statistics that were generated when we
processed the model.

We can write our own DMX, or you can build basic DMX using a tool such as
the Prediction Query Builder and then modify it. Both SQL Server Management Studio and
Visual Studio with Analysis Services projects provide tools that help us to build DMX
prediction queries. This topic describes how to create and execute data mining queries using
these tools.

Prediction Query Builder

Prediction Query Builder is included in the Mining Model Prediction tab of Data
Mining Designer, which is available in both SQL Server Management Studio and Visual
Studio with Analysis Services projects.

When we use the query builder, you select a mining model, add new case data, and
add prediction functions. We u can then switch to the text editor to modify the query
manually, or switch to the Results pane to view the results of the query.

Query Editor

The Query Editor in SQL Server Management Studio also lets you build and run
DMX queries. You can connect to an instance of Analysis Services, and then select a
database, mining structure columns, and a mining model. The Metadata Explorer contains a
list of prediction functions that can browse.

3
DMX Templates

SQL Server Management Studio provides interactive DMX query templates that you
can use to build DMX queries. If you do not see the list of templates, click View on the
toolbar, and select Template Explorer. To see all Analysis Services templates, including
templates for DMX, MDX, and XMLA, click the cube icon.

To build a query using a template, we can drag the template into an open query
window, or we can double-click the template to open a new connection and a new query
pane.

Machine Learning:

On the other hand, machine learning is the process of discovering algorithms that
have improved courtesy of experience derived from data. It’s the design, study, and
development of algorithms that permit machines to learn without human intervention. It’s a
tool to make machines smarter, eliminating the human element (but not eliminating humans
themselves; that would be wrong).

Machine learning can look at patterns and learn from them to adapt behaviour for
future incidents, while data mining is typically used as an information source for machine
learning to pull from.

Steps in Data Mining Process:

With the enormous amount of data stored in files, databases, and other repositories, it
is increasingly important, if not necessary, to develop powerful means for analysis and
perhaps interpretation of such data and for the extraction of interesting knowledge that could
help in decision-making.

Data Mining, also popularly known as Knowledge Discovery in Databases (KDD),


refers to the nontrivial extraction of implicit, previously unknown and potentially useful
information from data in databases. While data mining and knowledge discovery in databases
(or KDD) are frequently treated as synonyms, data mining is actually part of the knowledge
discovery process. The following figure (Figure 1.1) shows data mining as a step in an
iterative knowledge discovery process.

Figure 1.1: Data Mining is the core of Knowledge Discovery process

4
The Knowledge Discovery in Databases process comprises of a few steps leading from raw
data collections to some form of new knowledge.

Figure 2: Various steps in Data Mining

5
The iterative process consists of the following steps:

1. Data Cleaning: It is a phase in which noise data and irrelevant data are removed from
the collection.

2. Data Integration: At this stage, multiple data sources, often heterogeneous, may be
combined in a common source.

3. Data selection: At this step, the data relevant to the analysis is decided on and
retrieved from the data collection.

4. Data transformation: It is also known as data consolidation, At this step the selected
data is transformed into forms appropriate for the mining procedure.

5. Data mining: It is the crucial step in which clever techniques are applied to extract
patterns potentially useful.

6. Pattern evaluation: In this step, strictly interesting patterns representing knowledge


are identified based on given measures.

7. Knowledge representation: It is the final phase in which the discovered knowledge


is visually represented to the user. This essential step uses visualization techniques to
help users understand and interpret the data mining results.

Data Mining Techniques

1. Classification:
This analysis is used to retrieve important and relevant information about data, and
metadata. This data mining method helps to classify data in different classes.
For example, if you’re evaluating data on individual customers’ financial backgrounds
and purchase histories, we might be able to classify them as “low,” “medium,” or “high”
credit risks. We could then use these classifications to learn even more about those
customers.
2. Association Rules:
Association is related to tracking patterns, but is more specific to dependently linked
variables. In this case, you’ll look for specific events or attributes that are highly correlated
with another event or attribute. This data mining technique helps to find the association
between two or more Items. It discovers a hidden pattern in the data set.

6
For example, we might notice that when your customers buy a specific item, they also
often buy a second, related item. This is usually what’s used to populate “people also bought”
sections of online stores.

3. Sequential Patterns:
This data mining technique helps to discover or identify similar patterns or trends in
transaction data for certain period. One of the most basic techniques in data mining is
learning to recognize patterns in our data sets. This is usually recognition of some aberration
in our data happening at regular intervals, or an ebb and flow of a certain variable over time.
For example, we might see that your sales of a certain product seem to spike just before
the holidays, or notice that warmer weather drives more people to your website.
4. Outer detection:
This type of data mining technique refers to observation of data items in the dataset which
do not match an expected pattern or expected behaviour. This technique can be used in a
variety of domains, such as intrusion, detection, fraud or fault detection, etc. Outer detection
is also called Outlier Analysis or Outlier mining.
For example, if our purchasers are almost exclusively male, but during one strange week
in July, there’s a huge spike in female purchasers, we’ll want to investigate the spike and see
what drove it, so we can either replicate it or better understand our audience in the process.
5. Clustering:
Clustering analysis is a data mining technique to identify data that are like each other.
This process helps to understand the differences and similarities between the data. It is very
similar to classification, but involves grouping chunks of data together based on their
similarities.
For example, we might choose to cluster different demographics of our audience into
different packets based on how much disposable income they have, or how often they tend to
shop at your store.
6. Regression:
Regression analysis is the data mining method of identifying and analyzing the
relationship between variables. It is used primarily as a form of planning and modeling is
used to identify the likelihood of a specific variable, given the presence of other variables,
given the presence of other variables.
For example, we could use it to project a certain price, based on other factors like
availability, consumer demand, and competition. More specifically, regression’s main focus

7
is to help you uncover the exact relationship between two (or more) variables in a given data
set.
7. Prediction:
Prediction is one of the most valuable data mining techniques, since it’s used to project
the types of data you’ll see in the future. In many cases, just recognizing and understanding
historical trends is enough to chart a somewhat accurate prediction of what will happen in the
future. Prediction has used a combination of the other data mining techniques like trends,
sequential patterns, clustering, classification, etc. It analyzes past events or instances in a
right sequence for predicting a future event.
For example, you might review consumers’ credit histories and past purchases to predict
whether they’ll be a credit risk in the future.

8
Unit – II
Data models
Multidimensional Data model – Data cube – Dimension Modeling – OLAP operation –
Meta Data – Types of Meta Data.

2. Data Models

2.1. Multidimensional Data Model

Dimensional Model was developed for implementing data in the warehouse and data
marts. The multidimensional data model provides both a mechanism to store data and a way
for business analysis.

Multidimensional data model stores data in the form of data cube. Mostly, data
warehousing supports two or three-dimensional cubes. A data cube allows data to be viewed
in multiple dimensions. Dimension is entities with respect to which an organization wants to
keep records. For example, in store sales record, dimensions allow the store to keep track of
things like monthly sales of items and the branches and locations.

A multidimensional database helps to provide data-related answers to complex business


queries quickly and accurately. Data warehouses and Online Analytical Processing (OLAP)
tools are based on a multidimensional data model. OLAP in data warehousing enables users
to view data from different angles and dimensions.

Logical Multidimensional Data Model:

9
Figure 2.1.1: Multidimensional Model

The multidimensional data model is designed to solve complex queries in real time. The
multidimensional data model is composed of logical cubes, measures, dimensions,
hierarchies, levels, and attributes. The simplicity of the model is inherent because it defines
objects that represent real-world business entities.

• Cubes:

Logical cubes provide a means of organizing measures that have the same shape, that is,
they have the exact same dimensions. Measures in the same cube have the same relationships
to other logical objects and can easily be analyzed and displayed together.

• Measures:

Measures populate the cells of a logical cube with the facts collected about business
operations. Measures are organized by dimensions, which typically include a Time
dimension. Measures are static and consistent while analysts are using them to inform their
decisions. They are updated in a batch window at regular intervals: weekly, daily, or
periodically throughout the day. Many applications refresh their data by adding periods to the
time dimension of a measure, and may also roll off an equal number of the oldest time
periods. Each update provides a fixed historical record of a particular business activity for

10
that interval. Other applications do a full rebuild of their data rather than performing
incremental updates.

• Dimensions:

Dimensions contain a set of unique values that identify and categorize data. They form
the edges of a logical cube, and thus of the measures within the cube. Because measures are
typically multidimensional, a single value in a measure must be qualified by a member of
each dimension to be meaningful. For example, the Sales measure has four dimensions:
Time, Customer, Product, and Channel. A particular Sales value (43,613.50) only has
meaning when it is qualified by a specific time period (Feb-01), a customer (Warren
Systems), a product (Portable PCs), and a channel (Catalog).

• Dimensions Attributes:

A dimension consists of members. For example, the members of a product dimension


are the individual products. Members have attributes that identify them and provide further
information. For example, some possible attributes for a product dimension are the product
code, name, type, color, and size. Each member of a dimension must have a business key to
identify it in streams of transactional data. If the dimension is defined as a hierarchy, the
lower levels of the hierarchy must also have an attribute that identifies the parent of each
member. Another common attribute is the business name, which enables analysis software to
make reports more comprehensible by using the business name in place of the business key.

• Levels:

Level represents a position in the hierarchy. Each level above the base (or most detailed)
level contains aggregate values for the levels below it. The members at different levels have a
one-to-many parent-child relation. For example, Q1-02 and Q2-02 are the children of 2002,
thus 2002 is the parent of Q1-02 and Q2-02.

• Hierarchies:

2.1.1. Schemas of Multidimensional Data Model

A hierarchy is a way to organize data at different levels of aggregation. In viewing data,


analysts use dimension hierarchies to recognize trends at one level, drill down to lower levels

11
to identify reasons for these trends, and roll up to higher levels to see what affect these trends
have on a larger sector of the business.

• Star Schema Model

• Snow Flake Schema Model

• Fact Constellations

1 Star Schema Model:

The star schema is a modelling paradigm in which the data warehouse contains (1) a large
central table (fact table), and (2) a set of smaller attendant tables (dimension tables), one for
each dimension. The schema graph resembles star burst, with the dimension tables displayed
in a radial pattern around the central fact table.

An example of a star schema for All Electronics sales is shown in Figure 2.1.2 Sales are
considered along four dimensions, namely time, item, branch, and location. The schema
contains a central fact table for sales which contains keys to each of the four dimensions,
along with two measures: dollars sold and units sold. Notice that in the star schema, each
dimension is represented by only one table, and each table contains a set of attributes.

For example, the location dimension table contains the attribute set f location key, street,
city, province or state, country g. This constraint may introduce some redundancy. For
example, \Vancouver" and\ Victoria" are both cities in the Canadian province of British
Columbia. Entries for such cities in the location dimension table will create redundancy
among the attributes province or state and country, i.e., Vancouver, British Columbia,
Canada) and (.., Victoria, British Columbia, Canada). More- over, the attributes with in a
dimension table may form either a hierarchy(total order)or a lattice (partial order).

12
Figure 2.1.2: Star schema of a data warehouse for sales.

2 Snow Flake Schema Model:

The snowflake schema is a variant of the star schema model, where some dimension
tables are normalized, thereby further splitting the data into additional tables. The resulting
schema graph forms a shape similar to a snow flake. The major difference between the snow
flake and star schema models is that the dimension tables of the snow flake model may be
kept in normalized form. Such a table is easy to maintain and also saves storage space
because a large dimension table can be extremely large when the dimensional structure is
included as columns.

Since much of this space is redundant data, creating an or structure will reduce the overall
space requirement. However, the snow flake structure can reduce the effectiveness of brow
sings in more joins will be needed to execute a query. Consequently, the system performance
maybe adversely impacted. Performance benchmarking can be used to determine what is best
for your design.

An example of a snow flake schema for All Electronics sales is given in Figure2.1.3.
Here, the sales fact table is identical to that of the star schema in. The main difference
between the two schemas is in the Definition of dimension tables. The single dimension table
for item in the star schema is normalized in the snow flake schema, resulting in new item and
13
supplier tables. For example, the item dimension table now contains the attributes supplier
key, type, brand, item name, and item key, the latter of which is linked to the supplier
dimension table, containing supplier type and supplier key information. Similarly, the single
dimension table for location in the star schema can be normalized into two tables: new
location and city. The location key of the new location table now links to the city dimension.

Figure 2.1.3: Snow Flake schema of a data warehouse for sales.

Notice that further normalization can be performed on province or state and country in the
snow flake schema shown in Figure 2.1.3 when desirable. A compromise between the star
schema and the snow flake schema is to adopt a mixed schema where only the very large
dimension tables are normalized. Normalizing large dimension tables saves storage space,
while keeping small dimension table sun normalized may reduce the cost and performance
degradation due to join son multiple dimension tables. Doing both may lead to an overall
performance in. However, careful performance tuning could be required to determine which
dimension tables should be normalized and split into multiple tables.

3 Fact Constellations:

Sophisticated applications may require multiple fact tables to share dimension tables. This
kind of schema can be viewed as a collection of stars, and hence is called a galaxy schema or
a fact constellation.

An example of a fact constellation schema is shown in Figure 2.1.4. This schema species
two fact tables, sales and shipping. The sales table definition is identical to that of the star
schema (Figure 2.1.2). The shipping table has dimensions, or keys: time key, item key,
14
shipper key, from location, and to location, and two measures: dollars cost and units shipped.
A fact constellation schema allows dimension tables to be shared between fact tables. For
example, the dimensions tables for time, item, and location, are shared between both the sales
and shipping fact tables

2.2. Data Cube

2.2.1. Introduction

Figure 2.1.4:Fact constellation schema of a Figure data warehouse for sales and
shipping.

A data cube can also be described as the multidimensional extensions of two dimensional
tables. It can be viewed as a collection of identical 2-D tables stacked upon one another. Data
cubes are used to represent data that is too complex to be described by a table of columns and
rows. As such, data cubes can go far beyond 3-D to include many more dimensions.

Definition

A data cube refers is a three-dimensional (3D) (or higher) range of values that are
generally used to explain the time sequence of an image's data. It is a data abstraction to

15
evaluate aggregated data from a variety of viewpoints. It is also useful for imaging
spectroscopy as a spectrally-resolved image is depicted as a 3-D volume.

A data cube is generally used to easily interpret data. It is especially useful when
representing data together with dimensions as certain measures of business requirements. A
cube's every dimension represents certain characteristic of the database, for example, daily,
monthly or yearly sales. Figure.2.1.5 Shows the data included inside a data cube makes it
possible analyze almost all the figures for virtually any or all customers, sales agents,
products, and much more. Thus, a data cube can help to establish trends and analyze
performance.

2.2.2. Categories of Data Cube

Figure.2.2.1 :Data cube

Data cubes are mainly categorized into two categories:

1. Multidimensional Data Cube.

2. Relational OLAP

1. Multidimensional Data Cube

Most OLAP products are developed based on a structure where the cube is patterned as a
multidimensional array. These multidimensional OLAP (MOLAP) products usually offers
improved performance when compared to other approaches mainly because they can be
indexed directly into the structure of the data cube to gather subsets of data. When the
number of dimensions is greater, the cube becomes sparser. That means that several cells that
represent particular attribute combinations will not contain any aggregated data.

16
This in turn boosts the storage requirements, which may reach undesirable levels at times,
making the MOLAP solution untenable for huge data sets with many dimensions.
Compression techniques might help; however, their use can damage the natural indexing of
MOLAP. Figure 2.2.2 shows multidimensional model.

Figure 2.2.2:

Figure 2.2.2: MOLAP Model

2. Relational OLAP

Relational OLAP make use of the relational database model. The ROLAP data cube is
employed as a bunch of relational tables (approximately twice as many as the quantity of
dimensions) compared to a multidimensional array. Each one of these tables, known as a
cuboid, signifies a specific view. Figure 2.2.3shows relational model.

Figure 2.2.3:ROLAP Model

• Data Cube Measure 2.2.3. Cube Data Source

The term data cube is applied in contexts where these arrays are massively larger than the
hosting computer's main memory; examples include multi-terabyte/petabyte data warehouses

17
and time series of image data. The data cube is used to represent data (sometimes called
facts) along some measure of interest.

A cube data source is a data source in which hierarchies and aggregations have been
created by the cube's designer in advance.

Cubes are very powerful and can return information very quickly, often much more
quickly than a relational data source. However, the reason for a cube's speed is that all its
aggregations and hierarchies are pre-built. These definitions remain static until the cube is
rebuilt. Thus, cube data sources are not as flexible as relational data sources if the types of
questions you need to ask were not anticipated by the original designer, or if they change
after the cube was built.

The cube data sources supported in Tableau are,

• Oracle Essbase

• Teradata OLAP

• Microsoft Analysis Services (MSAS)

2.2.4. Create calculated members using MDX formulas

• SAP Net Weaver Business Warehouse

• Microsoft Power Pivot

When working with a cube data source, you can create calculated members using MDX
formulas instead of creating Tableau formulas. MDX, which stands for Multidimensional
Expressions, is a query language for OLAP databases. With MDX calculated members, you
can create more complex calculations and reference both measures and dimensions.

A calculated member can be either a calculated measure, which is a new field in the data
source just like a calculated field, or a calculated dimension member, which is a new member
within an existing hierarchy.

• For details, see How to Create a Calculated Member.

2.2.5. Defining Calculated Members

18
If you are using a multidimensional data source, you can create calculated members using
MDX formulas instead of Tableau formulas. A calculated member can be either a calculated
measure, which is a new field in the data source just like a calculated field, or a calculated
dimension member, which is a new member within an existing hierarchy. For example, if a
dimension Product has three members (Soda, Coffee, and Crackers) you can define a new
calculated member Beverages that sums the Soda and Coffee members. When you then place
the Products dimension on the Rows shelf it displays four rows: Soda, Coffee, Crackers, and
Beverages.

You can define a calculated dimension member by selecting Calculated Members from
the Data pane menu. In the Calculated Members dialog box that opens, you can create, delete,
and edit calculated members as shown Figure 2.2.4

Figure 2.2.4: Defining Calculated Members

To create new calculated members do the following:


1. Click New to add a new row to the list of calculated members at the top of the dialog box.

2. Type a Name for the new calculated member in the Member Definition area of the dialog
box.

19
20
3. Specify the Parent member for the new calculated member. All Member is selected by
default. However, you can choose Selected Member to browse the hierarchy and select a
specific parent member.

Note: Specifying a parent member is not available if you are connected to Oracle Essbase.

4. Give the new member a solve order. Sometimes a single cell in your data source can be.

5. If you are connected to a Microsoft Analysis Services data source, the calculation editor
contains a Run before SSAS check box. Choose this option to execute the Tableau
calculation before any Microsoft Analysis Services calculations. For information on
connecting to Microsoft Analysis Services data sources.

6. Type or paste an MDX expression into the large white text box.

7. Click Check Formula to verify that the formula is valid.

8. When finished, click OK.

2.2.6. Data Cube Aggregation

2.3. Dimension Modelling

2.3.1. Introduction

The new member displays in the Data pane either in the Measures area, if you chose
[Measures] as the parent member, or in the Dimensions area under the specified parent
member. You can use the new member just like any other field in the view.

Data cube aggregation is any process in which information is gathered and expressed in a
summary form, for purposes such as statistical analysis. A common aggregation purpose is to
get more information about particular groups based on specific variables such as age,
profession, or income. The information about such groups can then be used for Web site
personalization to choose content and advertising likely to appeal to an individual belonging
to one or more groups for which data has been collected. For example, a site that sells music
CDs might advertise certain CDs based on the age of the user and the data aggregate for their
age group. Online analytic processing (OLAP) is a simple type of data aggregation in which
the marketer uses an online reporting mechanism to process the information.

21
Data cube aggregation can be user-based: personal data aggregation services offer the
user a single point for collection of their personal information from other Web sites. The
customer uses a single master personal identification number (PIN) to give them access to
their various accounts (such as those for financial institutions, airlines, book and music clubs,
and so on). Performing this type of data aggregation is sometimes referred to as "screen
scraping."

Dimensional modelling (DM) is part of the Business Dimensional Lifecycle


methodology developed by Ralph Kimball which includes a set of methods, techniques and
concepts for use in data warehouse design. The approach focuses on identifying the key
business processes within a business and modelling and implementing these first before
adding additional business processes, a bottom-up approach .An alternative approach from In
mon advocates a top down design of the model of all the enterprise data using tools such as
entity-relationship modelling (ER).

• Definition

A dimensional model is a data structure technique optimized for Data warehousing


tools. Facts are the measurements/metrics or facts from your business process. Dimension
provides the context surrounding a business process event. The Attributes are the various
characteristics of the dimension.

• What is Dimensional Model?

A dimensional model is a data structure technique optimized for Data warehousing


tools. The concept of Dimensional Modelling was developed by Ralph Kimball and is
comprised of "fact" and "dimension" tables. A Dimensional model is designed to read,
summarize, analyze numeric information like values, balances, counts, weights, etc. in a data
warehouse. In contrast, relation models are optimized for addition, updating and deletion of
data in a real-time Online Transaction System. For instance, in the relational mode,
normalization and ER models reduce redundancy in data. On the contrary, dimensional model
arranges data in such a way that it is easier to retrieve information and generate reports.
Hence, Dimensional models are used in data warehouse systems and not a good fit for
relational systems.

• Example of Dimensional Model

22
Dimensional Data Modelling comprises of one or more dimension tables and fact
tables. Good examples of dimensions are location, product, time, promotion, organization etc.

2.3.2. Characteristics of a Dimensional Model:

Figure 2.3.1: Example of Dimension Model

The simplicity of a dimensional model is inherent because it defines objects that represent
real-world business entities. Analysts know which business measures they are interested in
examining, which dimensions and attributes make the data meaningful, and how the
dimensions of their business are organized into levels and hierarchies.

✓ Measures. Measures store quantifiable business data (such as sales, expenses, and
inventory). Measures are also called "facts". Measures are organized by one or more
dimensions and may be stored or calculated at query time.

✓ Stored Measures. Stored measures are loaded and stored at the leaf level. Commonly,
there is also a percentage of summary data that is stored. Summary data that is not stored is
dynamically aggregated when queried.

✓ Calculated Measures. Calculated measures are measures whose values are calculated
dynamically at query time. Only the calculation rules are stored in the database. Common

23
calculations include measures such as ratios, differences, totals and moving averages.
Calculations do not require disk storage space, and they do not extend the processing time
required for data maintenance.

✓ Dimensions. A dimension is a structure that categorizes data to enable users to answer


business questions. Commonly used dimensions are Customers, Products, and Time. A
dimension's structure is organized hierarchically based on parent-child relationships. 2.3.3.
Elements of Dimensional Data Model

Figure 2.3.2: Base Element of the Data warehouse

• Fact

Facts are the measurements/metrics or facts from your business process. For a Sales business
process, a measurement would be quarterly sales number

• Dimension

Dimension provides the context surrounding a business process event. In simple terms,
they give who, what, where of a fact. In the Sales business process, for the fact quarterly sales
number, dimensions would be

✓ Who – Customer Names

✓ Where – Location

✓ What – Product Name

24
• Attributes

The Attributes are the various characteristics of the dimension. In the Location
dimension, the attributes can be

✓ State

✓ Country

✓ Zip code etc.

Attributes are used to search, filter, or classify facts. Dimension Tables contain Attributes

• Fact Table

A fact table is a primary table in a dimensional model. A Fact Table contains

✓ Measurements/facts

✓ Foreign key to dimension table

✓ Dimension table

2.3.4. Steps of Dimensional Modelling

A dimension table contains dimensions of a fact. They are joined to fact table via a
foreign key. Dimension tables are de-normalized tables. The Dimension Attributes are the
various columns in a dimension table.

A dimension offers descriptive characteristics of the facts with the help of their attributes.
No set limit set for given for number of dimensions. The dimension can also contain one or
more hierarchical relationships

The accuracy in creating your Dimensional modelling determines the success of your data
warehouse implementation. Here are the steps to create Dimension Model;

1. Identify Business Process

2. Identify Grain (level of detail)

25
3. Identify Dimensions

4. Identify Facts

5. Build Star

1. Identify the business process

Identifying the actual business process a data are house should cover. This could be
Marketing, Sales, HR, etc. as per the data analysis needs of the organization. The selection of
the Business process also depends on the quality of data available for that process. It is the
most important step of the Data Modeling process, and a failure here would have cascading
and irreparable defects.

To describe the business process, you can use plain text or use basic Business Process
Modelling Notation (BPMN) or Unified Modeling Language (UML).

2. Identify the grain

The Grain describes the level of detail for the business problem/solution. It is the process
of identifying the lowest level of information for any table in your data warehouse. If a table
contains sales data for every day, then it should be daily granularity. If a table contains total
sales data for each month, then it has monthly granularity.

• • Example of Grain
• ✓ The CEO at an MNC wants to find the sales for specific products in different
locations on a daily basis.
• ✓ So, the grain is "product sale information by location by the day."

26
Figure 2.3.3: Step of the Dimension Model

3. Identify the dimensions

Dimensions are nouns like date, store, inventory, etc. These dimensions are where all the
data should be stored. For example, the date dimension may contain data like a year, month
and weekday.

• Example of Dimensions:

The CEO at an MNC wants to find the sales for specific products in different locations on
a daily basis.

✓ Dimensions: Product, Location and Time

✓ Attributes: For Product: Product key (Foreign Key), Name, Type, Specifications

✓ Hierarchies: For Location: Country, State, City, Street Address, Name

4. Identify the Fact

27
This step is co-associated with the business users of the system because this is where they
get access to data stored in the data warehouse. Most of the fact table rows are numerical
values like price or cost per unit, etc.

• Example of Facts:

The CEO at an MNC wants to find the sales for specific products in different
locations on a daily basis. The fact here is Sum of Sales by product by location by time.

5. Build Schema

In this step, you implement the Dimension Model. A schema is nothing but the database
structure (arrangement of tables). There are two popular schemas

• Star Schema

The star schema architecture is easy to design. It is called a star schema because diagram
resembles a star, with points radiating from a center. The center of the star consists of the fact
table, and the points of the star is dimension tables. The fact tables in a star schema which is
third normal form whereas dimensional tables are de-normalized.

• Snowflake Schema 2.3.5. Rules for dimensional modeling :

The snowflake schema is an extension of the star schema. In a snowflake schema, each
dimension are normalized and connected to more dimension tables.

Popular Schema – Star Schema, Snow Flake Schema

Dimensional Data Modeling is one of the data modeling techniques used in data
warehouse design.

The concept of Dimensional Modeling was developed by Ralph Kimball which is


comprised of facts and dimension tables. Since the main goal of this modeling is to improve
the data retrieval so it is optimized for SELECT OPERATION. The advantage of using this
model is that we can store data in such a way that it is easier to store and retrieve the data
once stored in a data warehouse. Dimensional model is the data model used by many OLAP
systems.

✓ Load atomic data into dimensional structures.

28
✓ Build dimensional models around business processes.

✓ Need to ensure that every fact table has an associated date dimension table.

✓ Ensure that all facts in a single fact table are at the same grain or level of detail.

✓ It's essential to store report labels and filter domain values in dimension tables

✓ Need to ensure that dimension tables use a surrogate key

✓ Continuously balance requirements and realities to deliver business solution to support


their decision-making.

2.3.6. Benefits of dimensional modeling

✓ Standardization of dimensions allows easy reporting across areas of the business.

✓ Dimension tables store the history of the dimensional information.

✓ It allows to introduced entirely new dimension without major disruptions to the fact
table.

✓ Dimensional also to store data in such a fashion that it is easier to retrieve the
information from the data once the data is stored in the database.

✓ Compared to the normalized model dimensional table are easier to understand.

✓ Information is grouped into clear and simple business categories.

✓ The dimensional model is very understandable by the business. This model is based on
business terms, so that the business knows what each fact, dimension, or attribute means.

✓ Dimensional models are deformalized and optimized for fast data querying. Many
relational database platforms recognize this model and optimize query execution plans to aid
in performance.

✓ Dimensional modeling creates a schema which is optimized for high performance. It


means fewer joins and helps with minimized data redundancy.

29
✓ The dimensional model also helps to boost query performance. It is more denormalized
therefore it is optimized for querying.

✓ Dimensional models can comfortably accommodate change. Dimension tables can


have more columns added to them without affecting existing business intelligence
applications using these tables.

Understandability. Compared to the normalized model, the dimensional model is easier to


understand and more intuitive. In dimensional models, information is grouped into coherent
business categories or dimensions, making it easier to read and interpret. Simplicity also
allows software to navigate databases efficiently. In normalized models, data is divided into
many discrete entities and even a simple business process might result in dozens of tables
joined together in a complex way.

Query performance. Dimensional models are more denormalized and optimized for data
querying, while normalized models seek to eliminate data redundancies and are optimized for
transaction loading and updating. The predictable framework of a dimensional model allows
the database to make strong assumptions about the data which may have a positive impact on
performance. Each dimension is an equivalent entry point into the fact table, and this
symmetrical structure allows effective handling of complex queries. Query optimization for
star-joined databases is simple, predictable, and controllable.

Extensibility. Dimensional models are scalable and easily accommodate unexpected new
data. Existing tables can be changed in place either by simply adding new data rows into the
table or executing SQL alter table commands. No queries or applications that sit on top of the
data warehouse need to be reprogrammed to accommodate changes. Old queries and
applications continue to run without yielding different results. But in normalized models each
modification should be considered carefully, because of the complex dependencies between
database tables.

2.3.7. Dimensional Modeling Basics:

Dimensional modeling gets its name from the business dimensions we need to
incorporate into the logical data model. It is a logical design technique to structure the
business dimensions and the metrics that are analyzed along these dimensions.

30
This modeling technique is intuitive for that purpose. The model has also proved to
provide high performance for queries and analysis. The multidimensional information
package diagram we have discussed is the foundation for the dimensional model.
Therefore, the dimensional model consists of the specific data structures needed to
represent the business dimensions. These data structures also contain the metrics or facts.
In Chapter 5, we discussed information package diagrams in sufficient detail. We
specifically looked at an information package diagram for automaker sales. Please go back
and review Figure 5-5 in that chapter. What do you see? In the bottom section of the diagram,
you observe the list of measurements or metrics that the automaker wants to use for analysis.
Next, look at the column headings.
These are the business dimensions along which the automaker wants to analyze the
measurements or metrics. Under each column heading you see the dimension hierarchies and
categories within that business dimension. What you see under each column heading are the
attributes relating to that business dimension.
Reviewing the information package diagram for automaker sales, we notice three
types of data entities: (1) measurements or metrics, (2) business dimensions, and (3)
attributes for each business dimension. So when we put together the dimensional model to
represent the information contained in the automaker sales information package, we need to
come up with data structures to represent these three types of data entities.
Let us discuss how we can do this. First, let us work with the measurements or
metrics seen at the bottom of the information package diagram. These are the facts for
analysis. In the automaker sales diagram, the facts are as follows:
Actual sale price MSRP sale price Options price Full price Dealer add-ons Dealer credits
Dealer invoice Amount of down payment Manufacturer proceeds Amount financed

Each of these data items is a measurement or fact. Actual sale price is a fact about what
the actual price was for the sale. Full price is a fact about what the full price was relating to
the sale. As we review each of these factual items, we find that we can group all of these into
a single data structure. In relational database terminology, you may call the data structure a
relational table. So the metrics or facts from the information package diagram will form the
fact table. For the automaker sales analysis this fact table would be the automaker sales fact
table. Look at Figure 10-2 showing how the fact table is formed. The fact table gets its name
from the subject for analysis; in this case, it is automaker sales. Each fact item or
measurement goes into the fact table as an attribute for automaker sales. We have determined
one of the data structures to be included in the dimensional model for automaker sales and

31
derived the fact table from the information package diagram. Let us now move on to the other
sections of the information package diagram, taking the business dimensions one by one.
Look at the product business dimension in Figure 5-5. The product business dimension is
used when we want to analyze the facts by products. Sometimes our analysis could be a
breakdown by individual models. Another analysis could be at a higher level by product
lines. Yet another analysis could be at even a higher level by product categories. The list of
data items relating to the product dimension are as follows:

Model name Model year Package styling Product line Product category Exterior colour
Interior colour First model year What can we do with all these data items in our dimensional
model? All of these relate to the product in some way. We can, therefore, group all of these
data items in one data structure or one relational table. We can call this table the product
dimension table. The data items in the above list would all be attributes in this table. Looking
further into the information package diagram, we note the other business dimensions shown
as column headings. In the case of the automaker sales information package diagram, these
other business dimensions are dealer, customer demographics, payment method, and time.
Just as we formed the product dimension table, we can form the remaining dimension tables
of dealer, customer demographics, payment method, and time. The data items shown within
each column would then be the attributes for each corresponding dimension table. Figure 10-
3 puts all of this together. It shows how the various dimension tables are formed from the
information package diagram. Look at the figure closely and see how each dimension table is
formed. So far we have formed the fact table and the dimension tables. How should these
tables be arranged in the dimensional model? What are the relationships and how should we
mark the relationships in the model? The dimensional model should primarily facilitate
queries and analyses. What would be the types of queries and analyses? These would be
queries and analyses where the metrics inside the fact table are analyzed across one or more
dimensions using the dimension table attributes. Let us examine a typical query against the
automaker sales data. How much sales proceeds did the Jeep Cherokee, Year 2000 Model
with standard options, generate in January 2000 at Big Sam Auto dealership for buyers who
own their homes and who took 3-year leases, financed by Daimler-Chrysler Financing? We
are analyzing actual sale price, MSRP sale price, and full price. We are analyzing these facts
along attributes in the various dimension tables. The attributes in the dimension tables act as
constraints and filters.

Figure 2.3.4:

32
Figure 2.3.4: Formation of the automaker dimension tables.

2.4. OLAP Operations 2.4.1. OLAP cube

OLAP is a category of software that allows users to analyze information from multiple
database systems at the same time. It is a technology that enables analysts to extract and view
business data from different points of view. OLAP stands for Online Analytical Processing.

Analysts frequently need to group, aggregate and join data. These operations in relational
databases are resource intensive. With OLAP data can be pre-calculated and pre-aggregated,
making analysis faster.

OLAP databases are divided into one or more cubes. The cubes are designed in such a
way that creating and viewing reports become easy.

At the core of the OLAP, concept is an OLAP Cube

The OLAP cube is a data structure optimized for very quick data analysis (Figure 2.4.1)
The OLAP Cube consists of numeric facts called measures which are categorized by
dimensions. OLAP Cube is also called the hypercube.

33
Figure2.4.1:OLAP cube

Usually, data operations and analysis are performed using the simple spreadsheet, where
data values are arranged in row and column format. This is ideal for two-dimensional data.
However, OLAP contains multidimensional data, with data usually obtained from a different
and unrelated source. Using a spreadsheet is not an optimal option. The cube can store and
analyze multidimensional data in a logical and orderly manner.

• How does it work?

A Data warehouse would extract information from multiple data sources and formats like
text files, excel sheet, multimedia files, etc. The extracted data is cleaned and transformed.
Data is

2.4.2. Basic operations of OLAP

loaded into an OLAP server (or OLAP cube) where information is pre-calculated in
advance for further analysis.

Four types of analytical operations in OLAP are:

1. Roll-up

2. Drill-down

34
3. Slice and dice

4. Pivot (rotate)

1. Roll-up a. Reducing dimensions

Roll-up is also known as "consolidation" or "aggregation." The Roll-up operation can be


performed in 2 ways.

b) Climbing up concept hierarchy. Concept hierarchy is a system of grouping things


based on their order or level.

Consider the Figure 2.4.2for roll up explanation,

Figure 2.4.2 Roll Up

35
▪ In this example, cities New jersey and Lost Angles and rolled up into country USA
▪ The sales figure of New Jersey and Los Angeles are 440 and 1560 respectively.
They become 2000 after roll-up
▪ In this aggregation process, data is location hierarchy moves up from city to the
country.
▪ In the roll-up process at least one or more dimensions need to be removed. In this
example, Quater dimension is removed.
▪ Moving down the concept hierarchy
▪ Increasing a dimension

2. Drill-down

In drill-down data is fragmented into smaller parts. It is the opposite of the rollup process.
It can be done via

Figure

36
▪ Quater Q1 is drilled down to months January, February, and March. Corresponding
sales are also registers.
▪ In this example, dimension months are added.

Figure 2.4.3 Drill Down

From the figure 2.x10, the dirll down is explained as follows;

3. Slice:

Here, one dimension is selected, and a new sub-cube is created. Figure 2.4.3 explain how
slice operation performed:

▪ Dimension Time is Sliced with Q1 as the filter.


▪ A new cube is created altogether.

37
Figure 2.4.4: Slice Operation

4. Dice

This operation is similar to a slice. The difference in dice is you select 2 or more
dimensions that result in the creation of a sub-cube.

Figure 2.4.5:

38
Figure 2.4.5: Slice Operation

5. Pivot

In Pivot, you rotate the data axes to provide a substitute presentation of data. In the figure
2.4.5, the pivot is based on item types.

39
2.5. Meta Data

2.4.6. Pivot Operation Figure

Metadata is simply defined as data about data. The data that is used to represent other
data is known as metadata.

For example, the index of a book serves as a metadata for the contents in the book. In
other words, we can say that metadata is the summarized data that leads us to detailed data. In
terms of data warehouse, we can define metadata as follows.

• Metadata is the road-map to a data warehouse.

• Metadata in a data warehouse defines the warehouse objects.

40
• Metadata acts as a directory. This directory helps the decision support system to locate the
contents of a data warehouse.

2.5.1. Categories of Metadata

Metadata can be broadly categorized into three categories as shown in Figure 2.5.1:

• Business Metadata − It has the data ownership information, business definition, and
changing policies.

• Technical Metadata − It includes database system names, table and column names and
sizes, data types and allowed values. Technical metadata also includes structural information
such as primary and foreign key attributes and indices.

• Operational Metadata − It includes currency of data and data lineage. Currency of data
means whether the data is active, archived, or purged. Lineage of data means the history of
data migrated and transformation applied on it.

2.5.2. Role of Metadata

Figure 2.5.1 Categories of Metadata

Metadata has a very important role in a data warehouse. The role of metadata in a
warehouse is different from the warehouse data. The various roles of metadata are explained
below (figure 2.5.2).

41
• Metadata acts as a directory.

• This directory helps the decision support system to locate the contents of the data
warehouse.

• Metadata helps in decision support system for mapping of data when data is transformed
from operational environment to data warehouse environment.

• Metadata helps in summarization between current detailed data and highly summarized
data.

• Metadata also helps in summarization between lightly detailed data and highly summarized
data.

• Metadata is used for query tools.

• Metadata is used in extraction and cleansing tools.

• Metadata is used in reporting tools.

• Metadata is used in transformation tools.

• Metadata plays an important role in loading functions

42
2.5.3. Metadata Repository

Figure 2.5.2 Roles of Metadata

Metadata repository is an integral part of a data warehouse system. It has the following
metadata;
• Definition of data warehouse − It includes the description of structure of data warehouse.
The description is defined by schema, view, hierarchies, derived data definitions, and data
mart locations and contents.
• Business metadata − It contains has the data ownership information, business definition,
and changing policies.
• Operational Metadata − It includes currency of data and data lineage. Currency of data
means whether the data is active, archived, or purged. Lineage of data means the history of
data migrated and transformation applied on it.
• Data for mapping from operational environment to data warehouse − It includes the
source databases and their contents, data extraction, data partition cleaning, transformation
rules, data refresh and purging rules.
• Algorithms for summarization − It includes dimension algorithms, data on granularity,
aggregation, summarizing, etc.

2.5.4. Metadata Component

Metadata in a data warehouse is similar to the data dictionary or the data catalog in a
database management system. In the data dictionary, you keep the information about the
logical data structures, the information about the files and addresses, the information about
the indexes, and so on. The data dictionary contains data about the data in the base

2.5.5. Metadata in the Data Warehouse

Think of metadata as the Yellow Pages in your town. Almost in the same manner, the
metadata component serves as a directory of the contents of your data warehouse. Because of
the importance of metadata in a data warehouse, we have set apart all of

• Why is metadata especially important in a data warehouse? 1. First, it acts as the glue
that connects all parts of the data warehouse.

2. Next, it provides information about the contents and structures to the developers.

3. Finally, it opens the door to the end-users and makes the contents recognizable in their
own terms

43
2.6. Types of Metadata

Metadata in a data warehouse fall into three major categories:

1. Operational Metadata

2. Extraction and Transformation Metadata

3. End-User Metadata

2.6.1. Operational Metadata:

2.6.2. Extraction and Transformation Metadata

2.6.3. End-User Metadata

2.6.4. Another variation in Types of Metadata

Data for the data warehouse comes from several operational systems of the enterprise.
These source systems contain different data structures. The data elements selected for the
data warehouse have various field lengths and data types. In selecting data from the source
systems for the data warehouse, you split records, combine parts of records from different
source files, and deal with multiple coding schemes and field lengths. When you deliver
information to the end-users, you must be able to tie that back to the original source data sets.
Operational metadata contain all of this information about the operational data sources.

Extraction and transformation metadata contain data about the extraction of data from the
source systems, namely, the extraction frequencies, extraction methods, and business rules
for the data extraction. Also, this category of metadata contains information about all the data
transformations that take place in the data staging area.

The end-user metadata is the navigational map of the data warehouse. It enables the end-
users to find information from the data warehouse. The end-user metadata allows the end-
users to use their own business men logy and look for information It enables the end-users to
find information from the data warehouse. The end-user metadata allows the end-users to use
their own business terminology and look for information in those ways in which they
normally think of the business.

44
There are only three main types, but it’s important to understand each type and how they
function to make your assets more easily discoverable. So, if you’re not sure what the
difference

is between structural metadata, administrative metadata, and descriptive metadata (spoiler


alert: those are the three main types of metadata), let’s clear up the confusion.

a) Structural Metadata Structural Metadata

Let’s start with the basics. Structural metadata is data that indicates how a digital asset is
organized, such as how pages in a book are organized to form chapters, or the notes that
make up a notebook in Evernote or OneNote. Structural metadata also indicates whether a
particular asset is part of a single collection or multiple collections and facilitates the
navigation and presentation of information in an electronic resource. Examples include:

• Page numbers

• Sections

• Chapters

• Indexes

• Table of contents

Beyond basic organization, structural metadata is the key to documenting the relationship
between two assets. For example, it’s used to indicate that a specific stock photo was used in
a particular sales brochure, or that one asset is a raw, unedited version of another.

b) Administrative Metadata

Administrative metadata relates to the technical source of a digital asset. It includes data
such as the file type, as well as when and how the asset was created. This is also the type of
metadata that relates to usage rights and intellectual property, providing information such as
the owner of an asset, where and how it can be used, and the duration a digital asset can be
used for those allowable purposes under the current license.

The National Information Standards Organization (NISO) actually breaks administrative


metadata down into three sub-types:

45
• Technical Metadata – Information necessary for decoding and rendering files

• Preservation Metadata – Information necessary for the long-term management and


archiving of digital assets

• Rights Metadata – Information pertaining to intellectual property and usage rights

A Creative Commons license, for instance, is administrative metadata. Other examples


include the date a digital asset was created, and for photos, administrative data might include
the camera model used to take the photo, light source, and resolution. In addition,
administrative metadata is used to indicate who can access a digital asset, the key to effective
permissions management in a DAM system.

c) Descriptive Metadata

Descriptive metadata is essential for discovering and identifying assets. Why? It’s
information that describes the asset, such as the asset’s title, author, and relevant keywords.
Descriptive metadata is what allows you to locate a book in a particular genre published after
2016, for instance, as a book’s metadata would include both genre and publication date. In
fact, the ISBN system is a good example of an early effort to use metadata to centralize
information and make it easier to locate resources (in this case, books in a traditional library).

Essentially, descriptive metadata includes any information describing the asset that can be
used for later identification and discovery. According to Cornell University, this includes:

• Unique identifiers (such as an ISBN)

• Physical attributes (such as file dimensions or Pantone colors)

• Bibliographic attributes (such as the author or creator, title, and keywords)

Descriptive metadata can be the most robust of all the types of metadata, simply because
there are many ways to describe an asset. When implementing a DAM solution, standardizing
the specific attributes used to describe your assets and how they’re documented is the key to
streamlined discoverability.

46
Unit- III

Data pre-processing and characterization:

Data cleaning – Data Integration and Transformation – Data reduction – Data


mining Query language - Generalization - Summarization – Association rule mining.

2.3 Data cleaning:

The data tend to be incomplete, noisy, missing values, smooth out noise while
identifying outliers and correct inconsistencies in the data.

2.3.1 Missing Values:

Imagine that you need to analyze All Electronics sales and customer data. Note that
many tuples have no recorded values for several attributes, such as customer income. The
filling in the missing values for this attribute look at the following methods.

(I) Ignore the Tuple:

This is usually done when the class label is missing. This method is not very effective,
unless the tuple contains several attributes with missing values. It is especially poor when the
percentage of missing values per attribute varies considerably.

(II) Fill in the missing value manually:

This approach is time consuming and may not be feasible given a large data set with
many missing values.

(III) Use a global constant to fill in the missing value:

Replace all missing attribute values by the some constant, such as a label like
“unknown”(or)-∞.if missing values are replaced by, say ”unknown” then the program may
mistakenly think that they form an interesting concept. They all have a value in common that
of “unknown”.

(IV) Use the attribute mean to fill in the missing value:

For example: suppose that the average income of all electron customer is $56,000.use
this value to replace the missing value for income.

(V) Use the attribute mean for all samples belonging to the same class as the given
triple:

If classifying customers according to credit risk, replace the missing value with the
average income value for customers in the same credit risk categories as that of the given
triple.

(VI) Use the most probable value to fill in the missing value:

47
This may be determined with regression inference based tools using a formalism
decision tree induction. For example using the others customer attributes in your data set, you
may construct a decision tree to predict the missing values for income.

2.3.2 Noisy Data:

Noise is a random error or variance in a measured variable. A numerical attribute such


as say, price, how can me ‘smooth’ out the data to remove the noise? Let look at the
following data smoothing techniques:

Partition into (equal-frequency) bins:

Bin 1:4, 8 .15

Bin 2:21, 21, 24

Bin 3:25, 28, 34

Smoothing by bin means:

Bin 1:9, 9, 9

Bin 2: 22, 22, 22

Bin 3: 29, 29, 29

Smoothing by bin boundaries:

Bin 1:4, 4, 15

Bin 2:21, 21, 24

Bin 3:25, 25, 34

Figure 2.11Binning methods for data smoothing

1. Binning:

Binning methods smooth a sorted data value by consulting its’ neighbourhood’ that is
the values around it. The values are distributed into a number of “buckets” (or) bins. Because
binning methods consult the neighbourhood of values, they perform local smoothing.

The data for price are first sorted and then partitioned into equal frequency bins of
size 3.In smoothing by bin means, each value in a bin is replace by the mean value of the bin.
Smoothing by bin medians can be employed in which each bin value is replaced by the bin
median,

In smoothing by bin boundaries, the minimum and maximum values in a given bin are
identified as the bin boundaries.

2. Regression:

48
Data can be smoothed by fitting the data to a function, such as with regression. Linear
regression involves finding the best line to fit two attributes, so that one attribute can be used
to predict the other.

Multiple linear regressions is an extension of linear regression where more than two
attributes one involved and the data are fit to a multidimensional surface.

Figure A 2-D plot of customer data with respect to customer locations in a city, showing
three data cluster. Each cluster centroid is marked with a “+”, representing the average point
in space for that cluster. Outliers may be detected as values that fall outside of sets of cluster.

3. Clustering:

Outlier may be detected by clustering, where similar values are organized into group
(or) clusters. Intuitively, values that fall outside of the set of clusters may be considered
outliers.

2.3.3 Data cleaning as a process:

Missing values, noise, and inconsistencies contribute to inaccurate data. We have


looked at techniques for handling missing data for smoothing data. But data cleaning is a big
job. What about data cleaning as a process? How exactly does one proceed in tackling this
task? Are there any tools out there to help?
49
The first step in data cleaning as a process is discrepancy detection. Discrepancies can
be caused by several factors, including poorly designed data entry forms that have many
optial fields, human errors in data entry, deliberate errors.

There may also be inconsistencies due to data inconsistence due to data. So how can
we proceed with discrepancy detection? As the starting point use any knowledge you may
already have regarding properties of the data. Such knowledge or data about data is referred
to as Meta data.

The data analyst you should on the look out for inconsistent use of codes and my
inconsistent data representation such as”2004/12/25” and ”25/12/2004” for date. The field
overloading is another source of errors that typically results when developers squeeze new
attribute definition into unused portions of already defined attribute.

The data should also examine regarding unique rules, consective rules, and null rules.
A unique rules says that each value of the given attribute must be different from all other
values for that attribute.

A consective rules says that there can be no missing values between the lowest and
highest values for the attribute and that all values must also be unique. A null rule specifies
the use of blanks question marks, special characters or other strings that may indicate the null
condition and how such values should be handled.

(i) The reason for missing values may include .The person originally asked to provide
a values for the attribute refuses and or find that the information requested is not applicable.

(ii)The data entry person does not know the correct value.

(iii)The value is to be provided by a late step of the process.

The null rule should specify how to record the null condition, such as to store zero for
numerical attributes, a blank for character attribute or any other conventions that may be in
use.

There are a number of different commercial tools that can aid in the step of
discrepancy detection. Data sorubbing tools use simple domain knowledge to detect errors
and make corrections in the data, these tools reply on passing fuzzy matching techniques
when cleaning data from multiple source.

Data auditing tools find discrepancies by analyzing the data to discover rules and
relationships, and detecting data that violate such conditions. They are variants of data
mining tools.

Commercial tools can assist in the data transformations step. Data migration tools
allow simple transformations to be specified, such as to replace the string “gender” by ”sex”
ETL(extraction/transformation/loading) tools allow users to specify transformations through
a graphical user Interface.

2.4 Data Integration and Transformation:

50
The merging of data from multiple data sources. The data may also need to be
transformed into forms appropriate for mining.

2.4.1 Data Integration:

The data analysis task will involve data integration, which combines data from
multiple sources into a coherent data store, as in data warehousing. These sources may
include multiple data base, data cubes (or) flat files.

Schema integration and object matching can be tricky how can equivalent
real world Entities from multiple data sources be matched up. This is referred to as the entity
identification problem.

Ex: how can the data analysts or the complete be sure that customer-id in one database and
customer-number in another refers to the same attribute?

Meta data for each attribute include the name meaning, data type and
range of values permitted for the attribute and null rules for the null rules for handling blank,
zero or null values.

Redundancy is another important issue. An attribute may be redundant if it


can be “derived” from another attribute or set of attributes. Inconsistencies in attribute or
dimension naming can also cause redundancies in the resulting data set.

Some redundancies can be detected by correlation analysis. Given two


attribute such analysis can measure how strongly one attribute implies the other based on the
available data.

N N

r A ,B = ∑ ( ai −A ) ( b j−B ) = ∑ ( ai bi ) − A B (2.8)
i=1 i=1
❑ ❑

Where N is the number of the tuples a i and b i are the respective values of A and B in

tuple i , A and B are the respective mean values of A and B, σ A and σ B the respective

standard deviations of A and B and ∑

( ai bi ) is the sum of the AB cross product. Note that -
1≤ r A ,B ≤ +1.if r A ,B greater than 0,then A and B are positively correlated meaning that the
value of A increases as the value of B increases.

The higher value is stronger the correlation. Hence the higher value may indicate that
A and B may be removed as a redundancy. If the resulting value is equal to 0, then A and B
are independent and there is no correlation between them.
2
c r
( oij −e ij )
χ =∑ ∑
2
(2.9)
i=1 j=1 eij

51
Where o ij is the observed frequency of the joint event ( Ai , B j) and e ijis the expected
frequency of( Ai , B j) which can be computed as

count ( A=a i ) X count(B=b j)


e ij = (2.10)
N

Where N is the number of data tuples, count (A= a i) is the number of tuples having
value a i for A, and count (B=b j) is the number of tuples having value b j for B. the sum in is
computed over all of the r x c cells. Note that the cells that contribute the most of the value χ 2
value are those whose actual count is very different from that expected.

Table 2.2

male female total

fiction 250(90) 200(360) 450

Non fiction 50(210) 1000(840) 1050

Total 300 1200 1500

A 2X2 contingency table for the data of example are gender and preferred reading
correlated. The χ 2 statics test the hypothesis that A and B are independent. The test is based
on significance level, with (r-1) (c-1) degrees of freedom.

2.4.2 Data Transformation:

In data transformation, the data are transformed or consolidate into forms appropriate
for mining.

Smoothing, this works to remove noise from the data. Such techniques include binning,
regression and clustering.

Aggregation:

Where summary or aggregation operation are applied to the data.

Ex: the daily sales data may be aggregated so as to compute monthly and annual total
amount.

Generalization of the data, where low level or primitive data are replaced by high level
concepts though the use of concept hierarchies. Categories attributes like can be generalized
to high level concept like city or country.

Normalization where the attributed data are scaled so as to fall within a small specified
range, such as -1.0 to 1.0 or 0.0 to 1.0

Attributed construction, where new attributes are constructed and added from the given set
of attributed to help the mining process. Min Max normalization performs a linear

52
transformation on the original data. Suppose that Min A and Max A are the minimum and
maximum values of an attribute.

A Min Max normalization maps a value, V of A to V’ in the range [new – Min A, New- Max
A] by computing

' V −minA
V= (new _maxA-new_minA)+new_minA.
maxA−minA

Min max normalization preserves the relationship among the original data values. It
will encounter an “out of the bounds” error if a future input case for normalization fall
outside of the original data range for A.

Example:

z-score normalization suppose that the mean and standard deviation of the values for
the attribute income are $54,000 and $16,000 respectively . with z-score normalization a
value of $73600 for income is transformed to 73,600 – 54,000 / 16,000 =1.225.

Normalization by decimal scaling normalization by moving the decimal point of


values of attributed A number of decimal points moved depends on the maximum absolute
value of A. A value, v of A is normalized to V’ by computing.

V’ = V/ 10 j

Where j is the smallest integer such that Max (|v’|) < 1.

2.5 Data Reduction:

Data reduction techniques can be applied to obtain a reduced representation of the set
that is much. Smaller in volume, yet closely maintains the integrity of the original data.

Mining on the reduction data set should be more efficient yet produce the same
analytical results.

(i) Data cube aggregation; where aggregation operations are applied to the data in
the construction of a data cube.

(ii) Attribute subset selection; where irrelevant , weakly relevant or redundant


attributes or dimensions may be detected and removed.

(iii) Dimensionality reduction, where encoding mechanisms are used to reduce the
data set size.

(iv) Numerosity reduction, where the data are replaced or estimated by alternative,
smaller data representation. Such as parametric models or non parametric
methods. Such as clustering, sample and the use of histograms.

(v) Discretization and concept hierarchy generation; where raw data values for
attributes are replaced by ranges or higher conceptual level. Data

53
discretization is a form of Numerosity reduction that is very useful for the
automatic generation of concept hierarchies.

2.5.1 Data Cube Aggregation:

Imagine that you have collected the data for your analysis. These data consist
of the All Electronics sales per quarter , for the years 2002 to 2004 you are , however
interested in the annual sales, rather than the total per quarter. Thus the data can be
aggregated so that the resulting data summarize the total sales per year instead of per
quarter.

Each cell holds an aggregate data value, corresponding to the data point in
multi dimensional space. The able created at the lowest level of abstraction is referred
to as the base cuboid. The base cuboid should correspond to an individual entity of
interest, such as sales or customer.

2.5.2 Attributed subset selection:

Attributed subset selection reduces the data set size by removing irrelevant or
redundant attribute. The goal of attributed subset selection is to find a minimum set of
attributes such that the resulting probability distribution of the classes is as close as
possible to the original distribution obtained using all attributes.

54
Greedy (heuristic) methods for attribute subset selection.

(i) Step wise forward selection:

The procedure starts with an empty set of attributes as the reduce set.
The best of the original attribute is determined and added to the
reduced set.

(i) Step wise back ward elimination: The procedure stats with the full set of
attribute. At each step it removes the worst attribute remaining in the set.

(ii) Combination of forward selection and back ward elimination: the stepwise
forward selection and backward elimination methods can be combined, so that
at each step the procedure select the best attribute and removes the worst from
among the remaining attribute.

(iii) Decision Tree induction: Decision tree algorithms, such as ID3, C4.5 and
CART, were originally intended for classification. Decision tree inductions
construct a flow chart.

Dimensionality Reduction:

In dimensionality reduction, data encoding or transformations are applied as


so to obtain a reduce or compressed representation of the original data.

If the original data can be constructed from the compressed data without any
loss of information, the data reduction is called lossless.

We can reconstruct only an approximation of the original data, and then the
data reduction is called lossy.

Wavelet Transformation:

55
The discrete wavelet transformation (DNT) is a linear signal processing
techniques that, when applied to a data vector x, transforms it to a numerically
different vector, X’ of wavelet coefficients.

The two vector are of the same length when applying this techniques to data
reduction we consider each tuple as an n-dimensional data vector, that is x=
(x1,x2,x3,….xn), depicting n measurements made on the tuple from n data base
attribute.

Examples for wavelet families. The number next to a wavelet name is the number of
vanishing moments of the wavelet. This is a set of mathematical relationships that the
coefficient must satisfy and is related to the number of coefficient.

(i) The length, L of the input data vector must be an integer power of 2. This
condition can be met by padding the data vector with zero’s as necessary
(L>=n).

(ii) Each transform involves applying two functions. The first applies some data
smoothing, such a sum or weighted average. The second performs a weighted
difference which acts to bring of the detailed features of the data.

(iii) The two functions are applied to pairs of data points in X, that is all pairs of
measurement (Xzi,X2i). this results in two sets of data of length L/2.

(iv) The two functions are recursively applied to the sets of data obtained in the
previous loop. Until the resulting data sets obtained are of length 2.

(v) Selected values from the data sets obtained in the above iterations are
designated the wavelet co-efficient of the transformed data.

Principal Components Analysis:

The data to be reduced consists of tuples or data vector described by n


attribute or dimension. Principal components analysis (or) PCA searches for K n-

56
dimensional orthogonal vectors that can best be used to represent the data, where
K<=n.

Example of wavelet families. The number next to a wavelet name is the number of
vanishing moments. This is a set of mathematical relationships that the coefficients
must satisfy and is related to the number of coefficients.

(i) The input data are normalized, so that each attribute falls within the same
range. This step helps ensure that attribute with larger domain will not
dominate attribute with smaller domains.

(ii) PCA Computes K Orthonormal vector that provide a basic for the normalized
input data. These are unit vectors that each point in a direction perpendicular
to the others.

These Vectors are referred to as the Principal components. The input are a
linear combination of the Principal components.

(iii) The principal components are sorted in order of decreasing “Significance” or


Strength. The principal components essentially serve as a new set of axes for
data, providing important information about variance.

(iv) Because the components are sorted according to decreasing order of


“Significance” the size of the data can be reduced by eliminating the weaver
components that is those with low variance.

2.5.4 Numerosity Reduction:

These techniques may be parametric or non-parametric. For parametric


methods, a model is used to estimate the data, so that typically only the data
parameters need to be stored, instead of the actual data.

57
Log linear models, which estimate discrete multidimensional probability
distribution, are an example. Nonparametric methods for storing reduced representation of
the data include histograms, clustering and sampling.

Regression and Log Linear Models:

Regression and log linear models can be used to approximate the given data. Linear
regressions the data are modeled to fit a straight line.

A random variable, Y can be modeled as a linear function of another random variable


X with the equation

Y=w x + b,

Where the variance of y is assumed to be constant. It the context of data mining, x and
y are numerical data base attributes. The coefficient, w and b specify the slop of the line and
y intercept respectively.

Multiple linear regressions is an extension of linear regression, which allows a


response variable, y, to be modeled as a linear function of two or more predictor variable.

Log linear models approximate discrete multidimensional probability distribution.


Given a set of tuples in n dimensions, we consider each tuples as a point in an n-dimensional
space. Log linear models can be used to estimate the probability of each point in a
multidimensional space for a set of discretized attribute, based on a smaller subset of
dimensional combinations.

Regression and log linear models can both be used on sparse data, although their
application may be limited. While both methods can handle skewed data, regression does
exceptionally well.

Histograms:

Histograms use binning to approximate data distribution and are a popular form of
data reduction. A histograms for an attribute, A, partition the data distribution of A into
disjoint subsets or buckets. If each bucket represents only a single attribute values / frequency
pair, the bucket are called singleton.

58
Example: histogram. The following data are a list of price of commonly sold items at all
electronics. The numbers have been
sorted :1,1,5,5,5,5,5,8,8,10,10,10,10,12,14,14,14,15,15,15,15,15,15,18,18,18,18,18,18,18,18,
20,20,20,20,20,20,21,21,21,21,25,25,25,25,28,28,30,30,30.

To further the reduced data it is common to have each bucket denote a continuous
range of values for the given attribute.

Equal width:

In a equal width histogram, the width of each bucket range is uniform.

Equal frequency: In an equal frequency histogram, the buckets are credited so that, roughly,
the frequency of each bucket is constant.

V-optimal:

If we consider all of the possible histograms for a given of buckets the voptimal
histograms is the one with the least variable. Hence the histograms

Variable is a weighted sum of the original values that each bucket represents, were bucket
weight is equal to the number of values in the bucket.

MaxDiff:

In a MaxDiff histogram, we consider the difference between each pair of adjacent


values.bucket boundary is established between each pair for pairs having the β-1 largest
difference , where β is the user specified number of buckets.

59
An equal-width histogram for price,where values are aggregated so that each bucket has a
uniform width of $10.

Clustering:

Clustering techniques consider data tuples as object. They partition the objects into
groups or clusters. So that objects within a cluster are “similar “to one another and dissimilar
to objects in other cluster.

Commonly defined in terms of how close the objects are in space based on a distance
function. The quality of a cluster may be represented by it’s diameter the maximum distance
between any two objects in the cluster.

The root of a B+-tree for a given set of data .

In data reduction the cluster representation of the data are used to replace the actual
data. The effectiveness of these techniques depends on the nature of the data. It is much more
effective for data that can be organized into distinct cluster than for smeared data.

Sampling:

60
Sampling can be used as a data reduction techniques because it allows a large data set
to be represented by a much smaller random samples of the data . suppose a large data set , D,
contains N tuples.

Figure Sampling can be used for data reduction.

Simple random sample without replacement of size:

This is created by drawing s of the N tuples from D (s<n) , where the probability of
drawing any tuple in D is 1/ N , that is all tuples are equally likely to be sampled.

Simple random sample with replacement:

61
This is similar to SRSWOR, except that each time a tuple is drawn from D, it is
recorded and then replaced. That is after a tuple is draw it is placed back in D so that it may
be draw again.

Cluster sample:

If the tuple in D are grouped into m mutually disjoint ‘cluster” then an srs of s clusters
can be obtained, where s<m, for example tuples in a data base are usually retrieved a page at
a time. So that each page can be considered a cluster.

Stratified sample:

If D is divided into mutually disjoint parts called strata, a stratified sample of D is


generated by obtaining an srs at each stratum. This helps ensure a representative sample,
especially when the data are skewed.

An advantage of sampling for data reduction is that the cost of obtaining a sample is
proportional to the size of the sample. s as opposed to N, the data set size.

Sampling complexity is potentially sub linear to the size of data. Other data reduction
techniques can require at least one complete pass through D.

DATA MINING-QUERY LANGUAGE

A DMQL can provide the ability to support ad-hoc and interactive data mining.
By providing a standardized language like SQL.

*To achieve a similar effect like that SQL has on relational database

* Foundation for system development and evolution

* Facilitate information exchange, technology transfer, commercialization and


wide acceptance.

Designing a comprehensive data mining language is challenging because data


mining covers a wide spectrum of tasks, from data characterization to evolution analysis.
Each task has different requirements. The design of an effective data mining query language
requires a deep understanding of the power, limitation, and underlying mechanisms of the
various kinds of data mining tasks.

Syntax for specification of

* The set of task-relevant data to be mined.

* The kind of knowledge to be mined.

* The background to be mined.

* The background knowledge to be used in the discovery process.

* The interestingness measures and thresholds for pattern evolution.

* The expected representation for visualizing the discovered patterns.


62
Syntax for task – relevant data specification:

* use database database_name, or use data warehouse data_warehouse_name

* from relation(s)/cube(s) [where condition]

* in relevance to att_or_dim_list

* order by order_list

* group by grouping_list

* having condition

Example:

This example shows how to use DMQL to specify the task-relevant data, the
mining of associations between items frequently purchased at AB Company by Sri Lankan
customers, with respect to customer income and age. In addition, the user specifies that the
data are to be grouped by date. The data are retrieved from a relational database.

Use database ABCompany_db in relevance to I .name,I.price, C.income, C.age

From customer C, item I, purchases, P, items_sold S Where I.item_ID=s. item.JD and


S.trans_ID=P.trans_ID and P.custJD= C.cust_ID and C. country - "Sri Lanka"Group
by p.data.

Syntax for Specifying the Kind of Knowledge to be mined:

The (Mine_Knowledge_Specification) statement is used to specify the kind of


knowledge to be mined. In other words, it indicates the data mining functionality to be
performed. Its syntax is defined below for characterization, discrimination, association, and
classification.

Characterization:

(Mine_Knowledge_Specification)::=mine characteristics [as (pattern_name)] Analyse


(measure(s))

This specifies that characteristic descriptions are to be mined. The analyze clause,
when used for characterization, specifies aggregate measure, such as count, sum, or count%
(percentage count, i.e., the percentage of tuples in the relevant data set with the specified
(characteristics). These measures are to computed for each data characteristic found.

Syntax for Concept Hierarchy Specification

Concept hierarchies allow the mining of knowledge at multiple levels of


abstraction. In order to accommodate the different viewpoints of users with regard to the
data, there may be more than one concept hierarchy per attribute or dimension. For instance,
some users may prefer to organize them according to languages used. In such cases,

63
* A user can indicate which concept hierarchy is to be used with statement
kuse hierarchy (hierarchy_name) for {attribute _or_dimension) Otherwise, a default
hierarchy per attribute or dimension is used.

* Use different syntax to define different type of hierarchies schema


hierarchies define hierarchy time_hierarchy on date as [date,month quarter,year]

set-grouping hierarchies

define hierarchy age_hierarchy for age on customer as

level1: {young, middle_aged, senior}<level0: all

level2: level2: {20,....,39}<level1; level1; young

level2: {40,.....,59}<level1:middle_aged

level2: {60,.....,89}<level1: senior

operation-derived hierarchies define hierarchy age_hierarchy age_hierarchy for


age on customer as {age_category(1),......, age_category(5)} := cluster(default, age,
5)<all (age)

Syntax for Interestingness Measure Specification

The user can help control the number of uninteresting patterns returned by the data
mining system by specifying measures of pattern interestingness and their corresponding
thresholds. Interestingness measures and thresholds can be specified be the user with the
statement with {(interest_measure_name)] threshold (threshold_value)

Example:

with support threshold=0.05

with confidence threshold=0.7

Syntax for Pattern Presentation and Visualization Specification

"How can users specify the forms of presentation and visualization to be used in
displaying the discovered patterns in one or more forms, including rules, tables cross tabs, pie
or bar charts, decision trees, cubes, curves or surface-We define the DMQL display statement
for this purpose; display a

(Result_form)

Where the (result_form) could be any of the knowledge presentation or


visualization forms listed above.

Interactive mining should allow the discovered patterns to be viewed at different


concept levels or from different angles. This can be accomplished with roll-up and drill-
down operations. Patterns can be rolled up, or viewed at a more general level, by climbing

64
up the concept hierarchy of an attribute or dimension (replacing lower=-level concept values
by higher-level values). Dropping attributes or dimensions con also perform generalization.

The user can alternately view the patterns at different levels of abstractions with
the use of following DMQL syntax:

(Multilevel_Manapulation)::= roll up on (attribute_or_dimension)

| drill down on (attribute_or_dimension)

| add (attribute_or_dimension)

| drop (attribute_or_dimension)

Putting all together-An example of a DMQL query

In the above discussion, we presented DMQL syntax for specifying data mining
queries in terms of the five data mining primitives. For a given query, these primitive define
the task-relevant data, the kind of knowledge to be mined, the concept hierarchies and
interestingness measures to be used, and the representation forms for pattern visualization.
Here we put these components together. Let's look at an example for the full specification of
a DQML query.

The full specification of a DMQL query

use database AllElectronics db AllElectronics_db

use hierarchy location_hierarchy for B.address

mine characteristics as customerPurchasing analyze count%

in relevance to C.age,I.type,I.place_made

from customer C from customerC, item I, purchases purchases P, items sold


items_sold S,

works at works_at W, branch where I.item_ID and S.trans_ID=P.trans_ID and


P.cust_ID=C.cust_ID and P.method_paid="AmEx" and P.empl_ID=W.empl_ID and
W.branch_ID=B branch ID B.branch_ID and B address B.address="Canada" and I price
I.price>=100 with noise threshold=0.05 display as table.

65
Unit- III

2-Marks Question and Answers

Data pre-processing and characterization:

Data cleaning – Data Integration and Transformation – Data reduction – Data


mining Query language - Generalization - Summarization – Association rule mining.

1. Define data pre-processing techniques?

Data cleaning

Data integration

Data transformation

Data reduction

2. Explain the smoothing techniques?

Binning

Clustering

Regression

3. Explain data transformation in detail?

Smoothing

Aggregation

Generalization

Normalization

Attribute construction

4. Explain normalization in detail?

Min max normalization

Z-score normalization

Normalization by decimal scaling

5. Explain data reduction?

Data cube aggregation

Attribute subset selection

Dimensional reduction

66
Numerosity reduction

6. Explain parametric methods and non-parametric methods of reduction?

Parametric:

Regression model

Log linear model

Non-parametric:

Sampling

Histogram

Clustering

7. Explain data discrimination and concept hierarchy generation?

Segmentation by natural partitioning

Binning

Histogram analysis

Cluster analysis

8. Explain data mining primitives?

Task relevant data

Kinds of knowledge to be mined

Concept hierarchies

Interesting measures

9. What is data cleaning?

Data cleaning means removing the inconsistent data or noise and collecting necessary
information of a collection of interrelated data.

10. What are the types of concept hierarchies?

A concept hierarchy defines a sequence of mappings from a set of low-level concepts


to higher-level, more general concepts. Concept hierarchies allow specialization, or drilling
down, where by concept values are replaced by lower-level concepts.

11. Write the strategies for data reduction

Data cube aggregation

Attribute subset selection

67
Dimensionality reduction

Numerosity reduction

Discretization and concept hierarchy generation

12. Why is it important to have data mining query language?

The design of an effective data mining query language requires a deep understanding
of the power, limitation, and underlying mechanisms of the various kinds of data mining
tasks.

13. List the five primitives for specifying a data mining task.

The set of task-relevant data to be mined

The kind of knowledge to be mined

The background knowledge to be used in the discovery process

The interestingness measures and thresholds for pattern evaluation

The expected representation for visualizing the discovered pattern

14. What is data generalization?

It is process that abstracts a larger set of task- relevant data in a database from
relatively low conceptual levels to higher conceptual levels 2 approaches for generalization.

1) Data cube approach


2) Attribute-oriented induction approach

15. How concept hierarchies are useful in data mining?

A concept hierarchy for a given numerical attribute defines as discretization of the


attribute. Concept hierarchies can be used to reduce the data by collecting and replacing low-
level concepts (such as youth, middle-aged, or senior). Although detail is lost by such data
generalization, the generalized data may be more meaningful and easier to interpret.

16. How do you clean the data?

Data cleaning (or data cleaning) routines attempt to fill missing values, smooth out
noise while identifying outliers, and correct inconsistencies in the data.

For missing values

1. Ignore the tuple


2. Fill in the missing value manually
3. Use a global constant to fill in the missing value
4. Use the attribute mean to fill in the missing value.
5. Use the attribute mean for all samples belonging to the same class as the given
tuple.
6. Use the most probable value
68
For noisy Data

1. Binning: Binning methods smooth a sorted data value by consulting its”


neighborhood”, that is the values around it.
2. Regression : Data can be smoothed by fitting the data to a function , such as
with regression
3. Clustering: Outliers may be detected by clustering, where similar values are
organizes into groups, or “clusters.

17. Define DMQL

Data Mining Query Language

It specifies clauses and syntaxes for performing different types of data mining tasks
for examples data classification, data mining association rules. Also it uses SQL-like syntaxes
to mine databases.

18. Define Data Mining –Query Language.

A DMQL can provide the ability to support ad-hoc and interactive data mining. By
providing a standing a standardized language like SQL

1. To achieve a similar effect like that SQL has on relational database


2. Foundation for system development and evolution
3. Facilitate information exchange, technology transfer, commercialization and wide
acceptance.

19. Define Association Rule Mining.

1. Association rule mining is popular and well researched method for discovering
interesting relations between variables in large databases.

2. It is intended to identify strong rules discovered in databases using different


measures of interestingness.

20. When we can say the association rules are interesting?

Association rules are considered interesting if they satisfy both a minimum support
threshold and a minimum confidence threshold. Users or domain can set such thresholds.

21. Define support and confidence in association rule mining.

Support s is the percentage of transactions in D that contain AUB

Confidence c is the percentage of transactions in D containing A that also contain B.

Support (A=>B)=P(AUB)

Confidence (A=>B)=P(B/A)
69
22. How is association rules mined from large database?

Find all frequent item sets:

Generate strong association rules from frequent item sets

23. Describe the different classification of association rule mining.

Types of values handled in the rules

i) Boolean association rule


ii) Quantitative association rule

Based on the dimensions of data involved

i) Single dimensional association rule


ii) Multidimensional association rule

Based on the levels of abstraction involved

i) Multilevel association rule


ii) Single-level association rule

Based on various extensions

i) Correlation analysis
ii) Mining max patterns

24. What is the purpose of apriori algorithm?

Apriori algorithm is an influential algorithm for mining frequent item sets for Boolean
association rules; the name of the algorithm is based on the fact that the algorithm user prior
knowledge of frequent item set properties.

25. How to generate association rules from frequent item sets?

For each frequent item set1, generate all non empty subsets of 1.for every non empty
subsets s 1, output the rule “s=>(1-s)”if support count(1)=min_conf/support_count(s)

Where min_conf is the minimum confidence threshold.

26. Give few techniques to improve the efficiency of apriori algorithm.

Hash based technique

Transaction reduction

Portioning

Sampling

Dynamic item counting

70
27. What are the things suffering the performance of apriori candidate generation
technique.

Need to generate a huge number of candidate sets.

Need to repeatedly scan the scan the database and check a large set of candidates by
pattern matching

28. Describe the method of generating frequent item sets without candidate generation.

Frequent –pattern growth adopts divide-and-conquer strategy.

Steps:

Compress the database representing frequent items into a frequent pattern tree or FP
tree. Divide the compressed database into a set of conditional database mine each conditional
database separately.

29. Mention few approaches to mining multileved association rules.

Uniform minimum support for all levels

Using reduced minimum support at lower levels, level-by-level independent

Level-cross filtering by single item

Level-cross filtering by K-item set

30. What are multidimensional association rules?

Multidimensional association rule with no repeated predicate or dimension

Hybrid-dimension association rule:

Multidimensional association rule with multiple occurrences of some predicates or


dimensions.

31. Define constraint-based association mining.

Mining is performed under the guidance of various kinds of constraints by the user.
The constraints include the following Knowledge type constraints data constraints
dimension/level constraints interestingness constraints rule constraints.

32. Define the concept of classification.

Two step process

A model is built describing a predefined set of data classes or concepts.

The model is constructed by analyzing database tuples described by attributes. The


model is used for classification.

33. What is decision tree?

71
A decision tree is a flow chart like tree structures, where each internal node denotes a
test on an attribute each branch represents an outcome of the test, and leaf nodes represent
classes or class distributions. The top most in a tree is the root node.

Objectives type questions and answers:

1. The objective of data pre-processing include size reduction of the input


space.______________ data normalization, noise reduction and features extraction.

(a) Algorithm (b) smoother relationship (c) star schema (d)


generalization.

Ans: (b) smoother relationship

2. __________ is a random error or variable in a measured variable.

(a) Data (b) file (c) noise (d) techniques

Ans: (c) noise

3. Smoothing by bin medians can be employed, in which each bin value is replaced by
the bin______________.

(a) Median (b) mode (c) boundaries (d) maximum.

Ans: (a) median

4. The smoothing by bin boundaries, the minimum and maximum values in a given bin
are identified as the ___________.

(a) equal width (b) range (c) binning (d) bin boundaries

Ans: (d) bin boundaries.

5. _________ is an extension of linear regression, where more than two attributes are
involved and the data are fit to a multidimensional surface.

(a) regression (b) linear (c) multiple linear regression (d)


multidimensional.

Ans: (c) multiple linear regressions

6. A ____________ says that each value of the given attribute must be different from all
other values for that attribute.

(a) process (b) unique rule (c) consecutive (d) null value.

Ans: (b) unique rule.

7. A______ specifies the use of blanks, question marks, special characters or other string
that may indicate the null condition and how such values should be handled.

(a) attribute (b) variable (c) null value (d) number

72
Ans: (c) null value

8. Data _____________ tools use simple domain knowledge to detect errors and make
correction in the data.

(a) migration (b) auditing (c) scrubbing (d) extraction.

Ans: (c) scrubbing .

9. Data _______ tools find discrepancies by analyzing the data to discover rules and
relationship and detecting data that violate such condition.

(a) Transformation (b) scrubbing (c) auditing (d) loading.

Ans: (c) auditing.

10. The ETL expand ______________-.

(a) Extraction / transformation / loading.

(b) Execute / transformation /loading

(c) Extraction / transport / loading

(d) Extraction / transformation/location

Ans: (a) Extraction / transformation / loading.

11. It is important to leap updating the Meta data to reflect _______________.

(a)Integration (b) Concept (c) Knowledge (d) clearing

Ans: c) Knowledge

12. Some redundancies can be detected by_____________.

(a) Correlation analysis (b) Correlation coefficient (c) Entity identification (d)
Normalization

Ans: (d) Normalization

13. A this important issue in data integration is the detection and resolution of
_____________.

(a)Constraints (b) Data value conflicts (c) Currencies (d) heterogeneity

Ans: (b) Data value conflicts

14. In ____________the data are transformed (or) consolidated into forms appropriate for
mining.

(a)Dearing (b) Integration (c) Noisy data (d) Data transformation

Ans: (d) Data transformation

73
15. Nomalization where the attribute data are scaled so as to fall within a small specified
range such as - 1.0 to 1.0

(a)-1.0 to 1.0 (b) 1.0 to -1.0 (c)-1.0 to 0.0 (d) 0.0 to -1.0

Ans: (a)-1.0 to 1.0

16. Data cube__________, where aggregation operations are applied to the data in the
construction of a data cube.

(a)Attribute (b) Aggregation (c) Reduction (d) Dimensionality

Ans: (b) Aggregation

17. A cube at the highest level of abstraction is the apex cuboid

(a)Apex cuboid (b)Base cuboid (c)Analytical (d)Lattice of cuboids

Ans: (a)Apex cuboid

18. These method are typically____________ in that while searching through attributes
space, they always make what looks to be the best choice at the time.

(a)Greedy (b) Subset (c) attributes (d) Elimination

Ans: (a)Greedy

19. A ___________ for an attribute, a partition the date distribution of A into disjoint
subsets.

(a) Distribution (b) histogram (c) singleton (d) range

Ans: (b) histogram

20. if we consider Each child of a parent node as a bucket, then an index tree can be
consider as a ____________.

(a) Random histogram (b) sequential histogram (c) hierarchical histogram


(d) direct histogram

Ans: (c) hierarchical histogram

Unit- IV

Classification and predictions:

Classifications:

74
Classification and prdiction are two forms of data analysis that can be used to extract
models describing important data classes or to predict future data trends. Such analysis can
help provide us with a better understanding of the data at large.

The classification predicts categorial discrete unordered labels, prediction models


continuous value functions.

Ex:

We can build a classification model to categorial bank loan application as either safe
or risky or a prediction model to predict the expenditures. Many classification methods have
been proposed by researchers in machine learning, pattern recognition and statistics.

Such as how to build decidion tree classifiers bayesian classifiers, bayesian belief
networks and rule based classifiers.

Classification hve numerous application, including fraud detection, target marketing


performance prediction, manufacturing and medial diagnosis.

What is classification:

The data analysis task is classification, where a model or classifier is constructed to


predict categoircal label. Such as “safe” or “risky” for the loan application data. This data
analysis task is an example of numeric prediction, where the model construction predicts a
continuous valued function or ordered vaklue as opposed to a categorical label.

Regression analysis is a statistical methodology that is most often for numeric


prediction, hence the two terms are often used synonymously.

Classification works:

1. In the first step a classifier is built describing a predetermined set of data classes. The
learning step, where a classificatio algorithm build the classifier by analyzing.

Learning from a training set made up of data base tuples and their associated class
labels.

A tuple x is represented by an n-dimensional attribute vector , x=(x1,x2,x3,,,,xn)


depicting n measurement made on the tuple from n data base attributes,respectively .
A1,A2,,,,An .each tuple x is assumed to belong to a predefined class as determined by
another data base attributes called the class label attribute.

The individual tuples making up the training set are referred to as training
tuples and selected from the data base under analysis. Data tuples can be referred to as
samples, examples, instances data points or objects.

The class label of each training tuple is provided, this step is also known as
supervised learning.
1. Training data are analyzed by a classification algorithm.

75
2. The class label attribute is loan decision and the learned module or classifier is
represented in the form of classification rules.
3. Classification :
Test data are used to estimate the accuracy of the classification rules. If the
accuracy is consider acceptable, the rules can be applied to the classification
of new data tuples.

Draw data are analyzed by a classification algorithm .here, the class label
attribute is tenured and the learned model or classifier is represented in the
form of classification rules.

Because the class label of each training tuple is provided, this step is also known as
supervised learning.

Step:2

76
In the second step, the model is used for classification first the predictive accuracy of
the classifier is estimated. A test set is used made up of test tuples and their associated class
labels. They are independent of the training tuples, meaning that were not used to construct
the classifier.

The accuracy of a predicted is estimated by computing on error based on the


difference between the predicted value and the actual known value of y for each of the the
test tuples.

Decision Tree Induction:

A decision tree is a flow chart like tree structure, where each internal node
(non leaf node) denotes a test on an attribute and each branch represents an outcome of the
test and each leaf node holds a class label. The topmost node in a tree is the root node.

Internal nodes are denoted by rectangles and leaf nodes are denoted by ovals. Some
decision tree algorithm produce only binary tree where as others can produce non binary.

Most algorithms for decision tree induction also fellow a top down approach, which
stands with a training set of tuples and their associated class labels.

We describe a basic algorithm for learning decision trees. During tree construction,
attribute selection measures are used to select the attribute that partition the tuple into distinct
classes.

Algorithm:

Generate decision tree:

Generate a decision tree from the tracking tuples of data partition D.

77
Input:

1. Data partition, D, which is a set of training tuples and their associate’s


class labels.
2. Attribute list, the set of candidate attribute .
3. Attribute selection method a procedure to detection the splitting
criterion that best partition the tuples into individual’s class. This
criterion consists of a splitting attribute and possibly either a split
point.

Output:

A decision tree.

Method:

1. Create a node N;
2. If tuples in D are all of the same class, c then
3. Return N as a leaf node labeled with the class c;
4. If attribute is empty then
5. Return N as a leaf node labeled with the majority class in D; // majority voting
6. Apply attribute selection method (D,attribute list) to find the “best” splitting
criterion;
7. Label node N with splitting criterion;
8. If splitting attribute is discrete value and multi way splits allowed then // not
restricted to binary trees.
9. Attribute list ←attribute list splitting – attribute; // remove splitting attribute.
10. For each out come j of splitting criterion // partition the tuples and grow sub
trees for each partition.
11. Let Dj be the set of data tuples in D satisfying outcomes j // a partition.
12. If dj is empty then
13. Attach a leaf labeled with thw majority class in D to node N;
14. Else attach the node returned by generate decision tree (Di,attribute list) to
node N; end for
15. Return N;

The tree starts as a single node, N, representing the training tuples in D.

If the tuples in D are all of the same class, them node N become a leaf and is labeled with that
class.

The algorithm calls attribute selection method to determine the splitting criterion. The
splitting criterion tells us with attribute to test at node N by determining the best way to
separate or partition the tuples in D into individual classes.

The splitting criterion also tells us which branches to grow from node N with
respected to the outcomes of the chosen test.

78
The node N is labeled with the splitting criterion, which server as a test at the node a branch
is grown from node N for each of the out comes of the splitting criterion.

The tuples in D are partitioned accordingly steps(10 to 11). There are three possible
scenarios let A be the splitting attribute. A has v distinct values { a1,a2,,,ar} based on the
training data.

1. A is discete valued:

in this case, the outcomes of the test at node N correspond directly to the known
values of A. A branch is created for each known value aj, of A and albeled with that
value.partition Dj is the subset of class labeled tuples in D having value aj of A.

2. A is continuous value:

In this case, the test at node N has two possible out comes, corresponding to the
condition A-split points and A>split point ,where split point is the split point returned by
attribute selection method as part of the splitting criterion. The tuples are partitioned such
that D1 holds the subset of class labeled tuple in D for which A split point, while D2 holds
the rest.

3. A is discrete valued and a binary tree:

The test at node N is of the form “A &SA” where Sa is the splitting subset for
A. Returned by attribute selection method as part of the splittingcriterion. It is a subset
of the known values of A. Two branches are grown from N by convertion the left

79
branch out of N is labeled tuple in D that satsfy the test. The right branch out of N is
labeled no. So that D2 corresponds to the subset of class labeled tuples from D that do
not satisfy the test.

The algorithm uses the same process recursively to form a decision tree for the
tuple at each resulting partition Dj.

Incremental versions of decision tree induction have also been proposed.


When given new training data these restructure the decision tree acquired from
learning an previous training data, rather releasing a new tree from seratch.

Attribute selection measures:

An attribute selection measures is a heuristic for selecting the splitting


criterion that “best” separates a given data partition, D of class labeled training tuples into
individual classes.

Attribute selection measures are known as splitting rules because they determine how
the tuples at a given node are to be split. The attribute selection measures provides a ranking
for each attribut describing the given training tuples. The attribute having the best score for
the measure is chosen as the splitting attribute for the given tuples.

If the splitting attribute is continous valued or if we are restricted to binary trees, then,
respectvely , either a split point or a splitting subset must also be determined as part of the
splitting criterion.

The tree node created for partition D is labeled with the splitting criterion, branches
are grown for each outcomes of the criterion and the tuple are partitioned accordingly.

Information Gain:

Information gain is an attribute selection mearsure. Let node N represent or hold the tuples
from partition D. The attribute with the height information gain is chosen as the splitting
attribute for node N.

The expected information needed to classify a tuple in D is given by


m
Info(D)=∑ Pi (P i)
i=1

Here ,

Let Pi be the probability that an arbitary tuple in D belongs to class Ci , estimated by |C i,D|/|D|.
A log function to the base 2 is used, because the information is enclosed in bits.

Info(D) is just the average amount of information needed to identify the class label of
a tuple in D. Note that, at this point, the information we have is based solely on the
proportions of tuples of each class.

80
Info(D) is also known as the entropy of D.information needed (after using A to split D
into v partitions) to classify D is given by
v
|D j|
Info A (D) =∑ × Info( D j )
j=1 |D|

|Di|
The term acts as the weight of the j th partition. Info A (D) is the expected
| D|
information required to classify a tuple from D based on the partitioning by A. The smaller
the expected information (still) required,the greater the parity of the partitions.

Information gained by branching on attributed A is given by. Information gain is


defined as the difference between the original inforation of classes) and the new
requirement(ie obtained after partitioning on A).

That is,

Gain (A) = Info(D)-Info A (D).

In other words, Gain(A) tells us much would be gined by branching on A. It is the


expected reduction in the infoation requirement caused by knowldge the value of A.

The attribute A with highest information gain Gain(A) is chosen as the splitting
attribute at node N.

Class p: buys – computer =”yes”

Class N: buys – computer =”no”

Info(D) = I(9.5)= -
9
14
log log 2
9
( )
5 5
− log log 2( )
14 14 14

= 0.940

5 4 5
Infoage (D)= I ( 2 ,3 )+ I ( 4 , 0 ) + I (3 , 2)
14 14 14

=0.694

5
I ( 2 ,3 ) means “age¿ 30 has 5 out of 14 samples, with 2 yes and 3 no.
14

Hence

Gain (age) = Info(D) – Infoage(D) = 0.246

Gain (income) = 0.029

Gain (student) = 0.151

Gain (credit rating) = 0.048

Computing information – Gain for continuous – valued attributes:


81
Consider attribute A to be a continuous valued attributed. The best split point
for a should be determind .

Sort the value A in increasing order.

The mid point between each pair of adjacent value is consider as a possible split point.

(a i + a i+1) / 2 is the midpoint between the values of a i and a i+1.

The point with the minimum expected information required for A is selected as the
split point for A.

Split Point:

D 1 is the set of tuples in the in D satisfying a<= split point and D2 is the set of tuples
in the D satisfying A> split point.

*Information gain measure is biased towards attributes with a large number of values

Gain Ration(A) = Gain(A) / split Info(A)


v
|Di| |D j|
Split InfoA(D) =-∑ ×
j=1 | D| |D|

SplitInfoincome(D) = -
4 4 6
×( )− ×
6
-
4
14 14 14 14 14 14
× ( )
4
( )
= 1.557

gain_ ratio (income) = 0.029/1.557= 0.019

The attribute with the maximum gain ratio is selected as the splitting attribute.

BAYES CLASSIFICATION:

Bayesian classification are statistical classifier. They can predict class membership
probabilities such as the probability that a given tuple belong to a partition class. Bayesian
classification is based on bayes theorem.

BAYES THEOREM:

Bayes theorem is named after thomas bayes a non conformist english clergyman who
did early work in probabilty and decision theory during the 18 th century.

Total probability theorem:


m
P(B) = ∑ P (B / A i) P( Ai )
i=1

Bayes theorem:

82
P ( X / H ) P(H )
P(H/X) = =P( X / H)× P( H )/ P( X)
P( X )

Let x be a data sample “evidence” class label is unknown .

Let H be a hypothesis that x belong to class C.

Classification is to determined P(H/X), (ie posterori probability) the probability that the
hypothesis holds given the observered data sample x.

P(H) (prior probability ) the inital probability .

Eg: x, will buy computer , regardless of age,income.

P(x): probability that sample data is observerd.

P(x/H) (likeli hood): the probability of observing the sample x, given that the hypothesis
holds.

Eg: given that x will buy computer the prob that x is 31.40 , medium income.

Prediction based on bayes theorem:

Given training data x, posteriori probability of a hypothesis H, P(H/X), follows the


Bayes theorem.

P ( X / H ) P(H )
P(H/X) = = P(X/H) x P(/H) / P(x) this can be viewed as posterior = likeli
P( X )
hood x priori / evidence .

Predicts x belongs to li if the probability p(li/x) is the highst among all p(lk/x) for all the k
classes.limitation is the it requires inital knowledge of many probabilities, involving
significant computational cost.

Naive bayes classifier:

The naive bayes classifier or simple bayesian classifier works as follows.

1. Let D be a training set tuples and their associated class labels. As usual ,each tuple is
represented by an n-dimensional attribute vector,x=(x1,x2,,,,xn), depicting n
measurement made an the tuple from n attributes, respectively A1,A2,….,An.
2. Suppose that there are m classes c1,c2,cm give tuple , x the classifier will predict that x
belongs to the class having the height posterior probability . conditioned on x. that is,
the naïve bayesian classifier predicts that tuple x belongs to the class ci if and only if

P(C i / X ¿ P (C j / X )for 1 ≤ j≤ m; j≠ i

83
Thus we maximize P(ci/x) . The class ci for which p(ci/x) is maximized is called the
maimum posterior hypothesis . by bayes theorem.

P(
C i / X ¿=
P
X
Ci ( )
P(C i)

P (X )
3. As P(x) is costanr for all classes, only P(x/c i) P(ci) need be aximized. If the class
prior probabilities are known, then it is commonly assumed that the classes are
equallly likely that is P(c1) = P (c2)=…..=p(cm) and we would there fore maximize
p(x/ci).
4. Given data sets with many attributes, it would be extremely computationally
expensive to compute p(x/ci).

To reduce computation in evaluating p(x/ci) the nauve assumption of class


conditionalindepence is mode .this presence that the attaribute values are conditionally
independent of one another , given the class label of the tuple.
n
P(X/C i ¿=∏ P ( X k /C i)=P ( x 1 /Ci ) × P ( x 2 /Ci ) × … ..× P ( x n /C i )
k=1

Naive bayes classifier:

Advantages:

It is easy to implement.

Good result is obtained in most of the class.

Limitations:

Thus assumption is class conditional independence there fore is loss of


accuracy.

Practically , dependencies exist among variables.

Eg: profile : age, family history etc.

Symptons: fever,cough etc., disease: lung cancer,dia bêtes, etc.

Dependence among these cannot be modelled by naive bayes classifier.

Bayesianbelief network:

Bayesian belief network (also known as baysian networks,


probabilistic networks) allow class conditional independences between subsets of variables.

Concept :

84
The naive Bayesian classifier makes the assumption of class
conditional independent that is given the class label of a tuple, the values of the attributes are
assumed to be conditionally independent of one another. This simplifies computation.

Bayesian belief network specify joint conditional probabilities distribution. They


allow class conditional independence to be defined between subsets of the variables. They
provide a graphical model of causal relationships.

On which learning can be performed. A belief network is defined by two components


a directed a cyclic graph and a set conditional probability tables.

Each node in the directed a cyclic graph represent a random variable. The variables
may be discrete or continuous valued. They may correspond to actual attribute given in the
data or to hidden variable to form a relationship.

Each arc represents a probabilistic dependence. If an arc draw from a node y to a node
z, then y is parent (or) immediate predecessor of z, and z is a descent of y. Each variable is
conditionally independent of its non descendants in the graph, given its parents.

Example:

Here,

Unit-V

Cluster Analysis

Introduction

What Is Cluster Analysis (Or) data segmentation?

➢ Clustering is the process of grouping a set of data objects into multiple groups.

➢ Cluster analysis or clustering is the process of partitioning a set of data objects (or
observations) into subsets. Each subset is a cluster, such that objects in a cluster are similar to

85
one another, yet dissimilar to objects in other clusters. The set of clusters resulting from a
cluster analysis can be referred to as a clustering.

Why should I learn Cluster analysis?

➢ Clustering is useful in that it can lead to the discovery of previously unknown


groups within the data.

Where we used clustering?

Cluster analysis has been widely used in many applications such as,

➢ business intelligence,

➢ image pattern recognition,

➢ Web search,

➢ Biology, and security.

5.1 Typical requirements of clustering in data mining:

1. Scalability – Clustering algorithms should work for huge databases

2. Ability to deal with different types of attributes – Clustering algorithms should work not
only for numeric data, but also for other data types.

3. Discovery of clusters with arbitrary shape – Clustering algorithms (based on distance


measures) should work for clusters of any shape.

4. Minimal requirements for domain knowledge to determine input parameters –


Clustering results are sensitive to input parameters to a clustering algorithm (example–
number of desired clusters). Determining the value of these parameters is difficult and
requires some domain knowledge.

5. Ability to deal with noisy data – Outlier, missing, unknown and erroneous data detected
by a clustering algorithm may lead to clusters of poor quality.

6. Insensitivity in the order of input records – Clustering algorithms should produce same
results even if the order of input records is changed.

7. High dimensionality – Data in high dimensional space can be sparse and highly skewed,
hence it is challenging for a clustering algorithm to cluster data objects in high dimensional
space.

8. Constraint-based clustering – In Real world scenario, clusters are performed based on


various constraints. It is a challenging task to find groups of data with good clustering
behaviour and satisfying various constraints.

86
9. Interpretability and usability – Clustering results should be interpretable,
comprehensible and usable. So we should study how an application goal may influence the
selection of clustering methods.

5.2 TYPES OF DATA

Numerical Data

• Examples include weight, marks, height, price, salary, and count.

• There are a number of methods for computing similarity between these data.

E.g. Euclidean distance, Manhattan distance.

Binary Data

• Examples include gender, marital status.

• A simple method involves counting how many attribute values of 2 objects are
different amongst n attributes &using this as an indication of distance.

Qualitative Nominal Data

• This is similar to binary data which may take more than 2 values but has no natural
order.

Examples include religion, foods or colors.

Qualitative Ranked Data

• This is similar to qualitative nominal data except that data has an order associated
with it.

• Examples include: 1) grades A, B, C, and D 2) sizes S, M, L and XL.

• One method of computing distance involves transferring the values to numeric


values according to their rank. For example, grades A, B, C, D could be transformed
to 4.0, 3.0, 2.0 and 1.0.

5.3 Major Clustering Methods:

➢ Partitioning Methods

➢ Hierarchical Methods

➢ Density-Based Methods

➢ Grid-Based Methods

➢ Model-Based Methods

Partitioning Methods

87
The simplest and most fundamental version of cluster analysis is partitioning, which
organizes the objects of a set into several exclusive groups or clusters. To keep the problem
specification concise, we can assume that the number of clusters is given as background
knowledge. This parameter is the starting point for partitioning methods.

Formally, given a data set, D, of n objects, and k, the number of clusters to form, a
partitioning algorithm organizes the objects into k partitions (k ≤ n), where each partition
represents a cluster. The clusters are formed to optimize an objective partitioning criterion,
such as a dissimilarity function based on distance, so that the objects within a cluster are
“similar” to one another and “dissimilar” to objects in other clusters in terms of the data set
attributes commonly used partitioning methods—k-means and k-Medoids

K-Means: A Centroid-Based Technique

A centroid-based partitioning technique uses the centroid of a cluster, Ci, to represent


that cluster. Conceptually, the centroid of a cluster is its center point. The centroid can be

the cluster. The difference between an object p ∈ Ci and ci, the representative of the cluster,
defined in various ways such as by the mean or medoid of the objects (or points) assigned to

is measured by dist(p,ci), where dist(x,y) is the Euclidean distance between two points x and
y. The quality of cluster Ci can be measured by the within cluster variation, which is the sum
of squared error between all objects in Ci and the centroid ci, defined as

“How does the k-means algorithm work”?

The k-means algorithm defines the centroid of a cluster as the mean value
of the points within the cluster. It proceeds as follows. First, it randomly selects k of the
objects in D, each of which initially represents a cluster mean or center. For each of the
remaining objects, an object is assigned to the cluster to which it is the most similar, based on
the Euclidean distance between the object and the cluster mean. The k-means algorithm then
iteratively improves the within-cluster variation. For each cluster, it computes the new mean
using the objects assigned to the cluster in the previous iteration. All the objects are then
reassigned using the updated means as the new cluster centers. The iterations continue until
the assignment is stable, that is, the clusters formed in the current round are the same as those
formed in the previous round.

Algorithm: k-means. The k-means algorithm for partitioning, where each cluster’s center is
represented by the mean value of the objects in the cluster.

Input:

k: the number of clusters,

D: a data set containing n objects.

88
Output: A set of k clusters.

Method:

(1) arbitrarily choose k objects from D as the initial cluster centers;

(2) repeat

(3) (re)assign each object to the cluster to which the object is the most similar,based on the
mean value of the objects in the cluster;

(4) update the cluster means, that is, calculate the mean value of the objects for each cluster;

(5) until no change;

o Decision trees.

o Neural networks.

o The prototype neurons compete for the current instance.

o The winner is the neuron whose weight vector is closest to the instance
currently presented.

o The winner and its neighbours learn by having their weights adjusted.
The SOM algorithm is successfully used for vector quantization and speech recognition.

5.4 Model-Based Clustering Methods

Model-based clustering methods attempt to optimize the fit between the


given data and some mathematical model. Such methods are often based on the assumption
that the data are generated by a mixture of underlying probability distributions. These
methods attempt to optimize the fit between the given data and some mathematical models.
Unlike conventional clustering, which identifies groups of objects; model-based clustering
methods also find characteristic descriptions for each group, where each group represents a
concept or class.

The most frequently used induction methods are:

5.4.1 Decision Trees:

In decision trees, the data is represented by a hierarchical tree, where each


leaf refers to a concept and contains a probabilistic description of that concept. Several
algorithms produce classification trees for representing the unlabelled data.

The most well-known algorithms are:

COBWEB:

This algorithm assumes that all attributes are independent (an often too
naive assumption). Its aim is to achieve high predictability of nominal variable values, given
a cluster. This algorithm is not suitable for clustering large database data (Fisher, 1987).

89
CLASSIT:

An extension of COBWEB for continuous-valued data, unfortunately has


similar problems as the COBWEB algorithm.

5.4.2 Neural Networks:

This type of algorithm represents each cluster by a neuron or ―prototype.


The input data is also represented by neurons, which are connected to the prototype neurons.
Each such connection has a weight, which is learned adaptively during learning. A very
popular neural algorithm for clustering is the self-organizing map (SOM). This algorithm
constructs a single-layered network.

o It is useful for visualizing high-dimensional data in 2D or 3D space.

o However, it is sensitive to the initial selection of weight vector, as well


as to its different parameters, such as the learning rate and neighbourhood radius.

5.4 Outlier analysis

➢ An outlier is a data object that deviates significantly from the rest of the objects, as if it
were generated by a different mechanism.

➢ Outliers are different from noisy data. ➢ global outliers,

➢ contextual(or conditional) outliers,

➢ collective outliers

5.4.1 Types of outlier:

Outliers can be classified into three categories, namely

Global outlier — Object significantly deviates from the rest of the data set

Contextual outlier — Object deviates significantly based on a selected context.

Collective outlier — A subset of data objects collectively deviate significantly from


the whole data set, even if the individual data objects may not be outliers.

5.4.2 Outlier Detection Methods

Supervised Methods- Supervised methods model data normality and abnormality. Domain
experts examine and label a sample of the underlying data. Outlier detection can then be
modeled as a classification problem.

Unsupervised- Unsupervised outlier detection methods make an implicit assumption: The


normal objects are somewhat “clustered.” In other words, an unsupervised outlier detection
method expects that normal objects follow a pattern far more frequently than outliers. Normal
objects do not have to fall into one group sharing high similarity.

Semi-supervised- Semi supervised outlier detection methods can be regarded as applications


of semi supervised learning methods. For example, when some labeled normal objects are
90
available, we can use them, together with unlabeled objects that are close by, to train a model
for normal objects.

5.5 WEB MINING

Web-mining is the application of data-mining techniques to extract


knowledge from web-data. (i.e. web-content, web-structure, and web-usage data). We interact
with the web for the following purposes:

1) Finding Relevant Information

• We use the search-engine to find specific information on the web.

• Query triggered process: We specify a simple keyword-query and the


response from a search- engine is a list of pages, ranked by their similarity to the query.

• Search tools have the following problems:

i) Low precision: This is due to the irrelevance of many of the search-


results. We may get many pages of information which are not really relevant to our query.

ii) Low recall: This is due to inability to index all the information available
on the web. Because some of the relevant pages are not properly indexed.

2) Discovering New Knowledge from the Web

→ Data triggered process: This assumes that

→ we already have a collection of web-data and

→ we want to extract potentially useful knowledge out of it

3) Personalized Web Page Synthesis

• We may wish to synthesize a web-page for different individuals from the


available set of web-pages.

• While interacting with the web, individuals have their own preferences
for the style of the

content and presentation.

4) Learning about Individual Users

→ This is about knowing

→what the customers do and

→what the customers want

• Within this problem, there are sub-problems such as→ problems related to effective web-
site design and management

→ problems related to marketing etc


91
→ Techniques from web-mining can be used to solve these problems.

Other related techniques from different research areas, such as DB(database), IR(information
retrieval) & NLP(natural language processing), can also be used.

Web-mining has 3 main operations:

→ Clustering (e.g. finding natural groupings of users, pages)

→ Associations (e.g. which URLs tend to be requested together)

→ Sequential analysis (e.g. the order in which URLs tend to be accessed)

• Web mining techniques can be classified into 3 areas of interest

→ web-content mining (e.g. text, image, records, etc.)

→ web-structure mining (e.g. hyperlinks, tags, etc)

→ web-usage mining (e.g. http logs, app server logs, etc.)

5.6 WEB CONTENT MINING

• This is the process of extracting useful information from the contents of web-documents.

• We see more & more government information are gradually being placed on the web in
recent years.

• We have

→ Digital libraries which users can access from the web

→ web-applications which users can access through web-interfaces

• Some of the web-data are hidden-data, and some are generated dynamically as a result of
queries and reside in the DBMSs.

• The web-content consists of different types of data such as text, image, audio, video as well
as

hyperlinks.

• Most of the research on web-mining is focused on the text or hypertext contents.

• The textual-parts of web-data consist of

→ unstructured-data such as free texts

→ semi structured-data such as HTML documents &

→ structured-data such as data in the tables

• Much of the web-data is unstructured, free text-data. As a result, text-mining techniques can
be directly employed for web-mining.

92
• Issues addressed in text mining are, topic discovery, extracting association patterns,
clustering of web documents and classification of Web Pages.

• Research activities on this topic have drawn heavily on techniques developed in other
disciplines such as IR (Information Retrieval) and NLP (Natural Language Processing).

5.7 WEB USAGE MINING

• This deals with studying the data generated by the web-surfer's sessions (or behaviours).

• Web-content/structure mining utilise the primary-data (or real) on the web.

On the other hand, web-usage mining extracts the secondary-data derived from the
interactions of the users with the web.

• The secondary-data includes the data from

→ web-server access logs → browser logs

→ user transactions/queries → user profiles

→ registration/bookmark data → cookies

• There are 2 main approaches in web-usage mining:

1) General Access Pattern Tracking

• This can be used to learn user-navigation patterns.

• This can be used to analyze the web-logs to understand access-patterns and trends.

• This can shed better light on the structure & grouping of resource providers.

2) Customized Usage Tracking

• This can be used to learn a user-profile in adaptive interfaces (personalized).

• This can be used to analyze individual trends.

• Main purpose: is to customize web-sites to users.

• Based on user access-patterns, following things can be dynamically customized for each
user over time:

→ information displayed

→ depth of site-structure

→ format of resources

• The mining techniques can be classified into 2 commonly used approaches:

1) The first approach maps the usage-data of the web-server into relational-tables before a
traditional data-mining technique is applied.

93
2) The second approach uses the log-data directly by utilizing special pre-processing
techniques.

5.8 WEB STRUCTURE MINING

• The structure of a typical web-graph consists of web-pages as nodes, and hyperlinks as


edges connecting related pages.

• Web-Structure mining is the process of discovering structure information from the web.

• This type of mining can be performed either at the (intra-page) document level or at the
(inter- page) hyperlink level.

• This can be used to classify web-pages.

• This can be used to generate information such as the similarity & relationship between
different web-sites.

PageRank

• PageRank is a metric for ranking hypertext documents based on their quality.

• The key idea is that a page has a high rank if it is pointed to by many highly ranked pages

Clustering & Determining Similar Pages

• For determining the collection of similar pages, we need to define the similarity measure
between the pages. There are 2 basic similarity functions:

1) Co-citation: For a pair of nodes p and q, the co-citation is the number of nodes that point
to both p and q.

2) Bibliographic coupling: For a pair of nodes p and q, the bibliographic coupling is equal to
the number of nodes that have links from both p and q.

Social Network Analysis

• This can be used to measure the relative standing or importance of individuals in a network.

• The basis idea is that if a web-page points a link to another web-page, then the former is, in
some sense, endorsing the importance of the latter.

• Links in the network may have different weights, corresponding to the strength of
endorsement.

5.9 SPATIAL DATA MINING

• This refers to the extraction of knowledge, spatial relationships, or other interesting patterns
not explicitly stored in spatial-databases.

• Consider a map of the city of Mysore containing various natural and man-made geographic
features, and clusters of points (where each point marks the location of a particular house).

94
• The houses might be important because of their size, or their current market value.

• Clustering algorithms can be used to assign each point to exactly one cluster, with the
number of clusters being defined by the user.

• We can mine varieties of information by identifying likely relationships.

• For ex, "the land-value of cluster of residential area around ‘Mysore Palace’ is high".

• Such information could be of value to realtors, investors, or prospective home buyers.

• This problem is not so simple because there may be a large number of features to consider.

• We need to be able to detect relationships among large numbers of geo-referenced objects


without incurring significant overheads.

5.9.1 SPATIAL MINING TASKS

• This includes o → finding characteristics rules

o → discriminant rules

o → association rules

• A spatial-characteristic rule is a general description of spatial-data. For


example, a rule describing the general price ranges of houses in various geographic regions in
a city.

• A spatial-discriminant rule is a general description of the features


discriminating a class of spatial-data from other classes, For example, the comparison of
price range of houses in different geographical regions.

5.9.1 SPATIAL MINING TASKS

• This includes o → finding characteristics rules

o → discriminant rules

o → association rules

• A spatial-characteristic rule is a general description of spatial-data. For


example, a rule describing the general price ranges of houses in various geographic regions in
a city.

• A spatial-discriminant rule is a general description of the features


discriminating a class of spatial-data from other classes, For example, the comparison of
price range of houses in different geographical regions.

5.9.2 SPATIAL CLUSTERING

95
• The key idea of a density based cluster is that for each point of a cluster, its epsilon
neighbourhood has to contain at least a minimum number of points.

• We can generalize this concept in 2 different ways:

• First, any other symmetric & reflexive neighbourhood relationship can be used instead of an
epsilon neighbourhood. It may be more appropriate to use topological relations such as
intersects, meets or above/below to group spatially extended objects.

• Second, instead of simply counting the objects in a neighbourhood of an object, other

measures to define the "cardinality" of that neighbourhood can be used as well.


Spatial Characterization
A spatial-characterization is a description of the spatial and non-spatial properties, which are
typical for the target-objects but not for the whole database.
For instance, different object types in a geographic database are mountains, lakes, highways,
railroads etc.
Spatial characterization considers both
→ properties of the target objects & → properties of their neighbours
A spatial characterization rule of the form - "Apartments in Sainikpur have a high occupancy
rate of retired army officers"- is an example.
5.10 Sequence
A sequence database consists of sequences of ordered elements or events, recorded with or
without a concrete notion of time. There are many applications involving sequence data.
Typical examples include customer shopping sequences, Web clickstreams, bio- logical
sequences, sequences of events in science and engineering, and in natural and social
developments. In this section, we study sequential pattern mining in transactional databases.
“What is sequential pattern mining?”
Sequential pattern mining is the mining of frequently occurring ordered events or
subsequences as patterns. An example of a sequential pattern is “Customers who buy a Canon
digital camera are likely to buy an HP color printer within a month.” For retail data,
sequential patterns are useful for shelf placement and promotions. This industry, as well as
telecommunications and other businesses, may also use sequential patterns for targeted
marketing, customer retention, and many other tasks.
5.11 Time-Series
“What is a time-series database?”
A time-series database consists of sequences of values or events obtained over repeated
measurements of time.
The values are typically measured at equal time intervals (e.g., hourly, daily, weekly). Time-
series databases are popular in many applications, such as stock market analysis, economic
and sales fore-casting, budgetary analysis, utility studies, inventory studies, yield projections,
work-load projections,

process and quality control, observation of natural phenomena (such as atmosphere,


temperature, wind, earthquake), scientific and engineering experiments, and medical
treatments.

96
A time-series database is also a sequence database. However, a sequence database is any
database that consists of sequences of ordered events, with or without concrete notions of
time.

For example, Web page traversal sequences and customer shopping transaction sequences are
sequence data, but they may not be time-series data.

Two mark Questions:

1. What is clustering?

Clustering is the process of grouping a set of data objects into multiple groups.

2. What are the applications available in cluster analysis?

1. business intelligence,

2. image pattern recognition,

3. Web search,

4. Biology and security.

3. Write any Five requirements of clustering in data mining?

1. Scalability.

2. Ability to deal with different types of attributes

3. Discovery of clusters with arbitrary shape

4. Minimal requirements for domain knowledge to determine input parameters.

5. Ability to deal with noisy data.

4. How many data types are available in clustering?

Numerical Data

Binary Data

Qualitative Nominal Data

Qualitative Ranked Data

5. Types of clustering methods.

1. Partitioning Methods

2. Hierarchical Methods

3. Density-Based Methods

4. Grid-Based Methods

97
5. Model-Based Methods

o Clustering (e.g. finding natural groupings of users, pages)

o Associations (e.g. which URLs tend to be requested together)

o Sequential analysis (e.g. the order in which URLs tend to be accessed)

6. What is Supervised Methods?

Supervised methods model data normality and abnormality. Domain


experts examine and label a sample of the underlying data. Outlier detection can then be
modelled as a classification problem.

7. What is web mining?

Web-mining is the application of data-mining techniques to extract


knowledge from web-data. (i.e. web-content, web-structure, and web-usage data).

9. Short Note: Page Rank.

• Page Rank is a metric for ranking hypertext documents based on their quality.

• The key idea is that a page has a high rank if it is pointed to by many highly ranked pages.

10. What is a time-series database?

A time-series database consists of sequences of values or events obtained


over repeated measurements of time.

11. What is Sequence?

A sequence database consists of sequences of ordered elements or events, recorded with or


without a concrete notion of time.

Five mark questions:

1. Explain Partion method in clustering?

2.Explain Web mining.

3.Explian outlier analysis.

Ten marks Questions:

1. Explain spatial mining

2. .Explain web content mining.

3. Explain k-means algorithm

98

You might also like