0% found this document useful (0 votes)
309 views

DWM Course

This document provides a summary of a syllabus for a course on Data Warehousing and Data Mining. It includes the course objectives, an introduction to the topics that will be covered in each of the 8 units, including data preprocessing, data warehouses, data mining primitives and languages, concept description, association rule mining, classification and prediction, cluster analysis, and mining complex data types. It lists the required textbooks and references. The document was prepared by an Associate Professor in the Department of Information Technology at Geethanjali College of Engineering and Technology.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
309 views

DWM Course

This document provides a summary of a syllabus for a course on Data Warehousing and Data Mining. It includes the course objectives, an introduction to the topics that will be covered in each of the 8 units, including data preprocessing, data warehouses, data mining primitives and languages, concept description, association rule mining, classification and prediction, cluster analysis, and mining complex data types. It lists the required textbooks and references. The document was prepared by an Associate Professor in the Department of Information Technology at Geethanjali College of Engineering and Technology.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 67

Geethanjali College of Engineering and Technology

DEPARTMENT OF INFORMATION TECHNOLOGY


(Name of the Subject/Lab Course):Operating Systems
(JNTU CODE:
550311
Programme: UG/PG
Branch:
No: 1 *

IT

Version

Year:
III
Document Number :GCET/IT/304 **
Semester:
Pages:90

No. of

Classification status (Unrestricted/Restricted ) :


Distribution List:

Prepared by :
1) Name :
2) Sign

Updated by :
Y.RAJU

1) Name

2) Sign

3) Design :ASSOC.PROFF

3) Design

4) Date

4) Date

:2

Verified by :
1) Name :
2) Sign
:
Sign
:
3) Design :
4) Date

Approved by (HOD) :
1) Name:

*For Q.C only


1)Name

2)
3) Design :
4) Date :

2) Sign

3) Date

*If it is prepared first time 1 , if it is updated 2 or 3


**GCET/Dept./3 indicates 3rd year; 04 indicates fourth in the list of JNTU Syllabus book

SYLLABUS
UNIT-I
INDTODUCTION: Fundamentals of data mining, Data Mining Functionalities,
Classification of Data Mining systems, Major issues in Data Mining.
Data Preprocessing : Needs Preprocessing the Data, Data Cleaning, Data Integration and
Transformation, Data Reduction, Discretization and Concept Hierarchy Generation.

UNIT-II
Data Warehouse and OLAP Technology for Data Mining Data Warehouse, Multidimensional
Data Model, Data Warehouse Architecture, Data Warehouse
Implementation, Further Development of Data Cube Technology, From Data Warehousing to
Data Mining.

UNIT-III
DATA MINING PRIMITIVES, LANGUAGES AND SYSTEM ARCHITECTURES: Data
Mining Primitives, Data Mining Query Languages, Designing Graphical User Interfaces Based
on a Data Mining Query Language Architectures of Data Mining Systems.

UNIT-IV
CONCEPTS DESCRIPTION : Characterization and Comparison : Data Generalization and
Summarization- Based Characterization, Analytical Characterization: Analysis of Attribute
Relevance, Mining Class Comparisons: Discriminating between Different Classes, Mining

Descriptive Statistical Measures in Large Databases.

UNIT-V
MINING ASSSOCIATION RULES IN LARGE DATABASES: Association Rule Mining,
Mining Single-Dimensional Boolean Association Rules from Transactional Databases, Mining
Multilevel Association Rules from Transaction Databases, Mining Multidimensional Association
Rules from Relational Databases and Data Warehouses, From Association Mining to Correlation
Analysis, Constraint-Based Association Mining.

UNIT-VI
CLASSIFICATION AND PREDICTION: Issues Regarding Classification and Prediction,
Classification by Decision Tree Induction, Bayesian Classification, Classification by
Backpropagation, Classification Based on Concepts from Association Rule Mining, Other
Classification Methods, Prediction, Classifier Accuracy.

UNIT-VII
CLUSTER ANALYSIS INTRODUCTION: Types of Data in Cluster Analysis, A
Categorization of Major Clustering Methods, Partitioning Methods, Density-Based Methods,
Grid-Based Methods, Model-Based Clustering Methods, Outlier Analysis.

UNIT-VIII
MINING COMPLEX TYPES OF DATA: Multimensional Analysis and Descriptive Mining of
Complex, Data Objects, Mining Spatial Databases, Mining Multimedia Databases, Mining TimeSeries and Sequence Data, Mining Text Databases, Mining the World Wide Web.

TEXT BOOKS :
1. Data Mining Concepts and Techniques - JIAWEI HAN & MICHELINE
KAMBER Harcourt India.
REFERENCES :
1. Data Mining Introductory and advanced topics MARGARET H DUNHAM,
PEARSON EDUCATION
2. Data Mining Techniques ARUN K PUJARI, University Press.
3. Data Warehousing in the Real World SAM ANAHORY & DENNIS

MURRAY. Pearson Edn Asia.


4 Data Warehousing Fundamentals PAULRAJ PONNAIAH WILEY STUDENT
EDITION.
5. The Data Warehouse Life cycle Tool kit RALPH KIMBALL WILEY
STUDENT EDITION.
For more details, visit Http://www.jntu.

GEETHANJALI COLLEGE OF ENGINEERING & TECHNOLOGY


CHEERYAL (V) KEESARA (M) RR District.
Department of: IT
Year and Semester to Whom Subject is Offered: III BTech, IISem
Name of the Subject: Datawarehousing And Data Mining
Name of the Faculty:Y.RAJU

Designation: Asso. Professor

Department: IT

1.1.

Introduction to the subject:


Data mining, the extraction of hidden predictive information from large databases, is

a powerful new technology with great potential to help companies focus on the most
important information in their data warehouses. Data mining tools predict future trends and
behaviors, allowing businesses to make proactive, knowledge-driven decisions. The
automated, prospective analyses offered by data mining move beyond the analyses of past
events provided by retrospective tools typical of decision support systems. Data mining tools

can answer business questions that traditionally were too time consuming to resolve. They
scour databases for hidden patterns, finding predictive information that experts may miss
because it lies outside their expectations.
Most companies already collect and refine massive quantities of data. Data mining
techniques can be implemented rapidly on existing software and hardware platforms to enhance
the value of existing information resources, and can be integrated with new products and systems
as they are brought on-line. When implemented on high performance client/server or parallel
processing computers, data mining tools can analyze massive databases to deliver answers to
questions such as, "Which clients are most likely to respond to my next promotional mailing, and
why?"
This white paper provides an introduction to the basic technologies of data mining.
Examples of profitable applications illustrate its relevance to todays business environment as
well as a basic description of how data warehouse architectures can evolve to deliver the value of
data mining to end users.

1.2.Objectives of the subject


Improve Quality of Data
Since a common DSS deficiency is "dirty data," it is almost guaranteed that you will have
to address the quality of your data during every data warehouse iteration. Data cleansing is a
sticky problem in data warehousing. On one hand, a data warehouse is supposed to provide
clean, integrated, consistent and reconciled data from multiple sources. On the other hand, we are
faced with a development schedule of 6-12 months. It is almost impossible to achieve both
without making some compromises. The difficulty lies in determining what compromises to
make. Here are some guidelines for determining your specific goal to cleanse your source data:

Never try to cleanse ALL the data. Everyone would like to have all the data perfectly
clean, but nobody is willing to pay for the cleansing or to wait for it to get done. To clean it all
would simply take too long. The time and cost involved often exceeds the benefit.
Never cleanse NOTHING. In other words, always plan to clean something. After all,
one of the reasons for building the data warehouse is to provide cleaner and more reliable data
than you have in your existing OLTP or DSS systems.
Determine the benefits of having clean data. Examine the reasons for building the data
warehouse:

Do you have inconsistent reports?

What is the cause for these inconsistencies?

Is the cause dirty data or is it programming errors?

What dollars are lost due to dirty data?

Which data is dirty?

Determine the cost for cleansing the data. Before you make cleansing all the dirty data
your goal, you must determine the cleansing cost for each dirty data element. Examine how long
it would take to perform the following tasks:

Analyze the data

Determine the correct data values and correction algorithms

Write the data cleansing programs

Correct the old files and databases (if appropriate)

Compare cost for cleansing to dollars lost by leaving it dirty. Everything in business
must be cost-justified. This applies to data cleansing as well. For each data element, compare the
cost for cleansing it to the business loss being incurred by leaving it dirty and decide whether to
include it in your data cleansing goal. If dollars lost exceeds the cost of cleansing, put the data on
the "to be cleansed" list. If cost for cleansing exceeds dollars lost, do not put the data on the "to
be cleansed" list.
Prioritize the dirty data you considered for your data cleansing goal. A difficult part
of compromising is balancing the time you have for the project with the goals you are trying to
achieve. Even though you may have been cautious in selecting dirty data for your cleansing goal,
you may still have too much dirty data on your "to be cleansed" list. Prioritize your list.
For each prioritized dirty data item ask: Can it be cleansed? You may have to do
some research to find out whether the "good data" still exists anywhere. Places to search could be
other files and databases, old documentation, manual file folders and even desk drawers.
Sometimes the data values are so convoluted that to write the transformation logic, you may have
to find some "old-timers" who still remember what all the data values meant. Then there will be
times when, after several days of research, you find out that you couldn't cleanse a data element
even if you wanted to; and you have to remove the item from your cleansing goal.
As you document your data cleansing goal, you want to include the following
information:

The degree of current "dirtiness" (either by percentage or number of

records)

The dollars lost due to its "dirtiness"

The cost for cleansing it

The degree of "cleanliness" you want to achieve (either by percentage or number

of records)

Minimize Inconsistent Reports


Addressing another common complaint about current DSS environments, namely
inconsistent reports, will most likely become one of your data warehouse goals. Inconsistent
reports are mainly caused by misuse of data, and the primary reason for misuse of data is
disagreement or misunderstanding of the meaning or the content of data. Correcting this problem
is another predicament in data warehousing, because it requires the interested business units to
resolve their disagreements or misunderstandings. This type of effort has more than once
torpedoed a data warehouse project because it took too long to resolve the disputes. Ignoring the
issue is not a solution either. We suggest the following guidelines:

1.3. JNTU Syllabus with Additional Topics

S UNIT
.no

Topic

Additional

NO
1

Topics
1 Introduction

Fundamentals

of

data

mining,
Data Mining Functionalities
Classification of Data Mining systems,
Major

issues

in

DataMining.

Data Preprocessing : Needs Preprocessing


the Data
Data

Cleaning,

Data

Integration

and

Transformation
DataReduction

DiscretizationandConcept
HierarchyGeneration
2

2 Data Warehouse and OLAP Technology for

Data Mining Data Warehouse,


Multidimensional

Data

Model

Data Warehouse Architecture,


DataWarehouseImplementation,
Further

Development

of

Data

Cube

Technology
From Data Warehousing to Data Mining.

UNIT-III
Data Mining Primitives
Testing methods

Languages

SystemArchitectures

Data Mining Primitives

Black box testing

Data Mining Query Languages


Designing Graphical User Interfaces
Based on a Data Mining.
Query Language Architectures of
Data Mining Systems
4

UNIT-IV
Concepts Description.
Characterization and Comparison
Data Generalization and SummarizationBased Characterization
Analytical Characterization
Analysis of Attribute Relevance
Mining Class Comparisons
Discriminating between Different Classes,
Mining Descriptive Statistical Measures in
Large Databases

UNIT-V
Mining

Association

Rules

in

Large

Databases:

Association

Rule

Mining,.

MiningSingle-DimensionalBoolean
Association

Rules

from

Transactional

Databases,
Warehouses, From Association Mining to
Correlation Analysis,
Constraint-Based Association Mining
Association Mining
6

UNIT-VI
Classification and Prediction
Issues

Regarding

Classification

and

Prediction
Classification by Decision Tree Induction,
Bayesian Classification
Classification by Back propagation,
Classification Based on Concepts from
AssociationRuleMining,

Other Classification Methods, Prediction


Classifier Accuracy.
7

UNIT-VII
ClusterAnalysisIntroduction.

Types of Data in Cluster Analysis


A Categorization of Major Clustering
Methods
Partitioning Methods
Grid-Based Methods
Model-Based Clustering Methods,
Density-Based Methods,
Outlier Analysis
8

UNIT-VIII
Mining Complex Types of Data
Multimensional Analysis and Descriptive

Mining of Complex
Data Objects
MiningSpatialDatabases

Mining Multimedia Databases


Mining

Time-Series

and

Sequence

Data, Mining Text Databases


Mining the World Wide Web

I.4. Sources of Information


I.4.1. Text books:TEXTBOOKS:
1. Data Mining Concepts and Techniques - JIAWEI HAN & MICHELINE KAMBER Harcourt
India.

I.4.2. Reference Text Books:1. Data Mining Introductory and advanced topics MARGARET H DUNHAM, PEARSON
EDUCATION
2.

Data

Mining

Techniques

ARUN

PUJARI,

University

Press.

3.

Data

Warehousing

in

the

Real

World

SAM

ANAHORY

&

DENNIS

MURRAY.PearsonEdnAsia.
4 Data Warehousing Fundamentals PAULRAJ PONNAIAH WILEY STUDENT EDITION.
5. The Data Warehouse Life cy Tool kit RALPH KIMBALL WILEY STUDENT EDITION.
.

1.4.3. Websites:- Http://www.jntu.ac.in/

I.4.4. Journals:-

1.5. Unit wise Summary

S
.no

Topic

Additional

NIT NO
1

Topics

1 Introduction: Fundamentals of data mining, Data


Mining Functionalities
Classification

of

Data

Mining

systems,

MajorissuesinDataMining.
Data Preprocessing: Needs Preprocessing the
Data, Data Cleaning,
Data

Integration

and

Transformation,

Data

Reduction,
Discretization and
Concept Hierarchy Generation
2

2 Data Warehouse and OLAP Technology for

DataMiningDataWarehouse,
Data Warehouse Architecture
DataWarehouseImplementation

QTP

Further

Development

of

Data

Cube

Technology
From Data Warehousing to Data Mining.
Multidimensional Data Model
3

DataMiningPrimitives,

Data Mining Primitives, Data Mining Query


Languages,
Designing Graphical User Interfaces
Based on a Data Mining Query Language.
Architectures of Data Mining Systems
and System Architectures
Languages
4

4 Concepts Description : Characterization and


Comparison:

Data Generalization and Summarization- Large


Databases
BasedCharacterization,Analytical

Silk Testing

Characterization:
Analysis of Attribute Relevance, Mining Class
Comparisons
Discriminating between Different Classes,
Mining Descriptive Statistical Measures in
5

5 Mining Association Rules in Large Databases :


Association Rule Mining,
Mining Single-Dimensional Boolean Association
Rules from Transactional Databases,
Mining

Multilevel

Association

Rules

from

Transaction Databases
Mining Multidimensional Association Rules from
Relational Databases and Data Warehouses,
From Association Mining to Correlation Analysis
Constraint-BasedAssociationMining.

6 Classification and Prediction : Issues Regarding


Classification and Prediction,
Classification

by

Decision

Tree

Induction,

Bayesian Classification
KVCHART

Classification by Back propagation,

APPLICATION

Classification

Based

on

Concepts

from

Association Rule Mining


Other Classification Methods,
Prediction,ClassifierAccuracy.

7 Cluster

Analysis

Introduction

Types

of

DatainClusterAnalysis,

A Categorization of Major Clustering Methods,


Partitioning Methods,
Density-Based Methods,

Automation
Techniques

Grid-Based Methods, Model-Based Clustering


Methods, Outlier Analysis.

8 Mining Complex Types of Data :


Multimensional Analysis and Descriptive Mining
of Complex,
Data Objects
Mining Time-Series and Sequence Data, Mining
Text Databases
Mining the World Wide Web.
Mining Spatial Databases,
Mining Multimedia Databases
,,
Agel model

1.6. Micro Plan

S Unit
.L

No

Total no of
Periods

Topics to be covered

Reg/Additi
onal

Teac
hing
used

aids emarks

LCD/
OHP/BB

Introduction : Fundamentals of data mining,

Regular

OHP,BB

Data Mining Functionalities

Regular

OHP,BB

Classification of Data Mining systems

Regular

OHP,BB

MajorissuesinDataMining.

Regular

OHP,BB

DataPreprocessing:NeedsPreprocessing

the Regular

BB

Data
DataCleaning,DataIntegrationandTransformat
ion
2

DataWarehouseand

Regular

OLAP Technology for Data Mining Data Regular

OHP,BB

BB

Warehouse,

MultidimensionalDataModel.

Regular

OHP, BB

Data Warehouse Architecture,

Regular

BB

10

DataWarehouseImplementation,

Regular

BB

11

Further

Development

of

Data

Cube Regular

OHP,BB

Technology,

From Data Warehousing to Data Mining

Regular

BB

12

DataMiningPrimitives

Regular

13

Data Mining Primitives

Regular

BB

14

Data Mining Query Languages,

Regular

BB

15

Designing Graphical User Interfaces Based Regular

OHP,BB

OHP,BB

on a Data Mining Query


16

Language Architectures of Data Mining Regular

BB

Systems.

Languages,andSystemArchitectures

Regular

OHP,BB

17

Concepts Description

Regular

BB

18

Characterization and Comparison

Regular

BB

19

DataGeneralizationand Summarization-Based Regular

BB

Characterization
20

Analytical

Characterization:

Analysis

of Regular

BB

MiningClassComparisons:Discriminating

Regular

BB

Attribute Relevance
21

between Different Classes


22

Mining Descriptive Statistical Measures in


Large Databases.

23

Mining Association Rules in Large Databases

Regular

BB

24

Association Rule Mining

Regular

BB

Boolean Regular

BB

25

Mining
Association

Single-Dimensional
Rules

from

Transactional

Databases
26

Mining Multilevel Association Rules from Regular

BB

Transaction Databases
27

Mining Multidimensional Association Rules Regular


from

Relational

Databases

and

Data

Warehouses

From Association Mining to Correlation

BB

Analysis,

Constraint-Based

Association

Mining.
6

28

ClassificationandPrediction

Regular

OHP,
BB

29

Issues

Regarding

Classification

and Regular

BB

Prediction

30

Classification by Decision Tree Induction

Regular

BB

31

Bayesian Classification

Regular

OHP

32

Classification

propagation, Regular

OHP,

by

Back

Classification Based on Concepts from

BB

Association Rule Mining


OtherClassificationMethods,

Prediction, Regular

Classifier Accuracy.

OHP,
BB

33

ClusterAnalysisIntroduction

Regular

BB

34

Types of Data in Cluster Analysis

Regular

BB

35

Categorization

of

Major

Clustering Regular

OHP,

Methods

BB

36

Partitioning Methods

Regular

BB

37

Mining Complex Types of Data

Regular

OHP,
BB

38

Multimensional Analysis and Descriptive Regular

BB

Mining of Complex
39

Data Objects, Mining Spatial Databases

Regular

LCD,
OHP,BB

40

Mining Multimedia Databases

Regular

OHP,
BB

41

Mining Time-Series and Sequence Data

Regular

BB

42

Mining Text Databases,

Regular

OHP,
BB

43

MiningtheWorldWideWeb.

1.7. Subject Contents


1.7. 1. Synopsis page for each period(62 pages)
1.7.2. Detailed Lecture notes containing:

Regular

BB

1.ppts
2.ohp slides
3. subjective type questions(approximately 5 t0 8 in no)
4.objective type questions(approximately 20 to 30 in no)
5. Any simulations
1.8. Course Review ( By the concerned Faculty):
(I)Aims
(II) Sample check
(III) End of the course report by the concerned faculty
GUIDELINES:
Distribution of periods:

No. of classes required to cover JNTU syllabus

: 40

No. of classes required to cover Additional topics

No. of classes required to cover Assignment tests (for every 2 units 1 test)

No. of classes required to cover tutorials

No. of classes required to cover Mid tests

No of classes required to solve University

Question papers

-------

Total periods

62

UNIT-I

DEFINITIONS:
DATAMINING: Data mining refers to extracting or mining knowledge from
large amounts of data.
DATAMINING FUNTIONALITIES: Characterization and discrimination,
Mining Frequent Patterns, Associations, and Correlations ,Association Analysis,
Classification and Prediction ,Cluster analysis, Outlier analysis, Trend and
evolution analysis
CLASSIFICATION OF DATAMINING SYSTEMS:
General functionality
Descriptive data mining
Predictive data mining
Data mining various criteria's:
Kinds of databases to be mined
Kinds of knowledge to be discovered
Kinds of techniques utilized
Kinds of applications adapted

Databases to be mined
Relational, transactional, object-oriented, object-relational, active, spatial, timeseries, text, multi-media, heterogeneous, legacy, WWW, etc.
Knowledge to be mined

Characterization, discrimination, association, classification, clustering, trend,


deviation and outlier analysis, etc.
Multiple/integrated functions and mining at multiple levels
analysis, Web mining, Weblog analysis, etc.
Techniques utilized
Database-oriented, data warehouse (OLAP), machine learning, statistics,
visualization, neural network, etc.
Applications adapted
Retail, telecommunication, banking, fraud analysis, DNA mining, stock market
MAJOR ISSUES IN DATAMINING
Mining methodology and user interaction issues
Performance issues
Issues relating to the diversity of data types

DATA PREPROSESSING
integrating multiple, heterogeneous data sources
DATA CLEANSING
Ensure consistency in naming conventions, encoding structures, attribute measures,
etc. among different data sources
BITS
1. Regression is the oldest and most well-known statistical technique that the data mining
community utilizes
2. Data mining is the use of automated data analysis techniques to uncover previously
undetected relationships among data items
3. Three of the major data mining techniques are regression, classification and clustering.
4. regression takes a numerical dataset and develops a mathematical formula that fits the
data.

5.
6.
7.
8.

Model construction describes a set of predetermined classes


The model is represented as classification rules, decision trees, or mathematical formulae
New data is classified based on the training set
Clustering is a data mining (machine learning) technique used to place data elements
into related groups without advance knowledge of the group definitions
9. Data mining is referred as Extracting or mining knowledge from large amounts of
data
10. clustering techniques include k-means clustering and expectation maximization (EM)
clustering

Easy Questions

1.What is datamining and datawarehouse?


2.explain Data mining functionality?
3.explain Major issues in data mining?
4.Explain Classification of data mining systems?
5.Explain A multi-dimensional data model?
6.Explain Data warehouse architecture?

7.Explain Preprocess techniques?

UNIT-II
DATAWAREHOUSING
A decision support database that is maintained separately from the organizations
operational database
A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile
collection of data in support of managements decision-making process.
DEFINITIONS:
OLAP (on-line analytical processing)

Major task of data warehouse system


Data analysis and decision making
MULTIDIMENTIONAL DATAMODEL
Star schema
Snowflake schema
Fact constellations
CUBE DEFINITION (Fact Table)
define cube <cube _name> [<dimension _ list>]:

<measure _list>

DATAWAREHOUSE APPLICATIONS
supports querying, basic statistical analysis, and reporting using crosstabs, tables,
charts and graphs
multidimensional analysis of data warehouse data
supports basic OLAP operations, slice-dice, drilling, pivoting

Major Tasks in Data Preprocessing


Data cleaning
Data integration
Data transformation
Data reduction
Data discretization
Data integration:

combines data from multiple sources into a coherent store


Redundant data occur often when integration of multiple databases
The same attribute may have different names in different databases
One attribute may be a derived attribute in another table, e.g., annual revenue
Redundant data may be able to be detected by correlation analysis
Careful integration of the data from multiple sources may help reduce/avoid
redundancies and inconsistencies and improve mining speed and quality
Data reduction strategies
Data cube aggregation
Attribute subset selection
Dimensionality reduction
Numerosity reduction
Discretization and concept hierarchy generation

What is The lowest level of a data cube


the aggregated data for an individual entity of interest
e.g., a customer in a phone calling data warehouse.

Parametric methods
Assume the data fits some model, estimate model parameters, store only the
parameters, and discard the data (except possible outliers)
Log-linear models: obtain value at a point in m-D space as the product on
appropriate marginal subspaces

Non-parametric methods
Do not assume models
Major families: histograms, clustering, sampling
Discretization
reduce the number of values for a given continuous attribute by dividing the range
of the attribute into intervals. Interval labels can then be used to replace actual data
values.
Concept hierarchies
reduce the data by collecting and replacing low level concepts (such as numeric
values for the attribute age) by higher level concepts (such as young, middle-aged,
or senior).
BITS
1.
2.
3.
4.
5.
6.
7.

A Data Warehouse Is A Structured Repository of Historic Data.


A data warehouse integrates data from multiple data sources
A data warehouse is a copy of transaction data specifically structured for query and analysis.
OLAP stands for On-Line Analytical Processing
OLAP can be braodly divided into two different ways that is: MOLAP and ROLAP

A data warehouse maintains its functions in three layers staging, integration, and access
The data accessed for reporting and analyzing and the tools for reporting and analyzing
data is is is also called the data mart.
8. Data access layer is the interface between the operational and informational access layer
9. the data warehousing concept was intended to provide an architectural model for the flow
of data from operational systems to decision support environments
10. The integration layer is used to integrate data and to have a level of abstraction from
users

Easy Questions
1.
2.
3.
4.

Explain Pre-processing procedure?


Explain data Transformation?
Explain data Integration?
Explain Data Reduction?

UNIT-III
DEFINITIONS
DATAMINING PRIMITIVES
More flexible user interaction
Foundation for design of graphical user interface
Standardization of data mining industry and practice
DATAMINING QUERY LANGUAGES
A DMQL can provide the ability to support ad-hoc and interactive data mining
By providing a standardized language like SQL
to achieve a similar effect like that SQL has on relational database
Foundation for system development and evolution
Facilitate information exchange, technology transfer, commercialization and
wide acceptance
What tasks should be considered in the design GUIs based on a data mining
query language?

Data collection and data mining query composition


Presentation of discovered patterns
Hierarchy specification and manipulation
Manipulation of data mining primitives
Interactive multilevel mining
Other miscellaneous information

What Defines a Data Mining Task ?


Task-relevant data
Type of knowledge to be mined
Background knowledge
Pattern interestingness measurements
Visualization of discovered patterns
Task-Relevant Data
Database or data warehouse name
Database tables or data warehouse cubes
Condition for data selection
Relevant attributes or dimensions
Data grouping criteria
What Types of knowledge to be mined?
Characterization
Discrimination

Association
Classification/prediction
Clustering
Outlier analysis
Other data mining tasks

Data Mining Query Language (DMQL)


A DMQL can provide the ability to support ad-hoc and interactive data mining
By providing a standardized language like SQL
to achieve a similar effect like that SQL has on relational database
Foundation for system development and evolution
Facilitate information exchange, technology transfer, commercialization and
wide acceptance
What is the Syntax for DMQL
task-relevant data
the kind of knowledge to be mined
concept hierarchy specification
interestingness measure
pattern presentation and visualization
Syntax for Association

Mine_Knowledge_Specification ::=
mine associations [as pattern_name]
What tasks should be considered in the design GUIs based on a data mining
query language?
Data collection and data mining query composition
Presentation of discovered patterns
Hierarchy specification and manipulation
Manipulation of data mining primitives
Interactive multilevel mining

Coupling data mining system with DB/DW system


No couplingflat file processing,
Loose coupling
Fetching data from DB/DW
Semi-tight couplingenhanced DM performance
Association rule language specifications
MSQL (Imielinski & Virmani99)
MineRule (Meo Psaila and Ceri96)
Query flocks based on Datalog syntax (Tsur et al98)
Syntax for Characterization
Mine_Knowledge_Specification ::=
mine characteristics [as pattern_name]
analyze measure(s)
Discrimination

Mine_Knowledge_Specification ::=
mine comparison [as pattern_name]
for target_class where target_condition
{versus contrast_class_i where contrast_condition_i}
analyze measure(s)
What is the Syntax for task-relevant data specification
use database database_name, or use data warehouse data_warehouse_name
from relation(s)/cube(s) [where condition]
in relevance to att_or_dim_list
order by order_list ,group by grouping_list ,having condition
BITS
1. Premitives of dadmining are Background knowledge ,Interestingness measure
2. Background Knowledge is the information about the domain to be mined.
3. Set Grouping Hierarchies Organizes values for a given attribute into groups or sets or

range of values
4. Certainty (confidence) is defined as ratio of tuples containing both A & B and tuples

containing A
5. Data Mining tools perform data analysis and contributing greatly to business strategies,

6.
7.
8.
9.
10.

knowledge Dad mining is more realistic because Design a query language,Design a good
architecture.
bases, and scientific and medical research
Drilling Down is a Specialization of data Concept values replaced by lower level
concepts
Association rules that satisfy both the minimum confidence and support threshold are
referred to as strong association rules.
Data mining language must be designed to facilitate flexible and effective knowledge
discovery
Semi-tight Coupling Besides linking a DM system to a DB/DW systems, efficient
implementation of a few DM primitives.

Easy Questions
1.Explain Data Mining Primitives?

2.Explain A data mining query language?


3. Architecture of data mining systems?
4. Design graphical user interfaces based on a data mining query language

UNIT-IV
Descriptive mining describes concepts or task-relevant data sets in
concise, summarative, informative, discriminative forms
Predictive mining Based on data and analysis, constructs models for the
database, and predicts the trend and properties of unknown data
Concept description
Characterization: provides a concise and succinct summarization of the
given collection of data
Comparison: provides descriptions comparing two or more collections
of data
Data generalization
A process which abstracts a large set of task-relevant data in a database
from a low conceptual levels to higher ones.
Generalized relation
Relations where some or all attributes are generalized, with counts or
other aggregation values accumulated.
Cross tabulation
Mapping results into cross tabulation form (similar to contingency
tables).

Visualization techniques:
Pie charts, bar charts, curves, cubes, and other visual forms.

Quantitative characteristic rules


Mapping generalized result into characteristic rules with quantitative information
associated with it
Decision tree
each internal node tests an attribute
each branch corresponds to attribute value
each leaf node assigns a classification
ID3 algorithm
build decision tree based on training objects with known class labels to classify
testing objects
rank attributes with information gain measure
minimal height
the least number of tests to classify an object
Decision tree
each internal node tests an attribute
each branch corresponds to attribute value
each leaf node assigns a classification
ID3 algorithm

build decision tree based on training objects with known class labels to classify
testing objects
rank attributes with information gain measure
minimal height
the least number of tests to classify an object
Data dispersion characteristics
median, max, min, quantiles, outliers, variance, etc.
Numerical dimensions -correspond to sorted intervals
Data dispersion: analyzed with multiple granularities of precision
Boxplot or quantile analysis on sorted intervals
Dispersion analysis on computed measures
Folding measures into numerical dimensions
Boxplot or quantile analysis on the transformed cube
Quartiles, outliers and boxplots
Quartiles: Q1 (25th percentile), Q3 (75th percentile)
Inter-quartile range: IQR = Q3 Q1
Five number summary: min, Q1, M, Q3, max
Boxplot: ends of the box are the quartiles, median is marked, whiskers, and plot
outlier individually
Outlier: usually, a value higher/lower than 1.5 x IQR
Five-number summary of a distribution:
Minimum, Q1, M, Q3, Maximum
Boxplot

Data is represented with a box


The ends of the box are at the first and third quartiles, i.e., the height of the box is
IRQ
The median is marked by a line within the box
Whiskers: two lines outside the box extend to Minimum and Maximum
Standard deviation:
the square root of the variance
Measures spread about the mean
It is zero if and only if all the values are equal
Both the deviation and the variance are algebraic
Difference in philosophies and basic assumptions
Positive and negative samples in learning-from-example: positive used for
generalization, negative - for specialization
Positive samples only in data mining: hence generalization-based, to drill-down
backtrack the generalization to a previous state
Difference in methods of generalizations
Machine learning generalizes on a tuple by tuple basis
Data mining generalizes on an attribute by attribute basis
BITS

1.

Characterization of the composition of the postsynaptic proteome (PSP) provides a


framework for understanding the overall organization and function
2. Clustering using representatives called CURE
3. The Data Mining Server must be integrated with the data warehouse and the OLAP
server to embed ROI-focused business analysis directly into this infrastructure
4. A decision tree technique used for classification of a dataset

5.

classification The process of dividing a dataset into mutually exclusive groups

6.
7.
8.
9.

data cleansing is The process of ensuring that all values in a dataset are consistent and
correctly recorded.
data warehouse is a system for storing and delivering massive quantities of data.
analytical model is a structure and process for analyzing a dataset

data navigation The process of viewing different dimensions, slices, and levels of detail
of a multidimensional database.
10. logistic regression a linear regression that predicts the proportions of a categorical target
variable, such as type of customer, in a population.

Easy Questions
1.
2.
3.
4.

Explain What is concept description?


Data generalization and summarization-based characterization?
Analytical characterization: Analysis of attribute relevance?
Mining descriptive statistical measures in large databases?

UNIT-V
Association rule mining
Finding frequent patterns, associations, correlations, or causal structures
among sets of items or objects in transaction databases, relational
databases, and other information repositories.
Basic Concepts of Association Rule
Given a database of transactions each transaction is a list of items (purchased by a
customer in a visit)
Find all rules that correlate the presence of one set of items with that of another set
of items
Find frequent patterns
Example for frequent itemset mining is market basket analysis.

Association rule performance measures


Confidence
Support
Minimum support threshold
Minimum confidence threshold
Martket Basket Analysis
Shopping baskets
Each item has a Boolean variable representing the presence or absence of that item.
Each basket can be represented by a Boolean vector of values assigned to these
variables.

Identify patterns from Boolean vector


Patterns can be represented by association rules.

Apriori Algorithm
Single dimensional, single-level, Boolean frequent item sets
Finding frequent item sets using candidate generation
Generating association rules from frequent item sets
Single-dimensional rules
buys(X, milk) buys(X, bread)
Multi-dimensional rules
Inter-dimension association rules -no repeated predicates
age(X,19-25) occupation(X,student) buys(X,coke)
hybrid-dimension association rules -repeated predicates
age(X,19-25) buys(X, popcorn) buys(X, coke)
Categorical Attributes
finite number of possible values, no ordering among values
Quantitative Attributes
numeric, implicit ordering among values
Static Discretization of Quantitative Attributes
Discretized prior to mining using concept hierarchy.
Numeric values are replaced by ranges.

In relational database, finding all frequent k-predicate sets will require k


or k+1 table scans.
Data cube is well suited for mining.
The cells of an n-dimensional cuboid correspond to the predicate sets.
Mining from data cubescan be much faster.
Objective measures
Two popular measurements
support
confidence

Subjective measures
A rule (pattern) is interesting if
*it is unexpected (surprising to the user); and/or
*actionable (the user can do something with it)

kinds of constraints
Knowledge type constraint- classification, association, etc.
Data constraint: SQL-like queries
Dimension/level constraints
Rule constraint
Interestingness constraints
A constraint Ca is anti-monotone iff. for any pattern S not satisfying
Ca, none of the super-patterns of S can satisfy Ca

A constraint Cm is monotone iff. for any pattern S satisfying Cm,


every super-pattern of S also satisfies it
Succinctness Property of Constraints
For any set S1 and S2 satisfying C, S1 S2 satisfies C
Given A1 is the sets of size 1 satisfying C, then any set S satisfying C are based on
A1 , i.e., it contains a subset belongs to A1 ,
Example :
sum(S.Price ) v is not succinct
min(S.Price ) v is succinct
BITS
1. An association rule is a pattern that states when X occurs, Y occurs with certain
probability
2. Goal Find all rules that satisfy the user-specified minimum support (minsup) and
minimum confidence (minconf).
3. Table data need to be converted to transaction form for association mining.
4. Subset function finds all the candidates contained in a transaction
5. Transaction reduction is a transaction that does not contain any frequent k-itemset is
useless in subsequent scans.
6. Sampling mining on a subset of given data, need a lower support threshold + a method to
determine the completeness
7. Icerberg query Compute aggregates over one or a set of attributes only for those whose
aggregate values is above certain threshold
8. A rule is redundant if its support is close to the expected value, based on the rules
ancestor.
9. Data cube is well suited for mining.
10. Distance between clusters measures degree of association

Easy Questions
1.explain Association rule mining?
2. Mining single-dimensional Boolean association rules from transactional
databases?
3.Explain Mining multilevel association rules from transactional databases?
4.Explain Mining multidimensional association rules from transactional ?
5.Explain From association mining to correlation analysis?

UNIT-VI
Classification:
predicts categorical class labels
classifies data (constructs a model) based on the training set and the values (class
labels) in a classifying attribute and uses it in classifying new data

Prediction:
models continuous-valued functions
predicts unknown or missing values
Supervised learning (classification)
Supervision: The training data (observations, measurements, etc.) are accompanied
by labels indicating the class of the observations
New data is classified based on the training set
Unsupervised learning (clustering)
The class labels of training data is unknown
Given a set of measurements, observations, etc. with the aim of establishing the
existence of classes or clusters in the data
Issues regarding classification and prediction Comparing Classification
Methods
Accuracy
Speed and scalability
Robustness
Scalability
Interpretability:

Interpretability
Decision tree
A flow-chart-like tree structure
Internal node denotes a test on an attribute
Branch represents an outcome of the test
Leaf nodes represent class labels or class distribution
Decision tree generation consists of two phases
Tree construction
At start, all the training examples are at the root
Partition examples recursively based on selected attributes
Tree pruning
Identify and remove branches that reflect noise or outliers
Use of decision tree: Classifying an unknown sample
Test the attribute values of the sample against the decision tree
Conditions for stopping partitioning
All samples for a given node belong to the same class
There are no remaining attributes for further partitioning majority voting is
employed for classifying the leaf
There are no samples left
Information gain (ID3/C4.5)
All attributes are assumed to be categorical
Can be modified for continuous-valued attributes

Extracting Classification Rules from Trees


Represent the knowledge in the form of IF-THEN rules
One rule is created for each path from the root to a leaf
Each attribute-value pair along a path forms a conjunction
The leaf node holds the class prediction
Rules are easier for humans to understand
Two approaches to avoid overfitting
Prepruning: Halt tree construction earlydo not split a node if this would result
in the goodness measure falling below a threshold
Difficult to choose an appropriate threshold
Postpruning: Remove branches from a fully grown treeget a sequence of
progressively pruned trees
Use a set of data different from the training data to decide which is the best
pruned tree
Approaches to Determine the Final Tree Size
Separate training and testing sets
Use cross validation, 10-fold cross validation
Use all the data for training
Use minimum description length (MDL) principle
Enhancements to basic decision tree induction
Allow for continuous-valued attributes
Handle missing attribute values
Attribute construction

Classificationa classical problem extensively studied by statisticians and


machine learning researchers
Scalability: Classifying data sets with millions of examples and hundreds of
attributes with reasonable speed
Why decision tree induction in data mining?
relatively faster learning speed (than other classification methods)
convertible to simple and easy to understand classification rules
can use SQL queries for accessing databases
comparable classification accuracy with other methods
Bayesian Classification
Statical classifiers
Based on Bayes theorem
Nave Bayesian classification
Class conditional independence
Bayesian belief netwoks
Bayesian belief network allows a subset of the variables conditionally independent
A graphical model of causal relationships
Several cases of learning Bayesian belief networks
Given both network structure and all the variables: easy
Given network structure but only some variables

Extracting rules from a trained network


Discretize activation values; replace individual activation value by the cluster
average maintaining the network accuracy
Enumerate the output from the discretized activation values to find rules between
activation value and output
Find the relationship between the input and activation value
Combine the above two to have rules relating the output to input
Rough sets are used to approximately or roughly define equivalent classes
A rough set for a given class C is approximated by two sets: a lower approximation
(certain to be in C) and an upper approximation (cannot be described as not
belonging to C)
Finding the minimal subsets (redacts) of attributes (for feature reduction) is NPhard but a discernibility matrix is used to reduce the computation intensity
Fuzzy logic uses truth values between 0.0 and 1.0 to represent the degree of
membership (such as using fuzzy membership graph)
Attribute values are converted to fuzzy values
e.g., income is mapped into the discrete categories {low, medium, high} with fuzzy
values calculated
For a given new sample, more than one fuzzy value may apply
Each applicable rule contributes a vote for membership in the categories
Typically, the truth values for each predicted category are summed
What Is Prediction?
First, construct a model
Second, use model to predict unknown value

Major method for prediction is regression


Linear and multiple regressions
Non-linear regression
Linear regression: Y = + X
Two parameters, and specify the line and are to be estimated by using the data
at hand.
Using the least squares criterion to the known values of Y1, Y2 X1, X2, .
Multiple regression: Y = b0 + b1 X1 + b2 X2.
Many nonlinear functions can be transformed into the above.
Log-linear models:
The multi-way table of joint probabilities is approximated by a product of lowerorder tables.
Probability: p(a, b, c, d) = ab acad bcd
1. Model construction describing a set of predetermined classes.
2. Scalability Classifying data sets with millions of examples and hundreds of attributes
with reasonable speed
3. Classification predicts categorical class labels.
4. Data Cleaning preprocesses data in order to reduce noise and handle missing values.
5. Probabilistic prediction predicts multiple hypotheses, weighted by their probabilities.
6. CAEP stands for Classification by aggregating emerging patterns.
7. In Genetic Algorithm each rule is represented by a string of bits.
8. Fuzzy logic uses truth values between 0.0 and 1.0 to represent the degree of membership.
9. Prediction models continuous-valued functions.
10. Predictive modeling predict data values or construct generalized linear models based
on the database data.

Eassy Questions
1. What is classification? What is prediction?

2. Explain Issues regarding classification and prediction?


3. ExplainClassification by decision tree induction?
4. Explain Bayesian Classification?
5. ExplainClassification by back propagation?
6. Explain Classification based on concepts association rule mining?
7. Explain Other Classification Methods?
8. Explain Prediction and Classification accuracy?

UNIT-VII
Cluster: a collection of data objects
Similar to one another within the same cluster
Dissimilar to the objects in other clusters
Cluster analysis

Grouping a set of data objects into clusters


General Applications of Clustering
Pattern Recognition
Spatial Data Analysis
Image Processing
Economic Science (market research)
WWW
Examples of Clustering Applications
Marketing, Land use, Insurance, City-planning, Earth-quake studies
A good clustering method will produce high quality clusters with
High intra-class similarity
Low inter-class similarity
Requirements of Clustering in Data Mining
Scalability
Ability to deal with different types of attributes
Discovery of clusters with arbitrary shape
Minimal requirements for domain knowledge to determine input parameters
Able to deal with noise and outliers
Insensitive to order of input records
High dimensionality
Incorporation of user-specified constraints
Interpretability and usability
Type of data in clustering analysis

Interval-scaled variables
Binary variables
Categorical, Ordinal, and Ratio Scaled variables
Variables of mixed types
Major Clustering Approaches
Partitioning algorithms
Hierarchy algorithms
Density-based
Grid-based
Model Based
Outlier Analysis
CLARA (Clustering Large Applications) (1990)
CLARA (Kaufmann and Rousseeuw in 1990)
Built in statistical analysis packages, such as S+
It draws multiple samples of the data set, applies PAM on each sample, and gives
the best clustering as the output
Strength: deals with larger data sets than PAM
Weakness:
Efficiency depends on the sample size
A good clustering based on samples will not necessarily represent a good
clustering of the whole data set if the sample is biased
Birch: Balanced Iterative Reducing and Clustering using Hierarchies, by Zhang,
Ramakrishna, Livny (SIGMOD96)

CHAMELEON: hierarchical clustering using dynamic modeling, by G. Karypis,


E.H. Han and V. Kumar99
DBSCAN Algorithm
Arbitrary select a point p
Retrieve all points density-reachable from p wrt Eps and MinPts.
If p is a core point, a cluster is formed.
If p is a border point, no points are density-reachable from p and DBSCAN visits
the next point of the database.
Continue the process until all of the points have been processed.
Limitations of COBWEB
The assumption that the attributes are independent of each other is often too strong
because correlation may exist
Not suitable for clustering large database data skewed tree and expensive
probability distributions
Neural network approaches
Represent each cluster as an exemplar, acting as a prototype of the cluster
New objects are distributed to the cluster whose exemplar is the most similar
according to some do stance measure
Outliers
The set of objects are considerably dissimilar from the remainder of the data
Example: Sports: Michael Jordon, Wayne Gretzky, ...
Distance-based outlier: A DB (p, D)-outlier is an object O in a dataset T such that
at least a fraction p of the objects in T lies at a distance greater than D from O
Algorithms for mining distance-based outliers
Index-based algorithm

Nested-loop algorithm
Cell-based algorithm
Sequential exception technique
Simulates the way in which humans can distinguish unusual objects from among a
series of supposedly like objects
OLAP data cube technique
Uses data cubes to identify regions of anomalies in large multidimensional data
BITS
1. clustering is the assignment of a set of observations into subsets.
2. Subspace clustering methods look for clusters that can only be seen in a particular
projection of the data.
3. Many clustering algorithms require the specification of the number of clusters to
produce in the input data set, prior to execution of the algorithm.
4. Distance measure which will determine how the similarity of two elements is calculated.
5. Hierarchical clustering creates a hierarchy of clusters which may be represented in a tree
structure called a dendrogram.
6. QT clustering is an alternative method of partitioning data, invented for gene clustering.
7. QT clustering QT stands for Quality Threshold.
8. Formal concept analysis is a technique for generating clusters(called formal concepts)
of objects and attributes.
9. Evaluation of clustering is sometimes referred to as Cluster validation.
Several different clustering systems based on mutual information have been
proposed.

Easy Question
1. What is Cluster Analysis?
2. Explain Types of Data in Cluster Analysis?
3. Explain

A Categorization of Major Clustering Methods?

4. Explain

Partitioning Methods?

5. Explain Hierarchical Methods?


6. Explain Density-Based Methods?

7. Explain Grid-Based Methods?


8. Explain Model-Based Clustering Methods?
9. Explain Outlier Analysis?

UNIT-VIII
Set-valued attribute
Generalization of each value in the set into its corresponding higher-level concepts
Derivation of the general behavior of the set, such as the number of elements in the
set, the types or value ranges in the set, or the weighted average for numerical data
hobby = {tennis, hockey, chess, violin, nintendo_games} generalizes to {sports,
music, video_games}
List-valued or a sequence-valued attribute
Same as set-valued attributes except that the order of the elements in the sequence
should be observed in the generalization

Spatial data:
Generalize detailed geographic points into clustered regions, such as business,
residential, industrial, or agricultural areas, according to land usage
Require the merge of a set of geographic areas by spatial operations
Image data:
Extracted by aggregation and/or approximation
Size, color, shape, texture, orientation, and relative positions and structures of the
contained objects or regions in the image
Music data:
Summarize its melody: based on the approximate patterns that repeatedly occur in
the segment
Summarized its style: based on its tone, tempo, or the major musical instruments
played
Object identifier: generalize to the lowest level of class in the class/subclass
hierarchies
Class composition hierarchies
generalize nested structured data
generalize only objects closely related in semantics to the current one
Plan: a variable sequence of actions
E.g., Travel (flight): <traveler, departure, arrival, d-time, a-time, airline, price,
seat>
Plan mining: extraction of important or significant generalized (sequential)
patterns from a planbase (a large collection of plans)
E.g., Discover travel patterns in an air flight database, or
find significant patterns from the sequences of actions in the repair of automobiles

Spatial data warehouse: Integrated, subject-oriented, time-variant, and


nonvolatile spatial data repository for data analysis and decision making
Spatial data integration: a big issue
Structure-specific formats (raster- vs. vector-based, OO vs. relational models,
different storage and indexing)
Vendor-specific formats (ESRI, MapInfo, Integraph)
Spatial data cube: multidimensional spatial database
Both dimensions and measures may contain spatial components
Spatial association rule:A B [s%, c%]
A and B are sets of spatial or nonspatial predicates
Topological relations: intersects, overlaps, disjoint, etc.
Spatial orientations: left_of, west_of, under, etc.
Distance information: close_to, within_distance, etc.
Hierarchy of spatial relationship:
g_close_to: near_by, touch, intersect, contain, etc.
First search for rough relationship and then refine it
Spatial classification
Analyze spatial objects to derive classification schemes, such as decision trees in
relevance to certain spatial properties (district, highway, river, etc.)
Example: Classify regions in a province into rich vs. poor according to the average
family income
Description-based retrieval systems
Build indices and perform object retrieval based on image descriptions, such as
keywords, captions, size, and time of creation

Labor-intensive if performed manually


Results are typically of poor quality if automated
Content-based retrieval systems
Support retrieval based on the image content, such as color histogram, texture,
shape, objects, and wavelet transforms
Image sample-based queries:
Find all of the images that are similar to the given image sample
Compare the feature vector (signature) extracted from the sample with the feature
vectors of images that have already been extracted and indexed in the image
database
Image feature specification queries:
Specify or sketch image features like color, texture, or shape, which are translated
into a feature vector
Match the feature vector with the feature vectors of the images in the database
Time-series database
Consists of sequences of values or events changing with time
Data is recorded at regular intervals
Characteristic time-series components
Trend, cycle, seasonal, irregular
Estimation of cyclic variations
If (approximate) periodicity of cycles occurs, cyclic index can be constructed in
much the same manner as seasonal indexes
Estimation of irregular variations
By adjusting the data for trend, seasonal and cyclic variations

Steps for performing a similarity search


Atomic matching
Find all pairs of gap-free windows of a small length that are similar
Window stitching
Stitch similar windows to form pairs of large similar subsequences allowing gaps
between atomic matches
Subsequence Ordering
Linearly order the subsequence matches to determine whether enough similar
pieces exist
Problems with the Web linkage structure
Not every hyperlink represents an endorsement
Other purposes are for navigation or for paid advertisements
If the majority of hyperlinks are for endorsement, the collective opinion will still
dominate
One authority will seldom have its Web page point to its rival authorities in the
same field
Authoritative pages are seldom particularly descriptive
Hub
Set of Web pages that provides collections of links to authorities
HITS (Hyperlink-Induced Topic Search)
Explore interactions between hubs and authoritative pages
Use an index-based search engine to form the root set
Expand the root set into a base set
Apply weight-propagation

Design of a Web Log Miner


Web log is filtered to generate a relational database
A data cube is generated form database
OLAP is used to drill-down and roll-up in the cube
OLAM is used for mining interesting knowledge
Benefits of Multi-Layer Meta-Web
Multi-dimensional Web info summary analysis
Approximate and intelligent query answering
Web high-level query answering (WebSQL, WebML)
Web content and structure mining
Observing the dynamics/evolution of the Web
BITS
1. A time Series Database consists of sequences of values or events obtained over repeated
measurements of time.
2. Sequential Pattern Mining is the discovery of frequently occurring ordered events as
patterns.
3. DSMS stands for Data Stream Management System.
4. Spatial data base stores large amount of space related data such as maps ,medical
imaging data.
5. Spatial data mining refers to extraction of knowledge, spatial relationships that are not
explicitly stored in spatial databases.
6. MBR stands for minimum bounding rectangle which is taken as rough estimation of a
merged region.
7. A set value attribute may be homogeneous or heterogeneous.
8. Data cleaning refers to preprocessing data in order to remove are reduce data noise.
9. Scalability refers to ability to construct the classifier or predictor efficiently given large
amount of data.
10. Decision tree induction is a learning of decision trees from class labeled training tuples.

Easy questions

1. Multidimensional analysis and descriptive mining of complex


data objects
2. Explain mining spatial databases
3. Explain Multidimensional analysis and descriptive mining of
complex data objects?
4. Explain Mining spatial databases?
5. Explain mining multimedia databases?
6. Explain Mining time-series and sequence data?
7. Explain

Mining text databases?

8. Explain Mining the World-Wide Web?


9. Explain mining multimedia databases?
10. Explain Mining time-series and sequence data?

You might also like