DWM Course
DWM Course
IT
Version
Year:
III
Document Number :GCET/IT/304 **
Semester:
Pages:90
No. of
Prepared by :
1) Name :
2) Sign
Updated by :
Y.RAJU
1) Name
2) Sign
3) Design :ASSOC.PROFF
3) Design
4) Date
4) Date
:2
Verified by :
1) Name :
2) Sign
:
Sign
:
3) Design :
4) Date
Approved by (HOD) :
1) Name:
2)
3) Design :
4) Date :
2) Sign
3) Date
SYLLABUS
UNIT-I
INDTODUCTION: Fundamentals of data mining, Data Mining Functionalities,
Classification of Data Mining systems, Major issues in Data Mining.
Data Preprocessing : Needs Preprocessing the Data, Data Cleaning, Data Integration and
Transformation, Data Reduction, Discretization and Concept Hierarchy Generation.
UNIT-II
Data Warehouse and OLAP Technology for Data Mining Data Warehouse, Multidimensional
Data Model, Data Warehouse Architecture, Data Warehouse
Implementation, Further Development of Data Cube Technology, From Data Warehousing to
Data Mining.
UNIT-III
DATA MINING PRIMITIVES, LANGUAGES AND SYSTEM ARCHITECTURES: Data
Mining Primitives, Data Mining Query Languages, Designing Graphical User Interfaces Based
on a Data Mining Query Language Architectures of Data Mining Systems.
UNIT-IV
CONCEPTS DESCRIPTION : Characterization and Comparison : Data Generalization and
Summarization- Based Characterization, Analytical Characterization: Analysis of Attribute
Relevance, Mining Class Comparisons: Discriminating between Different Classes, Mining
UNIT-V
MINING ASSSOCIATION RULES IN LARGE DATABASES: Association Rule Mining,
Mining Single-Dimensional Boolean Association Rules from Transactional Databases, Mining
Multilevel Association Rules from Transaction Databases, Mining Multidimensional Association
Rules from Relational Databases and Data Warehouses, From Association Mining to Correlation
Analysis, Constraint-Based Association Mining.
UNIT-VI
CLASSIFICATION AND PREDICTION: Issues Regarding Classification and Prediction,
Classification by Decision Tree Induction, Bayesian Classification, Classification by
Backpropagation, Classification Based on Concepts from Association Rule Mining, Other
Classification Methods, Prediction, Classifier Accuracy.
UNIT-VII
CLUSTER ANALYSIS INTRODUCTION: Types of Data in Cluster Analysis, A
Categorization of Major Clustering Methods, Partitioning Methods, Density-Based Methods,
Grid-Based Methods, Model-Based Clustering Methods, Outlier Analysis.
UNIT-VIII
MINING COMPLEX TYPES OF DATA: Multimensional Analysis and Descriptive Mining of
Complex, Data Objects, Mining Spatial Databases, Mining Multimedia Databases, Mining TimeSeries and Sequence Data, Mining Text Databases, Mining the World Wide Web.
TEXT BOOKS :
1. Data Mining Concepts and Techniques - JIAWEI HAN & MICHELINE
KAMBER Harcourt India.
REFERENCES :
1. Data Mining Introductory and advanced topics MARGARET H DUNHAM,
PEARSON EDUCATION
2. Data Mining Techniques ARUN K PUJARI, University Press.
3. Data Warehousing in the Real World SAM ANAHORY & DENNIS
Department: IT
1.1.
a powerful new technology with great potential to help companies focus on the most
important information in their data warehouses. Data mining tools predict future trends and
behaviors, allowing businesses to make proactive, knowledge-driven decisions. The
automated, prospective analyses offered by data mining move beyond the analyses of past
events provided by retrospective tools typical of decision support systems. Data mining tools
can answer business questions that traditionally were too time consuming to resolve. They
scour databases for hidden patterns, finding predictive information that experts may miss
because it lies outside their expectations.
Most companies already collect and refine massive quantities of data. Data mining
techniques can be implemented rapidly on existing software and hardware platforms to enhance
the value of existing information resources, and can be integrated with new products and systems
as they are brought on-line. When implemented on high performance client/server or parallel
processing computers, data mining tools can analyze massive databases to deliver answers to
questions such as, "Which clients are most likely to respond to my next promotional mailing, and
why?"
This white paper provides an introduction to the basic technologies of data mining.
Examples of profitable applications illustrate its relevance to todays business environment as
well as a basic description of how data warehouse architectures can evolve to deliver the value of
data mining to end users.
Never try to cleanse ALL the data. Everyone would like to have all the data perfectly
clean, but nobody is willing to pay for the cleansing or to wait for it to get done. To clean it all
would simply take too long. The time and cost involved often exceeds the benefit.
Never cleanse NOTHING. In other words, always plan to clean something. After all,
one of the reasons for building the data warehouse is to provide cleaner and more reliable data
than you have in your existing OLTP or DSS systems.
Determine the benefits of having clean data. Examine the reasons for building the data
warehouse:
Determine the cost for cleansing the data. Before you make cleansing all the dirty data
your goal, you must determine the cleansing cost for each dirty data element. Examine how long
it would take to perform the following tasks:
Compare cost for cleansing to dollars lost by leaving it dirty. Everything in business
must be cost-justified. This applies to data cleansing as well. For each data element, compare the
cost for cleansing it to the business loss being incurred by leaving it dirty and decide whether to
include it in your data cleansing goal. If dollars lost exceeds the cost of cleansing, put the data on
the "to be cleansed" list. If cost for cleansing exceeds dollars lost, do not put the data on the "to
be cleansed" list.
Prioritize the dirty data you considered for your data cleansing goal. A difficult part
of compromising is balancing the time you have for the project with the goals you are trying to
achieve. Even though you may have been cautious in selecting dirty data for your cleansing goal,
you may still have too much dirty data on your "to be cleansed" list. Prioritize your list.
For each prioritized dirty data item ask: Can it be cleansed? You may have to do
some research to find out whether the "good data" still exists anywhere. Places to search could be
other files and databases, old documentation, manual file folders and even desk drawers.
Sometimes the data values are so convoluted that to write the transformation logic, you may have
to find some "old-timers" who still remember what all the data values meant. Then there will be
times when, after several days of research, you find out that you couldn't cleanse a data element
even if you wanted to; and you have to remove the item from your cleansing goal.
As you document your data cleansing goal, you want to include the following
information:
records)
of records)
S UNIT
.no
Topic
Additional
NO
1
Topics
1 Introduction
Fundamentals
of
data
mining,
Data Mining Functionalities
Classification of Data Mining systems,
Major
issues
in
DataMining.
Cleaning,
Data
Integration
and
Transformation
DataReduction
DiscretizationandConcept
HierarchyGeneration
2
Data
Model
Development
of
Data
Cube
Technology
From Data Warehousing to Data Mining.
UNIT-III
Data Mining Primitives
Testing methods
Languages
SystemArchitectures
UNIT-IV
Concepts Description.
Characterization and Comparison
Data Generalization and SummarizationBased Characterization
Analytical Characterization
Analysis of Attribute Relevance
Mining Class Comparisons
Discriminating between Different Classes,
Mining Descriptive Statistical Measures in
Large Databases
UNIT-V
Mining
Association
Rules
in
Large
Databases:
Association
Rule
Mining,.
MiningSingle-DimensionalBoolean
Association
Rules
from
Transactional
Databases,
Warehouses, From Association Mining to
Correlation Analysis,
Constraint-Based Association Mining
Association Mining
6
UNIT-VI
Classification and Prediction
Issues
Regarding
Classification
and
Prediction
Classification by Decision Tree Induction,
Bayesian Classification
Classification by Back propagation,
Classification Based on Concepts from
AssociationRuleMining,
UNIT-VII
ClusterAnalysisIntroduction.
UNIT-VIII
Mining Complex Types of Data
Multimensional Analysis and Descriptive
Mining of Complex
Data Objects
MiningSpatialDatabases
Time-Series
and
Sequence
I.4.2. Reference Text Books:1. Data Mining Introductory and advanced topics MARGARET H DUNHAM, PEARSON
EDUCATION
2.
Data
Mining
Techniques
ARUN
PUJARI,
University
Press.
3.
Data
Warehousing
in
the
Real
World
SAM
ANAHORY
&
DENNIS
MURRAY.PearsonEdnAsia.
4 Data Warehousing Fundamentals PAULRAJ PONNAIAH WILEY STUDENT EDITION.
5. The Data Warehouse Life cy Tool kit RALPH KIMBALL WILEY STUDENT EDITION.
.
I.4.4. Journals:-
S
.no
Topic
Additional
NIT NO
1
Topics
of
Data
Mining
systems,
MajorissuesinDataMining.
Data Preprocessing: Needs Preprocessing the
Data, Data Cleaning,
Data
Integration
and
Transformation,
Data
Reduction,
Discretization and
Concept Hierarchy Generation
2
DataMiningDataWarehouse,
Data Warehouse Architecture
DataWarehouseImplementation
QTP
Further
Development
of
Data
Cube
Technology
From Data Warehousing to Data Mining.
Multidimensional Data Model
3
DataMiningPrimitives,
Silk Testing
Characterization:
Analysis of Attribute Relevance, Mining Class
Comparisons
Discriminating between Different Classes,
Mining Descriptive Statistical Measures in
5
Multilevel
Association
Rules
from
Transaction Databases
Mining Multidimensional Association Rules from
Relational Databases and Data Warehouses,
From Association Mining to Correlation Analysis
Constraint-BasedAssociationMining.
by
Decision
Tree
Induction,
Bayesian Classification
KVCHART
APPLICATION
Classification
Based
on
Concepts
from
7 Cluster
Analysis
Introduction
Types
of
DatainClusterAnalysis,
Automation
Techniques
S Unit
.L
No
Total no of
Periods
Topics to be covered
Reg/Additi
onal
Teac
hing
used
aids emarks
LCD/
OHP/BB
Regular
OHP,BB
Regular
OHP,BB
Regular
OHP,BB
MajorissuesinDataMining.
Regular
OHP,BB
DataPreprocessing:NeedsPreprocessing
the Regular
BB
Data
DataCleaning,DataIntegrationandTransformat
ion
2
DataWarehouseand
Regular
OHP,BB
BB
Warehouse,
MultidimensionalDataModel.
Regular
OHP, BB
Regular
BB
10
DataWarehouseImplementation,
Regular
BB
11
Further
Development
of
Data
Cube Regular
OHP,BB
Technology,
Regular
BB
12
DataMiningPrimitives
Regular
13
Regular
BB
14
Regular
BB
15
OHP,BB
OHP,BB
BB
Systems.
Languages,andSystemArchitectures
Regular
OHP,BB
17
Concepts Description
Regular
BB
18
Regular
BB
19
BB
Characterization
20
Analytical
Characterization:
Analysis
of Regular
BB
MiningClassComparisons:Discriminating
Regular
BB
Attribute Relevance
21
23
Regular
BB
24
Regular
BB
Boolean Regular
BB
25
Mining
Association
Single-Dimensional
Rules
from
Transactional
Databases
26
BB
Transaction Databases
27
Relational
Databases
and
Data
Warehouses
BB
Analysis,
Constraint-Based
Association
Mining.
6
28
ClassificationandPrediction
Regular
OHP,
BB
29
Issues
Regarding
Classification
and Regular
BB
Prediction
30
Regular
BB
31
Bayesian Classification
Regular
OHP
32
Classification
propagation, Regular
OHP,
by
Back
BB
Prediction, Regular
Classifier Accuracy.
OHP,
BB
33
ClusterAnalysisIntroduction
Regular
BB
34
Regular
BB
35
Categorization
of
Major
Clustering Regular
OHP,
Methods
BB
36
Partitioning Methods
Regular
BB
37
Regular
OHP,
BB
38
BB
Mining of Complex
39
Regular
LCD,
OHP,BB
40
Regular
OHP,
BB
41
Regular
BB
42
Regular
OHP,
BB
43
MiningtheWorldWideWeb.
Regular
BB
1.ppts
2.ohp slides
3. subjective type questions(approximately 5 t0 8 in no)
4.objective type questions(approximately 20 to 30 in no)
5. Any simulations
1.8. Course Review ( By the concerned Faculty):
(I)Aims
(II) Sample check
(III) End of the course report by the concerned faculty
GUIDELINES:
Distribution of periods:
: 40
No. of classes required to cover Assignment tests (for every 2 units 1 test)
Question papers
-------
Total periods
62
UNIT-I
DEFINITIONS:
DATAMINING: Data mining refers to extracting or mining knowledge from
large amounts of data.
DATAMINING FUNTIONALITIES: Characterization and discrimination,
Mining Frequent Patterns, Associations, and Correlations ,Association Analysis,
Classification and Prediction ,Cluster analysis, Outlier analysis, Trend and
evolution analysis
CLASSIFICATION OF DATAMINING SYSTEMS:
General functionality
Descriptive data mining
Predictive data mining
Data mining various criteria's:
Kinds of databases to be mined
Kinds of knowledge to be discovered
Kinds of techniques utilized
Kinds of applications adapted
Databases to be mined
Relational, transactional, object-oriented, object-relational, active, spatial, timeseries, text, multi-media, heterogeneous, legacy, WWW, etc.
Knowledge to be mined
DATA PREPROSESSING
integrating multiple, heterogeneous data sources
DATA CLEANSING
Ensure consistency in naming conventions, encoding structures, attribute measures,
etc. among different data sources
BITS
1. Regression is the oldest and most well-known statistical technique that the data mining
community utilizes
2. Data mining is the use of automated data analysis techniques to uncover previously
undetected relationships among data items
3. Three of the major data mining techniques are regression, classification and clustering.
4. regression takes a numerical dataset and develops a mathematical formula that fits the
data.
5.
6.
7.
8.
Easy Questions
UNIT-II
DATAWAREHOUSING
A decision support database that is maintained separately from the organizations
operational database
A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile
collection of data in support of managements decision-making process.
DEFINITIONS:
OLAP (on-line analytical processing)
<measure _list>
DATAWAREHOUSE APPLICATIONS
supports querying, basic statistical analysis, and reporting using crosstabs, tables,
charts and graphs
multidimensional analysis of data warehouse data
supports basic OLAP operations, slice-dice, drilling, pivoting
Parametric methods
Assume the data fits some model, estimate model parameters, store only the
parameters, and discard the data (except possible outliers)
Log-linear models: obtain value at a point in m-D space as the product on
appropriate marginal subspaces
Non-parametric methods
Do not assume models
Major families: histograms, clustering, sampling
Discretization
reduce the number of values for a given continuous attribute by dividing the range
of the attribute into intervals. Interval labels can then be used to replace actual data
values.
Concept hierarchies
reduce the data by collecting and replacing low level concepts (such as numeric
values for the attribute age) by higher level concepts (such as young, middle-aged,
or senior).
BITS
1.
2.
3.
4.
5.
6.
7.
A data warehouse maintains its functions in three layers staging, integration, and access
The data accessed for reporting and analyzing and the tools for reporting and analyzing
data is is is also called the data mart.
8. Data access layer is the interface between the operational and informational access layer
9. the data warehousing concept was intended to provide an architectural model for the flow
of data from operational systems to decision support environments
10. The integration layer is used to integrate data and to have a level of abstraction from
users
Easy Questions
1.
2.
3.
4.
UNIT-III
DEFINITIONS
DATAMINING PRIMITIVES
More flexible user interaction
Foundation for design of graphical user interface
Standardization of data mining industry and practice
DATAMINING QUERY LANGUAGES
A DMQL can provide the ability to support ad-hoc and interactive data mining
By providing a standardized language like SQL
to achieve a similar effect like that SQL has on relational database
Foundation for system development and evolution
Facilitate information exchange, technology transfer, commercialization and
wide acceptance
What tasks should be considered in the design GUIs based on a data mining
query language?
Association
Classification/prediction
Clustering
Outlier analysis
Other data mining tasks
Mine_Knowledge_Specification ::=
mine associations [as pattern_name]
What tasks should be considered in the design GUIs based on a data mining
query language?
Data collection and data mining query composition
Presentation of discovered patterns
Hierarchy specification and manipulation
Manipulation of data mining primitives
Interactive multilevel mining
Mine_Knowledge_Specification ::=
mine comparison [as pattern_name]
for target_class where target_condition
{versus contrast_class_i where contrast_condition_i}
analyze measure(s)
What is the Syntax for task-relevant data specification
use database database_name, or use data warehouse data_warehouse_name
from relation(s)/cube(s) [where condition]
in relevance to att_or_dim_list
order by order_list ,group by grouping_list ,having condition
BITS
1. Premitives of dadmining are Background knowledge ,Interestingness measure
2. Background Knowledge is the information about the domain to be mined.
3. Set Grouping Hierarchies Organizes values for a given attribute into groups or sets or
range of values
4. Certainty (confidence) is defined as ratio of tuples containing both A & B and tuples
containing A
5. Data Mining tools perform data analysis and contributing greatly to business strategies,
6.
7.
8.
9.
10.
knowledge Dad mining is more realistic because Design a query language,Design a good
architecture.
bases, and scientific and medical research
Drilling Down is a Specialization of data Concept values replaced by lower level
concepts
Association rules that satisfy both the minimum confidence and support threshold are
referred to as strong association rules.
Data mining language must be designed to facilitate flexible and effective knowledge
discovery
Semi-tight Coupling Besides linking a DM system to a DB/DW systems, efficient
implementation of a few DM primitives.
Easy Questions
1.Explain Data Mining Primitives?
UNIT-IV
Descriptive mining describes concepts or task-relevant data sets in
concise, summarative, informative, discriminative forms
Predictive mining Based on data and analysis, constructs models for the
database, and predicts the trend and properties of unknown data
Concept description
Characterization: provides a concise and succinct summarization of the
given collection of data
Comparison: provides descriptions comparing two or more collections
of data
Data generalization
A process which abstracts a large set of task-relevant data in a database
from a low conceptual levels to higher ones.
Generalized relation
Relations where some or all attributes are generalized, with counts or
other aggregation values accumulated.
Cross tabulation
Mapping results into cross tabulation form (similar to contingency
tables).
Visualization techniques:
Pie charts, bar charts, curves, cubes, and other visual forms.
build decision tree based on training objects with known class labels to classify
testing objects
rank attributes with information gain measure
minimal height
the least number of tests to classify an object
Data dispersion characteristics
median, max, min, quantiles, outliers, variance, etc.
Numerical dimensions -correspond to sorted intervals
Data dispersion: analyzed with multiple granularities of precision
Boxplot or quantile analysis on sorted intervals
Dispersion analysis on computed measures
Folding measures into numerical dimensions
Boxplot or quantile analysis on the transformed cube
Quartiles, outliers and boxplots
Quartiles: Q1 (25th percentile), Q3 (75th percentile)
Inter-quartile range: IQR = Q3 Q1
Five number summary: min, Q1, M, Q3, max
Boxplot: ends of the box are the quartiles, median is marked, whiskers, and plot
outlier individually
Outlier: usually, a value higher/lower than 1.5 x IQR
Five-number summary of a distribution:
Minimum, Q1, M, Q3, Maximum
Boxplot
1.
5.
6.
7.
8.
9.
data cleansing is The process of ensuring that all values in a dataset are consistent and
correctly recorded.
data warehouse is a system for storing and delivering massive quantities of data.
analytical model is a structure and process for analyzing a dataset
data navigation The process of viewing different dimensions, slices, and levels of detail
of a multidimensional database.
10. logistic regression a linear regression that predicts the proportions of a categorical target
variable, such as type of customer, in a population.
Easy Questions
1.
2.
3.
4.
UNIT-V
Association rule mining
Finding frequent patterns, associations, correlations, or causal structures
among sets of items or objects in transaction databases, relational
databases, and other information repositories.
Basic Concepts of Association Rule
Given a database of transactions each transaction is a list of items (purchased by a
customer in a visit)
Find all rules that correlate the presence of one set of items with that of another set
of items
Find frequent patterns
Example for frequent itemset mining is market basket analysis.
Apriori Algorithm
Single dimensional, single-level, Boolean frequent item sets
Finding frequent item sets using candidate generation
Generating association rules from frequent item sets
Single-dimensional rules
buys(X, milk) buys(X, bread)
Multi-dimensional rules
Inter-dimension association rules -no repeated predicates
age(X,19-25) occupation(X,student) buys(X,coke)
hybrid-dimension association rules -repeated predicates
age(X,19-25) buys(X, popcorn) buys(X, coke)
Categorical Attributes
finite number of possible values, no ordering among values
Quantitative Attributes
numeric, implicit ordering among values
Static Discretization of Quantitative Attributes
Discretized prior to mining using concept hierarchy.
Numeric values are replaced by ranges.
Subjective measures
A rule (pattern) is interesting if
*it is unexpected (surprising to the user); and/or
*actionable (the user can do something with it)
kinds of constraints
Knowledge type constraint- classification, association, etc.
Data constraint: SQL-like queries
Dimension/level constraints
Rule constraint
Interestingness constraints
A constraint Ca is anti-monotone iff. for any pattern S not satisfying
Ca, none of the super-patterns of S can satisfy Ca
Easy Questions
1.explain Association rule mining?
2. Mining single-dimensional Boolean association rules from transactional
databases?
3.Explain Mining multilevel association rules from transactional databases?
4.Explain Mining multidimensional association rules from transactional ?
5.Explain From association mining to correlation analysis?
UNIT-VI
Classification:
predicts categorical class labels
classifies data (constructs a model) based on the training set and the values (class
labels) in a classifying attribute and uses it in classifying new data
Prediction:
models continuous-valued functions
predicts unknown or missing values
Supervised learning (classification)
Supervision: The training data (observations, measurements, etc.) are accompanied
by labels indicating the class of the observations
New data is classified based on the training set
Unsupervised learning (clustering)
The class labels of training data is unknown
Given a set of measurements, observations, etc. with the aim of establishing the
existence of classes or clusters in the data
Issues regarding classification and prediction Comparing Classification
Methods
Accuracy
Speed and scalability
Robustness
Scalability
Interpretability:
Interpretability
Decision tree
A flow-chart-like tree structure
Internal node denotes a test on an attribute
Branch represents an outcome of the test
Leaf nodes represent class labels or class distribution
Decision tree generation consists of two phases
Tree construction
At start, all the training examples are at the root
Partition examples recursively based on selected attributes
Tree pruning
Identify and remove branches that reflect noise or outliers
Use of decision tree: Classifying an unknown sample
Test the attribute values of the sample against the decision tree
Conditions for stopping partitioning
All samples for a given node belong to the same class
There are no remaining attributes for further partitioning majority voting is
employed for classifying the leaf
There are no samples left
Information gain (ID3/C4.5)
All attributes are assumed to be categorical
Can be modified for continuous-valued attributes
Eassy Questions
1. What is classification? What is prediction?
UNIT-VII
Cluster: a collection of data objects
Similar to one another within the same cluster
Dissimilar to the objects in other clusters
Cluster analysis
Interval-scaled variables
Binary variables
Categorical, Ordinal, and Ratio Scaled variables
Variables of mixed types
Major Clustering Approaches
Partitioning algorithms
Hierarchy algorithms
Density-based
Grid-based
Model Based
Outlier Analysis
CLARA (Clustering Large Applications) (1990)
CLARA (Kaufmann and Rousseeuw in 1990)
Built in statistical analysis packages, such as S+
It draws multiple samples of the data set, applies PAM on each sample, and gives
the best clustering as the output
Strength: deals with larger data sets than PAM
Weakness:
Efficiency depends on the sample size
A good clustering based on samples will not necessarily represent a good
clustering of the whole data set if the sample is biased
Birch: Balanced Iterative Reducing and Clustering using Hierarchies, by Zhang,
Ramakrishna, Livny (SIGMOD96)
Nested-loop algorithm
Cell-based algorithm
Sequential exception technique
Simulates the way in which humans can distinguish unusual objects from among a
series of supposedly like objects
OLAP data cube technique
Uses data cubes to identify regions of anomalies in large multidimensional data
BITS
1. clustering is the assignment of a set of observations into subsets.
2. Subspace clustering methods look for clusters that can only be seen in a particular
projection of the data.
3. Many clustering algorithms require the specification of the number of clusters to
produce in the input data set, prior to execution of the algorithm.
4. Distance measure which will determine how the similarity of two elements is calculated.
5. Hierarchical clustering creates a hierarchy of clusters which may be represented in a tree
structure called a dendrogram.
6. QT clustering is an alternative method of partitioning data, invented for gene clustering.
7. QT clustering QT stands for Quality Threshold.
8. Formal concept analysis is a technique for generating clusters(called formal concepts)
of objects and attributes.
9. Evaluation of clustering is sometimes referred to as Cluster validation.
Several different clustering systems based on mutual information have been
proposed.
Easy Question
1. What is Cluster Analysis?
2. Explain Types of Data in Cluster Analysis?
3. Explain
4. Explain
Partitioning Methods?
UNIT-VIII
Set-valued attribute
Generalization of each value in the set into its corresponding higher-level concepts
Derivation of the general behavior of the set, such as the number of elements in the
set, the types or value ranges in the set, or the weighted average for numerical data
hobby = {tennis, hockey, chess, violin, nintendo_games} generalizes to {sports,
music, video_games}
List-valued or a sequence-valued attribute
Same as set-valued attributes except that the order of the elements in the sequence
should be observed in the generalization
Spatial data:
Generalize detailed geographic points into clustered regions, such as business,
residential, industrial, or agricultural areas, according to land usage
Require the merge of a set of geographic areas by spatial operations
Image data:
Extracted by aggregation and/or approximation
Size, color, shape, texture, orientation, and relative positions and structures of the
contained objects or regions in the image
Music data:
Summarize its melody: based on the approximate patterns that repeatedly occur in
the segment
Summarized its style: based on its tone, tempo, or the major musical instruments
played
Object identifier: generalize to the lowest level of class in the class/subclass
hierarchies
Class composition hierarchies
generalize nested structured data
generalize only objects closely related in semantics to the current one
Plan: a variable sequence of actions
E.g., Travel (flight): <traveler, departure, arrival, d-time, a-time, airline, price,
seat>
Plan mining: extraction of important or significant generalized (sequential)
patterns from a planbase (a large collection of plans)
E.g., Discover travel patterns in an air flight database, or
find significant patterns from the sequences of actions in the repair of automobiles
Easy questions