A.V.C.College of Engineering: Mannampandal, Mayiladuthurai-609 305
A.V.C.College of Engineering: Mannampandal, Mayiladuthurai-609 305
COLLEGE OF ENGINEERING
MANNAMPANDAL, MAYILADUTHURAI-609 305
COURSE MATERIAL
SEM : VII
DESIGNATION : Asst.Professor
1
A.V.C College of Engineering
Lesson Plan
SYLLABUS
ELECTIVE II
CS1011 – DATA WAREHOUSING AND DATA MINING LTP
3 0 0
DESCRIPTION 8
Why preprocessing − Cleaning − Integration − Transformation − Reduction −
Discretization – Concept hierarchy generation − Data mining primitives − Query
language − Graphical user interfaces − Architectures − Concept description − Data
generalization − Characterizations − Class comparisons − Descriptive statistical
measures.
2
TEXT BOOKS
1. Han, J. and Kamber, M., “Data Mining: Concepts and Techniques”, Harcourt India /
Morgan
Kauffman, 2001.
2. Margaret H. Dunham, “Data Mining: Introductory and Advanced Topics”, Pearson
Education
2004.
REFERENCES
1. Sam Anahory and Dennis Murry, “Data Warehousing in the real world”, Pearson
Education,
2003.
2. David Hand, Heikki Manila and Padhraic Symth, “Principles of Data Mining”, PHI
2004.
3. W.H.Inmon, “Building the Data Warehouse”, 3rd Edition, Wiley, 2003.
4. Alex Bezon and Stephen J.Smith, “Data Warehousing, Data Mining and OLAP”,
McGraw- Hill Edition, 2001.
5. Paulraj Ponniah, “Data Warehousing Fundamentals”, Wiley-Interscience Publication,
2003.
3
UNIT I BASICS OF DATA WAREHOUSING
Introduction − Data warehouse − Multidimensional data model − Data warehouse
architecture −Implementation – Fur ther development − Data warehousing to data mining.
Bill Inmon coined the term Data Warehouse in 1990, which he defined in the
following way: "A warehouse is a subject-oriented, integrated, time-variant
and non-volatile collection of data in support of management's decision making
process".
• Integrated: Data that is gathered into the data warehouse from a variety
of sources and merged into a coherent whole. Data cleaning and data
integration techniques are applied. It is used to ensure consistency in
naming conventions, encoding structures, attribute measures, etc.
among different data sources. E.g., Hotel price: currency, tax, breakfast
covered, etc.
4
o initial loading of data and
o Access of data.
Data Mart: Departmental subsets that focus on selected subjects. A data mart is
a segment of a data warehouse that can provide data for reporting and analysis
on a section, unit, department or operation in the company. E.g. sales, payroll,
production. Data marts are sometimes complete individual data warehouses
which are usually smaller than the corporate data warehouse.
5
• Data warehouses enable queries that cut across different segments of a
company's operation. E.g. production data could be compared against
inventory data even if they were originally stored in different databases
with different structures.
• Queries that would be complex in normalized databases could be easier to
build and maintain in data warehouses, decreasing the workload on
transaction systems.
• Data warehousing is an efficient way to manage and report on data that is
from a variety of sources, non uniform and scattered throughout a company.
• Data warehousing is an efficient way to manage demand for lots of
information from lots of users.
• Data warehousing provides the capability to analyze large amounts of
historical data for nuggets of wisdom that can provide an organization with
competitive advantage.
Other software that needs to be considered is the interface software that provides
transformation and metadata capability such as PRISM Solutions Warehouse
6
Manager. A final piece of software that is important is the software needed for
changed data capture.
A rough sizing of data needs to be done to determine the fitness of the hardware
and software platforms. If the hardware and DBMS software are much too large
for the data warehouse, the costs of building and running the data warehouse will
be exorbitant. Even though performance will be no problem, development and
operational costs and finances will be a problem.
Conversely, if the hardware and DBMS software are much too small for the size of
the data warehouse, then performance of operations and the ultimate end user
satisfaction with the data warehouse will suffer. So, it is important that there be a
comfortable fit between the data warehouse and the hardware and DBMS software
that will house and manipulate the warehouse.
There are two factors required to build and use data warehouse. They are:
Business factors:
• Business users want to make decision quickly and correctly using all
available data.
Technological factors:
• To address the incompatibility of operational data stores
• IT infrastructure is changing rapidly. Its capacity is increasing and cost is
decreasing so that building a data warehouse is easy
The central repository for corporate wide data helps us maintain one version of
truth of the data. The data in the EDW is stored at the most detail level. The reason
to build the EDW on the most detail level is to leverage the flexibility to be used
by multiple departments and to cater for future requirements.
7
The disadvantages of storing data at the detail level are
1. The complexity of design increases with increasing level of detail.
2. It takes large amount of space to store data at detail level, hence increased
cost.
Once the EDW is implemented we start building subject area specific data marts
which contain data in a de normalized form also called star schema. The data in the
marts are usually summarized based on the end users analytical requirements.
The reason to de normalize the data in the mart is to provide faster access to the
data for the end users analytics. If we were to have queried a normalized schema
for the same analytics, we would end up in a complex multiple level joins that
would be much slower as compared to the one on the de normalized schema.
The advantage of using the Top Down approach is that we build a centralized
repository to cater for one version of truth for business data. This is very important
for the data to be reliable, consistent across subject areas and for reconciliation in
case of data related contention between subject areas.
The disadvantage of using the Top Down approach is that it requires more time
and initial investment. The business has to wait for the EDW to be implemented
followed by building the data marts before which they can access their reports.
b. Bottom Up Approach
A Conformed fact has the same definition of measures, same dimensions joined to
it and at the same granularity across data marts.
8
The bottom up approach helps us incrementally build the warehouse by developing
and integrating data marts as and when the requirements are clear. We don’t have
to wait for knowing the overall requirements of the warehouse. We should
implement the bottom up approach when
1. We have initial cost and time constraints.
2. The complete warehouse requirements are not clear. We have clarity to only
one data mart.
Data warehouse design approach muse be business driven, continuous and iterative
engineering approach. In addition to the general considerations there are following
specific points relevant to the data warehouse design:
1. Data Content
The content and structure of the data warehouse are reflected in its data model.
The data model is the template that describes how information will be organized
within the integrated warehouse framework. The data in a data warehouse must be
9
a detailed data. It must be formatted, cleaned up and transformed to fit the
warehouse data model.
2. Meta Data
It defines the location and contents of data in the warehouse. Meta data is
searchable by users to find definitions or subject areas. In other words, it must
provide decision support oriented pointers to warehouse data and thus provides a
logical link between warehouse data and decision support applications.
3. Data Distribution
One of the biggest challenges when designing a data warehouse is the data
placement and distribution strategy. Data volumes continue to grow in nature.
Therefore, it becomes necessary to know how the data should be divided across
multiple servers and which users should get access to which types of data. The
data can be distributed based on the subject area, location (geographical region), or
time (current, month, year).
4. Tools
A number of tools are available that are specifically designed to help in the
implementation of the data warehouse. All selected tools must be compatible with
the given data warehouse environment and with each other. All tools must be able
to use a common Meta data repository.
Design steps
The following nine-step method is followed in the design of a data warehouse:
1. Choosing the subject matter
2. Deciding what a fact table represents
3. Identifying and conforming the dimensions
4. Choosing the facts
5. Storing pre calculations in the fact table
6. Rounding out the dimension table
7. Choosing the duration of the db
8. The need to track slowly changing dimensions
9. Deciding the query priorities and query models
10
1.2.4 Implementation Considerations
Access Tools
Data warehouse implementation relies on selecting suitable data access tools. The
best way to choose this is based on the type of data and the kind of access it
permits for a particular user. The following lists the various types of data that can
be accessed:
• Simple tabular form data
• Ranking data
• Multivariable data
• Time series data
• Graphing, charting and pivoting data
• Complex textual search data
• Statistical analysis data
• Data for testing of hypothesis, trends and patterns
• Predefined repeatable queries
• Ad hoc user specified queries
• Reporting and analysis data
• Complex queries with multiple joins, multi level sub queries and
sophisticated search criteria
Data Extraction, Clean Up, Transformation and Migration
11
• The tool must support flat files, indexed files since corporate data is still in
this type
• The tool must have the capability to merge data from multiple data stores
• The tool should have specification interface to indicate the data to be
extracted
• The tool should have the ability to read data from data dictionary
• The code generated by the tool should be completely maintainable
• The tool should permit the user to extract the required data
• The tool must have the facility to perform data type and character set
translation
• The tool must have the capability to create summarization, aggregation and
derivation of records
• The data warehouse database system must be able to perform loading data
directly from these tools
Metadata
Meta data can define all data elements and their attributes, data sources and timing
and the rules that govern data use and data transformations.
Power Users: can use pre defined as well as user defined queries to create simple
and ad hoc reports. These users can engage in drill down operations. These users
may have the experience of using reporting and query tools.
Expert users: These users tend to create their own complex queries and perform
standard analysis on the info they retrieve. These users have the knowledge about
the use of query and report tools.
12
There are two advantages of having parallel relational data base technology for
data warehouse:
• Linear Speed up: refers the ability to increase the number of processor to
reduce response time
• Linear Scale up: refers the ability to provide same performance on the same
requests as the database size increases
Interconnection Network
13
Fig. 1.3.2.1 Shared Memory Architecture
Interconnection Network
14
Each node is having its own data cache as the memory is not shared among the
nodes. Cache consistency must be maintained across the nodes and a lock manager
is needed to maintain the consistency. Additionally, instance locks using the DLM
on the Oracle level must be maintained to ensure that all nodes in the cluster see
identical data.
There is additional overhead in maintaining the locks and ensuring that the data
caches are consistent. The performance impact is dependent on the hardware and
software components, such as the bandwidth of the high-speed bus through which
the nodes communicate, and DLM performance.
Shared nothing systems are concerned with access to disks, not access to memory.
Nonetheless, adding more PUs and disks can improve scaleup. Oracle Parallel
Server can access the disks on a shared nothing system as long as the operating
15
system provides transparent disk access, but this access is expensive in terms of
latency.
Star schema has one large central table (fact table) and a set of smaller tables
(dimensions) arranged in a radial pattern around the central table.
• Facts are core data element being analyzed
• Dimensions are attributes about the facts.
The determination of which schema model should be used for a data warehouse
should be based upon the analysis of project requirements, accessible tools and
project team preferences.
16
Fig. 1.4.1.1 Star Schema
Star schema has points radiating from a center. The center of the star consists of
fact table and the points of the star are the dimension tables. Usually the fact tables
in a star schema are in third normal form (3NF) whereas dimensional tables are de-
normalized. Star schema is the simplest architecture and is most commonly used
and recommended by Oracle.
Fact Tables
A fact table is a table that contains summarized numerical and historical data
(facts) and a multipart index composed of foreign keys from the primary keys of
related dimension tables.
A fact table typically has two types of columns: foreign keys to dimension tables
and measures those that contain numeric facts. A fact table can contain fact's data
on detail or aggregated level.
Dimension Tables
Dimensions are categories by which summarized data can be viewed. E.g. a profit
summary in a fact table can be viewed by a Time dimension (profit by month,
quarter, year), Region dimension (profit by country, state, city), Product dimension
(profit for product1, product2).
Measures
Measures are numeric data based on columns in a fact table. They are the primary
data which end users are interested in. E.g. a sales fact table may contain a profit
measure which represents profit on each sale.
17
Cubes are data processing units composed of fact tables and dimensions from the
data warehouse. They provide multidimensional views of data, querying and
analytical capabilities to clients.
1.4.3 Fact constellation schema: For each star schema it is possible to construct
fact constellation schema. The fact constellation architecture contains multiple fact
tables that share many dimension tables.
Dimensions are hierarchical in nature i.e. time dimension may contain hierarchies
for years, quarters, months, week and day. GEOGRAPHY may contain country,
state, city etc.
18
Fig. 1.5.1 Multidimensional cube
Each side of the cube represents one of the elements of the question. The x-axis
represents the time, the y-axis represents the products and the z-axis represents
different centers. The cells in the cube represents the number of product sold or
can represent the price of the items.
When the size of the dimension increases, the size of the cube will also increase
exponentially. The time response of the cube depends on the size of the cube.
• Aggregation (roll-up)
– dimension reduction: e.g., total sales by city
– summarization over aggregate hierarchy: e.g., total sales by city and
year -> total sales by region and by year
• Selection (slice) defines a sub cube
– e.g., sales where city = Palo Alto and date = 1/15/96
• Navigation to detailed data (drill-down)
– e.g., (sales - expense) by city, top 3% of cities by average income
• Visualization Operations (e.g., Pivot or dice)
Operations:
19
by climbing up hierarchy or by dimension reduction
Drill down (roll down): reverse of roll-up
from higher level summary to lower level summary or detailed data,
or introducing new dimensions
Slice and dice:
project and select
Pivot (rotate):
reorient the cube, visualization, 3D to series of 2D planes.
Other operations
drill across: involving (across) more than one fact table
drill through: through the bottom level of the cube to its back-end
relational tables (using SQL)
20
12)Unlimited dimensions and aggregation levels: This depends on the kind of
Business, where multiple dimensions and defining hierarchies can be made.
22
1.8 Data Warehouse to Data Mining
1.8.1 Data Warehouse Usage
23
1.8.2 From On-Line Analytical Processing to On Line Analytical
Mining (OLAM
24
UNIT II DATA PREPROCESSING, LANGUAGE, ARCHITECTURES,
CONCEPT DESCRIPTION
Why preprocessing − Cleaning − Integration − Transformation − Reduction −
Discretization – Concept hierarchy generation − Data mining primitives − Query
language − Graphical user interfaces − Architectures − Concept description − Data
generalization − Characterizations − Class comparisons − Descriptive statistical
measures.
We need data processing as data in the real world are dirty. It can be in incomplete,
noisy and inconsistent from. These data needs to be preprocessed in order to
improve the quality of the data, and quality of the mining results.
• If no quality data, then no quality mining results. The quality decision is always
based on the quality data.
• If there is much irrelevant and redundant information present or noisy and
unreliable data, then knowledge discovery during the training phase is more
difficult.
• Incomplete data may come from
o “Not applicable” data value when collected
o Different considerations between the time when the data was collected
and when it is analyzed.
o Due to Human/hardware/software problems
o e.g., occupation=“ ”.
• Noisy data (incorrect values) may come from
o Faulty data collection by instruments
o Human or computer error at data entry
o Errors in data transmission and contain errors or outliers data. e.g.,
Salary=“-10”
• Inconsistent data may come from
o Different data sources
o Functional dependency violation (e.g., modify some linked data)
o Having discrepancies in codes or names. e.g., Age=“42”
Birthday=“03/07/1997”
25
• Data integration
o Integration of multiple databases, data cubes, or files
• Data transformation
o Normalization and aggregation
• Data reduction
o Obtains reduced representation in volume but produces the same or
similar analytical results
• Data discretization
o Part of data reduction but with particular importance, especially for
numerical data
Data cleaning routines attempt to fill in missing values, smooth out noise while
identifying outliers, and correct inconsistencies in the data.
i. Missing Values:
The various methods for handling the problem of missing values in data tuples
include:
(a) Ignoring the tuple: When the class label is missing the tuple can be
ignored. This method is not very effective unless the tuple contains several
attributes with missing values. It is especially poor when the percentage of
missing values per attribute varies considerably.
(b) Manually filling in the missing value: In general, this approach is
time-consuming and may not be a reasonable task for large data sets with
26
many missing values, especially when the value to be filled in is not easily
determined.
(c) Using a global constant to fill in the missing value: Replace all
missing attribute values by the same constant, such as a label like
“Unknown,” or −∞. If missing values are replaced by, say, “Unknown,”
then the mining program may mistakenly think that they form an interesting
concept, since they all have a value in common — that of “Unknown” .
(d) Using the attribute mean for quantitative (numeric) values or
attribute mode for categorical (nominal) values, for all samples
belonging to the same class as the given tuple: For example, if classifying
customers according to credit risk, replace the missing value with the
average income value for customers in the same credit risk category as that
of the given tuple.
(e) Using the most probable value to fill in the missing value: This may
be determined with regression, inference-based tools using Bayesian
formalism, or decision tree induction. For example, using the other
customer attributes in your data set, you may construct a decision tree to
predict the missing values for income.
In this technique,
1. The data for first sorted
2. Then the sorted list partitioned into equi-depth of bins.
3. Then one can smooth by bin means, smooth by bin median, smooth by bin
boundaries, etc.
a. Smoothing by bin means: Each value in the bin is replaced by the
mean value of the bin.
b. Smoothing by bin medians: Each value in the bin is replaced by the
bin median.
c. Smoothing by boundaries: The min and max values of a bin are
identified as the bin boundaries. Each bin value is replaced by the
closest boundary value.
• Example: Binning Methods for Data Smoothing
27
o Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28,
29, 34
o Partition into (equi-depth) bins(equi depth of 3 since each bin
contains three values):
Bin 1: 4, 8, 9, 15
Bin 2: 21, 21, 24, 25
Bin 3: 26, 28, 29, 34
o Smoothing by bin means:
Bin 1: 9, 9, 9, 9
Bin 2: 23, 23, 23, 23
Bin 3: 29, 29, 29, 29
o Smoothing by bin boundaries:
Bin 1: 4, 4, 4, 15
Bin 2: 21, 21, 25, 25
Bin 3: 26, 26, 26, 34
In smoothing by bin means, each value in a bin is replaced by the mean value of
the bin. For example, the mean of the values 4, 8, and 15 in Bin 1 is 9. Therefore,
each original value in this bin is replaced by the value 9.
Smoothing by bin medians can be employed, in which each bin value is replaced
by the bin median. In smoothing by bin boundaries, the minimum and maximum
values in a given bin are identified as the bin boundaries. Each bin value is then
replaced by the closest boundary value.
Suppose that the data for analysis include the attribute age. The age values for the
data tuples are (in increasing order): 13, 15, 16, 16, 19, 20, 20, 21, 22, 22, 25, 25,
25, 25, 30, 33, 33, 35, 35, 35, 35, 36, 40, 45, 46, 52, 70.
(a) Use smoothing by bin means to smooth the above data, using a bin depth of 3.
The following steps are required to smooth the above data using smoothing by bin
means with a bin depth of 3.
• Step 1: Sort the data. (This step is not required here as the data are already
sorted.)
• Step 2: Partition the data into equidepth bins of depth 3.
Bin 1: 13, 15, 16 Bin 2: 16, 19, 20 Bin 3: 20, 21, 22
Bin 4: 22, 25, 25 Bin 5: 25, 25, 30 Bin 6: 33, 33, 35
Bin 7: 35, 35, 35 Bin 8: 36, 40, 45 Bin 9: 46, 52, 70
28
Bin 7: 35, 35, 35 Bin 8: 40, 40, 40 Bin 9: 56, 56, 56
2 Regression: smooth by fitting the data into regression functions.
• Linear regression involves finding the best of line to fit two variables, so that
one variable can be used to predict the other.
It combines data from multiple sources into a coherent store. There are number of
issues to consider during data integration.
Issues:
• Schema integration: refers integration of metadata from different sources.
29
• Entity identification problem: Identifying entity in one data source similar
to entity in another table. For example, customer_id in one database and
customer_no in another database refer to the same entity
• Detecting and resolving data value conflicts: Attribute values from
different sources can be different due to different representations, different
scales. E.g. metric vs. British units
• Redundancy: Redundancy can occur due to the following reasons:
• Object identification: The same attribute may have different names
in different db
• Derived Data: one attribute may be derived from another attribute.
• Correlation analysis is used to detect the redundancy.
2.6.Data Discretization
Raw data values for attributes are replaced by ranges or higher conceptual levels in
data discretization.
The various methods used in Data Discretization are Binning, Histogram Analysis,
Entropy-Based Discretization, Interval merging by x2 analysis and Clustering.
Concept hierarchies
reduce the data by collecting and replacing low level concepts (such
as numeric values for the attribute age) by higher level concepts
(such as young, middle-aged, or senior).
Prepare for further analysis
Binning
Histogram analysis
Clustering analysis
Entropy-based discretization
Segmentation by natural partitioning
Entropy-Based Discretization
31
Concept eneration for Categorical data
Characterization
Discrimination
Association
Classification/prediction
Clustering
Outlier analysis
Other data mining tasks
Schema hierarchy
E.g., street < city < province_or_state < country
Set-grouping hierarchy
E.g., {20-39} = young, {40-59} = middle_aged
32
Operation-derived hierarchy
email address: login-name < department < university < country
Rule-based hierarchy
low_profit_margin (X) <= price(X, P1) and cost (X, P2) and (P1 -
P2) < $50
Simplicity
e.g., (association) rule length, (decision) tree size
Certainty
e.g., confidence, P(A|B) = n(A and B)/ n (B), classification reliability or
accuracy, certainty factor, rule strength, rule quality, discriminating weight, etc.
Utility
potential usefulness, e.g., support (association), noise threshold
(description)
Novelty
not previously known, surprising (used to remove redundant rules, e.g.,
Canada vs. Vancouver rule implication support ratio.
Motivation
A DMQL can provide the ability to support ad-hoc and interactive
data mining
By providing a standardized language like SQL
Hope to achieve a similar effect like that SQL has on
relational database
Foundation for system development and evolution
Facilitate information exchange, technology transfer,
commercialization and wide acceptance
Design
DMQL is designed with the primitives described earlier
Characterization
Mine_Knowledge_Specification ::=
mine characteristics [as pattern_name]
analyze measure(s)
Discrimination
Mine_Knowledge_Specification ::=
mine comparison [as pattern_name]
for target_class where target_condition
{versus contrast_class_i where contrast_condition_i}
analyze measure(s)
Association
Mine_Knowledge_Specification ::=
mine associations [as pattern_name]
Classification
Mine_Knowledge_Specification ::=
mine classification [as pattern_name]
analyze classifying_attribute_or_dimension
Prediction
Mine_Knowledge_Specification ::=
mine prediction [as pattern_name]
analyze prediction_attribute_or_dimension
{set {attribute_or_dimension_i= value_i}}
34
set-grouping hierarchies
define hierarchy age_hierarchy for age on customer as
level1: {young, middle_aged, senior} < level0: all
level2: {20, ..., 39} < level1: young
level2: {40, ..., 59} < level1: middle_aged
level2: {60, ..., 89} < level1: senior
Syntax for interestingness measure specification
What tasks should be considered in the design GUIs based on a data mining
query language?
Data collection and data mining query composition
Presentation of discovered patterns
Hierarchy specification and manipulation
Manipulation of data mining primitives
Interactive multilevel mining
Other miscellaneous information
35
2.11Concept Description
Concept description:
can handle complex data types of the attributes and their
aggregations
a more automated process
OLAP:
restricted to a small number of dimension and measure types
user-controlled process
Data generalization
A process which abstracts a large set of task-relevant data in a
database from a low conceptual levels to higher ones.
Approaches:
Data cube approach(OLAP approach)
Attribute-oriented induction approach
36
handle only dimensions of simple nonnumeric data and measures of
simple aggregated numeric values.
Lack of intelligent analysis, can’t tell which dimensions should be
used and what levels should the generalization reach
Attribute-Oriented Induction
Data focusing: task-relevant data, including dimensions, and the result is the
initial relation.
Attribute-removal: remove attribute A if there is a large set of distinct
values for A but (1) there is no generalization operator on A, or (2) A’s
higher level concepts are expressed in terms of other attributes.
Attribute-generalization: If there is a large set of distinct values for A, and
there exists a set of generalization operators on A, then select an operator
and generalize A.
Attribute-threshold control: typical 2-8, specified/default.
Generalized relation threshold control: control the final relation/rule size.
37
Presentation: User interaction: (1) adjust levels by drilling, (2) pivoting, (3)
mapping into rules, cross tabs, visualization presentations.
Example
DMQL: Describe general characteristics of graduate students in the Big-
University database
use Big_University_DB
mine characteristics as “Science_Students”
in relevance to name, gender, major, birth_place, birth_date, residence, phone#,
gpa
from student
where status in “graduate”
Corresponding SQL statement:
Select name, gender, major, birth_place, birth_date, residence, phone#, gpa
from student
where status in {“Msc”, “MBA”, “PhD” }
38
Task
Compare graduate and undergraduate students using discriminant
rule.
DMQL query
use Big_University_DB
mine comparison as “grad_vs_undergrad_students”
in relevance to name, gender, major, birth_place, birth_date, residence,
phone#, gpa
for “graduate_students”
where status in “graduate”
versus “undergraduate_students”
where status in “undergraduate”
analyze count%
from student
Given
attributes name, gender, major, birth_place, birth_date, residence,
phone# and gpa
Gen(ai) = concept hierarchies on attributes ai
Ui = attribute analytical thresholds for attributes ai
Ti = attribute generalization thresholds for attributes ai
R = attribute relevance threshold
1. Data collection
target and contrasting classes
2. Attribute relevance analysis
remove attributes name, gender, major, phone#
3. Synchronous generalization
controlled by user-specified dimension thresholds
prime target and contrasting class(es) relations/cuboids
Class Description
Quantitative characteristic rule
necessary
Quantitative discriminant rule
sufficient
Quantitative description rule
necessary and sufficient
39
2.14 Mining descriptive statistical measures in large databases
Mean
Weighted arithmetic mean
Median: A holistic measure
Middle value if odd number of values, or average of the middle two
values otherwise
estimated by interpolation
Mode
Value that occurs most frequently in the data
Unimodal, bimodal, trimodal
Empirical formula:mean-mode=3x(mean-mode)
41
Each basket can then be represented by a Boolean vector of values assigned to
these variable. The Boolean vectors can be analyzed for buying patterns which
reflect items that are frequent associated or purchased together. These patterns can
be represented in the form of association rules.
For example, the information that customers who purchase computers also tend to
buy financial management software at the same time is represented in association
Rule.
computer =>financial management software [support = 2%; confidence =
60%]
Example of association rule mining is market basket analysis. This process
analyzes customer buying habits by finding associations between the different
items that customers place in their “shopping baskets”.
The discovery of such associations can help retailers develop marketing strategies
by gaining insight into which items are frequently purchased together by
customers. For instance, if customers are buying milk, how likely are they to also
buy bread (and what kind of bread) on the same trip to the supermarket? Such
information can lead to increased sales by helping retailers to do selective
marketing and plan their shelf space.
For instance, placing milk and bread within close proximity may further encourage
the sale of these items together within single visits to the store.
42
• confidence, c, conditional probability that a transaction having X also
contains Y.
Rule support and confidence are two measures of rule interestingness that were
described
A support of 2% for association Rule means that 2% of all the transactions
under analysis show that computer and financial management software are
purchased together
A confidence of 60% means that 60% of the customers who purchased a
computer also bought the software. Typically, association rules are considered
interesting if they satisfy both a minimum support threshold and a minimum
confidence threshold. Such thresholds can be set by users or domain experts.
Rules that satisfy both a minimum support threshold (min sup) and a
minimum confidence threshold (min conf) are called strong. By convention, we
write min sup and min conf values so as to occur between 0% and 100%,
43
age(X; “30 ……39") ^ income(X; “42K ….. 48K") )=>buys( X, high resolution
TV")
In the above said examples the items bought are referenced at different levels of
abstraction. We refer to the rule set mined as consisting of multilevel association
rules. If, instead, the rules within a given set do not reference items or attributes at
different levels of abstraction, then the set contains single-level association rules.
44
Apriori employs an iterative approach known as a level-wise search, where k-
itemsets are used to explore (k+1)-itemsets. First, the set of frequent 1-itemsets is
found. This set is denoted L1. L1 is used to find L2, the frequent 2-itemsets, which
is used to find L3, and so on, until no more frequent k-itemsets can be found. The
finding of each Lk requires one full scan of the database. To improve the
efficiency of the level-wise generation of frequent itemsets, an important property
called the Apriori property is used to reduce the search space.
The Apriori property. All non-empty subsets of a frequent itemset must also be
frequent.
1. The join step: To find Lk, a set of candidate k-itemsets is generated by joining
Lk-1 with itself. This set of candidates is denoted C k. Let l1 and l2 be itemsets in
Lk_1. The notation li[j] refers to the jth item in li.
By convention, Apriori assumes that items within a transaction or itemset are
sorted in increasing lexicographic order. It also ensures that no duplicates are
generated.
3. The prune step: Ck is a superset of Lk, that is, its members may or may not
be frequent, but all of the frequent k-itemsets are included in C k. A scan of
the database to determine the count of each candidate in C k would result in
the determination of Lk. Ck can be huge, and so this could involve heavy
computation.
45
increment the count of all candidates in Ck+1 that are
contained in t
Lk+1 = candidates in Ck+1 with min_support
end
return ∪k Lk;
To reduce the size of Ck, the Apriori property is used as follows. Any (k-1)-
itemset that is not frequent cannot be a subset of a frequent k-itemset. Hence, if
any (k-1)-subset of a candidate k-itemset is not in L k-1, then the candidate cannot
be frequent either and so can be removed from Ck. This subset testing can be done
quickly by maintaining a hash tree of all frequent itemsets.
46
Fig. 3.2.1.1 Transactional data for an All Electronics branch
.
Let's look at a concrete example of Apriori, based on the All Electronics
transaction database, D, of There are nine transactions in this database.
1. In the first iteration of the algorithm, each item is a member of the set of
candidate 1-itemsets, C1. The algorithm simply scans all of the transactions in
order to count the number of occurrences of each item.
2. Suppose that the minimum transaction support count required is 2 (i.e., min sup
= 2). The set of frequent 1-itemsets, L 1, can then be determined. It consists of the
candidate 1-itemsets having minimum support.
4. Next, the transactions in D are scanned and the support count of each candidate
itemset in C2 is accumulated.
Once the frequent itemsets from transactions in a database D have been found, it is
straightforward to generate strong association rules from them (where strong
association rules satisfy both minimum support and minimum confidence). This
can be done for confidence, where the conditional probability is expressed in terms
of itemset support count:
A 2-itemset whose corresponding bucket count in the hash table is below the
support threshold cannot be frequent and thus should be removed from the
candidate set. Such a hash-based technique may substantially reduce the number of
the candidate k-itemsets examined (especially when k = 2).
3. Partitioning:
It is used for partitioning the data to find candidate itemsets. A partitioning
technique can be used which requires just two database scans to mine the frequent
itemsets. It consists of two phases.
48
itemsets. Partition size and the number of partitions are set so that each
partition can fit into main memory and therefore be read only once in each
phase.
4. Sampling:
It is used for Mining on a subset of the given data. The basic idea of the sampling
approach is to pick a random sample S of the given data D, and then search for
frequent itemsets in S instead D.
5.Calendric market basket analysis: Finding itemsets that are frequent in a set of
user-defined time intervals. Calendric market basket analysis uses transaction time
stamps to define subsets of the given database .
Steps:
1. Scan DB once, find frequent 1-itemset (single item pattern)
2. Order frequent items in frequency descending order
3. Scan DB again, construct FP-tree
49
Benefits of the FP-tree Structure
Completeness:
never breaks a long pattern of any transaction
preserves complete information for frequent pattern mining
Compactness
reduce irrelevant information—infrequent items are gone
frequency descending ordering: more frequent items are more likely
to be shared
never be larger than the original database (if not count node-links
and counts)
Example: For Connect-4 DB, compression ratio could be over 100
50
2) Construct conditional FP-tree from each conditional pattern-base
3) Recursively mine conditional FP-trees and grow frequent patterns obtained so
far
If the conditional FP-tree contains a single path, simply enumerate all
the patterns
Example : Suppose we are given the task-relevant set of transactional data in for
sales at the computer department of an All electronics branch, showing the items
purchased for each transaction TID. A concept hierarchy defines sequence of
mappings from a set of low level concepts to higher level, more general concepts.
Data can be generalized by replacing low level concepts within the
51
Fig. 2.3.2 Multilevel Mining with Reduced Support
Rules generated from association rule mining with concept hierarchies are called
multiple-level or multilevel association rules, since they consider more than one
concept level.
52
The uniform support approach is unlikely that items at lower levels of abstraction
will occur as frequently as those at higher levels of abstraction. If the minimum
support threshold is set too high, it could miss several meaningful associations
occurring at low abstraction levels.
If the threshold is set too low, it may generate many uninteresting associations
occurring at high abstraction levels. This provides the motivation for the following
approach.
For mining multiple-level associations with reduced support, there are a number of
alternative search strategies.
These include:
1. Level-By-Level Independent: This is a full breadth search, where no background
knowledge of frequent itemsets is used for pruning. Each node is examined,
regardless of whether or not its parent node is found to be frequent.
2. Level-Cross Filtering By Single Item: An item at the i-th level is examined if and
only if its parent node at the (i-1)-th level is frequent.
If a node is frequent, its children will be examined; otherwise, its descendents are
pruned from the search. For example, the descendent nodes of “computer" (i.e.,
“laptop computer" and \home computer") are not examined, since “computer" is
not frequent.
3. Level-Cross Filtering By K-Item Set: A k-itemset at the ith level is examined if
and only if its corresponding parent k-itemset at the (i-1)th level is frequent.
53
UNIT IV CLASSIFICATION AND CLUSTERING
Classification:
predicts categorical class labels
classifies data (constructs a model) based on the training set and the
values (class labels) in a classifying attribute and uses it in
classifying new data
Prediction:
models continuous-valued functions, i.e., predicts unknown or
missing values
Typical Applications
credit approval
target marketing
medical diagnosis
treatment effectiveness analysis
54
Supervised learning (classification)
Supervision: The training data (observations, measurements, etc.) are
accompanied by labels indicating the class of the observations
New data is classified based on the training set
Unsupervised learning (clustering)
The class labels of training data is unknown
Given a set of measurements, observations, etc. with the aim of
establishing the existence of classes or clusters in the data
4.2 Issues regarding classification and prediction (2):
Evaluating Classification Methods
Predictive accuracy
Speed and scalability
time to construct the model
time to use the model
Robustness
handling noise and missing values
Scalability
efficiency in disk-resident databases
Interpretability:
understanding and insight provded by the model
Goodness of rules
55
decision tree size
compactness of classification rules
4.3 Classification by Decision Tree Induction
• Decision tree
o A decision tree is a flowchart-like tree structure, where each internal
node denotes a test on an attribute.
o Each branch represents an outcome of the test, and each leaf node holds
a class label.
o The topmost node in a tree is the root node.
o Internal nodes are denoted by rectangles, and leaf nodes are denoted by
ovals.
o Some decision tree algorithms produce only binary trees whereas others
can produce non binary trees.
• Decision tree generation consists of two phases
o Tree construction
Attribute selection measures are used to select the attribute that
best partitions the tuples into distinct classes.
o Tree pruning
Tree pruning attempts to identify and remove such branches,
with the goal of improving classification accuracy on unseen
data.
• Use of decision tree: Classifying an unknown sample
o Test the attribute values of the sample against the decision
tree
• Assume that using attribute A a set S will be partitioned into sets {S1, S2 , …,
Sv}
• If Si contains pi examples of P and ni examples of N, the entropy, or the
expected information needed to classify objects in all subtrees Si is
Bayesian Theorem
57
• Given training data D, posteriori probability of a hypothesis h, P(h|
D) follows the Bayes theorem
• MAP (maximum posteriori) hypothesis
• Practical difficulty: It requires initial knowledge of many
probabilities, significant computational cost.
58
• The “IF”-part (or left-hand side) of a rule is known as the rule antecedent or
precondition. The “THEN”-part (or right-hand side) is the rule consequent.
R1 can also be written as
R1: (age = youth) ^ (student = yes)) (buys computer = yes).
Name Blood Type Give Birth Can Fly Live in Water Class
human warm yes no no mammals
python cold no no no reptiles
salmon cold no no yes fishes
whale warm yes no yes mammals
frog cold no no sometimes amphibians
komodo cold no no no reptiles
bat warm yes yes no mammals
pigeon warm no yes no birds
cat warm yes no no mammals
leopard shark cold yes no yes fishes
turtle cold no no sometimes reptiles
penguin warm no no sometimes birds
porcupine warm yes no no mammals
eel cold no no yes fishes
salamander cold no no sometimes amphibians
gila monster cold no no no reptiles
platypus warm no no no mammals
owl warm no yes no birds
dolphin warm yes no yes mammals
eagle warm no yes no birds
Fig. If Then Example
59
• Easy to interpret
• Easy to generate
• Can classify new instances rapidly
• Performance comparable to decision trees
Prediction
• (Numerical) prediction is similar to classification
o construct a model
o use model to predict continuous or ordered value for a given input
• Prediction is different from classification
o Classification refers to predict categorical class label
o Prediction models continuous-valued functions
• Major method for prediction: regression
o model the relationship between one or more independent or predictor
variables and a dependent or response variable
• Regression analysis
o Linear and multiple regression
o Non-linear regression
o Other regression methods: generalized linear model, Poisson regression,
log-linear models, regression trees
Instance-Based Methods
60
Instance-based learning:
Store training examples and delay the processing (“lazy evaluation”)
until a new instance must be classified
Typical approaches
k-nearest neighbor approach
Instances represented as points in a Euclidean space.
Locally weighted regression
Constructs local approximation
Case-based reasoning
Uses symbolic representations and knowledge-based
inference
61
Eager method cannot since they have already chosen global
approximation when seeing the query
Efficiency: Lazy - less time training but more time predicting
Accuracy
Lazy method effectively uses a richer hypothesis space since it uses
many local linear functions to form its implicit global approximation
to the target function
Eager: must commit to a single hypothesis that covers the entire
instance space
4.7 Prediction
Linear regression: Y = α + β X
Two parameters , α and β specify the line and are to be estimated by
using the data at hand.
62
using the least squares criterion to the known values of Y1, Y2, …,
X1, X2, ….
Multiple regression: Y = b0 + b1 X1 + b2 X2.
Many nonlinear functions can be transformed into the above.
Log-linear models:
The multi-way table of joint probabilities is approximated by a
product of lower-order tables.
Probability: p(a, b, c, d) = αab βacχad δbcd
63
groups is called clustering. A cluster is a collection of data objects that are similar
to one another within the same cluster and are dissimilar to the objects in other
clusters. A cluster of data objects can be treated collectively as one group and so
may be considered as a form of data compression.
Clustering is also called data segmentation in some applications because
clustering partitions large data sets into groups according to their similarity.
Clustering can also be used for outlier detection.
The quality of a clustering result depends on both the similarity measure used by
the method and its implementation. The quality of a clustering method is also
measured by its ability to discover some or all of the hidden patterns.
64
• Dissimilarity matrix: It is otherwise known as object-by-object structure.
This stores a collection of proximities that are available for all pairs of n
objects. It is often represented by an n-by-n table:
65
• Distance measure for symmetric binary variables:
d (i, j) = b+ c
d (i, j) = a +bb++c c + d
• Distance measure for asymmetric binary variables:
• Jaccard coefficient (similarity measure for asymmetric binary variables):
a a+ b+ c
sim (i, j) =
E.g. Jaccard
Dissimilarity between Binary Variables- patient record
a+ b+ c
Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4
Jack M Y N P N N N
Mary F Y N P N P N
Jim M Y P N N N N
Where A and B are positive constants, and t typically represents time. Common
examples include the growth of a bacteria population or the decay of a radioactive
element.
66
One approach is to group each kind of variable together, performing a separate
cluster analysis for each variable type. This is feasible if these analyses derive
compatible results.
All variable types could be processed together for performing a single cluster
analysis. The different variables can be combined into a single dissimilarity
matrix, with a common scale of the interval [0.0, 1.0].
Partitioning Methods:
Given a database of n objects or data tuples, a partitioning method constructs k
partitions of the data, where each partition represents a cluster and k ≤ n.
It classifies the data into k groups, which together satisfy the following
requirements:
(1) Each group must contain at least one object and
(2) Each object must belong to exactly one group.
Given k, the number of partitions to construct, a partitioning method creates an
initial partitioning. It then uses an iterative relocation technique that attempts to
improve the partitioning by moving objects from one group to another. The
general criterion of a good partitioning is that objects in the same cluster are
“close” or related to each other, whereas objects of different clusters are “far apart”
or very different.
There are various kinds of other criteria for judging the quality of partitions.
Hierarchical Methods:
A hierarchical method creates a hierarchical decomposition of the given set of data
objects. A hierarchical method can be classified as being either agglomerative or
divisive, based on how the hierarchical decomposition is formed.
• The agglomerative approach- is also known as bottom-up approach. It starts
with each object forming a separate group. It successively merges the
objects or groups that are close to one another, until all of the groups are
merged into one or until a termination condition holds.
• The divisive approach – is known as top-down approach. It starts with all of
the objects in the same cluster and the cluster is split up into smaller
clusters, until eventually each object is in one cluster or until a termination
condition holds.
67
Density-based Methods: A given cluster can be grown as long as the density in
the “neighborhood” exceeds some threshold; that is, for each data point within a
given cluster, the neighborhood of a given radius has to contain at least a minimum
number of points. Such a method can be used to filter out noise and discover
clusters of arbitrary shape.
Grid-based methods: Grid-based methods quantize the object space into a finite
number of cells that form a grid structure. Fast processing time, which is typically
independent of the number of data objects and dependent only on the number of
cells in each dimension in the quantized space is the merit of this method.
Model-based methods: Model-based methods hypothesize a model for each of the
clusters and find the best fit of the data to the given model. A model-based
algorithm may locate clusters by constructing a density function that reflects the
spatial distribution of the data points. It is used to automatically determine the
number of clusters based on standard statistics.
Clustering high-dimensional data: It is used for analysis of objects containing a
large number of features or dimensions. Frequent pattern–based clustering extracts
distinct frequent patterns among subsets of dimensions that occur frequently. It
uses such patterns to group objects and generate meaningful clusters.
Constraint-based clustering : It is a clustering approach that performs clustering
by incorporation of user-specified or application-oriented constraints. Various
kinds of constraints can be specified, either by a user or as per application
requirements.
68
Where E is the sum of the square error for all objects in the data set; p is the point
in space representing a given object; and mi is the mean of cluster Ci (both p and
mi are Multi dimensional). In other words, for each object in each cluster, the
distance from the object to its cluster center is squared, and the distances are
summed. This criterion tries to make the resulting k clusters as compact and as
separate as possible.
Steps are:
• arbitrarily choose k objects in D as the initial representative objects or
seeds;
• repeat
o assign each remaining object to the cluster with the nearest
representative object;
o randomly select a non representative object
o Compute the total cost, S, of swapping representative object.
o if S < 0 then swap objects to form the new set of k representative
objects;
• until no change;
69
• CLARA draws multiple samples of the data set, applies PAM on each
sample, and returns its best clustering as the output.
The effectiveness of CLARA depends on the sample size. PAM searches for the
best k medoids among a given data set, whereas CLARA searches for the best k
medoids among the selected sample of the data set. CLARA cannot find the best
clustering if any of the best sampled medoids is not among the best k medoids.
Hierarchical Methods
BIRCH introduces two concepts, clustering feature and clustering feature tree (CF
tree), which are used to summarize cluster representations. These structures help
the clustering method achieve good speed and scalability in large databases and
also make it effective for incremental and dynamic clustering of incoming objects.
70
Fig. 3.4.2.1 A CF tree structure.
Density-Based Methods
To discover clusters with arbitrary shape, density-based clustering methods have
been developed. These typically regard clusters as dense regions of objects in the
data space that are separated by regions of low density (representing noise).
a. DBSCAN
71
DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a
density- based clustering algorithm. The algorithm grows regions with sufficiently
high density into clusters and discovers clusters of arbitrary shape in spatial
databases with noise. It defines a cluster as a maximal set of density-connected
points.
72
Fig. OPTICS terminology.
Grid-Based Methods
The grid-based clustering approach uses a multi resolution grid data structure. It
quantizes the object space into a finite number of cells that form a grid structure on
which all of the operations for clustering are performed.
The main advantage of the approach is its fast processing time, which is typically
independent of the number of data objects, yet dependent on only the number of
cells in each dimension in the quantized space.
73
Fig. A hierarchical structure for STING clustering.
a. Expectation-Maximization
Expectation-Maximization is used to cluster the data using a finite mixture density
model of k probability distributions, where each distribution represents a cluster.
The problem is to estimate the parameters of the probability distributions so as to
best fit the data.
74
Fig. Expectation- Maximization
b. Conceptual Clustering
Conceptual clustering is a form of clustering in machine learning that, given a set
of unlabeled objects, produces a classification scheme over the objects. Unlike
conventional clustering, which primarily identifies groups of like objects,
conceptual clustering goes one step further by also finding characteristic
descriptions for each group, where each group represents a concept or class.
75
A neural network is a set of connected input/output units, where each connection
has a weight associated with it. Neural networks have several properties that make
them popular for clustering.
The attributes of an object assigned to a cluster can be predicted from the attributes
of the Cluster’s exemplar. Self-organizing feature maps (SOMs) are one of the
most popular neural network methods for cluster analysis. They are sometimes
referred to as Kohonen self-organizing feature maps, after their creator, Teuvo
Kohonon, or as topologically ordered maps.
PROCLUS finds the best set of medoids by a hill-climbing process similar to that
used in CLARANS, but generalized to deal with projected clustering. It adopts a
distance measure called Manhattan segmental distance, which is the Manhattan
distance on a set of relevant dimensions.
The Iteration Phase selects a random set of k medoids from this reduced set (of
medoids), and replaces “bad” medoids with randomly chosen new medoids if the
clustering is improved. For each medoid, a set of dimensions is chosen whose
average distances are small compared to statistical expectation. The total number
of dimensions associated to medoids must be k.
77
The Refinement Phase computes new dimensions for each medoid based on the
clusters found, reassigns points to medoids, and removes outliers.
78
Constraint-based semi-supervised clustering relies on user-provided labels or
constraints to guide the algorithm toward a more appropriate data partitioning.
This includes modifying the objective function based on constraints, or initializing
and constraining the clustering process based on the labeled objects.
Many data mining algorithms try to minimize the influence of outliers or eliminate
the mall together. This, however, could result in the loss of important hidden
information because one person’s noise could be another person’s signal.
It is useful for fraud detection, where outliers may indicate fraudulent activity.
Thus, outlier detection and analysis is an interesting data mining task, referred to
as outlier mining.
Outlier mining has wide applications. It can be used in fraud detection by detecting
unusual usage of credit cards or telecommunication services. In addition, it is
useful in customized marketing for identifying the spending behavior of customers
with extremely low or extremely high incomes, or in medical analysis for finding
unusual responses to various medical treatments.
The outlier mining problem can be viewed as two sub problems:
(1) Define what data can be considered as inconsistent in a given data set, and
(2) Find an efficient method to mine the outliers so defined.
Application of the test requires knowledge of the data set parameters (such as the
assumed data distribution), knowledge of distribution parameters (such as the
mean and variance), and the expected number of outliers.
79
• Consecutive (or sequential) procedures: An example of such a procedure is
the inside out procedure. The object that is least “likely” to be an outlier is
tested first. If it is found to be an outlier, then all of the more extreme values
are also considered outliers; otherwise, the next most extreme object is
tested, and so on. This procedure tends to be more effective than block
procedures.
80
To define the local outlier factor of an object, we need to introduce the concepts of
k-distance, k-distance neighborhood, reachability distance, and local reachability
density.
81
UNIT V RECENT TRENDS
Introduction
82
technologies. Therefore, an increasingly important task in data mining is to mine
complex types of data, including complex objects, spatial data, multimedia data,
time-series data, text data, and the World Wide Web.
1) An object-identifier
2) A set of attributes that may contain sophisticated data structures, set- or list-
valued data, class composition and hierarchies, multimedia data and so on &
3) A set of methods that specify the computational routines or rules associated
with the object class.
To facilitate generalization and induction in object-relational and object-
oriented databases, it is important to study how the generalized data can be used
for multidimensional data and analysis and data mining.
83
Spatial databases have many features distinguishing them from relational
databases. They carry topological and/or distance information, usually organized
by sophisticated, multidimensional spatial indexing structures that are accessed by
spatial data access methods and often require spatial reasoning, geometric
computation, and spatial knowledge representation techniques.
It can be used for understanding spatial data, discovering spatial relationships and
relationships between spatial and non spatial data, constructing spatial knowledge
bases, reorganizing spatial databases, and optimizing spatial queries.
It is expected to have wide applications in geographic information systems,
geomarketing, remote sensing, image database exploration, medical imaging,
navigation, traffic control, environmental studies, and many other areas where
spatial data are used.
For example
This rule states that 80% of schools that are close to sports centers are also close to
parks, and 0.5% of the data belongs to such a case.
Examples include distance information, topological relations (like intersect,
overlap, and disjoint), and spatial orientations (like left of and west of).
Progressive refinement can be adopted in spatial association analysis.
• The method first mines large data sets roughly using a fast algorithm.
• It improves the quality of mining in a pruned data set using a more
expensive
algorithm.
• Superset coverage property is used to ensure that the pruned data set covers
the complete set of answers when applying the high-quality data mining
algorithms.
• It preserves all of the potential answers. In other words, it should allow a
false-positive test, which might include some data sets that do not belong to
the answer sets, but it should not allow a false-negative test, which might
exclude some potential answers.
84
For mining spatial associations related to the spatial predicate close to, we can first
collect the candidates that pass the minimum support threshold by
Multimedia database systems are increasingly common owing to the popular use
of audio-video equipment, digital cameras, CD-ROMs, and the Internet. Typical
multimedia database systems include NASA’s EOS (Earth Observation System),
various kinds of image and audio-video databases, and Internet databases.
Content-based retrieval uses visual features to index images and promotes object
retrieval based on feature similarity, which is highly desirable in many
applications.
85
In a content-based image retrieval system, there are often two kinds of queries:
• Image-sample-based queries and
• Image feature specification queries.
Image-sample-based queries
It is used to find all of the images that are similar to the given image sample. This
search compares the feature vector (or signature) extracted from the sample with
the feature vectors of images that have already been extracted and indexed in the
image database. Based on this comparison, images that are close to the sample
image are returned.
86
could be a translation or scaling of a matching region in the other.
Therefore, a similarity measure between the query image Q and a target
image T can be defined in terms of the fraction of the area of the two
images covered by matching pairs of regions from Q and T. Such a region-
based similarity search can find images containing similar objects, where
these objects may be translated or scaled.
A multimedia data cube can contain additional dimensions and measures for
multimedia information, such as color, texture, and shape.
Typical examples include searching for and multimedia editing of particular video
clips in a TV studio, detecting suspicious persons or scenes in surveillance videos,
searching for particular events in a personal multimedia repository such as
87
MyLifeBits, discovering patterns and outliers in weather radar recordings, and
finding a particular melody or tune in MP3 audio album.
Time-series database
Consists of sequences of values or events changing with time
Data is recorded at regular intervals
Characteristic time-series components
Trend, cycle, seasonal, irregular
Applications
Financial: stock price, inflation
Industry: power consumption
Scientific: experiment results
Meteorological: precipitation
88
e.g., business cycles, may or may not be periodic
Seasonal movements or seasonal variations
i.e, almost identical patterns that a time series appears to
follow during corresponding months of successive years.
Irregular or random movements
Time series analysis: decomposition of a time series into these four basic
movements
Additive Modal: TS = T + C + S + I
Multiplicative Modal: TS = T × C × S × I
Seasonal index
Set of numbers showing the relative values of a variable during the
months of the year
E.g., if the sales during October, November, and December are 80%,
120%, and 140% of the average monthly sales for the whole year,
respectively, then 80, 120, and 140 are seasonal index numbers for
these months
Deseasonalized data
Data adjusted for seasonal variations for better trend and cyclic
analysis
Divide the original monthly data by the seasonal index numbers for
the corresponding months
Estimation of cyclic variations
If (approximate) periodicity of cycles occurs, cyclic index can be
constructed in much the same manner as seasonal indexes
Estimation of irregular variations
By adjusting the data for trend, seasonal and cyclic variations
With the systematic analysis of the trend, cyclic, seasonal, and irregular
components, it is possible to make long- or short-term predictions with
reasonable quality
89
5.5 Text Databases and IR
Information Retrieval
Typical IR systems
90
Keyword-Based Retrieval
91
Set of words that are deemed “irrelevant”, even though they
may appear frequently
E.g., a, the, of, for, with, etc.
Stop lists may vary when document set varies
92
Deficiencies
A topic of any breadth may easily contain hundreds of thousands of
documents
Many documents that are highly relevant to a topic may not contain
keywords defining them (polysemy)
93
XML can help solve heterogeneity for vertical applications,
but the freedom to define tags can make horizontal
applications on the Web more heterogeneous
94
Semantic integration of heterogeneous, distributed genome databases
Current: highly distributed, uncontrolled generation and use of a
wide variety of DNA data
Similarity search and comparison among DNA sequences
Compare the frequently occurring patterns of each class (e.g.,
diseased and healthy)
Identify gene sequence patterns that play roles in various diseases
Association analysis: identification of co-occurring gene sequences
Most diseases are not triggered by a single gene but by a
combination of genes acting together
Association analysis may help determine the kinds of genes that are
likely to co-occur together in target samples
Path analysis: linking genes to different disease development stages
Different genes may become active at different stages of the disease
Develop pharmaceutical interventions that target the different stages
separately
Visualization tools and genetic data analysis
95
integration of from multiple DBs (e.g., bank transactions,
federal/state crime history DBs)
Tools: data visualization, linkage analysis, classification, clustering tools, outlier
analysis, and sequential pattern analysis tools (find unusual access sequences
96