Data Warehousing/Mining Comp 150 DW Chapter 9. Mining Complex Types of Data
Data Warehousing/Mining Comp 150 DW Chapter 9. Mining Complex Types of Data
Data Warehousing/Mining
Data Warehousing/Mining
Set-valued attribute
Generalization of each value in the set into its corresponding higher-level concepts Derivation of the general behavior of the set, such as the number of elements in the set, the types or value ranges in the set, or the weighted average for numerical data E.g., hobby = {tennis, hockey, chess, violin, nintendo_games} generalizes to {sports, music, video_games} Same as set-valued attributes except that the order of the elements in the sequence should be observed in the generalization
Data Warehousing/Mining
Spatial data:
Generalize detailed geographic points into clustered regions, such as business, residential, industrial, or agricultural areas, according to land usage Require the merge of a set of geographic areas by spatial operations Extracted by aggregation and/or approximation Size, color, shape, texture, orientation, and relative positions and structures of the contained objects or regions in the image
Image data:
Music data:
Summarize its melody: based on the approximate patterns that repeatedly occur in the segment Summarized its style: based on its tone, tempo, or the major musical instruments played
Data Warehousing/Mining
Object identifier: generalize to the lowest level of class in the class/subclass hierarchies Class composition hierarchies Construction and mining of object cubes
generalize nested structured data generalize only objects closely related in semantics to the current one Extend the attribute-oriented induction method Apply a sequence of class-based generalization operators on different attributes Continue until getting a small number of generalized objects that can be summarized as a concise in high-level terms For efficient implementation Examine each attribute, generalize it to simple-valued data Construct a multidimensional data cube (object cube) Problem: it is not always desirable to generalize a set of values to singlevalued data
Data Warehousing/Mining
Plan mining: extraction of important or significant generalized (sequential) patterns from a planbase (a large collection of plans)
E.g., Discover travel patterns in an air flight database, or find significant patterns from the sequences of actions in the repair of automobiles
Method
Attribute-oriented induction on sequence data
A generalized travel plan: <small-big-small> E.g., big: same airline, small-big: nearby region
6
Data Warehousing/Mining
Data Warehousing/Mining
Multidimensional Analysis
Strategy
Generalize the planbase in different directions Look for sequential patterns in the generalized plans Derive high-level plans
Data Warehousing/Mining
Multidimensional Generalization
Multi-D generalization of the planbase
Plan# 1 2
. . .
Loc_Seq ALB - JFK - ORD - LAX - SAN SPI - ORD - JFK - SYR
. . .
Data Warehousing/Mining
10
Other plans: 1.5% of chances, there are other patterns: SS, L-S-L
11
Data Warehousing/Mining
Spatial data warehouse: Integrated, subject-oriented, time-variant, and nonvolatile spatial data repository for data analysis and decision making Spatial data integration: a big issue
Structure-specific formats (raster- vs. vector-based, OO vs. relational models, different storage and indexing, etc.) Vendor-specific formats (ESRI, MapInfo, Integraph, etc.)
Data Warehousing/Mining
12
Dimension modeling
nonspatial e.g. temperature: 25-30 degrees generalizes to hot spatial-to-nonspatial e.g. region B.C. generalizes to description western provinces spatial-to-spatial e.g. region Burnaby generalizes to region Lower Mainland
Measures
numerical
distributive (e.g. count, sum) algebraic (e.g. average) holistic (e.g. median, rank) collection of spatial pointers (e.g. pointers to all regions with 25-30 degrees in July)
spatial
Data Warehousing/Mining
13
Input
A map with about 3,000 weather probes scattered in B.C. Daily data for temperature, precipitation, wind velocity, etc. Concept hierarchies for all attributes
Output
A map that reveals patterns: merged (similar) regions
Goals
Interactive analysis (drill-down, slice, dice, pivot, roll-up) Fast response time Minimizing storage space used
Challenge
A merged region may contain hundreds of primitive regions (polygons)
Data Warehousing/Mining
14
Data Warehousing/Mining
Dimension table
Fact table
15
Spatial Merge
Precomputing all: too much storage space On-line merge: very expensive
Data Warehousing/Mining
16
On-line aggregation: collect and store pointers to spatial objects in a spatial data cube
expensive and slow, need efficient aggregation techniques
Data Warehousing/Mining
17
Topological relations: intersects, overlaps, disjoint, etc. Spatial orientations: left_of, west_of, under, etc. Distance information: close_to, within_distance, etc.
Examples
[7%, 85%]
is_a(x, large_town) ^ intersect(x, highway) adjacent_to(x, water) is_a(x, large_town) ^adjacent_to(x, georgia_strait) close_to(x, u.s.a.) [1%, 78%]
Data Warehousing/Mining 18
Apply only to those objects which have passed the rough spatial association test (no less than min_support)
Data Warehousing/Mining
19
Spatial classification
Analyze spatial objects to derive classification schemes, such as decision trees in relevance to certain spatial properties (district, highway, river, etc.) Example: Classify regions in a province into rich vs. poor according to the average family income
Data Warehousing/Mining
20
Data Warehousing/Mining
21
Data Warehousing/Mining
22
Data Warehousing/Mining
23
Wavelet Analysis
Wavelet-based signature
Use the dominant wavelet coefficients of an image as its signature Wavelets capture shape, texture, and location information in a single unified framework Improved efficiency and reduced the need for providing multiple search primitives May fail to identify images containing similar in location or size objects Similar images may contain similar regions, but a region in one image could be a translation or scaling of a matching region in the other Compute and compare signatures at the granularity of regions, not the entire image
Data Warehousing/Mining
24
by image colors
by color percentage by color layout by texture density by texture Layout by object model
by illumination invariance
by keywords
25
Data Warehousing/Mining
Data Warehousing/Mining
26
Data Warehousing/Mining
27
Data Warehousing/Mining
28
Design and construction similar to that of traditional data cubes from relational data Contain additional dimensions and measures for multimedia information, such as color, texture, and shape Feature descriptor: a set of vectors for each visual characteristic
Layout descriptor: contains a color layout vector and an edge layout vector
Data Warehousing/Mining
29
Data Warehousing/Mining
30
Cross Tab
JPEG GIF RED WHITE BLUE
By Colour
Group By
Colour
RED WHITE BLUE
By Format Sum
Measurement
Sum
Data Warehousing/Mining
Format of image Duration Colors Textures Keywords Size Width Height Internet domain of image Internet domain of parent pages Image popularity
31
Classification in MultiMediaMiner
Data Warehousing/Mining
32
Special features:
Need # of occurrences besides Boolean existence, e.g., Two red square and one blue circle implies theme air-show Need spatial relationships Blue on top of white squared object is associated with brown bottom Need multi-resolution and progressive refinement mining It is expensive to explore detailed associations among objects at high resolution It is crucial to ensure the completeness of search at multiresolution space
Data Warehousing/Mining
33
Data Warehousing/Mining
34
Data Warehousing/Mining
35
Difficult to implement a data cube efficiently given a large number of dimensions, especially serious in the case of multimedia data cubes Many of these attributes are set-oriented instead of single-valued Restricting number of dimensions may lead to the modeling of an image at a rather rough, limited, and imprecise scale More research is needed to strike a balance between efficiency and power of representation
36
Data Warehousing/Mining
Time-series database
Consists of sequences of values or events changing with time Data is recorded at regular intervals Characteristic time-series components
Applications
Financial: stock price, inflation Biomedical: blood pressure Meteorological: precipitation
Data Warehousing/Mining
37
Data Warehousing/Mining
38
A time series can be illustrated as a time-series graph which describes a point moving with the passage of time Categories of Time-Series Movements
Long-term or trend movements (trend curve)
i.e, almost identical patterns that a time series appears to follow during corresponding months of successive years.
Data Warehousing/Mining
39
Data Warehousing/Mining
40
Set of numbers showing the relative values of a variable during the months of the year E.g., if the sales during October, November, and December are 80%, 120%, and 140% of the average monthly sales for the whole year, respectively, then 80, 120, and 140 are seasonal index numbers for these months Data adjusted for seasonal variations E.g., divide the original monthly data by the seasonal index numbers for the corresponding months
41
Deseasonalized data
Data Warehousing/Mining
With the systematic analysis of the trend, cyclic, seasonal, and irregular components, it is possible to make long- or short-term predictions with reasonable quality
Data Warehousing/Mining
42
Normal database query finds exact match Similarity search finds data sequences that differ only slightly from the given query sequence Two categories of similarity queries
Whole matching: find a sequence that is similar to the query sequence Subsequence matching: find all pairs of similar sequences
Typical Applications
Financial market Market basket data analysis Scientific databases Medical diagnosis
43
Data Warehousing/Mining
Allow for gaps within a sequence or differences in offsets or amplitudes Normalize sequences with amplitude scaling and offset translation Two subsequences are considered similar if one lies within an envelope of width around the other, ignoring outliers Two sequences are said to be similar if they have enough non-overlapping time-ordered pairs of similar subsequences Parameters specified by a user or expert: sliding window size, width of an envelope for similarity, maximum gap, and matching fraction
44
Data Warehousing/Mining
Find all of the sequences that are similar to some sequence in class A, but not similar to any sequence in class B
Should be able to support various kinds of queries: range queries, allpair queries, and nearest neighbor queries
Data Warehousing/Mining
45
Mining of frequently occurring patterns related to time or other sequences Sequential pattern mining usually concentrate on symbolic patterns Examples
Renting Star Wars, then Empire Strikes Back, then Return of the Jedi in that order Collection of ordered events within an interval
Applications
Targeted marketing Customer retention Weather prediction
Data Warehousing/Mining
46
Sequential patterns with support > 0.25 {(C), (H)} {(C), (DG)}
Data Warehousing/Mining 47
Periodicity Analysis
Periodicity is everywhere: tides, seasons, daily power consumption, etc. Full periodicity
Every point in time contributes (precisely or approximately) to the periodicity Only some segments contribute to the periodicity Jim reads NY Times 7:00-7:30 am every week day Associations which form cycles Full periodicity: FFT, other statistical analysis methods Partial and cyclic periodicity: Variations of Apriori-like mining methods
48
Methods
Data Warehousing/Mining
Large collections of documents from various sources: news articles, research papers, books, digital libraries, e-mail messages, and Web pages, library database, etc. Data stored is usually semi-structured Traditional information retrieval techniques become inadequate for the increasingly vast amounts of text data A field developed in parallel with database systems Information is organized into (a large number of) documents Information retrieval problem: locating relevant documents based on user input, such as keywords or example documents
Information retrieval
Data Warehousing/Mining
49
Information Retrieval
Typical IR systems
Online library catalogs Online document management systems
Precision: the percentage of retrieved documents that are in fact relevant to the query (i.e., correct responses)
| {Relevant} {Retrieved} | | {Retrieved} | Recall: the percentage of documents that are relevant to the query and were, in fact, retrieved precision precision | {Relevant} {Retrieved} | | {Relevant} |
51
Data Warehousing/Mining
Keyword-Based Retrieval
A document is represented by a string, which can be identified by a set of keywords Queries may use expressions of keywords
E.g., car and repair shop, tea or coffee, DBMS but not Oracle Queries and retrieval should consider synonyms, e.g., repair and maintenance
Data Warehousing/Mining
52
Set
of words that are deemed irrelevant, even though they may appear frequently E.g., a, the, of, for, with, etc. Stop lists may vary when document set varies
Data Warehousing/Mining 53
sim(v1 , v2 )
Data Warehousing/Mining
| v1 || v2 |
54
Link analysis: unusual correlation between entities Sequence analysis: predicting a recurring event Anomaly detection: find information that violates usual patterns Hypertext analysis
Patterns in anchors/links Anchor text correlations with linked objects
Data Warehousing/Mining
55
Collect sets of keywords or terms that occur frequently together and then find the association or correlation relationships among them First preprocess the text data by parsing, stemming, removing stop words, etc. Then evoke association mining algorithms
Consider each document as a transaction View a set of keywords in the document as a set of items in the transaction
Data Warehousing/Mining
56
Motivation
Automatic classification for the tremendous number of on-line text documents (Web pages, e-mails, etc.)
A classification problem
Training set: Human experts generate a training data set Classification: The computer system discovers the classification rules Application: The discovered rules can be applied to classify new/unknown documents
Data Warehousing/Mining
57
Extract keywords and terms by information retrieval and simple association analysis techniques Obtain concept hierarchies of keywords and terms using
Available term classes, such as WordNet Expert knowledge Some keyword classification systems
Classify documents in the training set into class hierarchies Apply term association mining method to discover sets of associated terms Use the terms to maximally distinguish one class of documents from others Derive a set of association rules associated with each document class Order the classification rules based on their occurrence frequency and discriminative power Used the rules to classify new documents
58
Data Warehousing/Mining
Document Clustering
Automatically group related documents based on their contents Require no training sets or predetermined taxonomies, generate a taxonomy at runtime Major steps
Preprocessing Remove stop words, stem, feature extraction, lexical analysis, Hierarchical clustering Compute similarities applying clustering algorithms, Slicing Fan out controls, flatten the tree to configurable number of levels,
Data Warehousing/Mining
59
The WWW is huge, widely distributed, global information service center for
Information services: news, advertisements, consumer information, financial management, education, government, ecommerce, etc. Hyper-link information Access and usage information
Data Warehousing/Mining
60
Hosts
Sep-69
Sep-72
Sep-75
Sep-78
Sep-81
Sep-84
Sep-87
Sep-90
Sep-93
Sep-96
Broad diversity of user communities Only a small portion of the information on the Web is truly relevant or useful
99% of the Web information is useless to 99% of Web users How can we find high-quality Web pages on a specified topic?
Sep-99
Data Warehousing/Mining
61
Index-based: search the Web, index Web pages, and build and store huge keyword-based indices Help locate sets of Web pages containing certain keywords Deficiencies
A topic of any breadth may easily contain hundreds of thousands of documents Many documents that are highly relevant to a topic may not contain keywords defining them
Data Warehousing/Mining
62
Searches for
Web access patterns Web structures Regularity and dynamics of Web contents
Problems
The abundance problem Limited coverage of the Web: hidden Web sources, majority of data in DBMS Limited query interface based on keyword-oriented search Limited customization to individual users
Data Warehousing/Mining
63
Data Warehousing/Mining
64
65
Search Result Mining Search Engine Result Summarization Clustering Search Result (Leouski
and Croft, 1996, Zamir and Etzioni, 1997):
General Access Pattern Tracking Customized Usage Tracking
Web Structure Mining Using Links PageRank (Brin et al., 1998) CLEVER (Chakrabarti et al., 1998) Use interconnections between web pages to give weight to pages.
Using Generalization MLDB (1994), VWV (1998) Uses a multi-level database representation of the Web. Counters (popularity) and link lists are used for capturing structure.
Data Warehousing/Mining
67
General Access Pattern Tracking Web Log Mining (Zaane, Xin and Han, 1998) Uses KDD techniques to understand general access patterns and trends. Can shed light on better structure and grouping of resource providers.
Data Warehousing/Mining
68
Data Warehousing/Mining
69
Data Warehousing/Mining
70
Hub
Set of Web pages that provides collections of links to authorities
Data Warehousing/Mining
71
Explore interactions between hubs and authoritative pages Use an index-based search engine to form the root set
Many of these pages are presumably relevant to the search topic Some of them should contain links to most of the prominent authorities Include all of the pages that the root-set pages link to, and all of the pages that link to a page in the root set, up to a designated size cutoff An iterative process that determines numerical estimates of hub and authority weights
Apply weight-propagation
Data Warehousing/Mining
72
Data Warehousing/Mining
73
Assign a class label to each document from a set of predefined topic categories Based on a set of examples of preclassified documents Example
Use Yahoo!'s taxonomy and its associated documents as training and test sets Derive a Web document classification scheme Use the scheme classify new Web documents by assigning categories from the same taxonomy
Data Warehousing/Mining
Layer0: the Web itself Layer1: the Web page descriptor layer
Contains descriptive information for pages on the Web An abstraction of Layer0: substantially smaller but still rich enough to preserve most of the interesting, general information Organized into dozens of semistructured classes document, person, organization, ads, directory, sales, software, game, stocks, library_catalog, geographic_data, scientific_data, etc.
Layer2 and up: various Web directory services constructed on top of Layer1
provide multidimensional, application-specific services
Data Warehousing/Mining
75
Layer1
Generalized Descriptions
Layer0
Data Warehousing/Mining
76
Potential problem
XML can help solve heterogeneity for vertical applications, but the freedom to define tags can make horizontal applications on the Web more heterogeneous
79
Data Warehousing/Mining
Benefits:
Multi-dimensional Web info summary analysis Approximate and intelligent query answering Web high-level query answering (WebSQL, WebML) Web content and structure mining Observing the dynamics/evolution of the Web
Data Warehousing/Mining
80
Mining Web log records to discover user access patterns of Web pages Applications
Target potential customers for electronic commerce Enhance the quality and delivery of Internet information services to the end user Improve Web server system performance Identify potential prime advertisement locations
Data Warehousing/Mining
81
Conduct studies to
Analyze system performance, improve system design by Web caching, Web page prefetching, and Web page swapping
Data Warehousing/Mining
82
Web log
Database
Data Cube
1 Data Cleaning
3 OLAP
4 Data Mining
83
Data Warehousing/Mining
Summary (1)
Mining complex types of data include object data, spatial data, multimedia data, time-series data, text data, and Web data Object data can be mined by multi-dimensional generalization of complex structured data, such as plan mining for flight sequences Spatial data warehousing, OLAP and mining facilitates multidimensional spatial analysis and finding spatial associations, classifications and trends Multimedia data mining needs content-based retrieval and similarity search integrated with mining methods
84
Data Warehousing/Mining
Summary (2)
Time-series/sequential data mining includes trend analysis, similarity search in time series, mining sequential patterns and periodicity in time sequence Text mining goes beyond keyword-based and similaritybased information retrieval and discovers knowledge from semi-structured data using methods like keywordbased association and document classification Web mining includes mining Web link structures to identify authoritative Web pages, the automatic classification of Web documents, building a multilayered Web information base, and Weblog mining
85
Data Warehousing/Mining