Mining Object, Spatial,
Multimedia, Text, and Web Data
Data Mining
Mining Complex Types of Data
Mining spatial data
Mining image data
Mining text data
Mining the Web
Mining Spatial Databases
Spatial database
Space related data: maps, VLSI layouts, …
Topological, distance information organized by spatial
indexing structures
Spatial data warehousing
Issue: different representations & structures
Dimensions
Nonspatial: 25-30 degree hot
Spatial-to-nonspatial: “New York” “western provinces”
Spatial-to-spatial: equi. temp region 0-5 degree region
Measures
numerical
Spatial: collection of spatial pointers (0-5 degree region)
Example: BC Weather Pattern
Analysis
Input
A map with about 3,000 weather probes scattered in B.C.
Daily data for temperature, wind velocity, etc.
Concept hierarchies for all attributes
Output
A map that reveals patterns: merged (similar) regions
Goals
Interactive analysis (drill-down, slice, dice, pivot, roll-up)
Fast response time, Minimizing storage space used
Challenge
A merged region may contain hundreds of “primitive”
regions (polygons)
Spatial Merge
Precomputing: too much
storage space
On-line merge: very
expensive
Spatial Association Analysis
Spatial association rule: A B [s%, c%]
A and B are sets of spatial or nonspatial predicates
Topological relations: intersects, overlaps, disjoint, etc.
Spatial orientations: left_of, west_of, under, etc.
Distance information: close_to, within_distance, etc.
Example
is_a(x, “school”) ^ close_to(x, “sports_center”)
close_to(x, “park”) [7%, 85%]
Progressive Refinement
First search for rough relationship (e.g. g_close_to for
close_to, touch, intersect) using rough evaluation (e.g.
MBR)
Then apply only to those objects which have passed the
rough test
Spatial Classification
Spatial classification
Analyze spatial objects to derive classification schemes,
such as decision trees in relevance to spatial properties
Example
Classify regions into rich vs. poor
Properties: containing university, containing highway, near
ocean, etc.
Spatial Cluster Analysis
Constraints-based clustering
Selection of relevant objects before clustering
Parameters as constraints
K-means, density-based: radius, min points
Clustering with obstructed distance
C2 C3
r i dge
B C1
River
Mountain C4
Spatial data with obstacles Clustering without taking
obstacles into consideration
Mining Image Data - Retrieval
Description-based retrieval systems
Retrieval based on image descriptions, such as keywords,
captions, size, etc.
Labor-intensive, poor quality
Content-based retrieval systems
Retrieval based on the image content(features), such as
color histogram, texture, shape, and wavelet transforms
Sample-based queries
Find all of the images that are similar to the features of given
image
Feature specification queries
Specify or sketch image features like color, texture, or shape,
which are translated into a feature vector
Mining Image Data - Retrieval
Combining searches
Search for “blue sky” Search for “airplane in blue sky”
(top layout grid is blue) (top layout grid is blue and
keyword = “airplane”)
Classification of Image Data
Classification
Decision tree
Based on descriptive features
Based on content features
Feature extraction
Extract features for classification from raw image
Various image analysis techniques are required
Data transformation, edge detection, etc.
Example
Classify sky images to recognize galaxies, stars, etc.
By using properties obtained from image analysis
Classification of Image Data
Mining Text Databases
Text databases (document databases)
Large collections of documents from various sources
News articles, research papers, books, e-mail messages, and
Web pages
Data stored is usually semi-structured
Traditional information retrieval techniques become
inadequate for the increasingly vast amounts of text data
Information retrieval
Information is organized into documents
Information retrieval problem
Locating relevant documents based on user input, such as
keywords or example documents
Basic Measures for IR
Precision: the percentage of retrieved documents that are in
fact relevant to the query (i.e., “correct” responses)
| {Relevant} {Retrieved} |
precision
| {Retrieved} |
Recall: the percentage of documents that are relevant to the
query and were, in fact, retrieved
| {Relevant} {Retrieved } |
recall
| {Relevant} |
Keyword-Based Retrieval
A document is represented by a set of keywords
Retrieval by keyword matching
Queries may use expressions of keywords
(Car and accessory), (C++ or Java)
Major difficulties
Synonymy: same meaning but different word
Ex> Q: “software” Doc: about programming, do not have
the keyword
Polysemy: same word but different meaning
Ex> Q: “mining” Doc: about gold mining, have the
keyword
Similarity-Based Retrieval
A document is represented as a keyword vector
Retrieval by similarity computing
Basic techniques
Stop list – set of words that are frequent but irrelevant
Ex> a, the, of, for, with, …
Stemming – use a common word stem
Ex> drug, drugs, drugged drug
Weighting – count frequency
Term frequency, inverse document frequency, …
Similarity metrics
Measure the closeness of a document to a query
Cosine similarity: v1 v2
sim(v1 , v2 )
| v1 || v2 |
TF-IDF Weighting
TF (Term Frequency)
TF= f(t,d) : how many times term t appears in doc d
More frequent more relevant to topic
Normalization:
Document length varies : relative frequency preferred
IDF (Inverse Document Frequency)
IDF = 1 + log (n / k) : in how many documents term t appears
n : total number of docs
k : # docs with term t appearing (the document frequency)
Less frequent among documents more discriminative
TF-IDF weighting
weight(t, d) = TF(t, d) * IDF(t)
Latent Semantic Indexing
Reduce the dimension of keyword matrix
To resolve the synonym problem and the size problem
Use a singular value decomposition (SVD) techniques
Example
universe rocket moon car truck
D1 1 0 1 1 0
D 2 0 1 1 0 0
D3 1 0 0 0 0
D4 0 0 0 1 1
D5 0 0 0 1 0
D6 0 0 0 0 1
SVD
Singular Value Decomposition
Decompose the matrix Amn
Amn = Umm Smn (Vnn)T
Reduce dimension
Select largest k singular values
A’mn = Umk Skk (Vnk)T
Projection of A into k dimension
A’mn Vnk = Umk Skk
Computing similarity
AAT = USVT(USVT)T
= USVTVSTUT
= (US)(US)T
SVD
0.75 0.29 0.28 0.00 0.53
0.28 0.53 0.75 0.00 2.16 0.00 0.00 0.00 0.00
0 .29 0.00
1.59 0.00 0.00 0.00
0.20 0.19 0.45 0.58 0.63
0.00 V ...
T
U S 0.00 0.00 1.28 0.00
0 .45 0.63 0.20 0.00 0 .19
0.33 0.22 0.12 0.58 0.41 0.00 0.00 0.00 1.00 0.00
0.00 0.00 0.00 0.00 0.39
0.12 0.41 0.33 0.58 0.22
0.62 0.46 1.00 0.78 0.40 0.47 0.74 0.10
0.60 0.84 1 .00 0. 88 0 .18 0. 16 0 . 54
0.04 0.30 1.00 0.62 0.32 0.87
AV US 2 (US )(US )T
0 .97 1. 00 1 .00 0. 94 0. 93
0.71 0.35 1.00 0.74
0.26 0.65 1.00
Automatic Document
Classification
Motivation
Automatic classification for the tremendous number of on-line
text documents (Web pages, e-mails, etc.)
A classification problem
Training set: Human experts generate a training data set
Classification(learning): The system discovers the
classification rules
Methods
Extract keywords and weights from documents
Documents are represented as (keyword, weight) pairs
Classify training documents into classes
Apply classification algorithm
Decision tree, Bayesian, neural network, etc.
Mining the World-Wide Web
WWW provides rich sources for data mining
Contents information
Hyperlink information
Usage information
Challenges
Too huge for effective data warehousing and data mining
Too complex and heterogeneous
Growing and changing very rapidly
Web Search Engines
Index-based
Search the Web, collect Web pages, index Web pages, and
build and store huge keyword-based indices
Locate sets of Web pages containing certain keywords
Deficiencies
A topic of any breadth may easily contain hundreds of
thousands of documents
Many documents that are highly relevant to a topic may not
contain keywords defining them (synonymy, polysemy)
Web Contents Mining -
Classification
Web page/site classification
Assign a class label to each web page from a set of
predefined topic categories
Based on a set of examples of preclassified documents
Example
Use Yahoo!'s taxonomy and its associated documents as
training and test sets
Derive a Web document classification model
Use the model to classify new Web documents by assigning
categories from the same taxonomy
Methods
Keyword-based classification, use of hyperlink information,
statistical models, …
Web Structure Mining
Finding authoritative Web pages
Retrieving pages that are not only relevant, but also of high
quality, or authoritative on the topic
Hyperlinks can infer the notion of authority
A hyperlink pointing to another Web page, this can be
considered as the author's endorsement of the other page
Problems
Not every hyperlink represents an endorsement
One authority will seldom point to its rival authority
Authoritative pages are seldom particularly descriptive
Hub
Set of Web pages that provides collections of links to
authorities
HITS (Hyperlink-Induced
Topic Search)
Method
1. Use an index-based search engine to form the root set
2. Expand the root set into a base set
Include all of the pages that the root-set pages link to, and all
of the pages that link to a page in the root set
3. Apply weight-propagation
Determines numerical estimates of hub and authority
weights
4. Output a list of the pages
Large hub weights, large authority weights for the given
search topic
Systems based on the HITS algorithm
Clever, Google
Achieve better quality search results than AltaVista, Yahoo!
Web Usage Mining
Mining Web log records
Discover user access patterns
Typical Web log entry - URL requested, the IP address from
which the request originated, timestamp, etc.
OLAP on the Weblog database
Find the top N users, top N accessed Web pages, most
frequently accessed time periods, etc.
Data mining on Weblog records
Find association patterns, sequential patterns, and trends of
Web accessing
Web Usage Mining
Applications
Target potential customers for electronic commerce
Identify potential prime advertisement locations
Enhance the quality and delivery of Internet information
services to the end user
Improve Web server system performance
Web caching, Web page prefetching, and Web page swapping