100% found this document useful (1 vote)
160 views

Module 7 Mining Object Spatial Multimedia Text and Web Data

The document discusses various techniques for mining different types of data. It covers mining spatial data, images, text, and the web. For spatial data, it discusses mining spatial databases, spatial data warehousing, spatial merge, spatial association analysis, spatial classification, and spatial cluster analysis. For images, it discusses content-based retrieval, classification of images, and combining image searches. For text, it discusses keyword-based retrieval, similarity-based retrieval, TF-IDF weighting, and latent semantic indexing.

Uploaded by

sangram
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
160 views

Module 7 Mining Object Spatial Multimedia Text and Web Data

The document discusses various techniques for mining different types of data. It covers mining spatial data, images, text, and the web. For spatial data, it discusses mining spatial databases, spatial data warehousing, spatial merge, spatial association analysis, spatial classification, and spatial cluster analysis. For images, it discusses content-based retrieval, classification of images, and combining image searches. For text, it discusses keyword-based retrieval, similarity-based retrieval, TF-IDF weighting, and latent semantic indexing.

Uploaded by

sangram
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 28

Mining Object, Spatial,

Multimedia, Text, and Web Data

Data Mining
Mining Complex Types of Data
 Mining spatial data
 Mining image data
 Mining text data
 Mining the Web
Mining Spatial Databases
 Spatial database
 Space related data: maps, VLSI layouts, …
 Topological, distance information organized by spatial
indexing structures
 Spatial data warehousing
 Issue: different representations & structures
 Dimensions
 Nonspatial: 25-30 degree  hot
 Spatial-to-nonspatial: “New York”  “western provinces”
 Spatial-to-spatial: equi. temp region  0-5 degree region
 Measures
 numerical
 Spatial: collection of spatial pointers (0-5 degree region)
Example: BC Weather Pattern
Analysis
 Input
 A map with about 3,000 weather probes scattered in B.C.
 Daily data for temperature, wind velocity, etc.
 Concept hierarchies for all attributes
 Output
 A map that reveals patterns: merged (similar) regions
 Goals
 Interactive analysis (drill-down, slice, dice, pivot, roll-up)
 Fast response time, Minimizing storage space used
 Challenge
 A merged region may contain hundreds of “primitive”
regions (polygons)
Spatial Merge
 Precomputing: too much
storage space
 On-line merge: very
expensive
Spatial Association Analysis
 Spatial association rule: A  B [s%, c%]
 A and B are sets of spatial or nonspatial predicates
 Topological relations: intersects, overlaps, disjoint, etc.
 Spatial orientations: left_of, west_of, under, etc.
 Distance information: close_to, within_distance, etc.
 Example
 is_a(x, “school”) ^ close_to(x, “sports_center”)
 close_to(x, “park”) [7%, 85%]
 Progressive Refinement
 First search for rough relationship (e.g. g_close_to for
close_to, touch, intersect) using rough evaluation (e.g.
MBR)
 Then apply only to those objects which have passed the
rough test
Spatial Classification
 Spatial classification
 Analyze spatial objects to derive classification schemes,
such as decision trees in relevance to spatial properties
 Example
 Classify regions into rich vs. poor
 Properties: containing university, containing highway, near
ocean, etc.
Spatial Cluster Analysis
 Constraints-based clustering
 Selection of relevant objects before clustering
 Parameters as constraints
 K-means, density-based: radius, min points
 Clustering with obstructed distance

C2 C3

r i dge
B C1
River

Mountain C4

Spatial data with obstacles Clustering without taking


obstacles into consideration
Mining Image Data - Retrieval
 Description-based retrieval systems
 Retrieval based on image descriptions, such as keywords,
captions, size, etc.
 Labor-intensive, poor quality
 Content-based retrieval systems
 Retrieval based on the image content(features), such as
color histogram, texture, shape, and wavelet transforms
 Sample-based queries
 Find all of the images that are similar to the features of given
image
 Feature specification queries
 Specify or sketch image features like color, texture, or shape,
which are translated into a feature vector
Mining Image Data - Retrieval
Combining searches

Search for “blue sky” Search for “airplane in blue sky”


(top layout grid is blue) (top layout grid is blue and
keyword = “airplane”)
Classification of Image Data
 Classification
 Decision tree
 Based on descriptive features
 Based on content features
 Feature extraction
 Extract features for classification from raw image
 Various image analysis techniques are required
 Data transformation, edge detection, etc.
 Example
 Classify sky images to recognize galaxies, stars, etc.
 By using properties obtained from image analysis
Classification of Image Data
Mining Text Databases
 Text databases (document databases)
 Large collections of documents from various sources
 News articles, research papers, books, e-mail messages, and
Web pages
 Data stored is usually semi-structured
 Traditional information retrieval techniques become
inadequate for the increasingly vast amounts of text data
 Information retrieval
 Information is organized into documents
 Information retrieval problem
 Locating relevant documents based on user input, such as
keywords or example documents
Basic Measures for IR
 Precision: the percentage of retrieved documents that are in
fact relevant to the query (i.e., “correct” responses)
| {Relevant}  {Retrieved} |
precision 
| {Retrieved} |
 Recall: the percentage of documents that are relevant to the
query and were, in fact, retrieved
| {Relevant}  {Retrieved } |
recall 
| {Relevant} |
Keyword-Based Retrieval
 A document is represented by a set of keywords
 Retrieval by keyword matching
 Queries may use expressions of keywords
 (Car and accessory), (C++ or Java)
 Major difficulties
 Synonymy: same meaning but different word
 Ex> Q: “software”  Doc: about programming, do not have
the keyword
 Polysemy: same word but different meaning
 Ex> Q: “mining”  Doc: about gold mining, have the
keyword
Similarity-Based Retrieval
 A document is represented as a keyword vector
 Retrieval by similarity computing
 Basic techniques
 Stop list – set of words that are frequent but irrelevant
 Ex> a, the, of, for, with, …
 Stemming – use a common word stem
 Ex> drug, drugs, drugged  drug
 Weighting – count frequency
 Term frequency, inverse document frequency, …
 Similarity metrics
 Measure the closeness of a document to a query
 Cosine similarity: v1  v2
sim(v1 , v2 ) 
| v1 || v2 |
TF-IDF Weighting
 TF (Term Frequency)
 TF= f(t,d) : how many times term t appears in doc d
 More frequent  more relevant to topic
 Normalization:
 Document length varies : relative frequency preferred

 IDF (Inverse Document Frequency)


 IDF = 1 + log (n / k) : in how many documents term t appears
 n : total number of docs
 k : # docs with term t appearing (the document frequency)
 Less frequent among documents  more discriminative
 TF-IDF weighting
weight(t, d) = TF(t, d) * IDF(t)
Latent Semantic Indexing
 Reduce the dimension of keyword matrix
 To resolve the synonym problem and the size problem
 Use a singular value decomposition (SVD) techniques
 Example
universe rocket moon car truck
D1  1 0 1 1 0 
D 2  0 1 1 0 0 

D3  1 0 0 0 0 
 
D4  0 0 0 1 1 
D5  0 0 0 1 0 
 
D6  0 0 0 0 1 
SVD
 Singular Value Decomposition
 Decompose the matrix Amn
Amn = Umm Smn (Vnn)T
 Reduce dimension
 Select largest k singular values
A’mn = Umk Skk (Vnk)T
 Projection of A into k dimension
A’mn Vnk = Umk Skk
 Computing similarity
AAT = USVT(USVT)T
= USVTVSTUT
= (US)(US)T
SVD
  0.75  0.29 0.28 0.00  0.53
  0.28  0.53  0.75 0.00   2.16 0.00 0.00 0.00 0.00
0 .29 0.00
 
 1.59 0.00 0.00 0.00
 0.20  0.19 0.45 0.58 0.63 
0.00 V  ...
T
U   S  0.00 0.00 1.28 0.00
  0 .45 0.63  0.20 0.00 0 .19   
  0.33 0.22 0.12  0.58 0.41  0.00 0.00 0.00 1.00 0.00
  0.00 0.00 0.00 0.00 0.39
 0.12 0.41  0.33 0.58  0.22

  0.62  0.46 1.00 0.78 0.40 0.47 0.74 0.10 


  0.60  0.84  1 .00 0. 88  0 .18 0. 16  0 . 54 
   
  0.04  0.30  1.00  0.62  0.32  0.87
AV  US 2    (US )(US )T   
 0 .97 1. 00  1 .00 0. 94 0. 93 
 
  0.71 0.35   1.00 0.74 
   
  0.26 0.65   1.00 
Automatic Document
Classification
 Motivation
 Automatic classification for the tremendous number of on-line
text documents (Web pages, e-mails, etc.)
 A classification problem
 Training set: Human experts generate a training data set
 Classification(learning): The system discovers the
classification rules
 Methods
 Extract keywords and weights from documents
 Documents are represented as (keyword, weight) pairs
 Classify training documents into classes
 Apply classification algorithm
 Decision tree, Bayesian, neural network, etc.
Mining the World-Wide Web
 WWW provides rich sources for data mining
 Contents information
 Hyperlink information
 Usage information
 Challenges
 Too huge for effective data warehousing and data mining
 Too complex and heterogeneous
 Growing and changing very rapidly
Web Search Engines
 Index-based
 Search the Web, collect Web pages, index Web pages, and
build and store huge keyword-based indices
 Locate sets of Web pages containing certain keywords
 Deficiencies
 A topic of any breadth may easily contain hundreds of
thousands of documents
 Many documents that are highly relevant to a topic may not
contain keywords defining them (synonymy, polysemy)
Web Contents Mining -
Classification
 Web page/site classification
 Assign a class label to each web page from a set of
predefined topic categories
 Based on a set of examples of preclassified documents
 Example
 Use Yahoo!'s taxonomy and its associated documents as
training and test sets
 Derive a Web document classification model
 Use the model to classify new Web documents by assigning
categories from the same taxonomy
 Methods
 Keyword-based classification, use of hyperlink information,
statistical models, …
Web Structure Mining
 Finding authoritative Web pages
 Retrieving pages that are not only relevant, but also of high
quality, or authoritative on the topic
 Hyperlinks can infer the notion of authority
 A hyperlink pointing to another Web page, this can be
considered as the author's endorsement of the other page
 Problems
 Not every hyperlink represents an endorsement
 One authority will seldom point to its rival authority
 Authoritative pages are seldom particularly descriptive
 Hub
 Set of Web pages that provides collections of links to
authorities
HITS (Hyperlink-Induced
Topic Search)
 Method
1. Use an index-based search engine to form the root set
2. Expand the root set into a base set
 Include all of the pages that the root-set pages link to, and all
of the pages that link to a page in the root set
3. Apply weight-propagation
 Determines numerical estimates of hub and authority
weights
4. Output a list of the pages
 Large hub weights, large authority weights for the given
search topic
 Systems based on the HITS algorithm
 Clever, Google
 Achieve better quality search results than AltaVista, Yahoo!
Web Usage Mining
 Mining Web log records
 Discover user access patterns
 Typical Web log entry - URL requested, the IP address from
which the request originated, timestamp, etc.
 OLAP on the Weblog database
 Find the top N users, top N accessed Web pages, most
frequently accessed time periods, etc.
 Data mining on Weblog records
 Find association patterns, sequential patterns, and trends of
Web accessing
Web Usage Mining
 Applications
 Target potential customers for electronic commerce
 Identify potential prime advertisement locations
 Enhance the quality and delivery of Internet information
services to the end user
 Improve Web server system performance
 Web caching, Web page prefetching, and Web page swapping

You might also like