Module 7 Mining Object Spatial Multimedia Text and Web Data

The document discusses various techniques for mining different types of data. It covers mining spatial data, images, text, and the web. For spatial data, it discusses mining spatial databases, spatial data warehousing, spatial merge, spatial association analysis, spatial classification, and spatial cluster analysis. For images, it discusses content-based retrieval, classification of images, and combining image searches. For text, it discusses keyword-based retrieval, similarity-based retrieval, TF-IDF weighting, and latent semantic indexing.

Uploaded by

sangram

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

100% found this document useful (1 vote)

181 views28 pages

Module 7 Mining Object Spatial Multimedia Text and Web Data

Uploaded by

sangram

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 28

Mining Object, Spatial,

Multimedia, Text, and Web Data

Data Mining
Mining Complex Types of Data
 Mining spatial data
 Mining image data
 Mining text data
 Mining the Web
Mining Spatial Databases
 Spatial database
 Space related data: maps, VLSI layouts, …
 Topological, distance information organized by spatial
indexing structures
 Spatial data warehousing
 Issue: different representations & structures
 Dimensions
 Nonspatial: 25-30 degree  hot
 Spatial-to-nonspatial: “New York”  “western provinces”
 Spatial-to-spatial: equi. temp region  0-5 degree region
 Measures
 numerical
 Spatial: collection of spatial pointers (0-5 degree region)
Example: BC Weather Pattern
Analysis
 Input
 A map with about 3,000 weather probes scattered in B.C.
 Daily data for temperature, wind velocity, etc.
 Concept hierarchies for all attributes
 Output
 A map that reveals patterns: merged (similar) regions
 Goals
 Interactive analysis (drill-down, slice, dice, pivot, roll-up)
 Fast response time, Minimizing storage space used
 Challenge
 A merged region may contain hundreds of “primitive”
regions (polygons)
Spatial Merge
 Precomputing: too much
storage space
 On-line merge: very
expensive
Spatial Association Analysis
 Spatial association rule: A  B [s%, c%]
 A and B are sets of spatial or nonspatial predicates
 Topological relations: intersects, overlaps, disjoint, etc.
 Spatial orientations: left_of, west_of, under, etc.
 Distance information: close_to, within_distance, etc.
 Example
 is_a(x, “school”) ^ close_to(x, “sports_center”)
 close_to(x, “park”) [7%, 85%]
 Progressive Refinement
 First search for rough relationship (e.g. g_close_to for
close_to, touch, intersect) using rough evaluation (e.g.
MBR)
 Then apply only to those objects which have passed the
rough test
Spatial Classification
 Spatial classification
 Analyze spatial objects to derive classification schemes,
such as decision trees in relevance to spatial properties
 Example
 Classify regions into rich vs. poor
 Properties: containing university, containing highway, near
ocean, etc.
Spatial Cluster Analysis
 Constraints-based clustering
 Selection of relevant objects before clustering
 Parameters as constraints
 K-means, density-based: radius, min points
 Clustering with obstructed distance

C2 C3

r i dge
B C1
River

Mountain C4

Spatial data with obstacles Clustering without taking

obstacles into consideration
Mining Image Data - Retrieval
 Description-based retrieval systems
 Retrieval based on image descriptions, such as keywords,
captions, size, etc.
 Labor-intensive, poor quality
 Content-based retrieval systems
 Retrieval based on the image content(features), such as
color histogram, texture, shape, and wavelet transforms
 Sample-based queries
 Find all of the images that are similar to the features of given
image
 Feature specification queries
 Specify or sketch image features like color, texture, or shape,
which are translated into a feature vector
Mining Image Data - Retrieval
Combining searches

Search for “blue sky” Search for “airplane in blue sky”

(top layout grid is blue) (top layout grid is blue and
keyword = “airplane”)
Classification of Image Data
 Classification
 Decision tree
 Based on descriptive features
 Based on content features
 Feature extraction
 Extract features for classification from raw image
 Various image analysis techniques are required
 Data transformation, edge detection, etc.
 Example
 Classify sky images to recognize galaxies, stars, etc.
 By using properties obtained from image analysis
Classification of Image Data
Mining Text Databases
 Text databases (document databases)
 Large collections of documents from various sources
 News articles, research papers, books, e-mail messages, and
Web pages
 Data stored is usually semi-structured
 Traditional information retrieval techniques become
inadequate for the increasingly vast amounts of text data
 Information retrieval
 Information is organized into documents
 Information retrieval problem
 Locating relevant documents based on user input, such as
keywords or example documents
Basic Measures for IR
 Precision: the percentage of retrieved documents that are in
fact relevant to the query (i.e., “correct” responses)
| {Relevant}  {Retrieved} |
precision 
| {Retrieved} |
 Recall: the percentage of documents that are relevant to the
query and were, in fact, retrieved
| {Relevant}  {Retrieved } |
recall 
| {Relevant} |
Keyword-Based Retrieval
 A document is represented by a set of keywords
 Retrieval by keyword matching
 Queries may use expressions of keywords
 (Car and accessory), (C++ or Java)
 Major difficulties
 Synonymy: same meaning but different word
 Ex> Q: “software”  Doc: about programming, do not have
the keyword
 Polysemy: same word but different meaning
 Ex> Q: “mining”  Doc: about gold mining, have the
keyword
Similarity-Based Retrieval
 A document is represented as a keyword vector
 Retrieval by similarity computing
 Basic techniques
 Stop list – set of words that are frequent but irrelevant
 Ex> a, the, of, for, with, …
 Stemming – use a common word stem
 Ex> drug, drugs, drugged  drug
 Weighting – count frequency
 Term frequency, inverse document frequency, …
 Similarity metrics
 Measure the closeness of a document to a query
 Cosine similarity: v1  v2
sim(v1 , v2 ) 
| v1 || v2 |
TF-IDF Weighting
 TF (Term Frequency)
 TF= f(t,d) : how many times term t appears in doc d
 More frequent  more relevant to topic
 Normalization:
 Document length varies : relative frequency preferred

 IDF (Inverse Document Frequency)

 IDF = 1 + log (n / k) : in how many documents term t appears
 n : total number of docs
 k : # docs with term t appearing (the document frequency)
 Less frequent among documents  more discriminative
 TF-IDF weighting
weight(t, d) = TF(t, d) * IDF(t)
Latent Semantic Indexing
 Reduce the dimension of keyword matrix
 To resolve the synonym problem and the size problem
 Use a singular value decomposition (SVD) techniques
 Example
universe rocket moon car truck
D1  1 0 1 1 0 
D 2  0 1 1 0 0 

D3  1 0 0 0 0 
 
D4  0 0 0 1 1 
D5  0 0 0 1 0 
 
D6  0 0 0 0 1 
SVD
 Singular Value Decomposition
 Decompose the matrix Amn
Amn = Umm Smn (Vnn)T
 Reduce dimension
 Select largest k singular values
A’mn = Umk Skk (Vnk)T
 Projection of A into k dimension
A’mn Vnk = Umk Skk
 Computing similarity
AAT = USVT(USVT)T
= USVTVSTUT
= (US)(US)T
SVD
  0.75  0.29 0.28 0.00  0.53
  0.28  0.53  0.75 0.00   2.16 0.00 0.00 0.00 0.00
0 .29 0.00
 
 1.59 0.00 0.00 0.00
 0.20  0.19 0.45 0.58 0.63 
0.00 V  ...
T
U   S  0.00 0.00 1.28 0.00
  0 .45 0.63  0.20 0.00 0 .19   
  0.33 0.22 0.12  0.58 0.41  0.00 0.00 0.00 1.00 0.00
  0.00 0.00 0.00 0.00 0.39
 0.12 0.41  0.33 0.58  0.22

  0.62  0.46 1.00 0.78 0.40 0.47 0.74 0.10 

  0.60  0.84  1 .00 0. 88  0 .18 0. 16  0 . 54 
   
  0.04  0.30  1.00  0.62  0.32  0.87
AV  US 2    (US )(US )T   
 0 .97 1. 00  1 .00 0. 94 0. 93 
 
  0.71 0.35   1.00 0.74 
   
  0.26 0.65   1.00 
Automatic Document
Classification
 Motivation
 Automatic classification for the tremendous number of on-line
text documents (Web pages, e-mails, etc.)
 A classification problem
 Training set: Human experts generate a training data set
 Classification(learning): The system discovers the
classification rules
 Methods
 Extract keywords and weights from documents
 Documents are represented as (keyword, weight) pairs
 Classify training documents into classes
 Apply classification algorithm
 Decision tree, Bayesian, neural network, etc.
Mining the World-Wide Web
 WWW provides rich sources for data mining
 Contents information
 Hyperlink information
 Usage information
 Challenges
 Too huge for effective data warehousing and data mining
 Too complex and heterogeneous
 Growing and changing very rapidly
Web Search Engines
 Index-based
 Search the Web, collect Web pages, index Web pages, and
build and store huge keyword-based indices
 Locate sets of Web pages containing certain keywords
 Deficiencies
 A topic of any breadth may easily contain hundreds of
thousands of documents
 Many documents that are highly relevant to a topic may not
contain keywords defining them (synonymy, polysemy)
Web Contents Mining -
Classification
 Web page/site classification
 Assign a class label to each web page from a set of
predefined topic categories
 Based on a set of examples of preclassified documents
 Example
 Use Yahoo!'s taxonomy and its associated documents as
training and test sets
 Derive a Web document classification model
 Use the model to classify new Web documents by assigning
categories from the same taxonomy
 Methods
 Keyword-based classification, use of hyperlink information,
statistical models, …
Web Structure Mining
 Finding authoritative Web pages
 Retrieving pages that are not only relevant, but also of high
quality, or authoritative on the topic
 Hyperlinks can infer the notion of authority
 A hyperlink pointing to another Web page, this can be
considered as the author's endorsement of the other page
 Problems
 Not every hyperlink represents an endorsement
 One authority will seldom point to its rival authority
 Authoritative pages are seldom particularly descriptive
 Hub
 Set of Web pages that provides collections of links to
authorities
HITS (Hyperlink-Induced
Topic Search)
 Method
1. Use an index-based search engine to form the root set
2. Expand the root set into a base set
 Include all of the pages that the root-set pages link to, and all
of the pages that link to a page in the root set
3. Apply weight-propagation
 Determines numerical estimates of hub and authority
weights
4. Output a list of the pages
 Large hub weights, large authority weights for the given
search topic
 Systems based on the HITS algorithm
 Clever, Google
 Achieve better quality search results than AltaVista, Yahoo!
Web Usage Mining
 Mining Web log records
 Discover user access patterns
 Typical Web log entry - URL requested, the IP address from
which the request originated, timestamp, etc.
 OLAP on the Weblog database
 Find the top N users, top N accessed Web pages, most
frequently accessed time periods, etc.
 Data mining on Weblog records
 Find association patterns, sequential patterns, and trends of
Web accessing
Web Usage Mining
 Applications
 Target potential customers for electronic commerce
 Identify potential prime advertisement locations
 Enhance the quality and delivery of Internet information
services to the end user
 Improve Web server system performance
 Web caching, Web page prefetching, and Web page swapping

Intro: Read The Provided Scenario Carefully and Answer Each Question
0% (1)
Intro: Read The Provided Scenario Carefully and Answer Each Question
2 pages
Dunham - Data Mining PDF
83% (6)
Dunham - Data Mining PDF
156 pages
UNIT 4 Mining Object Spatial Multimedia Text and Web Data
No ratings yet
UNIT 4 Mining Object Spatial Multimedia Text and Web Data
30 pages
Spatial and Web Mining
No ratings yet
Spatial and Web Mining
27 pages
Dunham - Data Mining PDF
100% (1)
Dunham - Data Mining PDF
156 pages
BCA Semester VI Data Mining Module 5 (Presentation Kind of N
No ratings yet
BCA Semester VI Data Mining Module 5 (Presentation Kind of N
38 pages
DWDM - Unit - VIII
No ratings yet
DWDM - Unit - VIII
32 pages
Spatial & Web Mining
100% (1)
Spatial & Web Mining
45 pages
UNIT 5 Data Warehousing
No ratings yet
UNIT 5 Data Warehousing
15 pages
Data Mining: Concepts and Techniques (2nd Edition)
No ratings yet
Data Mining: Concepts and Techniques (2nd Edition)
9 pages
DATA MINING For Search Engines
No ratings yet
DATA MINING For Search Engines
33 pages
CS822 DataMining Week1
No ratings yet
CS822 DataMining Week1
97 pages
CS317 IR W1a
No ratings yet
CS317 IR W1a
20 pages
DM 5th Unit
No ratings yet
DM 5th Unit
54 pages
Advanced-Applications
No ratings yet
Advanced-Applications
54 pages
INTRODUCTION Data Mining
No ratings yet
INTRODUCTION Data Mining
43 pages
TMK DWDM Unit 7 Advance Topics
No ratings yet
TMK DWDM Unit 7 Advance Topics
28 pages
Web Mining: Day-Today: International Journal of Emerging Trends & Technology in Computer Science (IJETTCS)
No ratings yet
Web Mining: Day-Today: International Journal of Emerging Trends & Technology in Computer Science (IJETTCS)
4 pages
Cs1004 Data Warehousing & Mining Unit 5
No ratings yet
Cs1004 Data Warehousing & Mining Unit 5
10 pages
X.4 L-4 Mining Complex
No ratings yet
X.4 L-4 Mining Complex
15 pages
Mod 5
No ratings yet
Mod 5
36 pages
3 Data Mining
No ratings yet
3 Data Mining
58 pages
Text Mining: Lecturer: Dr. Nguyen Thi Ngoc Anh
No ratings yet
Text Mining: Lecturer: Dr. Nguyen Thi Ngoc Anh
27 pages
DS Finalexam (Thxtoshravani)
No ratings yet
DS Finalexam (Thxtoshravani)
31 pages
Data Mining-Model Based Clustering
No ratings yet
Data Mining-Model Based Clustering
8 pages
CH 5
No ratings yet
CH 5
53 pages
Web Content Mining and NLP: Bing Liu Department of Computer Science University of Illinois at Chicago Liub@cs - Uic.edu
No ratings yet
Web Content Mining and NLP: Bing Liu Department of Computer Science University of Illinois at Chicago Liub@cs - Uic.edu
59 pages
Data Mining:: Concepts and Techniques
No ratings yet
Data Mining:: Concepts and Techniques
28 pages
Datamining-Lect1 2
No ratings yet
Datamining-Lect1 2
44 pages
Datamining Lect1
No ratings yet
Datamining Lect1
61 pages
Data Mining Series 2 Important Topics
No ratings yet
Data Mining Series 2 Important Topics
22 pages
Data Mining1
No ratings yet
Data Mining1
13 pages
Web Mining - Lec1 2
No ratings yet
Web Mining - Lec1 2
62 pages
Mca II Sem Data Ware Hoise and Mining
No ratings yet
Mca II Sem Data Ware Hoise and Mining
53 pages
Internet Research: What's Hot in Search, Advertizing & Cloud Computing
No ratings yet
Internet Research: What's Hot in Search, Advertizing & Cloud Computing
59 pages
Assignment 5
No ratings yet
Assignment 5
16 pages
Lecture 5 Unsupervised
No ratings yet
Lecture 5 Unsupervised
54 pages
Content DM
No ratings yet
Content DM
10 pages
EBM
No ratings yet
EBM
16 pages
Data Mining Introductiondifferent
No ratings yet
Data Mining Introductiondifferent
83 pages
DMBI Presentations Unit-8
No ratings yet
DMBI Presentations Unit-8
28 pages
BA4027 Datamining For BI
100% (1)
BA4027 Datamining For BI
67 pages
DM Overview
No ratings yet
DM Overview
52 pages
Data Management
No ratings yet
Data Management
36 pages
Lecture 1
No ratings yet
Lecture 1
55 pages
Adbms Ans
No ratings yet
Adbms Ans
4 pages
1-Data Mining and Applications
No ratings yet
1-Data Mining and Applications
70 pages
3 DM
No ratings yet
3 DM
36 pages
Spatial, Text, and Multimedia Databases: Erik Zeitler Udbl
No ratings yet
Spatial, Text, and Multimedia Databases: Erik Zeitler Udbl
53 pages
2a. Basic Data Mining Techniques
No ratings yet
2a. Basic Data Mining Techniques
39 pages
7dm Midterm Reviewer
No ratings yet
7dm Midterm Reviewer
10 pages
EB Ining: Dvanced Opics
0% (1)
EB Ining: Dvanced Opics
48 pages
Mining The Web Searching and Integration
No ratings yet
Mining The Web Searching and Integration
5 pages
Web Mining and Text Mining
No ratings yet
Web Mining and Text Mining
65 pages
Mining Comlex Types of Data
No ratings yet
Mining Comlex Types of Data
19 pages
Large-Scale Data Mining CS 395T: Unique Number: 49460
No ratings yet
Large-Scale Data Mining CS 395T: Unique Number: 49460
4 pages
Lecture 2
No ratings yet
Lecture 2
66 pages
មេរៀនទី១
No ratings yet
មេរៀនទី១
40 pages
Data Mining Summary (Final)
No ratings yet
Data Mining Summary (Final)
10 pages
Paper - Xvii Data Mining and Warehousing
No ratings yet
Paper - Xvii Data Mining and Warehousing
140 pages
Illuminating Data: A hands on guide to data visualization in R
From Everand
Illuminating Data: A hands on guide to data visualization in R
Eman Ahmad
No ratings yet
Web Programming Worksheet
No ratings yet
Web Programming Worksheet
8 pages
SourceCode Proxy HTTP
No ratings yet
SourceCode Proxy HTTP
5 pages
Alfa AI Project Report
No ratings yet
Alfa AI Project Report
7 pages
Myspace PHP
No ratings yet
Myspace PHP
26 pages
HTML Tutorials
No ratings yet
HTML Tutorials
132 pages
Session Layer Detailed Notes
No ratings yet
Session Layer Detailed Notes
7 pages
Ripon Deb Nath CV PDF
No ratings yet
Ripon Deb Nath CV PDF
3 pages
Invoice 1211624957
No ratings yet
Invoice 1211624957
2 pages
EL Family Tool Kit All
100% (2)
EL Family Tool Kit All
9 pages
Rahat Mini - Project Documentation Adypu
No ratings yet
Rahat Mini - Project Documentation Adypu
23 pages
SOW For Forcepoint Web Security Cloud - Pluto7
No ratings yet
SOW For Forcepoint Web Security Cloud - Pluto7
13 pages
XXXXXXXXXXXXXXXXXXXXXXXXXXXXX PDF
No ratings yet
XXXXXXXXXXXXXXXXXXXXXXXXXXXXX PDF
113 pages
STD - X - ICT - HTML Practical Revision Assignment 2
No ratings yet
STD - X - ICT - HTML Practical Revision Assignment 2
5 pages
Uddamareshwar Tantram PDF
No ratings yet
Uddamareshwar Tantram PDF
2 pages
Multimedia Unit 11 - 12 Q2
No ratings yet
Multimedia Unit 11 - 12 Q2
21 pages
Web Technologies ITE1002: Module-1
No ratings yet
Web Technologies ITE1002: Module-1
28 pages
Socialshiftersglobalinnovationschallenge 2023 Faqs
No ratings yet
Socialshiftersglobalinnovationschallenge 2023 Faqs
8 pages
Tomcat - Web App Configuration
No ratings yet
Tomcat - Web App Configuration
6 pages
Morley15e PPT ch01
No ratings yet
Morley15e PPT ch01
71 pages
Web Programming Lab Manual
No ratings yet
Web Programming Lab Manual
57 pages
Jewellery Management System Document
No ratings yet
Jewellery Management System Document
64 pages
Guidelines For Referencing Electronic Sources
No ratings yet
Guidelines For Referencing Electronic Sources
10 pages
Tourism Management System
No ratings yet
Tourism Management System
15 pages
Local File Inclusion Hacking Tutorial
No ratings yet
Local File Inclusion Hacking Tutorial
8 pages
Selenium Cheat Sheet
No ratings yet
Selenium Cheat Sheet
3 pages
Data Edge Computing For Kinesys Drive Systems: Hydac
No ratings yet
Data Edge Computing For Kinesys Drive Systems: Hydac
3 pages
Quiz Chapter 6
No ratings yet
Quiz Chapter 6
6 pages
Telangana University Project Report: SRNK Govt Degree College, Banswada
No ratings yet
Telangana University Project Report: SRNK Govt Degree College, Banswada
27 pages
Online Job Portal Using PHP MySQL Project Full Report
No ratings yet
Online Job Portal Using PHP MySQL Project Full Report
32 pages