CSCE822 Data Mining and Warehousing
CSCE822 Data Mining and Warehousing
CSCE822 Data Mining and Warehousing
6/27/2012
Data Mining: Principles and Algorithms
8
(1 ) 1/ ,
ij
U M U n for all i j c c + =
( (1 ) )
T
U M c c +
( (1 ) )
T
U M p p c c + =
6/27/2012
Data Mining: Principles and Algorithms
9
Layout Structure
Compared to plain text, a web page is a 2D presentation
Rich visual effects created by different term types, formats, separators,
blank areas, colors, pictures, etc
Different parts of a page are not equally important
Title: CNN.com International
H1: IAEA: Iran had secret nuke agenda
H3: EXPLOSIONS ROCK BAGHDAD
TEXT BODY (with position and font
type): The International Atomic Energy
Agency has concluded that Iran has
secretly produced small amounts of
nuclear materials including low enriched
uranium and plutonium that could be used
to develop nuclear weapons according to a
confidential report obtained by CNN
Hyperlink:
URL: http://www.cnn.com/...
Anchor Text: AI oaeda
Image:
URL: http://www.cnn.com/image/...
Alt & Caption: Iran nuclear
Anchor Text: CNN Homepage News
6/27/2012
Data Mining: Principles and Algorithms
10
Web Page BlockBetter Information Unit
Importance = Med
Importance = Low
Importance = High
Web Page Blocks
6/27/2012
Data Mining: Principles and Algorithms
11
Motivation for VIPS (VIsion-based Page
Segmentation)
Problems of treating a web page as an atomic unit
Web page usually contains not only pure content
Noise: navigation, decoration, interaction,
Multiple topics
Different parts of a page are not equally important
Web page has internal structure
Two-dimension logical structure & Visual layout
presentation
> Free text document
< Structured document
Layout the 3
rd
dimension of Web page
1
st
dimension: content
2
nd
dimension: hyperlink
6/27/2012
Data Mining: Principles and Algorithms
12
Is DOM a Good Representation of Page
Structure?
Page segmentation using DOM
Extract structural tags such as P, TABLE, UL, TITLE,
H1~H6, etc
DOM is more related content display, does not
necessarily reflect semantic structure
How about XML?
A long way to go to replace the HTML
6/27/2012
Data Mining: Principles and Algorithms
13
VIPS Algorithm
Motivation:
In many cases, topics can be distinguished with visual clues. Such as
position, distance, font, color, etc.
Goal:
Extract the semantic structure of a web page based on its visual
presentation.
Procedure:
Top-down partition the web page based on the separators
Result
A tree structure, each node in the tree corresponds to a block in the page.
Each node will be assigned a value (Degree of Coherence) to indicate how
coherent of the content in the block based on visual perception.
Each block will be assigned an importance value
Hierarchy or flat
6/27/2012
Data Mining: Principles and Algorithms
14
VIPS: An Example
Web Page
VB1 VB2
VB2_1 VB2_2 . . .
VB2_2_1 VB2_2_2 VB2_2_3 VB2_2_4
. . .
. . .
. . .
. . .
A hierarchical structure of layout block
A Degree of Coherence (DOC) is defined
for each block
Show the intra coherence of the block
DoC of child block must be no less
than its parents
The Permitted Degree of Coherence
(PDOC) can be pre-defined to achieve
different granularities for the content
structure
The segmentation will stop only when
all the blocks DoC is no less than
PDoC
The smaller the PDoC, the coarser
the content structure would be
6/27/2012
Data Mining: Principles and Algorithms
15
Example of Web Page Segmentation (1)
( DOM Structure ) ( VIPS Structure )
6/27/2012
Data Mining: Principles and Algorithms
16
Example of Web Page Segmentation (2)
Can be applied on web image retrieval
Surrounding text extraction
( DOM Structure ) ( VIPS Structure )
6/27/2012
Data Mining: Principles and Algorithms
17
Web Page BlockBetter Information Unit
Page Segmentation
Vision based approach
Block Importance Modeling
Statistical learning
Importance = Med
Importance = Low
Importance = High
Web Page Blocks
6/27/2012
Data Mining: Principles and Algorithms
18
Block-based Web Search
Index block instead of whole page
Block retrieval
Combing DocRank and BlockRank
Block query expansion
Select expansion term from relevant blocks
6/27/2012
Data Mining: Principles and Algorithms
19
Experiments
Dataset
TREC 2001 Web Track
WT10g corpus (1.69 million pages), crawled at 1997.
50 queries (topics 501-550)
TREC 2002 Web Track
.GOV corpus (1.25 million pages), crawled at 2002.
49 queries (topics 551-560)
Retrieval System
Okapi, with weighting function BM2500
Preprocessing
Stop-word list (about 220)
Do not use stemming
Do not consider phrase information
Tune the b, k
1
and k
3
to achieve the best baseline
6/27/2012
Data Mining: Principles and Algorithms
20
Block Retrieval on TREC 2001 and TREC 2002
TREC 2001 Result
TREC 2002 Result
0 0.2 0.4 0.6 0.8 1
15
15.5
16
16.5
17
17.5
18
Combining Parameter o
A
v
e
r
a
g
e
P
r
e
c
i
s
i
o
n
(
%
)
VIPS (Block Retrieval)
Baseline (Doc Retrieval)
0 0.2 0.4 0.6 0.8 1
13
13.5
14
14.5
15
15.5
16
16.5
17
Combining Parameter o
A
v
e
r
a
g
e
P
r
e
c
i
s
i
o
n
(
%
)
VIPS (Block Retrieval)
Baseline (Doc Retrieval)
6/27/2012
Data Mining: Principles and Algorithms
21
Query Expansion on TREC 2001 and TREC 2002
TREC 2001 Result TREC 2002 Result
3 5 10 20 30
12
14
16
18
20
22
24
Number of blocks/docs
A
v
e
r
a
g
e
P
r
e
c
i
s
i
o
n
(
%
)
Block QE (VIPS)
FullDoc QE
Baseline
3 5 10 20 30
10
12
14
16
18
Number of blocks/docs
A
v
e
r
a
g
e
P
r
e
c
i
s
i
o
n
(
%
)
Block QE (VIPS)
FullDoc QE
Baseline
6/27/2012
Data Mining: Principles and Algorithms
22
Block-level Link Analysis
C
A
B
6/27/2012
Data Mining: Principles and Algorithms
23
A Sample of User Browsing Behavior
Improving PageRank using Layout Structure
Z: block-to-page matrix (link structure)
X: page-to-block matrix (layout structure)
Block-level PageRank:
Compute PageRank on the page-to-page graph
BlockRank:
Compute PageRank on the block-to-block graph
6/27/2012
Data Mining: Principles and Algorithms
24
XZ W
P
=
ZX W
B
=
=
otherwise
page p the to block b the from link a is there if s
Z
th th
b
bp
0
/ 1
function importance block the is f
otherwise
page p the in is block b the if b f
X
th th
p
pb
=
0
) (
6/27/2012
Data Mining: Principles and Algorithms
25
Using Block-level PageRank to Improve Search
0.8 0.82 0.84 0.86 0.88 0.9 0.92 0.94 0.96 0.98 1
0.115
0.12
0.125
0.13
0.135
0.14
0.145
0.15
0.155
0.16
0.165
Combining Parameter o
A
v
e
r
a
g
e
P
r
e
c
i
s
i
o
n
BLPR-Combination
PR-Combination
Block-level PageRank achieves 15-25%
improvement over PageRank (SIGIR04)
PageRank
Block-level
PageRank
Search = o - IR_Score + (1- o) - PageRank
o
6/27/2012
Data Mining: Principles and Algorithms
26
Mining Web Images Using Layout & Link
Structure (ACMMM04)
6/27/2012
Data Mining: Principles and
Algorithms
27
Image Graph Model & Spectral Analysis
Block-to-block graph:
Block-to-image matrix (container relation): Y
Image-to-image graph:
ImageRank
Compute PageRank on the image graph
Image clustering
Graphical partitioning on the image graph
e
=
otherwise
b if I s
Y
i j i
ij
0
1
Y W Y W
B
T
I
=
ZX W
B
=
6/27/2012
Data Mining: Principles and
Algorithms
28
ImageRank
Relevance
Ranking
Importance Ranking
Combined
Ranking
6/27/2012
Data Mining: Principles and Algorithms
29
ImageRank vs. PageRank
Dataset
26.5 millions web pages
11.6 millions images
Query set
45 hot queries in Google image search statistics
Ground truth
Five volunteers were chosen to evaluate the top 100 results
re-turned by the system (iFind)
Ranking method
( ) ( ) (1 ) ( )
importance relevance
s rank rank o o = + x x x
6/27/2012
Data Mining: Principles and Algorithms
30
ImageRank vs PageRank
Image search accuracy using ImageRank and
PageRank. Both of them achieved their best results at
o=0.25.
Image search accuracy (ImageRank vs. PageRank)
0.58
0.6
0.62
0.64
0.66
0.68
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
alpha
P
@
1
0
ImageRank
PageRank
6/27/2012
Data Mining: Principles and Algorithms
31
Summary
More improvement on web search can be made by
mining webpage Layout structure
Leverage visual cues for web information analysis &
information extraction
Demos:
http://www.ews.uiuc.edu/~dengcai2
Papers
VIPS demo & dll
Slides Credits
Slides in this presentation are partially based on the
work of
Han. Textbook Slides