Module 3_1
Module 3_1
24-02-2025 4
Coursera Courses Recommended
24-02-2025 https://www.coursera.org/learn/social-media-analytics-introduction#modules
5
Evaluation Scheme (DJ19)
24-02-2025 6
Evaluation Scheme (DJ19)
24-02-2025 7
List of Experiments
24-02-2025 8
Module 3:
Structured Data
Extraction
24-02-2025 9
24-02-2025 10
24-02-2025 11
A program for extracting structured data is usually called a wrapper.
24-02-2025 12
24-02-2025 13
24-02-2025 14
24-02-2025
Flat List Page 15
24-02-2025
Nested List Page 16
24-02-2025
Detail Page 17
Fig. 9.3(A) contains some nested
data records, which makes the
problem more
interesting and also harder. The
first product, “Cabinet Organizers by
Copco,” has two sizes (9-in. and 12-
in.) with different prices. These two
organizers are not at the same
level as “Cabinet Organizers by Copco”.
Our objective: We want to extract
the data and produce the data table
given in Fig. 9.3(B). “image 1” and
“Cabinet Organizers by Copco” are
repeated for the first two rows due
to the nesting.
24-02-2025 18
1st Approach to Data Extraction:
Wrapper Induction
(Supervised Learning)
24-02-2025 23
Wrapper Induction
• A wrapper induction system learns data extraction rules from a set of labeled
training examples.
• Labeling is usually done manually, which simply involves marking the data
items in the training pages/examples that the user wants to extract.
• The learned rules are then applied to extract target data from other pages with
the same mark-up encoding or the same template.
24-02-2025 24
24-02-2025 25
24-02-2025 26
24-02-2025 27
24-02-2025 28
24-02-2025 29
24-02-2025 30
24-02-2025 31
24-02-2025 32
24-02-2025 33
24-02-2025 34
24-02-2025 35
24-02-2025 36
24-02-2025 37
24-02-2025 38
24-02-2025 39
2nd Approach to Data Extraction:
Automatic Extraction
(Unsupervised Learning)
24-02-2025 40
24-02-2025 41
24-02-2025 42
24-02-2025 43
A bijection is a relation between two sets such that each element
24-02-2025 of either set is paired with exactly one element of the other set. 44
Example:
24-02-2025 45
A bijection is a relation between two sets such that each element
24-02-2025 of either set is paired with exactly one element of the other set. 46
24-02-2025 47
24-02-2025 48
24-02-2025 49
24-02-2025 50
24-02-2025 51
Instance-Based Wrapper
Learning
24-02-2025 52
24-02-2025 53
24-02-2025 54
24-02-2025 55
String Matching and Tree
Matching
24-02-2025 56
24-02-2025 57
• In the given text, the focus is on finding an encoding template from a set of
encoded instances, which are of the same type, such as HTML encoding
strings.
• The key idea is to detect repeated patterns in these strings. Since HTML
encoding strings contain nested structures (due to HTML tags), these can be
modeled as trees, specifically DOM (Document Object Model) trees.
• Tree matching is particularly useful because it can handle the nested nature
of HTML elements, and both tree and string matching algorithms are
employed to find such patterns effectively.
24-02-2025 58
24-02-2025 59
String Edit Distance
24-02-2025 60
24-02-2025 61
24-02-2025 62
24-02-2025 63
24-02-2025 64
24-02-2025 65
Tree Edit Distance/Tree Matching
24-02-2025 66
24-02-2025 67
24-02-2025 68
24-02-2025 69
24-02-2025 70
Tree Edit Distance/Tree Matching
24-02-2025 71
Tree Edit Distance/Tree Matching
24-02-2025 72
Tree Edit Distance/Tree Matching
24-02-2025 73
24-02-2025 74
24-02-2025 75
24-02-2025 76
24-02-2025 77
24-02-2025 78
24-02-2025 79
24-02-2025 80
24-02-2025 81
24-02-2025 82
24-02-2025 83
24-02-2025 84
24-02-2025 85
Building DOM Trees
24-02-2025 86
24-02-2025 87
24-02-2025 88
24-02-2025 89
24-02-2025 90
24-02-2025 91
24-02-2025 92
24-02-2025 93
24-02-2025 94
24-02-2025 95
24-02-2025 96
24-02-2025 97
24-02-2025 98
24-02-2025 99
24-02-2025 100
24-02-2025 101
24-02-2025 102
24-02-2025 103
24-02-2025 104
24-02-2025 105
24-02-2025 106
24-02-2025 107
108
109
110
24-02-2025 111
24-02-2025 112
24-02-2025 113
Extraction Given a List Page:
Nested Data Records
24-02-2025 114
115
116
Extraction Given a List Page:
Nested Data Records
24-02-2025 117
24-02-2025 118
119
120
121
24-02-2025 122
123
124
125
126
24-02-2025 127
24-02-2025 128
24-02-2025 129
24-02-2025 130
24-02-2025 131
24-02-2025 132
24-02-2025 133
24-02-2025 134
24-02-2025 135
24-02-2025 136
24-02-2025 137
24-02-2025 138
24-02-2025 139
24-02-2025 140
24-02-2025 141
24-02-2025 142