0% found this document useful (0 votes)
144 views59 pages

Web Content Mining and NLP: Bing Liu Department of Computer Science University of Illinois at Chicago Liub@cs - Uic.edu

This document discusses web content mining and natural language processing (NLP). It begins with an introduction to web mining, focusing on web content mining. It then outlines the roadmap which includes structured data extraction, information integration, information synthesis, and opinion mining. For structured data extraction, it discusses extracting structured data from web pages, including techniques like wrapper induction and automatic extraction. It also discusses information integration and constructing a global query interface by matching schemas across different sources.

Uploaded by

siddiqui16
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
144 views59 pages

Web Content Mining and NLP: Bing Liu Department of Computer Science University of Illinois at Chicago Liub@cs - Uic.edu

This document discusses web content mining and natural language processing (NLP). It begins with an introduction to web mining, focusing on web content mining. It then outlines the roadmap which includes structured data extraction, information integration, information synthesis, and opinion mining. For structured data extraction, it discusses extracting structured data from web pages, including techniques like wrapper induction and automatic extraction. It also discusses information integration and constructing a global query interface by matching schemas across different sources.

Uploaded by

siddiqui16
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 59

Web Content

Mining and NLP


Bing Liu
Department of Computer Science
University of Illinois at Chicago
[email protected]
http://www.cs.uic.edu/~liub
Introduction
 The Web is perhaps the single largest and
distributed data source in the world that is easily
accessible.
 Web mining
 Web usage mining: mine usage logs, web traffics
 Web structure mining: mine hyperlinks and communities.
 Web content mining: mine page contents.
 We focus on Web content mining.
 Still a very large topic. We will not discuss traditional tasks:
Web page classification, clustering, etc

Bing Liu, UIC 2


Different types of data
 Structured data
 The data are usually retrieved from backend
databases, and
 displayed in Web pages following some fixed
templates.
 Semi-structured data
 Each page is organized in someway to some
extent, usually as a hierarchy of blocks.
 Unstructured data:
 natural language text

Bing Liu, UIC 3


Roadmap

 Introduction
Structured
1. Structured data extraction
data
2. Information integration Semi-structured
data
3. Information synthesis
Unstructured
4. Opinion mining
text
 Conclusions

Bing Liu, UIC 4


Structured Data Extraction
 A large amount of information on the Web is
contained in regularly structured data objects.
 often data records retrieved from databases.
 Important: lists of products and services.
 Applications: Gather data to provide value-
added services
 comparative shopping, object search, opinion
mining, etc.
 Two types of pages with structured data:
 List pages, and detail pages

Bing Liu, UIC 5


List Page – two lists of
products
Two lists

Bing Liu, UIC 6


Detail Page – detailed
description

Bing Liu, UIC 7


Extraction Task: an
illustration

nesting

image 1 Cabinet Organizers by Copco 9-in. Round Turntable: White ***** $4.95

image 1 Cabinet Organizers by Copco 12-in. Round Turntable: White ***** $7.95

image 2 Cabinet Organizers 14.75x9 Cabinet Organizer (Non-skid): ***** $7.95


White
image 2 Cabinet Organizers 22x6 Cookware Lid Rack ***** $19.95

Bing Liu, UIC 8


Data Model and Solution
Web data model: Nested relations
 See formal definitions in (Grumbach and Mecca, ICDT-99; Liu,
Web Data Mining 2006)

Solve the problem


 Two main types of techniques
 Wrapper induction – supervised
 Automatic extraction – unsupervised
 Information that can be exploited
 Source files (e.g., Web pages in HTML)
 Represented as strings or trees

 Visual information (e.g., rendering information)

Bing Liu, UIC 9


Tree and Visual information

HTML

BODY
HEAD
TABLE P
TABLE

TBODY

TR TR TR TR TR TR TR TR TR TR
| | | | | |
TD TD TD TD TD TD TD TD TD TD

data data
TD TD TD TD TD TD TD TD
record 1 record 2

Bing Liu, UIC 10


Wrapper Induction (Muslea et al.,
Agents-99)
 Using machine learning to generate extraction rules.
 The user marks the target items in a few training pages.
 The system learns extraction rules from these pages.
 The rules are applied to extract items from other pages.
Training Examples
E1: 513 Pico, <b>Venice</b>, Phone 1-<b>800</b>-555-1515
E2: 90 Colfax, <b>Palms</b>, Phone (800) 508-1570
E3: 523 1st St., <b>LA</b>, Phone 1-<b>800</b>-578-2293
E4: 403 La Tijera, <b>Watts</b>, Phone: (310) 798-0008

Output Extraction Rules


 Start rules: End rules:
R1: SkipTo(() SkipTo())
R2: SkipTo(-<b>) SkipTo(</b>)

Bing Liu, UIC 11


Automated extraction
There are two main problem formulations:
Problem 1: Extraction based on a single
list page (Liu et al., KDD-03; Liu, 2006)
Problem 2: Extraction based on multiple
input pages of the same type (list pages or
detail pages) (Grumbach and Mecca, ICDT-99).
 Problem 1 is more general: Algorithms for solving
Problem 1 can solve Problem 2.
 Thus, we only discuss Problem 1.

Bing Liu, UIC 12


Automatic Extraction:
Problem 1
Data
region1

Data
records

Data
region2

Bing Liu, UIC 13


Solution Techniques (Liu et al.
KDD-2003)
 Identify data regions and data records: by
finding repeated patterns
 string matching
 treat HTML source as a string
 tree matching
 treat HTML source as a tree
 Align data items: Multiple alignment
 Align items in more than two data records

Bing Liu, UIC 14


String edit distance
(definition)

CS511, Bing Liu, UIC 15


An example

 The edit distance matrix and


back trace path

 alignment

CS511, Bing Liu, UIC 16


Tree edit distance or tree
matching

CS511, Bing Liu, UIC 17


Simple Tree Matching (Liu, Web Data
Mining 2006)
 Let A = RA:〈A1, …, Ak〈 and B = RB:〈B1,…, Bn〈 be
two trees, where RA and RB are their roots

Bing Liu, UIC 18


Multiple alignment

 Pairwise alignment is not sufficient because a


web page usually contain more than two data
records.
 We need multiple alignment.
 There are many existing techniques, e.g.,
 Partial tree alignment. It iteratively match all trees.
In each pairwise matching, only match those
nodes that can be matched (Zhai and Liu WWW-05).
 It is a least commitment approach

CS511, Bing Liu, UIC 19


Ts = T 1 p T2 p T3 p
An
… x b d b n c k g b c d h k
exampl
e Ts p
No node inserted

… x b d

New Ts p c, h, and k inserted



T2 is matched again x b c d h k

T2 p

b n c k g

… x b n c d h k g

CS511, Bing Liu, UIC 20


Roadmap

 Introduction
Structured
1. Structured data extraction
data
2. Information integration Semi-structured
text
3. Information synthesis
Unstructured
4. Opinion mining
text
 Conclusions

Bing Liu, UIC 21


Information Integration
 The extracted data from different sites need to be
integrated to produce a consistent database.
 Integration means:
 Schema match: match columns in different data tables
(e.g., product names).
 Data instance match: match values, e.g., “Coke” = “Coca
Cola”?
 Unfortunately, not much research has been done so
far in this extraction context.
 Much of the research has been focused on the
integration of Web query interfaces

Bing Liu, UIC 22


Web Query Interface
Integration
(Wu et al., SIGMOD-04; Dragut et al., VLDB-06)
Global Query Interface

united.com airtravel.com delta.com hotwire.com


Bing Liu, UIC 23
Constructing global query
interface (QI)
 A unified query interface:
 Conciseness - Combine semantically
similar fields over source interfaces
 Completeness - Retain source-specific fields
 User-friendliness – Highly related fields
are close together
 Two-phrased integration
 Interface Matching – Identify semantically similar fields

 Interface Integration – Merge the source query interfaces

CS583, Bing Liu 24


Schema Matching as
Correlation Mining (He and Chang,
KDD-04)
 This technique needs a large number of
input query interfaces.
 Synonym attributes are negatively correlated
 they are alternatives, rarely co-occur.
 e.g., Author = writer
 Group attributes have positive correlation
 they often co-occur in query interfaces
 e.g., {Last Name, First Name}
Bing Liu, UIC 25
tive correlation mining as potential groups

Mining positive correlations

Last Name, First Name

ve correlation mining as potential matchings

Author =
Mining negative correlations {Last Name, First Name}

atching selection as model construction


Author (any) =
{Last Name, First Name}
Subject = Category

Format = Binding

CS583, Bing Liu 26


A clustering approach to schema
matching (Wu et al. SIGMOD-04)
 Hierarchical modeling
 Bridging effect
 “a2” and “c2” might not look
similar themselves but they
might both be similar to “b3”
 1:m mappings
 Aggregate and is-a types X
 User interaction helps in:
 learning of matching
thresholds
 resolution of uncertain
mappings

CS583, Bing Liu 27


Find 1:1 Mappings via
Clustering
Interfaces: Initial similarity matrix:

After one merge:

 Similarity functions
 linguistic similarity
 domain similarity

…, final clusters:
{{a1,b1,c1}, {b2,c2},{a2},{b3}}

CS583, Bing Liu 28


“Bridging” Effect
A
?
B
C

Observations:
- It is difficult to match “vehicle” field, A, with “make” field, B
- But A’s instances are similar to C’s, and C’s label is similar to B’s
- Thus, C might serve as a “bridge” to connect A and B!

Note: Connections might also be made via labels


CS583, Bing Liu 29
Complex Mappings

Aggregate type – contents of fields on the many side are part of


the content of field on the one side

Commonalities – (1) field proximity, (2) parent label similarity,


and (3) value characteristics

CS583, Bing Liu 30


Complex Mappings (Cont’d)

Is-a type – contents of fields on the many side are sum/union of


the content of field on the one side

Commonalities – (1) field proximity, (2) parent label similarity,


and (3) value characteristics

CS583, Bing Liu 31


Instance-Based Matching via
Query Probing (Wang et al., VLDB-04)
 Both query interfaces and returned results
(instances) are considered in matching.
 Assumption: A global schema (GS) and a set of
instances are given.
 The method uses each instance value (IV) of
every attribute in GS to probe the underlying
database to obtain the count of IV appeared in the
returned results.
 These counts are used to help matching.

Bing Liu, UIC 32


Query Interface and Result
Page

Title?

Bing Liu, UIC 33


The core problem

 Recognizing domain specific synonyms


 Words
 Phrases
 Other general expressions
 An NLP problem!
 Existing methods exploited both linguistic and
semi-structured information in Web pages.

Bing Liu, UIC 34


Roadmap

 Introduction
Structured
1. Structured data extraction
data
2. Information integration Semi-structured
text
3. Information synthesis
Unstructured
4. Opinion mining
text
 Conclusions

Bing Liu, UIC 35


Information/knowledge
synthesis
 Web search paradigm:
 Given a query, a few words
 A search engine returns a ranked list of pages.
 The user then browses and reads the top-ranked
pages to find what s/he wants.
 Sufficient for navigational queries
 if one is looking for a specific piece of information,
e.g., homepage of a person, a paper.
 Not sufficient for informational queries
 open-ended research or exploration
CS583, Bing Liu 36
Information synthesis: a
growing trend
 Problems with individual pages:
 Bias
 incompleteness
 A growing trend among web search engines: go
beyond the traditional paradigm of presenting a
list of ranked pages to provide more varied, and
comprehensive information about a search topic.
 To provide unbiased and more complete info:
 Find and integrate related bits and pieces:
 Information synthesis!

CS583, Bing Liu 37


Bing search of “cell phone”

CS583, Bing Liu 38


Mining a book (Liu et al WWW-2003, Nitin
et al, coming)
 Traditionally, when one wants to learn about a topic,
 one reads a book or a survey paper.
 Learning in-depth knowledge of a topic from the Web
is becoming increasingly popular.
 Web’s convenience,
 richness of information and diversity
 For emerging topics, it may be essential - no book.
 Can we help such learning by mining “a book” from
the Web given a topic?
 Knowledge in a book is well organized:
 Table of Contents
 Detailed description pages

CS583, Bing Liu 39


An example
 Given the topic “data mining”, can the system produce the
following, a concept hierarchy?
 Classification
 Decision trees
 … (Web pages containing the descriptions of the topic)
 Naïve Bayes
 …
 …
 Clustering
 Hierarchical
 Partitioning
 K-means
 ….
 Association rules
 Sequential patterns
 …

CS583, Bing Liu 40


Exploiting information
redundancy
 Web information redundancy: many Web pages
contain similar information.

 Observation 1: If some phrases are mentioned in a


number of pages, they are likely to be important
concepts or sub-topics of the given topic.
 This means that we can use data mining to find
concepts and sub-topics:
 What are candidate words or phrases that may represent
concepts of sub-topics?

CS583, Bing Liu 41


Each Web page is already
organized
 Observation 2: The contents of most Web pages are
already organized.
 Different levels of headings
 Emphasized words and phrases
 They are indicated by various HTML emphasizing tags,
e.g., <H1>, <H2>, <H3>, <B>, <I>, etc.
 We utilize existing page organizations to find a global
organization of the topic.
 Cannot rely on only one page because it is often incomplete,
and mainly focus on what the page authors are familiar with or
are working on.

CS583, Bing Liu 42


Using language patterns to find
sub-topics
 Certain syntactic language patterns express
some relationship of concepts.
 The following patterns represent hierarchical
relationships, concepts and sub-concepts:
 Such as
 For example (e.g.,)
 Including
 E.g., “There are many clustering techniques
(e.g., hierarchical, partitioning, k-means, k-
medoids).”
CS583, Bing Liu 43
PANKOW (Cimiano, et al WWW-04) and
KnowItAll (Etzioni et al WWW-04)
 Linguistic patterns, first 4 from (Hearst SIGIR-92):

1: <concept>s such as <instance>


2: such <concepts>s as <instance>
3: <concepts>s, (especially | including)<instance>
4: <instance> (and | or) other <concept>s
5: the <instance> <concept>
6: the <concept> <instance>
7: <instance>, a <concept>
8: <instance> is a <concept>
…….
CS583, Bing Liu 44
Put them together
1. Crawl the set of pages (a set of given documents)
2. Identify important phrases using
1. HTML emphasizing tags, e.g., <h1>,…,<h4>, <b>, <strong>,
<big>, <i>, <em>, <u>, <li>, <dt>.
2. Language patterns.
3. Perform data mining (frequent itemset mining) to find
frequent itemsets (candidate concepts)
 Data mining can weed out peculiarities of individual pages to find
the essentials.
1. Eliminate unlikely itemsets (using heuristic rules).
2. Rank the remaining itemsets, which are main concepts.

CS583, Bing Liu 45


Additional techniques
 Segment a page into different sections.
 Find sub-topics/concepts only in the appropriate sections.
 Mutual reinforcements:
 Using sub-concepts search to help each other
 …
 Finding definition of each concept using syntactic
patterns (again)
 {is | are} [adverb] {called | known as | defined as} {concept}
 {concept} {refer(s) to | satisfy(ies)} …
 {concept} {is | are} [determiner] …
 {concept} {is | are} [adverb] {being used to | used to | referred to |
employed to | defined as | formalized as | described as |
concerned with | called} …

CS583, Bing Liu 46


Data Mining
Clustering Some concepts
Classification
Data Warehouses
Databases
extraction results
Knowledge Discovery
Classification Clustering
Web Mining Neural networks Hierarchical
Information Discovery Trees K means
Association Rules Naive bayes Density based
Machine Learning Decision trees Partitioning
Sequential Patterns K nearest neighbor K medoids
Regression Distance based methods
Web Mining Neural net
Web Usage Mining
Mixture models
Web Content Mining Sliq algorithm Graphical techniques
Data Mining Parallel algorithms Intelligent miner
Webminers Classification rule learning Agglomerative
Text Mining ID3 algorithm Graph based algorithms
Personalization C4.5 algorithm
Information Extraction Probabilistic models
Semantic Web Mining
XML
Mining Web Data

CS583, Bing Liu 47


The core problems

 Recognize key concepts in a domain


 Discover their relationships
 Manly hierarchical relations
 Recognize domain specific synonyms

 Existing methods exploit structures or


organizations in a page and language
patterns.

Bing Liu, UIC 48


Roadmap

 Introduction
Structured
1. Structured data extraction
data
2. Information integration Semi-structured
text
3. Information synthesis
Unstructured
4. Opinion mining
text
 Conclusions

Bing Liu, UIC 49


Opinion mining
 We now move to unstructured text on the Web.
 A major Web content mining research is to extract
specific types of information from text in Web pages.
 Factual information, e.g.,
 Extract unreported side effects of drugs from Web pages.
 Extract infectious diseases from online news.
 Extract economic data from reports of different countries.
 Opinions
 We focus on this topic as the Web has enabled the task. There
is also a growing interest in this topic.
 It is useful to everyone: individuals and organizations.

Bing Liu, UIC 50


Word-of-Mouth on the Web
 The Web has dramatically changed the way that
people express their opinions. One can
 post reviews of products at merchant sites, and
 express opinions on almost anything in forums, discussion
groups, and blogs, which are collectively called the user
generated content.
 Opinion mining or sentiment analysis aims to extract
and summarize opinions
 Benefits:
 Potential Customer: No need to read many reviews, etc.
 Product manufacturer: market intelligence, product
benchmarking.

Bing Liu, UIC 51


Sentiment Classification of
Reviews
(Turney,
 ClassifyACL-02, Pang
reviews based et
on al., EMNLP-02;
the overall ……)
sentiment
expressed by authors, i.e.,
 Positive or negative
 Related to but different from traditional topic-based text
classification.
 Here the opinion words (e.g., great, beautiful, bad, etc) are
important, not topic words.
 Some representative techniques
 Use opinion phrases
 Use traditional text classification method
 Use a custom-designed score function

Bing Liu, UIC 52


Feature-Based Opinion
Summarization
Sentiment
 (Hu classification
and Liu, KDD-04) does not find what exactly
consumers liked or disliked.
 You may say that people can read reviews, but
 In online shopping, a lot of
people write reviews
 Time consuming and boring to
read all the reviews
 How?

 Opinion summarization is a natural solution


 What is an effective summary?
Bing Liu, UIC 53
An Review Example and a
Summary
GREAT Camera., Jun 3, 2004
Summary :

Feature1: picture
Reviewer: jprice174 from Atlanta,
Ga. Positive: 12
 The pictures coming out of this camera
I did a lot of research last year are amazing.
before I bought this camera... It  Overall this is a good camera with a
kinda hurt to leave behind my really good picture clarity.
beloved nikon 35mm SLR, but I …
was going to Italy, and I needed Negative: 2
something smaller, and digital.  The pictures come out hazy if your
hands shake even for a moment
The pictures coming out of this during the entire process of taking a
camera are amazing. The 'auto' picture.
feature takes great pictures most  Focusing on a display rack about 20
of the time. And with digital, feet away in a brightly lit room during
day time, pictures produced by this
you're not wasting film if the camera were blurry and in a shade of
picture doesn't come out. … orange.

…. Feature2: battery life


Bing Liu, UIC 54


Visual Summarization &
Comparison
+ (Liu et al., WWW-05)
 Summary of
reviews of
Digital camera 1
_
Picture Battery Zoom Size Weight

 Comparison of +
reviews of
Digital camera 1
Digital camera 2
_
Bing Liu, UIC 55
Mining Tasks
(Hu and Liu, KDD-04; Liu, Web Data Mining book
2006)
Task 1: Identifying and extracting object
features that have been commented on in
each review.
Task 2: Determining whether the opinions on
the features are positive, negative or neutral.
Task 3: Grouping synonym features.
 Produce a feature-based opinion summary.
 A structured and quantitative summary.

Bing Liu, UIC 56


Existing Research
 Current algorithms are combinations of
 Natural language processing (NLP) methods, and
 Part-of-speech tagging, parsing, etc.
 Pre-compiled opinion words and phrases.
 Data mining or machine learning techniques.
 Opinion mining is a fascinating problem
 Technically very challenging. It is NLP!
 It touches every aspect of NLP, yet it is confined/targeted
 20-60 companies working on it in USA alone.
 We will discuss it in more detail tomorrow.
Bing Liu, UIC 57
Roadmap

 Introduction
Structured
1. Structured data extraction
data
2. Information integration Semi-structured
text
3. Information synthesis
Unstructured
4. Opinion mining
text
 Conclusions

Bing Liu, UIC 58


Conclusions
 We briefly:
 Structured data extraction
 Information integration
 Information synthesis
 Opinion mining
 The tasks look different, but there is a common theme:
 Extraction and integration
 All are related to and need some level of NLP.
 Integration has been regarded as the most difficult
task by database researchers.
 Core problem: recognizing domain “synonym”: words, phrases
and expressions

Bing Liu, UIC 59

You might also like