Web Mining
Web Mining
4
Opportunities and Challenges
Web offers an unprecedented opportunity and
challenge to data mining
The amount of information on the Web is huge, and easily
accessible.
The coverage of Web information is very wide and diverse. One
can find information about almost anything.
Information/data of almost all types exist on the Web, e.g.,
structured tables, texts, multimedia data, etc.
Much of the Web information is semi-structured due to the
nested structure of HTML code.
Much of the Web information is linked. There are hyperlinks
among pages within a site, and across different sites.
Much of the Web information is redundant. The same piece of
information or its variants may appear in many pages.
Opportunities and Challenges
The Web is noisy. A Web page typically contains a mixture of
many kinds of information, e.g., main contents,
advertisements, navigation panels, copyright notices, etc.
The Web is also about services. Many Web sites and pages
enable people to perform operations with input parameters,
i.e., they provide services.
The Web is dynamic. Information on the Web changes
constantly. Keeping up with the changes and monitoring the
changes are important issues.
Above all, the Web is a virtual society. It is not only about
data, information and services, but also about interactions
among people, organizations and automatic systems, i.e.,
communities.
Data Mining vs. Web Mining
Traditional data mining
data is structured and relational
well-defined tables, columns, rows, keys, and
constraints.
Web data
Semi-structured and unstructured
readily available data
rich in features and patterns
Web mining may be divided into
three categories:
Classifications
Clustering
Association
Web Content Mining
:: example – clustered search results
Can drill
down within
clusters to
view sub-
topics or to
view the
relevant
subset of
results
19
Web Content Mining
:: example – personalized content delivery
Google's personalized
news is an example of
a content-based
recommender system
which recommends
items (in part) based
on the similarity of
their content to a
user’s profile
(gathered from search
and click history)
20
Applications:
• document clustering or
categorization
• topic identification / tracking
• concept discovery
• focused crawling
• content-based personalization
• intelligent search tools
Web-Structure Mining
Generate structural summary about the Web site
and Web page
• Discovering the Web Page Structure.
•Discovering useful patterns from the hyperlink
structure connecting Web sites or Web resources.
•Discovering the nature of the hierarchy of hyperlinks in
the website and its structure.
Web-Structure Mining
cont…
Finding Information about web pages.
Retrieving information about the relevance and the
quality of the web page.
Basic idea:
Rank of a page depends on the ranks of pages
pointing to it
Out Degree of page is the number of edges
pointing away from it – used to compute the
contribution of the page to those to which it
points
The final PageRank value represents the
Illustration of PageRank propagation probability that a random surfer will reach the
page
d is the prob. that a random surfer chooses the
page directly rather than getting there via
navigation
25
In general, there are mainly four kinds of data
mining techniques applied to the web mining
domain to discover the user navigation pattern:
Association Rule mining
Sequential pattern
Clustering
Classification
Applications of Web Mining
With the rapid growth of World Wide Web, Web mining becomes a
very hot and popular topic in Web research. E-commerce and E-
services are claimed to be the killer applications for Web mining,
and Web mining now also plays an important role for E-
commerce website and E-services to understand how their
websites and services are used and to provide better services for
their customers and users.