Web Mining
Web Mining
Web
1
Mining the World-Wide Web
2
Mining the World-Wide Web
Knowledge
Web log Database Data Cube Sliced and diced
cube
1 2
Data Cleaning 3 4
Data Cube OLAP Data Mining
Creation
3
Mining the World-Wide Web
40000000
35000000
30000000
Hos ts
25000000
20000000
15000000
10000000
5000000 Sep-69
Sep-72
Sep-75
Sep-78
Sep-81
Sep-84
Sep-87
Sep-90
Sep-93
Sep-96
Sep-99
0
5
Web Mining: A more challenging task
Searches for
Web access patterns
Web structures
Regularity and dynamics of Web contents
Problems
The “abundance” problem
Limited coverage of the Web: hidden Web
sources, majority of data in DBMS
Limited query interface based on keyword-
oriented search
Limited customization to individual users
6
What is Web Mining?
15
Cont…
Web Content Mining- Application of
data mining techniques to unstructured or
semi structured text. Typically HTML
documents.
Web Structure Mining – Use of
hyperlink structure of the web as an
additional information source.
Web Usage Mining – Analysis of user
interaction with a web server.
Usage patterns
Number of visitors
Popularity e.g., products, movies, music 16
WEB CONTENT MINING
Web Content Mining
Web content mining consists of
several types of data such as…
Textual
Image
Audio
Video
Metadata, as well as
Hyperlinks.
Cont…
Recent research on mining multi-types
of data is termed as multimedia data
mining.
The textual parts of web content data
consists of unstructured data such as
free text, semi-structure data such as
HTML documents and more structured
data such as data in the tables or
database-generated HTML.
Most of the web content data is
unstructured, free text data.
Cont…
As a result, the techniques of text
mining can be directly employed for
web content mining in such cases.
Web Content Mining
It describes the discovery of useful
information from the web content.
Information
Web contains many kinds of data such
as..
Government information are gradually
being placed on the web in recent years.
Many commercial institutes are
transforming their business and services
electronically.
Cont…
Existence of Digital Libraries that are also
accessible from web.
We can not ignore another type of web
content-
The existence of web applications so that
the users could access the applications
through web interfaces.
Many applications are being migrated to
the web and
Many types of applications are emerging
in the web environment itself.
Web Mining
• Content: text & multimedia mining
• Structure: link analysis, graph
mining
• Usage: log analysis, query mining
• Relate all of the above
–Web characterization
– Particular applications
WEB STRUCTURE MINING
Web Structure Mining
Web Structure Mining is concerned with
discovering the model underlying the link
structure of web.
It is used to study the topology of the
hyperlinks with or without the description of
the links.
This model is used to categorized web
pages.
It is useful to generate information such as the
similarity and relationship between
different web sites.
Cont…
Web mining is also used to discover
authority sites for the subjects and
overview ( or hub) sites for the subjects
that point to many authorities.
It is seen that Web content mining
attempts to explore the structure within
a document (intra-document structure).
Web structure mining studies the
structure of documents within the web
itself (inter-document structure).
Cont…
Some algorithms to model web
topology such as…
1. HITS
2. PAGE RANK
3. CLEVER
These models are applied to calculate
the quality of rank or relevancy of
each web page.
Techniques used in modeling topology
Page rank-
In this importance of the document
is measured by counting citations or
back links to a given document.
This gives some approximation of
a document’s importance or quality.
Cont…
Social Network-
It is another way of studying the web
link structure.
Web structure mining utilizes the
hyperlinks structure of the web to apply
social network analysis.
Social network studies ways to measure
the relative standing or importance of
individuals in the network.
WEB USAGE MINING
Web usage mining
Web usage mining deals with
studying the data generated by the
web surfer’s session or behavior.
Web content and structure mining
utilize the real or primary data on
the web.
Where as web usage mines the
secondary data derived from the
interactions of the users with the
web.
Cont…
The secondary data includes the data from the-
Web server access logs
Proxy server logs
Browser logs
User profiles
Registration data
User sessions or transactions
Cookies
Cont…
User queries
Bookmark data
Mouse clicks and scrolls &
Any other data which are the results
of these interactions.
Cont…
This data can be accumulated by the
web server.
Analysis of the web access logs of
different web sites can facilitate an
understanding of the user behavior
and the web structure .
Size of the Web
Number of pages
Technically, infinite
Much duplication (30-40%)
Best estimate of “unique” static HTML
pages comes from search engine claims
Until last year, Google claimed 8 billion(?),
Yahoo claimed 20 billion
Google recently announced that their index
contains 1 trillion pages
How to explain the discrepancy?
Trends in Data Mining
38
Summary
39
Mining Multimedia Databases
A Multimedia database system stores
and manage a large collection of
multimedia objects…. Such as
Audio data
Image data
Video data
Sequence data and
Hypertext data
Which contains text, text markups &
linkages.
Cont…
Multimedia database systems are
increasingly common to the popular
use of
Important Questions
Write short notes on
Mining Spatial Databases
Mining Multimedia Databases
Mining Time-Series and Sequence Data
Mining Text Databases
Mining the World-Wide Web
Data Mining Applications
Social Impact of Data Mining
Trends in Data Mining.
UNIT-V
Important Questions