0% found this document useful (0 votes)
16 views61 pages

Unit 5 DM

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views61 pages

Unit 5 DM

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 61

Web mining

Web mining
□ Web mining - data mining techniques to
automatically discover and extract information
from Web documents/services.
□ Web mining research – integrate research
from several research communities such
as:
Database (DB)
Information retrieval (IR)
The sub-areas of machine learning (ML)
Natural language processing
Mining the World-Wide Web
WWW is huge, widely distributed, global
information source for
□ – Information services: news,
advertisements, consumer information,
financial management,education,
government, e-commerce, etc.
□ – Hyper-link information
□ – Access and usage information
□ – Web Site contents and Organization
Mining the World-Wide Web
□ Growing and changing very rapidly
– Broad diversity of user communities
□ Only a small portion of the information on
the Web is truly relevant or useful to Web
users
– How to find high-quality Web pages on
a specified topic?
□ WWW provides rich sources for data
mining
Challenges on WWW Interactions
□ Finding Relevant Information
□ Creating knowledge from Information
available
□ Personalization of the information
□ Learning about customers / individual
users.
Web Mining can play an important
Role!
Web Mining: more challenging

□ Searches for:
– Web access patterns
– Web structures
– Regularity and dynamics of Web contents
Problems
– The “abundance” problem
– Limited coverage of the Web: hidden Web
sources, majority of data in DBMS
– Limited query interface based on
keyword-oriented search
– Limited customization to individual users
– Dynamic and semi structured
Web Mining : Subtasks
□ Resource Finding
– Task of retrieving intended web-documents
□ Information Selection & Pre-processing
– Automatic selection and pre-processing specific
information from retrieved web resources
□ Generalization
– Automatic Discovery of patterns in web sites
□ Analysis
– Validation and / or interpretation of mined
patterns
Web Mining Taxonomy
Web Mining

□ Web Content Mining


□ Web Usage Mining
□ Web Structure Mining
Web Content Mining
□ Discovery of useful information from web
contents / data / documents
– Web data contents: text, image, audio, video,
metadata and hyperlinks.
• Information Retrieval View ( Structured +
Semi-Structured)
– Assist / Improve information finding
– Filtering Information to users on user profiles
• Database View
– Model Data on the web
– Integrate them for more sophisticated queries
Issues in Web Content Mining
□ Developing intelligent tools for IR
- Finding keywords and key phrases
- Discovering grammatical rules and
collocations
- Hypertext classification/categorization.
- Extracting key phrases from text documents.
- Learning extraction models/rules.
- Hierarchical clustering.
- Predicting (words) relationship.
Issues in Web Content Mining
□ Developing Web query systems
– WebOQL, XML-QL
•Mining multimedia data
- Mining image from satellite
- Mining image to identify small volcanoes
on Venus.
Web Usage Mining
□ Web usage mining also known as
Web log mining
– mining techniques to discover
interesting usage patterns from the
secondary data derived from the
interactions of the users while surfing
the web
Web Usage Mining
Applications
– Target potential customers for electronic
commerce
– Enhance the quality and delivery of Internet
information services to the end user
– Improve Web server system performance
– Identify potential prime advertisement
locations
– Facilitates personalization/adaptive sites
– Improve site design
– Fraud/intrusion detection
– Predict user’s actions (allows prefetching)
□ Web Usage Mining
□ Web usage mining is used for mining
the web log records (access
information of web pages) and helps
to discover the user access patterns
of web pages.
□ Web server registers a web log entry
for every web page.
□ Analysis of similarities in web log
records can be useful to identify the
Some of the techniques to
discover and analyze the
web usage pattern are:

i) Session and visitor analysisThe
analysis of preprocessed data can be
performed in session analysis ,which
includes the record of visitors, days,
sessions etc. This information can be
used to analyze the behavior of
visitors.
□ Report is generated after this
analysis, which contains the details of
frequently visited web pages,
Some of the techniques to
discover and analyze the web
usage pattern are:........
□ ii) OLAP (Online Analytical
Processing)OLAP performs
Multidimensional analysis of complex
data.
□ OLAP can be performed on different
parts of log related data in a certain
interval of time.
□ The OLAP tool can be used to derive
the important business intelligence
metrics.
Problems with Web Logs
□ Identifying users
– Clients may have multiple streams
– Clients may access web from multiple hosts
– Proxy servers: many clients/one address
– Proxy servers: one client/many addresses
□ Data not in log
– POST data (i.e., CGI request) not recorded
– Cookie data stored elsewhere
Problems with Web Logs
□ Missing data
– Pages may be cached
– Referring page requires client cooperation
– When does a session end?
– Use of forward and backward pointers
• Typically a 30 minute timeout is used
• Web content may be dynamic
– May not be able to reconstruct what the
user saw
• Use of spiders and automated agents –
automatic request web pages
Problems with Web Logs
Like most data mining tasks, web log
mining requires preprocessing
– To identify users
– To match sessions to other data
– To fill in missing data
Essentially, to reconstruct the click stream
Log Data - Simple Analysis
Statistical analysis of users
– Length of path
– Viewing time
– Number of page views
• Statistical analysis of site
– Most common pages viewed
– Most common invalid URL
Web Log – Data Mining Applications
□ Association rules
– Find pages that are often viewed together
• Clustering
– Cluster users based on browsing patterns
– Cluster pages based on content
• Classification
– Relate user attributes to patterns
WUM – Association Rule Generation
□ Discovers the correlations between pages that are
most often referenced together in a single server
session
• Provide the information
What are the set of pages frequently accessed together
by Web users?
What page will be fetched next?
What are paths frequently accessed by Web users?
Association rule
A B [ Support = 60%, Confidence = 80% ]
Example
“50% of visitors who accessed URLs /infor-f.html and
labo/infos.html also visited situation.html”
Associations & Correlations
□ Page associations from usage data
– User sessions
– User transactions
• Page associations from content data
– similarity based on content analysis
• Page associations based on structure
– link connectivity between pages
• ==> Obtain frequent itemsets
Examples:
60% of clients who accessed /products/, also
accessed/products/software/webminer.htm.

30% of clients who accessed /specialoffer.


html, placed an online order in
offer./products/software/.

(Example from IBM official Olympics Site)


{Badminton, Diving} ===> {Table Tennis} (a =
69.7%, s = 0.35%
WUM – Clustering
Groups together a set of items having similar
characteristics
User Clusters
Discover groups of users exhibiting similar
browsing patterns
Page recommendation
User’s partial session is classified into a single
cluster.
The links contained in this cluster are recommended
WUM – Clustering
Page clusters
□ Discover groups of pages having
related content
□ Usage based frequent pages
□ Page recommendation
□ The links are presented based on
how often URL references occur
together across user sessions
Website Usage Analysis
□ Why developing a Website usage /
utilization analyzation tool?
• Knowledge about how visitors use Website
could
- Prevent disorientation and help designers
place important information/functions
exactly where the visitors look for and in
the way users need it
- Build up adaptive Website server
Clustering and Classification
□ clients who often access
/products/software/webminer.html
tend to be from educational institutions.
□ clients who placed an online order for software
tend to be students in the 20-25 age group and live
in the United States.
□ 75% of clients who download software from
/products/software/demos/ visit between 7:00
and11:00 pm on weekend.
Sequential Patterns & Clusters
30% of clients who visited
/products/software/, had done a
search in Yahoo using the keyword
“software” before their visit.
60% of clients who placed an online
order for WEBMINER, placed another
online order for software within 15
days.
Web Structure Mining
□ To discover the link structure of the
hyperlinks at the inter-document level to
generate structural summary about the
Website and Web page.
– Direction 1: based on the hyperlinks,
categorizing the Web pages and generated
information.
– Direction 2: discovering the structure of Web
document itself.
– Direction 3: discovering the nature of the
hierarchy or network of hyperlinks in the
Website of a particular domain
Web Structure Mining

Finding authoritative Web pages


– Retrieving pages that are not only relevant, but
also of high quality, or authoritative on the topic
• Hyperlinks can infer the notion of authority
– The Web consists not only of pages, but also of
hyperlinks pointing from one page to Another
– These hyperlinks contain an enormous amount of
latent human annotation
– A hyperlink pointing to another Web page,this can
be considered as the author's endorsement of the
other page
Web Structure Mining
As mentioned by Kosala and Blockeel [7] and Pujari [19], Web
Structure Mining tries to discover the underlying link
structures
of the web by building a model which can be used to
categorize
web pages and generate important information like
similarity
or relationship between web pages.
Apart from this, it can also be used to find authorities and hubs.
Algorithms used to model web topology are HITS, PAGERANK and
CLEVER.
Improvement of HITS by adding content information to link
structure or by using outlier filtering, can be done for web page
categorization and discovering micro communities on web.
According to Pujari [19], the web structure mining can be used
for:
Finding quality of page in terms of Authority of a Page and
Ranking of a Page.
Finding interesting web structures like graph patterns for
Co-citations, social choice etc.
Classifying web pages according to various areas of
interests.
The research at hyperlink level is also known as Hyperlink
Analysis.
Google and the Page
Rank Algorithm
PageRank™ - Introduction

The heart of Google’s searching


software is PageRank™, a system for
ranking web pages developed by
Larry Page and Sergey Brin at
Stanford University
PageRank™ - Introduction

Essentially, Google interprets a link


from page A to page B as a vote, by
page A, for page B.
BUT these votes don’t weigh the
same, because Google also analyzes
the page that casts the vote.
The original PageRank™
algorithm

PR(A) = (1-d) + d (PR(T1)/C(T1) + ...


+PR(Tn)/C(Tn))
where:
PR(A) is the PageRank of page A,
PR(Ti) is the PageRank of pages Ti which link to
page A,
C(Ti) is the number of outbound links on page Ti
d is a damping factor which can be set between 0
and 1.
PageRank™ - Introduction

It’s obvious that the PageRank™


algorithm does not rank the whole
website, but it’s determined for each
page individually. Furthermore, the
PageRank™ of page A is recursively
defined by the PageRank™ of those
pages which link to page A
Page Rank ….
The PageRank™ of pages Ti which link to page A
does not influence the PageRank™ of page A
uniformly.
The PageRank™ of a page T is always weighted by
the number of outbound links C(T) on page T.
Which means that the more outbound links a
page T has, the less will pageA benefit from a link
to it on page T.
The weighted PageRank™ of pages Ti is then added
up. The outcome of this is that an additional
inbound link
for page A will always increase page A's PageRank™.
After all, the sum of the weighted PageRanks
of all pages Ti is multiplied with a damping
factor d which can be set between 0 and
1.Thereby, the extend of PageRank benefit
for a page by another page linking to it is
reduced.
The PR of each page depends on the PR of the
pages pointing to it. But we won’t know
what PR those pages have until the pages
pointing to them have their PR calculated
and so on. So what we do is make a guess.
Example
A B

Each page has one outgoing link. So that means C(A) = 1 and
C(B) = 1.
d (damping factor) = 0.85
PR(A)= (1 – d) + d(PR(B)/1)
PR(B)= (1 – d) + d(PR(A)/1)
i.e.
PR(A)= 0.15 + 0.85 * 1
=1
PR(B)= 0.15 + 0.85 * 1
=1
PAGERANK algorithm
The formula used for calculating the PAGERANK of a page is recursive,
starting with any set of rank and iterating the computations till it converges.
The algorithm for PageRank is as follows :
Step 1. Initialise the rank value of each page by 1/n where n is
the total number of pages to be ranked.
Step 2. Consider some value of damping factor „d‟ such that
0<d<1 e.g. 0.85,0.15 etc.
Step 3. Let PR be an array of elements representing PageRank
for each web page. Then Repeat for each node i where 0<i<n.
PR[i]=1-d
For all pages Q that have a inward link to i, compute
PR[i]= PR[i]+d*A[Q]/Qn
Where Qn=number of outdegree of Q
Step 4. Update the value of A[i]=PR[i] for 0<i<n
Repeat from Step 3 till PR[i] value converges i.e. value of two
consecutive iteration is similar.
Authority and Hub
□ A page that is referenced by lot of important pages (has
more back links) is more important (Authority)
■ A page referenced by a single important page may be more
important than that referenced by five unimportant pages
□ A page that references a lot of important pages is also
important (Hub)
□ “Importance” can be propagated
■ Your importance is the weighted sum of the importance
conferred on you by the pages that refer to you
■ The importance you confer on a page may be proportional
to how many other pages you refer to (cite)
□ (Also what you say about them when you cite them!)
□ A page is a good authoritative page with respect to a
given query if it is referenced (i.e., pointed to) by many
(good hub) pages that are related to the query.
□ A page is a good hub page with respect to a given query
if it points to many good authoritative pages with
respect to the query.
□ Good authoritative pages (authorities) and good hub
pages (hubs) reinforce each other.
HITS (Hyperlink Induced Topic
Search)
□ It is a Link analysis algorithm that rates web pages on the
concept of authority and hub. It was proposed by Jon Kleinberg
and is based on the principle that a document has a high
weight authority if it is pointed to by many document with high hub
weight and vice versa, and a document has a high hub weight
if it points to many documents with high authority weight and
vice versa.
The steps in this algorithm are:
1. A subgraph (R) is created based on the query (keywords)
by the algorithm.
2. Then, the weight of hubs and authorities for each node are
calculated.
3. Further, expand R to a base set B, of pages linked to or
from R.
4. Again, calculate weights for authorities and hubs.
5. Pages with highest ranks in B are returned.
HITS
HITS provided good results. However, it is not stable on the
class of authority connected graphs and did not work well in
few cases:
a) At sometimes a set of documents on one host points to a
single document on another host or a single document on
one host points to a set of document on another host. These
situations may give a misleading result for a good hub or a
good authority.
b) Automatically generated links by the tools may again
provide wrong definition of good hub or good authority.
c) Sometimes pages point to other pages which are nonrelevant

to the query topic. This may lead to wrong results for hub and authorities .
SOCIAL NETWORK ANALYSIS

According to Pujari ,Social Network Analysis is yet another way of studying


the web link structure.

It uses an exponential damping factor in the algorithms. The social


network studies the ways to measure the relative standing or
importance of individuals in a network.

The basic premise here is that if a webpage points a link to another web
page, then the former is endorsing the importance of the latter in some
sense.

Also, if there exists a link from a node to the other node and back from the
latter to the former, it signifies some kind of mutual reinforcement while
links from one node to different nodes shows existence of Co-Citation.
Ranking pages
Kleinberg discusses a heuristic method of giving weights
to the links.
A link is said to be transverse link if it is between
pages with different domain names and intrinsic if it is between
pages with the same domain name.
Intrinsic links convey less information about the page than transverse links
so they are not taken into account and hence deleted from the graph.

Botafogo proposes another way of ranking pages through


the notion of Index Node and Reference Node. An Index Node
is one whose outdegree is significantly larger than the average
outdegree of the graph. A Reference Node is one whose
indegree is significantly larger than the average indegree of the
graph.
Ranking pages
For determining collection of similar pages, we need to define
the similarity measure between the pages. The two similarity
functions are:
Bibliographic Coupling: For a pair of nodes p and q, the
bibliographic coupling is equal to the number of nodes
that have links from both p and q.
Co-citation: For a pair of nodes p and q, the Co-citation
number is the number of nodes that point to both p and q
Web Content/Structure Mining
□ Mining of the textual content on the
Web
□ Data collection via Web crawlers

□ Web pages include hyperlinks


■ Authoritative pages
■ Hubs
■ hyperlink-induced topic search (HITS)
alg
Web Usage Mining
□ Extraction of information from data
generated through Web page visits and
transactions…
■ data stored in server access logs, referrer
logs, agent logs, and client-side cookies
■ user characteristics and usage profiles
■ metadata, such as page attributes, content
attributes, and usage data
□ Clickstream data
□ Clickstream analysis
Web Usage Mining
□ Web usage mining applications
■ Determine the lifetime value of clients
■ Design cross-marketing strategies across
products.
■ Evaluate promotional campaigns
■ Target electronic ads and coupons at user
groups based on user access patterns
■ Predict user behavior based on previously
learned rules and users' profiles
■ Present dynamic information to users based
on their interests and profiles…
Web Usage Mining
(clickstream analysis)
Web Mining Success Stories
□ Amazon.com, Ask.com, Scholastic.com,

□ Website Optimization Ecosystem
Web Mining Tools
Problems based on page Rank:

You might also like