0% found this document useful (0 votes)
39 views8 pages

Web Mining Course

The document introduces web mining, which involves discovering useful information from the World Wide Web and its usage patterns. It discusses that the web is the largest database ever built but contains both structured and unstructured data. Three main types of web mining are then introduced: web content mining which analyzes text-based web content; web structure mining which examines relationships between web pages and links; and web usage mining which analyzes user interactions and behavior on the web. The goal is to understand how users interact with websites and gain insights from patterns in web data.

Uploaded by

Rani Shamas
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views8 pages

Web Mining Course

The document introduces web mining, which involves discovering useful information from the World Wide Web and its usage patterns. It discusses that the web is the largest database ever built but contains both structured and unstructured data. Three main types of web mining are then introduced: web content mining which analyzes text-based web content; web structure mining which examines relationships between web pages and links; and web usage mining which analyzes user interactions and behavior on the web. The goal is to understand how users interact with websites and gain insights from patterns in web data.

Uploaded by

Rani Shamas
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 8

Introduction to Web Mining

WWW: Facts

Discovering useful information from the World-Wide Web and its usage patterns

 The Web is the largest database ever built


 The Web is not a relational database.
 Some of it is structured, some is semi-structured and some is unstructured.
 The size of the Web is technically infinite
 The content is dynamic and has duplicates and inconsistencies.
 Queries are non-deterministic
 The web is a huge, widely distributed collection of:
 Documents of all sorts ( static as well as dynamically generated content and services)
 Hyper-link information
 Mine interesting nuggets of information leads to wealth of information and knowledge
 Challenge: Unstructured, huge, dynamic.

Warehousing a Meta-Web: Web yellow page service

Problems

 the “abundance” problem:


 99% of info of no interest to 99% of people
 limited coverage of the Web:
 hidden Web sources, majority of data in DBMS.
 limited query interface based on keyword-oriented search
 limited customization to individual users

Web content mining

Web page content mining, also known as web text mining or web data mining, is the process of
extracting valuable information, patterns, and insights from unstructured web content. It involves
analyzing and extracting knowledge from the vast amount of text-based information available on
the internet, including web pages, articles, blog posts, forums, social media posts, and other
textual data.

Web content mining can encompass a wide range of tasks and techniques, including:

 Text Preprocessing:
 Text Extraction: .
 Keyword Extraction:
 Sentiment Analysis:
 Text Classification:

Opinion Mining: Identifying opinions, attitudes, and subjective information expressed in the
text.

Web structure mining

Web structure mining is a branch of web mining that focuses on analyzing and discovering
patterns and knowledge from the structural components of the World Wide Web. It involves
examining the relationships and connections between web pages, websites, and other web-based
resources to gain insights into the organization, navigation, and interlinking of information on
the web.

There are three primary types of web structure mining:

 Link Analysis: This type of web structure mining focuses on the analysis of hyperlinks
that connect web pages.
 Web Usage Mining: Web usage mining analyzes user interactions with the web,
including clickstreams and navigation patterns.
 Web Page Clustering: Web page clustering aims to group similar web pages based on
their content, structure, or link patterns.
Web usage mining

Web usage mining is a branch of web mining that focuses on the analysis of user interactions
and behavior on the World Wide Web. It involves discovering meaningful patterns, trends, and
insights from the vast amount of user-generated data, such as clickstreams, session data, and
navigation patterns. The goal of web usage mining is to understand how users navigate websites,
interact with web pages, and utilize web-based applications and services

Web Structure Mining

Web structure mining is the process of extracting knowledge from the interconnections of
hypertext document in the world wide web.

The Web is a Graph

Pages are nodes, Hyperlinks are edges

Interesting Questions:

 What is the distribution of in- and out-degrees?


 How is its connectivity structure?

Evaluation of Web pages

There are two approches:

page rank: for discovering the most important pages on the Web (as used in Google)

hubs and authorities: a more detailed evaluation of the importance of Web pages

Basic definition of importance:

A page is important if important pages link to it

Intuition

Web pages are not equally “important”


www.amazon.com v www.gcuf.edu.pk

Links as citations: a page cited often is more important

www.amazon.com has 23,4000 inlinks

www.gcuf.edu.pk has 1000 inlink

Are all links equal?

Recursive model: being cited by a highly cited paper counts a lot…

Eigenvector prestige measure

Connectivity

Weakly connected components:

links are considered to be undirected

about 90% form a single component

Strongly connected components:

SCC- a set of nodes such that for any (u,v) there is a path from u to v

only directed links

about 28% form a strongly connected core set of pages

number of strongly connected components also follows power law


 Central core – (SCC) – pages that can reach one another along directed links - about 30%
of the Web
 IN group – can reach SCC but cannot be reached from it - about 20%
 OUT group – can be reached from SCC but cannot reach it - about 20%
 Tendrils – cannot reach SCC and cannot be reached by it - about 20%
 Unconnected – about 10%

You might also like