0% found this document useful (0 votes)

21 views5 pages

Process of Web Mining and Categories of Web Mining

The document outlines the process of web mining, which includes data collection, preprocessing, pattern discovery, analysis, and evaluation. It categorizes web mining into three types: web content mining, web structure mining, and web usage mining, each with its own techniques, challenges, and applications. Additionally, it describes major clustering methods such as partitioning, hierarchical, density-based, grid-based, and model-based approaches.

Uploaded by

M R unknown

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views5 pages

Process of Web Mining and Categories of Web Mining

Uploaded by

M R unknown

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 5

Process of Web Mining

The process of web mining typically involves the following steps:

1. Data Collection:

o Data is collected from web resources such as websites, logs, or social media
platforms using techniques like web scraping, APIs, or server logs.

o Tools such as Scrapy, BeautifulSoup, or Selenium are often used for automated data
extraction.

2. Preprocessing:

o The raw data collected from the web is often noisy, redundant, and inconsistent.
Preprocessing involves cleaning the data, removing duplicates, and formatting it for
analysis.

o For example, removing HTML tags, handling missing values, and filtering irrelevant
content.

3. Pattern Discovery:

o Data mining and machine learning techniques are applied to discover meaningful
patterns and relationships.

o Techniques like clustering, classification, association rule mining, or natural language

processing (NLP) are used, depending on the type of web mining.

4. Analysis and Interpretation:

o The extracted patterns and insights are interpreted to derive actionable knowledge.

o Visualization tools like Tableau or Python libraries such as Matplotlib and Seaborn
can help present findings in a user-friendly format.

5. Evaluation and Deployment:

o The results are evaluated for accuracy, relevance, and utility.

o Once validated, the insights are deployed in applications such as recommendation

systems, search engines, or targeted marketing campaigns

Categories of Web Mining

1. Web Content Mining

 Definition: Focuses on extracting useful information from the content of web pages, such as
text, images, audio, video, and other multimedia data.

 Key Characteristics:

o Deals with semi-structured or unstructured data (unlike traditional data mining,

which handles structured data).

o Uses text mining, data mining, and custom techniques due to the semi-structured
nature of web data.
o Rapidly growing due to the vast expansion of web content and its economic
potential.

Techniques and Approaches:

 Agent-based Approach: Uses intelligent agents to enhance information retrieval and

filtering.

 Database Approach: Models web data into structured forms (e.g., tables or databases) to
apply data mining techniques effectively.

Challenges:

1. Data/Information Extraction: Extracting structured data from unstructured or semi-

structured web content (e.g., scraping product details).

2. Web Information Integration and Schema Matching: Harmonizing data from various
sources that represent similar information differently.

3. Opinion Mining: Extracting user sentiment or opinions from reviews, blogs, and forums.

4. Knowledge Synthesis: Automatically synthesizing information into hierarchies or ontologies

to organize the knowledge domain.

5. Noise Detection and Removal: Filtering out irrelevant parts of web pages, such as ads,
navigation links, and other non-content elements.

Applications:

 Search engine optimization (SEO).

 Sentiment analysis for reviews and opinions.

 Topic discovery from web articles.

2. Web Structure Mining

 Definition: Focuses on analyzing the structure of hyperlinks between web pages (i.e., the
topology of the web) to identify relationships, page importance, and web communities.

 Key Characteristics:

o Examines the inter-document structure (links between web pages) rather than the
content within a page.

o Uses concepts like graph theory to analyze web page connectivity.

Techniques and Approaches:

 Hyperlink Analysis:

o PageRank Algorithm: Determines the importance of a web page based on the

number and quality of links pointing to it.

o HITS Algorithm: Identifies "hubs" (pages that link to many others) and "authorities"
(pages that are linked to by many others).
 Structure Mining: Studies web page schemas or navigational structures.

Challenges:

1. Link Structure Analysis: Discovering meaningful correlations between linked web pages or
websites.

2. Community Detection: Identifying groups of related web pages that form a "community."

3. Web Schema Discovery: Revealing the structure or hierarchy of web pages to improve
navigation or data access.

Applications:

 Improving search engine ranking algorithms.

 Identifying influential pages or websites.

 Detecting and studying web communities.

3. Web Usage Mining

 Definition: Focuses on analyzing user behavior and navigation patterns by mining web server
logs, cookies, or clickstream data.

 Key Characteristics:

o Examines secondary data generated by user interactions (e.g., logs, page views, and
click behavior).

o Aims to predict user behavior and improve user experience.

Techniques and Approaches:

 Pattern Discovery:

o Clustering: Groups users with similar navigation patterns.

o Classification: Assigns user behavior to predefined categories (e.g., frequent

shoppers).

o Association Rule Mining: Identifies frequently co-accessed pages.

o Sequential Pattern Mining: Analyzes user clickstreams to predict future navigation

paths.

Challenges:

1. Data Volume and Complexity: Dealing with massive, high-dimensional, and noisy web usage
data.

2. User Identification: Accurately identifying individual users in cases where multiple users
access the same account or IP address.

3. Data Privacy: Ensuring ethical handling of sensitive user behavior data.

4. Predicting User Behavior: Accurately modeling and predicting navigation paths or
preferences.

Applications:

 Personalization and recommendation systems (e.g., suggesting products on e-commerce

platforms).

 Website optimization for improved usability.

 Targeted marketing and business intelligence.

Major Categories of Clustering Methods

Clustering methods can be broadly categorized into the following types:

1. Partitioning Methods

o Constructs k partitions of the data, where each partition represents a cluster,

ensuring that:

 Each cluster contains at least one object.

 Each object belongs to exactly one cluster.

o Uses an iterative relocation technique to optimize the partitioning by minimizing

intra-cluster distance and maximizing inter-cluster distance.

o Examples: K-Means, K-Medoids.

2. Hierarchical Methods

o Creates a hierarchical decomposition of the dataset and can be classified into:

 Agglomerative Approach (Bottom-Up): Starts with individual objects as

separate clusters and merges them iteratively.

 Divisive Approach (Top-Down): Starts with one large cluster and splits it
iteratively.

o Drawback: Once a merge or split is made, it cannot be undone.

o Examples: BIRCH, CURE.

3. Density-Based Methods

o Forms clusters based on regions of high density separated by regions of lower

density.

o Can discover clusters of arbitrary shapes and is effective in filtering out noise or
outliers.
o Examples: DBSCAN, OPTICS, DENCLUE.

4. Grid-Based Methods

o Quantizes the object space into a finite number of cells to form a grid structure.

o All clustering operations are performed on the grid, resulting in fast processing
times that depend on the number of cells rather than the dataset size.

o Examples: STING, WaveCluster.

5. Model-Based Methods

o Assumes a specific statistical model for each cluster and finds the best fit of the data
to the model.

o Useful for automatically determining the number of clusters and handling noise or
outliers.

o Examples: Gaussian Mixture Models (GMM), Expectation-Maximization (EM)

Algorithm.

3.3.7 IGCSE Chemistry Notes Percentage Purity and Percentage Yield
No ratings yet
3.3.7 IGCSE Chemistry Notes Percentage Purity and Percentage Yield
2 pages
Web Mining
No ratings yet
Web Mining
28 pages
EB Ining: Dvanced Opics
0% (1)
EB Ining: Dvanced Opics
48 pages
Bai 2. Int J Consumer Studies - 2023 - Chen - Consumer Behaviour in Cross Border e Commerce Systematic Literature Review and
No ratings yet
Bai 2. Int J Consumer Studies - 2023 - Chen - Consumer Behaviour in Cross Border e Commerce Systematic Literature Review and
61 pages
Web Mining
100% (3)
Web Mining
28 pages
Tentative Deviation at KK Nagar - Correction
100% (1)
Tentative Deviation at KK Nagar - Correction
84 pages
Unit 4 (DWDM)
No ratings yet
Unit 4 (DWDM)
27 pages
Care Management of Small Ruminant
No ratings yet
Care Management of Small Ruminant
29 pages
Motion Controller K Series
No ratings yet
Motion Controller K Series
61 pages
Data Mining
No ratings yet
Data Mining
12 pages
Receivable Record (6-5-24)
No ratings yet
Receivable Record (6-5-24)
36 pages
Sma U-2
No ratings yet
Sma U-2
19 pages
TMK DWDM Unit 7 Advance Topics
No ratings yet
TMK DWDM Unit 7 Advance Topics
28 pages
Web Mining Unit 1
No ratings yet
Web Mining Unit 1
25 pages
Unit 7: Web Mining and Text Mining
No ratings yet
Unit 7: Web Mining and Text Mining
13 pages
Spatial & Web Mining
100% (1)
Spatial & Web Mining
45 pages
DM M5.1 Web Mining v3.11
No ratings yet
DM M5.1 Web Mining v3.11
114 pages
Sparsh
No ratings yet
Sparsh
10 pages
Short Questions For Data Mining Advance Applications
No ratings yet
Short Questions For Data Mining Advance Applications
3 pages
From Follow The Rabbit Proof Fence
No ratings yet
From Follow The Rabbit Proof Fence
7 pages
Web Data Mining - 5
No ratings yet
Web Data Mining - 5
14 pages
Unit - 5
No ratings yet
Unit - 5
12 pages
Business Data Mining Long
No ratings yet
Business Data Mining Long
4 pages
Soy Milk Maker, Commercial Large Soybean Milk Grinding Machine Electric Automatic Ordinary Soya Milk Tofu Maker and Dregs Separater Splitter Organic Soy Nuts Milk Filter 25KG - H - Home & Kitchen
No ratings yet
Soy Milk Maker, Commercial Large Soybean Milk Grinding Machine Electric Automatic Ordinary Soya Milk Tofu Maker and Dregs Separater Splitter Organic Soy Nuts Milk Filter 25KG - H - Home & Kitchen
9 pages
Data Mining
No ratings yet
Data Mining
10 pages
Review of Gran Turismo 6
No ratings yet
Review of Gran Turismo 6
4 pages
Christmas Programme 2024
No ratings yet
Christmas Programme 2024
2 pages
CH 6 Web Mining and Other Data Mining
No ratings yet
CH 6 Web Mining and Other Data Mining
19 pages
Forensic Mass Spectrometry - Scientific and Legal Precedents
No ratings yet
Forensic Mass Spectrometry - Scientific and Legal Precedents
15 pages
Java Assignment
No ratings yet
Java Assignment
2 pages
UNITAR Introduction To Sustainable Development in Practice
No ratings yet
UNITAR Introduction To Sustainable Development in Practice
30 pages
Gorkha School Level 3w Reporting Form-2015!07!23
No ratings yet
Gorkha School Level 3w Reporting Form-2015!07!23
350 pages
DWM Ia-2 QB
No ratings yet
DWM Ia-2 QB
10 pages
Business Data Mining Week 13
No ratings yet
Business Data Mining Week 13
15 pages
Web Mining and Other Data Mining
No ratings yet
Web Mining and Other Data Mining
2 pages
Sma Unit 2
No ratings yet
Sma Unit 2
18 pages
Unit 5 DWDM
No ratings yet
Unit 5 DWDM
6 pages
Web Mining U-1,2
No ratings yet
Web Mining U-1,2
15 pages
Mind Map Loyalty - Google Penelusuran
No ratings yet
Mind Map Loyalty - Google Penelusuran
1 page
Web Mining
No ratings yet
Web Mining
73 pages
Unit 3 DMW
No ratings yet
Unit 3 DMW
31 pages
Week 1
No ratings yet
Week 1
80 pages
Wa 971053
No ratings yet
Wa 971053
2 pages
Final Script Assembly Play
No ratings yet
Final Script Assembly Play
3 pages
POST Newspaper For 16th of January, 2016
No ratings yet
POST Newspaper For 16th of January, 2016
56 pages
43.v. Bharanipriya1 & v. Kamakshi Prasad2
No ratings yet
43.v. Bharanipriya1 & v. Kamakshi Prasad2
6 pages
Web Mining: by Saumil Shah Roll No: 46 Mca 4 Sem
No ratings yet
Web Mining: by Saumil Shah Roll No: 46 Mca 4 Sem
28 pages
13-Web Mining
No ratings yet
13-Web Mining
3 pages
Introduction To Web Mining
No ratings yet
Introduction To Web Mining
20 pages
Section 1 - With Answers
No ratings yet
Section 1 - With Answers
2 pages
Module1PartAweb Mining-Intro
No ratings yet
Module1PartAweb Mining-Intro
28 pages
Web Mining
No ratings yet
Web Mining
42 pages
Data Mining: Web Data Mining Techniques, Tools and Algorithms: An Overview
No ratings yet
Data Mining: Web Data Mining Techniques, Tools and Algorithms: An Overview
9 pages
No. 2853 - Housing and Development Act (Chapter 129)
No ratings yet
No. 2853 - Housing and Development Act (Chapter 129)
5 pages
Webmining I
No ratings yet
Webmining I
69 pages
Web Mining Notes
100% (1)
Web Mining Notes
8 pages
Web Mining MMMUT NOTES
No ratings yet
Web Mining MMMUT NOTES
5 pages
Web Mining Research: A Survey: Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000
No ratings yet
Web Mining Research: A Survey: Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000
34 pages
Unit 5 DM
No ratings yet
Unit 5 DM
11 pages
Sandaruwan WP
No ratings yet
Sandaruwan WP
4 pages
Bda Class - Feb 7th
No ratings yet
Bda Class - Feb 7th
28 pages
10 Civics Ch-1 Notes
No ratings yet
10 Civics Ch-1 Notes
4 pages
DWM Assignment 1: 1. Write Detailed Notes On The Following: - A. Web Content Mining
No ratings yet
DWM Assignment 1: 1. Write Detailed Notes On The Following: - A. Web Content Mining
10 pages
Web Mining
No ratings yet
Web Mining
20 pages
Data Mining. Mining WWW.: Sonali. Parab
No ratings yet
Data Mining. Mining WWW.: Sonali. Parab
25 pages
Webminingtextmining 160906165305
No ratings yet
Webminingtextmining 160906165305
18 pages
UNIT - 3 Final
No ratings yet
UNIT - 3 Final
37 pages
ACI Materials Journal July 2023 v.120 No.4
No ratings yet
ACI Materials Journal July 2023 v.120 No.4
106 pages
Major Ingredients in Baking
No ratings yet
Major Ingredients in Baking
42 pages
Web Mining and Text Mining
No ratings yet
Web Mining and Text Mining
65 pages
Analysis of Web Usage Mining: International Journal of Application or Innovation in Engineering & Management (IJAIEM)
No ratings yet
Analysis of Web Usage Mining: International Journal of Application or Innovation in Engineering & Management (IJAIEM)
7 pages
19 Web Mining 2
No ratings yet
19 Web Mining 2
41 pages
Web Mining Using Artificial Ant Colonies: A Survey
No ratings yet
Web Mining Using Artificial Ant Colonies: A Survey
6 pages
Data Mining-World Wide Web
No ratings yet
Data Mining-World Wide Web
4 pages
3.Eng-A Survey On Web Mining
No ratings yet
3.Eng-A Survey On Web Mining
8 pages
Web Mining: By:-Vineeta 8pgc18 M.Tech (II Semester)
No ratings yet
Web Mining: By:-Vineeta 8pgc18 M.Tech (II Semester)
33 pages
Web Mining
No ratings yet
Web Mining
42 pages
Different Nutrition Throughout The Life Span Infancy - Childhood
No ratings yet
Different Nutrition Throughout The Life Span Infancy - Childhood
27 pages
Test (Electrochemistry) For Practice
No ratings yet
Test (Electrochemistry) For Practice
2 pages
Web Mining: Presented By: Vikash Kumar
No ratings yet
Web Mining: Presented By: Vikash Kumar
24 pages
Webmining I
No ratings yet
Webmining I
69 pages
Web Mining: Day-Today: International Journal of Emerging Trends & Technology in Computer Science (IJETTCS)
No ratings yet
Web Mining: Day-Today: International Journal of Emerging Trends & Technology in Computer Science (IJETTCS)
4 pages
Crystal Academy: Ecosystem
No ratings yet
Crystal Academy: Ecosystem
3 pages
Web Mining
No ratings yet
Web Mining
13 pages
Cause Effect Eng.5lpppp
100% (2)
Cause Effect Eng.5lpppp
3 pages
Bai Tap Ve Su Hoa Hop Giua Chu Ngu Va Dong Tu
No ratings yet
Bai Tap Ve Su Hoa Hop Giua Chu Ngu Va Dong Tu
4 pages
Character When Relevant
No ratings yet
Character When Relevant
4 pages
Adapting The Astatic d104 Mic
100% (1)
Adapting The Astatic d104 Mic
3 pages
Introduction to Robotics
From Everand
Introduction to Robotics
Swarnalata Verma
No ratings yet
The Secret Of Machine Learning
From Everand
The Secret Of Machine Learning
Mhd Arjunanta
No ratings yet