Process of Web Mining and Categories of Web Mining
Process of Web Mining and Categories of Web Mining
1. Data Collection:
o Data is collected from web resources such as websites, logs, or social media
platforms using techniques like web scraping, APIs, or server logs.
o Tools such as Scrapy, BeautifulSoup, or Selenium are often used for automated data
extraction.
2. Preprocessing:
o The raw data collected from the web is often noisy, redundant, and inconsistent.
Preprocessing involves cleaning the data, removing duplicates, and formatting it for
analysis.
o For example, removing HTML tags, handling missing values, and filtering irrelevant
content.
3. Pattern Discovery:
o Data mining and machine learning techniques are applied to discover meaningful
patterns and relationships.
o The extracted patterns and insights are interpreted to derive actionable knowledge.
o Visualization tools like Tableau or Python libraries such as Matplotlib and Seaborn
can help present findings in a user-friendly format.
Definition: Focuses on extracting useful information from the content of web pages, such as
text, images, audio, video, and other multimedia data.
Key Characteristics:
o Uses text mining, data mining, and custom techniques due to the semi-structured
nature of web data.
o Rapidly growing due to the vast expansion of web content and its economic
potential.
Database Approach: Models web data into structured forms (e.g., tables or databases) to
apply data mining techniques effectively.
Challenges:
2. Web Information Integration and Schema Matching: Harmonizing data from various
sources that represent similar information differently.
3. Opinion Mining: Extracting user sentiment or opinions from reviews, blogs, and forums.
5. Noise Detection and Removal: Filtering out irrelevant parts of web pages, such as ads,
navigation links, and other non-content elements.
Applications:
Definition: Focuses on analyzing the structure of hyperlinks between web pages (i.e., the
topology of the web) to identify relationships, page importance, and web communities.
Key Characteristics:
o Examines the inter-document structure (links between web pages) rather than the
content within a page.
Hyperlink Analysis:
o HITS Algorithm: Identifies "hubs" (pages that link to many others) and "authorities"
(pages that are linked to by many others).
Structure Mining: Studies web page schemas or navigational structures.
Challenges:
1. Link Structure Analysis: Discovering meaningful correlations between linked web pages or
websites.
2. Community Detection: Identifying groups of related web pages that form a "community."
3. Web Schema Discovery: Revealing the structure or hierarchy of web pages to improve
navigation or data access.
Applications:
Definition: Focuses on analyzing user behavior and navigation patterns by mining web server
logs, cookies, or clickstream data.
Key Characteristics:
o Examines secondary data generated by user interactions (e.g., logs, page views, and
click behavior).
Pattern Discovery:
Challenges:
1. Data Volume and Complexity: Dealing with massive, high-dimensional, and noisy web usage
data.
2. User Identification: Accurately identifying individual users in cases where multiple users
access the same account or IP address.
Applications:
1. Partitioning Methods
2. Hierarchical Methods
Divisive Approach (Top-Down): Starts with one large cluster and splits it
iteratively.
3. Density-Based Methods
o Can discover clusters of arbitrary shapes and is effective in filtering out noise or
outliers.
o Examples: DBSCAN, OPTICS, DENCLUE.
4. Grid-Based Methods
o Quantizes the object space into a finite number of cells to form a grid structure.
o All clustering operations are performed on the grid, resulting in fast processing
times that depend on the number of cells rather than the dataset size.
5. Model-Based Methods
o Assumes a specific statistical model for each cluster and finds the best fit of the data
to the model.
o Useful for automatically determining the number of clusters and handling noise or
outliers.