0% found this document useful (0 votes)
4 views3 pages

Web Data Mining Important Algorithms Notes

Uploaded by

workiimeee.02
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views3 pages

Web Data Mining Important Algorithms Notes

Uploaded by

workiimeee.02
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

1.

Association Rule Mining

**Apriori Algorithm**:

- Identifies frequent itemsets by iteratively expanding them using a bottom-up approach.

- Uses support and confidence thresholds.

- Key Steps: Generate candidate itemsets -> Prune infrequent ones -> Repeat.

**Association Rule Generation**:

- Derives rules from frequent itemsets.

- Uses confidence and lift measures to evaluate rules.

**PrefixSpan Algorithm**:

- Sequential pattern mining using pattern-growth approach.

- Avoids candidate generation by exploring projected databases.

2. Information Retrieval

**Rocchio Method**:

- A relevance feedback algorithm in vector space model.

- Adjusts query vector based on relevant/irrelevant documents.

**Statistical Language Model**:

- Probabilistic approach to rank documents based on the probability of generating the query.

- Techniques: Unigram, Bigram models, smoothing methods.

Other Key Concepts:

- Phrase Queries: Search exact sequences of words.

- Proximity Queries: Search words near each other.

- Stemming: Reduces words to root form.


- Meta-Search: Aggregates results from multiple search engines.

- Web Page Preprocessing: Tokenization, stop-word removal, stemming.

3. Link Analysis Algorithms

**PageRank Algorithm**:

- Ranks web pages based on link structure.

- Uses random surfer model.

- Strengths: Scalable, robust.

- Weaknesses: Sensitive to link spam.

**HITS Algorithm**:

- Assigns hub and authority scores.

- Based on mutual reinforcement between hubs and authorities.

**Proximity Prestige**:

- Measures importance of a page based on closeness to others.

**Co-citation & Bibliographic Coupling**:

- Co-citation: Two documents cited together by others.

- Bibliographic Coupling: Two documents citing the same sources.

4. Web Crawling

**Basic Crawler Algorithm**:

- Fetches web pages, extracts links, and repeats.

- Components: URL frontier, fetch module, parser.

**Crawler Ethics & Conflicts**:

- Follow robots.txt.
- Avoid overloading servers.

- Respect site policies and bandwidth.

5. Opinion Mining & Sentiment Analysis

**Sentiment vs. Sentiment Phrase Classification**:

- Sentiment Classification: Overall opinion (positive/negative).

- Phrase-based: Focuses on opinionated expressions.

**Feature-based Opinion Mining**:

- Identifies sentiment towards specific features.

- Techniques: Dependency parsing, aspect extraction.

**Opinion Search & Spam Detection**:

- Opinion Search: Retrieves opinion-rich content.

- Challenges: Spam detection, sarcasm, domain-dependence.

6. Web Usage Mining

**Web Usage Mining Process**:

- Discover patterns from web log data.

- Steps: Data collection -> Preprocessing -> Pattern discovery -> Analysis.

**Data Fusion & Cleaning**:

- Fusion: Combine data from multiple sources.

- Cleaning: Remove irrelevant/incomplete entries.

**Sessionization**:

- Divide user log into meaningful sessions.

- Based on time thresholds or navigation behavior.

You might also like