1.
Association Rule Mining
**Apriori Algorithm**:
- Identifies frequent itemsets by iteratively expanding them using a bottom-up approach.
- Uses support and confidence thresholds.
- Key Steps: Generate candidate itemsets -> Prune infrequent ones -> Repeat.
**Association Rule Generation**:
- Derives rules from frequent itemsets.
- Uses confidence and lift measures to evaluate rules.
**PrefixSpan Algorithm**:
- Sequential pattern mining using pattern-growth approach.
- Avoids candidate generation by exploring projected databases.
2. Information Retrieval
**Rocchio Method**:
- A relevance feedback algorithm in vector space model.
- Adjusts query vector based on relevant/irrelevant documents.
**Statistical Language Model**:
- Probabilistic approach to rank documents based on the probability of generating the query.
- Techniques: Unigram, Bigram models, smoothing methods.
Other Key Concepts:
- Phrase Queries: Search exact sequences of words.
- Proximity Queries: Search words near each other.
- Stemming: Reduces words to root form.
- Meta-Search: Aggregates results from multiple search engines.
- Web Page Preprocessing: Tokenization, stop-word removal, stemming.
3. Link Analysis Algorithms
**PageRank Algorithm**:
- Ranks web pages based on link structure.
- Uses random surfer model.
- Strengths: Scalable, robust.
- Weaknesses: Sensitive to link spam.
**HITS Algorithm**:
- Assigns hub and authority scores.
- Based on mutual reinforcement between hubs and authorities.
**Proximity Prestige**:
- Measures importance of a page based on closeness to others.
**Co-citation & Bibliographic Coupling**:
- Co-citation: Two documents cited together by others.
- Bibliographic Coupling: Two documents citing the same sources.
4. Web Crawling
**Basic Crawler Algorithm**:
- Fetches web pages, extracts links, and repeats.
- Components: URL frontier, fetch module, parser.
**Crawler Ethics & Conflicts**:
- Follow robots.txt.
- Avoid overloading servers.
- Respect site policies and bandwidth.
5. Opinion Mining & Sentiment Analysis
**Sentiment vs. Sentiment Phrase Classification**:
- Sentiment Classification: Overall opinion (positive/negative).
- Phrase-based: Focuses on opinionated expressions.
**Feature-based Opinion Mining**:
- Identifies sentiment towards specific features.
- Techniques: Dependency parsing, aspect extraction.
**Opinion Search & Spam Detection**:
- Opinion Search: Retrieves opinion-rich content.
- Challenges: Spam detection, sarcasm, domain-dependence.
6. Web Usage Mining
**Web Usage Mining Process**:
- Discover patterns from web log data.
- Steps: Data collection -> Preprocessing -> Pattern discovery -> Analysis.
**Data Fusion & Cleaning**:
- Fusion: Combine data from multiple sources.
- Cleaning: Remove irrelevant/incomplete entries.
**Sessionization**:
- Divide user log into meaningful sessions.
- Based on time thresholds or navigation behavior.