0% found this document useful (0 votes)
45 views19 pages

DWM Notes

Uploaded by

amanyadaviot2022
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
45 views19 pages

DWM Notes

Uploaded by

amanyadaviot2022
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

Module 1: Data Warehousing Fundamentals

Introduction to Data Warehouse, Data warehouse architecture, Data warehouse versus Data
Marts, E-R Modeling versus Dimensional Modeling, Information Package Diagram, Data
Warehouse Schemas; Star Schema, Snowflake Schema, Factless Fact Table, Fact
Constellation Schema. Update to the dimension tables. Major steps in ETL process, OLTP
versus OLAP, OLAP operations: Slice, Dice, Rollup, Drilldown and Pivot.

1. Introduction to Data Warehouse:


A data warehouse is a large storage system that collects and stores data from different
sources in one place. It helps businesses make better decisions by organizing data in a way
that's easy to understand and analyze. The data is cleaned, organized, and made available for
quick reporting and analysis. It's like a big, organized library of information that helps
companies track their performance and plan for the future.
2. Data Warehouse Architecture:

The data warehouse architecture is the framework that defines how data is collected,
stored, and accessed. It typically has three main layers:

1. Data Source Layer: This is where data comes from different places like databases,
flat files, or external sources (e.g., sales systems, customer records). The data can be
structured or unstructured.
2. Data Staging Layer: This is a temporary area where raw data is cleaned,
transformed, and prepared for storage. The process is called ETL (Extract,
Transform, Load). Here, the data is made consistent and usable.
3. Data Storage Layer: This is where the cleaned and transformed data is stored in
the data warehouse. The data is often stored in databases organized into schemas
like star schema or snowflake schema.
4. Data Presentation/Access Layer: This layer is where end-users or applications
access the stored data for reporting, analysis, and decision-making. It includes tools
for generating reports, dashboards, and performing OLAP (Online Analytical
Processing) operations.

In simple terms, the architecture flows like this: data sources → staging area → data
warehouse storage → user access for reporting/analysis.

3. Data Warehouse vs. Data Marts:

● Data Warehouse: A large, centralized system that stores data from various
departments or sources.
● Data Mart: A smaller, more focused subset of the data warehouse, tailored for
specific departments or business areas (e.g., marketing, sales).

4. E-R Modeling vs. Dimensional Modeling:

● E-R Modeling (Entity-Relationship): Focuses on capturing relationships


between entities in transactional databases.
● Dimensional Modeling: Optimized for data warehouses, this approach organizes
data into facts and dimensions, simplifying queries and reporting.
5. Information Package Diagram:

This diagram helps in identifying key dimensions and facts required for decision-making. It
shows the relationships between business processes, dimensions, and metrics, forming the
foundation of dimensional modeling.

6. Data Warehouse Schemas:

● Star Schema: A central fact table is linked to several dimension tables. It's simple
and easy to navigate.
● Snowflake Schema: An extension of the star schema where dimension tables are
normalized into multiple related tables, reducing redundancy.
● Factless Fact Table: A fact table that doesn't contain measurable facts but captures
event occurrences or conditions.
● Fact Constellation Schema: Multiple fact tables share dimension tables, forming
a more complex schema used in enterprise-level data warehouses.

7. Update to Dimension Tables:

Dimension tables can be updated using strategies like:

● Slowly Changing Dimensions (SCD): Handling changes in data over time


without losing historical accuracy.
○ Type 1: Overwrites old data.
○ Type 2: Adds new records for changes.
○ Type 3: Adds a new column for updated data.

8. Major Steps in ETL Process:

● Extract: Retrieving data from different sources.


● Transform: Cleaning, formatting, and converting data into the desired format.
● Load: Loading the transformed data into the data warehouse.

9. OLTP vs. OLAP:

● OLTP (Online Transaction Processing): Manages transactional data, optimized


for inserting, updating, and deleting records (e.g., banking systems).
● OLAP (Online Analytical Processing): Supports complex queries and data
analysis, optimized for retrieving and summarizing data (e.g., sales reporting).

10. OLAP Operations:

● Slice: Selecting a specific dimension of data, e.g., sales in a specific region.


● Dice: Selecting data by specifying multiple dimensions, e.g., sales in a region and a
specific time frame.
● Rollup: Aggregating data along a dimension, e.g., monthly sales rolled up into
quarterly data.
● Drilldown: Breaking down data into finer levels of detail, e.g., going from yearly to
monthly sales data.
● Pivot: Rotating the data to view it from different perspectives, e.g., switching rows
and columns in a report.

_______________________________________________________________

MODULE2 - Introduction to Data Mining, Data Exploration and Data


Pre-processing

Data Mining Task Primitives, Architecture, KDD process, Issues in Data Mining,
Applications of Data Mining, Data Exploration: Types of Attributes, Statistical Description of
Data, Data Visualization, Data Preprocessing: Descriptive data summarization, Cleaning,
Integration & transformation, Data reduction, Data Discretization and Concept hierarchy
generation.

1. Data Mining Task Primitives:

Data mining task primitives are the core functionalities or operations that define the
types of patterns or knowledge that can be discovered during the data mining process. These
task primitives help specify the data mining tasks and guide the mining process to extract
meaningful patterns from large datasets. Here are the key task primitives used in data
mining:

1. Classification

● Definition: Classification is a supervised learning task where the objective is to map


data instances into predefined categories or classes.
● Process:
○ The model is trained on a labeled dataset, where each instance is associated
with a known class.
○ Once trained, the model predicts the class labels of new, unseen instances
based on the features.
● Applications:
○ Spam detection (email classified as spam or not spam).
○ Disease diagnosis (patients classified as having a particular disease or not).
○ Image recognition (classifying objects in an image).
● Techniques:
○ Decision Trees
○ Naive Bayes
○ Support Vector Machines (SVM)
○ Neural Networks

Evaluation Metrics in Classification

When evaluating the performance of a classification model, several key metrics are used,
including accuracy, precision, recall, F1 score, and the confusion matrix. Each of
these metrics offers unique insights into how well the model is performing, especially in
situations with class imbalances or varying importance of different types of errors.

1. Confusion Matrix
● Definition: A confusion matrix is a table that describes the performance of a
classification model by comparing predicted labels to actual labels.

It consists of four components:

● True Positives (TP): The number of correct positive predictions (predicted as


positive and actually positive).
● True Negatives (TN): The number of correct negative predictions (predicted as
negative and actually negative).
● False Positives (FP) (Type I Error): The number of incorrect positive predictions
(predicted as positive but actually negative).
● False Negatives (FN) (Type II Error): The number of incorrect negative
predictions (predicted as negative but actually positive).
2. Clustering

● Definition: Clustering is an unsupervised learning task that groups a set of data


objects such that objects in the same group (cluster) are more similar to each other
than to those in other groups.
● Process:
○ Unlike classification, clustering does not use predefined labels. The algorithm
identifies patterns and structures in the data to form clusters based on
similarity measures.
● Applications:
○ Market segmentation (grouping customers based on purchasing behavior).
○ Image segmentation (dividing an image into meaningful parts).
○ Social network analysis (identifying communities within networks).
● Techniques:
○ k-Means Clustering
○ Hierarchical Clustering (Agglomerative, Divisive)
○ DBSCAN (Density-Based Spatial Clustering of Applications with Noise)

3. Regression

● Definition: Regression is a supervised learning task where the goal is to predict a


continuous numerical value based on input data.
● Process:
○ A regression model learns from historical data and estimates the relationship
between the input variables (independent variables) and the output
(dependent variable).
● Applications:
○ Predicting house prices based on features like size, location, and amenities.
○ Forecasting sales revenue based on historical sales data.
○ Estimating the demand for electricity or other utilities.
● Techniques:
○ Linear Regression
○ Polynomial Regression
○ Decision Tree Regression
○ Neural Networks for Regression

4. Association Rule Mining

● Definition: Association rule mining discovers relationships or associations between


variables in large datasets. It is commonly used in market basket analysis to find
which items are frequently bought together.
● Process:
○ The algorithm identifies sets of items (itemsets) that frequently appear
together in transactions. Then, it generates association rules, which are
implications of the form: "If item A is bought, item B is likely to be bought."
● Applications:
○ Market Basket Analysis (finding frequent product combinations).
○ Recommender systems (recommending products based on customer
behavior).
○ Fraud detection (detecting unusual patterns of transactions).
● Techniques:
○ Apriori Algorithm
○ FP-Growth Algorithm
○ Eclat Algorithm

5. Anomaly Detection

● Definition: Anomaly detection identifies data points or patterns that deviate


significantly from the expected behavior. These anomalies could indicate fraud,
errors, or novel events.
● Process:
○ Anomaly detection algorithms analyze historical data to understand normal
behavior, then flag outliers or unusual data points as anomalies.
● Applications:
○ Fraud detection (identifying fraudulent transactions).
○ Network security (detecting unusual patterns in network traffic).
○ Equipment maintenance (detecting equipment malfunctions through
abnormal sensor data).
● Techniques:
○ Statistical Methods (Z-score)
○ Machine Learning Methods (Isolation Forest, One-Class SVM)
○ Distance-Based Methods (k-Nearest Neighbors)

Summary of Data Mining Task Primitives:

● Classification: Assigns labels to data based on learned patterns.


● Clustering: Groups similar data points without predefined labels.
● Regression: Predicts continuous values based on input features.
● Association Rule Mining: Discovers relationships between variables in large
datasets.
● Anomaly Detection: Identifies unusual or abnormal data points that deviate from
the norm.

2. Data Mining Architecture:

The architecture includes the following layers:

● Data sources: Raw data collected from databases, flat files, or other sources.
● Database/Data Warehouse Server: Stores and manages the data.
● Knowledge Base: Contains domain knowledge, used to guide the mining process.
● Data Mining Engine: The core component that performs tasks like classification,
clustering, association rule mining, etc.
● Pattern Evaluation Module: Interprets patterns and determines their validity and
usefulness.
● User Interface: Allows interaction with the system and visualization of results.

3. KDD Process (Knowledge Discovery in Databases):

This process involves multiple steps:

● Data Selection: Choosing relevant data from the database.


● Data Preprocessing: Cleaning and preparing data for mining.
● Data Transformation: Converting data into appropriate forms for mining.
● Data Mining: Applying algorithms to extract patterns.
● Pattern Evaluation: Evaluating the extracted patterns for their importance.
● Knowledge Representation: Presenting the knowledge in an understandable
form.

4. Issues in Data Mining:

Common challenges include:

● Data quality: Incomplete, noisy, or inconsistent data can affect mining results.
● Performance: Scalability and efficiency when mining large datasets.
● Privacy and security: Concerns over sensitive information being revealed.
● Interpretation of results: Ensuring that the patterns discovered are useful and
interpretable.

5. Applications of Data Mining:

● Business: Market basket analysis, customer segmentation, fraud detection.


● Healthcare: Disease prediction, patient behavior analysis.
● Education: Student performance prediction, personalized learning paths.
● Finance: Risk management, stock market prediction.
● Social Media: Sentiment analysis, trend detection.

6. Data Exploration:
● Types of Attributes:
○ Nominal: Categorical data without order (e.g., color, gender).
○ Ordinal: Categorical data with a meaningful order (e.g., ranking).
○ Interval: Numeric data where the difference is meaningful but no true zero
point (e.g., temperature in Celsius).
○ Ratio: Numeric data with a true zero point (e.g., age, salary).
● Statistical Description of Data: Basic statistics like mean, median, mode,
standard deviation, etc., are used to describe data properties.
● Data Visualization: Graphical representation of data (e.g., histograms, scatter
plots) to identify patterns and trends visually.

7. Data Preprocessing:

● Descriptive Data Summarization: Summarizing the main features of the data


using measures like mean, median, and mode.
● Data Cleaning: Handling missing values, noise, and inconsistencies.
● Data Integration and Transformation: Combining data from multiple sources
and transforming it into a suitable format.
● Data Reduction: Reducing the volume of data while maintaining its integrity. This
includes methods like dimensionality reduction, aggregation, and data compression.
● Data Discretization and Concept Hierarchy Generation: Reducing the
number of distinct values in an attribute (discretization) and grouping data into
meaningful hierarchies (e.g., from individual products to product categories).

8. Data Preprocessing :

● Data Reduction:
○ Dimensionality Reduction: Reducing the number of features or attributes
while retaining essential information, often using techniques like PCA
(Principal Component Analysis).
○ Numerosity Reduction: Representing data in a compact form by using
parametric models (e.g., regression) or non-parametric methods (e.g.,
clustering).
○ Data Compression: Encoding data more efficiently to reduce its size
without losing significant information.
● Data Discretization: Dividing continuous data into discrete intervals or categories.
It helps transform numeric attributes into categorical ones. Techniques include
Binning, Histogram Analysis, and Cluster Analysis.
● Concept Hierarchy Generation: Organizing data into different levels of
abstraction, often visualized in a hierarchical structure. For instance, "City → State →
Country" represents a concept hierarchy for geographical data.

_______________________________________________________________

Module 3: Classification

Basic Concepts, Decision Tree Induction, Naïve Bayesian Classification, Accuracy and Error
measures, Evaluating the Accuracy of a Classifier: Holdout & Random Subsampling, Cross
Validation, Bootstrap.
1. Classification: Basic Concepts:

Classification is a supervised learning technique where the model learns from labeled
training data to predict the category or class of new, unseen instances. The key goal is to map
input features (attributes) to a discrete class label.

● Examples: Predicting whether an email is spam or not, classifying patients as


having a disease or not based on medical tests.

2. Decision Tree Induction:

● A Decision Tree is a tree-like model where each node represents a feature


(attribute), each branch represents a decision rule, and each leaf node represents an
outcome (class label).
● Induction refers to the process of creating the tree based on the training data.
Popular algorithms like ID3, C4.5, or CART are used to split the data at each node
by selecting the best attribute that minimizes impurity (measured by criteria like
Gini Index or Entropy).
● Advantages: Easy to interpret and visualize, handles both categorical and numerical
data.
● Disadvantages: Can overfit the training data if not pruned.
5. Evaluating the Accuracy of a Classifier:

To evaluate how well a classifier performs, especially on unseen data, various validation
techniques are used.

● Holdout Method: The data is split into two sets: a training set and a test set. The
classifier is trained on the training set and evaluated on the test set. However, this
method may lead to high variance because the accuracy depends heavily on how the
data is split.
● Random Subsampling: The data is randomly split into training and test sets
multiple times. The classifier's accuracy is averaged across multiple runs to get a
more reliable estimate.
● Cross Validation: The data is divided into k subsets (or folds). The classifier is
trained on k−1k-1k−1 subsets and tested on the remaining subset. This process is
repeated kkk times, with each subset used exactly once as the test set. The final
accuracy is the average across all kkk iterations. 10-fold cross-validation is a
common choice.
● Bootstrap: A resampling technique where multiple random samples are drawn
with replacement from the dataset to create multiple training sets. The classifier is
trained on each sample, and accuracy is averaged across samples. Bootstrap gives an
estimate of accuracy and variance.

6. Holdout & Random Subsampling (continued):

● Holdout Method: This method is simple and efficient for large datasets. The
dataset is typically split into a training set (around 70%-80% of the data) and a test
set (20%-30%). A limitation of the holdout method is that the model's performance
may vary depending on how the data is split, and some important patterns might be
missed.
● Random Subsampling: This is an extension of the holdout method where the
process of splitting data into training and test sets is repeated multiple times. Each
time, the classifier's performance is evaluated, and the final accuracy is the average
across all iterations. This helps reduce the variability introduced by the random split
in the holdout method, but it can still lead to bias if the training sets are not
representative.

7. Cross Validation:

● k-Fold Cross Validation: The data is divided into k equal-sized parts or folds. For
each of the k iterations, one fold is used as the test set, and the remaining k-1 folds
are used as the training set. After all k iterations, the results are averaged to provide
an overall accuracy estimate. This method ensures that all data points are used both
for training and testing, providing a more reliable evaluation.
○ 10-fold cross-validation is widely used because it strikes a balance
between training accuracy and computational efficiency.
● Leave-One-Out Cross Validation (LOO-CV): A special case of cross-validation
where k is equal to the number of data points. For each iteration, only one data point
is used as the test set, and the rest are used for training. This method is very thorough
but can be computationally expensive for large datasets.

8. Bootstrap:

● In bootstrap resampling, several training datasets are created by randomly


selecting samples with replacement from the original dataset. This means that the
same instance can appear multiple times in the training set.
● Typically, for a dataset with nnn instances, we take n random samples to form a
training set, and the instances not selected form the test set (about 1/3 of the data).
The model is trained on the bootstrapped training set, and performance is evaluated
on the test set.
● This process is repeated multiple times, and the performance measures (accuracy,
error rates) are averaged across all iterations.
● Advantages: It helps reduce bias and variance in the performance estimates and
works well even with small datasets.
● Disadvantages: Computationally more intensive.

Summary of Classifier Evaluation Methods:


1. Holdout Method: Single split of the data into training and test sets.
2. Random Subsampling: Multiple holdout evaluations, averaging the results.
3. Cross-Validation: Divides the data into k subsets for more reliable results.
4. Bootstrap: Creates multiple training sets by sampling with replacement, useful for
small datasets.

_______________________________________________________________

Module 4 - Clustering

Types of data in Cluster analysis, Partitioning Methods (k-Means, k-Medoids), Hierarchical


Methods (Agglomerative, Divisive).

1. Clustering:
Clustering is an unsupervised learning technique used to group similar data points
together without predefined labels. The goal is to partition data into meaningful subgroups,
where data points in the same cluster are more similar to each other than to those in other
clusters.

2. Types of Data in Cluster Analysis:

Cluster analysis can be applied to different types of data, including:

● Numerical Data: Data with continuous numerical values (e.g., age, income,
temperature).
● Categorical Data: Data that has distinct categories or labels (e.g., gender, color).
● Mixed-Type Data: A combination of numerical and categorical data, often seen in
real-world datasets.

3. Partitioning Methods:

These methods divide the dataset into a set of non-overlapping clusters such that each data
point belongs to exactly one cluster. Two popular partitioning methods are k-Means and
k-Medoids.

a. k-Means:

● How it works:
1. Initialization: Choose kkk initial cluster centroids randomly.
2. Assignment: Assign each data point to the nearest cluster centroid based on
a distance metric (typically Euclidean distance).
3. Update: Recompute the cluster centroids by averaging the data points in
each cluster.
4. Repeat: The process of assignment and updating is repeated until the
centroids stabilize (i.e., the cluster assignments no longer change).
● Advantages: Simple and efficient for large datasets.
● Disadvantages: Sensitive to the initial selection of centroids and may converge to
local optima. It also works best with spherical-shaped clusters and is affected by
outliers.
b. k-Medoids (PAM – Partitioning Around Medoids):

● How it works:
1. Similar to k-Means, but instead of using the mean of the data points to define
the centroid, k-Medoids uses actual data points (medoids) as cluster centers.
2. The medoid is the point in the cluster whose average dissimilarity to all other
points in the cluster is minimal.
3. The algorithm iterates by swapping points with medoids and reassigning
points until no further improvement is possible.
● Advantages: More robust than k-Means because it minimizes the effect of outliers
and noise.
● Disadvantages: More computationally expensive than k-Means.

4. Hierarchical Methods:

Hierarchical clustering creates a hierarchy of clusters that can be represented in a tree-like


structure called a dendrogram. There are two main types: agglomerative (bottom-up)
and divisive (top-down).

a. Agglomerative (Bottom-Up):

● How it works:
1. Start with each data point as its own individual cluster (i.e., each point is a
cluster).
2. At each step, merge the two clusters that are the closest (based on a distance
metric like Euclidean distance, Manhattan distance, etc.).
3. Repeat until all points are merged into a single cluster.
● Advantages: Does not require specifying the number of clusters in advance. Can
capture the hierarchical structure in data.
● Disadvantages: Computationally expensive for large datasets, as the algorithm
requires recalculating distances at each step.

b. Divisive (Top-Down):

● How it works:
1. Start with all data points in one cluster.
2. At each step, split the cluster into two smaller clusters (based on dissimilarity
or distance between points).
3. Repeat until each point is in its own cluster or a stopping condition is met.
● Advantages: Can be more efficient than agglomerative methods in some cases and
also captures hierarchical relationships.
● Disadvantages: Less commonly used than agglomerative methods, and it can be
sensitive to how the splits are made.

Key Differences Between Partitioning and Hierarchical Methods:

● Partitioning Methods (like k-Means and k-Medoids) require the number of


clusters to be specified in advance, while hierarchical methods do not.
● Hierarchical Methods create a nested structure of clusters, which allows for
different levels of granularity in the clustering.
● k-Means is fast and efficient but sensitive to the initial choice of clusters, while
k-Medoids is more robust but computationally slower.

These clustering techniques are useful in a variety of fields, including customer


segmentation, image segmentation, and market basket analysis. Each method has its own
strengths and is chosen based on the type of data and the desired clustering output.

_______________________________________________________________

Module 5- Mining frequent patterns and associations

Market Basket Analysis, Frequent Item sets, Closed Item sets, and Association Rule,
Frequent Pattern Mining, Apriori Algorithm, Association Rule Generation, Improving the
Efficiency of Apriori, Mining Frequent Itemsets without candidate 6 generation,
Introduction to Mining Multilevel Association Rules and Mining Multidimensional
Association Rules.

1. Mining Frequent Patterns and Associations:

● Frequent pattern mining involves finding recurring patterns, correlations, or


associations in data. This is often applied in transactional datasets, such as retail
data, to discover how items co-occur, leading to actionable insights.

2. Market Basket Analysis:

● Market Basket Analysis is a type of frequent pattern mining used in retail to


identify items that are frequently bought together by customers. For example, if
customers who buy bread also frequently buy butter, a store might consider placing
these items near each other to boost sales.
● Association Rules are generated in this context, showing how the purchase of one
item is associated with another. An example rule could be: "If a customer buys X,
they are likely to also buy Y."

3. Frequent Itemsets:

● A Frequent Itemset is a set of items that appear together in a dataset with a


frequency greater than a predefined minimum support threshold. In Market Basket
Analysis, frequent itemsets represent products often bought together.
Example:
○ If customers frequently buy "bread" and "butter" together, then {"bread",
"butter"} is a frequent itemset.

4. Closed Itemsets:

● A Closed Itemset is a frequent itemset where no proper superset of the itemset has
the same support count. In other words, it’s an itemset that cannot be extended by
adding more items without decreasing its frequency.
Example:
○ If the itemset {"bread", "butter"} appears 50 times and adding "milk" makes it
appear fewer than 50 times, then {"bread", "butter"} is a closed itemset.

5. Association Rule:

6. Frequent Pattern Mining:

● This is the process of discovering frequent itemsets from large datasets. The most
well-known algorithm for frequent pattern mining is the Apriori Algorithm.

7. Apriori Algorithm:

● How it works:
1. It starts by finding all single items (1-itemsets) that meet the minimum
support threshold.
2. It then generates larger itemsets (2-itemsets, 3-itemsets, etc.) by combining
the frequent itemsets of the previous iteration and checking their support.
3. This process continues until no more frequent itemsets can be generated.
● Downward Closure Property: Apriori relies on the property that all non-empty
subsets of a frequent itemset must also be frequent. This allows the algorithm to
reduce the search space by eliminating infrequent itemsets early.

8. Association Rule Generation:

● After finding frequent itemsets using the Apriori Algorithm, association rules are
generated by identifying subsets of the frequent itemsets that satisfy the minimum
confidence threshold.

9. Improving the Efficiency of Apriori:

● The basic Apriori Algorithm can be inefficient, as it generates a large number of


candidate itemsets and repeatedly scans the dataset. Some techniques to improve its
efficiency include:
○ Reducing Candidate Generation: Only generating candidates from
frequent itemsets, rather than generating all possible itemsets.
○ Transaction Reduction: Once an itemset is found to be infrequent,
transactions that do not contain the frequent items can be ignored in future
iterations.
○ Hash-based techniques: Using hash structures to reduce the number of
candidate itemsets.

10. Mining Frequent Itemsets Without Candidate Generation:

● FP-Growth (Frequent Pattern Growth) is an alternative to the Apriori


algorithm that avoids candidate generation. It uses a compressed representation of
the dataset called an FP-tree (Frequent Pattern tree) and recursively extracts the
frequent itemsets from it.
○ Advantages: FP-Growth is faster than Apriori because it eliminates the need
to generate candidate itemsets and avoids multiple database scans.

11. Mining Multilevel Association Rules:

● Multilevel Association Rules mine associations at different levels of abstraction.


For instance, you might discover that customers who buy "laptops" also tend to buy
"mouse" (specific level), but at a higher level, you might find that people who buy
"electronics" also tend to buy "accessories" (general level).
Example:
○ Level 1: "Electronics → Accessories"
○ Level 2: "Laptops → Mouse"
● Mining multilevel rules requires defining minimum support thresholds for each level,
as higher-level rules tend to be more frequent than lower-level ones.

12. Mining Multidimensional Association Rules:

● Multidimensional Association Rules involve multiple dimensions or attributes,


such as time, location, or customer demographic data, in addition to the items
themselves.
Example:
○ "Customers from New York who are between 25-35 years old are likely to buy
laptops and headphones together."
● These rules allow for more complex patterns to be discovered, integrating different
aspects of the data.

Summary of Key Concepts:

● Frequent Itemsets: Groups of items that frequently appear together in


transactions.
● Association Rules: Implications showing relationships between itemsets.
● Apriori Algorithm: A popular method for discovering frequent itemsets by
generating candidates.
● FP-Growth: A more efficient algorithm for mining frequent itemsets without
candidate generation.
● Multilevel and Multidimensional Rules: Allow for discovering patterns at
different levels of abstraction and across multiple dimensions.

_______________________________________________________________

Module 6 - Web Mining

Introduction, Web Content Mining: Crawlers, Harvest System, Virtual Web View,
Personalization, Web Structure Mining: Page Rank, Clever, Web Usage Mining.

Web Mining

Web Mining is the process of discovering useful and relevant patterns, knowledge, and
insights from the web by extracting data from web pages, web structure, and user
interactions. It is broadly divided into three categories: Web Content Mining, Web
Structure Mining, and Web Usage Mining.

1. Web Content Mining:

● Definition: Web content mining involves extracting useful information from the
content of web pages. This content could include text, images, audio, video, and other
multimedia formats.
● Techniques: It uses techniques from natural language processing (NLP),
information retrieval, and machine learning to process and mine useful data.

Key Concepts:

● Crawlers:
○ Crawlers (also known as spiders or bots) are automated programs that
systematically browse the web to collect data from websites. They start with a
set of seed URLs and follow the links to gather new web pages for mining or
indexing.
○ Example: Search engines like Google use crawlers to gather and index the
vast amount of information on the web.
● Harvest System:
○ A Harvest System is a web resource discovery and indexing system that
combines web crawlers with data repositories to provide content-based
indexing of distributed web resources. It allows efficient retrieval of
large-scale web data.
● Virtual Web View:
○ This refers to the creation of a customized, virtual view of the web that is
tailored for specific users or user groups. It essentially "filters" content to
provide only the relevant information based on user preferences or needs.
● Personalization:
○ Personalization is the process of adapting the web content or services to the
needs of specific users or user groups. It can be based on previous browsing
behavior, user profiles, or direct input from users.
○ Example: Online stores like Amazon recommend products to users based on
their past purchases and browsing history.
2. Web Structure Mining:

● Definition: Web structure mining focuses on the structure of hyperlinks within the
web, i.e., the connections between different web pages. It uses graph theory to model
the web as a directed graph where web pages are nodes, and hyperlinks are the edges
connecting the nodes.

Key Concepts:

● PageRank:
○ PageRank is an algorithm used by Google to rank web pages in search
results based on their importance. The rank of a web page depends on the
number and quality of links pointing to it.
○ The basic idea is that if a page has many high-quality links pointing to it, it is
likely more important and should rank higher in search results.
● Clever Algorithm:
○ The Clever Algorithm (also known as HITS – Hyperlink-Induced Topic
Search) is another link analysis algorithm that distinguishes between two
types of web pages:
■ Authorities: Pages that are highly relevant to a specific topic.
■ Hubs: Pages that link to multiple authority pages.
○ The Clever algorithm ranks pages based on their hub and authority scores,
where hubs link to many authorities, and authorities are linked by many hubs.

3. Web Usage Mining:

● Definition: Web usage mining involves analyzing user behavior on websites by


examining web logs, clickstreams, and other user activity data. This helps in
understanding how users interact with websites and can be used to improve the user
experience and increase site performance.

Key Concepts:

● Web Logs:
○ Web logs record user interactions with a website. They typically contain
information like the user’s IP address, pages visited, time spent on each page,
and the sequence of clicks.
○ Analysis of web logs can provide insights into user navigation patterns,
frequently visited pages, and traffic sources.
● Clickstream Analysis:
○ Clickstream analysis tracks the sequence of clicks a user makes as they
navigate through a website. This data is used to analyze user behavior and
preferences.
● Applications:
○ Personalization: Web usage mining can help personalize content for users
based on their past behavior.
○ Improving Website Design: Understanding user navigation paths can lead
to better website design, improving the user experience.
○ Recommendation Systems: Usage mining can be used to recommend
products, content, or services to users based on their previous activity.

Summary of Key Concepts:

● Web Content Mining: Extracts data from web page content using crawlers and
personalization systems.
● Web Structure Mining: Analyzes the structure of links on the web to rank pages
using algorithms like PageRank and Clever.
● Web Usage Mining: Analyzes user behavior on websites using logs and
clickstreams to improve user experience and personalization.

You might also like