DWM Notes
DWM Notes
Introduction to Data Warehouse, Data warehouse architecture, Data warehouse versus Data
Marts, E-R Modeling versus Dimensional Modeling, Information Package Diagram, Data
Warehouse Schemas; Star Schema, Snowflake Schema, Factless Fact Table, Fact
Constellation Schema. Update to the dimension tables. Major steps in ETL process, OLTP
versus OLAP, OLAP operations: Slice, Dice, Rollup, Drilldown and Pivot.
The data warehouse architecture is the framework that defines how data is collected,
stored, and accessed. It typically has three main layers:
1. Data Source Layer: This is where data comes from different places like databases,
flat files, or external sources (e.g., sales systems, customer records). The data can be
structured or unstructured.
2. Data Staging Layer: This is a temporary area where raw data is cleaned,
transformed, and prepared for storage. The process is called ETL (Extract,
Transform, Load). Here, the data is made consistent and usable.
3. Data Storage Layer: This is where the cleaned and transformed data is stored in
the data warehouse. The data is often stored in databases organized into schemas
like star schema or snowflake schema.
4. Data Presentation/Access Layer: This layer is where end-users or applications
access the stored data for reporting, analysis, and decision-making. It includes tools
for generating reports, dashboards, and performing OLAP (Online Analytical
Processing) operations.
In simple terms, the architecture flows like this: data sources → staging area → data
warehouse storage → user access for reporting/analysis.
● Data Warehouse: A large, centralized system that stores data from various
departments or sources.
● Data Mart: A smaller, more focused subset of the data warehouse, tailored for
specific departments or business areas (e.g., marketing, sales).
This diagram helps in identifying key dimensions and facts required for decision-making. It
shows the relationships between business processes, dimensions, and metrics, forming the
foundation of dimensional modeling.
● Star Schema: A central fact table is linked to several dimension tables. It's simple
and easy to navigate.
● Snowflake Schema: An extension of the star schema where dimension tables are
normalized into multiple related tables, reducing redundancy.
● Factless Fact Table: A fact table that doesn't contain measurable facts but captures
event occurrences or conditions.
● Fact Constellation Schema: Multiple fact tables share dimension tables, forming
a more complex schema used in enterprise-level data warehouses.
_______________________________________________________________
Data Mining Task Primitives, Architecture, KDD process, Issues in Data Mining,
Applications of Data Mining, Data Exploration: Types of Attributes, Statistical Description of
Data, Data Visualization, Data Preprocessing: Descriptive data summarization, Cleaning,
Integration & transformation, Data reduction, Data Discretization and Concept hierarchy
generation.
Data mining task primitives are the core functionalities or operations that define the
types of patterns or knowledge that can be discovered during the data mining process. These
task primitives help specify the data mining tasks and guide the mining process to extract
meaningful patterns from large datasets. Here are the key task primitives used in data
mining:
1. Classification
When evaluating the performance of a classification model, several key metrics are used,
including accuracy, precision, recall, F1 score, and the confusion matrix. Each of
these metrics offers unique insights into how well the model is performing, especially in
situations with class imbalances or varying importance of different types of errors.
1. Confusion Matrix
● Definition: A confusion matrix is a table that describes the performance of a
classification model by comparing predicted labels to actual labels.
3. Regression
5. Anomaly Detection
● Data sources: Raw data collected from databases, flat files, or other sources.
● Database/Data Warehouse Server: Stores and manages the data.
● Knowledge Base: Contains domain knowledge, used to guide the mining process.
● Data Mining Engine: The core component that performs tasks like classification,
clustering, association rule mining, etc.
● Pattern Evaluation Module: Interprets patterns and determines their validity and
usefulness.
● User Interface: Allows interaction with the system and visualization of results.
● Data quality: Incomplete, noisy, or inconsistent data can affect mining results.
● Performance: Scalability and efficiency when mining large datasets.
● Privacy and security: Concerns over sensitive information being revealed.
● Interpretation of results: Ensuring that the patterns discovered are useful and
interpretable.
6. Data Exploration:
● Types of Attributes:
○ Nominal: Categorical data without order (e.g., color, gender).
○ Ordinal: Categorical data with a meaningful order (e.g., ranking).
○ Interval: Numeric data where the difference is meaningful but no true zero
point (e.g., temperature in Celsius).
○ Ratio: Numeric data with a true zero point (e.g., age, salary).
● Statistical Description of Data: Basic statistics like mean, median, mode,
standard deviation, etc., are used to describe data properties.
● Data Visualization: Graphical representation of data (e.g., histograms, scatter
plots) to identify patterns and trends visually.
7. Data Preprocessing:
8. Data Preprocessing :
● Data Reduction:
○ Dimensionality Reduction: Reducing the number of features or attributes
while retaining essential information, often using techniques like PCA
(Principal Component Analysis).
○ Numerosity Reduction: Representing data in a compact form by using
parametric models (e.g., regression) or non-parametric methods (e.g.,
clustering).
○ Data Compression: Encoding data more efficiently to reduce its size
without losing significant information.
● Data Discretization: Dividing continuous data into discrete intervals or categories.
It helps transform numeric attributes into categorical ones. Techniques include
Binning, Histogram Analysis, and Cluster Analysis.
● Concept Hierarchy Generation: Organizing data into different levels of
abstraction, often visualized in a hierarchical structure. For instance, "City → State →
Country" represents a concept hierarchy for geographical data.
_______________________________________________________________
Module 3: Classification
Basic Concepts, Decision Tree Induction, Naïve Bayesian Classification, Accuracy and Error
measures, Evaluating the Accuracy of a Classifier: Holdout & Random Subsampling, Cross
Validation, Bootstrap.
1. Classification: Basic Concepts:
Classification is a supervised learning technique where the model learns from labeled
training data to predict the category or class of new, unseen instances. The key goal is to map
input features (attributes) to a discrete class label.
To evaluate how well a classifier performs, especially on unseen data, various validation
techniques are used.
● Holdout Method: The data is split into two sets: a training set and a test set. The
classifier is trained on the training set and evaluated on the test set. However, this
method may lead to high variance because the accuracy depends heavily on how the
data is split.
● Random Subsampling: The data is randomly split into training and test sets
multiple times. The classifier's accuracy is averaged across multiple runs to get a
more reliable estimate.
● Cross Validation: The data is divided into k subsets (or folds). The classifier is
trained on k−1k-1k−1 subsets and tested on the remaining subset. This process is
repeated kkk times, with each subset used exactly once as the test set. The final
accuracy is the average across all kkk iterations. 10-fold cross-validation is a
common choice.
● Bootstrap: A resampling technique where multiple random samples are drawn
with replacement from the dataset to create multiple training sets. The classifier is
trained on each sample, and accuracy is averaged across samples. Bootstrap gives an
estimate of accuracy and variance.
● Holdout Method: This method is simple and efficient for large datasets. The
dataset is typically split into a training set (around 70%-80% of the data) and a test
set (20%-30%). A limitation of the holdout method is that the model's performance
may vary depending on how the data is split, and some important patterns might be
missed.
● Random Subsampling: This is an extension of the holdout method where the
process of splitting data into training and test sets is repeated multiple times. Each
time, the classifier's performance is evaluated, and the final accuracy is the average
across all iterations. This helps reduce the variability introduced by the random split
in the holdout method, but it can still lead to bias if the training sets are not
representative.
7. Cross Validation:
● k-Fold Cross Validation: The data is divided into k equal-sized parts or folds. For
each of the k iterations, one fold is used as the test set, and the remaining k-1 folds
are used as the training set. After all k iterations, the results are averaged to provide
an overall accuracy estimate. This method ensures that all data points are used both
for training and testing, providing a more reliable evaluation.
○ 10-fold cross-validation is widely used because it strikes a balance
between training accuracy and computational efficiency.
● Leave-One-Out Cross Validation (LOO-CV): A special case of cross-validation
where k is equal to the number of data points. For each iteration, only one data point
is used as the test set, and the rest are used for training. This method is very thorough
but can be computationally expensive for large datasets.
8. Bootstrap:
_______________________________________________________________
Module 4 - Clustering
1. Clustering:
Clustering is an unsupervised learning technique used to group similar data points
together without predefined labels. The goal is to partition data into meaningful subgroups,
where data points in the same cluster are more similar to each other than to those in other
clusters.
● Numerical Data: Data with continuous numerical values (e.g., age, income,
temperature).
● Categorical Data: Data that has distinct categories or labels (e.g., gender, color).
● Mixed-Type Data: A combination of numerical and categorical data, often seen in
real-world datasets.
3. Partitioning Methods:
These methods divide the dataset into a set of non-overlapping clusters such that each data
point belongs to exactly one cluster. Two popular partitioning methods are k-Means and
k-Medoids.
a. k-Means:
● How it works:
1. Initialization: Choose kkk initial cluster centroids randomly.
2. Assignment: Assign each data point to the nearest cluster centroid based on
a distance metric (typically Euclidean distance).
3. Update: Recompute the cluster centroids by averaging the data points in
each cluster.
4. Repeat: The process of assignment and updating is repeated until the
centroids stabilize (i.e., the cluster assignments no longer change).
● Advantages: Simple and efficient for large datasets.
● Disadvantages: Sensitive to the initial selection of centroids and may converge to
local optima. It also works best with spherical-shaped clusters and is affected by
outliers.
b. k-Medoids (PAM – Partitioning Around Medoids):
● How it works:
1. Similar to k-Means, but instead of using the mean of the data points to define
the centroid, k-Medoids uses actual data points (medoids) as cluster centers.
2. The medoid is the point in the cluster whose average dissimilarity to all other
points in the cluster is minimal.
3. The algorithm iterates by swapping points with medoids and reassigning
points until no further improvement is possible.
● Advantages: More robust than k-Means because it minimizes the effect of outliers
and noise.
● Disadvantages: More computationally expensive than k-Means.
4. Hierarchical Methods:
a. Agglomerative (Bottom-Up):
● How it works:
1. Start with each data point as its own individual cluster (i.e., each point is a
cluster).
2. At each step, merge the two clusters that are the closest (based on a distance
metric like Euclidean distance, Manhattan distance, etc.).
3. Repeat until all points are merged into a single cluster.
● Advantages: Does not require specifying the number of clusters in advance. Can
capture the hierarchical structure in data.
● Disadvantages: Computationally expensive for large datasets, as the algorithm
requires recalculating distances at each step.
b. Divisive (Top-Down):
● How it works:
1. Start with all data points in one cluster.
2. At each step, split the cluster into two smaller clusters (based on dissimilarity
or distance between points).
3. Repeat until each point is in its own cluster or a stopping condition is met.
● Advantages: Can be more efficient than agglomerative methods in some cases and
also captures hierarchical relationships.
● Disadvantages: Less commonly used than agglomerative methods, and it can be
sensitive to how the splits are made.
_______________________________________________________________
Market Basket Analysis, Frequent Item sets, Closed Item sets, and Association Rule,
Frequent Pattern Mining, Apriori Algorithm, Association Rule Generation, Improving the
Efficiency of Apriori, Mining Frequent Itemsets without candidate 6 generation,
Introduction to Mining Multilevel Association Rules and Mining Multidimensional
Association Rules.
3. Frequent Itemsets:
4. Closed Itemsets:
● A Closed Itemset is a frequent itemset where no proper superset of the itemset has
the same support count. In other words, it’s an itemset that cannot be extended by
adding more items without decreasing its frequency.
Example:
○ If the itemset {"bread", "butter"} appears 50 times and adding "milk" makes it
appear fewer than 50 times, then {"bread", "butter"} is a closed itemset.
5. Association Rule:
● This is the process of discovering frequent itemsets from large datasets. The most
well-known algorithm for frequent pattern mining is the Apriori Algorithm.
7. Apriori Algorithm:
● How it works:
1. It starts by finding all single items (1-itemsets) that meet the minimum
support threshold.
2. It then generates larger itemsets (2-itemsets, 3-itemsets, etc.) by combining
the frequent itemsets of the previous iteration and checking their support.
3. This process continues until no more frequent itemsets can be generated.
● Downward Closure Property: Apriori relies on the property that all non-empty
subsets of a frequent itemset must also be frequent. This allows the algorithm to
reduce the search space by eliminating infrequent itemsets early.
● After finding frequent itemsets using the Apriori Algorithm, association rules are
generated by identifying subsets of the frequent itemsets that satisfy the minimum
confidence threshold.
_______________________________________________________________
Introduction, Web Content Mining: Crawlers, Harvest System, Virtual Web View,
Personalization, Web Structure Mining: Page Rank, Clever, Web Usage Mining.
Web Mining
Web Mining is the process of discovering useful and relevant patterns, knowledge, and
insights from the web by extracting data from web pages, web structure, and user
interactions. It is broadly divided into three categories: Web Content Mining, Web
Structure Mining, and Web Usage Mining.
● Definition: Web content mining involves extracting useful information from the
content of web pages. This content could include text, images, audio, video, and other
multimedia formats.
● Techniques: It uses techniques from natural language processing (NLP),
information retrieval, and machine learning to process and mine useful data.
Key Concepts:
● Crawlers:
○ Crawlers (also known as spiders or bots) are automated programs that
systematically browse the web to collect data from websites. They start with a
set of seed URLs and follow the links to gather new web pages for mining or
indexing.
○ Example: Search engines like Google use crawlers to gather and index the
vast amount of information on the web.
● Harvest System:
○ A Harvest System is a web resource discovery and indexing system that
combines web crawlers with data repositories to provide content-based
indexing of distributed web resources. It allows efficient retrieval of
large-scale web data.
● Virtual Web View:
○ This refers to the creation of a customized, virtual view of the web that is
tailored for specific users or user groups. It essentially "filters" content to
provide only the relevant information based on user preferences or needs.
● Personalization:
○ Personalization is the process of adapting the web content or services to the
needs of specific users or user groups. It can be based on previous browsing
behavior, user profiles, or direct input from users.
○ Example: Online stores like Amazon recommend products to users based on
their past purchases and browsing history.
2. Web Structure Mining:
● Definition: Web structure mining focuses on the structure of hyperlinks within the
web, i.e., the connections between different web pages. It uses graph theory to model
the web as a directed graph where web pages are nodes, and hyperlinks are the edges
connecting the nodes.
Key Concepts:
● PageRank:
○ PageRank is an algorithm used by Google to rank web pages in search
results based on their importance. The rank of a web page depends on the
number and quality of links pointing to it.
○ The basic idea is that if a page has many high-quality links pointing to it, it is
likely more important and should rank higher in search results.
● Clever Algorithm:
○ The Clever Algorithm (also known as HITS – Hyperlink-Induced Topic
Search) is another link analysis algorithm that distinguishes between two
types of web pages:
■ Authorities: Pages that are highly relevant to a specific topic.
■ Hubs: Pages that link to multiple authority pages.
○ The Clever algorithm ranks pages based on their hub and authority scores,
where hubs link to many authorities, and authorities are linked by many hubs.
Key Concepts:
● Web Logs:
○ Web logs record user interactions with a website. They typically contain
information like the user’s IP address, pages visited, time spent on each page,
and the sequence of clicks.
○ Analysis of web logs can provide insights into user navigation patterns,
frequently visited pages, and traffic sources.
● Clickstream Analysis:
○ Clickstream analysis tracks the sequence of clicks a user makes as they
navigate through a website. This data is used to analyze user behavior and
preferences.
● Applications:
○ Personalization: Web usage mining can help personalize content for users
based on their past behavior.
○ Improving Website Design: Understanding user navigation paths can lead
to better website design, improving the user experience.
○ Recommendation Systems: Usage mining can be used to recommend
products, content, or services to users based on their previous activity.
● Web Content Mining: Extracts data from web page content using crawlers and
personalization systems.
● Web Structure Mining: Analyzes the structure of links on the web to rank pages
using algorithms like PageRank and Clever.
● Web Usage Mining: Analyzes user behavior on websites using logs and
clickstreams to improve user experience and personalization.