Datawarehouse and Data Mining Final Notes
Datawarehouse and Data Mining Final Notes
Data Warehouse: A large storage system that collects, integrates, and organizes data from
multiple sources. It is used for reporting and analysis to help businesses make decisions.
Data Mining: The process of analyzing large datasets to discover patterns, trends, and useful
insights. It helps in predicting future outcomes and improving decision-making.
Note: memorize it
Steps to design warehouse:
Understand Business Needs: Identify what data needs to be stored and analyzed.
Identify Data Sources: Gather data from databases, applications, and other sources.
Design Data Model: Structure the data into tables (Fact & Dimension tables).
ETL Process (Extract, Transform, Load): Clean, transform, and load data into the
warehouse.
Choose Storage & Tools: Select a database system (e.g., Snowflake, Redshift).
Create Data Marts: Organize data for specific business functions (e.g., sales, finance).
Implement Reporting & Analysis: Use tools like Power BI or Tableau for insights.
Optimize & Maintain: Monitor performance and update as needed.
Roll-up: Summarizes data from low level to high level (e.g. roll up from sales per day to
sales per month).
Drill-down: Go into details from high level to low level (e.g., drill down from sales per
month to sales per day).
Slice: Filters data by one dimension (e.g., sales for year 2024 only).
Dice: Filters data by multiple dimensions (e.g., sales for year=2024, Product=A, Region=
X).
Pivot (Rotation): Rearranges data to view it from a different perspective (e.g., switching
rows and columns in a sales report).
o Classification: Categorizes data into groups (e.g., spam vs. non-spam emails).
o Clustering: Groups similar data points together (e.g., customer segmentation).
1
o Association Rule Mining: Finds relationships between data (e.g., "People who buy bread
often buy butter").
o Regression: Predicts values based on past data (e.g., future sales prediction).
o Anomaly Detection: Identifies unusual patterns (e.g., fraud detection in banking).
o Sequential Pattern Mining: Finds patterns over time (e.g., customers buying a phone
first, then accessories).
ETL (Extract, Transform, Load): ETL is a process where data is first extracted from sources,
transformed into a structured format, and then loaded into a data warehouse. Since
transformation happens before loading, it ensures clean, structured, and high-quality data. It is
used for structured data (e.g. Customer Relationship Management (CRM)).
ELT (Extract, Load, Transform): ELT first extracts data and loads it as-is into a storage
system (like a data lake), where transformation happens later when needed. This approach is
faster and works well with large, unstructured, and raw data. It is used for Big data and
unstructured data (e.g., logs, IoT, social media).
Data Mart: A small, focused database with specific data for a particular department (e.g., sales
or marketing). It's organized and easy to analyze.
Data Lake: A large storage system that holds all types of raw data (structured and unstructured).
It's messy and unorganized but stores everything for future use.
Purpose: Used for analyzing data and decision-making (e.g., generating reports, trends).
When: When you need to analyze historical data or get insights from large datasets.
Where: Used in data warehouses to store and analyze data.
Operations: Includes actions like slicing, dicing, drilling down, and pivoting to explore
data from different angles.
Purpose: Used for real-time transactions and daily operations (e.g., placing orders,
processing payments).
When: When you need to process live transactions quickly and efficiently.
Where: Used in databases to manage current, operational data.
2
Operations: Includes tasks like inserting, updating, deleting, and selecting data for daily
business processes.
Star Schema
What it is: A central fact table linked to dimension tables (like time, product).
When to use: Use when you need simple and fast querying for reporting.
Example: Sales data linked to time, product, and location.
Snowflake Schema
What it is: Similar to star schema, but dimension tables are split into more detailed sub-
tables.
When to use: Use when you need to save storage space and maintain data integrity.
Example: A product table broken into sub-tables for categories and suppliers.
3
Galaxy Schema (Fact Constellation)
4
Note: Memorize it
A multi-tier architecture in a data warehouse has layers through which data flows:
Note: Understand and memorize architecture diagrams. These diagrams are suitable for
every scenario
Datawarehouse architectures:
Top-Down Architecture
What it is: The centralized data warehouse is built first, and then data marts are
created from it.
When to use: Use when you want a single, consistent source of truth for all
departments and need to integrate data across the organization.
Example: Building an enterprise-level warehouse first and then creating department-
specific data marts from it.
Bottom-Up Architecture
5
What it is: Data marts are built first, and then they are integrated into a centralized data
warehouse.
When to use: Use when you need a fast solution and can start with department-specific
data before integrating them into a bigger system.
Example: Building department-specific marts first (like sales or marketing) and then
combining them later into a centralized warehouse.
Federated Architecture
What it is: Data is stored in multiple, independent systems that remain separate but are
linked together to act as a unified system.
When to use: Use when you need to maintain existing systems and don’t want to
centralize everything into one data warehouse.
Example: Keeping separate databases for different departments but allowing them to
access data across systems when needed.
Support Vector Machine (SVM): A machine learning algorithm used for classification. It finds
the best line (or hyperplane) that separates different classes of data with the largest possible
margin.
6
Convolutional Neural Network (CNN): A type of deep learning model used mainly for image
recognition. It uses layers to detect patterns in images (like edges or shapes) and helps in tasks
like classifying or detecting objects in pictures.
Confusion: A confusion matrix shows how well a model classifies data, showing the number of
correct and incorrect predictions for each class.
Diffusion: Diffusion is the spreading of information or patterns discovered from data to other
parts of a system, people, or networks. It’s like how a trend spreads through a group of people.
Types of Clustering:
K-Means Clustering:
o The data is divided into K groups (clusters) based on similarity.
o It starts by picking K random centers (centroids) and assigns each data point to
the nearest center.
o Then, the centers are updated, and the process repeats until the groups don’t
change much.
Frequent Pattern: A frequent pattern is a pattern or set of items that appear together in a dataset
more often than a specified threshold (support). In data mining, finding frequent patterns helps to
identify relationships, trends, and associations between data items.
7
Market Basket Analysis: Market Basket Analysis is a data mining technique used to discover
associations between items that frequently co-occur in transactions. It is often used in retail to
understand customer purchasing behavior.
Note: memorize it
Data preprocessing:
1. Data Cleaning: Removing errors or inconsistencies in the data, like missing values or
duplicates.
2. Data Integration: Combining data from different sources into a unified format.
3. Data Reduction: Reducing the size of the data by selecting important features or
aggregating data.
4. Data Transformation: Converting data into a suitable format or structure for analysis
(e.g., normalization or scaling).
5. Data Discretization: Converting continuous data into discrete categories or bins for
easier analysis.
Note: memorize it
Data Quality:
Note: memorize it
8
Knowledge Presentation: Present valuable insights using charts or reports for easier
understanding.