0% found this document useful (0 votes)
14 views

Datawarehouse and Data Mining Final Notes

The document provides an overview of data warehousing and data mining concepts, including definitions, processes, and techniques. It outlines steps for designing a data warehouse, differentiates between ETL and ELT processes, and describes various architectures and schemas. Additionally, it covers data mining techniques, data quality, and the knowledge discovery process.

Uploaded by

gpdmgz24fm
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views

Datawarehouse and Data Mining Final Notes

The document provides an overview of data warehousing and data mining concepts, including definitions, processes, and techniques. It outlines steps for designing a data warehouse, differentiates between ETL and ELT processes, and describes various architectures and schemas. Additionally, it covers data mining techniques, data quality, and the knowledge discovery process.

Uploaded by

gpdmgz24fm
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 9

Datawarehouse and Data mining Notes

Data Warehouse: A large storage system that collects, integrates, and organizes data from
multiple sources. It is used for reporting and analysis to help businesses make decisions.
Data Mining: The process of analyzing large datasets to discover patterns, trends, and useful
insights. It helps in predicting future outcomes and improving decision-making.

Note: memorize it
Steps to design warehouse:
 Understand Business Needs: Identify what data needs to be stored and analyzed.
 Identify Data Sources: Gather data from databases, applications, and other sources.
 Design Data Model: Structure the data into tables (Fact & Dimension tables).
 ETL Process (Extract, Transform, Load): Clean, transform, and load data into the
warehouse.
 Choose Storage & Tools: Select a database system (e.g., Snowflake, Redshift).
 Create Data Marts: Organize data for specific business functions (e.g., sales, finance).
 Implement Reporting & Analysis: Use tools like Power BI or Tableau for insights.
 Optimize & Maintain: Monitor performance and update as needed.

Online Transactional Processing (OLTP):

 Roll-up: Summarizes data from low level to high level (e.g. roll up from sales per day to
sales per month).
 Drill-down: Go into details from high level to low level (e.g., drill down from sales per
month to sales per day).
 Slice: Filters data by one dimension (e.g., sales for year 2024 only).
 Dice: Filters data by multiple dimensions (e.g., sales for year=2024, Product=A, Region=
X).
 Pivot (Rotation): Rearranges data to view it from a different perspective (e.g., switching
rows and columns in a sales report).

Data Mining Techniques

o Classification: Categorizes data into groups (e.g., spam vs. non-spam emails).
o Clustering: Groups similar data points together (e.g., customer segmentation).

1
o Association Rule Mining: Finds relationships between data (e.g., "People who buy bread
often buy butter").
o Regression: Predicts values based on past data (e.g., future sales prediction).
o Anomaly Detection: Identifies unusual patterns (e.g., fraud detection in banking).
o Sequential Pattern Mining: Finds patterns over time (e.g., customers buying a phone
first, then accessories).

ETL (Extract, Transform, Load): ETL is a process where data is first extracted from sources,
transformed into a structured format, and then loaded into a data warehouse. Since
transformation happens before loading, it ensures clean, structured, and high-quality data. It is
used for structured data (e.g. Customer Relationship Management (CRM)).

ELT (Extract, Load, Transform): ELT first extracts data and loads it as-is into a storage
system (like a data lake), where transformation happens later when needed. This approach is
faster and works well with large, unstructured, and raw data. It is used for Big data and
unstructured data (e.g., logs, IoT, social media).

Data Mart: A small, focused database with specific data for a particular department (e.g., sales
or marketing). It's organized and easy to analyze.

Data Lake: A large storage system that holds all types of raw data (structured and unstructured).
It's messy and unorganized but stores everything for future use.

OLAP (Online Analytical Processing)

 Purpose: Used for analyzing data and decision-making (e.g., generating reports, trends).
 When: When you need to analyze historical data or get insights from large datasets.
 Where: Used in data warehouses to store and analyze data.
 Operations: Includes actions like slicing, dicing, drilling down, and pivoting to explore
data from different angles.

OLTP (Online Transaction Processing)

 Purpose: Used for real-time transactions and daily operations (e.g., placing orders,
processing payments).
 When: When you need to process live transactions quickly and efficiently.
 Where: Used in databases to manage current, operational data.

2
 Operations: Includes tasks like inserting, updating, deleting, and selecting data for daily
business processes.

Star Schema

 What it is: A central fact table linked to dimension tables (like time, product).
 When to use: Use when you need simple and fast querying for reporting.
 Example: Sales data linked to time, product, and location.

Snowflake Schema

 What it is: Similar to star schema, but dimension tables are split into more detailed sub-
tables.
 When to use: Use when you need to save storage space and maintain data integrity.
 Example: A product table broken into sub-tables for categories and suppliers.

3
Galaxy Schema (Fact Constellation)

 What it is: Multiple fact tables share common dimension tables.


 When to use: Use when you need to handle multiple facts with shared dimensions (e.g.,
sales and inventory data).
 Example: Sales and inventory fact tables both linked to time and product dimensions.

4
Note: Memorize it

A multi-tier architecture in a data warehouse has layers through which data flows:

1. Data Source Layer: Raw data from different systems is collected.


2. Data Staging Layer: Data is cleaned and prepared for storage.
3. Data Warehouse Layer: Cleaned data is stored and organized for analysis.
4. Presentation Layer: Users access and visualize the data using BI tools (e.g., Tableau,
Power BI).

Note: Understand and memorize architecture diagrams. These diagrams are suitable for
every scenario

Datawarehouse architectures:

Top-Down Architecture

 What it is: The centralized data warehouse is built first, and then data marts are
created from it.
 When to use: Use when you want a single, consistent source of truth for all
departments and need to integrate data across the organization.
 Example: Building an enterprise-level warehouse first and then creating department-
specific data marts from it.

Bottom-Up Architecture

5
 What it is: Data marts are built first, and then they are integrated into a centralized data
warehouse.
 When to use: Use when you need a fast solution and can start with department-specific
data before integrating them into a bigger system.
 Example: Building department-specific marts first (like sales or marketing) and then
combining them later into a centralized warehouse.

Federated Architecture

 What it is: Data is stored in multiple, independent systems that remain separate but are
linked together to act as a unified system.
 When to use: Use when you need to maintain existing systems and don’t want to
centralize everything into one data warehouse.
 Example: Keeping separate databases for different departments but allowing them to
access data across systems when needed.

 Chi-square test: A statistical test used to determine if there is a significant difference


between observed and expected frequencies in categorical data.
 Hypothesis: A testable statement or assumption about a population. It’s divided into:
o Null hypothesis (H₀): Assumes no effect or no difference.
o Alternative hypothesis (H₁): Assumes there is an effect or difference

Support Vector Machine (SVM): A machine learning algorithm used for classification. It finds
the best line (or hyperplane) that separates different classes of data with the largest possible
margin.

6
Convolutional Neural Network (CNN): A type of deep learning model used mainly for image
recognition. It uses layers to detect patterns in images (like edges or shapes) and helps in tasks
like classifying or detecting objects in pictures.

Confusion: A confusion matrix shows how well a model classifies data, showing the number of
correct and incorrect predictions for each class.

Diffusion: Diffusion is the spreading of information or patterns discovered from data to other
parts of a system, people, or networks. It’s like how a trend spreads through a group of people.

Types of Clustering:

 K-Means Clustering:
o The data is divided into K groups (clusters) based on similarity.
o It starts by picking K random centers (centroids) and assigns each data point to
the nearest center.
o Then, the centers are updated, and the process repeats until the groups don’t
change much.

 Hierarchical Clustering: Builds a tree-like structure (dendrogram) of clusters.


o Agglomerative (bottom-up): Starts with each data point as its own cluster and
merges the closest clusters.
o Divisive (top-down): Starts with all data in one cluster and splits it into smaller
clusters.

 DBSCAN (Density-Based Spatial Clustering of Applications with Noise):


o Groups data based on density.
o It finds dense regions of data and creates clusters, while points in less dense areas
are labeled as noise or outliers.

 Gaussian Mixture Models (GMM):


o Assumes data comes from multiple Gaussian distributions (bell-shaped curves).
o It tries to fit the data into several overlapping clusters, where each cluster is
represented by a Gaussian distribution.

Frequent Pattern: A frequent pattern is a pattern or set of items that appear together in a dataset
more often than a specified threshold (support). In data mining, finding frequent patterns helps to
identify relationships, trends, and associations between data items.

7
Market Basket Analysis: Market Basket Analysis is a data mining technique used to discover
associations between items that frequently co-occur in transactions. It is often used in retail to
understand customer purchasing behavior.

Note: memorize it

Data preprocessing:

1. Data Cleaning: Removing errors or inconsistencies in the data, like missing values or
duplicates.
2. Data Integration: Combining data from different sources into a unified format.
3. Data Reduction: Reducing the size of the data by selecting important features or
aggregating data.
4. Data Transformation: Converting data into a suitable format or structure for analysis
(e.g., normalization or scaling).
5. Data Discretization: Converting continuous data into discrete categories or bins for
easier analysis.

Note: memorize it

Data Quality:

 Accuracy: Data must be correct and free from errors.


 Completeness: All necessary data should be present.
 Consistency: Data should be consistent across different sources.
 Timeliness: Data must be up-to-date and relevant.
 Believability: Data should be trustworthy and credible.
 Interpretability: Data should be easy to understand and interpret.

Note: memorize it

Knowledge discovery from data:

 Data Selection: Choose the relevant data for your analysis.


 Data Cleaning: Remove errors and fix inconsistencies in the data.
 Data Integration: Combine data from different sources into one dataset.
 Data Transformation: Modify the data into a suitable format for analysis.
 Data Mining: Apply algorithms to find patterns and insights in the data.
 Pattern Evaluation: Evaluate the discovered patterns to find the most useful ones.

8
 Knowledge Presentation: Present valuable insights using charts or reports for easier
understanding.

You might also like