0% found this document useful (0 votes)

30 views9 pages

Datawarehouse and Data Mining Final Notes

The document provides an overview of data warehousing and data mining concepts, including definitions, processes, and techniques. It outlines steps for designing a data warehouse, differentiates between ETL and ELT processes, and describes various architectures and schemas. Additionally, it covers data mining techniques, data quality, and the knowledge discovery process.

Uploaded by

gpdmgz24fm

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

30 views9 pages

Datawarehouse and Data Mining Final Notes

Uploaded by

gpdmgz24fm

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 9

Datawarehouse and Data mining Notes

Data Warehouse: A large storage system that collects, integrates, and organizes data from
multiple sources. It is used for reporting and analysis to help businesses make decisions.
Data Mining: The process of analyzing large datasets to discover patterns, trends, and useful
insights. It helps in predicting future outcomes and improving decision-making.

Note: memorize it
Steps to design warehouse:
 Understand Business Needs: Identify what data needs to be stored and analyzed.
 Identify Data Sources: Gather data from databases, applications, and other sources.
 Design Data Model: Structure the data into tables (Fact & Dimension tables).
 ETL Process (Extract, Transform, Load): Clean, transform, and load data into the
warehouse.
 Choose Storage & Tools: Select a database system (e.g., Snowflake, Redshift).
 Create Data Marts: Organize data for specific business functions (e.g., sales, finance).
 Implement Reporting & Analysis: Use tools like Power BI or Tableau for insights.
 Optimize & Maintain: Monitor performance and update as needed.

Online Transactional Processing (OLTP):

 Roll-up: Summarizes data from low level to high level (e.g. roll up from sales per day to
sales per month).
 Drill-down: Go into details from high level to low level (e.g., drill down from sales per
month to sales per day).
 Slice: Filters data by one dimension (e.g., sales for year 2024 only).
 Dice: Filters data by multiple dimensions (e.g., sales for year=2024, Product=A, Region=
X).
 Pivot (Rotation): Rearranges data to view it from a different perspective (e.g., switching
rows and columns in a sales report).

Data Mining Techniques

o Classification: Categorizes data into groups (e.g., spam vs. non-spam emails).
o Clustering: Groups similar data points together (e.g., customer segmentation).

1
o Association Rule Mining: Finds relationships between data (e.g., "People who buy bread
often buy butter").
o Regression: Predicts values based on past data (e.g., future sales prediction).
o Anomaly Detection: Identifies unusual patterns (e.g., fraud detection in banking).
o Sequential Pattern Mining: Finds patterns over time (e.g., customers buying a phone
first, then accessories).

ETL (Extract, Transform, Load): ETL is a process where data is first extracted from sources,
transformed into a structured format, and then loaded into a data warehouse. Since
transformation happens before loading, it ensures clean, structured, and high-quality data. It is
used for structured data (e.g. Customer Relationship Management (CRM)).

ELT (Extract, Load, Transform): ELT first extracts data and loads it as-is into a storage
system (like a data lake), where transformation happens later when needed. This approach is
faster and works well with large, unstructured, and raw data. It is used for Big data and
unstructured data (e.g., logs, IoT, social media).

Data Mart: A small, focused database with specific data for a particular department (e.g., sales
or marketing). It's organized and easy to analyze.

Data Lake: A large storage system that holds all types of raw data (structured and unstructured).
It's messy and unorganized but stores everything for future use.

OLAP (Online Analytical Processing)

 Purpose: Used for analyzing data and decision-making (e.g., generating reports, trends).
 When: When you need to analyze historical data or get insights from large datasets.
 Where: Used in data warehouses to store and analyze data.
 Operations: Includes actions like slicing, dicing, drilling down, and pivoting to explore
data from different angles.

OLTP (Online Transaction Processing)

 Purpose: Used for real-time transactions and daily operations (e.g., placing orders,
processing payments).
 When: When you need to process live transactions quickly and efficiently.
 Where: Used in databases to manage current, operational data.

2
 Operations: Includes tasks like inserting, updating, deleting, and selecting data for daily
business processes.

Star Schema

 What it is: A central fact table linked to dimension tables (like time, product).
 When to use: Use when you need simple and fast querying for reporting.
 Example: Sales data linked to time, product, and location.

Snowflake Schema

 What it is: Similar to star schema, but dimension tables are split into more detailed sub-
tables.
 When to use: Use when you need to save storage space and maintain data integrity.
 Example: A product table broken into sub-tables for categories and suppliers.

3
Galaxy Schema (Fact Constellation)

 What it is: Multiple fact tables share common dimension tables.

 When to use: Use when you need to handle multiple facts with shared dimensions (e.g.,
sales and inventory data).
 Example: Sales and inventory fact tables both linked to time and product dimensions.

4
Note: Memorize it

A multi-tier architecture in a data warehouse has layers through which data flows:

1. Data Source Layer: Raw data from different systems is collected.

2. Data Staging Layer: Data is cleaned and prepared for storage.
3. Data Warehouse Layer: Cleaned data is stored and organized for analysis.
4. Presentation Layer: Users access and visualize the data using BI tools (e.g., Tableau,
Power BI).

Note: Understand and memorize architecture diagrams. These diagrams are suitable for
every scenario

Datawarehouse architectures:

Top-Down Architecture

 What it is: The centralized data warehouse is built first, and then data marts are
created from it.
 When to use: Use when you want a single, consistent source of truth for all
departments and need to integrate data across the organization.
 Example: Building an enterprise-level warehouse first and then creating department-
specific data marts from it.

Bottom-Up Architecture

5
 What it is: Data marts are built first, and then they are integrated into a centralized data
warehouse.
 When to use: Use when you need a fast solution and can start with department-specific
data before integrating them into a bigger system.
 Example: Building department-specific marts first (like sales or marketing) and then
combining them later into a centralized warehouse.

Federated Architecture

 What it is: Data is stored in multiple, independent systems that remain separate but are
linked together to act as a unified system.
 When to use: Use when you need to maintain existing systems and don’t want to
centralize everything into one data warehouse.
 Example: Keeping separate databases for different departments but allowing them to
access data across systems when needed.

 Chi-square test: A statistical test used to determine if there is a significant difference

between observed and expected frequencies in categorical data.
 Hypothesis: A testable statement or assumption about a population. It’s divided into:
o Null hypothesis (H₀): Assumes no effect or no difference.
o Alternative hypothesis (H₁): Assumes there is an effect or difference

Support Vector Machine (SVM): A machine learning algorithm used for classification. It finds
the best line (or hyperplane) that separates different classes of data with the largest possible
margin.

6
Convolutional Neural Network (CNN): A type of deep learning model used mainly for image
recognition. It uses layers to detect patterns in images (like edges or shapes) and helps in tasks
like classifying or detecting objects in pictures.

Confusion: A confusion matrix shows how well a model classifies data, showing the number of
correct and incorrect predictions for each class.

Diffusion: Diffusion is the spreading of information or patterns discovered from data to other
parts of a system, people, or networks. It’s like how a trend spreads through a group of people.

Types of Clustering:

 K-Means Clustering:
o The data is divided into K groups (clusters) based on similarity.
o It starts by picking K random centers (centroids) and assigns each data point to
the nearest center.
o Then, the centers are updated, and the process repeats until the groups don’t
change much.

 Hierarchical Clustering: Builds a tree-like structure (dendrogram) of clusters.

o Agglomerative (bottom-up): Starts with each data point as its own cluster and
merges the closest clusters.
o Divisive (top-down): Starts with all data in one cluster and splits it into smaller
clusters.

 DBSCAN (Density-Based Spatial Clustering of Applications with Noise):

o Groups data based on density.
o It finds dense regions of data and creates clusters, while points in less dense areas
are labeled as noise or outliers.

 Gaussian Mixture Models (GMM):

o Assumes data comes from multiple Gaussian distributions (bell-shaped curves).
o It tries to fit the data into several overlapping clusters, where each cluster is
represented by a Gaussian distribution.

Frequent Pattern: A frequent pattern is a pattern or set of items that appear together in a dataset
more often than a specified threshold (support). In data mining, finding frequent patterns helps to
identify relationships, trends, and associations between data items.

7
Market Basket Analysis: Market Basket Analysis is a data mining technique used to discover
associations between items that frequently co-occur in transactions. It is often used in retail to
understand customer purchasing behavior.

Note: memorize it

Data preprocessing:

1. Data Cleaning: Removing errors or inconsistencies in the data, like missing values or
duplicates.
2. Data Integration: Combining data from different sources into a unified format.
3. Data Reduction: Reducing the size of the data by selecting important features or
aggregating data.
4. Data Transformation: Converting data into a suitable format or structure for analysis
(e.g., normalization or scaling).
5. Data Discretization: Converting continuous data into discrete categories or bins for
easier analysis.

Note: memorize it

Data Quality:

 Accuracy: Data must be correct and free from errors.

 Completeness: All necessary data should be present.
 Consistency: Data should be consistent across different sources.
 Timeliness: Data must be up-to-date and relevant.
 Believability: Data should be trustworthy and credible.
 Interpretability: Data should be easy to understand and interpret.

Note: memorize it

Knowledge discovery from data:

 Data Selection: Choose the relevant data for your analysis.

 Data Cleaning: Remove errors and fix inconsistencies in the data.
 Data Integration: Combine data from different sources into one dataset.
 Data Transformation: Modify the data into a suitable format for analysis.
 Data Mining: Apply algorithms to find patterns and insights in the data.
 Pattern Evaluation: Evaluate the discovered patterns to find the most useful ones.

8
 Knowledge Presentation: Present valuable insights using charts or reports for easier
understanding.

Learn Data Warehousing in 24 Hours
From Everand
Learn Data Warehousing in 24 Hours
Alex Nordeen
No ratings yet
Introduction to Data Warehouse
No ratings yet
Introduction to Data Warehouse
17 pages
DWM Notes
No ratings yet
DWM Notes
19 pages
Data Warehousing Mining
No ratings yet
Data Warehousing Mining
26 pages
DM & W SQ
No ratings yet
DM & W SQ
15 pages
Unit 5 Notes
No ratings yet
Unit 5 Notes
19 pages
Elaborated_DWH_DataMining_Assignment_Answers
No ratings yet
Elaborated_DWH_DataMining_Assignment_Answers
8 pages
Data Warehosing and Data Mining
No ratings yet
Data Warehosing and Data Mining
15 pages
dwh
No ratings yet
dwh
34 pages
Data Mining v3
No ratings yet
Data Mining v3
54 pages
Data Mining Display
No ratings yet
Data Mining Display
20 pages
DWDM external
No ratings yet
DWDM external
30 pages
Data Warehousing and Data Mining
No ratings yet
Data Warehousing and Data Mining
36 pages
Ba Important
No ratings yet
Ba Important
13 pages
Data Mining
No ratings yet
Data Mining
48 pages
business_analytics[1]
No ratings yet
business_analytics[1]
3 pages
Data Mining: Concepts and Techniques
No ratings yet
Data Mining: Concepts and Techniques
70 pages
dwm 2
No ratings yet
dwm 2
31 pages
CS 2208 DATA MINING AND WAREHOUSING NOTES
No ratings yet
CS 2208 DATA MINING AND WAREHOUSING NOTES
14 pages
ds
No ratings yet
ds
38 pages
Data Warehouse and Mining Techmax - Compressed
No ratings yet
Data Warehouse and Mining Techmax - Compressed
429 pages
Data Mining 1
No ratings yet
Data Mining 1
13 pages
??? ????????? ???
No ratings yet
??? ????????? ???
21 pages
DM Unit2(Part1)
No ratings yet
DM Unit2(Part1)
19 pages
DWDM fresh notes for Unit 1,Unit 2 ,Unit 3
No ratings yet
DWDM fresh notes for Unit 1,Unit 2 ,Unit 3
54 pages
Data Mining Basics
No ratings yet
Data Mining Basics
20 pages
module 1
No ratings yet
module 1
41 pages
Data Mining and Data Warehousing
No ratings yet
Data Mining and Data Warehousing
73 pages
Lecture 2.1.1 2.1.2 (1)
No ratings yet
Lecture 2.1.1 2.1.2 (1)
19 pages
DM
No ratings yet
DM
99 pages
Datamining Unit -1
No ratings yet
Datamining Unit -1
20 pages
Data Mining N Business Intelligence
No ratings yet
Data Mining N Business Intelligence
63 pages
Part A Aim: Prerequisite: Database Outcome: To Impart Knowledge of Data Warehouse and Data Mining Theory
No ratings yet
Part A Aim: Prerequisite: Database Outcome: To Impart Knowledge of Data Warehouse and Data Mining Theory
4 pages
OLAP and Data Mining
No ratings yet
OLAP and Data Mining
27 pages
DWDM B Tech Unit 1 Part-A
No ratings yet
DWDM B Tech Unit 1 Part-A
15 pages
data ming unit 2
No ratings yet
data ming unit 2
8 pages
DMBI Summer 23
No ratings yet
DMBI Summer 23
33 pages
Dbms Data Warehosuing
No ratings yet
Dbms Data Warehosuing
80 pages
Data Mining
No ratings yet
Data Mining
4 pages
DWM QB ANSWERS
No ratings yet
DWM QB ANSWERS
14 pages
DMT Unit1
No ratings yet
DMT Unit1
46 pages
ISM Data warehousing-1
No ratings yet
ISM Data warehousing-1
23 pages
D-Unit-1 R16
No ratings yet
D-Unit-1 R16
17 pages
Adbms
No ratings yet
Adbms
19 pages
Introduction to Data Mining and Data Warehousing
No ratings yet
Introduction to Data Mining and Data Warehousing
2 pages
Data Mining and Warehousing
No ratings yet
Data Mining and Warehousing
18 pages
Report on Principles of Fragmentation in Computer Science
No ratings yet
Report on Principles of Fragmentation in Computer Science
26 pages
DW Concepts
No ratings yet
DW Concepts
40 pages
Data Warehousing and Data Mining
No ratings yet
Data Warehousing and Data Mining
31 pages
DW Concepts
No ratings yet
DW Concepts
40 pages
Data Extraction
No ratings yet
Data Extraction
14 pages
100 Important Questions with Solutions for Data Warehousing & Data Mining (BCS058)
No ratings yet
100 Important Questions with Solutions for Data Warehousing & Data Mining (BCS058)
119 pages
DataWarehouseDesignDecisions PDF
No ratings yet
DataWarehouseDesignDecisions PDF
62 pages
Data warehouse (1)
No ratings yet
Data warehouse (1)
14 pages
Data warehouse
No ratings yet
Data warehouse
11 pages
DW Concepts
100% (1)
DW Concepts
40 pages
Unit Iii
No ratings yet
Unit Iii
10 pages
Data Warehousing and Data Mining
No ratings yet
Data Warehousing and Data Mining
31 pages
Data Mining Cat
No ratings yet
Data Mining Cat
6 pages
Database And Computer Management: SERIES 1, #3
From Everand
Database And Computer Management: SERIES 1, #3
Elias Mutegi
No ratings yet
Math 6 Performance Task 2
No ratings yet
Math 6 Performance Task 2
6 pages
Socomec ATyS C20 C30 Controller
No ratings yet
Socomec ATyS C20 C30 Controller
34 pages
unit-2_HTML (2)
No ratings yet
unit-2_HTML (2)
52 pages
Siddhivinayak H Sortur: Work Experience
No ratings yet
Siddhivinayak H Sortur: Work Experience
1 page
Guidelines To Determine The Right Interface When Integrating With SAP Systems
100% (1)
Guidelines To Determine The Right Interface When Integrating With SAP Systems
45 pages
Unit 3 Clustering
No ratings yet
Unit 3 Clustering
28 pages
Joint Khmer Word Segmentation and Part-Of-Speech T
No ratings yet
Joint Khmer Word Segmentation and Part-Of-Speech T
12 pages
Veda Yurtsever - Google Search
No ratings yet
Veda Yurtsever - Google Search
1 page
Overview of The Architecture, Circuit Design, and Physical Implementation of A First-Generation Cell Processor
No ratings yet
Overview of The Architecture, Circuit Design, and Physical Implementation of A First-Generation Cell Processor
18 pages
Dsf-Pyt-Lab Manual
No ratings yet
Dsf-Pyt-Lab Manual
50 pages
09 Numeric Control
No ratings yet
09 Numeric Control
3 pages
Steven - Pethan-Jan 2022
No ratings yet
Steven - Pethan-Jan 2022
3 pages
Lecture - 3
No ratings yet
Lecture - 3
18 pages
Constellations Cootie Catcher
No ratings yet
Constellations Cootie Catcher
6 pages
APC SmartUPS SUA2200RM2U
No ratings yet
APC SmartUPS SUA2200RM2U
4 pages
WINE QUALITY PREDICTOR ppt
0% (1)
WINE QUALITY PREDICTOR ppt
9 pages
Devops Shack 50 Complex Kubernetes Scenario-Based Q&A: 1. Scenario: Zero-Downtime Deployment For Multiple Services
No ratings yet
Devops Shack 50 Complex Kubernetes Scenario-Based Q&A: 1. Scenario: Zero-Downtime Deployment For Multiple Services
45 pages
Atc-1000 Users Manual v2 0
No ratings yet
Atc-1000 Users Manual v2 0
4 pages
Web Access Management and Single Sign-On: Ronnie Dale Huggins
No ratings yet
Web Access Management and Single Sign-On: Ronnie Dale Huggins
9 pages
Inbox 4
No ratings yet
Inbox 4
1 page
Purchasing Inforecords
No ratings yet
Purchasing Inforecords
5 pages
Sampath Rachoti - AEM
No ratings yet
Sampath Rachoti - AEM
4 pages
Cablu Termosensibil - Pozare
No ratings yet
Cablu Termosensibil - Pozare
2 pages
DS Honeywell Experion C300 Controller
No ratings yet
DS Honeywell Experion C300 Controller
5 pages
Unit 1: The Business Scenario: Week 4: Dealing With Existing Code
No ratings yet
Unit 1: The Business Scenario: Week 4: Dealing With Existing Code
52 pages
13.Minimization of Automata Machines
No ratings yet
13.Minimization of Automata Machines
3 pages
Armv8-A Instruction Set Architecture
100% (1)
Armv8-A Instruction Set Architecture
39 pages
Working of Hive 2
No ratings yet
Working of Hive 2
7 pages
Clarion nx702 Tx-1087b-A
No ratings yet
Clarion nx702 Tx-1087b-A
43 pages
Python Assignment-3
No ratings yet
Python Assignment-3
8 pages

Datawarehouse and Data Mining Final Notes

Uploaded by

Datawarehouse and Data Mining Final Notes

Uploaded by

Datawarehouse and Data mining Notes

Online Transactional Processing (OLTP):

Data Mining Techniques

OLAP (Online Analytical Processing)

OLTP (Online Transaction Processing)

 What it is: Multiple fact tables share common dimension tables.

1. Data Source Layer: Raw data from different systems is collected.

 Chi-square test: A statistical test used to determine if there is a significant difference

 Hierarchical Clustering: Builds a tree-like structure (dendrogram) of clusters.

 DBSCAN (Density-Based Spatial Clustering of Applications with Noise):

 Gaussian Mixture Models (GMM):

 Accuracy: Data must be correct and free from errors.

Knowledge discovery from data:

 Data Selection: Choose the relevant data for your analysis.

You might also like