0% found this document useful (0 votes)
14 views9 pages

Data Warehouse & Data Mining Notes

The document provides an overview of data warehousing and data mining concepts, emphasizing the importance of data warehouses in supporting business intelligence through integrated and historical data. It outlines the architecture, features, and processes involved in data warehousing, as well as the data mining techniques and classifiers used for pattern discovery. Key components include ETL processes, data storage models, and various data mining methods like classification and clustering.

Uploaded by

Ravi Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views9 pages

Data Warehouse & Data Mining Notes

The document provides an overview of data warehousing and data mining concepts, emphasizing the importance of data warehouses in supporting business intelligence through integrated and historical data. It outlines the architecture, features, and processes involved in data warehousing, as well as the data mining techniques and classifiers used for pattern discovery. Key components include ETL processes, data storage models, and various data mining methods like classification and clustering.

Uploaded by

Ravi Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Data Warehouse and Data Mining Notes

MODULE 1: Overview and Concepts of Data Warehousing


1.1 Introduction to Data Warehousing
A Data Warehouse (DW) is a repository of integrated information, collected from multiple sources,
stored under a unified schema, and used for analysis and decision making. It is a core component of
Business Intelligence (BI) systems.
Unlike transactional databases, which are optimized for insert, update, and delete operations and
support real-time transactional processing, a DW is optimized for read access and complex queries.
The data warehouse is typically updated in batches, not in real-time.
The fundamental objective of a DW is to support strategic decision-making processes by providing a
consolidated view of the enterprise's data over time.
1.2 Importance and Need for Data Warehousing
Modern organizations deal with large volumes of data spread across various departments, formats,
and systems. This fragmented nature of data leads to:
• Inconsistent information
• Delayed access to vital data
• Difficulty in performing historical analysis
A data warehouse addresses these issues by:
• Integrating data from multiple heterogeneous sources
• Providing a historical context for business data
• Ensuring data consistency and reliability
• Enabling efficient querying and reporting
1.3 Definition by Bill Inmon
Bill Inmon, widely recognized as the "Father of Data Warehousing," defines a DW as:
"A subject-oriented, integrated, time-variant, and non-volatile collection of data in support of
management's decision-making process."
Each component of this definition is significant:
• Subject-Oriented: Data is organized around key subjects (e.g., customer, product) rather than
applications.
• Integrated: Combines data from different sources into a cohesive, unified format.
• Time-Variant: Data is identified with a particular time period, enabling historical analysis.
• Non-Volatile: Once data is entered into the warehouse, it is not updated or deleted.
1.4 Evolution of Data Warehousing
• 1980s: Organizations began storing historical data for reporting purposes, mainly in flat files
or simple databases.
• 1990s: The concept of DW matured with dedicated architectures, ETL processes, and OLAP
tools.
• 2000s: Emergence of web-based data warehouses and integration with BI tools.
• 2010s: Cloud-based data warehousing solutions emerged (e.g., Amazon Redshift, Google
BigQuery).
• 2020s: Real-time data warehousing, integration with machine learning, and big data
platforms like Hadoop and Spark.
1.5 Data Warehousing and Business Intelligence (BI)
Business Intelligence (BI) refers to the set of techniques and tools for transforming raw data into
meaningful information to support better decision-making.
DW provides the backbone for BI by offering:
• Clean, consistent data
• Historical and trend analysis
• Fast and reliable data retrieval for reporting tools
BI tools such as Tableau, Power BI, and Qlik use DW as their data source.
1.6 Building Blocks of a Data Warehouse
1. Operational Data Sources: ERP, CRM, legacy systems, external data
2. ETL Process: Extract, Transform, and Load the data into the staging area and DW
3. Staging Area: Temporary storage for data cleansing and transformation
4. Data Storage Area: Central repository using relational or multidimensional models
5. Metadata Repository: Information about data lineage, source, transformations
6. Data Access Tools: OLAP, ad hoc query tools, dashboards
7. End Users: Analysts, managers, executives
MODULE 2: Defining Features and Architectural Types
2.1 Defining Features
1. Subject-Oriented: Data organized by business entities, not processes. For example, sales,
customers, inventory.
2. Integrated: Data from various sources are standardized (e.g., consistent formats, units,
naming conventions).
3. Time-Variant: Contains historical data for trend analysis (e.g., sales by quarter over the last 5
years).
4. Non-Volatile: Once stored, data is read-only. Changes are recorded as new entries.
5. Granularity: Refers to the level of detail in the warehouse. Higher granularity = finer detail
(e.g., daily vs. monthly sales).
2.2 Data Warehouses vs Data Marts
• Data Warehouse: Centralized repository for the entire organization. Large in size and scope.
• Data Mart: A smaller subset focused on a single department or business function (e.g.,
marketing).
• Dependent Data Mart: Draws data from the central DW.
• Independent Data Mart: Built directly from operational systems.
2.3 Types of Data Warehouse Architecture
1. Centralized Architecture:
o All data is stored in a single, centralized repository.
o Simplifies management but may become a bottleneck.
2. Independent Data Marts:
o Separate warehouses for different departments.
o Easier to implement but leads to data silos and inconsistency.
3. Hub-and-Spoke Architecture:
o Central DW (hub) with connected departmental data marts (spokes).
o Balances integration with flexibility.
4. Federated Architecture:
o Combines multiple existing systems into a virtual warehouse using middleware.
o Low implementation cost but less control over consistency and performance.

MODULE 3: Data Staging, Storage, Delivery, Metadata, and


Requirements
3.1 Data Staging (ETL)
Data staging is a key step in the ETL (Extract, Transform, Load) process. It involves temporary
storage and preprocessing of data before it is loaded into the DW.
• Extraction: Retrieving data from various operational sources (databases, flat files, web
services).
• Transformation: Cleaning and converting data to fit the warehouse schema. This may
include:
o Removing duplicates
o Resolving data inconsistencies
o Converting data types and units
• Loading: Writing transformed data into the data warehouse.
3.2 Data Storage
• Data is stored in a structured format, often using dimensional models:
o Star Schema: A central fact table surrounded by dimension tables.
o Snowflake Schema: Dimension tables are normalized into multiple related tables.
• The data is organized to support analytical queries efficiently.
3.3 Information Delivery
Information delivery refers to the tools and mechanisms used to provide end users with access to
data in the DW.
• Dashboards: Graphical summaries for executives.
• Reports: Predefined summaries based on business requirements.
• Ad hoc Queries: Custom queries executed by users using SQL or front-end tools.
• OLAP Tools: Enable slicing, dicing, drill-down, and roll-up operations.
3.4 Metadata
Metadata is "data about data." It helps users and administrators understand, manage, and use the
data warehouse.
• Technical Metadata: Describes data sources, data types, transformations, table definitions.
• Business Metadata: Defines the meaning of data in business terms (e.g., what does
"customer" mean?).
• Operational Metadata: Includes data about ETL jobs, load times, and execution logs.
3.5 Management and Control Components
These ensure the smooth functioning of the DW.
• Scheduling: Timely execution of ETL processes.
• Monitoring: Checking system performance and data accuracy.
• Security: Ensuring authorized access.
• Auditing: Tracking data lineage and changes.
3.6 Requirement Gathering Methods
To design a DW effectively, understanding business needs is crucial. Common techniques:
• Interviews: One-on-one discussions with stakeholders.
• Questionnaires: Structured forms to collect data needs.
• Workshops: Collaborative sessions with multiple stakeholders.
• Observation: Watching users interact with current systems.
3.7 Requirements Definition Document (RDD)
An RDD captures and formalizes business needs:
• Project scope and goals
• Functional requirements (e.g., types of reports needed)
• Non-functional requirements (e.g., performance, security)
• Data requirements and constraints
3.8 Business Dimensions and Key Measurements
• Dimensions: Categorical fields used to slice and filter data (e.g., Time, Geography, Product).
• Measures: Quantitative metrics used for analysis (e.g., Sales Amount, Profit, Quantity Sold).
• These are stored in fact tables and dimension tables to support analytical queries.
MODULE 4: Data Warehouse Architecture and Infrastructure
4.1 Architectural Components
A well-structured data warehouse architecture is essential for ensuring scalability, performance,
and reliability. The major architectural components are:
1. Data Acquisition:
o Extracts data from operational systems
o Cleanses and transforms data to ensure consistency
o Loads data into staging and ultimately into the warehouse
2. Data Storage:
o Centralized data warehouse storage
o Dimensional modeling used for data organization (star or snowflake schema)
o May include OLAP cubes and data marts for department-level storage
3. Information Delivery:
o End-users access data through reporting tools, dashboards, OLAP tools, and data
mining applications
o Provides support for ad hoc querying and multidimensional analysis
4. Metadata Management:
o Maintains data dictionaries, ETL job logs, user permissions, and lineage info
5. Warehouse Management and Monitoring:
o Includes system administration, performance tuning, backup/recovery, and access
control
4.2 Types of Architectures
• Two-Tier Architecture:
o Client applications directly access DW
o Simple but not scalable for large user bases
• Three-Tier Architecture:
o Divides architecture into:
1. Bottom Tier (Data Warehouse Server)
2. Middle Tier (OLAP Server)
3. Top Tier (Client Tools)
o Most common and scalable model
• Cloud-Based Architecture:
o Uses platforms like Amazon Redshift, Snowflake, and Google BigQuery
o Offers elasticity, scalability, and reduced infrastructure cost
4.3 Characteristics of Good DW Infrastructure
• Scalability to handle growing data volumes
• Security for sensitive business data
• High performance for query execution
• Fault tolerance and backup systems
• Support for real-time or near-real-time data integration

MODULE 5: Data Mining Overview


5.1 What is Data Mining?
Data mining is the process of discovering previously unknown, valid patterns and relationships in
large datasets, often involving AI, machine learning, and statistics. It plays a key role in predicting
future trends and behaviors.
5.2 Knowledge Discovery in Databases (KDD) Process
1. Data Cleaning – Removing noise, missing values
2. Data Integration – Combining data from multiple sources
3. Data Selection – Identifying relevant data
4. Data Transformation – Converting data into suitable format
5. Data Mining – Applying algorithms to discover patterns
6. Pattern Evaluation – Identifying truly interesting patterns
7. Knowledge Presentation – Visualizing patterns using charts, graphs, etc.
5.3 OLAP vs Data Mining

Criteria OLAP Data Mining

Objective Summarize data Discover patterns

Query User-defined Algorithmic/automated

Output Aggregates Models and patterns

Example Sales by region Predict customer churn

5.4 Common Data Mining Techniques


• Association Rule Learning:
o Discovers relationships between variables (e.g., Market Basket Analysis)
o Example: If a customer buys milk, they are likely to buy bread
• Classification:
o Assigns items to predefined classes (e.g., email as spam or not)
• Clustering:
o Groups data points with similar characteristics without predefined labels
• Regression:
o Predicts a numeric value (e.g., forecasting stock prices)
• Outlier Detection:
o Identifies rare or unusual observations (e.g., fraud detection)

MODULE 6: Data Mining Classifiers (Brief Introduction)


6.1 K-Nearest Neighbour (K-NN)
• Type: Instance-based, lazy learning
• Working:
o Classifies new records based on the "k" closest data points in the training set
o Distance metrics (Euclidean, Manhattan) used to calculate similarity
• Advantages:
o Simple and effective for small datasets
• Disadvantages:
o Computationally expensive for large datasets
6.2 Support Vector Machine (SVM)
• Type: Supervised learning algorithm
• Working:
o Constructs a hyperplane that best separates different classes
o Uses kernel functions to handle non-linear separations
• Advantages:
o Effective in high-dimensional spaces
o Robust and accurate
• Disadvantages:
o Requires tuning of parameters like kernel type and regularization
6.3 Naive Bayes Classifier
• Type: Probabilistic classifier based on Bayes’ Theorem
• Assumption: Features are independent of each other
• Formula: P(C|X) = P(X|C) * P(C) / P(X)
• Applications:
o Spam filtering, sentiment analysis, document classification
• Advantages:
o Fast, scalable, and works well with large datasets
• Disadvantages:
o The assumption of independence may not always hold true

You might also like