DW Micro
DW Micro
Ans: A data warehouse is a centralized repository where large volumes of data from various Ans: A schema in a data warehouse is a logical structure or blueprint that defines how data is
sources are stored, organized, and managed. It is designed to support business intelligence (BI) organized, stored, and related within the database. It determines the tables, fields, and
activities, such as reporting, analysis, and decision-making. Unlike operational databases, which relationships among them to support analytical queries efficiently.
handle day-to-day transactions, data warehouses are optimized for query performance and Star Schema
analytics. Structure:
Importance of a Data Warehouse in Modern Business: The star schema has a central fact table surrounded by dimension tables, resembling a star
a)Centralized Data Management shape.
Combines data from different sources (e.g., CRM, ERP, social media) into a single, unified Fact Table: Stores quantitative data (e.g., sales, revenue) and foreign keys linking to dimension
platform. tables.
Reduces data silos and ensures consistency in reporting. Dimension Tables: Contain descriptive attributes (e.g., product name, customer details) for
b)Improved Decision-Making analysis.
Provides actionable insights by enabling advanced analytics and visualization. Snowflake Schema
Helps businesses make data-driven decisions based on historical trends and patterns. Structure:
c)Enhanced Performance A snowflake schema is an extension of the star schema where dimension tables are normalized
Speeds up query execution for large datasets, saving time compared to traditional databases. (split into smaller related tables).
Supports complex queries required for forecasting, trend analysis, and customer segmentation. This reduces redundancy but increases complexity.
d)Scalability Galaxy Schema (Fact Constellation)
Handles growing data volumes as businesses expand, ensuring long-term usability. Structure:
e)Improved Customer Understanding A galaxy schema involves multiple fact tables sharing common dimension tables.
Tracks customer behavior, preferences, and feedback across multiple channels. It is used when a business tracks multiple related processes.
Helps design personalized marketing strategies and improve customer satisfaction.
Q3: Example of star, snowflake and galaxy schemas Q4: Difference between database and datawarehouse
Ans: Star Schema
For a sales data warehouse:
Fact Table: Sales
Columns: Sale_ID, Product_ID, Customer_ID, Store_ID, Date, Sales_Amoun
Dimension Tables:
Product: Product_ID, Product_Name, Category
Customer: Customer_ID, Customer_Name, Location
Store: Store_ID, Store_Name, City
Date: Date, Month, Year
Snowflake schema: Using the same sales example:
Dimension Table Product is normalized into:
Product: Product_ID, Product_Name, Category_ID
Category: Category_ID, Category_Name
Galaxy Schema:
A retail business may have:
Fact Table Sales (Product_ID, Customer_ID, Store_ID, Sales_Amount)
Fact Table Returns (Product_ID, Customer_ID, Store_ID, Return_Amount)
Both share dimension tables like Product, Customer, and Store.
Q13: Explain data mining and tasks of datamining Q14: Explain architecture of data mining.
Ans: Data mining is the process of discovering patterns, trends, relationships, and useful insights Ans: Data Source Layer:Includes data from multiple sources like databases, data warehouses, flat
from large datasets using statistical, mathematical, and computational techniques. It is a key step files, or external datasets.
in the data analysis process and helps businesses make data-driven decisions. Data Preprocessing Layer:Involves cleaning, transforming, and integrating data to remove
Tasks of Data Mining inconsistencies and ensure quality data for analysis.
Data mining tasks are broadly classified into two types: Descriptive and Predictive tasks. Data Mining Engine:
1. Descriptive Tasks This is the core component that applies algorithms and techniques like classification, clustering,
These tasks aim to summarize the characteristics or patterns in the data. regression, etc., to mine patterns from data.
Clustering: Grouping similar data points into clusters or segments based on their characteristics. Pattern Evaluation Layer:
Example: Grouping customers based on purchasing behavior Evaluates the discovered patterns to identify the most interesting or valuable insights.
2. Predictive Tasks Knowledge Base:
These tasks involve using data to predict future outcomes. Stores knowledge about the data, data mining techniques, and the results of previous analyses.
Classification: Assigning data points to predefined categories based on input features. User Interface:Allows users to interact with the system, set parameters, and visualize the results
Example: Predicting whether an email is spam or not based on certain features. of data mining.
Q15: explain datalake, hadoop, metadata, map reduce Q16: explain building blocks or components of datawarehouse
Ans: Data Lake Ans: Data Sources:
A centralized storage system for raw, unstructured, and structured data. Various operational systems (databases, CRM, APIs) providing raw data for ETL processing.
Stores large volumes of data with schema-on-read. ETL Process:
Used for big data analytics and real-time processing. Extract: Collects data.
Hadoop Transform: Cleans and integrates data.
An open-source framework for distributed storage (HDFS) and parallel data processing Load: Loads transformed data into the warehouse.
(MapReduce). Data Storage:
Handles big data across many machines with fault tolerance and scalability. Central repository for structured data, including fact tables (numeric data) and dimensiontables
Metadata (descriptive data).
Data about data: Describes the structure, format, and usage of data. Data Modeling:
Helps in data discovery and management. Organizes data into schemas like Star Schema or Snowflake Schema for efficient querying.
MapReduce OLAP Engine:
A programming model for processing large data sets in parallel. Supports multidimensional querying for fast analysis (e.g., slicing, dicing).
Divides tasks into Map (process) and Reduce (combine) phases. Metadata:
Used for efficient, scalable data processing. Describes data structure, helping manage and provide context for users.
Surrogate Key Front-End Tools:
A surrogate key is a unique, system-generated identifier used in a database to represent an entity, BI tools (dashboards, reports) for data analysis and decision-making.
replacing natural keys. Data Governance and Security:
Unique Identifier: A sequential number (e.g., 1, 2, 3) to uniquely identify a record.No Business Ensures data quality, integrity, and secure access.
Dependency: Not derived from business data (e.g., customer email).
Improves Efficiency: Reduces complexity and improves query performance.
Q17: Explain top down and bottom up approach in data warehouse Q18: Explain data modelling lifecycle
Ans: Top-Down Approach: In the Top-Down approach, the data warehouse is built starting from Ans: Requirement Gathering:
the centralized data warehouse and then data marts are created later. Understand business needs and data requirements through stakeholder interactions.
Centralized Design: The data warehouse is developed first as the core repository. Conceptual Data Modeling:
Data Marts: Data marts are derived from the data warehouse later, based on business needs. Define high-level relationships between data entities, focusing on business terms.
High-Level Integration: Focuses on building an integrated, enterprise-wide data model. Logical Data Modeling:
Cost and Time: Initially, more costly and time-consuming. Create detailed models, specifying attributes, keys, and relationships, independent of physical design.
Example: IBM, Oracle use this approach for large-scale systems. Physical Data Modeling:
Bottom-Up Approach: In the Bottom-Up approach, data marts are built first, and later integrated Design how data will be stored, optimizing for performance and storage.
into the data warehouse. Implementation:
Data Marts First: Data marts are created based on specific business areas (e.g., sales, marketing). Create and deploy the database schema in the database system.
Integration Later: These data marts are later integrated into a centralized data warehouse. Data Integration and Population:
Faster Results: Quicker implementation as business areas get access to data faster. Load data into the model using ETL processes.
Cost-Effective: Lower initial costs and can be scaled incrementally. Testing and Validation:
Example: Retail businesses often use this approach for quicker insights. Verify the model’s correctness, performance, and alignment with business needs.
Maintenance and Optimization:
Update and optimize the model as business needs evolve.