Datawarehouse&Data Mining - ALL
Datawarehouse&Data Mining - ALL
Short-Answer Questions
1. What is KDD? Explain about data mining as a step in the process of knowledge
discovery
Data Selection: Identifying the relevant data from a larger data set. Focusing on data
that is pertinent to the problem at hand.
Data Preprocessing: Cleaning the data to remove noise and handle missing values.
Ensuring the data is in a suitable format for analysis.
Data Transformation: Transforming data into appropriate forms for mining. This
might include normalization, aggregation, and feature selection.
Data Mining: The core step where algorithms are applied to extract patterns from
the data. This can involve classification, regression, clustering, association rule
learning, and other techniques.
Pattern Evaluation: Interpreting the mined patterns to identify the truly useful ones.
Ensuring the patterns are valid, novel, potentially useful, and understandable.
Data mining is a critical step within the KDD process where the actual extraction of
patterns and knowledge occurs. During this phase, sophisticated algorithms and
techniques are applied to the preprocessed and transformed data to discover
patterns. Here are some key data mining techniques:
2. Define Data mining? Explain about data mining on what kind of data?
Relational Databases:
Data Warehouses:
Large repositories of integrated data from multiple sources, structured for query
and analysis.
Designed to facilitate reporting and analysis, making them a prime target for
data mining.
Transactional Databases:
Spatial Databases:
Text Data:
Unstructured data in the form of text, such as documents, emails, and social
media posts.
Techniques like natural language processing (NLP) are used to extract
meaningful information from text.
Multimedia Data:
Web Data:
Data collected from the web, including web pages, clickstreams, and social
networks.
Useful for web usage mining, web structure mining, and web content mining.
Depending on the type of data and the goals of the analysis, different data mining
techniques can be employed:
Data preprocessing is a crucial step in the data mining process. It involves preparing
raw data for analysis by cleaning, transforming, and organizing it. The main reasons
for preprocessing data are:
Real-world data often contains missing values due to errors in data collection,
transmission, or entry.
Techniques such as imputation (filling in missing values) or deletion (removing
incomplete records) are used to handle missing data, ensuring the dataset is
complete and suitable for analysis.
Raw data may include errors, outliers, or noise that can distort analysis results.
Preprocessing steps like smoothing, normalization, and outlier detection help to
clean the data, improving the accuracy and reliability of data mining outcomes.
Data attributes may have different scales or units (e.g., age in years, income in
dollars), which can bias the analysis.
Normalization (scaling values to a common range) ensures that no single
attribute dominates the analysis due to its scale, allowing for fair comparison
and better algorithm performance.
Data mining, the process of extracting knowledge from massive datasets, faces
several challenges. These issues can be broadly classified into three categories:
Choosing the Right Mining Method: Selecting the most appropriate data
mining technique (classification, clustering, etc.) for a specific task is crucial.
Choosing the wrong method can lead to irrelevant or meaningless patterns.
Incorporating User Knowledge: Data mining works best when combined with
human expertise. Background knowledge about the data and the desired outcome
can guide the selection of methods and interpretation of results.
Performance and Scalability:
Data Privacy and Security: As data mining deals with potentially sensitive
information, ensuring data security and privacy is paramount. Regulations like GDPR
and CCPA mandate careful handling of personal data.
Ethical Considerations: Data mining can raise ethical concerns, such as biased
algorithms or discriminatory practices. Transparency in data collection, usage, and
potential biases is crucial.
Data mining, like a well-oiled machine, has several key components working
together to extract knowledge from data. Here's a breakdown of its anatomy:
1. Data Sources:
This is the raw material – the data you want to mine for insights. It can come from
various sources like databases, transaction records, social media feeds, sensor
readings, and more.
Imagine a giant organized storage facility. The data warehouse server holds the data
collected from various sources, often preprocessed and structured for efficient
analysis.
This is the heart of the operation. The engine uses specific algorithms and techniques
to analyze the data based on your goals. Common techniques include classification
(categorizing data), clustering (grouping similar data points), and association rule
learning (finding relationships between data).
Not all patterns discovered are valuable. This module evaluates the unearthed
patterns based on pre-defined criteria like statistical significance or business
relevance. It helps identify the most useful insights.
5. Knowledge Base:
This component acts like a memory bank. It stores past data mining results and user
knowledge about the specific domain. The engine can leverage this knowledge to
refine future analysis and potentially discover even deeper insights.
This is the user's interface with the data mining system. It allows users to define
goals, select data, monitor the process, and visualize the discovered patterns and
insights in an understandable format.
Initialization:
Fitness Evaluation:
Selection:
Individuals with higher fitness values are more likely to be selected for reproduction.
This simulates the survival of the fittest principle.
Crossover:
Mutation:
Random changes (mutations) are introduced to the offspring's genetic material. This
helps in exploring new solution spaces and prevents premature convergence.
Replacement:
The new population replaces the old one, and the process repeats from step 2 until a
termination condition is met (e.g., maximum number of generations, satisfactory
fitness level).
Key Components:
Individual: A potential solution represented as a string of values (e.g., binary, integer, real-
valued).
Population: A group of individuals.
Fitness function: Evaluates the quality of an individual.
Selection: Determines which individuals will reproduce.
Crossover: Combines genetic material from parents to create offspring.
Mutation: Introduces random changes to the offspring.
Applications:
How it works:
Data preparation: The data is cleaned and preprocessed to ensure consistency and relevance.
Similarity measurement: A distance or similarity metric is chosen to determine how close
data points are to each other. Common metrics include Euclidean distance, Manhattan
distance, and cosine similarity.
Cluster formation: The algorithm iteratively groups data points based on their similarity. The
number of clusters can be predefined or determined automatically.
Evaluation: The quality of the clustering is assessed using various metrics, such as silhouette
coefficient, Calinski-Harabasz index, or Davies-Bouldin index.
Types of Clustering:
Applications of Clustering:
Challenges in Clustering:
Determining the optimal number of clusters: The number of clusters is often subjective and
can significantly impact the results.
Handling different data types: Clustering algorithms may need to be adapted for different
data types (e.g., numerical, categorical, text).
Outliers: Outliers can significantly affect the clustering results.
Scalability: Some clustering algorithms may not be efficient for large datasets.
Key concepts:
Example:
If 80% of customers who buy diapers also buy beer, the rule "diapers -> beer" has a support
of 80% (80% of transactions contain both diapers and beer) and a confidence of 100% (all
customers who buy diapers also buy beer).
Applications:
Decision Trees
Decision trees are a supervised machine learning algorithm used for both
classification and regression tasks. They create a tree-like model of decisions and
their possible consequences.
How it works:
The tree is built recursively by splitting the data based on features that best separate the
target variable.
Each internal node represents a test on an attribute, and each branch represents an outcome
of the test.
Leaf nodes represent the final decision or prediction.
Key concepts:
Information gain: Measures the decrease in entropy (impurity) after splitting the data.
Gini index: Measures the impurity of a node based on the probability of misclassifying a
randomly chosen element.
Pruning: Reduces the size of the decision tree to avoid overfitting.
Applications:
Supervised vs. Unsupervised: Decision trees are supervised, while association rule mining is
unsupervised.
Goal: Decision trees aim to predict a target variable, while association rule mining aims to
find relationships between items.
Output: Decision trees produce a tree-like structure, while association rule mining produces
a set of if-then rules.
Metrics: Decision trees use metrics like information gain or Gini index, while association rule
mining uses support, confidence, and lift.
9. Write short notes on Apriori algorithm
Apriori is a classic algorithm used for frequent itemset mining and association rule
learning. It's particularly effective in analyzing transactional databases, such as
market basket analysis.
Key Concepts:
How it Works:
1. Candidate Generation: Generates candidate itemsets based on frequent itemsets from the
previous iteration.
2. Support Counting: Calculates the support for each candidate itemset.
3. Pruning: Removes itemsets with support below the minimum support threshold.
4. Association Rule Generation: Generates association rules from frequent itemsets.
Key Features:
Applications:
Limitations:
Basic Charts:
Advanced Charts:
Interactive Visualizations:
Exploratory Data Analysis (EDA): Visualizations help identify patterns, outliers, and
relationships within data.
Data Quality Assessment: Visualizations can reveal data inconsistencies and errors.
Performance Monitoring: Visualize key performance indicators (KPIs) to track business
performance.
Predictive Modeling: Visualize model outputs to understand their behavior and accuracy.
Storytelling: Create compelling narratives using visualizations to communicate insights to
stakeholders.
Choose the right visualization: Consider the type of data and the message you want to
convey.
Keep it simple: Avoid cluttering visualizations with unnecessary elements.
Use color effectively: Choose colors that enhance understanding and avoid color blindness
issues.
Interactive elements: Allow users to explore data dynamically.
Contextual information: Provide clear labels, titles, and legends.
Information is data that has been processed, organized, and interpreted to give it
meaning and context. It is useful for decision-making, understanding, and learning.
Information is derived from data.
Key Differences:
Example:
Data warehousing involves extracting, transforming, and loading (ETL) data from
multiple OLTP systems into a centralized repository for analysis and reporting. The
data warehouse is designed for analytical processing (OLAP), which involves
complex queries and aggregations.
Data mining is the process of discovering patterns in large data sets. While OLTP
systems generate the raw data, they are not directly used for data mining due to their
focus on transaction processing.
Instead, data mining techniques are applied to the data stored in the data warehouse.
This is because data warehouses contain historical, integrated, and summarized data,
which is ideal for analysis.
1. Business Users:
2. IT Professionals:
Data Engineers: Responsible for designing, building, and maintaining the data warehouse
infrastructure.
Data Scientists: Develop advanced analytical models and algorithms.
Database Administrators (DBAs): Manage and optimize the data warehouse database.
3. Knowledge Workers:
Varying levels of technical expertise: From non-technical executives to highly skilled data
scientists.
Different analytical needs: Ranging from simple reporting to complex data mining.
Diverse roles and responsibilities: Covering various business functions and IT departments.
To effectively serve these diverse users, data warehouses often employ a layered
architecture with different levels of abstraction, providing various tools and interfaces
to meet the needs of different user groups.
At their core, schemas are patterns of thought and behavior. They provide a mental
shortcut, enabling us to efficiently process vast amounts of information by
categorizing and relating new experiences to existing knowledge structures. For
instance, the "restaurant" schema encompasses expectations about atmosphere, menu,
service, and payment methods, guiding our behavior when dining out.
However, while schemas offer efficiency, they also carry the potential for biases and
distortions. Our pre-existing beliefs and expectations can influence how we perceive
and interpret new information, sometimes leading to confirmation bias or stereotyping.
The "teacher" schema, for example, might include assumptions about age, gender, and
teaching style, potentially overlooking individual differences.
In conclusion, schemas are indispensable mental tools that shape our understanding of
the world. By recognizing their influence, we can strive for objectivity, challenge our
assumptions, and foster a more accurate and nuanced perception of reality.
Key Characteristics:
Normalized Dimension Tables: Unlike the Star Schema where dimension tables are
denormalized, the Snowflake Schema breaks down dimension tables into multiple related
tables, creating a hierarchical or "snowflake" structure.
Hierarchical Structure: This normalization leads to a hierarchical arrangement of dimension
tables, resembling a snowflake when visualized.
Reduced Data Redundancy: By normalizing dimension tables, data redundancy is reduced
compared to the Star Schema.
Increased Query Complexity: While offering storage efficiency, the Snowflake Schema can
introduce additional joins during query processing, potentially impacting query performance.
Example:
2. Performance Optimization:
I/O Efficiency: In older hard disk drives (HDDs), partitioning could improve performance by
physically separating frequently accessed data from less frequently accessed data. While
modern solid-state drives (SSDs) have overcome this limitation to a large extent, partitioning
can still be beneficial for specific use cases.
Defragmentation: Partitioning can make defragmentation more efficient by focusing on
specific partitions, improving overall disk performance.
Dual Booting: Partitioning is essential for installing multiple operating systems on a single
physical drive, allowing you to choose which operating system to boot into.
4. Data Recovery:
Isolation: If a partition becomes corrupted or inaccessible, the data on other partitions
remains unaffected, increasing the chances of data recovery.
5. Legacy Systems:
Compatibility: Some older applications or operating systems might have compatibility issues
with modern file systems. Partitioning allows you to use different file systems on different
partitions.
Key Concepts:
Horizontal Partitioning (Sharding): Divides data rows across multiple servers or storage
devices based on a specific criterion (e.g., date, geographic location, customer ID).
Vertical Partitioning: Splits data columns into different partitions based on data type or
usage patterns.
Range Partitioning: Divides data based on a range of values for a specific column (e.g., date
range, numeric range).
Hash Partitioning: Distributes data evenly across partitions using a hash function.
List Partitioning: Divides data based on values in a specific column that belong to a
predefined list.
How it Works:
Example:
Consider a customer table with a customer_id column. We want to partition this table
into 4 partitions using hash partitioning based on the customer_id.
Even Data Distribution: Ideally, hash partitioning can distribute data evenly across partitions,
improving query performance and load balancing.
Scalability: Adding or removing partitions is relatively straightforward.
Dynamic Partitioning: New data can be added to partitions dynamically based on the hash
function.
Data Skew: If the data distribution is not uniform, it can lead to uneven load across partitions.
Partition Range Restrictions: Hash partitioning does not allow for partition-based range
queries efficiently.
Use Cases:
Distributing large datasets across multiple servers or storage devices.
Improving query performance by parallelizing data processing.
Balancing load across multiple nodes in a distributed database system.
Measures or Facts: These are the quantitative values that are the primary focus of analysis.
Examples include sales amount, quantity sold, profit, cost, etc.
Foreign Keys: These link the fact table to dimension tables, providing context to the
measures. For instance, a sales fact table might have foreign keys to product, customer, and
time dimensions.
Granularity: This defines the level of detail in the fact table. It could be at a transaction level
(e.g., each sale), daily, monthly, or yearly level.
1. Identify the Business Process: Clearly define the business process that the fact table will
represent (e.g., sales, inventory, finance).
2. Determine the Grain: Establish the level of detail for each fact table row. This is crucial for
defining the fact table's structure and the types of analysis it can support.
3. Identify Measures: Determine the quantitative data points that will be stored in the fact
table (e.g., sales amount, quantity sold, profit margin).
4. Identify Dimensions: Identify the factors that provide context to the measures (e.g., product,
customer, time, location).
5. Create Foreign Keys: Include foreign keys in the fact table to link it to the corresponding
dimension tables.
6. Consider Performance: Optimize the fact table for query performance by using appropriate
data types, indexing, and partitioning strategies.
Example:
Best Practices:
Keep fact tables simple: Focus on core measures and avoid storing derived or calculated
values.
Choose appropriate data types: Use data types that optimize storage and performance.
Consider indexing: Create indexes on frequently used columns to improve query
performance.
Partition fact tables: Divide large fact tables into smaller partitions for better manageability
and performance.
Unified View of Data: Combines data from various sources into a single, consistent view,
providing a holistic understanding of the business.
Enhanced Data Quality: Improves data consistency and accuracy through cleansing and
transformation processes.
Facilitates Data Analysis: Supports complex analytical queries and data mining techniques to
uncover valuable insights.
Historical Perspective: Stores historical data, enabling trend analysis, forecasting, and
performance benchmarking.
Faster Query Response: Optimizes data for analytical processing, leading to faster query
execution times.
Reduced Data Redundancy: Eliminates inconsistencies and redundant data, improving data
management efficiency.
Enhanced Data Accessibility: Provides easy access to data for authorized users, empowering
them to make informed decisions.
Automation: Enables automation of routine reporting and analysis tasks, freeing up
resources for strategic initiatives.
Competitive Advantage
Other Benefits
Data Security and Governance: Centralized data management enhances security and
compliance with data regulations.
Scalability: Can accommodate growing data volumes and increasing analytical demands.
Support for Business Intelligence: Provides a foundation for business intelligence and
reporting tools.
Examples of Metadata:
For a digital image: file size, resolution, creation date, author, camera model
For a book: title, author, publication date, publisher, number of pages, ISBN
For a database table: column names, data types, relationships to other tables
Types of Metadata:
Descriptive metadata: Describes the resource, such as title, author, and subject.
Structural metadata: Describes the structure of the resource, such as how parts are
organized.
Administrative metadata: Describes technical information about the resource, such as file
size, format, and creation date.
Example:
Imagine a large retail company with a central data warehouse containing all sales,
customer, product, and financial data. The marketing department might require a data
mart specifically focused on customer demographics, purchase history, and marketing
campaign performance. This data mart would be a subset of the overall data
warehouse, optimized for the marketing team's analytical needs.
Key Characteristics of a Data Mart:
Mainframe computers: Once the backbone of large-scale computing, but now largely
replaced by modern servers.
COBOL applications: Written in an older programming language, still used in some financial
and government systems.
Proprietary hardware and software: Systems that use unique hardware or software, making
them difficult to upgrade or replace.
Improved Data Quality: Accurate and complete metadata enhances data quality and
reliability.
Enhanced Data Discoverability: Well-managed metadata makes it easier to find and access
relevant data.
Better Data Understanding: Metadata provides context and meaning to data, facilitating
data interpretation and analysis.
Increased Data Usability: Accurate and consistent metadata promotes data sharing and
reuse.
Support for Data Governance: Metadata plays a crucial role in data governance initiatives by
providing information about data ownership, lineage, and compliance.
Types of Backup
There are primarily three types of backups used to safeguard digital assets:
1. Full Backup
2. Incremental Backup
Definition: Backs up only files that have changed since the last backup, regardless of whether
it was a full or incremental backup.
Advantages: Faster than full backups, requires less storage space.
Disadvantages: Requires the last full backup and all subsequent incremental backups for
restoration.
3. Differential Backup
Definition: Backs up all files that have changed since the last full backup.
Advantages: Faster than full backups, requires less storage space than full backups but more
than incremental backups.
Disadvantages: Requires the last full backup for restoration.
ESSAYS 10 MARKS
Q. Explain Data mining Architecture and its operations in detail.
Data Sources:
Data Warehouses: Centralized storage combining data from various sources, optimized for query and
analysis.
ETL (Extract, Transform, Load): Integrates data from different sources, cleans, transforms, and loads
it into a data warehouse or a data mart.
Data Cleaning: Removes noise, handles missing values, and corrects inconsistencies.
Acts as a central repository where integrated data is stored and managed. Provides a
multidimensional view of data, facilitating efficient querying and analysis.
Core Component: Implements various data mining algorithms for tasks such as classification,
clustering, regression, and association rule mining.
Pattern Evaluation Module: Validates and evaluates the mined patterns for their significance and
relevance.
Knowledge Base:
Stores domain knowledge, metadata, and rules that guide the data mining process. Enhances the
accuracy and interpretability of the mined patterns.
User Interface:
Provides tools for query and reporting, visualization, and interaction with the data mining system.
Facilitates the presentation of mined knowledge in an understandable format, such as charts, graphs,
and reports.
Data Preparation:
Data Selection:
Data Mining:
Pattern Evaluation:
Knowledge Presentation:
1. Choose an Attribute: It selects the attribute that best classifies the data based on
information gain. Information gain measures how much uncertainty about the target
variable is reduced after splitting the data based on an attribute.
2. Create a Decision Node: A decision node is created for the selected attribute.
3. Create Branches: Branches are created for each possible value of the selected attribute.
4. Repeat: The process is recursively applied to each branch until all data instances belong to
the same class or a predefined stopping criterion is met.
Classification: ID3 is primarily used for classification tasks, where the goal is to predict the
class label of new instances based on their attributes.
Decision Support: The resulting decision tree can be easily interpreted and visualized,
making it valuable for decision-making and knowledge discovery.
Data Exploration: ID3 can help identify important attributes and relationships within the
data.
Advantages of ID3
Disadvantages of ID3
Overfitting: Can create overly complex trees that overfit the training data.
Bias Towards Multi-valued Attributes: Tends to favor attributes with more values.
Limited to Categorical Data: Handles only categorical data; numerical data need to be
discretized.
Neural networks are a class of machine learning models inspired by the structure and
functioning of the human brain. They are particularly effective for tasks involving
pattern recognition, such as image and speech recognition, natural language
processing, and time series forecasting.
Neurons (Nodes):
Layers:
Weights:
1. Parameters that connect neurons between layers, determining the strength and
direction of the signal.
2. Adjusted during training to minimize the error in predictions.
Activation Functions:
Forward Propagation:
Loss Function:
1. Measures the difference between the network's output and the actual target values.
2. Common loss functions include mean squared error (MSE) for regression and cross-
entropy for classification.
Backpropagation:
Optimization:
1. Techniques like stochastic gradient descent (SGD), Adam, and RMSprop are used to
find the optimal weights that minimize the loss function.
Applications
Image and Speech Recognition: Convolutional Neural Networks (CNNs) for image data and
Recurrent Neural Networks (RNNs) for sequential data.
Natural Language Processing (NLP): Tasks like sentiment analysis, machine translation, and
text generation.
Time Series Forecasting: Predicting future values based on historical data.
Decision Tree
Decision trees are a popular machine learning technique used for both classification
and regression tasks. They are a type of predictive modeling tool that maps
observations about an item to conclusions about the item's target value.
Structure of a Decision Tree
Root Node:
Decision Nodes:
Leaf Nodes:
o Use measures like information gain (used in ID3), gain ratio (used in C4.5), or Gini
index (used in CART) to determine the best attribute for splitting the data.
Recursive Partitioning:
Pruning:
o Remove branches that have little importance to reduce overfitting and improve
generalization.
o Methods include pre-pruning (stop growing the tree early) and post-pruning
(remove branches after the tree is fully grown).
Advantages
Disadvantages
Overfitting: Trees can become overly complex and fit the noise in the data.
Bias towards Attributes with More Levels: Tends to favor attributes with many distinct
values.
Instability: Small changes in the data can result in significantly different trees.
Applications
Classification: Used in various fields such as finance for credit scoring, healthcare for
diagnosing diseases, and marketing for customer segmentation.
Regression: Predicting continuous values like house prices, stock prices, and sales forecasting.
Rule Extraction: Extracting decision rules from the tree structure for decision-making
processes.
Online Transaction Processing (OLTP) and Online Analytical Processing (OLAP) are
two distinct types of systems used for managing and analyzing data in organizations.
They serve different purposes and have unique characteristics.
Typically, OLTP systems generate the raw data that is extracted, transformed, and
loaded (ETL) into a data warehouse. The data warehouse then serves as the
foundation for OLAP analysis.
A Data Warehouse Manager is responsible for overseeing the entire data warehousing
environment. Their role involves strategic planning, design, implementation, and
maintenance of the data warehouse.
Key responsibilities:
Data Acquisition: Gathering data from diverse sources and ensuring data quality.
Data Integration: Combining data from multiple systems into a consistent format.
Data Storage: Designing and managing the physical storage of data within the data
warehouse.
Data Security: Protecting sensitive data from unauthorized access.
Performance Optimization: Ensuring efficient query processing and response times.
Monitoring and Maintenance: Continuously monitoring the data warehouse's health and
performance.
User Management: Providing access to the data warehouse based on user roles and
permissions.
Query Manager
Key responsibilities:
Vertical Partitioning
Vertical Partitioning involves dividing a table into smaller tables based on columns.
Each partition contains a subset of the columns from the original table, but all rows
for those columns.
Key Characteristics:
Column-Based Division:
Use Case:
Implementation:
1. Often used in systems where certain columns are frequently accessed together.
2. Helps in isolating sensitive data (e.g., separating personal information from other
data).
Advantages:
1. Reduces the amount of data scanned during queries, leading to faster retrieval
times.
2. Can improve cache efficiency since only relevant columns are loaded into memory.
3. Enhances security and privacy by isolating sensitive columns.
Disadvantages:
1. Requires joins to reconstruct the original table if queries need data from multiple
partitions.
2. Can increase complexity in query design and database management.
Example:
Horizontal Partitioning
Horizontal Partitioning involves dividing a table into smaller tables based on rows.
Each partition contains a subset of the rows from the original table, but all columns
for those rows.
Key Characteristics:
Row-Based Division:
Use Case:
Implementation:
Advantages:
1. Reduces the amount of data scanned during queries, leading to faster retrieval
times.
2. Can improve performance by allowing parallel processing of queries across
partitions.
3. Simplifies data management and backup procedures by handling smaller chunks of
data.
Disadvantages:
1. Requires union operations to reconstruct the original table if queries need data
from multiple partitions.
2. Can increase complexity in database management, especially in distributed
environments.
Example:
Measures or Facts: These are the quantitative values that are the primary focus of analysis.
Examples include sales amount, quantity sold, profit, cost, etc.
Foreign Keys: These link the fact table to dimension tables, providing context to the
measures. For instance, a sales fact table might have foreign keys to product, customer, and
time dimensions.
Granularity: This defines the level of detail in the fact table. It could be at a transaction level
(e.g., each sale), daily, monthly, or yearly level.
1. Identify the Business Process: Clearly define the business process that the fact table will
represent (e.g., sales, inventory, finance).
2. Determine the Grain: Establish the level of detail for each fact table row. This is crucial for
defining the fact table's structure and the types of analysis it can support.
3. Identify Measures: Determine the quantitative data points that will be stored in the fact
table (e.g., sales amount, quantity sold, profit margin).
4. Identify Dimensions: Identify the factors that provide context to the measures (e.g., product,
customer, time, location).
5. Create Foreign Keys: Include foreign keys in the fact table to link it to the corresponding
dimension tables.
6. Consider Performance: Optimize the fact table for query performance by using appropriate
data types, indexing, and partitioning strategies.
Keep fact tables simple: Focus on core measures and avoid storing derived or calculated
values.
Choose appropriate data types: Use data types that optimize storage and performance.
Consider indexing: Create indexes on frequently used columns to improve query
performance.
Partition fact tables: Divide large fact tables into smaller partitions for better manageability
and performance.
Data Quality: Ensure data accuracy and consistency in fact tables.
Additional Considerations
Factless Fact Tables: Used to store events or flags without associated measures.
Slowly Changing Dimensions: Handle changes in dimension attributes over time.
Surrogate Keys: Consider using surrogate keys for fact table primary keys to improve
performance and maintain data integrity.
Data Mart
A data mart is a subset of a data warehouse that is designed to serve the needs of a
specific business unit or department within an organization. It is essentially a smaller,
focused version of a data warehouse, containing a specific collection of data that is
tailored to the requirements of its users. Data marts are typically used to provide quick
access to relevant data for decision-making and analysis within a particular area of the
business.
Scope:
Purpose:
Design:
1. Often designed using a dimensional model (e.g., star schema or snowflake schema).
2. Optimized for queries and reporting specific to the business area it serves.
Data Sources:
1. Sources data from the enterprise data warehouse, operational systems, or external
sources.
2. Integrates and transforms data to meet the specific needs of the data mart users.
Usage:
1. Used for operational reporting, business analysis, and decision support within a
department.
2. Facilitates faster query response times compared to querying directly from a
centralized data warehouse.
Types:
Meta Mart
Metadata Management:
1. Stores and manages metadata related to data assets across the organization.
2. Includes information on data schemas, definitions, data lineage, data quality rules,
and usage metrics.
Integration:
1. Integrates metadata from various sources, including data warehouses, data marts,
operational systems, and external data sources.
2. Provides a unified view of metadata to support data governance and stewardship.
Accessibility:
1. Provides tools and interfaces for users to browse, search, and analyze metadata.
2. Supports metadata-driven initiatives such as data lineage analysis and impact
analysis.
Impact Analysis:
1. Helps users understand the impact of changes to data structures or definitions
across the organization.
2. Supports decision-making processes related to data integration, migration, and
transformation.
Types:
Backing up a data warehouse is critical for ensuring data integrity, availability, and
disaster recovery preparedness. It involves creating copies of the data stored in the
data warehouse and storing them securely to prevent data loss in case of hardware
failure, system errors, or other unforeseen events. The backup strategy should align
with the organization's data recovery objectives and operational requirements.
1. Full Backup
Description: Copies all data from the data warehouse to a backup storage
location.
Advantages:
Considerations:
o Requires significant storage space.
o Longer backup and restore times compared to incremental backups.
2. Incremental Backup
Description: Copies only the data that has changed since the last backup.
Advantages:
Considerations:
3. Differential Backup
Description: Copies data that has changed since the last full backup.
Advantages:
Considerations:
4. Snapshot Backup
Advantages:
Considerations:
5. Off-site Backup
Description: Stores backup copies of data warehouse outside the primary data
center or facility.
Advantages:
Considerations:
o Requires secure transfer and storage mechanisms to protect data during transit and
at rest.
o May incur additional costs for off-site storage facilities or cloud storage services.
Verify Backups: Regularly verify backup integrity and test restore procedures
to validate data recoverability.
Secure Backup Data: Encrypt backup data during storage and transmission to
protect against unauthorized access and data breaches.
b . Classification
c . Regression
d . Pruning
4. ID3 use _____________ attribute selection measure. a
a . Information Gain
b . Gain Ratio
c. Gini Index
d . Tree pruning.
5. Decision tree builds _____________________models in a
the form of a tree structure
a . classification
b . regression
c . both
d . none
6. ______________________ is a type of data mining b
technique that is used to builds classification models in the
form of a tree-like structure.
a . Naive bayes Classifier
b . Decision Tree
c . K-Nearest Neighbor
b . Information
c . Light
d . Noise
8. An association rule can be extracted from a given itemset a
by using a level-wise approach.
a . Frequent
b . Candidate
c . Similar
d . Infrequent
9. ___________ of a decision tree has only one incoming b
edge and no outgoing.
a. Pattern
b. Leaf Node
c. Data Transformation
d. KDD
10. _______errors are the expected errors generated by a b
model because of unknown records.
a . Training
b . Generalization
c . Test
d . Misclassification
Unit III
1. Which field of data mining helps in removing uncertainty, b
noise etc?
a) Data preprocessing
b) Data Mining
c) Outlier detection and removal
d) Uncertainty Reasoning
2. Pick the wrong data mining functionality among the given d
data mining functionalities.
a) Classification
b) Clustering
c) Class Description
d) Object Description
3. A data warehouse is which of the following? c
a) Cab be updated by end users
b) Contains numerous naming conventions and formats.
c)Organized around important subject areas.
d) Contains only current data.
4. The data is stored, retrieved and updated in_______. b
a) OLAP
b) OLTP
c) SMTP
d) FTP
5. _________ describes the data contained in the data c
warehouse.
a) Relational data
b)Operational data
c) Metadata
d) Informational data
6. Record cannot be updated in________. d
a) OLTP
b) Files
c) RDBMS
d) Data warehouse
7. Data warehouse contains_________ data that is never c
found in the operational environment.
a) Normalized
b) Informational
c) Summary
d) DE normalized
8. In Data Warehousing, how many approaches are there for d
the integration of heterogeneous databases?
a. 5
b. 4
c. 3
d. 2
9. In Data Warehousing, which of these is the correct c
advantage of the Update-Driven Approach?