Final DWM
Final DWM
Explain
1. *Fact Table*:
A fact table contains the *measures* or *quantitative data* related to a business process.
These measures represent the events or transactions that occur within a specific subject
area.
- The granularity of a fact table is at the *lowest level*—it captures detailed data for each
event.
- Fact tables store *numeric values*, such as sales revenue, quantities sold, or profit
margins.
- They are typically *vertical* tables, with fewer attributes.
- Fact tables are used for *analysis* and *decision-making*.
Examples of fact tables include sales transactions, inventory movements, or website clicks.
2. *Dimension Table*:
- A dimension table provides *contextual information* about the data in the fact table. It
contains descriptive attributes that help categorize and filter the measures.
- Dimension tables are *wide* because they include more attributes (columns) related to
the grain of the table.
- These attributes are typically in *text format* and provide additional details about the
events.
- Dimension tables are *horizontal* tables, with fewer records compared to fact tables.
- They help organize data into hierarchies, such as time (year, quarter, month), geography
(country, region), or product categories.
- Examples of dimension tables include date dimensions, customer dimensions, or product
dimensions.
Components of Apriori algorithm
The given three components comprise the apriori algorithm.
1. Support 2. Confidence 3. Lift
Multilevel Association Rules: Multilevel association rules extend traditional association
rule mining to incorporate hierarchies or levels of abstraction in the data. This allows for
the discovery of relationships at different levels of granularity, enabling more nuanced
insights into the underlying associations.
Example: In a retail dataset, instead of only mining associations at the product level,
multilevel association rule mining might explore relationships between product
categories (e.g., beverages) and specific items (e.g., cola), as well as between
subcategories (e.g., soft drinks) and individual products.
Approaches:
Level-wise Mining: Extend the Apriori algorithm to handle hierarchical data structures
and mine association rules at different levels of the hierarchy.
Constraint-based Methods: Incorporate constraints that enforce relationships between
levels of abstraction, guiding the rule mining process.
Multidimensional Association Rules: Multidimensional association rules consider
associations among multiple attributes or dimensions in the dataset, beyond just
itemsets. This allows for the discovery of complex patterns involving multiple variables,
facilitating a more comprehensive understanding of the data.
Example: In a healthcare dataset, multidimensional association rule mining might
uncover relationships between patient demographics (e.g., age, gender), medical
conditions (e.g., diabetes), and treatment outcomes (e.g., medication adherence).
Approaches:
MD-Mine Algorithm: Specifically designed for mining multidimensional association
rules, considering correlations among attributes across multiple dimensions. \
Cuboid-based Approaches: Construct cuboids representing combinations of dimensions
and mine association rules within each cuboid to capture multidimensional relationships.
Why is entity – relationship modeling technique is not suitable for data warehouse?
How is dimensional modeling different?
In computing, a data warehouse, also known as an enterprise data warehouse, is a system
used for reporting and data analysis and is considered as a core component of business
intelligence. A data warehouse is a central repository of information that can be analyzed
to make more useful decisions. Data flows into a data warehouse from transactional
systems, relational databases, and other sources like application log files and transaction
applications. ER modelling aims to optimize performance for transaction processing and is
also difficult to query ER models because of the complexity. Hence, ER models are not
suitable for high performance retrieval of data. The conceptual Entity-Relationship (ER)
model is extensively used for database design in relational database environment, which is
used on day-to - day operations. Multidimensional (MD) data modeling is crucial in data
warehouse design as it is targeted for managerial decision support. Multidimensional Data
Modeling supports decision making by allowing users to drill-down for a more detailed
information, roll-up to view summarized information, slice and dice a dimension for a
selection of a specific point of interest and pivot to re-orientate the view of MD data.
Dimensional modeling is a form of modeling of data that is more flexible from the
perspective of the user. These dimensional and relational models have their unique way of
data storage that has specific advantages. Dimensional models are built around business
processes. They need to ensure that dimension tables use a surrogate key. Dimension
tables store the history of the dimensional information.
Q1 a) Every data structure in the data warehouse contains the time element. Why?
Every data structure in the data warehouse contains a time element. Due to the nature of
its mission, it must include historical data rather than current figures.
About data warehouse contain time element :
Every data structure in a data warehouse has a temporal component to it.
Due to the nature of its mission, it must include historical data rather than current
figures.
In a typical data warehouse, there are four main components. The central database
contains all of the ETL tools, metadata, and access tools. All of these components are made
to perform quickly, allowing you to collect information and analyze data while on the go.
Que8. Explain ETL process in detail.
1. Extraction:
- This is the first step in the ETL process. It involves extracting data from multiple sources,
which can include databases, files, applications, web services, etc. - The extraction process
can be either full extraction, where all the data is extracted every time, or incremental
extraction, where only the newly added or updated data since the last extraction is
retrieved. - Extracted data may often be in different formats and structures, so it needs to be
consolidated and standardized for further processing.
2. Transformation:
- Once the data is extracted, it undergoes transformation. This step involves cleaning,
filtering, and structuring the data to make it suitable for analysis and storage in the data
warehouse - Transformation tasks may include: - Data cleansing: Removing or correcting
errors, inconsistencies, or duplicates in the data. - Data validation: Ensuring that the data
meets certain quality standards and business rules. - Data aggregation: Combining and
summarizing data from multiple sources. - Data enrichment: Adding additional information
or attributes to the data. - Data normalization: Standardizing data formats, units, and values.
- Transformation is a critical step as it ensures that the data is accurate, consistent, and
relevant for analysis.
3. Loading:
- Once the data is extracted and transformed, it is loaded into the data warehouse for
storage and analysis. - Loading involves inserting the transformed data into the appropriate
tables or structures within the data warehouse. - There are different loading strategies,
including: - Full load: Loading all the data from scratch each time. - Incremental load:
Loading only the new or changed data since the last load. - Parallel load: Loading data into
multiple tables simultaneously to improve performance. - After loading, the data is available
for querying, reporting, and analysis by end-users or analytical tools.
Q6 What are the three major areas in the data warehouse? Relate and explain the
architectural components to the three major areas
The three major areas in a data warehouse are the data staging area, the data storage area,
and the data presentation area. This division is logical as it allows for a clear separation of
tasks and functionalities within the data warehouse architecture. The architectural
components support and enable the functionalities of each major area.
Explanation:
The three major areas in a data warehouse are the data staging area, the data storage area,
and the data presentation area.
The data staging area is where data from various sources is extracted, transformed, and
cleansed before being loaded into the data warehouse. This area acts as a temporary
storage space for data before it is processed and integrated into the data storage area.
The data storage area is the core of the data warehouse architecture. It stores the
integrated and processed data in a structured format for efficient retrieval and analysis.
This area typically consists of a data mart or a data warehouse server.
The data presentation area is where the processed data is made available for users to
access and analyze. This area includes tools and technologies for data visualization, report
generation, and interactive querying
This division is logical because it allows for a clear separation of tasks and functionalities
within the data warehouse architecture. Each area focuses on a specific aspect of the data
warehousing process, enabling efficient data management and analysis. The architectural
components, such as ETL (Extract, Transform, Load) tools for the data staging area, the
data storage server for the data storage area, and reporting tools for the data presentation
area, support and enable the functionalities of each major area.
Define initial load, incremental load and full refresh.
1. *Initial Load*:
initial Load: For the very first time loading all the data warehouse tables.
Incremental Load: Periodically applying ongoing changes as per the requirement. After the
data is loaded into the data warehouse database, verify the referential integrity between
the dimensions and the fact tables to ensure that all records belong to the appropriate
records in the other tables. The DBA must verify that each record in the fact table is related
to one record in each dimension table that will be used in combination with that fact table.
Full Refresh: Deleting the contents of a table and reloading it with fresh data.
The initial load, also known as the *full load*, involves populating all the data warehouse
tables for the very first time. - During the initial load, all the records from the source
system are loaded into the target database or data warehouse - It erases any existing data
in the tables and replaces it with fresh data³⁴.
2. *Incremental Load*:
- The incremental load, also referred to as the *delta load*, occurs periodically after the
initial load. - Instead of loading the entire dataset, only the new or updated data since the
last extraction is loaded into the target system.- This approach is more efficient in terms of
resource utilization and speed, as it focuses only on the changes¹⁵.
3. *Full Refresh*:
A full refresh involves erasing the contents of one or more tables in the data warehouse
and reloading them with fresh data - Unlike incremental load, which only handles changes,
a full refresh replaces all existing data. - Organizations typically use full refresh when they
need to ensure complete consistency or periodically reset the data¹
Q6 What are the three major areas in the data warehouse? Relate and explain the
architectural components to the three major areas
The three major areas in a data warehouse are the data staging area, the data storage area,
and the data presentation area. This division is logical as it allows for a clear separation of
tasks and functionalities within the data warehouse architecture. The architectural
components support and enable the functionalities of each major area.
Explanation:
The three major areas in a data warehouse are the data staging area, the data storage area,
and the data presentation area.
The data staging area is where data from various sources is extracted, transformed, and
cleansed before being loaded into the data warehouse. This area acts as a temporary
storage space for data before it is processed and integrated into the data storage area.
The data storage area is the core of the data warehouse architecture. It stores the
integrated and processed data in a structured format for efficient retrieval and analysis.
This area typically consists of a data mart or a data warehouse server.
The data presentation area is where the processed data is made available for users to
access and analyze. This area includes tools and technologies for data visualization, report
generation, and interactive querying
This division is logical because it allows for a clear separation of tasks and functionalities
within the data warehouse architecture. Each area focuses on a specific aspect of the data
warehousing process, enabling efficient data management and analysis. The architectural
components, such as ETL (Extract, Transform, Load) tools for the data staging area, the
data storage server for the data storage area, and reporting tools for the data presentation
area, support and enable the functionalities of each major area.
Data integration in data mining refers to the process of combining data from multiple
sources into a single, unified view. This can involve cleaning and transforming the data, as
well as resolving any inconsistencies or conflicts that may exist between the different
sources. The goal of data integration is to make the data more useful and meaningful for the
purposes of analysis and decision making. Techniques used in data integration include data
warehousing, ETL (extract, transform, load) processes, and data federation.
Data Integration is a data preprocessing technique that combines data from multiple
heterogeneous data sources into a coherent data store and provides a unified view of the
data. These sources may include multiple data cubes, databases, or flat files.
The data integration approaches are formally defined as triple <G, S, M> where,
G stand for the global schema,
S stands for the heterogeneous source of schema,
M stands for mapping between the queries of source and global schema.
What is data integration :
Data integration is the process of combining data from multiple sources into a cohesive and
consistent view. This process involves identifying and accessing the different data sources,
mapping the data to a common format, and reconciling any inconsistencies or discrepancies
between the sources. The goal of data integration is to make it easier to access and analyze
data that is spread across multiple systems or platforms, in order to gain a more complete
and accurate understanding of the data.
Data integration can be challenging due to the variety of data formats, structures, and
semantics used by different data sources. Different data sources may use different data
types, naming conventions, and schemas, making it difficult to combine the data into a
single view. Data integration typically involves a combination of manual and automated
processes, including data profiling, data mapping, data transformation, and data
reconciliation.
Data integration is used in a wide range of applications, such as business intelligence, data
warehousing, master data management, and analytics. Data integration can be critical to the
success of these applications, as it enables organizations to access and analyze data that is
spread across different systems, departments, and lines of business, in order to make better
decisions, improve operational efficiency, and gain a competitive advantage.
Data Cleaning: Every data science enthusiast quickly learns one truth: real-world data is
messy. Thisstep involves removing any inconsistencies, errors, or outliers that might skew
the results.
Data Integration: Data often comes from multiple sources, each with its own format and
structure.This step merges this data into a unified set, ensuring consistency and reducing
redundancy.
Data Selection: Not all data is relevant for every analysis. Here, you’ll select the subset of
data thatpertains to your specific objective.
Data Transformation: Data might need to be summarized, aggregated, or otherwise
transformed tomake it suitable for mining.
Data Mining: This is where the magic happens! Using various algorithms and techniques,
patterns,trends, and relationships are extracted from the data.
Pattern Evaluation: Not all patterns are useful or interesting. This step helps filter out
the noise,ensuring only
valuable insights are
considered.
Knowledge
Presentation: After all
the hard work, it’s time
to share the findings.
This often involves
visualizations, reports,
or other means to make
the knowledge
accessible and understandable.
These tasks are critical for ensuring that the data in the warehouse is accurate, consistent,
and ready for analysis. Proper transformation ensures that data from diverse sources can
be integrated and used effectively in decision-making processes.
Describe slowly changing dimensions. What are the three types? Explain each type
very briefly.
Slowly Changing Dimensions
Slowly changing dimensions refer to how data in your data warehouse changes over time.
Slowly changing dimensions have the same natural key but other data columns that may or
may not change over time depending on the type of dimensions that it is.
Slowly changing dimensions are important in data analytics to track how a record is
changing over time. The way the database is designed directly reflects whether historical
attributes can be tracked or not, determining different metrics available for the business to
use.
For example, if data is constantly being overwritten for one natural key, the business will
never be able to see how changes in that row’s attributes affect key performance indicators.
If a company continually iterates on a product and its different features, but doesn’t track
how those features have changed, it will have no idea how customer retention, revenue,
customer acquisition cost, or other marketing analytics were directly impacted by those
changes.
Types of slowly changing dimensions
Type 0
Type 0 refers to dimensions that never change. You can think of these as mapping tables in
your data warehouse that will always remain the same, such as states, zipcodes, and county
codes. Date_dim tables that you may use to simplify joins are also considers type 0
dimensions. In addition to mapping tables, other pieces of data like social security number
and date of birth are considered type 0 dimensions.
Type 1
Type 1 refers to data that is overwritten by new data without keeping a historical record of
that old piece of data. With this type, there is no way to keep track of changes over time. I’ve
seen many companies use this type of dimension accidentally, not realizing that they can
never get the old values back. When implementing this dimension, make sure you do not
need to track the trends in that data column over time.
A good example of this is customer addresses. You don’t need to keep track of how a
customer’s address has changed over time, you just need to know you are sending an order
to the right place.
Type 2
Type 2 dimensions are always created as a new record. If a detail in the data changes, a new
row will be added to the table with a new primary key. However, the natural key would
remain the same in order to map a record change to one another. Type 2 dimensions are the
most common approach to tracking historical records.
There are a few different ways you can handle type 2 dimensions from an analytics
perspective. The first is by adding a flag colum
n to show which record is currently active. This is the approach Fivetran takes with data
tables that have CDC implemented. Instead of deleting any historic records, they will add a
new one with the _FIVETRAN_DELETED column set to FALSE. The old record will then be
set to TRUE for this _FIVETRAN_DELETED column. Now, when querying this data, you can
use this column to filter for records that are active while still being able to get historical
records if needed.
You can also handle type 2 dimensions by adding a timestamp column or two to show when
a new record was created or made active and when it was made ineffective. Instead of
checking for whether a record is active or not, you can find the most recent timestamp and
assume that is the active data row. You can then piece together the timestamps to get a full
picture of how a row has changed over time.
Metadata is data that describes and contextualizes other data. It provides information
about the content, format, structure, and other characteristics of data, and can be used to
improve the organization, discoverability, and accessibility of data.
Metadata can be stored in various forms, such as text, XML, or RDF, and can be organized
using metadata standards and schemas. There are many metadata standards that have been
developed to facilitate the creation and management of metadata, such as Dublin Core,
schema.org, and the Metadata Encoding and Transmission Standard (METS). Metadata
schemas define the structure and format of metadata and provide a consistent framework
for organizing and describing data.
Metadata can be used in a variety of contexts, such as libraries, museums, archives, and
online platforms. It can be used to improve the discoverability and ranking of content in
search engines and to provide context and additional information about search results.
Metadata can also support data governance by providing information about the ownership,
use, and access controls of data, and can facilitate interoperability by providing information
about the content, format, and structure of data, and by enabling the exchange of data
between different systems and applications. Metadata can also support data preservation
by providing information about the context, provenance, and preservation needs of data,
and can support data visualization by providing information about the data’s structure and
content, and by enabling the creation of interactive and customizable visualizations.
Several Examples of Metadata:
Metadata is data that provides information about other data. Here are a few examples of
metadata:
File metadata: This includes information about a file, such as its name, size, type, and
creation date. Image metadata: This includes information about an image, such as its
resolution, color depth, and camera settings. Music metadata: This includes information
about a piece of music, such as its title, artist, album, and genre. Video metadata: This
includes information about a video, such as its length, resolution, and frame rate.
Document metadata: This includes information about a document, such as its author, title,
and creation date.Database metadata: This includes information about a database, such as
its structure, tables, and fields.
Types of Metadata:
There are many types of metadata that can be used to describe different aspects of data,
such as its content, format, structure, and provenance. Some common types of metadata
include:
Descriptive metadata: This type of metadata provides information about the content,
structure, and format of data, and may include elements such as title, author, subject, and
keywords. Descriptive metadata helps to identify and describe the content of data and can
be used to improve the discoverability of data through search engines and other tools.
Administrative metadata: This type of metadata provides information about the
management and technical characteristics of data, and may include elements such as file
format, size, and creation date. Administrative metadata helps to manage and maintain data
over time and can be used to support data governance and preservation.
Structural metadata: This type of metadata provides information about the relationships
and organization of data, and may include elements such as links, tables of contents, and
indices. Structural metadata helps to organize and connect data and can be used to facilitate
the navigation and discovery of data.Provenance metadata: This type of metadata
provides information about the history and origin of data, and may include elements such as
the creator, date of creation, and sources of data. Provenance metadata helps to provide
context and credibility to data and can be used to support data governance and
preservation.
Data transformation in data mining refers to the process of converting raw data into a
format that is suitable for analysis and modeling. The goal of data transformation is to
prepare the data for data mining so that it can be used to extract useful insights and
knowledge. Data transformation typically involves several steps, including:
1. Data cleaning: Removing or correcting errors, inconsistencies, and missing values in the
data.
2. Data integration: Combining data from multiple sources, such as databases and
spreadsheets, into a single format.
3. Data normalization: Scaling the data to a common range of values, such as between 0 and
1, to facilitate comparison and analysis.
4. Data reduction: Reducing the dimensionality of the data by selecting a subset of relevant
features or attributes.
5. Data discretization: Converting continuous data into discrete categories or bins.
6. Data aggregation: Combining data at different levels of granularity, such as by summing or
averaging, to create new features or attributes.
7. Data transformation is an important step in the data mining process as it helps to ensure
that the data is in a format that is suitable for analysis and modeling, and that it is free of
errors and inconsistencies. Data transformation can also help to improve the performance
of data mining algorithms, by reducing the dimensionality of the data, and by scaling the
data to a common range of values.
The data are transformed in ways that are ideal for mining the data. The data
transformation involves steps that are:
1. Smoothing: It is a process that is used to remove noise from the dataset using some
algorithms It allows for highlighting important features present in the dataset. It helps in
predicting the patterns. When collecting data, it can be manipulated to eliminate or reduce
any variance or any other noise form. The concept behind data smoothing is that it will be
able to identify simple changes to help predict different trends and patterns. This serves as
a help to analysts or traders who need to look at a lot of data which can often be difficult to
digest for finding patterns that they wouldn’t see otherwise.
2. Aggregation: Data collection or aggregation is the method of storing and presenting data
in a summary format. The data may be obtained from multiple data sources to integrate
these data sources into a data analysis description. This is a crucial step since the accuracy
of data analysis insights is highly dependent on the quantity and quality of the data used.
Gathering accurate data of high quality and a large enough quantity is necessary to produce
relevant results. The collection of data is useful for everything from decisions concerning
financing or business strategy of the product, pricing, operations, and marketing strategies.
For example, Sales, data may be aggregated to compute monthly& annual total amounts.
3. Discretization: It is a process of transforming continuous data into set of small intervals.
Most Data Mining activities in the real world require continuous attributes. Yet many of the
existing data mining frameworks are unable to handle these attributes. Also, even if a data
mining task can manage a continuous attribute, it can significantly improve its efficiency by
replacing a constant quality attribute with its discrete values. For example, (1-10, 11-20)
(age:- young, middle age, senior).
4. Attribute Construction: Where new attributes are created & applied to assist the mining
process from the given set of attributes. This simplifies the original data & makes the mining
more efficient.
5. Generalization: It converts low-level data attributes to high-level data attributes using
concept hierarchy. For Example Age initially in Numerical form (22, 25) is converted into
categorical value (young, old). For example, Categorical attributes, such as house
addresses, may be generalized to higher-level definitions, such as town or country.
A dimension table is wide, the fact table deep. Explain
1. *Fact Table*:
A fact table contains the *measures* or *quantitative data* related to a business process.
These measures represent the events or transactions that occur within a specific subject
area. - The granularity of a fact table is at the *lowest level*—it captures detailed data for
each event.
- Fact tables store *numeric values*, such as sales revenue, quantities sold, or profit
margins. - They are typically *vertical* tables, with fewer attributes.
- Fact tables are used for *analysis* and *decision-making*.
Examples of fact tables include sales transactions, inventory movements, or website clicks.
2. *Dimension Table*:
- A dimension table provides *contextual information* about the data in the fact table. It
contains descriptive attributes that help categorize and filter the measures.
- Dimension tables are *wide* because they include more attributes (columns) related to
the grain of the table.
- These attributes are typically in *text format* and provide additional details about the
events.
- Dimension tables are *horizontal* tables, with fewer records compared to fact tables.
- They help organize data into hierarchies, such as time (year, quarter, month), geography
(country, region), or product categories.
- Examples of dimension tables include date dimensions, customer dimensions, or product
dimensions.
Que6. Explain Updates to the dimension table [SCD].
Dimension tables are more stable and less volatile until fact table, which changes as the
number of rows increases, a dimension table changes as the attributes. themselves change.
Type I change: correction of error
Over write the values of attribute with the new values in dimension table row.
-The old value of attribute is discarded that is not preserved.
-No other changes are made in dimension table row and the key of dimension table row is
not affected.
Type 2 change prevention of history
A new dimension table row with new values (changes attribute is added)
- A new column called effective address is added in dimensions table
The original row is not changed, key will be same. the new row is inserted with the new
surrogate key in the dimension table.
Type 3 change: Tentative soft revision
- An old field is added in the dimension table for the effective attribute
the new name of attribute is kept in the current field. The current effective data field is also
added for the change attribute.
No new dimension row is added in the dimension table.
Update to the dimension Table
-Dimension table tend to be more stable and less volatile as compared to fact table.
-The fact table changes to the increase in the number of rows.
-A dimension table changes only not because of increase is number of rowes but also
because of the changes to the attributes themselves.
SCD [slowly changing dimension table]- It is a customery term used for managing
issues associated with the impact of change to attribute of dimension table
-Designing approaches to scd are as categorised into 3. types:
-Type - overwrite the dimension record
-Type 2 add a new dimension.
-Type 3-create new field in dimension record
10-134 20-235 30-1235 40-25 50-135 min count 2 confidence 60%