0% found this document useful (0 votes)
18 views30 pages

Final DWM

The document discusses dimensional modeling and how it differs from entity-relationship modeling. Dimensional modeling is better suited for data warehousing because it is more flexible, focused on business processes, and allows for historical data analysis through techniques like drill-down and roll-up.

Uploaded by

Om Padhi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views30 pages

Final DWM

The document discusses dimensional modeling and how it differs from entity-relationship modeling. Dimensional modeling is better suited for data warehousing because it is more flexible, focused on business processes, and allows for historical data analysis through techniques like drill-down and roll-up.

Uploaded by

Om Padhi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

A dimension table is wide, the fact table deep.

Explain
1. *Fact Table*:
A fact table contains the *measures* or *quantitative data* related to a business process.
These measures represent the events or transactions that occur within a specific subject
area.
- The granularity of a fact table is at the *lowest level*—it captures detailed data for each
event.
- Fact tables store *numeric values*, such as sales revenue, quantities sold, or profit
margins.
- They are typically *vertical* tables, with fewer attributes.
- Fact tables are used for *analysis* and *decision-making*.
Examples of fact tables include sales transactions, inventory movements, or website clicks.
2. *Dimension Table*:
- A dimension table provides *contextual information* about the data in the fact table. It
contains descriptive attributes that help categorize and filter the measures.
- Dimension tables are *wide* because they include more attributes (columns) related to
the grain of the table.
- These attributes are typically in *text format* and provide additional details about the
events.
- Dimension tables are *horizontal* tables, with fewer records compared to fact tables.
- They help organize data into hierarchies, such as time (year, quarter, month), geography
(country, region), or product categories.
- Examples of dimension tables include date dimensions, customer dimensions, or product
dimensions.
Components of Apriori algorithm
The given three components comprise the apriori algorithm.
1. Support 2. Confidence 3. Lift
Multilevel Association Rules: Multilevel association rules extend traditional association
rule mining to incorporate hierarchies or levels of abstraction in the data. This allows for
the discovery of relationships at different levels of granularity, enabling more nuanced
insights into the underlying associations.
Example: In a retail dataset, instead of only mining associations at the product level,
multilevel association rule mining might explore relationships between product
categories (e.g., beverages) and specific items (e.g., cola), as well as between
subcategories (e.g., soft drinks) and individual products.
Approaches:
 Level-wise Mining: Extend the Apriori algorithm to handle hierarchical data structures
and mine association rules at different levels of the hierarchy.
 Constraint-based Methods: Incorporate constraints that enforce relationships between
levels of abstraction, guiding the rule mining process.
Multidimensional Association Rules: Multidimensional association rules consider
associations among multiple attributes or dimensions in the dataset, beyond just
itemsets. This allows for the discovery of complex patterns involving multiple variables,
facilitating a more comprehensive understanding of the data.
Example: In a healthcare dataset, multidimensional association rule mining might
uncover relationships between patient demographics (e.g., age, gender), medical
conditions (e.g., diabetes), and treatment outcomes (e.g., medication adherence).
Approaches:
 MD-Mine Algorithm: Specifically designed for mining multidimensional association
rules, considering correlations among attributes across multiple dimensions. \
 Cuboid-based Approaches: Construct cuboids representing combinations of dimensions
and mine association rules within each cuboid to capture multidimensional relationships.
Why is entity – relationship modeling technique is not suitable for data warehouse?
How is dimensional modeling different?
In computing, a data warehouse, also known as an enterprise data warehouse, is a system
used for reporting and data analysis and is considered as a core component of business
intelligence. A data warehouse is a central repository of information that can be analyzed
to make more useful decisions. Data flows into a data warehouse from transactional
systems, relational databases, and other sources like application log files and transaction
applications. ER modelling aims to optimize performance for transaction processing and is
also difficult to query ER models because of the complexity. Hence, ER models are not
suitable for high performance retrieval of data. The conceptual Entity-Relationship (ER)
model is extensively used for database design in relational database environment, which is
used on day-to - day operations. Multidimensional (MD) data modeling is crucial in data
warehouse design as it is targeted for managerial decision support. Multidimensional Data
Modeling supports decision making by allowing users to drill-down for a more detailed
information, roll-up to view summarized information, slice and dice a dimension for a
selection of a specific point of interest and pivot to re-orientate the view of MD data.
Dimensional modeling is a form of modeling of data that is more flexible from the
perspective of the user. These dimensional and relational models have their unique way of
data storage that has specific advantages. Dimensional models are built around business
processes. They need to ensure that dimension tables use a surrogate key. Dimension
tables store the history of the dimensional information.

Star Schema Snowflake Schema


In star schema, The fact tables and the While in snowflake schema, The fact tables,
dimension tables are contained. dimension tables as well as sub dimension tables
are contained.
Star schema is a top-down model. While it is a bottom-up model.
Star schema uses more space. While it uses less space.
It takes less time for the execution of While it takes more time than star schema for the
queries. execution of queries.
In star schema, Normalization is not While in this, Both normalization and
used. denormalization are used.
It’s design is very simple. While it’s design is complex.
The query complexity of star schema is While the query complexity of snowflake schema
low. is higher than star schema.
It’s understanding is very simple. While it’s understanding is difficult.
It has less number of foreign keys. While it has more number of foreign keys.
It has high data redundancy. While it has low data redundancy.

Q1 a) Every data structure in the data warehouse contains the time element. Why?
Every data structure in the data warehouse contains a time element. Due to the nature of
its mission, it must include historical data rather than current figures.
About data warehouse contain time element :
 Every data structure in a data warehouse has a temporal component to it.
 Due to the nature of its mission, it must include historical data rather than current
figures.
In a typical data warehouse, there are four main components. The central database
contains all of the ETL tools, metadata, and access tools. All of these components are made
to perform quickly, allowing you to collect information and analyze data while on the go.
Que8. Explain ETL process in detail.
1. Extraction:
- This is the first step in the ETL process. It involves extracting data from multiple sources,
which can include databases, files, applications, web services, etc. - The extraction process
can be either full extraction, where all the data is extracted every time, or incremental
extraction, where only the newly added or updated data since the last extraction is
retrieved. - Extracted data may often be in different formats and structures, so it needs to be
consolidated and standardized for further processing.
2. Transformation:
- Once the data is extracted, it undergoes transformation. This step involves cleaning,
filtering, and structuring the data to make it suitable for analysis and storage in the data
warehouse - Transformation tasks may include: - Data cleansing: Removing or correcting
errors, inconsistencies, or duplicates in the data. - Data validation: Ensuring that the data
meets certain quality standards and business rules. - Data aggregation: Combining and
summarizing data from multiple sources. - Data enrichment: Adding additional information
or attributes to the data. - Data normalization: Standardizing data formats, units, and values.
- Transformation is a critical step as it ensures that the data is accurate, consistent, and
relevant for analysis.
3. Loading:
- Once the data is extracted and transformed, it is loaded into the data warehouse for
storage and analysis. - Loading involves inserting the transformed data into the appropriate
tables or structures within the data warehouse. - There are different loading strategies,
including: - Full load: Loading all the data from scratch each time. - Incremental load:
Loading only the new or changed data since the last load. - Parallel load: Loading data into
multiple tables simultaneously to improve performance. - After loading, the data is available
for querying, reporting, and analysis by end-users or analytical tools.

Q6 What are the three major areas in the data warehouse? Relate and explain the
architectural components to the three major areas
The three major areas in a data warehouse are the data staging area, the data storage area,
and the data presentation area. This division is logical as it allows for a clear separation of
tasks and functionalities within the data warehouse architecture. The architectural
components support and enable the functionalities of each major area.
Explanation:
The three major areas in a data warehouse are the data staging area, the data storage area,
and the data presentation area.
The data staging area is where data from various sources is extracted, transformed, and
cleansed before being loaded into the data warehouse. This area acts as a temporary
storage space for data before it is processed and integrated into the data storage area.
The data storage area is the core of the data warehouse architecture. It stores the
integrated and processed data in a structured format for efficient retrieval and analysis.
This area typically consists of a data mart or a data warehouse server.
The data presentation area is where the processed data is made available for users to
access and analyze. This area includes tools and technologies for data visualization, report
generation, and interactive querying
This division is logical because it allows for a clear separation of tasks and functionalities
within the data warehouse architecture. Each area focuses on a specific aspect of the data
warehousing process, enabling efficient data management and analysis. The architectural
components, such as ETL (Extract, Transform, Load) tools for the data staging area, the
data storage server for the data storage area, and reporting tools for the data presentation
area, support and enable the functionalities of each major area.
Define initial load, incremental load and full refresh.
1. *Initial Load*:
initial Load: For the very first time loading all the data warehouse tables.
Incremental Load: Periodically applying ongoing changes as per the requirement. After the
data is loaded into the data warehouse database, verify the referential integrity between
the dimensions and the fact tables to ensure that all records belong to the appropriate
records in the other tables. The DBA must verify that each record in the fact table is related
to one record in each dimension table that will be used in combination with that fact table.
Full Refresh: Deleting the contents of a table and reloading it with fresh data.
The initial load, also known as the *full load*, involves populating all the data warehouse
tables for the very first time. - During the initial load, all the records from the source
system are loaded into the target database or data warehouse - It erases any existing data
in the tables and replaces it with fresh data³⁴.
2. *Incremental Load*:
- The incremental load, also referred to as the *delta load*, occurs periodically after the
initial load. - Instead of loading the entire dataset, only the new or updated data since the
last extraction is loaded into the target system.- This approach is more efficient in terms of
resource utilization and speed, as it focuses only on the changes¹⁵.
3. *Full Refresh*:
A full refresh involves erasing the contents of one or more tables in the data warehouse
and reloading them with fresh data - Unlike incremental load, which only handles changes,
a full refresh replaces all existing data. - Organizations typically use full refresh when they
need to ensure complete consistency or periodically reset the data¹

Q6 What are the three major areas in the data warehouse? Relate and explain the
architectural components to the three major areas
The three major areas in a data warehouse are the data staging area, the data storage area,
and the data presentation area. This division is logical as it allows for a clear separation of
tasks and functionalities within the data warehouse architecture. The architectural
components support and enable the functionalities of each major area.
Explanation:
The three major areas in a data warehouse are the data staging area, the data storage area,
and the data presentation area.
The data staging area is where data from various sources is extracted, transformed, and
cleansed before being loaded into the data warehouse. This area acts as a temporary
storage space for data before it is processed and integrated into the data storage area.
The data storage area is the core of the data warehouse architecture. It stores the
integrated and processed data in a structured format for efficient retrieval and analysis.
This area typically consists of a data mart or a data warehouse server.
The data presentation area is where the processed data is made available for users to
access and analyze. This area includes tools and technologies for data visualization, report
generation, and interactive querying
This division is logical because it allows for a clear separation of tasks and functionalities
within the data warehouse architecture. Each area focuses on a specific aspect of the data
warehousing process, enabling efficient data management and analysis. The architectural
components, such as ETL (Extract, Transform, Load) tools for the data staging area, the
data storage server for the data storage area, and reporting tools for the data presentation
area, support and enable the functionalities of each major area.
Data integration in data mining refers to the process of combining data from multiple
sources into a single, unified view. This can involve cleaning and transforming the data, as
well as resolving any inconsistencies or conflicts that may exist between the different
sources. The goal of data integration is to make the data more useful and meaningful for the
purposes of analysis and decision making. Techniques used in data integration include data
warehousing, ETL (extract, transform, load) processes, and data federation.
Data Integration is a data preprocessing technique that combines data from multiple
heterogeneous data sources into a coherent data store and provides a unified view of the
data. These sources may include multiple data cubes, databases, or flat files.
The data integration approaches are formally defined as triple <G, S, M> where,
G stand for the global schema,
S stands for the heterogeneous source of schema,
M stands for mapping between the queries of source and global schema.
What is data integration :
Data integration is the process of combining data from multiple sources into a cohesive and
consistent view. This process involves identifying and accessing the different data sources,
mapping the data to a common format, and reconciling any inconsistencies or discrepancies
between the sources. The goal of data integration is to make it easier to access and analyze
data that is spread across multiple systems or platforms, in order to gain a more complete
and accurate understanding of the data.
Data integration can be challenging due to the variety of data formats, structures, and
semantics used by different data sources. Different data sources may use different data
types, naming conventions, and schemas, making it difficult to combine the data into a
single view. Data integration typically involves a combination of manual and automated
processes, including data profiling, data mapping, data transformation, and data
reconciliation.
Data integration is used in a wide range of applications, such as business intelligence, data
warehousing, master data management, and analytics. Data integration can be critical to the
success of these applications, as it enables organizations to access and analyze data that is
spread across different systems, departments, and lines of business, in order to make better
decisions, improve operational efficiency, and gain a competitive advantage.

ER Modeling Dimensional Modeling


It is transaction-oriented. It is subject-oriented.
Entities and Relationships. Fact Tables and Dimension Tables.
Few levels of granularity. Multiple levels of granularity.
Real-time information. Historical information.
It eliminates redundancy. It plans for redundancy.
High transaction volumes using few Low transaction volumes using many records at
records at a time. a time.
Highly Volatile data. Non-volatile data.
Physical and Logical Model. Physical Model.
Normalization is suggested. De-Normalization is suggested.
OLTP Application. OLAP Application.
The application is used for buying products Application to analyze buying patterns of the
from e-commerce websites like Amazon. customer of the various cities over the past 10
years.
D) Concatenated Key: A row in a fact table relate to a combination of rows from the
dimension table.
Let the dimension table be
The row in the fact table must be identified by the primary pey of these 4 dimension table.
Table deep not wide
A fact table contains fever attributes than the dimension table but the no. of records in a fact
table is large.
Eg. 3 product, 5 customer, 30 days, 10 sale
Now of rows = 3x5x30 × 10 = 4500 rowa
Data Grain
It is level of detail for the measurement. Here we can find out a single order on a certain
data for a specific customer by a specific sale representative.
Que4. What are the characteristics of DWM.
Subject oriented: A data warchouse target on the modelling and analysis. of data for
decision makers
- Therefore on topically provide a concise and straight- forward view around a particular
subject like customer. product sale.
Integrated: A data warehouse integrates various heterogeneous data source like RDBMS,
flat files, etc.
-It requires data cleaning and integration during data warehousing to ensure consistency.
-Data across seperate sources needs to be augned and harmonized and standardize
3) Time variant: Historical information is kept in data warehouse For eg one can retrieve
files from 3 months, 6 months, 12 months or even previous data.
these varies in a transaction system where often only the most current file is kept
Every key structure in data warehouse contains either implicitly on explicitly, a time
element
Non Volatile: The data warehouse is a physically seperate data storage which is
transformed from the source operational RDBMS
The operational update of data do not occur in the data warchouse.
Non-volatile defines that once entered into the warehouse, and data cannot change
It requies only two "processes procedures in data
accessing is 1. initial loading of data, 2. access to dată
Q) The process of making a group of abstract objects into classes of similar objects
is known as clustering.
Points to Remember:
One group is treated as a cluster of data objects
 In the process of cluster analysis, the first step is to partition the set of data into groups with
the help of data similarity, and then groups are assigned to their respective labels.
 The biggest advantage of clustering over-classification is it can adapt to the changes made
and helps single out useful features that differentiate different groups.
Applications of cluster analysis :
 It is widely used in many applications such as image processing, data analysis, and pattern
recognition.
 It helps marketers to find the distinct groups in their customer base and they can
characterize their customer groups by using purchasing patterns.
 It can be used in the field of biology, by deriving animal and plant taxonomies and
identifying genes with the same capabilities.
 It also helps in information discovery by classifying documents on the web.
Clustering Methods:
It can be classified based on the following categories.
Model-Based Method Hierarchical Method Constraint-Based Method
Grid-Based Method Partitioning Method Density-Based Method
OLAP, which stands for Online Analytical Processing, operations are a set of techniques
used in data mining and business intelligence for analyzing large datasets. OLAP operations
enable users to extract insights from multidimensional data sets quickly and efficiently.
These operations help users explore data from multiple perspectives, gain insights into data
relationships, and make informed decisions based on data analysis.
One example of an OLAP operation is the "slice" operation. This operation allows users to
extract data from a multidimensional cube by selecting a single dimension and a specific
value for that dimension. For instance, if a user wants to analyze data for a particular
product, they can slice the cube by selecting the product dimension and the desired product
value.
Operations in OLAP
Drill Down
The Drill Down OLAP operation allows users to view more detailed data by expanding a
particular dimension in a multidimensional cube. For example, a user can drill down into
the product dimension to view data for individual products, or a user can expand quarterly
sales data into monthly sales figures. This operation is important in data mining as it
enables users to explore data at lower levels of detail and gain insights into specific aspects
of the data. By drilling down into the data, users can identify trends and patterns that may
not be apparent at higher levels of aggregation, allowing for more targeted analysis and
decision-making.
Drill Up
It is the opposite operation of Drill Down. The Drill Up OLAP operation allows users to view
data at a higher level of aggregation by collapsing a specific dimension in a
multidimensional cube. For instance, a user can drill up into quarterly or yearly sales
figures to view data for a higher-level time. By drilling up, users can identify trends and
patterns that may not be visible at lower levels of detail, enabling them to make informed
decisions based on a complete picture of the data.
Slice
The Slice OLAP operation allows users to extract data from a multidimensional cube by
selecting a single dimension and a specific value for that dimension. For example, if a user
wants to analyze data for a particular product, they can slice the cube by selecting the
product dimension and the desired product value. This operation is important in data
mining, enabling users to extract data based on specific criteria and allowing for more
focused and targeted analysis.
Dice
The Dice OLAP operation allows users to extract data from a multidimensional cube by
selecting multiple dimensions and specific values for each selected dimension. For example,
if a user wants to analyze data for a particular product and a specific period, they can dice
the cube by selecting both the product and time dimensions and the desired values for each
dimension. This operation is important in data mining, enabling users to extract data based
on multiple criteria, allowing for more focused and targeted analysis.
Pivot
The Pivot OLAP operation allows users to rotate the orientation of a multidimensional cube
to view the data from a different perspective. For example, if a user wants to analyze sales
data by product category and sales channel, they can pivot the cube to view the sales data
by sales channel and product category. This operation is important in data mining as it
enables users to view the same data from different angles and gain new insights into the
patterns and relationships in the data.
A )Demonstrate Multidimensional association Rule mining with suitable example.
Association rule mining is a technique used to discover interesting relationships or
patterns in large datasets. Multidimensional association rule mining extends this concept
to consider multiple dimensions or attributes simultaneously.
1.In Multi dimensional association:
 Attributes can be categorical or quantitative.
 Quantitative attributes are numeric and incorporates hierarchy.
 Numeric attributes must be discretized.
 Multi dimensional association rule consists of more than one dimension :
Eg: buys(X,”IBM Laptop computer”)buys(X,”HP Inkjet Printer”)
2.Three approaches in mining multi dimensional association rules: I.Using static
discretization of quantitative attributes. - Discretization is static and occurs prior to
mining. - Discretized attributes are treated as categorical. - Use Apriori algorithm to find all
k-frequent predicate sets(this requires k or k+1 table scans ). - Every subset of frequent
predicate set must be frequent.
Eg:
If in a data cube the 3D cuboid (age, income, buys) is frequent implies (age, income), (age,
buys), (income, buys) are also frequent.
Data cubes are well suited for mining since they make mining faster.
The cells of an n-dimensional data cuboid correspond to the predicate cells.
II. Using dynamic discretization of quantitative attributes.
Known as mining Quantitative Association Rules.
Numeric attributes are dynamically discretized.
Eg: age(X,”20..25”) Λ income(X,”30K..41K”)buys (X, ”Laptop, Computer ”)
GRID FOR TUPLES
III. Using distance based discretization with clustering.
This is dynamic discretization process that considers the distance between data points.
It involves a two step mining process:
Perform clustering to find the interval of attributes involved.
Obtain association rules by searching for groups of clusters that occur together.
The resultant rules may satisfy:
Clusters in the rule antecedent are strongly associated with clusters of rules in the
consequent.
Clusters in the antecedent occur together.
Clusters in the consequent occur together.
Explain architecture of Data Mining with a
diagram.
The above diagram depicts the components of data
mining architecture, which explains the processes
as follows:
The user inputs the requests
The requests are sent to data mining engines for
pattern evaluation
The system tries to seek a solution to the query
using the existing databases and thus generates the
metadata
The metadata is sent to the data mining engine for
analysis, which may interact with pattern
evaluation modules to find the results
The obtained result is ready to be interpreted at the
front end via a suitable interface
Database/Data Warehouse Server: The
database/data warehouse server is the crucial element of thearchitecture of data mining
that contains the cleaned and integrated data, under a unified schema, and is ready to be
processed. The database/data warehouse server retrieves the relevant data basis the
userrequest.
Data Mining Engine: Data Mining Engine comprises modules or tools for performing
various taskslike data clustering, data classification, prediction, and correlation analysis,
on the data stored in Database/data warehouse server. These sets of tools include –
An interpreter to transmit commands to the computer
Gear between the engine and the data warehouse to produce and handle
bidirectionalcommunication
A set of data mining algorithms
Pattern Evaluation: The pattern evaluation module investigates a pattern using a
threshold value. It works in collaboration data mining engine and uses stake measures to
find interesting and useful patterns. Pattern evaluation may also coordinate with the
mining module, basis data mining techniques. It is suggested to push the evaluation of
pattern stakes as much as possible into the miningprocedure to find the desirable patterns
and ensure an effective data mining process.
Graphical User Interface: GUI serves as the link between the user and the data mining
system. GUIhides the complex process of data interpretation and presents the data in an
easy and readable format.The main component of a GUI are –
Legend: Some visualization results need colours, icons, or labels. A legend at the bottom of
the visualizer page helps to interpret the results
Status bar: The status bar facilitates the visualization of textual information
Toolbar: Every view provides a specific toolbar to access the crucial features of the view
Knowledge Base: A knowledge base is defined as the repository of domain-specific or
general knowledge gathered from data sources. It stores large amounts of organized data
and follows a defined schema or “data model” that facilitates its storage, retrieval, and
modification and is poweredby artificial intelligence and machine learning algorithms.
Knowledge Base gives inputs to the data mining engine and helps in pattern evaluation.
Q)Explain data mining as a step in KDD. Give the architecture of typical data mining

Data Cleaning: Every data science enthusiast quickly learns one truth: real-world data is
messy. Thisstep involves removing any inconsistencies, errors, or outliers that might skew
the results.
Data Integration: Data often comes from multiple sources, each with its own format and
structure.This step merges this data into a unified set, ensuring consistency and reducing
redundancy.
Data Selection: Not all data is relevant for every analysis. Here, you’ll select the subset of
data thatpertains to your specific objective.
Data Transformation: Data might need to be summarized, aggregated, or otherwise
transformed tomake it suitable for mining.
Data Mining: This is where the magic happens! Using various algorithms and techniques,
patterns,trends, and relationships are extracted from the data.
Pattern Evaluation: Not all patterns are useful or interesting. This step helps filter out
the noise,ensuring only
valuable insights are
considered.
Knowledge
Presentation: After all
the hard work, it’s time
to share the findings.
This often involves
visualizations, reports,
or other means to make
the knowledge
accessible and understandable.

Architecture of Typical Data mining system


1. Database, data warehouse, or other information repository:
o This is information repository.
o Data cleaning and data integration techniques may be performed on the data.
2. Databases or data warehouse server:
o It fetches the data as per the users’
requirement which one need for data
mining task.
3. Knowledge base:
o This is used to guide the search, and gives
the interesting and hidden patterns from
data.
4. Data mining engine:
o It performs the data mining task such as
characterization, association, classification,
cluster analysis etc.
5. Pattern evaluation module:
o It is integrated with the mining module and
it give the search of only the interesting
patterns.
6. Graphical user interface:
o This module is used to communicate
between user and the data mining system
and allow users to browse databases or data warehouse schemas.
Q3] Write two data mining classification techniques.
Data mining classification is a process that involves the analysis of data to identify
patterns and relationships. The objective of classification is to build a model that can be
used to predict the class orcategory of new data instances based on their attributes or
features. Classification is a supervised learning technique, meaning it uses a labeled
dataset to build a predictive model.
Data Mining Classification Techniques:
1. Decision Trees: A decision tree is a graphical representation of a decision-making
process. It consists of nodes that represent the features of the data and branches that
represent the decisionsbased on those features. Decision trees are easy to interpret and
can handle both categorical and numerical data.
2. Naive Bayes: Naive Bayes is a probabilistic algorithm that makes predictions based
on theprobabilities of the features. It assumes that the features are independent of
each other and calculates the probability of each class based on the probability of the
features.
3. K-Nearest Neighbors (KNN): KNN is a non-parametric algorithm that makes predictions
basedon the similarity of the features. It calculates the distance between the new data
instance and theexisting data instances and selects the k-nearest neighbour’s to make a
prediction.
4. Support Vector Machines (SVM): SVM is a linear or nonlinear algorithm that separates the
datainto different classes using a hyperplane. The objective of SVM is to maximize the
margin between the hyperplane and the nearest data points.
5. Random Forest: A random forest is an ensemble of decision trees that makes
predictions byaveraging the predictions of multiple decision trees. Each decision tree is
built using a random subset of the features and a random subset of the data instances..

Linear Regression Logistic Regression


Linear regression is used to predict the Logistic Regression is used to predict the
continuous dependent variable using a given categorical dependent variable using a given
set of independent variables. set of independent variables.
Linear Regression is used for solving Logistic regression is used for solving
Regression problem. Classification problems.
In Linear regression, we predict the value of In logistic Regression, we predict the values
continuous variables. of categorical variables.
In linear regression, we find the best fit line, In Logistic Regression, we find the S-curve by
by which we can easily predict the output. which we can classify the samples.
Least square estimation method is used for Maximum likelihood estimation method is
estimation of accuracy. used for estimation of accuracy.
The output for Linear Regression must be a The output of Logistic Regression must be a
continuous value, such as price, age, etc. Categorical value such as 0 or 1, Yes or No,
etc.
In Linear regression, it is required that In Logistic regression, it is not required to
relationship between dependent variable and have the linear relationship between the
independent variable must be linear. dependent and independent variable.
In linear regression, there may be collinearity In logistic regression, there should not be
between the independent variables. collinearity between the independent
variable.
Criteria OLAP OLTP
Purpose OLAP helps you analyze large OLTP helps you manage and
volumes of data to support decision- process real-time transactions.
making.
Data source OLAP uses historical and aggregated OLTP uses real-time and
data from multiple sources. transactional data from a single
source.
Data structure OLAP uses multidimensional OLTP uses relational databases.
(cubes) or relational databases.
Data model OLAP uses star schema, snowflake OLTP uses normalized or
schema, or other analytical models. denormalized models.
Volume of OLAP has large storage OLTP has comparatively smaller
data requirements. Think terabytes (TB) storage requirements. Think
and petabytes (PB). gigabytes (GB).
Response OLAP has longer response times, OLTP has shorter response
time typically in seconds or minutes. times, typically in milliseconds
Example OLAP is good for analyzing trends, OLTP is good for processing
applications predicting customer behavior, and payments, customer data
identifying profitability. management, and order
processing.
Que3: Explain Metadata and its types.
Dataware house users can use data in a variety of situation to build maintain and merge the
system. The base information of metadata in the data warehouse is it is data about data' eg-
Index page of any book
 metadata can hold information about data warehouse like.
 sources for any extracted data
 use of that dataware data.
 Feature of data
 Any kind of data and its values.
Metadata is categorised into three as:
Operational meta data: data for the data warehouse comes from several operational
system of organisation which certain different data structures varying field, length and
datatypes
In selecting data from source system sometimes you need to split records of sometimes
combine records with multiple coding scheme and field length and sometimes lend user
wants original dataset.
This problem is served by original metadata by containing au information about the
operational data sources.
Extraction and transformation
- Contains data about data extraction from the source system and various transformation
techniques that are applied to data before storing it to the data storehouse.
The primary reason for this metadata is to map every individual data dement from its
source system to the data wareHOUSE.
this cause for identification of data element by its source field name and destination filed
name. keys format, size.
End use
It acts as navigational map of data warehouse while enroding end user to find information
form data ware house using their own business terminologies.
It translate a critic name code of a data element into a meanfull information of data
element.
eg. (Name [customer name]
B ) Explain the architecture of DWM with Diagram.
A data-warehouse is a heterogeneous collection of different data sources organised under
a unified schema. There are 2 approaches for constructing data-warehouse: Top-down
approach and Bottom-up approach are explained as below.
1. Top-down approach:
The essential components are discussed below:
External Sources – External source is a source from where data is collected irrespective of
the type of data. Data can be structured, semi structured and unstructured as well.
Stage Area – Since the data, extracted from the external sources does not follow a
particular format, so there is a need to validate this data to load into data warehouse. For
this purpose, it is recommended to use ETL tool.
E (Extracted): Data is extracted from External data source.
T (Transform): Data is transformed into the standard format.
L(Load): Data is loaded into data warehouse after transforming it into the standard format.
Data-warehouse –
After cleansing of data, it is stored in the data warehouse as central repository. It actually
stores the meta data and the actual data gets stored in the data marts. Note that data
warehouse stores the data in its purest form in this top-down approach.
Data Marts – Data mart is also a part of storage component. It stores the information of a
particular function of an organisation which is handled by single authority. There can be as
many numbers of data marts in an organisation depending upon the functions. We can also
say that data mart contains subset of the data stored in data warehouse.
Data Mining – The practice of analysing the big data present in data warehouse is data
mining. It is used to find the hidden patterns that are present in the database or in data
warehouse with the help of algorithm of data mining.
This approach is defined by Inmon as – data warehouse as a central repository for the
complete organisation and data marts are created from it after the complete data
warehouse has been created.

] Explain the data cleaning techniques in data mining.


Data cleaning is a method to remove all the possible noises from data and clean it. Proper
and cleaneddata is used for data analysis and find key insights, patterns, etc from it. Data
cleaning increases data consistency and entails normalizing of data. The data derived from
existing sources may be inaccurate, unreliable, complex, and sometimes incomplete
Data Cleaning Process
The data cleaning process handles data cleaning; but before handling the inconsistent
data, it shouldbe identified first. Following phases are used in the data cleaning process.
Identify Inconsistent Details - Due to different factors, such as the data type, the
discrepancyin data can be built with many optional fields that allow the candidates to fill in
missing details. While entering the results, the candidates could have made a mistake. Any
of the details might be out of date, such as updating address, phone number, etc. This may
be the cause of the contradictory details.
Identifying Missing Values - If there is a record that lacks several attributes and its
values so that it can ignore.
Remove Noisy Data and Missing Values - Noisy data incorporates information without
meaning. For the expression of corrupt records, the term noisy information is also used.
Noisy data cannot comply with valuable info by the data mining process. To allow data
mining, noisy data increases the volume of data in the data warehouse that can be removed
efficient
.Elucidate market basket analysis with an example.
A data mining technique that is used to uncover purchase patterns in any retail setting is
known as Market Basket Analysis. In simple terms Basically, Market basket analysis in data
mining is to analyze the combination of products which been bought together.
This is a technique that gives the careful study of purchases done by a customer in a
supermarket. This concept identifies the pattern of frequent purchase items by customers.
This analysis can help to promote deals, offers, sale by the companies, and data mining
techniques helps to achieve this analysis task. Example:
Data mining concepts are in use for Sales and marketing to provide better customer
service, to improve cross-selling opportunities, to increase direct mail response rates.
Customer Retention in the form of pattern identification and prediction of likely defections
is possible by Data mining. - Risk Assessment and Fraud area also use the data-mining
concept for identifying inappropriate or unusual behavior etc.
Market basket analysis mainly works with the ASSOCIATION RULE {IF} -> {THEN}.
IF means Antecedent: An antecedent is an item found within the data
THEN means Consequent: A consequent is an item found in combination with the
antecedent
types of Market Basket Analysis
Descriptive market basket analysis: This sort of analysis looks for patterns and
connections in the data that exist between the components of a market basket. This kind of
study is mostly used to understand consumer behavior, including what products are
purchased in combination and what the most typical item combinations. Retailers can place
products in their stores more profitably by understanding which products are frequently
bought together with the aid of descriptive market basket analysis.
Predictive Market Basket Analysis:
Market basket analysis that predicts future purchases based on past purchasing patterns is
known as predictive market basket analysis. Large volumes of data are analyzed using
machine learning algorithms in this sort of analysis in order to create predictions about
which products are most likely to be bought together in the future. Retailers may make
data-driven decisions about which products to carry, how to price them, and how to
optimize shop layouts with the use of predictive market basket research.
Differential Market Basket Analysis: Differential market basket analysis analyses two sets
of market basket data to identify variations between them. Comparing the behavior of
various client segments or the behavior of customers over time is a common usage for this
kind of study. Retailers can respond to shifting consumer behavior by modifying their
marketing and sales tactics with the help of differential market basket analysis.
Star Schema Snowflake Schema
In star schema, The fact tables and the While in snowflake schema, The fact tables,
dimension tables are contained. dimension tables as well as sub dimension tables
are contained.
Star schema is a top-down model. While it is a bottom-up model.
Star schema uses more space. While it uses less space.
It takes less time for the execution of While it takes more time than star schema for the
queries. execution of queries.
In star schema, Normalization is not used. While in this, Both normalization and
denormalization are used.
It’s design is very simple. While it’s design is complex.
The query complexity of star schema is While the query complexity of snowflake schema is
low. higher than star schema.
It’s understanding is very simple. While it’s understanding is difficult.
It has less number of foreign keys. While it has more number of foreign keys.
It has high data redundancy. While it has low data redundancy.
C) Q6 What are the three major areas in the data warehouse? Relate and explain
the architectural components to the three major areas
The three major areas in a data warehouse are the data staging area, the data storage area,
and the data presentation area. This division is logical as it allows for a clear separation of
tasks and functionalities within the data warehouse architecture. The architectural
components support and enable the functionalities of each major area.
Explanation The three major areas in a data warehouse are the data staging area, the data
storage area, and the data presentation area.
The data staging area is where data from various sources is extracted, transformed, and
cleansed before being loaded into the data warehouse. This area acts as a temporary
storage space for data before it is processed and integrated into the data storage area.
The data storage area is the core of the data warehouse architecture. It stores the
integrated and processed data in a structured format for efficient retrieval and analysis.
This area typically consists of a data mart or a data warehouse server.
The data presentation area is where the processed data is made available for users to
access and analyze. This area includes tools and technologies for data visualization, report
generation, and interactive querying
This division is logical because it allows for a clear separation of tasks and functionalities
within the data warehouse architecture. Each area focuses on a specific aspect of the data
warehousing process, enabling efficient data management and analysis. The architectural
components, such as ETL (Extract, Transform, Load) tools for the data staging area, the
data storage server for the data storage area, and reporting tools for the data presentation
area, support and enable the functionalities of each major area.
Q) Explain the data cleaning techniques in data mining.
Data cleaning is a method to remove all the possible noises from data and clean it. Proper
and cleaneddata is used for data analysis and find key insights, patterns, etc from it. Data
cleaning increases data consistency and entails normalizing of data. The data derived from
existing sources may be inaccurate, unreliable, complex, and sometimes incomplete
Data Cleaning Process
The data cleaning process handles data cleaning; but before handling the inconsistent
data, it shouldbe identified first. Following phases are used in the data cleaning process.
Identify Inconsistent Details - Due to different factors, such as the data type, the
discrepancyin data can be built with many optional fields that allow the candidates to fill in
missing details. While entering the results, the candidates could have made a mistake. Any
of the details might be out of date, such as updating address, phone number, etc. This may
be the cause of the contradictory details.
Identifying Missing Values - If there is a record that lacks several attributes and its
values so that it can ignore.
Remove Noisy Data and Missing Values - Noisy data incorporates information without
meaning. For the expression of corrupt records, the term noisy information is also used.
Noisy data cannot comply with valuable info by the data mining process. To allow data
mining, noisy data increases the volume of data in the data warehouse that can be removed
efficient
Que7. Write 3 insides of each Dimension table and fact table.
=Inside Dimension Table.
1)Dimension Table key: Primary Key of dimension table uniquely identifies leach row in
the table
2) diving down and rowing up: The attributes in at provide the ability to get details for
higher level of aggregations to lower level of details
3)multiple Hierarchy: some of often provide for multiple hierarchy so driving down may be
performed along any of the multiple Hierarchy.
Inside fact Table
name the set of basic transformations task in dwm and give example for each
In Data Warehousing and Data Warehouse Management (DWM), basic transformation tasks
are essential for preparing data to be loaded into the data warehouse. These tasks involve
cleaning, transforming, and consolidating data from various sources to ensure it is in a
suitable format for analysis. Here are the main transformation tasks:
1.Data Cleaning:
- *Removing Duplicates*: Identifying and removing duplicate records from the dataset.
- *Handling Missing Values*: Addressing missing data by imputation (filling in missing
values), removing records, or other techniques.
- *Correcting Errors*: Identifying and correcting errors in data, such as typos, incorrect
values, or inconsistent formats.
2. *Data Integration*:
- *Combining Data*: Merging data from multiple sources into a single, unified dataset.
- *Resolving Data Conflicts*: Addressing discrepancies between data sources, such as
differing formats, naming conventions, or conflicting data values.
- *Data Alignment*: Ensuring that data from different sources is correctly aligned and
consistent with each other.
3. *Data Transformation*:
- *Normalization*: Converting data to a standard format, such as transforming all dates to a
common date format or converting currencies to a common currency.
- *Aggregation*: Summarizing detailed data into higher-level aggregates, such as calculating
totals or averages.
- *Disaggregation*: Breaking down aggregated data into more detailed components, if
necessary.
- *Data Type Conversion*: Changing data types to ensure compatibility and correctness,
such as converting strings to dates or integers to floats.
- *Deriving New Values*: Creating new attributes or columns based on existing data, such as
calculating a new metric or deriving a category.
4. *Data Reduction*:
- *Filtering*: Removing unnecessary or irrelevant data to reduce the size of the dataset and
improve performance.
- *Sampling*: Selecting a representative subset of the data for analysis.
- *Data Compression*: Using techniques to reduce the storage size of the data without losing
important information.
5. *Data Consolidation*:
- *Summarization*: Creating summary tables or reports that provide an overview of the
data.
- *Data Integration*: Combining data from different sources into a cohesive, unified view.
6. *Data Loading*:
- *Staging*: Temporarily storing data in a staging area before it is loaded into the data
warehouse.
- *Incremental Loading*: Adding only new or changed data to the data warehouse, rather
than reloading all data.

These tasks are critical for ensuring that the data in the warehouse is accurate, consistent,
and ready for analysis. Proper transformation ensures that data from diverse sources can
be integrated and used effectively in decision-making processes.
Describe slowly changing dimensions. What are the three types? Explain each type
very briefly.
Slowly Changing Dimensions

Slowly changing dimensions refer to how data in your data warehouse changes over time.
Slowly changing dimensions have the same natural key but other data columns that may or
may not change over time depending on the type of dimensions that it is.
Slowly changing dimensions are important in data analytics to track how a record is
changing over time. The way the database is designed directly reflects whether historical
attributes can be tracked or not, determining different metrics available for the business to
use.
For example, if data is constantly being overwritten for one natural key, the business will
never be able to see how changes in that row’s attributes affect key performance indicators.
If a company continually iterates on a product and its different features, but doesn’t track
how those features have changed, it will have no idea how customer retention, revenue,
customer acquisition cost, or other marketing analytics were directly impacted by those
changes.
Types of slowly changing dimensions
Type 0
Type 0 refers to dimensions that never change. You can think of these as mapping tables in
your data warehouse that will always remain the same, such as states, zipcodes, and county
codes. Date_dim tables that you may use to simplify joins are also considers type 0
dimensions. In addition to mapping tables, other pieces of data like social security number
and date of birth are considered type 0 dimensions.
Type 1
Type 1 refers to data that is overwritten by new data without keeping a historical record of
that old piece of data. With this type, there is no way to keep track of changes over time. I’ve
seen many companies use this type of dimension accidentally, not realizing that they can
never get the old values back. When implementing this dimension, make sure you do not
need to track the trends in that data column over time.
A good example of this is customer addresses. You don’t need to keep track of how a
customer’s address has changed over time, you just need to know you are sending an order
to the right place.
Type 2
Type 2 dimensions are always created as a new record. If a detail in the data changes, a new
row will be added to the table with a new primary key. However, the natural key would
remain the same in order to map a record change to one another. Type 2 dimensions are the
most common approach to tracking historical records.
There are a few different ways you can handle type 2 dimensions from an analytics
perspective. The first is by adding a flag colum
n to show which record is currently active. This is the approach Fivetran takes with data
tables that have CDC implemented. Instead of deleting any historic records, they will add a
new one with the _FIVETRAN_DELETED column set to FALSE. The old record will then be
set to TRUE for this _FIVETRAN_DELETED column. Now, when querying this data, you can
use this column to filter for records that are active while still being able to get historical
records if needed.
You can also handle type 2 dimensions by adding a timestamp column or two to show when
a new record was created or made active and when it was made ineffective. Instead of
checking for whether a record is active or not, you can find the most recent timestamp and
assume that is the active data row. You can then piece together the timestamps to get a full
picture of how a row has changed over time.
Metadata is data that describes and contextualizes other data. It provides information
about the content, format, structure, and other characteristics of data, and can be used to
improve the organization, discoverability, and accessibility of data.
Metadata can be stored in various forms, such as text, XML, or RDF, and can be organized
using metadata standards and schemas. There are many metadata standards that have been
developed to facilitate the creation and management of metadata, such as Dublin Core,
schema.org, and the Metadata Encoding and Transmission Standard (METS). Metadata
schemas define the structure and format of metadata and provide a consistent framework
for organizing and describing data.
Metadata can be used in a variety of contexts, such as libraries, museums, archives, and
online platforms. It can be used to improve the discoverability and ranking of content in
search engines and to provide context and additional information about search results.
Metadata can also support data governance by providing information about the ownership,
use, and access controls of data, and can facilitate interoperability by providing information
about the content, format, and structure of data, and by enabling the exchange of data
between different systems and applications. Metadata can also support data preservation
by providing information about the context, provenance, and preservation needs of data,
and can support data visualization by providing information about the data’s structure and
content, and by enabling the creation of interactive and customizable visualizations.
Several Examples of Metadata:
Metadata is data that provides information about other data. Here are a few examples of
metadata:
File metadata: This includes information about a file, such as its name, size, type, and
creation date. Image metadata: This includes information about an image, such as its
resolution, color depth, and camera settings. Music metadata: This includes information
about a piece of music, such as its title, artist, album, and genre. Video metadata: This
includes information about a video, such as its length, resolution, and frame rate.
Document metadata: This includes information about a document, such as its author, title,
and creation date.Database metadata: This includes information about a database, such as
its structure, tables, and fields.
Types of Metadata:
There are many types of metadata that can be used to describe different aspects of data,
such as its content, format, structure, and provenance. Some common types of metadata
include:
Descriptive metadata: This type of metadata provides information about the content,
structure, and format of data, and may include elements such as title, author, subject, and
keywords. Descriptive metadata helps to identify and describe the content of data and can
be used to improve the discoverability of data through search engines and other tools.
Administrative metadata: This type of metadata provides information about the
management and technical characteristics of data, and may include elements such as file
format, size, and creation date. Administrative metadata helps to manage and maintain data
over time and can be used to support data governance and preservation.
Structural metadata: This type of metadata provides information about the relationships
and organization of data, and may include elements such as links, tables of contents, and
indices. Structural metadata helps to organize and connect data and can be used to facilitate
the navigation and discovery of data.Provenance metadata: This type of metadata
provides information about the history and origin of data, and may include elements such as
the creator, date of creation, and sources of data. Provenance metadata helps to provide
context and credibility to data and can be used to support data governance and
preservation.
Data transformation in data mining refers to the process of converting raw data into a
format that is suitable for analysis and modeling. The goal of data transformation is to
prepare the data for data mining so that it can be used to extract useful insights and
knowledge. Data transformation typically involves several steps, including:
1. Data cleaning: Removing or correcting errors, inconsistencies, and missing values in the
data.
2. Data integration: Combining data from multiple sources, such as databases and
spreadsheets, into a single format.
3. Data normalization: Scaling the data to a common range of values, such as between 0 and
1, to facilitate comparison and analysis.
4. Data reduction: Reducing the dimensionality of the data by selecting a subset of relevant
features or attributes.
5. Data discretization: Converting continuous data into discrete categories or bins.
6. Data aggregation: Combining data at different levels of granularity, such as by summing or
averaging, to create new features or attributes.
7. Data transformation is an important step in the data mining process as it helps to ensure
that the data is in a format that is suitable for analysis and modeling, and that it is free of
errors and inconsistencies. Data transformation can also help to improve the performance
of data mining algorithms, by reducing the dimensionality of the data, and by scaling the
data to a common range of values.
The data are transformed in ways that are ideal for mining the data. The data
transformation involves steps that are:
1. Smoothing: It is a process that is used to remove noise from the dataset using some
algorithms It allows for highlighting important features present in the dataset. It helps in
predicting the patterns. When collecting data, it can be manipulated to eliminate or reduce
any variance or any other noise form. The concept behind data smoothing is that it will be
able to identify simple changes to help predict different trends and patterns. This serves as
a help to analysts or traders who need to look at a lot of data which can often be difficult to
digest for finding patterns that they wouldn’t see otherwise.
2. Aggregation: Data collection or aggregation is the method of storing and presenting data
in a summary format. The data may be obtained from multiple data sources to integrate
these data sources into a data analysis description. This is a crucial step since the accuracy
of data analysis insights is highly dependent on the quantity and quality of the data used.
Gathering accurate data of high quality and a large enough quantity is necessary to produce
relevant results. The collection of data is useful for everything from decisions concerning
financing or business strategy of the product, pricing, operations, and marketing strategies.
For example, Sales, data may be aggregated to compute monthly& annual total amounts.
3. Discretization: It is a process of transforming continuous data into set of small intervals.
Most Data Mining activities in the real world require continuous attributes. Yet many of the
existing data mining frameworks are unable to handle these attributes. Also, even if a data
mining task can manage a continuous attribute, it can significantly improve its efficiency by
replacing a constant quality attribute with its discrete values. For example, (1-10, 11-20)
(age:- young, middle age, senior).
4. Attribute Construction: Where new attributes are created & applied to assist the mining
process from the given set of attributes. This simplifies the original data & makes the mining
more efficient.
5. Generalization: It converts low-level data attributes to high-level data attributes using
concept hierarchy. For Example Age initially in Numerical form (22, 25) is converted into
categorical value (young, old). For example, Categorical attributes, such as house
addresses, may be generalized to higher-level definitions, such as town or country.
A dimension table is wide, the fact table deep. Explain
1. *Fact Table*:
A fact table contains the *measures* or *quantitative data* related to a business process.
These measures represent the events or transactions that occur within a specific subject
area. - The granularity of a fact table is at the *lowest level*—it captures detailed data for
each event.
- Fact tables store *numeric values*, such as sales revenue, quantities sold, or profit
margins. - They are typically *vertical* tables, with fewer attributes.
- Fact tables are used for *analysis* and *decision-making*.
Examples of fact tables include sales transactions, inventory movements, or website clicks.
2. *Dimension Table*:
- A dimension table provides *contextual information* about the data in the fact table. It
contains descriptive attributes that help categorize and filter the measures.
- Dimension tables are *wide* because they include more attributes (columns) related to
the grain of the table.
- These attributes are typically in *text format* and provide additional details about the
events.
- Dimension tables are *horizontal* tables, with fewer records compared to fact tables.
- They help organize data into hierarchies, such as time (year, quarter, month), geography
(country, region), or product categories.
- Examples of dimension tables include date dimensions, customer dimensions, or product
dimensions.
Que6. Explain Updates to the dimension table [SCD].
Dimension tables are more stable and less volatile until fact table, which changes as the
number of rows increases, a dimension table changes as the attributes. themselves change.
Type I change: correction of error
Over write the values of attribute with the new values in dimension table row.
-The old value of attribute is discarded that is not preserved.
-No other changes are made in dimension table row and the key of dimension table row is
not affected.
Type 2 change prevention of history
A new dimension table row with new values (changes attribute is added)
- A new column called effective address is added in dimensions table
The original row is not changed, key will be same. the new row is inserted with the new
surrogate key in the dimension table.
Type 3 change: Tentative soft revision
- An old field is added in the dimension table for the effective attribute
the new name of attribute is kept in the current field. The current effective data field is also
added for the change attribute.
No new dimension row is added in the dimension table.
Update to the dimension Table
-Dimension table tend to be more stable and less volatile as compared to fact table.
-The fact table changes to the increase in the number of rows.
-A dimension table changes only not because of increase is number of rowes but also
because of the changes to the attributes themselves.
SCD [slowly changing dimension table]- It is a customery term used for managing
issues associated with the impact of change to attribute of dimension table
-Designing approaches to scd are as categorised into 3. types:
-Type - overwrite the dimension record
-Type 2 add a new dimension.
-Type 3-create new field in dimension record
10-134 20-235 30-1235 40-25 50-135 min count 2 confidence 60%

You might also like