0% found this document useful (0 votes)
13 views14 pages

Unit - III DW

The document discusses data preprocessing techniques in data mining, detailing the steps of data processing including collection, preparation, input, processing, interpretation, and storage. It highlights various methods such as data cleaning, integration, transformation, and reduction, emphasizing the importance of these processes in converting raw data into usable information. Additionally, it covers different data processing methods and types, including batch, real-time, and online processing, as well as techniques for data reduction to improve mining efficiency.

Uploaded by

amanprajapat648
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views14 pages

Unit - III DW

The document discusses data preprocessing techniques in data mining, detailing the steps of data processing including collection, preparation, input, processing, interpretation, and storage. It highlights various methods such as data cleaning, integration, transformation, and reduction, emphasizing the importance of these processes in converting raw data into usable information. Additionally, it covers different data processing methods and types, including batch, real-time, and online processing, as well as techniques for data reduction to improve mining efficiency.

Uploaded by

amanprajapat648
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Data Warehouse and Data Mining

Unit - III
Data Preprocessing Techniques in Data Mining
Introduction
Data processing is collecting raw data and translating it into usable information. The raw data is collected,
filtered, sorted, processed, analyzed, stored, and then presented in a readable format. It is usually performed in
a step-by-step process by a team of data scientists and data engineers in an organization.

The data processing is carried out automatically or manually. Nowadays, most data is processed automatically
with the help of the computer, which is faster and gives accurate results. Thus, data can be converted into
different forms. It can be graphic as well as audio ones. It depends on the software used as well as data
processing methods.

After that, the data collected is processed and then translated into a desirable form as per requirements, useful
for performing tasks. The data is acquired from Excel files, databases, text file data, and unorganized data such
as audio clips, images, GPRS, and video clips.

The most commonly used tools for data processing are Storm, Hadoop, HPCC, Statwing, Qubole,
and CouchDB. The processing of data is a key step of the data mining process. Raw data processing is a more
complicated task. The processing of data largely depends on the following things, such as:
 The volume of data that needs to be processed.
 The complexity of data processing operations.
 Capacity and inbuilt technology of respective computer systems.
 Technical skills and Time constraints.

Tasks in Data Preprocessing

Data cleaning
Data cleaning help us remove inaccurate, incomplete and incorrect data from the dataset. Some techniques used in data
cleaning are −
Handling missing values
This type of scenario occurs when some data is missing.
 Standard values can be used to fill up the missing values in a manual way but only for a small dataset.
 Attribute's mean and median values can be used to replace the missing values in normal and non-normal
distribution of data respectively.
 Tuples can be ignored if the dataset is quite large and many values are missing within a tuple.

Page 1 of 14
Data Warehouse and Data Mining
 Most appropriate value can be used while using regression or decision tree algorithms
Noisy Data
Noisy data are the data that cannot be interpreted by machine and are containing unnecessary faulty data. Some ways
to handle them are −
 Binning − This method handle noisy data to make it smooth. Data gets divided equally and stored in form of
bins and then methods are applied to smoothing or completing the tasks. The methods are Smoothing by a bin
mean method(bin values are replaced by mean values), Smoothing by bin median(bin values are replaced by
median values) and Smoothing by bin boundary(minimum/maximum bin values are taken and replaced by
closest boundary values).
 Regression − Regression functions are used to smoothen the data. Regression can be linear(consists of one
independent variable) or multiple(consists of multiple independent variables).
 Clustering − It is used for grouping the similar data in clusters and is used for finding outliers.
Data integration
The process of combining data from multiple sources (databases, spreadsheets,text files) into a single dataset. Single
and consistent view of data is created in this process. Major problems during data integration are Schema
integration(Integrates set of data collected from various sources), Entity identification(identifying entities from different
databases) and detecting and resolving data values concept.
Data transformation
In this part, change in format or structure of data in order to transform the data suitable for mining process. Methods for
data transformation are −
Normalization − Method of scaling data to represent it in a specific smaller range( -1.0 to 1.0)
Discretization − It helps reduce the data size and make continuous data divide into intervals.
Attribute Selection − To help the mining process, new attributes are derived from the given attributes.
Concept Hierarchy Generation − In this, the attributes are changed from lower level to higher level in hierarchy.
Aggregation − In this, a summary of data gets stored which depends upon quality and quantity of data to make the
result more optimal.
Data reduction
It helps in increasing storage efficiency and reducing data storage to make the analysis easier by producing almost the
same results. Analysis becomes harder while working with huge amounts of data, so reduction is used to get rid of that.
Steps of data reduction are −
Data Compression
Data is compressed to make efficient analysis. Lossless compression is when there is no loss of data while compression.
loss compression is when unnecessary information is removed during compression.
Numerosity Reduction
There is a reduction in volume of data i.e. only store model of data instead of whole data, which provides smaller
representation of data without any loss of data.
Dimensionality reduction
In this, reduction of attributes or random variables are done so as to make the data set dimension low. Attributes are
combined without losing its original characteristics.

Page 2 of 14
Data Warehouse and Data Mining

Stages of Data Processing


1. Data Collection
The collection of raw data is the first step of the data processing cycle. The raw data collected has a huge impact
on the output produced. Hence, raw data should be gathered from defined and accurate sources so that the
subsequent findings are valid and usable. Raw data can include monetary figures, website cookies, profit/loss
statements of a company, user behavior, etc.
2. Data Preparation
Data preparation or data cleaning is the process of sorting and filtering the raw data to remove unnecessary
and inaccurate data. Raw data is checked for errors, duplication, miscalculations, or missing data and
transformed into a suitable form for further analysis and processing. This ensures that only the highest quality
data is fed into the processing unit.
3. Data Input
In this step, the raw data is converted into machine-readable form and fed into the processing unit. This can be
in the form of data entry through a keyboard, scanner, or any other input source.
4. Data Processing
In this step, the raw data is subjected to various data processing methods using machine learning and artificial
intelligence algorithms to generate the desired output. This step may vary slightly from process to process
depending on the source of data being processed (data lakes, online databases, connected devices, etc.) and the
intended use of the output.
5. Data Interpretation or Output
The data is finally transmitted and displayed to the user in a readable form like graphs, tables, vector files,
audio, video, documents, etc. This output can be stored and further processed in the next data processing cycle.
6. Data Storage
The last step of the data processing cycle is storage, where data and metadata are stored for further use. This
allows quick access and retrieval of information whenever needed. Effective proper data storage is necessary
for compliance with GDPR (data protection legislation).
Methods of Data Processing
There are three main data processing methods, such as:
1. Manual Data Processing
Data is processed manually in this data processing method. The entire procedure of data collecting, filtering,
sorting, calculation and alternative logical operations is all carried out with human intervention without using
any electronic device or automation software. It is a low-cost methodology and does not need very many tools.
However, it produces high errors and requires high labor costs and lots of time.
2. Mechanical Data Processing
Data is processed mechanically through the use of devices and machines. These can include simple devices
such as calculators, typewriters, printing press, etc. Simple data processing operations can be achieved with this
method. It has much fewer errors than manual data processing, but the increase in data has made this method
more complex and difficult.
3. Electronic Data Processing

Page 3 of 14
Data Warehouse and Data Mining
Data is processed with modern technologies using data processing software and programs. The software gives
a set of instructions to process the data and yield output. This method is the most expensive but provides the
fastest processing speeds with the highest reliability and accuracy of output.
Types of Data Processing
There are different types of data processing based on the source of data and the steps taken by the processing unit to
generate an output. There is no one size fits all method that can be used for processing raw data.

1. Batch Processing: In this type of data processing, data is collected and processed in batches. It is used for large
amounts of data. For example, the payroll system.
2. Single User Programming Processing: It is usually done by a single person for his personal use. This
technique is suitable even for small offices.
3. Multiple Programming Processing: This technique allows simultaneously storing and executing more than
one program in the Central Processing Unit (CPU). Data is broken down into frames and processed using two
or more CPUs within a single computer system. It is also known as parallel processing. Further, the multiple
programming techniques increase the respective computer's overall working efficiency. A good example of
multiple programming processing is weather forecasting.
4. Real-time Processing: This technique facilitates the user to have direct contact with the computer system. This
technique eases data processing. This technique is also known as the direct mode or the interactive mode
technique and is developed exclusively to perform one task. It is a sort of online processing, which always
remains under execution. For example, withdrawing money from ATM.
5. Online Processing: This technique facilitates the entry and execution of data directly; so, it does not store or
accumulate first and then process. The technique is developed to reduce the data entry errors, as it validates
data at various points and ensures that only corrected data is entered. This technique is widely used for online
applications. For example, barcode scanning.
6. Time-sharing Processing: This is another form of online data processing that facilitates several users to share
the resources of an online computer system. This technique is adopted when results are needed swiftly.
Moreover, as the name suggests, this system is time-based. Following are some of the major advantages of
time-sharing processing, such as:
7. Distributed Processing: This is a specialized data processing technique in which various computers (located
remotely) remain interconnected with a single host computer making a network of computers. All these
computer systems remain interconnected with a high-speed communication network. However, the central

Page 4 of 14
Data Warehouse and Data Mining
computer system maintains the master database and monitors accordingly. This facilitates communication
between computers.

Data Reduction in Data Mining


Data mining is applied to the selected data in a large amount database. When data analysis and mining is done on a huge
amount of data, then it takes a very long time to process, making it impractical and infeasible.
Data reduction techniques ensure the integrity of data while reducing the data. Data reduction is a process that reduces
the volume of original data and represents it in a much smaller volume. Data reduction techniques are used to obtain a
reduced representation of the dataset that is much smaller in volume by maintaining the integrity of the original data.
By reducing the data, the efficiency of the data mining process is improved, which produces the same analytical results.
Techniques of Data Reduction
1. Dimensionality Reduction
We use the attribute required for our analysis. Dimensionality reduction eliminates the attributes from the data set under
consideration, thereby reducing the volume of original data. It reduces data size as it eliminates outdated or redundant
features. Here are three methods of dimensionality reduction.
i. Wavelet Transform: In the wavelet transform, suppose a data vector A is transformed into a numerically
different data vector A' such that both A and A' vectors are of the same length. Then how it is useful in reducing
data because the data obtained from the wavelet transform can be truncated. The compressed data is obtained
by retaining the smallest fragment of the strongest wavelet coefficients. Wavelet transform can be applied to
data cubes, sparse data, or skewed data.
ii. Principal Component Analysis: Suppose we have a data set to be analyzed that has tuples with n attributes. The
principal component analysis identifies k independent tuples with n attributes that can represent the data set.
iii. Attribute Subset Selection: The large data set has many attributes, some of which are irrelevant to data mining
or some are redundant. The core attribute subset selection reduces the data volume and dimensionality. The
attribute subset selection reduces the volume of data by eliminating redundant and irrelevant attributes.
2. sNumerosity Reduction
The numerosity reduction reduces the original data volume and represents it in a much smaller form. This technique
includes two types parametric and non-parametric numerosity reduction.
i. Parametric: Parametric numerosity reduction incorporates storing only data parameters instead of the original
data. One method of parametric numerosity reduction is the regression and log-linear method.
o Regression and Log-Linear: Linear regression models a relationship between the two attributes by
modeling a linear equation to the data set. Suppose we need to model a linear function between two
attributes.
y = wx +b
Here, y is the response attribute, and x is the predictor attribute. If we discuss in terms of data mining,
attribute x and attribute y are the numeric database attributes, whereas w and b are regression
coefficients.
Multiple linear regressions let the response variable y model linear function between two or more
predictor variables.
Log-linear model discovers the relation between two or more discrete attributes in the database.
Suppose we have a set of tuples presented in n-dimensional space. Then the log-linear model is used to

Page 5 of 14
Data Warehouse and Data Mining
study the probability of each tuple in a multidimensional space.
Regression and log-linear methods can be used for sparse data and skewed data.
ii. Non-Parametric: A non-parametric numerosity reduction technique does not assume any model. The non-
Parametric technique results in a more uniform reduction, irrespective of data size, but it may not achieve a
high volume of data reduction like the parametric. There are at least four types of Non-Parametric data reduction
techniques, Histogram, Clustering, Sampling, Data Cube Aggregation, and Data Compression.
o Histogram: A histogram is a graph that represents frequency distribution which describes how often a
value appears in the data. Histogram uses the binning method to represent an attribute's data
distribution. It uses a disjoint subset which we call bin or buckets.
A histogram can represent a dense, sparse, uniform, or skewed data. Instead of only one attribute, the
histogram can be implemented for multiple attributes. It can effectively represent up to five attributes.
o Clustering: Clustering techniques groups similar objects from the data so that the objects in a cluster
are similar to each other, but they are dissimilar to objects in another cluster.
How much similar are the objects inside a cluster can be calculated using a distance function. More is
the similarity between the objects in a cluster closer they appear in the cluster.
The quality of the cluster depends on the diameter of the cluster, i.e., the max distance between any two
objects in the cluster.
o The cluster representation replaces the original data. This technique is more effective if the present data
can be classified into a distinct clustered.
o Sampling: One of the methods used for data reduction is sampling, as it can reduce the large data set
into a much smaller data sample. Below we will discuss the different methods in which we can sample
a large data set D containing N tuples:
o Simple random sample without replacement (SRSWOR) of size s: In this s, some tuples are
drawn from N tuples such that in the data set D (s<N). The probability of drawing any tuple
from the data set D is 1/N. This means all tuples have an equal probability of getting sampled.
o Simple random sample with replacement (SRSWR) of size s: It is similar to the SRSWOR, but
the tuple is drawn from data set D, is recorded, and then replaced into the data set D so that it
can be drawn again.
o Cluster sample: The tuples in data set D are clustered into M mutually disjoint subsets. The
data reduction can be applied by implementing SRSWOR on these clusters. A simple random
sample of size s could be generated from these clusters where s<M.
o Stratified sample: The large data set D is partitioned into mutually disjoint sets called 'strata'.
A simple random sample is taken from each stratum to get stratified data. This method is
effective for skewed data.
3. Data Cube Aggregation
This technique is used to aggregate data in a simpler form. Data Cube Aggregation is a multidimensional aggregation
that uses aggregation at various levels of a data cube to represent the original data set, thus achieving data reduction.
For example, suppose you have the data of All Electronics sales per quarter for the year 2018 to the year 2022. If you
want to get the annual sale per year, you just have to aggregate the sales per quarter for each year. In this way, aggregation
provides you with the required data, which is much smaller in size, and thereby we achieve data reduction even without
losing any data.

Page 6 of 14
Data Warehouse and Data Mining

4. Data Compression
Data compression employs modification, encoding, or converting the structure of data in a way that consumes less
space. Data compression involves building a compact representation of information by removing redundancy and
representing data in binary form. Data that can be restored successfully from its compressed form is called Lossless
compression. In contrast, the opposite where it is not possible to restore the original form from the compressed form is
Lossy compression. Dimensionality and numerosity reduction method are also used for data compression.

This technique reduces the size of the files using different encoding mechanisms, such as Huffman Encoding and run-
length Encoding. We can divide it into two types based on their compression techniques.
i. Lossless Compression: Encoding techniques (Run Length Encoding) allow a simple and minimal data size
reduction. Lossless data compression uses algorithms to restore the precise original data from the compressed
data.
ii. Lossy Compression: In lossy-data compression, the decompressed data may differ from the original data but
are useful enough to retrieve information from them. For example, the JPEG image format is a lossy
compression, but we can find the meaning equivalent to the original image. Methods such as the Discrete
Wavelet transform technique PCA (principal component analysis) are examples of this compression.
5. Discretization Operation
The data discretization technique is used to divide the attributes of the continuous nature into data with intervals. We
replace many constant values of the attributes with labels of small intervals. This means that mining results are shown
in a concise and easily understandable way.
i. Top-down discretization: If you first consider one or a couple of points (so-called breakpoints or split points)
to divide the whole set of attributes and repeat this method up to the end, then the process is known as top-down
discretization, also known as splitting.
ii. Bottom-up discretization: If you first consider all the constant values as split-points, some are discarded through
a combination of the neighborhood values in the interval. That process is called bottom-up discretization.

Page 7 of 14
Data Warehouse and Data Mining
Discretization in data mining
Data discretization refers to a method of converting a huge number of data values into smaller ones so that the
evaluation and management of data become easy. In other words, data discretization is a method of converting
attributes values of continuous data into a finite set of intervals with minimum data loss. There are two forms
of data discretization first is supervised discretization, and the second is unsupervised discretization. Supervised
discretization refers to a method in which the class data is used. Unsupervised discretization refers to a method
depending upon the way which operation proceeds. It means it works on the top-down splitting strategy and
bottom-up merging strategy.
Now, we can understand this concept with the help of an example
Suppose we have an attribute of Age with the given values

Age 1,5,9,4,7,11,14,17,13,18, 19,31,33,36,42,44,46,70,74,78,77

Table before Discretization

Attribute Age Age Age Age

1,5,4,9,7 11,14,17,13,18,19 31,33,36,42,44,46 70,74,77,78

After Discretization Child Young Mature Old

Another example is analytics, where we gather the static data of website visitors. For example, all visitors who
visit the site with the IP address of India are shown under country level.
Some Famous techniques of data discretization
Histogram analysis
Histogram refers to a plot used to represent the underlying frequency distribution of a continuous data set. Histogram
assists the data inspection for data distribution. For example, Outliers, skewness representation, normal distribution
representation, etc.
Binning
Binning refers to a data smoothing technique that helps to group a huge number of continuous values into smaller values.
For data discretization and the development of idea hierarchy, this technique can also be used.
Cluster Analysis
Cluster analysis is a form of data discretization. A clustering algorithm is executed by dividing the values of x numbers
into clusters to isolate a computational feature of x.
Data discretization and concept hierarchy generation
The term hierarchy represents an organizational structure or mapping in which items are ranked according to their levels
of importance. In other words, we can say that a hierarchy concept refers to a sequence of mappings with a set of more
general concepts to complex concepts. It means mapping is done from low-level concepts to high-level concepts. For
example, in computer science, there are different types of hierarchical systems. A document is placed in a folder in
windows at a specific place in the tree structure is the best example of a computer hierarchical tree model. There are
two types of hierarchy: top-down mapping and the second one is bottom-up mapping.

Page 8 of 14
Data Warehouse and Data Mining
Let's understand this concept hierarchy for the dimension location with the help of an example.
A particular city can map with the belonging country. For example, New Delhi can be mapped to India, and India
can be mapped to Asia.
Top-down mapping
Top-down mapping generally starts with the top with some general information and ends with the bottom to the
specialized information.
Bottom-up mapping
Bottom-up mapping generally starts with the bottom with some specialized information and ends with the top
to the generalized information.

Analysis of Attribute Relevance in Data mining


Method of Analysis of Attribute : There have been numerous investigations in AI, insights, fluffy and harsh set
Hypotheses on quality pertinence investigation. The overall thought behind characteristic Pertinence examination is to
process some gauge that is utilized to evaluate the importance of a trait concerning a given class or idea. Such measures
incorporate data pick up, the Gini index, uncertainty, and connection coefficient. Let’s discuss one by one.
1. Data Collection – Collect information for both the objective class and the differentiating class by inquiry
handling. For class correlation, the client in the information mining question gives both the objective class and
the differentiating class. For class portrayal, the objective class is the class to be portrayed, though the
differentiating class is the arrangement of similar information that is not in the objective class.
2. Preliminary relevance analysis using conservative AOI(Attribute-oriented induction) – This step recognizes a
Set of measurements and characteristics on which the choose importance measure is to be applied. Since various
degrees of measurement may have drastically unique Importance regarding a given class, each quality
characterizing the calculated levels of the measurement should be remembered for the significance examination
on a fundamental level. (AOI) can be utilized to play out some starter significance examination on the
information by eliminating or summing up qualities having a very huge number of unmistakable qualities, (for
example, name and phone#). Such characteristics are probably not going to be discovered helpful for idea
portrayal. The relation obtained by such an application of attribute Induction is called the candidate relation of
the mining task.
3. Remove irrelevant and weakly attributes using the selected relevance analysis measure – We assess each quality
in the candidate relation using the importance of relevance analysis measure. This step brings about an

Page 9 of 14
Data Warehouse and Data Mining
underlying objective class working connection and starting a differentiating class working connection. The
attributes are then sorted(i.e., ranked )according to their computed relevance to the data mining task.
4. Generate the concept description using AOI – Perform AOI utilizing a less Conservative arrangement of
characteristic speculation limits. In the event that the unmistakable mining Task is a class portrayal, just the
underlying objective class working connection is incorporated here. On the off chance that the expressive
mining task is a class examination, both the underlying objective class working connection and the underlying
differentiating class working connection are incorporated.
Relevance Measure Components :
1. Information Gain (ID3)
2. Gain Ratio (C4.5)
3. Gini Index
4. Chi^2 contingency table statistics
5. Uncertainty Coefficient

Class Comparison Methods in Data Mining


In many applications, users may not be interested in having a single class or concept described or characterized
but rather would prefer to mine a description comparing or distinguishing one class (or concept) from other
comparable classes (or concepts). Class discrimination or comparison (hereafter referred to as class comparison)
mines descriptions that distinguish a target class from its contrasting classes. Notice that the target and
contrasting classes must be comparable because they share similar dimensions and attributes. For example, the
three classes, person, address, and item, are not comparable.
For example, the attribute generalization process described for class characterization can be modified so that
the generalization is performed synchronously among all the classes compared. This allows the attributes in all
classes to be generalized to the same levels of abstraction. Suppose that we are given the All Electronics data
for sales in 2003 and sales in 2004 and would like to compare these two classes. Consider the dimension location
with abstractions at the city, province or state, and country levels. Each class of data should be generalized to
the same location level. They are synchronously all generalized to either the city level, the province or state
level, or the country level. Ideally, this is more useful than comparing the sales in Vancouver in 2003 with the
sales in the United States in 2004 (i.e., where each set of sales data is generalized to a different level). The users,
however, should have the option to overwrite such an automated, synchronous comparison with their own
choices when preferred.
Class Comparison Methods and Implementation
The general procedure for class comparison is as follows:

Page 10 of 14
Data Warehouse and Data Mining

1. Data Collection: The set of relevant data in the database and data warehouse is collected by query Processing
and partitioned into a target class and one or a set of contrasting classes.
2. Dimension relevance analysis: If there are many dimensions and analytical comparisons are desired, then
dimension relevance analysis should be performed. Only the highly relevant dimensions are included in the
further analysis.
3. Synchronous Generalization: The process of generalization is performed upon the target class to the level
controlled by the user or expert specified dimension threshold, which results in a prime target class relation or
cuboid. The concepts in the contrasting class or classes are generalized to the same level as those in the prime
target class relation or cuboid, forming the prime contrasting class relation or cuboid.
4. Presentation of the derived comparison: The resulting class comparison description can be visualized in the
form of tables, charts, and rules. This presentation usually includes a "contrasting" measure (such as count%)
that reflects the comparison between the target and contrasting classes. As desired, the user can adjust the
comparison description by applying drill-down, roll-up, and other OLAP operations to the target and contrasting
classes.

5. For example, the task we want to perform is to compare graduate and undergraduate students using the
discriminant rule. So to do this, the DMQL query would be as follows.
1. use University_Database
2. mine comparison as "graduate_students vs_undergraduate_students"
3. in relevance to name, gender, program, birth_place, birth_date, residence, phone_no, GPA
4. for "graduate_students"
5. where status in "graduate"
6. versus "undergraduate_students"
7. where status in "undergraduate"
8. analyze count%
9. from student

Now from this, we can formulate that


o attributes = name, gender, program, birth_place, birth_date, residence, phone_no, and GPA.
o Gen(ai)= concept hierarchies on attributes ai.
o Ui = attribute analytical thresholds for attributes ai.
o Ti = attribute generalization thresholds for attributes ai.
o R = attribute relevance threshold.

Data Marts –
What is Data Mart?
A Data Mart is a subset of a directorial information store, generally oriented to a specific purpose or primary data subject
which may be distributed to provide business needs. Data Marts are analytical record stores designed to focus on
particular business functions for a specific community within an organization. Data marts are derived from subsets of
data in a data warehouse, though in the bottom-up data warehouse design methodology, the data warehouse is created
from the union of organizational data marts.

Page 11 of 14
Data Warehouse and Data Mining
The fundamental use of a data mart is Business Intelligence (BI) applications. BI is used to gather, store, access, and
analyze record. It can be used by smaller businesses to utilize the data they have accumulated since it is less expensive
than implementing a data warehouse.

Reasons for creating a data mart


o Creates collective data by a group of users
o Easy access to frequently needed data
o Ease of creation
o Improves end-user response time
o Lower cost than implementing a complete data warehouses
o Potential clients are more clearly defined than in a comprehensive data warehouse
o It contains only essential business data and is less cluttered.

Types of Data Marts


There are mainly two approaches to designing data marts. These approaches are
o Dependent Data Marts
o Independent Data Marts
Dependent Data Marts
A dependent data marts is a logical subset of a physical subset of a higher data warehouse. According to this technique,
the data marts are treated as the subsets of a data warehouse. In this technique, firstly a data warehouse is created from
which further various data marts can be created. These data mart are dependent on the data warehouse and extract the
essential record from it. In this technique, as the data warehouse creates the data mart; therefore, there is no need for
data mart integration. It is also known as a top-down approach.

Page 12 of 14
Data Warehouse and Data Mining
Independent Data Marts
The second approach is Independent data marts (IDM) Here, firstly independent data marts are created, and then a data
warehouse is designed using these independent multiple data marts. In this approach, as all the data marts are designed
independently; therefore, the integration of data marts is required. It is also termed as a bottom-up approach as the data
marts are integrated to develop a data warehouse.

Other than these two categories, one more type exists that is called "Hybrid Data Marts."
Hybrid Data Marts
It allows us to combine input from sources other than a data warehouse. This could be helpful for many situations;
especially when Adhoc integrations are needed, such as after a new group or product is added to the organizations.

Features of data marts:


Subset of Data: Data marts are designed to store a subset of data from a larger data warehouse or data lake. This allows
for faster query performance since the data in the data mart is focused on a specific business unit or department.
Optimized for Query Performance: Data marts are optimized for query performance, which means that they are designed
to support fast queries and analysis of the data stored in the data mart.
Customizable: Data marts are customizable, which means that they can be designed to meet the specific needs of a
business unit or department.

Page 13 of 14
Data Warehouse and Data Mining
Self-Contained: Data marts are self-contained, which means that they have their own set of tables, indexes, and data
models. This allows for easier management and maintenance of the data mart.
Security: Data marts can be secured, which means that access to the data in the data mart can be controlled and restricted
to specific users or groups.
Scalability: Data marts can be scaled horizontally or vertically to accommodate larger volumes of data or to support
more users.
Integration with Business Intelligence Tools: Data marts can be integrated with business intelligence tools, such as
Tableau, Power BI, or QlikView, which allows users to analyze and visualize the data stored in the data mart.
ETL Process: Data marts are typically populated using an Extract, Transform, Load (ETL) process, which means that
data is extracted from the larger data warehouse or data lake, transformed to meet the requirements of the data mart,
and loaded into the data mart.

Page 14 of 14

You might also like