Unit 2 FDS
Unit 2 FDS
Introduction
Data Warehouse
Early methods to hold data started with punched cards, paper tapes.
Then the development of magnetic tapes took place. Though we can
write and rewrite data in magnetic tapes, it is not a stable medium to
hold data. Disk storage came into existence where you can store and
access large amounts of data.
DBMS in disk storage:
Online Applications:
1.1) Query Tools: By using query tools, the user can explore the data
and generates reports or graphics in accordance with the business
requirements.
1.2) Reporting Tools: Reporting tools are used when the business
wants to see the results in a certain format on a regular basis, such as
daily, weekly, or monthly. This type of report can be saved and
retrieved at any time.
2.3) Roll up: Roll up is the polar opposite of drill-down. Roll up comes
into play if the business needs any summary data. By advancing up the
dimensional structure, it aggregates the detail level data. Roll-ups are
used to examine a system’s development and performance.
Benefits:
When a data warehouse system is operational, a business gains the
following advantages:
Disadvantages:
OLAP
“Online Analytical Processing (OLAP)” is a tool used for the analysis
and treatment of data in real time, widely used to treat a massive amount
of information in the various dimensions of a data warehouse.
Example: We are analyzing temporal data and we want to have views
by year, day, quarter or semester, this interaction with the user is done
by OLAP.
OLTP
Or “Online Transaction Platform”, refers to systems that record all the
operational actions of a stock, guaranteeing its success.
This type of data is generated massively every day
Example: a bank transaction for example, if it fails the whole action
must be reversed, if it is successful, it must be recorded and immutable.
Data warehouse
In this analysis we use structured data in the form of a cube (each side
of the cube is a dimension), the multidimensional model is the standard
in the analysis tools, for example, when we use arithmetic queries with
OLAP. This model has a higher performance in queries, besides
providing a facility in the creation of complex query’s.
When the scope of the project is reduced, this model allows a more agile
implementation.
Structure
Data cube is a data structure for storing and analysing large amounts
of multidimensional data (Pedersen, 2009b).
Fact Tables
Fact tables are objects to be analyzed, composed of measures, contexts
of each dimension and Foreign Keys, used to link the dimensions to that
table.
Example: In our data warehouse we need to create a sales fact table, for
this, we have structured it as follows.
Data Reduction
Data reduction does not affect the result obtained from data mining.
That means the result obtained from data mining before and after data
reduction is the same or almost the same.
Data reduction aims to define it more compactly. When the data size is
smaller, it is simpler to apply sophisticated and computationally high-
priced algorithms. The reduction of the data may be in terms of the
number of rows (records) or terms of the number of columns
(dimensions).
Techniques of Data Reduction
1. Dimensionality Reduction
Whenever we encounter weakly important data, we use the attribute
required for our analysis. Dimensionality reduction eliminates the
attributes from the data set under consideration, thereby reducing the
volume of original data. It reduces data size as it eliminates outdated or
redundant features. Here are three methods of dimensionality
reduction.
i. Wavelet Transform: In the wavelet transform, suppose a data
vector A is transformed into a numerically different data vector A'
such that both A and A' vectors are of the same length. Then how
it is useful in reducing data because the data obtained from the
wavelet transform can be truncated. The compressed data is
obtained by retaining the smallest fragment of the strongest
wavelet coefficients. Wavelet transform can be applied to data
cubes, sparse data, or skewed data.
ii. Principal Component Analysis: Suppose we have a data set to
be analyzed that has tuples with n attributes. The principal
component analysis identifies k independent tuples with n
attributes that can represent the data set.
In this way, the original data can be cast on a much smaller space,
and dimensionality reduction can be achieved. Principal
component analysis can be applied to sparse and skewed data.
iii. Attribute Subset Selection: The large data set has many
attributes, some of which are irrelevant to data mining or some
are redundant. The core attribute subset selection reduces the data
volume and dimensionality. The attribute subset selection reduces
the volume of data by eliminating redundant and irrelevant
attributes.
The attribute subset selection ensures that we get a good subset
of original attributes even after eliminating the unwanted
attributes. The resulting probability of data distribution is as close
as possible to the original data distribution using all the attributes.
2. sNumerosity Reduction
The numerosity reduction reduces the original data volume and
represents it in a much smaller form. This technique includes two types
parametric and non-parametric numerosity reduction.
i. Parametric: Parametric numerosity reduction incorporates
storing only data parameters instead of the original data. One
method of parametric numerosity reduction is the regression and
log-linear method.
o Regression and Log-Linear: Linear regression models a
relationship between the two attributes by modeling a
linear equation to the data set. Suppose we need to model a
linear function between two attributes.
y = wx +b
Here, y is the response attribute, and x is the predictor
attribute. If we discuss in terms of data mining, attribute x
and attribute y are the numeric database attributes, whereas
w and b are regression coefficients.
Multiple linear regressions let the response variable y
model linear function between two or more predictor
variables.
Log-linear model discovers the relation between two or
more discrete attributes in the database. Suppose we have
a set of tuples presented in n-dimensional space. Then the
log-linear model is used to study the probability of each
tuple in a multidimensional space.
Regression and log-linear methods can be used for sparse
data and skewed data.
ii. Non-Parametric: A non-parametric numerosity reduction
technique does not assume any model. The non-Parametric
technique results in a more uniform reduction, irrespective of data
size, but it may not achieve a high volume of data reduction like
the parametric. There are at least four types of Non-Parametric
data reduction techniques, Histogram, Clustering, Sampling, Data
Cube Aggregation, and Data Compression.
o Histogram: A histogram is a graph that represents
frequency distribution which describes how often a value
appears in the data. Histogram uses the binning method to
represent an attribute's data distribution. It uses a disjoint
subset which we call bin or buckets.
A histogram can represent a dense, sparse, uniform, or
skewed data. Instead of only one attribute, the histogram
can be implemented for multiple attributes. It can
effectively represent up to five attributes.
o Clustering: Clustering techniques groups similar objects
from the data so that the objects in a cluster are similar to
each other, but they are dissimilar to objects in another
cluster.
How much similar are the objects inside a cluster can be
calculated using a distance function. More is the similarity
between the objects in a cluster closer they appear in the
cluster.
The quality of the cluster depends on the diameter of the
cluster, i.e., the max distance between any two objects in the
cluster.
The cluster representation replaces the original data. This
technique is more effective if the present data can be
classified into a distinct clustered.
o Sampling: One of the methods used for data reduction is
sampling, as it can reduce the large data set into a much
smaller data sample. Below we will discuss the different
methods in which we can sample a large data set D
containing N tuples:
a. Simple random sample without replacement
(SRSWOR) of size s: In this s, some tuples are drawn
from N tuples such that in the data set D (s<N). The
probability of drawing any tuple from the data set D is
1/N. This means all tuples have an equal probability
of getting sampled.
b. Simple random sample with replacement
(SRSWR) of size s: It is similar to the SRSWOR, but
the tuple is drawn from data set D, is recorded, and
then replaced into the data set D so that it can be drawn
again.
This technique reduces the size of the files using different encoding
mechanisms, such as Huffman Encoding and run-length Encoding. We
can divide it into two types based on their compression techniques.
i. Lossless Compression: Encoding techniques (Run Length
Encoding) allow a simple and minimal data size reduction.
Lossless data compression uses algorithms to restore the precise
original data from the compressed data.
ii. Lossy Compression: In lossy-data compression, the
decompressed data may differ from the original data but are
useful enough to retrieve information from them. For example,
the JPEG image format is a lossy compression, but we can find
the meaning equivalent to the original image. Methods such as
the Discrete Wavelet transform technique PCA (principal
component analysis) are examples of this compression.
5. Discretization Operation
The data discretization technique is used to divide the attributes of the
continuous nature into data with intervals. We replace many constant
values of the attributes with labels of small intervals. This means that
mining results are shown in a concise and easily understandable way.
i. Top-down discretization: If you first consider one or a couple of
points (so-called breakpoints or split points) to divide the whole
set of attributes and repeat this method up to the end, then the
process is known as top-down discretization, also known as
splitting.
ii. Bottom-up discretization: If you first consider all the constant
values as split-points, some are discarded through a combination
of the neighborhood values in the interval. That process is called
bottom-up discretization.
Benefits of Data Reduction
The main benefit of data reduction is simple: the more data you can fit
into a terabyte of disk space, the less capacity you will need to purchase.
Here are some benefits of data reduction, such as:
o Data reduction can save energy.
o Data reduction can reduce your physical storage costs.
o And data reduction can decrease your data center track.
Data reduction greatly increases the efficiency of a storage system and
directly impacts your total spending on capacity.
Data discretization